Back to ComputerTerms Terms * ClusteringAlgorithms * ControlledVocabulary * InvertedFile * LexicalAnalysis * [[Precision]] * [[Recall]] * RelevanceFeedback * SignatureFile * StemmingConflation * StopWords * SuperImposedCoding * TruncationConflation * ThesaurusConflation '''This link provides a nice glossary of terms http://www.cs.jhu.edu/~weiss/glossary.html''' == Description == Examples: Library catalogs Generally the '''data''' are organized as a collection of '''documents'''. == Querying == Querying of unstructured textual data is referred to as '''Information Retrieval'''. It covers the following areas: * Querying based on keywords * The relevance of documents to the query * The analysis, classification and indexing of documents. Queries are formed using keywords and logical connectives ''and, or,'' and ''not'' where the ''and'' connective is implicit. '''Full Text''' --> All words in a document are ''keywords''. We use '''term''' to refer to words in a document, since all words are keywords. Given a document ''d'', and a term ''t'' one way of defining the relavence ''r'' is $$$r(d,t)=\log\left(1+\frac{n(d,t)}{n(d)}\right)$$$ n(d) denotes the number of terms in the document, and n(d,t) denotes the number of occurrences of term t in the document d. KEY: In the information retrieval community, the relevance of a document to a term is referred to as '''term frequency''', regardless of the exact formula used. Inverse Document frequency defined as: $$$IDF = \frac{1}{n(t)}$$$ where n(t) denotes the number of documents that contain the term t. Here we have a low IDF if the word is found in many of the documents. If it is found in only a few, then it is probably a good term to use! Thus the '''relavance''' of a document ''d'' to a set of terms ''Q'' is then defined as $$$r(d,Q)=\sum_{t \in Q}\frac{r(d,t)}{n(t)}$$$ $$$r(d,Q)=\sum_{t \in Q}\frac{w(t) r(d,t)}{n(t)}$$$ where w(t) is a weight specified by the user. '''KEY: Stop words''' are words that are not indexed such as ''and, or the, a'' etc. '''Proximity''': if a the terms occur close to each other in the document, the document would be ranked higher than if they occur far apart. We could (although we don't) modify the formula $$r(d,Q)$$ to take proximity into account. Back to ComputerTerms