Back to ComputerTerms, InformationRetrieval
See Also: StopWords
Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens. Tokens are groups of characters with collective significance. This is the first stage of automated indexing and of the query processing.
Issues:
- Digits: Number are not usually allowed, but we might allow words to contain digits, as long as they don't start with them. There are exceptions of course!
- Hyphens: Consistancy is important, but there will be problems non the less.
- ",._?`~" and other punctuation may be an integral part of the word. How we deal with this is important with respect to the kind of information that we are using!
- Case: Usually just make everthing lower case!
- Choosing delimiters is also very important: usually any white space and unrecognized punctuation or control characters are delimiters.
Implementation:
- Use alexical analyzer generator like lex: This is the best approach when the lexical analyzer is complicated.
- Write a lexical analyzer by hand - ad hoc: The worst solution, this will likely have subtle errors and may not be efficient.
- Write a lexical analyzer by hand as a finite state machine: Must be a good way, because this the the one our book chose to implement.
Back to ComputerTerms, InformationRetrieval