Back to ComputerTerms, InformationRetrieval See Also: StopWords Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens. '''Tokens''' are groups of characters with collective significance. '''This is the first stage of automated indexing and of the query processing'''. Issues: * Digits: Number are not usually allowed, but we might allow words to contain digits, as long as they don't start with them. There are exceptions of course! * Hyphens: Consistancy is important, but there will be problems non the less. * ",._?`~" and other punctuation may be an integral part of the word. How we deal with this is important with respect to the kind of information that we are using! * Case: Usually just make everthing lower case! * Choosing delimiters is also very important: usually any white space and unrecognized punctuation or control characters are delimiters. Implementation: 1. Use alexical analyzer generator like lex: This is the best approach when the lexical analyzer is complicated. 1. Write a lexical analyzer by hand - ad hoc: The worst solution, this will likely have subtle errors and may not be efficient. 1. Write a lexical analyzer by hand as a finite state machine: Must be a good way, because this the the one our book chose to implement. Back to ComputerTerms, InformationRetrieval