Back to ComputerTerms, InformationRetrieval

A list of words that for reasons of volume or ["Precision"] and ["Recall"] will not be included in the index and hence are not searchable. E.g. "and", "or", "not" etc.

There are two ways to filter stoplist words from an input token stream:

  1. Examine lexical analyzer output and remove any stopwords
  2. Remove stopwords as part of the lexical analysis: This is one of the more efficient ways to implement a StopList

If we implement (a) we must look up every token produced in a stoplist structure. Hashing is undoubtable the fastest way to do this! We can even implement hashing into the lexical analysis process by generating the hash code as part of the token generation. Issues include comparing the token against the stopword if we are not using a perfect hashing algorithm.

The second method is better: We have to do lexical analysis anyway, and recognizing even a large stoplist can be done at almost no extra cost! This was shown (in my IR book) using a Lexical analyzer generator which generates a minimum state deterministic finite automata.

Back to ComputerTerms, InformationRetrieval

StopWords (last edited 2004-04-08 16:24:35 by yakko)