Back to ComputerTerms

Topic: InformationRetrieval

= How to create an inverted file representation =

Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: '''Term, Document Number'''. Note: we may optionally keep track of the location within the document as well if we are doing any proximity tests.

Step 2: Alphabetically sort the file by term

Step 3: Agregate the duplicates at this point for each '''Document'''. Now the file is formatted in three columns: '''Term, Document Number, Frequency'''

What you now have is an inverted file implementation.

This can be split into a '''Lexicon (Dictionary)''' and a '''Postings file'''. 

== Example ==

||Document||||Keywords||
||1||||CS(2), UNL(3), Ferguson(5),  Lincoln(2)||
||2||||Lincoln(3), CS(4), Computer(6)||
||3||||CS(3)||
||4||||university(2), UNL(2), CS(1)||
||5||||Ferguson(1)||

'''Here is Inverted File:'''

||Term||||Document Number||||Frequency||
||Computer||||2||||6||
||CS||||1||||2||
||CS||||2||||4||
||CS||||3||||3||
||CS||||4||||1||
||Ferguson||||1||||5||
||Ferguson||||5||||1||
||Lincoln||||1||||2||
||Lincoln||||2||||3||
||university||||4||||2||
||UNL||||1||||3||
||UNL||||4||||2||

To see the split to Lexicon and Posting file
SEE: PostingsFile

Back to ComputerTerms