Back to ComputerTerms

Topic: InformationRetrieval

= How to create an inverted file representation =

Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: Term, Document Number.

Step 2: Alphabetically sort the file by term

Step 3: Agregate the duplicates at this point for each Document. Now the file is formatted in three columns: Term, Document Number, Frequency

What you now have is an inverted file implementation.

This can be split into a Lexicon (Dictionary) and a Postings file.

Example

Document

Keywords

1

CS(2), UNL(3), Ferguson(5), Lincoln(2)

2

Lincoln(3), CS(4), Computer(6)

3

CS(3)

4

university(2), UNL(2), CS(1)

5

Ferguson(1)

Term

Document Number

Computer

2

CS

1

CS

2

CS

3

CS

4

Ferguson

1

Ferguson

5

Lincoln

1

Lincoln

2

university

4

UNL

1

UNL

4

Back to ComputerTerms