Differences between revisions 3 and 4
Revision 3 as of 2004-03-26 21:27:07
Size: 1118
Editor: yakko
Comment:
Revision 4 as of 2004-03-26 21:30:29
Size: 1178
Editor: yakko
Comment:
Deletions are marked like this. Additions are marked like this.
Line 29: Line 29:
||Computer||||2||
||CS||||1||
||CS||||2||
||CS||||3||
||CS||||4
||
||Ferguson||||1||
||Ferguson||||5||
||Lincoln||||1||
||Lincoln||||2||
||university||||4||
||UNL||||1||
||UNL||||4||
||Computer||||2||||6||
||CS||||1||||2||
||CS||||2||||4||
||CS||||3||||3||
||CS||||4||||1
||
||Ferguson||||1||||5||
||Ferguson||||5||||1||
||Lincoln||||1||||2||
||Lincoln||||2||||3||
||university||||4||||2||
||UNL||||1||||3||
||UNL||||4||||2||

Back to ComputerTerms

Topic: InformationRetrieval

= How to create an inverted file representation =

Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: Term, Document Number.

Step 2: Alphabetically sort the file by term

Step 3: Agregate the duplicates at this point for each Document. Now the file is formatted in three columns: Term, Document Number, Frequency

What you now have is an inverted file implementation.

This can be split into a Lexicon (Dictionary) and a Postings file.

Example

Document

Keywords

1

CS(2), UNL(3), Ferguson(5), Lincoln(2)

2

Lincoln(3), CS(4), Computer(6)

3

CS(3)

4

university(2), UNL(2), CS(1)

5

Ferguson(1)

Here is Inverted File:

Term

Document Number

Computer

2

6

CS

1

2

CS

2

4

CS

3

3

CS

4

1

Ferguson

1

5

Ferguson

5

1

Lincoln

1

2

Lincoln

2

3

university

4

2

UNL

1

3

UNL

4

2

Back to ComputerTerms

InvertedFile (last edited 2004-04-08 00:24:04 by yakko)