Back to ComputerTerms, InformationRetrieval

Cluster analysis is a statistical technique used to generate a category structure.The groups which are formed should have a high degree of association between members of hte same group and a low degree between members of different groups.

Similarity Measures:

               2C
 S         = -------
  (Di,dj)     A + B 

   Where C is the number of terms that Di and Dj have in common, 
   and A and B are the number of termsin Di and Dj

Similarity Matrix calculates a similarity measure between document x and y

  | S21                 |
  | S31  S32            |
  |  ...                |
  | SN1  SN2 ...SN(N-1) |

Clustering Methods

Nonhierarchical methods (partitioning methods) divide the data set of N objects into M clusters, where no overlap is allowed. Since discovering the optimal grouping of N objects into M sets is Combinations(n|M)=N!/((N-M)!M!) is impossible for large sets, so we use heuristics. Computational requirements are then usually O(NM)

Hierarchical methods produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The agglomerative method performs in N - 1 pairwise joins beginning from an unclustered dataset. A good resource on this is http://www-csli.stanford.edu/~schuetze/completelink.html

NonHierarchical Methods

Used in early research, but now considered obsolete.

single pass method

  1. Assign the first document D1 as the representative for C1.
  2. For Di, caclulate the similarity S with the representative for each existing cluster.
  3. If Smax is greater than a threshold value Sr, add the item to the corresponding cluster and recaclulate teh cluster representative; otherwise, use Di to initiate a new cluster.
  4. If an item Di remains to be clustered, return to step 2.

Reallocation Methods

  1. Select M cluster representatives or centroids
  2. For i=1 to N, assign Di to the most similar centroid
  3. For j=1 to M, recaclulate the cluster centroid Cj
  4. Repeat 2-3 until there is little or no change in cluster membership during a pass through the file.

== Hierarchical Cluster Methods HACMs==

The procedure is to start off with each object in its own cluster. We then start linking objects together using a link method (described below) until all the objects have been merged into a single heirarchically organized cluster.

General Algorithm

  1. Identify the two closest points and combine them in a cluster.
  2. Identify and combine the next two closest points (treating existing clusters as points).
  3. If more than one cluster remains, return to step 2.

Document Retrieval from a Clustered Data Set

This is what we are working for, SO THIS IS THE MOST IMPORTANT ISSUE! If we don't get good results here, whats the use!

Retrieval Approaches

  1. Top Down: Start at the root and work your way down through the cluster hierarchy choosing the most similar sub-cluster until some criterion is met: Number of documents or similarity drops below a threshold value. Works well with the complete link method.
  2. Bottom up: begins with some document or cluster at the base of the tree and moves up until the retrieval criterion has been met. This method gives the best results apart from the complete link method.

Back to ComputerTerms, InformationRetrieval

ClusteringAlgorithms (last edited 2004-04-09 16:40:59 by yakko)