CS 115 Spring 2008
Test 2

  1. If documents vary greatly in size, the frequency of occurrence of a term in a document should be normalized to document length. Explain what this means.

  2. Given a Document-Term matrix whose elements give the number of times a given term occurs in a given document, a dictionary giving the total number of times each word occurs and a table giving the number of documents a word occurs in, how you would replace these counts with weights which indicate the importance of the terms in their respective documents?

  3. In the modified centroid algorithm, term discrimination values are calculated relative to a centroid vector of the document hyperspace. How does this reduce the computational complexity. What other technique does this procedure use to increase speed of computation?

  4. How can otherwise discarded high frequency terms can be made into indexing terms? How can low frequency words which would otherwise be discarded be made into indexing terms?

  5. Given two document vectors with weights:
    D1 = <1, 3, 0, 4, 2, 0, 9>
    D2 = <0, 0, 1, 3, 5, 1, 5>
    
    Calculate the Cosine similarity function between them (show details - not just a call to the Cosine function).

  6. Given a document-document matrix where the elemnts indicate the similarity between two documents, explain how you would construct document clusters.

Due: Mon April 7 at 10 am. Typed and stapled.