If documents vary greatly in size, the frequency of occurrence of a
term in a document should be normalized to document length. Explain what this means.
Given a Document-Term matrix whose elements give the number of times a given
term occurs in a given document, a dictionary giving the total
number of times each word occurs and a table giving the number
of documents a word occurs in, how you would replace these counts with weights
which indicate the importance of the terms in their respective documents?
In the modified centroid algorithm, term discrimination values are calculated relative to
a centroid vector of the document hyperspace. How does this reduce the computational
complexity. What other technique does this procedure use to increase speed of computation?
How can otherwise discarded high frequency terms can be made into indexing terms?
How can low frequency words which would otherwise be discarded be
made into indexing terms?