081:115G Spring 2009
810:115g Information Storage and Retrieval (3 hours)

Last updated: January 29, 2009
Objective To understand computer based automatic indexing and retrieval of text/web based information.
Books: Experiments in Information Storage and Retrieval Using Mumps

Mumps Book
Direct link to Mumps Book PDF

Course Materials: Programming examples will mainly use the Mumps language although other languages including C++, Perl, PHP, and Java may be used as desired.

Data bases, examples and text
Mumps Pocket Guide JPG images

Requirements: The requiements will consist of a set of assignments to be performed individually and a project which may be done either individually or in small groups (approx. 3 persons max). As this couse depends heavily on lecture content, attendance is required. Excessive absence will result in a reduced grade for the project. The term "excessive absence" means: don't push your luck.

Term Project (25%)
Assignments (40%) Late assignments charged 5% per class day. Chronically late assignments will be charged at a higher rate.
Tests (2 at 10% each)
Final (15%).

Classes are lecture format. Cell phones, pagers, laptops and PDA's may not be used.

Assignments must include:

  • in a comment block your name
  • the assignment number
  • the due date
  • you must turn in your code and an example of its execution.
  • multiple pages must be stapled.

    Assignments not complying with the above will be returned ungraded. Assignmnents may not be submitted by email.

  • Test 1 Friday Feb 27
    Test 2 Mon April 6
    Final: Click Here
    Makeup Tests Makeup tests will be given only in cases of demonstrated need for causes such as serious illness, family emergency or University sanctioned schedule conflict. In all cases, written documentation will be required.
    Penalty for
    Hacking & Cheating
    A grade of "F" for the course and possible University disciplinary action. If your work duplicates in whole or part the work of another, both works will receive a grade of F.
    Project

    Projects will be presented in class the last week of classes.

    All projects will be presented in class with a brief online demonstration. A vote will be taken for the best project and the winner will be exempt from the final (grade of A will be entered for the final exam grade). You may work in teams of from 1 to 3 (hint: have one person do the indexing, one the retrieval code and the other do the web interface). All group members will receive the same grade.

    Final Grades Final grades will not be available via email. If you want your grade mailed to you, bring a stamped, self-addressed envelope to the final.
    Contact Click Here
    Computer You may want to use your own computer to do the assignments and project. If that is not the case, you may have an account on one of my Linux servers. The assignments and project are extreme I/O intensive. If you use your own machine, a desktop with lots of memeory is best. Due to high I/O, you may not want to use a laptop.
    Data Base Reuters Unfiltered
    Reuters Filtered
    Reuters Filtered version 2
    Reuters Filtered version 3
    (version 2 has xoxoxo ... XOXOXO, version 3 has title1 ... title2).
    Getting Started
    1. Get and install Ubuntu either native, under WUBI or Sun VirtualBox. or Cygwin for Windows
    2. Get and install Mumps under Cygwin: In Cygwin, type:
      1. wget http://cns2.uni.edu/~okane/source/MUMPS-MDH/mumpscompiler-10.0.src.tar.gz
        (Check http://cns2.uni.edu/~okane/source/MUMPS-MDH/ for the latest version number).
      2. tar xvzf mumpscompiler-10.0.src.tar.gz
      3. cd mumpsc
      4. ./configure prefix=/usr
      5. make
      6. make install
    3. Learn an adult editor (nano/pico are for children) : vi Tutorial
    4. Learn Mumps.
    HTML BareBones HTML Guide
    Assignments Do not push assignments under the office door - leave them in the CS Office. All pages must be stapled.

    1. Using the reuters-filtered dataset, write a program to calculate:
      
      - the frequency of occurrence of each word.
      - the total number of titles/articles
      
      Write out for each word its frequency of occurrence, a blank and the word.
      
      Sort the result in reverse order (high to low)
      
      Read the resulting dataset and write out for each word in
      neat columns:
      
      the frequency of occurrence of the word
      the frequency of occurrence of the word times its rank
      the word.
      
      Turn in the code and the first page of the results.
      
      Notes: 
      
      The dataset is linked from my web page (reuters-filtered)
      and it is in /f/ISR/filtered-reuters on sidhe.cs.uni.edu
      
      If you use sidhe, do not make a copy of the dataset. Create a link:
      
      ln -s /f/ISR/reuters-filtered reuters-filtered
      
      I do not want 27 copies of the dataset on the machine. one is enough.
      the ln command will create a link in your directory back to
      the original file. The link will behave as though it were the
      file itself.
      
      'rank' means the order in the file, first is rank 1, second is
      rank 2 and so forth. freq*rank is known as Zipf's constant.
      
      You can sort in reverse order with something like:
      
      sort -nr < in  > out
      
      The values from the freq*rank may need to be formatted. Use the 
      justify() function. do not print endless decimal places.
      
      Each title has xoxoxo at the beginning of the title and xoxoxo at the
      end of the title. Include title words in your dictionary. use 
      $zzScanAlnum to read the input. This will eliminate punctuation
      and convert the words to lower case.
      
      Due: Fri Feb 6
      
    2. Using the Reuters data base:
      
      1. Build a file of stop list words consisting of very high and very low 
         frequency words. You may use a text editor for this.
      2. Build a document term matrix. Do not include words which are in the stop list. 
         stem (remove endings) of the words that remain. Discard stems less that 3
         characters in length. For each stem in each document, store the frequency
         of occurrence of each stem.
      3. When finished, compute a document frequency vector giving for each stem the
         number of documents it occurs in.
      
      To turn in:
      
      - The first ten document vectors giving document number, word stems and frequency.
      - The 20 stems which occur in the most number of documents.
      - Program code.
      
      Note: there is a stem function in mumps.
      
      due: Fri 20 Feb
      
      (this is about 20 lines of code)
      
    3. From the document frequency vector and doc-term
      matrix in assignment 2, calculate the IDF weight
      of each term and replace the frequencies in the
      doc-term matrix by the frequency times the IDF for
      the word.
      
      From the doc-term matrix, build build the 
      transposed index(word,doc) matrix.
      
      Turn in: the top ten IDF weights, the 
      weighted document vectors for the first ten 
      documents and the weighted term vectors for
      the first ten words in index.
      
      Due Fri Mar 13.
      
    4. Calculate the term-term matrix for the Reuters database 
      and then calculate the cohesion value for terms with
      a significant number of co-occurrences.
      
      what to turn in:
      
      - a sample of the term-term matrix giving the terms and their
        co-occurrence count (about ten terms and their related
        terms)
      
      - the top ten phrases based on the cohesion values.
      
      - the code.
      
      Due: Fri April 3
      
    Other Resources Mesh

    Documentation and Availability
    Documentation Descriptor Data Elements
    Descriptor Records
    Documentation Qualifier Data Elements
    Qualifier Records
    Supplementary Records

    Resources:

    Salton, G., Automatic Informatiuon Organization and Retrieval, McGraw Hill (1968)
    Salton, G., Automatic Text Processing, Addison-Wesley (1989)
    Salton, G., ed., The SMART Retrieval System, Prentice-Hall (1971).
    Salton, G., and McGill, M., Introduction to Modern Information Retrieval, McGraw-Hill, (1983)
    Borko, H., Automated Language Processing, (1968)
    The Smart System from Cornell: ftp://cs.cornell.edu/pub/smart


    Data Sets for Machine Learning
    WordNet
    Apache
    Web Archive (you think you have disk capacity problems!)
    PostgreSQL Tutorial
    Rod Library Electronic Resources
    ACM Digital Library
    Lemur Toolkit for Language Modeling
    Cornell Smart System
    SIGIR List and Archives
    UVA Electronic Text Center
    NLM Gateway
    Digital Library Research Laboratory
    Automatic Text Browsing Using the Vector Space Model
    Lawrence Berkeley Lab Science Articles Archive
    The Internet Archive
    Search Engine Features
    Anatomy of a Search Engine (Google)
    Medline (National Library of Medicine)
    Information Retrieval by C. J. van Rijsbergen
    Modern Information Retrieval Chapter 10
    Cystic Fibrosis Reference Collection
    Marti Hearst Site
    Online Papers
    Ed Fox Links
    IIT IR Publications
    Web IR
    WWW 10
    WWW 9
    WWW 8
    WWW 7
    WWW 5
    IRIS Project
    Lots of Links
    Top Ten Issues
    Searching Genomic Databases
    Amazon.com's recommender algorithm
    Huffman Trees
    Knuth Optimal Binary Trees
    Hu-Tucker Trees
    AVL (Balanced) Trees
    B trees and AVL Trees
    B Trees
    IBM Clever Project
    flex documentation
    Homology Searching
    Project Gutenberg

    The following notice is required by the University:

    "The Americans with Disabilities Act of 1990 (ADA) provides protection from illegal discrimination for qualified individuals with disabilities. Students requesting instructional accommodations due to disabilities must arrange for such accommodation through the Office of Disability Services. The ODS is located at: 213 Student Services Center, and the phone number is: 273-2676."

    Because the Office of Disability Services has procedures in place to determine the validity of disability claims as well as the need for instructional accommodations, faculty are reminded that they are to direct all students with accommodation requests to the above listed office.

    UNDER NO CIRCUMSTANCE SHOULD A FACULTY MEMBER MAKE AN ACCOMMODATION INDEPENDENT OF THE OFFICE OF DISABILITY SERVICES.

    Questions may be directed to: Disability Services Coordinator, at 273-2676 or to this office at 273-2846.