Assessment Date: Tutorial Session on 9 October 2018 (No later than 9 October) Submission Due Date: 11.59PM, 12 October 2018 (No late submission is allowed) What to Submit: Zipped source code with detailed comments
Where to Submit: Electronic submission via blackboard
The goal of this project is to gain practical experience in using the vector space
model with tf.idf weight and cosine similarity measure for document retrieval.
You must work on this project individually. The standard academic honesty rules apply.
Dataset: Cranfield
Assumptions: This project builds on top of project 1 and 2, assuming that the corpus has been tokenized and transformed into lower cases, all SGML tags and stopwords have been removed, and the corpus is indexed by the inverted index.
Task 1 – Building the vector space model representations for the corpus: Write the necessary code to build the vector space model representations for all the documents in the corpus. In this representation, tf.idf weight is used to indicate the term weight. Assume that only the top 1000 most frequent words in the corpus are used to construct the term dictionary. (2 marks)
Task 2 – Using the vector space model representations to perform search: Write the code to implement search: In the following cases, constructing its vector space model representation, and returning top 10 documents that are ranked based on their cosine similarities to the query vector, by comparing the query vector with all the document vectors in the dataset.
(1) Query = “method” (0.5 mark)
(2) Query = “transfer equations” (1 mark)
(3) Query = “free problem case” (1 mark)
Task 3 – Using the Inverted Index to speed up the search:
Write the code to speed up the search process in Task 2 by combining the inverted index. The idea is to first select the documents which contain the query words using the inverted index, followed by comparing the selected documents’ vectors with the query vector and ranking them based on their cosine similarities. (2 marks)
Code: Your implementation should be coded in some general programming language (e.g., C, Java, Python, etc.) without using any external IR packages. Your code should provide a simple interface (on console) that provides the following functions: (0.5 mark)
· Allow user to enter the name of the corpus directory (assume that corpus directory is in the same directory as your executable code)
· Allow user to enter the keywords of a search query
Deliverables: Your submission includes the following components:
1) Program: (5 marks in total)
· Source code and its brief description
· Interface for input
2) Output: (2 marks in total)
· Reporting the query, query results (see Task 2)
3) Performance Bonus: (3 marks in total)
· Efficiency: Report average query execution time for both Task 2 and 3
respectively over 10 executions of the same query.
· Retrieval Models: Implement two or more retrieval models including Vector Space Model. (except Boolean Retrieval)