R语言代写|R语言代做|R语言代考
当前位置:以往案例 > >案例之R语言案例INFS7410 加急帮助英国案例
2018-02-20



Assessment Date: Tutorial Session on 9 October 2018 (No later than 9 October) Submission Due Date: 11.59PM, 12 October 2018 (No late submission is allowed) What to Submit: Zipped source code with detailed comments

Where to Submit: Electronic submission via blackboard




The goal of this project is to gain practical experience in using the vector space

model with tf.idf weight and cosine similarity measure for document retrieval.


You must work on this project individually. The standard academic honesty rules apply.

Dataset: Cranfield

Assumptions: This project builds on top of project 1 and 2, assuming that the corpus has been tokenized and transformed into lower cases, all SGML tags and stopwords have been removed, and the corpus is indexed by the inverted index.

Task 1 – Building the vector space model representations for the corpus: Write the necessary code to build the vector space model representations for all the documents in the corpus. In this representation, tf.idf weight is used to indicate the term weight. Assume that only the top 1000 most frequent words in the corpus are used to construct the term dictionary. (2 marks)

Task 2 – Using the vector space model representations to perform search: Write the code to implement search: In the following cases, constructing its vector space model representation, and returning top 10 documents that are ranked based on their cosine similarities to the query vector, by comparing the query vector with all the document vectors in the dataset.



(1) Query = “method” (0.5 mark)

(2) Query = “transfer equations” (1 mark)

(3) Query = “free problem case” (1 mark)


Task 3 – Using the Inverted Index to speed up the search:

Write the code to speed up the search process in Task 2 by combining the inverted index. The idea is to first select the documents which contain the query words using the inverted index, followed by comparing the selected documents’ vectors with the query vector and ranking them based on their cosine similarities. (2 marks)

Code: Your implementation should be coded in some general programming language (e.g., C, Java, Python, etc.) without using any external IR packages. Your code should provide a simple interface (on console) that provides the following functions: (0.5 mark)

· Allow user to enter the name of the corpus directory (assume that corpus directory is in the same directory as your executable code)

· Allow user to enter the keywords of a search query


Deliverables: Your submission includes the following components:


1) Program: (5 marks in total)

· Source code and its brief description

· Interface for input

2) Output: (2 marks in total)

· Reporting the query, query results (see Task 2)

3) Performance Bonus: (3 marks in total)

· Efficiency: Report average query execution time for both Task 2 and 3

respectively over 10 executions of the same query.

· Retrieval Models: Implement two or more retrieval models including Vector Space Model. (except Boolean Retrieval)

在线提交订单