SIT772 Database and Information Retrieval · 2018. 9. 10. · SIT772 Database and Information...

4
SIT772 Database and Information Retrieval Assessment Task 2 Trimester 2 2018 Deakin University 1 FutureLearn This task evaluates the student's technical skills in the management of unstructured data, with potential usage in real applications. This assessment supports student understandings of the techniques related to unstructured data management and data processing. This assessment task enables students to demonstrate their proficiency against Unit Learning Outcome 5. ULO5: Demonstrate data retrieval skills in the context of a data processing system. Assessment 2 (Individual) Written report Weight (% of total mark) 30 Due date Friday, 21 September 2018 11:59 pm (AEST). Submission method Referencing style Please read the rubric carefully as it outlines what criteria your assessment will be evaluated on. Instructions Read these instructions Answer as many questions as possible Place your name, ID and answers in your document. Please submit your word file with your answers and graphs (embedded) where appropriate as a SINGLE document in the Submission Portal. Do not submit PDF files Through CloudDeakin via FutureLearn Harvard

Transcript of SIT772 Database and Information Retrieval · 2018. 9. 10. · SIT772 Database and Information...

  • SIT772 Database and Information Retrieval

    Assessment Task 2 Trimester 2 2018

    Deakin University 1 FutureLearn

    This task evaluates the student's technical skills in the management of unstructured data, with potential usage in real applications.

    This assessment supports student understandings of the techniques related to unstructured data management and data processing.

    This assessment task enables students to demonstrate their proficiency against Unit Learning Outcome 5.

    ULO5: Demonstrate data retrieval skills in the context of a data processing system.

    Assessment 2 (Individual) Written report

    Weight (% of total mark) 30

    Due date Friday, 21 September 2018 11:59 pm (AEST).

    Submission method

    Referencing style

    Please read the rubric carefully as it outlines what criteria your assessment will be evaluated on.

    Instructions

    • Read these instructions

    • Answer as many questions as possible

    • Place your name, ID and answers in your document.

    • Please submit your word file with your answers and graphs (embedded) where appropriateas a SINGLE document in the Submission Portal.

    Do not submit PDF files

    Through CloudDeakin via FutureLearn

    Harvard

  • Deakin University 2 FutureLearn

    Question 1 (15 marks):

    Suppose you have joined a search engine development team to design a search algorithm based on both the Vector model and the Boolean model.

    You are supposed to collect unstructured documents for the following topics, and apply an index technique to convert them into an inverted index.

    Please collect 3 documents (less than 30 words for each) in three different topics. Topics are listed as follows, you can also choose some other topics you prefer.

    • Science

    • Computer Vision

    • Search Engine

    • Database

    • Security and privacy.

    An example of document:

    “Google is the most widely used Web search engine in the World. It claims to be the World’s most comprehensive search engine, indexing over 2.4 billion Web pages.”

    1. Creating the inverted index. In the process of creating the inverted index, pleasecomplete the following steps:

    a. Find a stopword list in the Internet and remove all stopwords and punctuationfrom those three documents. Then apply Porter’s stemming algorithm to alldocuments. Note that there are plenty of online stemming applications available,and you may use Porter algorithm for this question. The output will be a set ofstemmed terms.

    b. Create a merged inverted list including the within-document frequencies for eachterm.

    c. Use the index created in step (b) to create a dictionary and the related posting file.

    d. You may like to test the inverted index by using some keywords, please selectsome keywords from the documents. For example: google, web, search.

    2. Boolean and Vector queries.

    a. Please design three Boolean queries, (for example, web AND search) and list therelevant documents for each query.

    b. Please use the Vector model to query on the inverted index, and compare theresult with the Boolean model. (Hint: you can use cosine similarity and set asimilarity threshold).

  • SIT772 Database and Information Retrieval

    Assessment Task 2 Trimester 2 2018

    Deakin University 3 FutureLearn

    Question 2 (IR Evaluation, 15 marks):

    For this exercise, you are required to evaluate the performance of different search engines.

    First, please find two search engines you are familiar with, such as Google, Bing, Yahoo!, etc.

    Second, please choose a target in the following groups, and design two queries to search in both search engines.

    The target is chosen by the last number of your student ID. For example, if your student ID ends with the number is 1, please choose target 1; if it is 0, please choose target 10.

    • Target 1: obtain the unit guide of SIT771.

    • Target 2: obtain the unit guide of SIT772.

    • Target 3: obtain the unit guide of SIT773.

    • Target 4: obtain the unit guide of SIT774.

    • Target 5: obtain the price of the new Macbook.

    • Target 6: obtain the price of the new iPhone.

    • Target 7: obtain the price of a Lenovo Laptop.

    • Target 8: obtain the install document of MongoDB.

    • Target 9: obtain the manual of MongoDB.

    • Target 10: obtain the operation guide of MongoDB.

    Select the first 20 results in both search engines, if they return the target, then mark them as relevant documents, otherwise, they are irrelevant. The following exercises are based on your search results.

    a. List your target and designed search queries (you can use any keywords you think are related to the target). For Search Engine 1, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels. Also plot the average precision versus recall curve for Search Engine 1 (all three curves should be on a single chart).

    b. For Search Engine 2, plot the precision versus recall curves for Query 1 and Query 2, interpolated to the 11 standard recall levels. Also plot the average precision versus recall curve for Search Engine 2 (all three curves should be on a single chart, but a separate chart from that used in part (a)).

    c. Plot the averages for Search Engine 1 and Search Engine 2 on a separate chart, and compare the algorithms in terms of precision and recall. Which search engine do you think is superior? Why?

  • SIT772 DATABASE AND INFORMATION RETRIEVAL, Trimester2 2018 Assessment Task 2 Rubric

    1

    Question 1 Excellent Good Fair Unsatisfactory

    Output porter stemming results

    3 marks

    Output suitable word list on all of the three documents.

    2 marks

    Output suitable word list on two of the three documents.

    1 mark

    Output suitable word list on one of the documents.

    0 mark

    Fail to output stemming results on any document.

    Create a merged inverted list including the within-document frequencies list index

    3 marks

    Successfully created the merged inverted list index.

    2 marks

    2/3 of terms are correctly indexed. 1 mark

    Half of the terms are correctly indexed. 0 mark

    Fail to create an inverted list index.

    Create a dictionary and the related posting file

    3 marks

    Successfully created the dictionary. All terms are pointing to the posting file

    2 marks

    2/3 of the dictionary file is correct. All related terms are pointing to the posting file

    1 mark

    Half of the dictionary file is correct. All related terms are pointing to the posting file

    0 mark

    Fail to create a dictionary and the related posting file.

    Boolean model

    3 marks

    The correct documents for all of the three queries are correct.

    2 marks

    The correct documents for two of the three queries are correct.

    1 mark

    The correct documents for one of the three queries are correct.

    0 mark

    Fail to get the correct documents for any of the queries.

    Vector Model

    3 marks

    The correct documents for all of the three queries are correct. Provide compare result.

    2 marks

    The correct documents for two of the three queries are correct. Provide the compare result

    1 mark

    The correct documents for one of the three queries are correct. Provide the compare result

    0 mark

    Fail to get the correct documents for any of the queries. Fail to compare vector and Boolean models

    Question 2 Excellent Good Fair Unsatisfactory

    List the target and related queries

    3 marks

    List target and both queries. 2 marks

    List target, but only one query 1 mark

    List target, but not queries 0 mark

    Fail to list the target and queries

    Search engine 1: Plot recall and precision curve

    3 marks

    Both Precision and recall values and curves are correct.

    2 marks

    Precision and recall values for both queries are correct, but related curves are incorrect.

    1 mark

    Precision and recall values for one query are correct, but another one is incorrect.

    0 mark

    Curves are missing or wrong.

    Search engine 2: Plot recall and precision curve

    3 marks

    Both values and curves are correct. 2 marks

    Precision and recall values for both queries are correct, but related curves are incorrect.

    1 mark

    Precision and recall values for one query are correct, but another one is incorrect.

    0 mark

    Curves are missing or wrong.

    Interpolation of both curves of search engine 1 and 2.

    3 marks

    Provide correct interpolations for both curves.

    2 marks

    Provide approximate correct interpolation but with some key points missing

    1 mark

    Provide one correct interpolation but other’s a wrong

    0 mark

    Fail to provide the interpolation

    Compare two engines 3 marks

    Obtain correct comparison result with full explanation.

    2 marks

    Obtain correct comparison result, but does not provide enough explanation.

    1 mark

    Compare part of result, but the result in doubtable

    0 mark

    Fail to compare them

    Overall score Level 4 24 or more

    Level 3 16 or more

    Level 2 8 or more

    Level 1 0 or more

    SIT772-Assessment task 2 T2 2018.pdfInstructionsDo not submit PDF filesQuestion 1 (15 marks):Question 2 (IR Evaluation, 15 marks):

    SIT772-Assessment-2-rubric T2 2018.pdf