Graduate Learning Outcome

Unit Information: SIT772 Database and Information Retrieval
Trimester: 2022 T1
Assessment 2: Information Retrieval Techniques Problem Solving Task
This document supplies the detailed information on assessment tasks for this unit.
Key information
• Due: Week 11-Friday, 27 May 2022, 8pm (AEST)
• Weighting: 30%
• Submit: Through CloudDeakin
Learning Outcomes
This assessment assesses the following Unit Learning Outcomes (ULO) and related Graduate
Learning Outcomes (GLO):

Unit Learning Outcome (ULO)	Graduate Learning Outcome (GLO)
ULO 5: Demonstrate data retrieval skills in the context of a data processing system.	GLO 1: Discipline-specific knowledge and capabilities

AssignmentTutorOnline

Purpose
This task evaluates the student’s technical skills in the management of unstructured data, with
potential usage in real applications. This assessment supports student understandings of the
techniques related to unstructured data management and data processing
Instructions and Submission Guide
This is an individual assessment task. Students are required to submit ONE written report.
• Read these instructions and the following questions.
• ONE written report with the file name as using student ID_givenname_A2.pdf, e.g.,
123456_Kevin_A2.pdf
• The report must be submitted via CloudDeakin assessment portal. The wrong
submission venue or the wrong submitted file may lead to the penalty.
Question 1 (Index Construction): [10 Marks]
Suppose you have joined in a search engine development team to design a search algorithm
based on both the Vector model and the Boolean model.
You have collected the following documents (unstructured) and plan to apply an index
technique to convert them into an inverted index.
Doc 1： data science is field to use scientific method, process, algorithm, system to extract
knowledge.
Doc 2： data mining is the process to discover pattern in large data to involve method at the
database system.
Doc 3： information system is the study of network of hardware and software that people use
to process data.
Doc 4： data mining is a technique of data science, which can be developed into an information
system to solve practical application problems.
To answer the below questions, you have to provide the detailed procedures step by step. You
need to remove all stop words and punctuation before the process of creating the inverted index
[the stop words can be determined based on your reasonable understanding, e.g., is, the, of,
that, to, etc..]. After that, please complete the following steps:
Question 1.1: [3 Marks]
Create a merged inverted list including the within-document frequencies for each term.
Question 1.2: [3 Marks]
Use the index created in Question 1.1 to create a dictionary and the related posting file.
Question 1.3: [2 Marks]
Please design three Boolean queries using AND, OR, NOT (for example, web AND search)
and list the relevant documents for each query. Each of selected query keywords should be
contained in at least two documents.
Question 1.4: [2 Marks]
Please use the Vector model to query on the inverted index, and compare the result with the
Boolean model. (Hint: you can use cosine similarity and set a similarity threshold).
Marking Rubric: Regarding Question 1.1 and 1.2, 3 marks will be given if the solution is fully
correct and writing format is professional as shown in lecture notes; 2 marks if the solution is
correct but the writing format is not professional; 1 mark if the solution is partially correct; 0
mark for the others. Regarding Question 1.3 and 1.4, 2 marks will be given if the queries and
the results are matched; 1 mark if there is one query and its results not-matched; 0 mark for
the otherwise.
Question 2 (IR Evaluation): [20 Marks]
In this question, you are required to evaluate the performance of different search engines.
• First, please select two of the three search engines you are familiar,
https://www.google.com.au/, https://www.bing.com/?cc=au, https://au.yahoo.com/.
Figure 1: Select the search engine located in Australia
• Second, you are required to constrain your query to www.reuters.com by following the
steps: e.g., Given a keyword query “high-tech global”, you need to write into the search
box with “high-tech global site:www.reuters.com” at your selected search engine
website. Below is the example provided for you to use google, shown in Figure 2 and
Figure 3. [Penalty is applied if your results do not follow the instruction.]
Figure 2: Search “high-tech global” within the website www.reuters.com
Figure 3: Search Results of “high-tech global” within the website www.reuters.com
• Third, please choose one of the targeted information as below, and design two queries
Query 1 and Query 2 to search in both search engines. Both Query 1 and Query 2 have
to be tested in both search engines.
 Target 1: Australia fully re-opens the border after the two years of covid-19
pandemic closeness. [Note: A news can be considered as relevant result if it
contains terms “Australia, open” and has concrete information, which can avoid
that you treat all retrieved news as relevant, and obtain a number of relevant
results at the same time.]
 Target 2: Global economic will recover after many years. [Note: A news can be
considered as relevant result if it contains terms “economic, recovery” and has
concrete information, which can avoid that you treat all retrieved news as
relevant, and obtain a number of relevant results at the same time.]
• Finally, select the first 20 results in both search engines by only selecting the news from
routers and ignoring the advertisements. If they return the target news, then you can
mark them as relevant results, otherwise, they are irrelevant. Note: assume there are
10 relevant news in total (retrieved and not-retrieved). Based on your own
justification, you can only retrieve 6~9 relevant results, i.e., there may have 1~6
relevant results that are not retrieved. You can simply label some relevant news as
irrelevant for practice if there are more than 9 relevant results.
The following questions are based on your search results.
Question 2.1: [3 Marks]
List your target, results and designed search queries (You can use any keywords that are related
to the target news, even if the keywords are not contained the target news text). For each
searched result, you need to list its URL address (web link) and news’ title, and mark it as
relevant or irrelevant by following the above requirements. [You cannot treat all searched
results as relevant.]
For instance, if you choose Target 1 as the example, then you can prepare this solution:
Target 1: Australia fully re-opens the border after the two years of covid-19 pandemic closeness
R1 (Relevant) –https://www.reuters.com/world/asia-pacific/australia-fully-reopens-bordersshut-by-covid-pandemic-welcomes-back-tourists-2022-02-20/
‘Welcome back world!’: Australia fully reopens borders after two years
R2 (Relevant) – https://www.reuters.com/world/asia-pacific/australia-fully-reopen-bordersvaccinated-travellers-feb-21-2022-02-07/
After two years of closed borders, Australia welcomes the world back
R3 (Irrelevant) – https://www.reuters.com/lifestyle/sports/australian-open-organisers-denyslack-covid-testing-2022-01-20/
Australian Open organisers deny slack COVID testing
………
R20 ……
In summary, the output list will be shown as:
1, 1, 0, ….
Marking Rubric: 3 full marks will be given if the answer follows the requirement, i.e., 20 URL
and news title are clearly presented, and the relevance makes sense (6~9 relevant results
among the 20 retrieved news); 2 marks if the answer is partially provided; 0 mark if no
reasonable answer is provided.
Question 2.2: [3 Marks]
Get the precision and recall values of the hits for the 20 news for Query 1 in search engine 1.
Graphing the precision and recall of hits by following Lecture 10 (Page 18). Drawing the result
of interpolation at the 11 standard recall levels by following Lecture 10 (Page 20-22).
The precision and recall values of hits are shown as:
Query 1 in Search Engine 1 –
1 (precision=***, recall=***), 1 (precision=***, recall=***), 0 (precision=***,
recall=***), ….
Graphing the precision and recall of hits:
[Example only]
Result of interpolation:
[Example only]
Marking Rubric: 3 full marks will be given if the precision-recall, graphing chart, and
interpolation chart are correctly presented; -1 mark applied if the precision-recall or graphing
chart is not correct, or interpolation chart was not correctly drawn, respectively.
Question 2.3: [3 Marks]
Get the precision and recall values of the hits for the 20 news for Query 1 in search engine 2.
Graphing the precision and recall of hits by following Lecture 10 (Page 18). Drawing the result
of interpolation at the 11 standard recall levels by following Lecture 10 (Page 20-22).
The precision and recall values of hits are shown as:
Query 1 in Search Engine 2 –
1 (precision=***, recall=***), 1 (precision=***, recall=***), 0 (precision=***,
recall=***), ….
Graphing the precision and recall of hits:
[Example only]
Result of interpolation:
[Example only]
Marking Rubric: 3 full marks will be given if the precision-recall, graphing chart, and
interpolation chart are correctly presented; -1 mark applied if the precision-recall or graphing
chart is not correct, or interpolation chart was not correctly drawn, respectively.
Question 2.4: [3 Marks]
Get the precision and recall values of the hits for the 20 news for Query 2 in search engine 1.
Graphing the precision and recall of hits by following Lecture 10 (Page 18). Drawing the result
of interpolation at the 11 standard recall levels by following Lecture 10 (Page 20-22).
The precision and recall values of hits are shown as:
Query 1 in Search Engine 1 –
1 (precision=***, recall=***), 1 (precision=***, recall=***), 0 (precision=***,
recall=***), ….
Graphing the precision and recall of hits:
[Example only]
Result of interpolation:
[Example only]
Marking Rubric: 3 full marks will be given if the precision-recall, graphing chart, and
interpolation chart are correctly presented; -1 mark applied if the precision-recall or graphing
chart is not correct, or interpolation chart was not correctly drawn, respectively.
Question 2.5: [3 Marks]
Get the precision and recall values of the hits for the 20 news for Query 2 in search engine 2.
Graphing the precision and recall of hits by following Lecture 10 (Page 18). Drawing the result
of interpolation at the 11 standard recall levels by following Lecture 10 (Page 20-22).
The precision and recall values of hits are shown as:
Query 1 in Search Engine 2 –
1 (precision=***, recall=***), 1 (precision=***, recall=***), 0 (precision=***,
recall=***), ….
Graphing the precision and recall of hits:
[Example only]
Result of interpolation:
[Example only]
Marking Rubric: 3 full marks will be given if the precision-recall, graphing chart, and
interpolation chart are correctly presented; -1 mark applied if the precision-recall or graphing
chart is not correct, or interpolation chart was not correctly drawn, respectively.
Question 2.6: [3 Marks]
Compute the average interpolated precision of Query 1 and Query 2 for search engine 1, and
for search engine 2, respectively. Graphing the two sets of average interpolated precisions and
11 standard recall values into a chart by referring to Lecture 10 (Page 22-23). Explain which
search engine is working better regarding the two queries’ evaluation, what their specific
advantages in terms of their best working situations.
The average precisions of queries for Search Engine 1 and Search Engine 2 are below:
Engine 1: (0, ***), (0.1, ***), ….
Engine 2: (0, ***), (0.1, ***), ….
The graphing chart of the average precisions of queries for Search Engine 1 and Search
Engine 2 are shown as below:
Marking Rubric: 3 full marks if both average precision and chart are correctly calculated and
drawn, and explanation is reasonable; 2 marks if both average precision and chart are
correctly calculated and drawn, without reasonable explanation; 0 mark for otherwise.
Question 2.7: [2 Marks]
Regarding Query 1, calculate the Mean Average Precision (MAP) and R-precision (Recallprecision) by following lecture 10 (Page 26-27). It needs to list the procedure and the final
result.
MAP procedure is as below:
Map = 0.2307
R-precision procedure is as below:
R-precision = 0.357
Marking Rubric: 2 full marks if both MAP and R-precision are calculated correctly; 1 mark if
either of them is not correct; 0 mark for otherwise.
Extension requests
Requests for extensions should be made via CloudDeakin – SIT772 – Assessments Menu with
3 days early before the assessment due date. You also need to show your working progress and
explain how the extension is reasonable to complete the assignment. Without clear evidence
and working progress, the request cannot be approved.
Special consideration
You may be eligible for special consideration if circumstances beyond your control prevent
you from undertaking or completing an assessment task at the scheduled time.
See the following link for advice on the application process:
http://www.deakin.edu.au/students/studying/assessment-and-results/special-consideration
Assessment feedback
Detailed written feedback and results will be provided within two weeks of submission.
Academic integrity, plagiarism and collusion
Plagiarism and collusion constitute extremely serious breaches of academic integrity. They are
forms of cheating, and severe penalties are associated with them, including cancellation of
marks for a specific assignment, for a specific unit or even exclusion from the course. If you
are ever in doubt about how to properly use and cite a source of information refer to the
referencing site above.
Plagiarism occurs when a student passes off as the student’s own work, or copies without
acknowledgement as to its authorship, the work of any other person or resubmits their own
work from a previous assessment task.
Collusion occurs when a student obtains the agreement of another person for a fraudulent
purpose, with the intent of obtaining an advantage in submitting an assignment or other work.
Work submitted may be reproduced and/or communicated by the university for the purpose of
assuring academic integrity of submissions: https://www.deakin.edu.au/students/studysupport/referencing/academic-integrity

Graduate Learning Outcome

Get 25% off your order