10. Evaluation

The evaluation of an Information Retrieval (IR) system is crucial to assess its effectiveness and performance. It involves comparing the system's retrieved results against a known set of relevant documents, typically provided by human assessors. The evaluation process allows us to measure various metrics that reflect the system's retrieval accuracy, relevance, and user satisfaction.

Key Concepts in IR Evaluation:

Relevance:
- Relevance refers to the degree to which a retrieved document meets the information needs of a user with respect to their query.
- Documents that are highly relevant are considered as true positives, while irrelevant documents are considered false positives.
Precision and Recall:
- Precision measures the proportion of retrieved documents that are relevant out of all retrieved documents.
- Recall measures the proportion of relevant documents that were successfully retrieved out of all relevant documents in the collection.
- Precision and Recall are commonly used together to evaluate an IR system's retrieval effectiveness.
F1 Score:
- The F1 score is the harmonic mean of Precision and Recall and provides a single metric to evaluate the trade-off between the two.
- It is especially useful when the focus is on balancing precision and recall.
Mean Average Precision (MAP):
- MAP is a popular metric for evaluating IR systems when there are multiple queries.
- It calculates the average precision across all queries and provides a comprehensive measure of the system's performance.
Mean Reciprocal Rank (MRR)
- a metric used to evaluate the performance of a ranked retrieval system in Information Retrieval (IR) tasks. MRR is particularly useful when dealing with ranked lists of documents, such as in the evaluation of search engine results.
- For each query in the evaluation set, the ranked list of retrieved documents is generated.
- The reciprocal rank (RR) for each query is calculated as the inverse of the rank of the first relevant document in the list. If no relevant documents are retrieved for a query, the RR is 0.
- The Mean Reciprocal Rank (MRR) is computed as the average of the reciprocal ranks across all queries in the evaluation set.
Normalized Discounted Cumulative Gain (nDCG):
- nDCG is a metric that considers the ranking order of retrieved documents in addition to their relevance.
- It takes into account the graded relevance of documents and penalizes the system for suboptimal ranking.

Evaluation Methodologies:

Offline Evaluation:
- In offline evaluation, the evaluation is performed using a pre-defined set of queries and known relevant documents.
- The system's retrieval results are compared to the relevance judgments to compute metrics like precision, recall, F1 score, MAP, etc.
Online Evaluation:
- Online evaluation involves conducting experiments with real users in a live environment.
- User interactions, click-through rates, and other user behavior are analyzed to assess the system's performance in a real-world setting.

Test Collections:

In IR evaluation, a test collection is a predefined set of queries and corresponding relevant documents used for testing and benchmarking the IR system's performance.

Test Collections in Information Retrieval:

Definition: A test collection is a carefully curated dataset used for evaluating and benchmarking the performance of Information Retrieval (IR) systems. It consists of a set of queries, a corresponding set of relevant documents for each query, and sometimes additional relevance judgments provided by human assessors.

Purpose of Test Collections: The primary purpose of test collections is to provide a standardized and reproducible way to evaluate the effectiveness of IR systems. They serve as a reference point for comparing different retrieval algorithms, configurations, and improvements in a fair and controlled manner.

Components of Test Collections:

Queries:
- Queries are the information needs or search requests provided to the IR system for retrieval.
- They can be specific keywords or phrases, or more complex natural language queries.
Relevant Documents:
- Relevant documents are the documents considered to be useful and related to a particular query's information need.
- Human assessors typically provide relevance judgments for each query-document pair.
Additional Metadata:
- Some test collections may include additional metadata about the documents, such as author names, publication dates, or document categories, to support more sophisticated evaluation.

Creating Test Collections:

Query Selection:
- Test collections often include a diverse set of queries to cover various user information needs and scenarios.
- Queries can be derived from real user search logs, topic lists, or common information needs identified by domain experts.
Relevance Judgments:
- Relevance judgments are essential for evaluating the performance of IR systems accurately.
- Human assessors manually judge the relevance of retrieved documents for each query (ground truth).
- Relevance judgments are typically binary (relevant/not relevant) or graded (e.g., on a scale of 0 to 3).
Document Selection:
- The document collection used in a test collection should be representative of the target domain or the application area of the IR system being evaluated.
- The size of the document collection can vary based on the evaluation needs and the available resources.

Two evaluation settings

Ranked Evaluation Setting:

Definition: In the ranked evaluation setting, the retrieved documents are ranked based on their predicted relevance scores, usually obtained from an IR system. The highest-ranked documents are considered more relevant to the query than lower-ranked ones. The primary goal of ranked evaluation is to assess the effectiveness of an IR system in returning the most relevant documents at the top of the ranked list.

Process:

Retrieval and Ranking: The IR system retrieves a set of documents for a given query and ranks them based on their relevance scores. The higher the relevance score, the higher the document's rank.
Relevance Judgments: For each query, human assessors provide relevance judgments for a subset of the retrieved documents, ----indicating whether they are relevant or not.
Evaluation Metrics: Common evaluation metrics in ranked evaluation include Precision-Recall curves, Average Precision (AP), Mean Average Precision (MAP), and Discounted Cumulative Gain (DCG).

Unranked Evaluation Setting:

Definition: In the unranked evaluation setting, retrieved documents are considered equally relevant to the query, and their order is not taken into account. The primary goal of unranked evaluation is to measure the presence or absence of relevant documents without considering their specific rankings.

Process:

Retrieval: The IR system retrieves a set of documents for a given query.
Relevance Judgments: For each query, human assessors provide binary relevance judgments (relevant or not relevant) for the retrieved documents.
Evaluation Metrics: Common evaluation metrics in unranked evaluation include Precision, Recall, F1 score, and Accuracy.

Precision-Recall Curve:

To construct the Precision-Recall curve, the IR system retrieves documents for a given query and ranks them based on their predicted relevance scores. Then, the Precision and Recall values are computed for each possible cut-off point, where the system retrieves only the top k documents (where k varies from 1 to the total number of retrieved documents).

Plotting the Precision-Recall Curve:

Compute Precision and Recall for each cut-off point (k) as the system retrieves top k documents.
Plot the Precision-Recall pairs on the graph, with Recall on the x-axis and Precision on the y-axis.
Connect the points to create the Precision-Recall curve.

Interpretation of the Curve:

The Precision-Recall curve shows the trade-off between Precision and Recall as the system retrieves more or fewer documents.
A higher Precision value indicates that the retrieved documents are highly relevant, while a higher Recall value indicates that more relevant documents are retrieved, regardless of non-relevant ones.
The ideal system would have both high Precision and high Recall, achieving a curve that is close to the top-right corner of the graph

Example:
Suppose an IR system retrieves documents for a query, and the results are as follows:

Rank	Document	Relevant (Yes/No)
1	Document A	Yes
2	Document B	No
3	Document C	Yes
4	Document D	Yes
5	Document E	No
6	Document F	No

Precision-Recall Curve:

At k=1: Precision = 1/1=1.01/1=1.0, Recall = 1/3≈0.3331/3≈0.333
At k=2: Precision = 1/2=0.51/2=0.5, Recall = 1/3≈0.3331/3≈0.333
At k=3: Precision = 2/3≈0.6672/3≈0.667, Recall = 2/3≈0.6672/3≈0.667
At k=4: Precision = 3/4=0.753/4=0.75, Recall = 3/3=1.03/3=1.0
At 5k=5: Precision = 3/5=0.63/5=0.6, Recall = 3/3=1.03/3=1.0
At k=6: Precision = 3/6=0.53/6=0.5, Recall = 3/3=1.03/3=1.0

Interpolated Precision
is a method used to compute precision values at specific recall levels on a Precision-Recall curve. It is particularly helpful when dealing with sparse Precision-Recall curves with few or no precision values at certain recall levels. The interpolated precision assigns the highest precision value seen at or after each recall level to ensure a monotonically decreasing precision curve.

Calculating Interpolated Precision:

Start with the Precision-Recall curve, where precision values are calculated for various recall levels (usually at each relevant document retrieved).
For each recall level r, find the highest precision value ′p′ that occurs at or after r. In other words, find the maximum precision value from the precision values at recall levels greater than or equal to r.
Assign the �′p′ value to the interpolated precision at recall level �r.
Repeat this process for all recall levels.