ITACA evaluation

Overview

Itaca dataset is designed for evaluating web search when word sense disambiguation is applied. It consists of 20 queries, each of one with a set of terms disambiguated with Wikipedia entries. The Itaca dataset consists of 3 files where each row is terminated by a newline and fields are separated by an space. The files are described below.

Queries.txt

It contains the query ID and the string of that query:

ID String
1 iphone features
2 ipad features
3 JCR
.........

goals.txt

It contains the query ID and a description of its goal:

ID Goal
1 General features of an Iphone.
2 General features of an Ipad.
3 Information about the Journal Citation Reports publication.
.........

disambiguations.txt

It contains the term ID for that query (formed by the query ID and the term number) and the URL of the Wikipedia concept that term refers to:

ID Term Disambiguation
1.1 iphone http://en.wikipedia.org/wiki/Iphone
1.2 features http://en.wikipedia.org/wiki/Feature_(software_design)
.........

Usage

Author: Damaris Fuentes-Lorenzo
The author can not guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set, and author assumes no responsibility for the content, legality, reliability, and accuracy of the data.
The data set may be used for any research purposes.
Please acknowledge the use of the data set in publications resulting from the use of the data set:
Damaris Fuentes-Lorenzo. (2011). Itaca dataset, http://webtlab.it.uc3m.es/results/ITACA/evaluation.html

Download

Itaca dataset, zip.