Itaca dataset is designed for evaluating web search when word sense disambiguation is applied. It consists of 20 queries, each of one with a set of terms disambiguated with Wikipedia entries. The Itaca dataset consists of 3 files where each row is terminated by a newline and fields are separated by an space. The files are described below.
It contains the query ID and the string of that query:
ID String
1 iphone features
2 ipad features
3 JCR
.........
It contains the query ID and a description of its goal:
ID Goal
1 General features of an Iphone.
2 General features of an Ipad.
3 Information about the Journal Citation Reports publication.
.........
It contains the term ID for that query (formed by the query ID and the term number) and the URL of the Wikipedia concept that term refers to:
ID Term Disambiguation
1.1 iphone http://en.wikipedia.org/wiki/Iphone
1.2 features http://en.wikipedia.org/wiki/Feature_(software_design)
.........