Semantic similarity evaluation


This dataset is designed for evaluating semantic similarity between concepts of Wikipedia. It consists of two sets of Wikipedia concept pairs, from the English version, both for training and testing purposes respectively. The training set is composed of 37 pairs and the test set is composed of 28 pairs.

The original pairs were initially proposed in 1965 by Rubenstein and Goodenough's work (henceforth, R&G). However, these pairs were about words, and there is no information about the intended meaning of these words. As our approach works with well-defined concepts - disambiguated concepts - from Wikipedia, our experiments had to train and test all the possible senses of each of the words proposed in R&G. After the experiments, results yielded the senses with better results, and we have finally selected the Wikipedia concepts representing those senses.

Every subset is listed in a CSV file. Every row represents a pair and contains, separated by commas:

  1. The first original concept in the set of R&G.
  2. The second original concept in the set of R&G.
  3. The Wikipedia concept referring to the first original concept in the set of R&G.
  4. The Wikipedia concept referring to the second original concept in the set of R&G.

For the Wikipedia concepts, we have eliminated the URL prefix: http://en.wikipedia.org/wiki/


An excerpt of a CSV file with the dataset is the following

Bird; Crane; Bird; Crane_(bird)
Brother; Monk; Broter_(Catholic); Monk

In the first row, the original R&G pairs were bird and crane. Their respective Wikipedia concepts with better similarity results with our approach are http://en.wikipedia.org/wiki/bird and http://en.wikipedia.org/wiki/crane_(bird) respectively.


Author: Damaris Fuentes-Lorenzo
The author can not guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the data set, and author assumes no responsibility for the content, legality, reliability, and accuracy of the data.
The data set may be used for any research purposes.

Please acknowledge this data set in publications resulting from its use:
Damaris Fuentes-Lorenzo. (2014). Wikipedia dataset for similarity, http://webtlab.it.uc3m.es/results/similarity/evaluation.html


Training dataset, csv.
Test dataset, csv.
