This dataset is designed for evaluating semantic similarity between concepts of Wikipedia. It consists of two sets of Wikipedia concept pairs, from the English version, both for training and testing purposes respectively. The training set is composed of 37 pairs and the test set is composed of 28 pairs.
The original pairs were initially proposed in 1965 by Rubenstein and Goodenough's work (henceforth, R&G). However, these pairs were about words, and there is no information about the intended meaning of these words. As our approach works with well-defined concepts - disambiguated concepts - from Wikipedia, our experiments had to train and test all the possible senses of each of the words proposed in R&G. After the experiments, results yielded the senses with better results, and we have finally selected the Wikipedia concepts representing those senses.
Every subset is listed in a CSV file. Every row represents a pair and contains, separated by commas:
For the Wikipedia concepts, we have eliminated the URL prefix: http://en.wikipedia.org/wiki/
An excerpt of a CSV file with the dataset is the following
Bird; Crane; Bird; Crane_(bird)
Brother; Monk; Broter_(Catholic); Monk
.........
In the first row, the original R&G pairs were bird and crane. Their respective Wikipedia concepts with better similarity results with our approach are http://en.wikipedia.org/wiki/bird and http://en.wikipedia.org/wiki/crane_(bird) respectively.
Author: Damaris Fuentes-Lorenzo
The author can not guarantee the correctness of the data,
its suitability for any particular purpose, or the validity of results
based on the use of the data set, and author assumes no responsibility
for the content, legality, reliability, and accuracy of the data.
The data set may be used for any research purposes.