 |
The ISLab Instance Matching Benchmark 2009
Provided by the Information Systems
& Knowledge Management Lab
Dipartimento di Informatica e Comunicazione - Università degli Studi di Milano
via Comelico 39, 20135, Milano, Italy
Contact person for IIMB: Alfio Ferrara (ferrara at dico.unimi.it)
|
The ISLab Instance Matching Benchmark is
a benchmark automatically constituted using one data source and modifying it according to various criterias. IIMB 2009 is available for public use. Please acknowledge the usage of IIMB 2009 data by referring to the following paper in your results.
Ferrara, A; Lorusso, D.; Montanelli, S. & Varese, G.
Towards a Benchmark for Instance Matching.
Ontology Matching (OM 2008), CEUR-WS.org, 2008
@INPROCEEDINGS{Ferrara08OM,
author = {Alfio Ferrara and Davide Lorusso and Stefano Montanelli and Gaia Varese},
booktitle = {Ontology Matching (OM 2008)},
editor = {Pavel Shvaiko and Jérôme Euzenat and Fausto Giunchiglia and Heiner Stuckenschmidt},
publisher = {CEUR-WS.org},
series = {CEUR Workshop Proceedings},
title = {Towards a Benchmark for Instance Matching.},
volume = {431},
year = {2008}
}
Objective
The objective of the benchmark is to evaluate the results of instance matching algorithms. The aim of these algorithms is to find out, as better as possible, the instance belonging to one or more ontologies which are referred to the same object in the real world.
Thus, we provide a reference OWL abox and a set of OWL aboxes, each consisting in a collection of individuals where properties values are obtained modifying the corresponding value in the source ontology in one of the ways described in the following.
Each execution of the test takes as input a couple of aboxes, that is, the reference abox and one of the modified aboxes, and returns the instance mappings found by the tested algorithm. Each case measures the capability of the algorithm to face a specific kind of error, a specific modification in the structure of the ontology, or a different kind of individual classification. For each test in the benchmark, the reference collection of expected mappings is provided.
The algorithms are evaluated according to the following parameters.
-
Precision.
The number of correct retrieved mappings / the number of retrieved mappings.
-
Recall.
The number of correct retrieved mappings / the number of expected mappings.
-
F-measure.
2 x (precision x recall) / (precision + recall).
-
Fall-out.
The number of incorrect retrieved mappings / the number of non-expected mappings.
-
Execution time (optional).
Source abox
The benchmark is about movies. In particular, the movie instances belonging to the source abox had been extracted from the IMDb Web site (http://www.imdb.com).
The reference ontology contains 5 named classes, 4 object properties, 13 datatype properties and 302 individuals.
Source ontology is available here.
Benchmark description
The transformations introduced in the benchmark ontologies can be distinguished in three main categories. Each of them introduces a class of modifications over the original value/s of a specific property in the source ontology.
Values transformations
These transformations are of two different kinds.
Typographical error simulation (T)
These transformations can be applied to each datatype property's value and are obtained by changing the value/s of the properties in the source ontology in the following four ways.
-
Insert char.
A random char (or a random number, if the property has a numerical value) is inserted in the string at a random position.
-
Modify char.
A random char (or a random number, if the property has a numerical value) is modified in the string.
-
Delete char.
A random char (or a random number, if the property has a numerical value) is deleted in the string.
-
Exchange chars' position.
The position of two adjacent chars (or two adjacent numbers, if the property has a numerical value) is exchanged.
The number of transformations introduced in the string is proportional to the string's length. If the number of transformations to apply is greater than one, the corresponding value can be modified combining different kinds of transformations.
The aim of these kinds of transformations is to simulate typographical errors that can be found in real data.
In the benchmark, we have introduced three different levels of severity in applying such kinds of transformations.
-
Low: (T(L)).
-
Medium: (T(M)).
-
High: (T(H)).
The string is replaced with a random sequence of chars (or numbers, if the property has a numerical value).
Standard modification (S)
These transformations are applied only to properties representing person's name. So, in the reference ontology, we applied this modification only to actor's and director's name.
For example, if in the source ontology the name of an actor is stored as "Pacino, Al", the modified value will be "Al Pacino".
Finally, for properties which allow standard modifications, both standard and typographical transformations can be combined together within the same value.
Structural transformations
These transformations are of three different kinds.
Values deletion (VD)
These transformations can be applied to each property. If a property allows multiple values, two kinds of deletions are possible.
-
Delete all values: (VD(A)).
All the values specified for the property are deleted.
-
Delete a random number of values: (VD(R)).
A random number of the values specified for a property are deleted.
Depth modification (D)
This modification transform a datatype property into an object one with the same name. The value of the generated object property is an instance which has an attribute that specifies the same value of the original datatype property.
For example, if a movie instance of the source ontology specifies the value "Scarface" for the datatype property HasTitle, the modified movie instance will have an object property HasTitle, whose value is an instance which has an attribute that specifies the value "Scarface".
Values separation (VS)
This modification splits the value of a property into two different property values.
For example, if a movie instance of the source ontology specifies the value "Pacino, Al" for the datatype property Name, the modified movie instance will have the values "Al" for the property Name, and "Pacino" for a new datatype property Surname.
Finally, for properties which allow depth modification or value separation and which allow multiple values, value deletion and depth modification or value separation can be applied together.
Logical transformations
These transformations are of five different kinds. Each requires some kind of reasoning in order to properly recognize which instances need to be compared.
Instantiation on different subclasses (SC)
This transformation is obtained instantiating identical individuals into different subclasses.
Instantiation on disjointed classes (DJ)
This transformation is obtained instantiating identical individuals into disjointed classes.
Different instantiation based on explicit class hierarchy declaration (E)
This transformation is obtained applying a different instantiation for the source and the modified abox.
The corresponding instance hierarchy can be recognize even by a RDFS reasoner.
Different instantiation based on implicit class hierarchy declaration (I)
This transformation is obtained applying a different instantiation for the source and the modified abox. In this case, the class hierarchy is defined through restrictions.
The corresponding instance hierarchy can be recognize using a DL reasoner.
Implicit values specification (V)
These transformations are obtained by setting individuals property values through the hasValue restriction.
Finally, these transformations can also be combined together.
Tests organization
Tests organization is available here.