Semantic Similarity and Selection of Resources Published According to Linked Data Best Practice

Semantic Similarity and Selection
of Resources Published According
to Linked Data Best Practice
Riccardo Albertoni,
Monica De Martino
CNR-IMATI-GE
Institute of Applied Mathematics and Information Technologies
(Dept of Genoa)
Consiglio Nazionale delle Ricerche, Italy
The 6th International Workshop on Ontology Content (OnToContent 2010)
Oct 28, 2010 Crete Part of the OTM (OTM'2010)

Outline
• Resource Selection, Semantic Similarity and Linked
data.
▫ Why does Resource Selection matter?
▫ Real example:
 Complex metadata to document resources
 Linked data paves the way for sharing complex metadata
▫ Semantic Similarity as base for resource selection
 Nice features as Asymmetry & Context-Dependence
• Scaling Semantic similarity up to Web of Data
▫ Issues & Research plandirection
▫ Exploratory phase with real data from the web data
 Are the issues we consider relevant? In which varieties
shapes issues occur in real data?
▫ Lesson learnt from the exploratory phase

Resource Selection:
• why does it matter?
▫ Effective sharing and reuse of data are still
desiderata of many scientific and industrial
domains where the selection of tailored and
high-quality data is a necessary condition to
provide successful and competitive services
• Resource selection
▫ in order to select the resources which fit a given
problemtask we rely on an analysis of
metadata documenting resources

Real Example
Acquisition
Preprocessing
Integration
ModelsAnalysis
Web server
Sea Trial
courtesy of NATO Undersea Research Centre (NURC),
Example developed in NURC Research Assistance
granted to R. Albertoni (2008)
Short term perspective: data is collected and
elaborated for well planned purposes (aka sea trial experiments)

Potential new “customer” for sea trial
Data
• NATO Agencies/Nations ask for data previously
collected
• New scientists arriving at NURC
▫ They want to access to data in order to produce model by their own
approaches and to compare the results with models already
produced at NURC (Benchmarking)
• Scientists/Agencies investigating how phenomena
have been changed in a long period
▫ They are interested in data collected in the past
• Scientists/Agencies planning a new sea trial
▫ It can be useful to know what have collected in previous sea trials,
how data have been elaborated
Data reusability: unplanned use of data
long term perspective

Potential customers’ point of view
These curtomers were not involved in sea
trials, thus, searching for data they
wonder:
• Is data collected at NURC suitable for the
application I have in mind?
• Is data reliable enough?
To answer to these macro questions
• Users need to have details about how data has
been acquired, pre-processed, integrated,
analyzed, and even to know who was in charge
for which part…

ModelsData
Processes
People
Sensors
Characteristics
Metadata Complexity in Real World- Linked data helps in
share complex metadata
Sensor’s responsible
party
sensor settings
Parameters, choices made
during the preprocessing
Analysis applied..
Parameters etc
Sensor
Sensor
Sensor
Sensor APO
FOAF
ISO19115CoreTest PlanDublin Core
SensorML
SensorML
SensorML
SensorML

Problem: keep the bar balanced !!
Semantic similarity
as Metadata analysis
to support user
comparing the
features of candidate
resources
Huge amount of
ontology driven
metadata describing
complex features as
linked data

semantic similarity as metadata
analysis tool
• instance similarity is fundamental to support detailed
comparison, ranking and selection of resources through
its ontology driven metadata
▫ Albertoni R., De Martino M., Asymmetric and context-dependent
semantic similarity among ontology instances, Journal on Data
Semantics X, Springer Verlag, (2008).
• Explicitly addressing the
▫ Context as explicit parameterization of similarity assessment
 Context specifies which features to consider and how
▫ Asymmetry to highlight containment between resources
 Sim(A,B) ranges [0,1] is worked out to measure how many
features A shares with B out of the overall A features
 If features(A) are contained in features(B): sim(A,B)=1 and
sim(B,A)<1
• Limitation: Not for linked data, it was for locally-stored
ontology-driven repository and one well defined schema

How to make Semantic Similarity to
scale up to the web of data? 1/2
Identified issues Research Plan
non-authoritative metadata, metadata
published by actors who are neither the resource
producers nor the owners
WHEN metadata documenting resources that
have been re-elaborated or reviewed by third
parties
Synergies with semantic
web indexes (e.g.,
SINDICE ) to retrieve non
authoritative features
heterogeneous metadata, metadata
provided according to different, sometimes
interlinked, more often overlapping metadata
vocabularies
WHEN metadata for a resource is provided by
stakeholders with different fields of competency,
then they may use different vocabularies, not
always these vocabularies are independent
deploying schema and
entity level
consolidation using both
explicit metadata
statements and mining
implicit equivalences
through co-occurring
resources annotations;

How to make Semantic Similarity to
scale up to the web of data? 2/2
Identified issues Research Plan
non-consistently identified metadata,
namely metadata occurring when the same
resource has different identifiers in distinct
metadata sets
WHEN
Two actors in the pipeline documents
independently the same resource at different
stage of the pipeline
•reasoning techniques to be
applied to web datasets, e.g., to
smush fragments of
distributed metadata
• scripts to interlink
resources relying on a-
priori knowledge about how
datasets have been originated;
efficiency and computational issue: in
a longer perspective an accurate similarity
assessment might result computationally
prohibitive
WHEN
the number of resources discovered and
features considered increase.
•cashing of intermediate
comparisons
•techniques to prune
comparisons according to a
specified application context
•algorithms for efficient
parallelization can be studied

Exploratory phase
• Facing with the aforementioned issues is a very
challenging research plan!!!
• Let’s get a first hand experience in varieties introduced
by data providers
▫ Requirements:
 Real metadata published as linked data
 Provided by third parties
• Linked data provides huge potential for documenting
resources produced in complex pipelines but it is not yet
a common practice
▫ We considered a simpler domain (researchers and
their publications)
 Semantic Web Dog Food-SWDF
(http://data.semanticweb.org/)
 DBLP in RDF (http://dblp.l3s.de/d2r).

Instance similarity redesigned
prototype
• As test bed for experimenting and deepen the
aforementioned issues
• Extension
▫ Extended the notion of context including
namespaces to consider properties from different
RDF schemas
▫ Updated the ontology model, moving from
ProtegeAPI to RDF model
 JENA Reasoner and SPARQL

Context:
Researcher X (URI(X)=A) Researcher Y (URI(Y)=B)
<A> rdfs:label “A descr" ;
dc:license <http://vb.com> ;
foaf:primaryTopic <xzc> ;
foaf:made <paperC>;
foaf:made <paperD>;
foaf:made <paperH>.
<B> rdfs:label “B descr" ;
foaf:primaryTopic <vbn> ;
foaf:made <paperC>;
foaf:made <paperF>.
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
[foaf:Person]->{{},{(foaf:made, Count)}}
Two researchers are as similar as they
have a similar number of publications
3 2
SIM(X,Y)= SIM(A,B)= 2/max(3,2)=2/3
SIM(Y,X)= SIM(B,A)= 3/max(3,2)=1
Take a look to R. Albertoni, M. De Martino
JODS X, 2008 for more complex similarity
assessment !!
We compare researchers by their URIs

Non-Authoritative Metadata - Example
URI(Giovanni)=A URI (Renaud)=B
foaf:made <paperC>;
foaf:made <paperD>;
foaf:made <paperH>.
<B> rdfs:label “B descr" ;
foaf:primaryTopic <vbn> ;
foaf:made <paperC>;
foaf:made <paperF>.
Let’s compare Giovanni and Renaud starting from their URI in
DBLP
A= http://dblp.l3s.de/…../Giovanni_Tummarello
B= http://dblp.l3s.de/…./Renaud_Delbru
But we know, semantic web dog food (SWDF) might provide more info about
Giovanni and Renaud,
What if SWDF provides an additional paper for Renaud,
paper which Giovanni is not coauthoring?
SIM(Giovanni, Renaud)=1 instead of 2/3….

Non-Authoritative Metadata -SINDICE
You get RDF Fragments from DBLP only !!!
none from semantic web dog food we know providing further info..
IDEA: Querying SINDICE by Researchers’ URIs A, B to get RDF
fragments pertaining to Giovanni and Renaud
•URIs not name as keywords, because different people might share
the same name, URIs are in principle more precise
First lesson: Non-authoritative metadata and Non-consistently
identified metadata are tightly inter-related in the real practice. To
effectively deal with the former issue often we have to care about the
latter issue.
SWDF
Researchers
URI
DBLP
Researchers’
URI
They do not
overlap!!!

DBLP URI ---
How to move next?
IDEA: if SWDF added rules likes
<http://data.semanticweb.org/person/name-[midlename]-[familyname]>
owl:sameAs <http://dblp.l3s.de/d2r/resource/authors/name_[middle-
name]_familyname>
SWDF URIOwl:SameAs
The SWDF fragments would have been retrieved by SINDICE..
[We are implicitly assuming some reasoning:
e.g.:
(X owl:sameAs X1) and (X1 rel Z) -> (X rel Z)
]

heterogeneous metadata- Example
RDF for Giovanni in DBLP RDF for Giovanni in SWDF
foaf:made <paperC>;
foaf:made <paperD>;
foaf:made <paperH>.
<paperE> foaf:maker <A>.
<paperB> foaf:maker <A>.
This problem does not appear in terms of different RDF
schemas
Both DBLP and SWDF deploy foaf …
foaf:made is owl:inverseOf foaf:maker, but you cannot know it if you don’t
dereference/load the foaf schema
Second lesson: ontology/schema/properties in the context must be
dereferenced as much as entity’s URIs to make the semantics of
properties exploitable.

We must be careful dereferencing
• Dereferencing schemata and URI
▫ is extremely slow
▫ adds many RDF statements which might result
useless for semantic similarity assessment
 Info not pertaining to specified context
▫ ends up with huge amount of derived RDF
statement which might worsen efficiency ad
computational problems
Third lesson: specific and context driven policies to dereference the URI
and retrieve RDF fragments should be deployed in order to ease
efficiency and computational problems.
For example : to dereference only properties mentioned in context .. Or
consider only RDF fragments returned by SINDICE with explicit
reference to schemas mentioned in the context.

How to move next?
RDF for Giovanni in DBLP RDF for Giovanni in SWDF
foaf:made <paperC>;
foaf:made <paperD>;
foaf:made <paperH>.
foaf:primaryTopic <xzc>.
<paperE> foaf:maker <A>.
<paperB> foaf:maker <A>.
<A> foaf:make <paperE>.
<A> foaf:make <paperB>.
Assuming we have dereferenced the foaf:maker, or
upload in the reasoner a rule saying (P foaf:maker X)-
> (X foaf:make P)

Non-consistently identified metadata
What if the same pub is provided both by DBLP and SWDF?
E.g., DBLP:paperC and SWDF:paperB are two URI for the same paper
We count it twice 
Fourth lesson: Non-consistently identified metadata is a recursive
problem. Consolidating researchers without consolidating papers brings
to wrong similarity results. We must be sure entities and properties in
the similarity context have been properly consolidated before applying
instance similarity.
RDF for Giovanni in DBLP + for Giovanni in SWDF
foaf:made <DBLP:paperC>;
foaf:made <DBLP:paperD>;
foaf:made <DBLP:paperH>.
foaf:primaryTopic <xzc>.
<A> foaf:made <SWDF:paperE>.
<A> foaf:made <SWDF:paperB>.

Conclusion (I)
• Linked data best practice and our semantic
similarity
▫ good potential to support data selection for
complex domain resource
• But scaling semantic similarity up to web of data
means to deal with
▫ Non authoritative metadata
▫ Heterogeneous metadata
▫ Non-consistently identified metadata
▫ Efficiency and computational issue

Conclusion (II)
• The exploratory phase shows
▫ All the mentioned issues arise even in very simple
scenario assessing the semantic similarity
▫ It is pivotal to have first-hand experience with real
data to discover the shape issues might assume
• Consideration
▫ Problems we found are not exclusive for similarity
assessment
 We suspect this issues arise whenever you try to
elaborate information published as linked data in
order to mining new factsinfo from the published
data

Do not hesitate to email me (Albertoni@ge.imati.cnr.it)
If you have off line questions

Semantic Similarity and Selection of Resources Published According to Linked Data Best Practice

Related slideshows

More Related Content

Semantic Similarity and Selection of Resources Published According to Linked Data Best Practice

Editor's Notes