SlideShare a Scribd company logo
Peter Mika| Yahoo Research, Spainpmika@yahoo-inc.comThanh Tran | Institute AIFB, KIT, GermanyTran@aifb.uni-karlsruhe.deSemantic Search TutorialIntroduction
About the speakersPeter MikaSemantic Search group at Yahoo! BarcelonaSemantic Search, Web Object Retrieval, Natural Language ProcessingTran Duc ThanhSemantic Search group at AIFBSemantic Search, Semantic Data Management, Linked Data Query Processing
AgendaIntroduction (5 min)Semantic Web data (50 min)The RDF data modelPublishing RDFCrawling and indexing RDF dataQuery processing (35 min)Ranking (25 min) Result presentation (15 min)Semantic Search evaluation (15 min)Demos (15 min)Questions (5 min)
Why Semantic Search? I.“We are at the beginning of search.“ (Marissa Mayer)Solved large classes of queries, e.g. navigationalHeavy investment in computational powerRemaining queries are hard, not solvable by brute force, and require a deep understanding of the world and human cognitionBackground knowledge and metadata can help to address poorly solved queries
Poorly solved information needsAmbiguous searchesparishiltonLong tail queriesgeorge bush (and I mean the beer brewer in Arizona)Multimedia searchparishilton sexyImprecise or overly precise searches jimhendlerpictures of strong adventures peoplePrecise searches for descriptionscountries in africa32 year old computer scientist living in barcelonareliable digital camera under 300 dollarsMany of these queries would not be asked by users, who learned over time what search technology can and can not do.
Example: multiple interpretations
Why Semantic Search? II.The Semantic Web is now a realityLarge amounts of data published in RDFHeterogeneous data of varying qualityUsers who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domainSearching data instead or in addition to searching documentsDirect answersNovel search tasks
Information box with content from and links to Yahoo! TravelExample: direct answers in searchPoints of interest in Vienna, AustriaShopping results from Yahoo! ShoppingSince Aug, 2010, ‘regular’ search results are ‘Powered by Bing’
Novel search tasksAggregation of search resultse.g. price comparison across websitesAnalysis and predictione.g. world temperature by 2020Semantic profilingrecommendations based on particular interestsSemantic log analysisunderstanding user behavior in terms of objects Support for complex taskse.g. booking a vacation using a combination of services
Document retrieval and data retrievalInformation Retrieval (IR) support the retrieval of documents (document retrieval)Representation based on lightweight syntax-centric models Work well for topical searchNot so well for more complex information needsWeb scaleDatabase (DB)  and Knowledge-based Systems (KB) deliver more precise answers (data retrieval)More expressive models Allow for complex queriesRetrieve concrete answers that precisely match queriesNot just matching and filtering, but also joinsLimitations in scalability
Combination of document and data retrieval Documents with metadataMetadata may be embeddedinside the documentI’m looking for documents that mention countries in Africa.Data retrievalStructured data, but searchable text fieldsI’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.
Semantic SearchTarget (combination of) document and data retrievalSemantic search is a retrieval paradigm thatExploits the structure/semantics of the data or explicit background knowledge to understand user intent and the meaning of contentIncorporates the intent of the query and the meaning of content into the search process (semantic models)Wide range of semantic search systemsEmploy different semantic models, possibly at different steps of the search process and in order to support different tasks
Semantic modelsSemantics is concerned with the meaning of the resources made available for searchVarious representations of meaningLinguistic models: models of relationships among wordsTaxonomies, thesauri, dictionaries of entity namesInference along linguistic relations, e.g. broader/narrower termsNatural language searchConceptual models: models of relationships among objectsOntologies capture entities in the world and their relationshipsInference along domain-specific relationsKnowledge-based searchWe will focus on conceptual models in this tutorialIn particular, the RDF/OWL conceptual model for representing classes in a domain, and describing their instances
Semantic Search –  a process viewKnowledge RepresentationSemantic ModelsResourcesDocumentsDocumentRepresentation
Semantic Search systems	For data / document retrieval, semantic search systems might combine a range of techniques, ranging from statistics-based IR methods for ranking, database methods for efficient indexing and query processing, up to complex reasoning techniques for making inferences!
Semantic Web data
Semantic WebSharing data across the WebStandard data modelRDFA number of syntaxes (file formats)RDF/XML, RDFaPowerful, logic-based languages for schemasOWL, RIFQuery languages and protocolsHTTP, SPARQL
Resource Description Framework (RDF)Each resource (thing, entity) is identified by a URIGlobally unique identifiersLocators of informationData is broken down into individual factsTriples of (subject, predicate, object)A set of triples (an RDF graph) is published together in an RDF documentRDF documentfoaf:Persontypeexample:roiname“Roi Blanco”
Linking resourcesFriend-of-a-Friend ontologyRoi’s homepagetypeexample:roifoaf:Personname“Roi Blanco”sameAsYahootypeworksWithexample:roi2example:peteremail“pmika@yahoo-inc.com”
Publishing RDFLinked DataData published as RDF documents linked to other RDF documentsCommunity effort to re-publish large public datasets (e.g. Dbpedia, open government data)RDFaData embedded inside HTML pagesRecommended for site owners by Yahoo, Google, FacebookSPARQL endpointsTriple stores (RDF databases) that can be queried through the web
Linked DataA web of RDF documents in parallel to the current Weboften implemented as wrappers around databases or APIsThe four rules of Linked Data:Use URIs to identify things.Use HTTPURIs so that these things can be referred to and looked up ("dereference") by people and user agents.Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML.Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
Linked DataAdvantages: No change to the publishing of the HTML documentsData can be published by third party (e.g. Dbpedia)Disadvantages:Web servers need to be configured to properly handle URIs that identify concepts instead of documentsNot favored by search engines Lack of use casesCrawling needs to be changedAuthority is difficult to determineToolsTriple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)RDB-to-RDF mappers (e.g. D2RQ, Triplify)Validators (Vapour)Linked Data browsers (many)
The state of Linked DataRapidly growing community effort to (re)publish open datasets as Linked DataIn particular, scientific and government datasetssee linkeddata.orgLess commercial interest, real usage
Metadata in HTML1995: HTML meta tags1996: Simple HTML Ontology Extensions (SHOE)1998: RDF/XMLRDF/XML in HTMLRDF linked from HTML2003: Web 2.0TaggingMicroformatsMetadata in WikipediaMachine tags in Flickr2005: eRDF 2008: RDFa 1.02011: RDFa 1.12012: Microdata?
HTML meta tags<HTML><HEAD profile="http://dublincore.org/documents/dcq-html/"><META name="DC.author" content="Peter Mika"><LINK rel="DC.rights copyright" href="http://www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF" 	   href= "http://www.cs.vu.nl/~pmika/foaf.rdf"> </HEAD> …</HTML>
Microformats (μf)Agreements on the way to encode certain kinds metadata in HTMLReuse of semantic-bearing HTML elementsBased on existing standardsMinimalityMicroformats exist for a limited set of objectshCard (persons and organizations)hCalendar (events)hResumehProducthRecipeVarying degrees of support and stabilityhCard and rel-tag are widely supportedCommunity centered around microformats.orgSpecifications and discussions are hosted there
Microformats: limitationsNo shared syntaxEach microformat has a separate syntax tailored to the vocabulary No formal schemasLimited reuse, extensibility of schemasUnclear which combinations are allowedNo datatypesNo namespaces, unique identifiers (URIs) no interlinkingmapping between instances is requiredAlways appears in the HTML <body>
Example: the hCard microformat<div class="vcard">   <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a>     <div class="tel">+1-919-555-7878</div>   <div class="title">Area Administrator, Assistant</div> </div> <cite class="vcard"><a class="fn url" rel="friend colleague met” href="http://meyerweb.com/">Eric Meyer</a> </cite> wrote a post (<cite><a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/">Tax Relief</a></cite>) about an unintentionally humorous letter he received from the <span class="vcard”> <a class="fn org url" href="http://irs.gov/">Internal Revenue Service</a> </span>.
RDFaW3C standard for embedding RDF data in HTML documentsA set of new HTML attributes to be used in head or bodyA specification of how to extract the data from these attributes RDFa is just a syntax, you have to choose a vocabulary separatelyRDFa 1.0 is a W3C Recommendation since October, 2008RDFa PrimerRDFa 1.1 is a small update on RDFa to make it easier to useCurrently Working Draft (March 31, 2011)Updated version of the RDFa Primer (April 19, 2011)RDFa API for accessing RDFa data in a webpage in the browser from JavaScriptCurrently Working Draft (April 19, 2011)
RDFa 1.1ChangesNew vocab attribute to define the default namespace for the document or subtreeProfile documents to define multiple namespace prefixesThe prefix attribute as a recommended replacement of xmlnsYou can use URIs even where only CURIEs where allowed beforeRDFa 1.1 is backward compatible with RDFa 1.0RDFa 1.1 is recommended if you want to use HTML5
When to use RDFaChoose microformats when you find a microformat that fits your needs and supported by your consumersMicroformats are first option because they are simpleYahoo supports all major microformats, see the documentationIt’s a common misconception that RDFa requires XHTML or that it’s compatible with HTML5It’s compatible with HTML4, HTML5, XHTMLIf you find none that perfectly fits your needs then you need RDFaMicroformats have a fixed schema: you can not add your own attributesExample: a social networking site with user profilesVCard is a good candidate, but for example it doesn’t have a way to express the user’s social connectionsYou either live without this, or go with RDFa
Example: Facebook’s Like and the Open Graph ProtocolThe ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities Shows up in profiles and news feedSite owners can later reach users who have liked an objectFacebook Graph API allows 3rd party developers to access the data Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’
Example: Facebook’s Open Graph ProtocolRDF vocabulary to be used in conjunction with RDFaSimplify the work of developers by restricting the freedom in RDFaActivities, Businesses, Groups, Organizations, People, Places, Products and EntertainmentOnly HTML <head> accepted<html xmlns:og="http://opengraphprotocol.org/schema/"> <head> 	<title>The Rock (1996)</title> 	<meta property="og:title" content="The Rock" /> 	<meta property="og:type" content="movie" /> 	<meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> 	<meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …</head> ...
Example: Yahoo! Enhanced Results (was: SearchMonkey)Guide for publishers to mark-up their pages for common types of objectsProduct, Local, News, Video, Events, Documents, Discussion, GamesUsing popular microformats and RDF vocabulariesCopy-paste code ValidatorYahoo as a consumerSee later
Example: Google’s Rich SnippetsGoogle accepts popular microformats and its own RDFa vocabularySimilar approach to RDFa as FacebookValidator to check if the markup is correctGoogle displays enhanced results based on this metadataRich Snippets
RDFa on the rise510% increase between March, 2009 and October, 2010Percentage of URLs with embedded metadata in various formats
MicrodataCurrently under standardization at the W3COriginally part of the HTML5 spec, but now a separate documentSimilar to microformats, but with the extensibility of RDFaIntroduce new terms using reverse domain names or full URIsHTML5 also has a number of “semantic” elements such as <time>, <video>, <article>…
Microdata example<div itemscope itemid=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called  <span itemprop="band">Four Parts Water</span>. I was born on  <time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>. <img itemprop="image" src=”me.png" alt=”me”> </p></div
The state of metadata in HTML5-10% of webpages contain some explicit metadataDepending on how you count…Too many competing approachesToo many formats: microformats vs RDFa vs MicrodataWhen using RDFa, publishers may need to use multiple different vocabularies to satisfy everyone
Crawling the Semantic WebLinked DataSimilar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawledSemantic Sitemap/VOID descriptionsRDFaSame as HTML crawling, but data is extracted after crawlingMika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.SPARQL endpointsEndpoints are not linked, need to be discovered by other meansSemantic Sitemap/VOID descriptions
Data fusionOntology matchingWidely studied in Semantic Web research, see e.g. list of publications at ontologymatching.orgUnfortunately, not much of it is applicable in a Web context due to the quality of ontologiesEntity resolutionLogic-based approaches in the Semantic WebStudied as record linkage in the database literature Machine learning based approaches, focusing on attributesGraph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF dataImprovements over only attribute based matchingBlendingMerging objects that represent the same real world entity and reconciling information from multiple sources
Data quality assessment and curationHeterogeneity, quality of data is an even larger issueQuality ranges from well-curated data sets (e.g. Freebase) to microformats In the worst of cases, the data becomes a graph of wordsShort amounts of text: prone to mistakes in data entry or extractionExample: mistake in a phone number or state codeQuality assessment and data curationQuality varies from data created by experts to user-generated contentAutomated data validationAgainst known-good data or using triangulationValidation against the ontology or using probabilistic modelsData validation by trained professionals or crowdsourcingSampling data for evaluationCuration based on user feedback
IndexingSearch requires matching and rankingMatching selects a subset of the elements to be scoredThe goal of indexing is to speed up matchingRetrieval needs to be performed in millisecondsWithout an index, retrieval would require streaming through the collectionThe type of index depends on the query model to supportDB-style indexingIR-style indexing
IR-style indexingIndex data as textCreate virtual documents from dataOne virtual document per subgraph, resource or tripletypically: resourceKey differences to Text RetrievalRDF data is structuredMinimally, queries on property values are required
Horizontal index structureTwo fields (indices): one for terms, one for propertiesFor each term, store the property on the same position in the property indexPositions are required even without phrase queriesQuery engine needs to support the alignment operator✓  Dictionary is number of unique terms + number of propertiesOccurrences is number of tokens * 2
Vertical index structureOne field (index) per propertyPositions are not requiredBut useful for phrase queriesQuery engine needs to support fieldsDictionary is number of unique termsOccurrences is number of tokens✗ Number of fields is a problem for merging, query performance
Indexing using MapReduceMapReduce is the perfect model for building inverted indicesMap creates (term, {doc1}) pairsReduce collects all docs for the same term: (term, {doc1, doc2…}Sub-indices are merged separatelyTerm-partitioned indicesPeter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
Query Processing
StructureTaxonomy of retrieval approachesQuery processing for semantic searchTypes of semantic dataFormalisms for querying semantic dataApproachesGeneral task: hybrid graph pattern matchingMatching keyword query against textMatching structured query against structured dataMatching keyword query against structured dataMatching structured query against text (a hybrid case)Main tasks, challenges and opportunities
Taxonomy of retrieval approaches (1)Data retrieval problemA collection of resources represented by the data model GInformation needs expressed as queries in QRetrieval is the task of efficiently computing results from G that are relevant to the queries in QDocument retrieval vs. data retrieval Differences in query and data representation and matchingEfficiently retrieve structured data that exactly match formal information needs expressed as structured queriesEffectively rank textual results that match ambiguous NL / keyword queries to a certain degree (notions of relevance)
Taxonomy of retrieval approaches (2)ExactComplete Sound QueryMatchingApproximate
Not complete
Not sound
Ranked
Best effort
Top-kDataQuery processing mainly focuses on efficiency of matching whereas ranking deals with degree of matching (relevance)!
Query processing for Semantic Search (1)The underlying collection of resources is represented by semantic data G ranging fromStructured data with well defined schemasSemi-structured data with incomplete or no schemasData that largely comprise textHybrid / embedded dataTargeted information needs Q are of varying complexity, captured using different formalisms and querying paradigms Natural language texts and keywords Form-based inputs Formal structured queriesSemantic search, mainly, is the task of efficiently computing results(query processing) from G that are relevantto the queries in Q (ranking)
Query processing for Semantic Search (2)KeywordsNL QuestionsForm- / facet-based InputsStructured Queries (SPARQL)QueryMatchingDataOWL ontologies with rich, formal semanticsStructured RDF dataSemi-Structured RDF dataRDF data embedded in text (RDFa)
Query processing for Semantic Search (3)Textual DataKeyword query on textual data (IR/document retrieval) Structured query on textual data Semantic Search target different group of users, information needs, and types of data. Query processing for semantic search is hybrid combination of techniques!Unstructured QueryStructuredQueryKeyword query on structured data Structured query on structured data (DB/data retrieval)Structured Data
Types of data models (1)TextualBag-of-wordsRepresent documents, text in structured data,…, real-world objects (captured as structured data)Lacks “structure” in text, e.g. linguistic structure, hyperlinks, (positional information)Structure in structured data representationterm (statistics)In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum.combination Cloud Computing Technologiessolutions management `big data' industry solutions support complex ……
Types of data models (2)TextualStructured Resource Description Framework (RDF) Represent real-world objects, services, applications, …. documentsResource attribute values and relationships between resourcesSchemaPicturecreatorPersonBob
Types of data models (3)TextualStructuredHybridRDF data embedded in text (RDFa)
Types of data models – RDFa (1)…<div about="/alice/posts/trouble_with_bob">      <h2 property="dc:title">The trouble with Bob</h2><h3 property="dc:creator">Alice</h3>Bob is a good friend of mine. We went to the same university, and  	also shared an apartment in Berlin in 2008. The trouble with Bob is 	that he takes much better photos than I do:      <div about="http://example.com/bob/photos/sunset.jpg"><imgsrc="http://example.com/bob/photos/sunset.jpg" /><span property="dc:title">Beautiful Sunset</span>        by <span property="dc:creator">Bob</span>.</div></div>…adoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
Types of semantic data – RDFa (2) Bob is a good friend of mine. We went to the same university, and  also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:contentcontentadoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
Types of semantic data - conclusionSemantic data in general can be conceived as a graph with text and structured data items as nodes, and edges represent different types of relationships including explicit semantic relationships and vaguely specified ones such as hyperlinks!
Formalisms for querying semantic data (1)Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”Unstructured queriesFully-structured queriesHybrid queries: unstructured + structured
Formalisms for querying semantic data (2)Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”UnstructuredNLKeywords apartmentBerlinAliceshared
Formalisms for querying semantic data (3)Example information need“Information about a friend of Alice, who shared an apartment with her in Berlinand knows someone working at KIT.”UnstructuredFully-structuredSPARQL: BGP, filter, optional, union, select, construct, ask, describe PREFIX ns: <http://example.org/ns#> 	SELECT ?x 	WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.  		     ?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT” }
Formalisms for querying semantic data (4)Fully-structuredUnstructured Hybrid: content and structure constraints“shared apartment Berlin Alice”?x ns:knows ? y. ?y ns:name “Alice”.  ?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT”
Formalisms for querying semantic data (5)Fully-structuredUnstructured Hybrid: content and structure constraints“shared apartment Berlin Alice”?x ns:knows ? y. ?y ns:name “Alice”.  ?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT”
Formalisms for querying semantic data - conclusionSemantic search queries can be conceived as graph patterns with nodes referring to text and structured data items, and edges referring to relationships between these items!
Processing hybrid graph patterns (1)Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT”apartment shared Berlin Alice?y ns:name “Alice”. ?x ns:knows ? yageworks34trouble with bobFluidOpsPetersunset.jpgBob is a good friend of mine. We went to the same university, and  also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:Beautiful SunsetauthortitleSemantic SearchGermanyAlicecreatorauthorcreatorknowsyearGermany2009BobThanhknowslocatedworksKIT
Processing hybrid graph patterns (2)Matching hybrid graph patterns against data
Matching keyword query against textRetrieve documents
Inverted list (inverted index)keyword  {<doc1, pos, score, ...>, <doc2, pos, score, ...>, ...}AND-semantics: top-k joinsharedBerlinAlicesharedBerlinAliceD1D1D1sharedberlinalice==shared
Matching structured query against structured dataRetrieve data for triple patterns
Index on tables
Multiple “redundant” indexes to cover different access patterns
Join (conjunction of triples)
Blocking, e.g.  linear merge join (required sorted input)
Non-blocking, e.g. symmetric hash-join
Materialized join indexes?x ns:knows ?y. ?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT”Per1 ns:works?v?vns:name “KIT”SP-indexPO-index===Per1 ns:worksIns1Ins1ns:name KITPer1 ns:works Ins1Ins1 ns:name KIT
Matching keyword query against structured dataRetrieve keyword elements
Using inverted index	keyword  {<el1, score, ...>, <el2, score, ...>,…} Exploration / “Join”
Data indexes for triple lookup
Materialized index (paths up to graphs)
Top-k Steiner tree search, top-k subgraph exploration AliceBobBobKITAliceKIT↔↔==Alice ns:knowsBobInst1ns:name KITBobns:worksInst1
Matching structured query against textBased on offline IE (offline see Peter’s slides)
Based on online IE, i.e., “retrieve “ is as follows
Derive keywords to retrieve relevant documents
On-the-fly information extraction, i.e., phrase pattern matching  “X title Y”
Retrieve extracted data for structured part
Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp.
Index
Inverted index for document retrieval and pattern matching
Join index  inverted index for storing materialized  joins between keywords
Neighborhood indexes for phrase patterns?x ns:knows ?y. ?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT”KITnameknowsnameKITHybrid case
Query processing – main tasksRetrievalDocuments , data elements, triples, paths, graphsInverted index,…, but also other indexes (B+ tree)Index documents, triples materialized join pathsJoinDifferent join implementations, efficiency depends on availability of indexesNon-blocking join good for early result reporting and for “unpredictable” linked data scenarioQueryMatchingData
Query processing – more tasksDisjunction, aggregation, groupingJoin order optimizationApproximate Approximate the search space Approximate the results (matching, join)ParallelizationTop-k Use only some entries in the input streams to produce k resultsMultiple sourcesFederation, routing On-the-fly mapping, similarity join HybridJoin text and dataQueryMatchingData
Query processing on the Web - research challenges and opportunities Large amount of semantic dataData inconsistent, redundant, and low quality Large amount of data embedded in text Large amount of sourcesLarge amount of links between sourcesOptimization parallelization,
Approximation
Hybrid querying and data management
Federation, routing
Online schema mappings
Similarity join Ranking
StructureProblem definitionTypes of ambiguitiesRanking paradigmsModel constructionContent-basedStructure-based
Ranking – problem definitionQueryAmbiguities arise when representation is incomplete / imprecise
Ambiguities at the level of
elements (content ambiguity)
structure between elements (structure ambiguity)MatchingDataDue to ambiguities in the representation of the information needs and the underlying resources, the results cannot be guaranteed to exactly match the query. Ranking is the problem of determining the degree of matching using some notions of relevance.
Content ambiguity?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT”apartment shared Berlin Alice?y ns:name “Alice”. ?x ns:knows ? yageworks34trouble with bobFluidOpsPetersunset.jpgBob is a good friend of mine. We went to the same university, and  also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:Beautiful SunsetauthortitleSemantic SearchGermanyAlicecreatorauthorcreatorknowsyearGermany2009BobThanhknowslocatedworksKITWhat is meant by “Berlin” in the query?What is meant by “Berlin” in the data?A city with the name Berlin?  a person? What is meant by “KIT” in the query?What is meant by “KIT” in the data?A research group?  a university? a location?
Structure ambiguity?x ns:knows ?z.  ?z ns: works ?v. ?v ns:name “KIT”apartment shared Berlin Alice?y ns:name “Alice”. ?x ns:knows ? yageworks34trouble with bobFluidOpsPetersunset.jpgBob is a good friend of mine. We went to the same university, and  also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:Beautiful SunsetauthortitleSemantic SearchGermanyAlicecreatorauthorcreatorknowsyearGermany2009BobThanhknowslocatedworksKITWhat is the connection between “Berlin” and “Alice”? Friend? Co-worker? What is meant by “works”? Works at? employed?
AmbiguityRecall: query processing is matching at the level of syntax and semantics Ambiguities arise when data or query allow for multiple interpretations, i.e. multiple matchesSyntactic, e.g. works vs. works atSemantic, e.g. works vs. employ“Aboutness”, i.e., contain some elements which represent the correct interpretation Ambiguities arise when matching elements of different granularitiesDoes icontains the interpretation for j, given some part(s) of i (syntactically/semantically) match jE.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…”Strictly speaking, ranking is performed after syntactic / semantic matching is done!
Features: What to use to deal with ambiguities?What is meant by “Berlin”? What is the connection between “Berlin” and “Alice”? Content featuresFrequencies of terms: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR)Co-occurrences: terms K that often co-occur form a contextual interpretation, i.e., topics (cluster hypothesis) Structure featuresConsider relevance at level of fieldsLinked-based popularity
Ranking paradigmsExplicit relevance model Foundation: probability ranking principleRanking results by the posterior probability (odds) of being observed in the relevant class:P(w|R) varies in different approaches, e.g., binary independence model, 2-poisson model, relevancemodel
Ranking paradigmsNo explicit notion of relevance: similarity between the query and the document modelVector space model (cosine similarity)Language models (KL divergence)
Model constructionHow to obtainRelevance models?Weights for query / document terms?Language models for document / queries?
Content-based model constructionDocument statistics, e.g. Term frequency Document length Collection statistics, e.g. Inverse document frequencyBackground language models An object is more likely about “Berlin”?

More Related Content

SemTech 2011 Semantic Search tutorial

  • 1. Peter Mika| Yahoo Research, Spainpmika@yahoo-inc.comThanh Tran | Institute AIFB, KIT, GermanyTran@aifb.uni-karlsruhe.deSemantic Search TutorialIntroduction
  • 2. About the speakersPeter MikaSemantic Search group at Yahoo! BarcelonaSemantic Search, Web Object Retrieval, Natural Language ProcessingTran Duc ThanhSemantic Search group at AIFBSemantic Search, Semantic Data Management, Linked Data Query Processing
  • 3. AgendaIntroduction (5 min)Semantic Web data (50 min)The RDF data modelPublishing RDFCrawling and indexing RDF dataQuery processing (35 min)Ranking (25 min) Result presentation (15 min)Semantic Search evaluation (15 min)Demos (15 min)Questions (5 min)
  • 4. Why Semantic Search? I.“We are at the beginning of search.“ (Marissa Mayer)Solved large classes of queries, e.g. navigationalHeavy investment in computational powerRemaining queries are hard, not solvable by brute force, and require a deep understanding of the world and human cognitionBackground knowledge and metadata can help to address poorly solved queries
  • 5. Poorly solved information needsAmbiguous searchesparishiltonLong tail queriesgeorge bush (and I mean the beer brewer in Arizona)Multimedia searchparishilton sexyImprecise or overly precise searches jimhendlerpictures of strong adventures peoplePrecise searches for descriptionscountries in africa32 year old computer scientist living in barcelonareliable digital camera under 300 dollarsMany of these queries would not be asked by users, who learned over time what search technology can and can not do.
  • 7. Why Semantic Search? II.The Semantic Web is now a realityLarge amounts of data published in RDFHeterogeneous data of varying qualityUsers who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domainSearching data instead or in addition to searching documentsDirect answersNovel search tasks
  • 8. Information box with content from and links to Yahoo! TravelExample: direct answers in searchPoints of interest in Vienna, AustriaShopping results from Yahoo! ShoppingSince Aug, 2010, ‘regular’ search results are ‘Powered by Bing’
  • 9. Novel search tasksAggregation of search resultse.g. price comparison across websitesAnalysis and predictione.g. world temperature by 2020Semantic profilingrecommendations based on particular interestsSemantic log analysisunderstanding user behavior in terms of objects Support for complex taskse.g. booking a vacation using a combination of services
  • 10. Document retrieval and data retrievalInformation Retrieval (IR) support the retrieval of documents (document retrieval)Representation based on lightweight syntax-centric models Work well for topical searchNot so well for more complex information needsWeb scaleDatabase (DB) and Knowledge-based Systems (KB) deliver more precise answers (data retrieval)More expressive models Allow for complex queriesRetrieve concrete answers that precisely match queriesNot just matching and filtering, but also joinsLimitations in scalability
  • 11. Combination of document and data retrieval Documents with metadataMetadata may be embeddedinside the documentI’m looking for documents that mention countries in Africa.Data retrievalStructured data, but searchable text fieldsI’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.
  • 12. Semantic SearchTarget (combination of) document and data retrievalSemantic search is a retrieval paradigm thatExploits the structure/semantics of the data or explicit background knowledge to understand user intent and the meaning of contentIncorporates the intent of the query and the meaning of content into the search process (semantic models)Wide range of semantic search systemsEmploy different semantic models, possibly at different steps of the search process and in order to support different tasks
  • 13. Semantic modelsSemantics is concerned with the meaning of the resources made available for searchVarious representations of meaningLinguistic models: models of relationships among wordsTaxonomies, thesauri, dictionaries of entity namesInference along linguistic relations, e.g. broader/narrower termsNatural language searchConceptual models: models of relationships among objectsOntologies capture entities in the world and their relationshipsInference along domain-specific relationsKnowledge-based searchWe will focus on conceptual models in this tutorialIn particular, the RDF/OWL conceptual model for representing classes in a domain, and describing their instances
  • 14. Semantic Search – a process viewKnowledge RepresentationSemantic ModelsResourcesDocumentsDocumentRepresentation
  • 15. Semantic Search systems For data / document retrieval, semantic search systems might combine a range of techniques, ranging from statistics-based IR methods for ranking, database methods for efficient indexing and query processing, up to complex reasoning techniques for making inferences!
  • 17. Semantic WebSharing data across the WebStandard data modelRDFA number of syntaxes (file formats)RDF/XML, RDFaPowerful, logic-based languages for schemasOWL, RIFQuery languages and protocolsHTTP, SPARQL
  • 18. Resource Description Framework (RDF)Each resource (thing, entity) is identified by a URIGlobally unique identifiersLocators of informationData is broken down into individual factsTriples of (subject, predicate, object)A set of triples (an RDF graph) is published together in an RDF documentRDF documentfoaf:Persontypeexample:roiname“Roi Blanco”
  • 19. Linking resourcesFriend-of-a-Friend ontologyRoi’s homepagetypeexample:roifoaf:Personname“Roi Blanco”sameAsYahootypeworksWithexample:roi2example:peteremail“pmika@yahoo-inc.com”
  • 20. Publishing RDFLinked DataData published as RDF documents linked to other RDF documentsCommunity effort to re-publish large public datasets (e.g. Dbpedia, open government data)RDFaData embedded inside HTML pagesRecommended for site owners by Yahoo, Google, FacebookSPARQL endpointsTriple stores (RDF databases) that can be queried through the web
  • 21. Linked DataA web of RDF documents in parallel to the current Weboften implemented as wrappers around databases or APIsThe four rules of Linked Data:Use URIs to identify things.Use HTTPURIs so that these things can be referred to and looked up ("dereference") by people and user agents.Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML.Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
  • 22. Linked DataAdvantages: No change to the publishing of the HTML documentsData can be published by third party (e.g. Dbpedia)Disadvantages:Web servers need to be configured to properly handle URIs that identify concepts instead of documentsNot favored by search engines Lack of use casesCrawling needs to be changedAuthority is difficult to determineToolsTriple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)RDB-to-RDF mappers (e.g. D2RQ, Triplify)Validators (Vapour)Linked Data browsers (many)
  • 23. The state of Linked DataRapidly growing community effort to (re)publish open datasets as Linked DataIn particular, scientific and government datasetssee linkeddata.orgLess commercial interest, real usage
  • 24. Metadata in HTML1995: HTML meta tags1996: Simple HTML Ontology Extensions (SHOE)1998: RDF/XMLRDF/XML in HTMLRDF linked from HTML2003: Web 2.0TaggingMicroformatsMetadata in WikipediaMachine tags in Flickr2005: eRDF 2008: RDFa 1.02011: RDFa 1.12012: Microdata?
  • 25. HTML meta tags<HTML><HEAD profile="http://dublincore.org/documents/dcq-html/"><META name="DC.author" content="Peter Mika"><LINK rel="DC.rights copyright" href="http://www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF" href= "http://www.cs.vu.nl/~pmika/foaf.rdf"> </HEAD> …</HTML>
  • 26. Microformats (μf)Agreements on the way to encode certain kinds metadata in HTMLReuse of semantic-bearing HTML elementsBased on existing standardsMinimalityMicroformats exist for a limited set of objectshCard (persons and organizations)hCalendar (events)hResumehProducthRecipeVarying degrees of support and stabilityhCard and rel-tag are widely supportedCommunity centered around microformats.orgSpecifications and discussions are hosted there
  • 27. Microformats: limitationsNo shared syntaxEach microformat has a separate syntax tailored to the vocabulary No formal schemasLimited reuse, extensibility of schemasUnclear which combinations are allowedNo datatypesNo namespaces, unique identifiers (URIs) no interlinkingmapping between instances is requiredAlways appears in the HTML <body>
  • 28. Example: the hCard microformat<div class="vcard"> <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div> <cite class="vcard"><a class="fn url" rel="friend colleague met” href="http://meyerweb.com/">Eric Meyer</a> </cite> wrote a post (<cite><a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/">Tax Relief</a></cite>) about an unintentionally humorous letter he received from the <span class="vcard”> <a class="fn org url" href="http://irs.gov/">Internal Revenue Service</a> </span>.
  • 29. RDFaW3C standard for embedding RDF data in HTML documentsA set of new HTML attributes to be used in head or bodyA specification of how to extract the data from these attributes RDFa is just a syntax, you have to choose a vocabulary separatelyRDFa 1.0 is a W3C Recommendation since October, 2008RDFa PrimerRDFa 1.1 is a small update on RDFa to make it easier to useCurrently Working Draft (March 31, 2011)Updated version of the RDFa Primer (April 19, 2011)RDFa API for accessing RDFa data in a webpage in the browser from JavaScriptCurrently Working Draft (April 19, 2011)
  • 30. RDFa 1.1ChangesNew vocab attribute to define the default namespace for the document or subtreeProfile documents to define multiple namespace prefixesThe prefix attribute as a recommended replacement of xmlnsYou can use URIs even where only CURIEs where allowed beforeRDFa 1.1 is backward compatible with RDFa 1.0RDFa 1.1 is recommended if you want to use HTML5
  • 31. When to use RDFaChoose microformats when you find a microformat that fits your needs and supported by your consumersMicroformats are first option because they are simpleYahoo supports all major microformats, see the documentationIt’s a common misconception that RDFa requires XHTML or that it’s compatible with HTML5It’s compatible with HTML4, HTML5, XHTMLIf you find none that perfectly fits your needs then you need RDFaMicroformats have a fixed schema: you can not add your own attributesExample: a social networking site with user profilesVCard is a good candidate, but for example it doesn’t have a way to express the user’s social connectionsYou either live without this, or go with RDFa
  • 32. Example: Facebook’s Like and the Open Graph ProtocolThe ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities Shows up in profiles and news feedSite owners can later reach users who have liked an objectFacebook Graph API allows 3rd party developers to access the data Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’
  • 33. Example: Facebook’s Open Graph ProtocolRDF vocabulary to be used in conjunction with RDFaSimplify the work of developers by restricting the freedom in RDFaActivities, Businesses, Groups, Organizations, People, Places, Products and EntertainmentOnly HTML <head> accepted<html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …</head> ...
  • 34. Example: Yahoo! Enhanced Results (was: SearchMonkey)Guide for publishers to mark-up their pages for common types of objectsProduct, Local, News, Video, Events, Documents, Discussion, GamesUsing popular microformats and RDF vocabulariesCopy-paste code ValidatorYahoo as a consumerSee later
  • 35. Example: Google’s Rich SnippetsGoogle accepts popular microformats and its own RDFa vocabularySimilar approach to RDFa as FacebookValidator to check if the markup is correctGoogle displays enhanced results based on this metadataRich Snippets
  • 36. RDFa on the rise510% increase between March, 2009 and October, 2010Percentage of URLs with embedded metadata in various formats
  • 37. MicrodataCurrently under standardization at the W3COriginally part of the HTML5 spec, but now a separate documentSimilar to microformats, but with the extensibility of RDFaIntroduce new terms using reverse domain names or full URIsHTML5 also has a number of “semantic” elements such as <time>, <video>, <article>…
  • 38. Microdata example<div itemscope itemid=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>. <img itemprop="image" src=”me.png" alt=”me”> </p></div
  • 39. The state of metadata in HTML5-10% of webpages contain some explicit metadataDepending on how you count…Too many competing approachesToo many formats: microformats vs RDFa vs MicrodataWhen using RDFa, publishers may need to use multiple different vocabularies to satisfy everyone
  • 40. Crawling the Semantic WebLinked DataSimilar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawledSemantic Sitemap/VOID descriptionsRDFaSame as HTML crawling, but data is extracted after crawlingMika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.SPARQL endpointsEndpoints are not linked, need to be discovered by other meansSemantic Sitemap/VOID descriptions
  • 41. Data fusionOntology matchingWidely studied in Semantic Web research, see e.g. list of publications at ontologymatching.orgUnfortunately, not much of it is applicable in a Web context due to the quality of ontologiesEntity resolutionLogic-based approaches in the Semantic WebStudied as record linkage in the database literature Machine learning based approaches, focusing on attributesGraph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF dataImprovements over only attribute based matchingBlendingMerging objects that represent the same real world entity and reconciling information from multiple sources
  • 42. Data quality assessment and curationHeterogeneity, quality of data is an even larger issueQuality ranges from well-curated data sets (e.g. Freebase) to microformats In the worst of cases, the data becomes a graph of wordsShort amounts of text: prone to mistakes in data entry or extractionExample: mistake in a phone number or state codeQuality assessment and data curationQuality varies from data created by experts to user-generated contentAutomated data validationAgainst known-good data or using triangulationValidation against the ontology or using probabilistic modelsData validation by trained professionals or crowdsourcingSampling data for evaluationCuration based on user feedback
  • 43. IndexingSearch requires matching and rankingMatching selects a subset of the elements to be scoredThe goal of indexing is to speed up matchingRetrieval needs to be performed in millisecondsWithout an index, retrieval would require streaming through the collectionThe type of index depends on the query model to supportDB-style indexingIR-style indexing
  • 44. IR-style indexingIndex data as textCreate virtual documents from dataOne virtual document per subgraph, resource or tripletypically: resourceKey differences to Text RetrievalRDF data is structuredMinimally, queries on property values are required
  • 45. Horizontal index structureTwo fields (indices): one for terms, one for propertiesFor each term, store the property on the same position in the property indexPositions are required even without phrase queriesQuery engine needs to support the alignment operator✓ Dictionary is number of unique terms + number of propertiesOccurrences is number of tokens * 2
  • 46. Vertical index structureOne field (index) per propertyPositions are not requiredBut useful for phrase queriesQuery engine needs to support fieldsDictionary is number of unique termsOccurrences is number of tokens✗ Number of fields is a problem for merging, query performance
  • 47. Indexing using MapReduceMapReduce is the perfect model for building inverted indicesMap creates (term, {doc1}) pairsReduce collects all docs for the same term: (term, {doc1, doc2…}Sub-indices are merged separatelyTerm-partitioned indicesPeter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
  • 49. StructureTaxonomy of retrieval approachesQuery processing for semantic searchTypes of semantic dataFormalisms for querying semantic dataApproachesGeneral task: hybrid graph pattern matchingMatching keyword query against textMatching structured query against structured dataMatching keyword query against structured dataMatching structured query against text (a hybrid case)Main tasks, challenges and opportunities
  • 50. Taxonomy of retrieval approaches (1)Data retrieval problemA collection of resources represented by the data model GInformation needs expressed as queries in QRetrieval is the task of efficiently computing results from G that are relevant to the queries in QDocument retrieval vs. data retrieval Differences in query and data representation and matchingEfficiently retrieve structured data that exactly match formal information needs expressed as structured queriesEffectively rank textual results that match ambiguous NL / keyword queries to a certain degree (notions of relevance)
  • 51. Taxonomy of retrieval approaches (2)ExactComplete Sound QueryMatchingApproximate
  • 56. Top-kDataQuery processing mainly focuses on efficiency of matching whereas ranking deals with degree of matching (relevance)!
  • 57. Query processing for Semantic Search (1)The underlying collection of resources is represented by semantic data G ranging fromStructured data with well defined schemasSemi-structured data with incomplete or no schemasData that largely comprise textHybrid / embedded dataTargeted information needs Q are of varying complexity, captured using different formalisms and querying paradigms Natural language texts and keywords Form-based inputs Formal structured queriesSemantic search, mainly, is the task of efficiently computing results(query processing) from G that are relevantto the queries in Q (ranking)
  • 58. Query processing for Semantic Search (2)KeywordsNL QuestionsForm- / facet-based InputsStructured Queries (SPARQL)QueryMatchingDataOWL ontologies with rich, formal semanticsStructured RDF dataSemi-Structured RDF dataRDF data embedded in text (RDFa)
  • 59. Query processing for Semantic Search (3)Textual DataKeyword query on textual data (IR/document retrieval) Structured query on textual data Semantic Search target different group of users, information needs, and types of data. Query processing for semantic search is hybrid combination of techniques!Unstructured QueryStructuredQueryKeyword query on structured data Structured query on structured data (DB/data retrieval)Structured Data
  • 60. Types of data models (1)TextualBag-of-wordsRepresent documents, text in structured data,…, real-world objects (captured as structured data)Lacks “structure” in text, e.g. linguistic structure, hyperlinks, (positional information)Structure in structured data representationterm (statistics)In combination with Cloud Computing technologies, promising solutions for the management of `big data' have emerged. Existing industry solutions are able to support complex queries and analytics tasks with terabytes of data. For example, using a Greenplum.combination Cloud Computing Technologiessolutions management `big data' industry solutions support complex ……
  • 61. Types of data models (2)TextualStructured Resource Description Framework (RDF) Represent real-world objects, services, applications, …. documentsResource attribute values and relationships between resourcesSchemaPicturecreatorPersonBob
  • 62. Types of data models (3)TextualStructuredHybridRDF data embedded in text (RDFa)
  • 63. Types of data models – RDFa (1)…<div about="/alice/posts/trouble_with_bob"> <h2 property="dc:title">The trouble with Bob</h2><h3 property="dc:creator">Alice</h3>Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: <div about="http://example.com/bob/photos/sunset.jpg"><imgsrc="http://example.com/bob/photos/sunset.jpg" /><span property="dc:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>.</div></div>…adoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
  • 64. Types of semantic data – RDFa (2) Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:contentcontentadoptedfrom : http://www.w3.org/TR/xhtml-rdfa-primer/
  • 65. Types of semantic data - conclusionSemantic data in general can be conceived as a graph with text and structured data items as nodes, and edges represent different types of relationships including explicit semantic relationships and vaguely specified ones such as hyperlinks!
  • 66. Formalisms for querying semantic data (1)Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”Unstructured queriesFully-structured queriesHybrid queries: unstructured + structured
  • 67. Formalisms for querying semantic data (2)Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”UnstructuredNLKeywords apartmentBerlinAliceshared
  • 68. Formalisms for querying semantic data (3)Example information need“Information about a friend of Alice, who shared an apartment with her in Berlinand knows someone working at KIT.”UnstructuredFully-structuredSPARQL: BGP, filter, optional, union, select, construct, ask, describe PREFIX ns: <http://example.org/ns#> SELECT ?x WHERE { ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
  • 69. Formalisms for querying semantic data (4)Fully-structuredUnstructured Hybrid: content and structure constraints“shared apartment Berlin Alice”?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
  • 70. Formalisms for querying semantic data (5)Fully-structuredUnstructured Hybrid: content and structure constraints“shared apartment Berlin Alice”?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
  • 71. Formalisms for querying semantic data - conclusionSemantic search queries can be conceived as graph patterns with nodes referring to text and structured data items, and edges referring to relationships between these items!
  • 72. Processing hybrid graph patterns (1)Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.”?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”apartment shared Berlin Alice?y ns:name “Alice”. ?x ns:knows ? yageworks34trouble with bobFluidOpsPetersunset.jpgBob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:Beautiful SunsetauthortitleSemantic SearchGermanyAlicecreatorauthorcreatorknowsyearGermany2009BobThanhknowslocatedworksKIT
  • 73. Processing hybrid graph patterns (2)Matching hybrid graph patterns against data
  • 74. Matching keyword query against textRetrieve documents
  • 75. Inverted list (inverted index)keyword  {<doc1, pos, score, ...>, <doc2, pos, score, ...>, ...}AND-semantics: top-k joinsharedBerlinAlicesharedBerlinAliceD1D1D1sharedberlinalice==shared
  • 76. Matching structured query against structured dataRetrieve data for triple patterns
  • 78. Multiple “redundant” indexes to cover different access patterns
  • 80. Blocking, e.g. linear merge join (required sorted input)
  • 82. Materialized join indexes?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”Per1 ns:works?v?vns:name “KIT”SP-indexPO-index===Per1 ns:worksIns1Ins1ns:name KITPer1 ns:works Ins1Ins1 ns:name KIT
  • 83. Matching keyword query against structured dataRetrieve keyword elements
  • 84. Using inverted index keyword  {<el1, score, ...>, <el2, score, ...>,…} Exploration / “Join”
  • 85. Data indexes for triple lookup
  • 87. Top-k Steiner tree search, top-k subgraph exploration AliceBobBobKITAliceKIT↔↔==Alice ns:knowsBobInst1ns:name KITBobns:worksInst1
  • 88. Matching structured query against textBased on offline IE (offline see Peter’s slides)
  • 89. Based on online IE, i.e., “retrieve “ is as follows
  • 90. Derive keywords to retrieve relevant documents
  • 91. On-the-fly information extraction, i.e., phrase pattern matching “X title Y”
  • 92. Retrieve extracted data for structured part
  • 93. Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp.
  • 94. Index
  • 95. Inverted index for document retrieval and pattern matching
  • 96. Join index  inverted index for storing materialized joins between keywords
  • 97. Neighborhood indexes for phrase patterns?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”KITnameknowsnameKITHybrid case
  • 98. Query processing – main tasksRetrievalDocuments , data elements, triples, paths, graphsInverted index,…, but also other indexes (B+ tree)Index documents, triples materialized join pathsJoinDifferent join implementations, efficiency depends on availability of indexesNon-blocking join good for early result reporting and for “unpredictable” linked data scenarioQueryMatchingData
  • 99. Query processing – more tasksDisjunction, aggregation, groupingJoin order optimizationApproximate Approximate the search space Approximate the results (matching, join)ParallelizationTop-k Use only some entries in the input streams to produce k resultsMultiple sourcesFederation, routing On-the-fly mapping, similarity join HybridJoin text and dataQueryMatchingData
  • 100. Query processing on the Web - research challenges and opportunities Large amount of semantic dataData inconsistent, redundant, and low quality Large amount of data embedded in text Large amount of sourcesLarge amount of links between sourcesOptimization parallelization,
  • 102. Hybrid querying and data management
  • 106. StructureProblem definitionTypes of ambiguitiesRanking paradigmsModel constructionContent-basedStructure-based
  • 107. Ranking – problem definitionQueryAmbiguities arise when representation is incomplete / imprecise
  • 108. Ambiguities at the level of
  • 110. structure between elements (structure ambiguity)MatchingDataDue to ambiguities in the representation of the information needs and the underlying resources, the results cannot be guaranteed to exactly match the query. Ranking is the problem of determining the degree of matching using some notions of relevance.
  • 111. Content ambiguity?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”apartment shared Berlin Alice?y ns:name “Alice”. ?x ns:knows ? yageworks34trouble with bobFluidOpsPetersunset.jpgBob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:Beautiful SunsetauthortitleSemantic SearchGermanyAlicecreatorauthorcreatorknowsyearGermany2009BobThanhknowslocatedworksKITWhat is meant by “Berlin” in the query?What is meant by “Berlin” in the data?A city with the name Berlin? a person? What is meant by “KIT” in the query?What is meant by “KIT” in the data?A research group? a university? a location?
  • 112. Structure ambiguity?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”apartment shared Berlin Alice?y ns:name “Alice”. ?x ns:knows ? yageworks34trouble with bobFluidOpsPetersunset.jpgBob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:Beautiful SunsetauthortitleSemantic SearchGermanyAlicecreatorauthorcreatorknowsyearGermany2009BobThanhknowslocatedworksKITWhat is the connection between “Berlin” and “Alice”? Friend? Co-worker? What is meant by “works”? Works at? employed?
  • 113. AmbiguityRecall: query processing is matching at the level of syntax and semantics Ambiguities arise when data or query allow for multiple interpretations, i.e. multiple matchesSyntactic, e.g. works vs. works atSemantic, e.g. works vs. employ“Aboutness”, i.e., contain some elements which represent the correct interpretation Ambiguities arise when matching elements of different granularitiesDoes icontains the interpretation for j, given some part(s) of i (syntactically/semantically) match jE.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…”Strictly speaking, ranking is performed after syntactic / semantic matching is done!
  • 114. Features: What to use to deal with ambiguities?What is meant by “Berlin”? What is the connection between “Berlin” and “Alice”? Content featuresFrequencies of terms: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR)Co-occurrences: terms K that often co-occur form a contextual interpretation, i.e., topics (cluster hypothesis) Structure featuresConsider relevance at level of fieldsLinked-based popularity
  • 115. Ranking paradigmsExplicit relevance model Foundation: probability ranking principleRanking results by the posterior probability (odds) of being observed in the relevant class:P(w|R) varies in different approaches, e.g., binary independence model, 2-poisson model, relevancemodel
  • 116. Ranking paradigmsNo explicit notion of relevance: similarity between the query and the document modelVector space model (cosine similarity)Language models (KL divergence)
  • 117. Model constructionHow to obtainRelevance models?Weights for query / document terms?Language models for document / queries?
  • 118. Content-based model constructionDocument statistics, e.g. Term frequency Document length Collection statistics, e.g. Inverse document frequencyBackground language models An object is more likely about “Berlin”?
  • 119. When it contains a relatively high number of mentions of the term “Berlin”
  • 120. When the number of mentions of this term in the overall collection is relatively lowStructure-based model constructionConsider structure of objects during content-based modeling, i.e., to obtain structured content-based modelContent-based model for structured objects, documents and for general tuples An object is more likely about “Berlin”?
  • 121. When one of its (important) fields contains a relatively high number of mentions of the term “Berlin”Structure-based model constructionPageRankLink analysis algorithmMeasuring relative importance of nodesLink counts as a vote of support The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links)ObjectRankTypes and semantics of links vary in structured data setting Authority transfer schema graph specifies connection strengths Recursively compute authority transfer data graph An object about “Berlin” is more important than one another?
  • 122. When a relatively large number of objects are linked to itTaxonomy of ranking approachesExplicitly vs. non-explicitly relevance-basedContent-based rankingStructure-based rankingContent- and-structure-based ranking
  • 124. Search interfaceInput and output functionalityhelping the user to formulate complex queriespresenting the results in an intelligent mannerSemantic Search brings improvements inQuery formulationSnippet generationAdaptive and interactive presentationPresentation adapts to the kind of query and results presentedObject results can be actionable, e.g. buy this productAggregated searchGrouping similar items, summarizing results in various waysFiltering (facets), possibly across different dimensionsTask completionHelp the user to fulfill the task by placing the query in a task context
  • 125. Query interpretation“Snap-to-grid”: find the most likely interpretation of the query given the ontology or a summary of the dataSee Query Processing Display the system’s interpretation of the user queryOffer one or more interpretations, possibly while the user is typing
  • 127. Example: TrueKnowledgeQ: “How many people live in Shanghai?”I: What is the population of Shanghai (Shanghainese: Zånhae), the metropolis in eastern China and a direct-controlled municipality of the People's Republic of China? A: The population of Shanghai on November 7th 2010 is approximately 19,300,389. (Extrapolated from a population of 18,884,600 in 2008 and a population of 19,210,000 on June 6th 2010.)
  • 128. Snippet generation using metadataYahoo displays enriched search results for pages that contain microformat or RDFa markup using recognized ontologiesDisplaying data, images, videoExample: GoodRelations for productsEnhanced results also appear for sites from which we extract information ourselvesAlso used for generating facets that can be used to restrict search results by object typeExample: “Shopping sites” facet for productsDocumentation and validator for developershttp://developer.search.yahoo.comFormerly: SearchMonkey allowed developers to customize the result presentation and create new ones for any object type
  • 129. Example: Yahoo! Enhanced ResultsEnhanced result with deep links, rating, address.
  • 130. Automated snippet summarizationGenerate search result snippets given a query and a search resultPenin et al. Snippet Generation for Semantic Web Search Engines, ASWC 2010Search results are ontologies
  • 131. Example: Facets in Yahoo! SearchClick to restrict results to shopping sites
  • 132. Example: Yahoo! Vertical Intent SearchRelated actors and movies
  • 133. Adaptive presentation: semantic bookmarkingExtract objects from pages tagged/bookmarked by a userVisualize the extracted objectsTabular displaySorting on attributesMapTracking changes in data Alert me when the price drops below…Prototype: house search applicationDelicious profilesExtracting housing data from popular Spanish real-estate sites
  • 135. Interactive presentation: Time ExplorerDeliverable of the LivingKnowledge European Project Not a Yahoo producthttp://fbmya01.barcelonamedia.org:8080/future/ Won the HCIR 2010 challengeTool for understanding current news stories what are the events that led to a particular situation? what are the important entities for a given topic? (people,places,dates, etc.) what entities are important at a given time? How do their relationships change? what are the predictions made of a given topic?
  • 136. Interactive presentation: Time ExplorerTechnologyNamed Entity Recognition (persons, organizations)Temporal expression miningInverted (sentence and document) index Forward index (archive) for retrieving relevant entitiesRanking of both documents and relevant entitiesDisplayTwo synchronized timelines showing relevant documents and the volume of documentsEntity relationshipsSentiments (future work)
  • 139. Semantic Search evaluation at SemSearch 2010/2011Started at SemSearch 2010Two tasksEntity SearchQueries where the user is looking for a single real world objectPound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010.List search (new in 2011)Queries where the user is looking for a class of objectsBillion Triples Challenge 2009 datasetEvaluated using Amazon’s Mechanical TurkSee Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010Prizes sponsored by Yahoo! Labs (2011)
  • 140. Entity Search TrackEntity Searchretrieval of data related to a single entityQueriesSelected from the Search Query Tiny Sample v1.0 dataset, provided by the Yahoo! Webscope programReal web search queries sampled from the US query log of January, 2009Queries asked by at least three different users and with long number sequence removed (privacy reasons)50 selected queries that name an entity explicitly (but may also provide context)Last year: same type of queries, but a mix of Microsoft and Yahoo! Logs
  • 141. List query trackList queriesQueries that describe a set of entitiesThe answer is a closed set Relatively small number of possible answersThe answer is not likely to change Hand-picked but not hand-writtenYahoo! Search logsQueries from the Tiny Sample v1.0 datasetQueries with clicks on WikipediaTrueKnowledgeRecent queries
  • 142. Data setSame as Billion Triples Challenge 2009 data setBlank nodes are encoded as URIsA data set combining crawls of multiple semantic search enginesdoesn’t necessarily match the current state of the Webdoesn’t necessarily match the coverage of any particular search engineFinal dataset
  • 143. Collecting the resultsSubmissions via semsearch.yahoo.comNTNU, IIIT Hyderabad, DERI, U of Delaware, Daiictmax. 3 submissions per team per trackPooling of resultsTop 20 results are evaluatedDespite validation, still problemse.g. N-Triples encoded URIs, lowercased URIsCollecting triples for each resultAll triples where the URI is the subjectDiscarded URIs that didn’t appear as subjectRendering result displayValues are clipped at 300 chars (last # or / for object-properties)RDF built-ins shown firstPreference to English language values
  • 145. Assessment with Amazon Mechanical TurkEvaluation using non-expert judgesPaid $0.2 per 12 resultsTypically done in 1-2 minutes$6-$12 an hourSponsored by the European SEALS projectEach result is evaluated by 5 workersWorkers are free to choose how many tasks they doMakes agreement difficult to computeNumber of tasks completed per worker (2010)
  • 148. Catching the bad guysPayment can be rejected for workers who try to game the systemAn explanation is commonly expected, though cheaters rarely complainWe opted to mix control questions into the real resultsGold-win cases that are known to be perfectGold-loose cases that are known to be badMetricsAvg. and std. dev on gold-win and gold-loose resultsTime to complete
  • 149. Lessons learnedFinding complex queries is not easyQuery logs from web search engines contain simple queriesSemantic Web search engines are not used by the general publicComputing agreement is difficultEach judge evaluates a different number of itemsResult rendering is critically importantThe Semantic Web is not necessarily all DBpediasub30-RES.3 40.6% DBpediasub31-run3 93.8% DbpediaFollow up experiments validated the Mechanical Turk approachBlanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011
  • 150. Next stepsAchieve repeatabilitySimplify our process and publish our toolsAutomate as much as possible… except the Turks ;)Web site for evaluationContinuous submission?Positioning compared to other evaluation campaignsTREC Entity TrackQuestion Answering over Linked DataSEALS campaignsJoin the discussion at semsearcheval@yahoogroups.com
  • 151. Demos

Editor's Notes

  1. Search is a form of content aggregation
  2. - In recent years we have witnessed tremendous interest and substantial economic exploitation of search technologies, both at web and enterprise scale. In this regard, technologies for Information Retrieval (IR) can be distinguished from solutions in the field of Database (DB) and Knowledge-based Expert Systems (KB). Whereas IR applications support the retrieval of documents (document retrieval), DB and KB systems deliver more precise answers (data retrieval). The technical differences between these two paradigms can be broken down into three main dimensions, i.e. the representation of the user need (query model), the underlying resources (data model) and the matching technique. The representation of user need and resource content in current IR systems is still almost exclusively achieved by the lightweight syntax-centric models such as the predominant keyword paradigm (i.e. keyword queries matched against bag-of-words document representation). While these systems have shown to work well for topical search, i.e. retrieve documents based on a topic, they usually fail to address more complex information needs. Using more expressive models for the representation of the user need, and leveraging the structure and semantics inherent in the data, DB and KB systems allow for complex queries, and for the retrieval of concrete answers that precisely match them.
  3. Semantic search can be seen as a retrieval paradigm Centered on the use of semanticsIncorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
  4. Another trend resulting from this convergence of textual, structured and semantic data is the need for hybrid semantic search systems. Whilestandard IR focuses on the retrieval of documents, the DB and KB systems are built for data retrieval. As opposed to these types of systems, hybrid search systems manage the different types of data in a holistic way, and support the retrieval of answers that are integrated units of information, possibly assembled from different types of data. In such a system, there is not only a convergence at the
  5. Close to the topic of keyword-search in databases, except knowledge-bases have a schema-oblivious designDifferent papers assume vastly different query needs even on the same type of data
  6. &gt;&gt; &gt;&gt; Intro Session: motivation, overview etc.: 10 min &gt;&gt; &gt;&gt;1)Representation of the Search Space (M) 45&gt;&gt; &gt;&gt;2)Offline Preprocessing: Crawling and Indexing (P) 60&gt;&gt; &gt;&gt;3)Query Processing (T) 4)Matching (T) 90&gt;&gt; &gt;&gt;5)Ranking 60 &gt;&gt; &gt;&gt;6)Result Presentation (45)&gt;&gt; &gt;&gt;7)Evaluation (45)&gt;&gt; &gt;&gt;Demo Session (30)&gt;&gt; &gt;&gt; Wrap-up Session (10)
  7. Approximate many results  need rankingRanking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
  8. Miss structural information in textsHyperlinksLinguistic structurePositional information
  9. - Real world objects
  10. SELECT Returns all, or a subset of, the variables bound in a query pattern match. CONSTRUCT Returns an RDF graph constructed by substituting variables in a set of triple templates. ASK Returns a boolean indicating whether a query pattern matches or not. DESCRIBE Returns an RDF graph that describes the resources found. Graph patterns are defined recursively. A graph pattern may have zero or more optional graph patterns, and any part of a query pattern may have an optional part. In this example, there are two optional graph patterns.Section 6 introduces the ability to make portions of a query optional; Section 7 introduces the ability to express the disjunction of alternative graph patterns; and Section 8 introduces the ability to constrain portions of a query to particular source graphs. Section 8 also presents SPARQL&apos;s mechanism for defining the source graphs for a query.Basic graph patterns allow applications to make queries where the entire query pattern must match for there to be a solution. For every solution of a query containing only group graph patterns with at least one basic graph pattern, every variable is bound to an RDF Term in a solution. However, regular, complete structures cannot be assumed in all RDF graphs. It is useful to be able to have queries that allow information to be added to the solution where the information is available, but do not reject the solution because some part of the query pattern does not match. Optional matching provides this facility: if the optional part does not match, it creates no bindings but does not eliminate the solution.The UNION pattern combines graph patterns; each alternative possibility can contain more than one triple pattern:SPARQL provides a means of combining graph patterns so that one of several alternative graph patterns may match. If more than one of the alternatives matches, all the possible pattern solutions are found.
  11. Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Query as a set of constrains Match structured data Match text
  12. - Less than 5 percent of IR papers deal with query processing and the aspect of efficiency
  13. Partitioning has impact on performance!)Blocking: iterator-based approachesNon-blocking: good for streaming, good we cannot wait for some parts of the results to be completely worked-offLink data: cannot wait for sources, (some are slower then other) thus better to push data into query processing as the they come instead of pulling data and wait (busy waiting)Top-k:
  14. -phrase patterns (e.g., “X is the capital of Y”) for large scale extraction. Such simple patterns, when coupled with the richness and redundancy of theWeb, can be very useful in scraping millions or even billions of facts from the Web.- patterns: Matched when keywords or data types Xi appear in sequence. Matched if all keywords/data types/patterns appear within an m-words window.For extraction: relation patternsFor text search: entity patterns -When not assuming that all relevant data can be extracted such matching against text still needed: Hybrid search
  15. Given some materialized indexes  no joins at all Given sorted inputs  sorted merge join
  16. Given some materialized indexes  no joins at all Given sorted inputs  sorted merge joinjoin
  17. Every task is a challenge of itself, some more some less well elaboratedThere are separate challenges for every problems
  18. &gt;&gt; &gt;&gt; Intro Session: motivation, overview etc.: 10 min &gt;&gt; &gt;&gt;1)Representation of the Search Space (M) 45&gt;&gt; &gt;&gt;2)Offline Preprocessing: Crawling and Indexing (P) 60&gt;&gt; &gt;&gt;3)Query Processing (T) 4)Matching (T) 90&gt;&gt; &gt;&gt;5)Ranking 60 &gt;&gt; &gt;&gt;6)Result Presentation (45)&gt;&gt; &gt;&gt;7)Evaluation (45)&gt;&gt; &gt;&gt;Demo Session (30)&gt;&gt; &gt;&gt; Wrap-up Session (10)
  19. Approximate many results  need rankingRanking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
  20. Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Query as a set of constrains Match structured data Match text
  21. Syntactic works vs.works at Semantic works vs. employ
  22. text which contains a large mentions of “Berlin” is likely to be about “Berlin”i is more likely to be the correct interpretation of K when terms in K co-occur in a large number of context (bag of words) associated with i“Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people“Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people