SlideShare a Scribd company logo
A Semantic Big Data
Companion
Stefano Bortoli @stefanobortoli
bortoli@okkam.it
Flavio Pompermaier @fpompermaier
pompermaier@okkam.it
The company (briefly)
• Okkam is
– a SME based in Trento, Italy.
– Started as spin-off of the
University of Trento and FBK (2010)
• Okkam core business is
– large-scale data integration using
semantic technologies and
an Entity Name System
• Okkam operative sectors
– Services for public administration
– Services for restaurants (and more)
– Research projects
• FP7, H2020, and Local agencies
Who we are
• Stefano Bortoli, PhD
– works as technical director and researcher at Okkam S.R.L.
(Trento, Italy). His research and development interests are in the
area of Information Integration, with special focus in entity-
centric applications exploiting semantic technologies.
• Flavio Pompermaier, MSc.
– works as senior software engineer at Okkam S.R.L. (Trento, Italy).
Flavio is a passionate developer working with state of the art
technologies, combining semantic with big data technologies.
Our contributions
• Early Flink adopters and promoters (since Stratosphere)
• Example on how to use Flink with MongoDB
– https://github.com/okkam-it/flink-mongodb-test
• 2 pull requests
– [FLINK-1928] [hbase] Added HBase write example
– [FLINK-1828] [hadoop] Fixed missing call to configure() for Configurable HadoopOutputFormats
• 8 JIRA tickets
– OPEN FLINK-2503 Inconsistencies in FileInputFormat hierarchy
– RESOLVED FLINK-1978 POJO serialization NPE
– OPEN FLINK-1834 Is mapred.output.dir conf parameter really required?
– CLOSED FLINK-1828 Impossible to output data to an HBase table
– OPEN FLINK-1827 Move test classes in test folders and fix scope of test dependencies
– OPEN FLINK-2800 kryo serialization problem
– RESOLVED FLINK-2394 HadoopOutFormat OutputCommitter is default to FileOutputCommiter
– OPEN FLINK-1241 Record processing counter for Dashboard
• Hundreads of email threads and discussions 
What we do
Our toolbox
Our Flink Use Cases
• Our objective is to build and manage (very)
large entity-centric knowledge bases to serve
different purposes and domains
• So far, we used Apache Flink for:
– Domain reasoning (Flink + Parquet + Thrift)
– RDF data lifecycle (Flink + Parquet + Jena/Sesame )
– RDF data intelligence (Flink + ELKiBi)
– Duplicate record detection (Flink + HBase + Solr)
– Entiton Record linkage (Flink + MongoDB + Kryo)
– Telemetry analysis (Flink + MongoDB + Weka)
Why we need Flink
Entiton data model
Database record
RDF statement
Triplestore
NOSQL
+ Indexes
+
Quad
provenance IRI
predicate
object
object Type
Subject
local IRI
Subject
Global IRI
RDF Type
Expensive
datawearhouse
Entiton using Parquet+Thrift
namespace java
it.okkam.flink.entitons.serialization.thrift
struct EntitonQuad {
1: required string p; //pred
2: required string o; //obj
3: optional string ot; //obj-type
4: required string g; //sourceIRI
}
struct EntitonAtom {
1: required string s; //local-IRI
2: optional string oid; // ens-IRI
3: required list<string> types; //rdf-types
4: required list<EntitonQuad> quads; // quads
}
struct EntitonMolecule {
1: required EntitonAtom r; //root atom
2: optional list<EntitonAtom> atoms; //other atoms
}
Quad
Subject
local IRI
Subject
ENS IRI
RDF
Type
Hardware-wise
• We compete with expensive data wearhouse solution
– e.g. Oracle Exadata Database Machines, IBM Netezza, etc.
• Test on small machines fosters optimization
– If you don’t want to wait, make your code faster!
• Our code is ready to scale, without big investments
• Fancy stuff can be done without millions of euros in HW
8 x Gigabyte Brix
16GB RAM
256GB SSD
1T HDD
Intel I7 4770 3,2Ghz
+
1 Gbit Switch
ENS Maintenance
• Entity Name System (ENS) (FP7 ’08-’10)
– A Web-scale support for minting and reusing
persistent entity identifiers for SW or LOD
ENS Maintenance
• Duplicate detection of 9.2M entities in 6h45
(using Flink incubator 0.6)
– Query Apache Solr global index to perform flexible blocking given
a subset of attributes of each entity (names)
– Distinct/Join pairs of candidate duplicate
– Rich Filter implementing Match function
– Consolidate sets of candidate duplicates grouping
• Tricks:
– If HBase does not distribute uniformly regions do rebalance()
– Compress byte[] with LZ4 in custom HBase input format to reduce
network traffic
– Reverse keys to speed up join execution (up to 10%, in some cases)
Tax Reasoner
• Pilot project for ACI and Val d’Aosta
Objectives: to produce analytics and investigate:
1. Who did not pay Vehicle Excise Duty (Kraftfahrzeugsteuer)?
2. Who did not pay Vehicle Insurance?
3. Who skipped Vehicle Inspection?
4. Who did not pay Vehicle Sales Taxes?
5. Who violated exceptions to the above?
Dataset: 15 data sources for 5 year with 12M records about 950k
vehicles and 500k subjects for a total of 90M facts
Challenge: consider events (time) and infer implicit information.
Apache Flink jobs:
1. From RDF to Entiton
2. Domain Specific Temporal Inference (Tax Reasoner)
3. Build ElasticSearch Indexes
Tax Reasoner
Tax Reasoner
Temporal Inference Execution Plan
1h ETA with SSD (2h30 on HDD)
on developer machine
11M new facts inferred
It took 1 DAY to
perform the
select query for
one of the
sources!!
RDF Data Intelligence
SIREn Solution Pivot BroswerTimeline for details about vehicleBusiness Intelligence Analytics with SIREn Solution KiBi
Geospatial indicators
Using Entiton with MongoDB
• Inspired by the work of Gregg Kellog (2012)
– http://www.slideshare.net/gkellogg1/jsonld-and-mongodb
• We updated JAOB (Java Architecture for OWL Binding)
– Serialize RDF into POJO and viceversa
– Provides also a Maven plugin to compile OWL ontology into POJOs
– https://github.com/okkam-it/jaob
• Database access layer data (using SpringData):
– POJO  RDF  JSON-LD + KRYO  MongoDB
– MongoDB  KRYO  POJO
• Bottom line:
– we use (framed) JSON-LD to allow (complex) tree queries on an entiton
database modeled according to a domain ontology
– We exploit Kryo deserialization for fast reading
– We enjoy SpringData abstraction to implement Data access APIs
Using Entiton with MongoDB
@Document(collection = Entiton.TABLE)
@CompoundIndexes({
@CompoundIndex(name = "nestedId", def = "{ 'jsonld.@id': 1 }", unique = true),
….
})
public class Entiton<T extends Thing> implements Serializable {
@Id private String id;
@Version private String version;
private DBObject jsonld;
private Binary kryo;
@GeoSpatialIndexed(type=GeoSpatialIndexType.GEO_2DSPHERE)
private Point point;
public String javaClass;
…
}
Entiton (Mongo JSON-LD)
Contextual personalization in
Recommendation Architecture
Telemetry data collection
Flink batch processes
Contextual Personalization on demand
Lesson learned
• Reversing String Tuples ids leads to performance
improvements of joins
• When you make joins, ensure distinct dataset keys
• Reuse objects to reduce the impact of garbage collection
• When writing Flink jobs, start with small and debuggable
unit tests first, then run it on the cluster on the entire
dataset (waiting for big data debugging of Dr. Leich@TUB)
• Serialization matters: less memory required, less gc, faster
data loading  faster execution
• HD speed matters when RAM is not enough, SSD rulez
• Parquet rulez: self-describing data, push-down filters
• Use Gelly consciously, sometimes joins are good enough
• If your code crashes, it is usually your fault: from 0.9 Flink
is quite stable for batch jobs execution (at least)
Suggested improvements
• Running jobs on a cluster is always tricky because of bad
maven dependencies management in flink
– Define a POM for the build settings, a parent, and a bill-of-
materials BOM (releasetrain) and common resources folder
– Jar validation on client submit wrt deployed Flink dist bundle
giving warning on possibly conflicting classes
• Enable fast compilation (–Dmaven.test.skip=true) FLINK-1827
• Better monitoring, we’re eager to use the new web client!
• Complete Hadoop compatibility
– counters and custom grouping/sorting functions
• Start thinking about education of professionals
– e.g. courses and certification
Future works
• Benchmark Entiton serialization models on
Parquet (Avro vs Thrift vs Protobuf)
• Manage declarative data fusion policies
– a-la LDIF: http://ldif.wbsg.de/
• Define a formal entiton operations algebra (e.g.
merge, project, select, filter)
• Try out Cloudera Kudu
– novel Hadoop storage engine addressing both bulk
loading stability, scan performance and random access
– https://github.com/cloudera/kudu
Conclusions
• We think we are walking along the “last mile”
towards real world enterprise Semantic Applications
• Combining big data and semantics allows us to be
flexible, expressive and, thanks to Flink, very scalable
with very competitive costs
• Apache Flink gives us the leverage to shuffle data
around without much headache
• We proved cool stuff can be done in a simple and
efficient way, with the right tools and mindset
• Hopefully, Flink will help us reducing tax evasion in
Italy, not bad for a German Squirrel, ah?
Thanks for your attention
Any Questions
(before beer)?

More Related Content

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion

  • 1. A Semantic Big Data Companion Stefano Bortoli @stefanobortoli bortoli@okkam.it Flavio Pompermaier @fpompermaier pompermaier@okkam.it
  • 2. The company (briefly) • Okkam is – a SME based in Trento, Italy. – Started as spin-off of the University of Trento and FBK (2010) • Okkam core business is – large-scale data integration using semantic technologies and an Entity Name System • Okkam operative sectors – Services for public administration – Services for restaurants (and more) – Research projects • FP7, H2020, and Local agencies
  • 3. Who we are • Stefano Bortoli, PhD – works as technical director and researcher at Okkam S.R.L. (Trento, Italy). His research and development interests are in the area of Information Integration, with special focus in entity- centric applications exploiting semantic technologies. • Flavio Pompermaier, MSc. – works as senior software engineer at Okkam S.R.L. (Trento, Italy). Flavio is a passionate developer working with state of the art technologies, combining semantic with big data technologies.
  • 4. Our contributions • Early Flink adopters and promoters (since Stratosphere) • Example on how to use Flink with MongoDB – https://github.com/okkam-it/flink-mongodb-test • 2 pull requests – [FLINK-1928] [hbase] Added HBase write example – [FLINK-1828] [hadoop] Fixed missing call to configure() for Configurable HadoopOutputFormats • 8 JIRA tickets – OPEN FLINK-2503 Inconsistencies in FileInputFormat hierarchy – RESOLVED FLINK-1978 POJO serialization NPE – OPEN FLINK-1834 Is mapred.output.dir conf parameter really required? – CLOSED FLINK-1828 Impossible to output data to an HBase table – OPEN FLINK-1827 Move test classes in test folders and fix scope of test dependencies – OPEN FLINK-2800 kryo serialization problem – RESOLVED FLINK-2394 HadoopOutFormat OutputCommitter is default to FileOutputCommiter – OPEN FLINK-1241 Record processing counter for Dashboard • Hundreads of email threads and discussions 
  • 7. Our Flink Use Cases • Our objective is to build and manage (very) large entity-centric knowledge bases to serve different purposes and domains • So far, we used Apache Flink for: – Domain reasoning (Flink + Parquet + Thrift) – RDF data lifecycle (Flink + Parquet + Jena/Sesame ) – RDF data intelligence (Flink + ELKiBi) – Duplicate record detection (Flink + HBase + Solr) – Entiton Record linkage (Flink + MongoDB + Kryo) – Telemetry analysis (Flink + MongoDB + Weka)
  • 8. Why we need Flink Entiton data model Database record RDF statement Triplestore NOSQL + Indexes + Quad provenance IRI predicate object object Type Subject local IRI Subject Global IRI RDF Type Expensive datawearhouse
  • 9. Entiton using Parquet+Thrift namespace java it.okkam.flink.entitons.serialization.thrift struct EntitonQuad { 1: required string p; //pred 2: required string o; //obj 3: optional string ot; //obj-type 4: required string g; //sourceIRI } struct EntitonAtom { 1: required string s; //local-IRI 2: optional string oid; // ens-IRI 3: required list<string> types; //rdf-types 4: required list<EntitonQuad> quads; // quads } struct EntitonMolecule { 1: required EntitonAtom r; //root atom 2: optional list<EntitonAtom> atoms; //other atoms } Quad Subject local IRI Subject ENS IRI RDF Type
  • 10. Hardware-wise • We compete with expensive data wearhouse solution – e.g. Oracle Exadata Database Machines, IBM Netezza, etc. • Test on small machines fosters optimization – If you don’t want to wait, make your code faster! • Our code is ready to scale, without big investments • Fancy stuff can be done without millions of euros in HW 8 x Gigabyte Brix 16GB RAM 256GB SSD 1T HDD Intel I7 4770 3,2Ghz + 1 Gbit Switch
  • 11. ENS Maintenance • Entity Name System (ENS) (FP7 ’08-’10) – A Web-scale support for minting and reusing persistent entity identifiers for SW or LOD
  • 12. ENS Maintenance • Duplicate detection of 9.2M entities in 6h45 (using Flink incubator 0.6) – Query Apache Solr global index to perform flexible blocking given a subset of attributes of each entity (names) – Distinct/Join pairs of candidate duplicate – Rich Filter implementing Match function – Consolidate sets of candidate duplicates grouping • Tricks: – If HBase does not distribute uniformly regions do rebalance() – Compress byte[] with LZ4 in custom HBase input format to reduce network traffic – Reverse keys to speed up join execution (up to 10%, in some cases)
  • 13. Tax Reasoner • Pilot project for ACI and Val d’Aosta Objectives: to produce analytics and investigate: 1. Who did not pay Vehicle Excise Duty (Kraftfahrzeugsteuer)? 2. Who did not pay Vehicle Insurance? 3. Who skipped Vehicle Inspection? 4. Who did not pay Vehicle Sales Taxes? 5. Who violated exceptions to the above? Dataset: 15 data sources for 5 year with 12M records about 950k vehicles and 500k subjects for a total of 90M facts Challenge: consider events (time) and infer implicit information. Apache Flink jobs: 1. From RDF to Entiton 2. Domain Specific Temporal Inference (Tax Reasoner) 3. Build ElasticSearch Indexes
  • 15. Tax Reasoner Temporal Inference Execution Plan 1h ETA with SSD (2h30 on HDD) on developer machine 11M new facts inferred It took 1 DAY to perform the select query for one of the sources!!
  • 16. RDF Data Intelligence SIREn Solution Pivot BroswerTimeline for details about vehicleBusiness Intelligence Analytics with SIREn Solution KiBi Geospatial indicators
  • 17. Using Entiton with MongoDB • Inspired by the work of Gregg Kellog (2012) – http://www.slideshare.net/gkellogg1/jsonld-and-mongodb • We updated JAOB (Java Architecture for OWL Binding) – Serialize RDF into POJO and viceversa – Provides also a Maven plugin to compile OWL ontology into POJOs – https://github.com/okkam-it/jaob • Database access layer data (using SpringData): – POJO  RDF  JSON-LD + KRYO �� MongoDB – MongoDB  KRYO  POJO • Bottom line: – we use (framed) JSON-LD to allow (complex) tree queries on an entiton database modeled according to a domain ontology – We exploit Kryo deserialization for fast reading – We enjoy SpringData abstraction to implement Data access APIs
  • 18. Using Entiton with MongoDB @Document(collection = Entiton.TABLE) @CompoundIndexes({ @CompoundIndex(name = "nestedId", def = "{ 'jsonld.@id': 1 }", unique = true), …. }) public class Entiton<T extends Thing> implements Serializable { @Id private String id; @Version private String version; private DBObject jsonld; private Binary kryo; @GeoSpatialIndexed(type=GeoSpatialIndexType.GEO_2DSPHERE) private Point point; public String javaClass; … }
  • 25. Lesson learned • Reversing String Tuples ids leads to performance improvements of joins • When you make joins, ensure distinct dataset keys • Reuse objects to reduce the impact of garbage collection • When writing Flink jobs, start with small and debuggable unit tests first, then run it on the cluster on the entire dataset (waiting for big data debugging of Dr. Leich@TUB) • Serialization matters: less memory required, less gc, faster data loading  faster execution • HD speed matters when RAM is not enough, SSD rulez • Parquet rulez: self-describing data, push-down filters • Use Gelly consciously, sometimes joins are good enough • If your code crashes, it is usually your fault: from 0.9 Flink is quite stable for batch jobs execution (at least)
  • 26. Suggested improvements • Running jobs on a cluster is always tricky because of bad maven dependencies management in flink – Define a POM for the build settings, a parent, and a bill-of- materials BOM (releasetrain) and common resources folder – Jar validation on client submit wrt deployed Flink dist bundle giving warning on possibly conflicting classes • Enable fast compilation (–Dmaven.test.skip=true) FLINK-1827 • Better monitoring, we’re eager to use the new web client! • Complete Hadoop compatibility – counters and custom grouping/sorting functions • Start thinking about education of professionals – e.g. courses and certification
  • 27. Future works • Benchmark Entiton serialization models on Parquet (Avro vs Thrift vs Protobuf) • Manage declarative data fusion policies – a-la LDIF: http://ldif.wbsg.de/ • Define a formal entiton operations algebra (e.g. merge, project, select, filter) • Try out Cloudera Kudu – novel Hadoop storage engine addressing both bulk loading stability, scan performance and random access – https://github.com/cloudera/kudu
  • 28. Conclusions • We think we are walking along the “last mile” towards real world enterprise Semantic Applications • Combining big data and semantics allows us to be flexible, expressive and, thanks to Flink, very scalable with very competitive costs • Apache Flink gives us the leverage to shuffle data around without much headache • We proved cool stuff can be done in a simple and efficient way, with the right tools and mindset • Hopefully, Flink will help us reducing tax evasion in Italy, not bad for a German Squirrel, ah?
  • 29. Thanks for your attention Any Questions (before beer)?