SlideShare a Scribd company logo
Data Science with the Help of Metadata
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO @ Logical Clocks AB
www.hops.io
@hopshadoop
Metadata for Source Code
•Metadata for Source Code
- Enables questions like: who, when, what, why?
•Metadata for Automation
- Enables testing, quality-control, deployment.
•Metadata for Collaboration
- Github projects, teams
Metadata for Datasets?
•Access Control
•Data provenance
•Auditing
•Development
- Schema for the dataset
- How can I load/download this dataset?
- Quality control
3
Metadata can simplify development
sqlContext = HiveContext(sc)
f1_df = sqlContext.sql(
"SELECT id, count(*) AS nb_entries
FROM my_db.log 
WHERE ts = '20160515' 
GROUP BY id"
)
sqlContext = SQLContext(sc)
f0 = sc.textFile('logfile')
fpFields = [
StructField(‘ts', StringType(), True),
StructField('id', StringType(), True),
StructField(‘it', StringType(), True)
]
fpSchema = StructType(fpFields)
df_f0 = sqlContext.createDataFrame(f0,
fpSchema)
df_f0.registerTempTable('log')
f1_df = sqlContext.sql(
"SELECT log.id, count(*) AS nb_entries
FROM log WHERE ts = '20160515‘
GROUP BY id“
)
4
SparkSQLHive-on-Spark
Hive is Metadata for HDFS files
5
Metadata for Files/Directories in HDFS
6
Add Schemas using
the Filesystem API
Add auditing using
the FSImage API
Add access control using
a Filesystem Plugin
Access Control in Hadoop
hdfs dfs -chmod -R 000 /apps/hive
7
[http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger]
Metadata Totem Poles in Hadoop
8How do you ensure the consistency of the metadata and the data?
Why are the Metadata Services Silo’ed?
9
HDFS v2
10
DataNodes
HDFS Client
Journal Nodes Zookeeper
NameNode Standby
NameNode
Max 200 GB
metadata
YARN
11
NodeManagers
YARN Client
Zookeeper
ResourceMgr Standby
ResourceMgr
Metadata on the
JVM Heap Again
Hops: Distributed Metadata for Hadoop
12
HopsFS Architecture
13
NameNodes
NDB
Leader
HDFS Client
DataNodes
> 12 TB
> 2.6 X
Throughput
[HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Niazi et Al, arXiv 2016]
HopsYARN Architecture
14
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
Leader Election for
Failed Scheduler
Up to 10K
Node Clusters
Experience Designing Metadata in Hops
15
Hops Metadata services
Elasticsearch
Database
[HDFS/YARN]
Kafka
Zookeeper
Metadata API
The Distributed Database is the Single Source-of-Truth for Metadata
Metadata for HDFS and YARN
17
Files
Directories
Containers
Provenance
Security
Quotas
Projects
Datasets
Metadata + Data in the same Database
2-phase commit (transactions)
Strong Consistency for Metadata.
Metadata Integrity maintained using 2PC and Foreign Keys.
Metadata in Elasticsearch
18
Files
Directories
Metadata
Search
Indexes
DatabaseElasticsearch one-way replication
Eventual Consistency for Metadata.
Metadata Integrity maintained by Asynchronous Replication.
[ePipe Tutorial, BOSS Workshop, VLDB 2016]
Metadata for Kafka
19
Topics
Partitions
ACLs
Zookeeper/KafkaDatabase
Eventual Consistency for Metadata.
Metadata integrity maintained by custom recovery logic and polling.
Metadata API
polling
Case Study: Self-Service Multi-Tenant Projects
20
www.hops.io
@hopshadoop
Problem: Sensitive Data needs its own Cluster
21
NSA DataSet
User DataSet
Alice can copy/cross-link between data sets
Alice has only one Kerberos Identity.
Neither attribute-based access control nor dynamic roles supported in Hadoop.
Alice
Solution: Project-Specific UserIDs
22
Project NSA
Project Users
Member of
NSA__Alice
Users__Alice
Member of
HDFS enforces
access control
How can we share DataSets between Projects?
Sharing DataSets between Projects
23
Project NSA
Project Users
Member of
DataSetowns
Add members of Project
NSA to the DataSet group
NSA__Alice
Users__Alice
Member of
HopsWorks (WebApp) Enforces Dynamic Roles
24
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
Projects
Secure
Impersonation
Kafka
X.509
Certificates
X.509 Certificate Per Project-Specific User
25
Alice@gmail.com
Authenticate
Add/Del
Users
Distributed
Database
Insert/Remove CertsProject
Mgr
Root
CA
Services
Hadoop
Spark
Kafka
etc
Cert Signing
Requests
Project
•A project is a collection of
- Members
- HDFS DataSets
- Kafka Topics
- Notebooks, Jobs
•A project has an owner
•A project has quotas
26
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
Project Roles
Data Owner Privileges
- Import/Export data
- Manage Membership
- Share DataSets, Topics
Data Scientist Privileges
- Write and Run code
27
We delegate administration of privileges to users
Elastic Hadoop
Each Project has a:
• YARN CPU Quota
• HDFS Storage Quota
Uber-Style Pricing to
incentivize cluster usage
28
Sharing DataSets/Topics between Projects
29
The same as Sharing Folders in Dropbox
Added Multi-Tenancy to Zeppelin
www.hops.site
31
A 2 MW datacenter research and test environment
5 lab modules, planned up to 3-4000 servers, 2-3000 square meters
[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]
Demo
32
Status and Upcoming
•Automated installation support using Vagrant/Chef
or Karamel/Chef
•First official release of Hopsworks coming soon
•Globally shared datasets with peer-to-peer
technology, backed by our data center.
•Support for Apache Beam
Summing Up
Metadata services have the potential to make your
life easier as a Data Scientist
Most Hadoop Metadata services are proprietary and
require an administrator-in-the-loop
Hops provides an open, tinker-friendly platform for
building consistent metadata
Hopsworks shows how you can leverage metadata to
build a self-service project-based model for
Hadoop/Spark/Flink applications
34
The Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Johan Svedlund Nordström,
Vasileios Giannokostas, Ermias Gebremeskel,
Antonios Kouzoupis, Misganu Dessalegn, Rizvi Hasan,
Paul Mälzer, Bram Leenders, Juan Roca.
Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt,
Alberto Lorente, Andre Moré, Ali Gholami,
Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Jude D’Souza, Qi Qi, Gayana Chandrasekara,
Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali,
Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Join us!
http://github.com/hopshadoop
www.hops.io
@hopshadoop

More Related Content

Data Science with the Help of Metadata

  • 1. Data Science with the Help of Metadata Jim Dowling Associate Prof @ KTH Senior Researcher @ SICS CEO @ Logical Clocks AB www.hops.io @hopshadoop
  • 2. Metadata for Source Code •Metadata for Source Code - Enables questions like: who, when, what, why? •Metadata for Automation - Enables testing, quality-control, deployment. •Metadata for Collaboration - Github projects, teams
  • 3. Metadata for Datasets? •Access Control •Data provenance •Auditing •Development - Schema for the dataset - How can I load/download this dataset? - Quality control 3
  • 4. Metadata can simplify development sqlContext = HiveContext(sc) f1_df = sqlContext.sql( "SELECT id, count(*) AS nb_entries FROM my_db.log WHERE ts = '20160515' GROUP BY id" ) sqlContext = SQLContext(sc) f0 = sc.textFile('logfile') fpFields = [ StructField(‘ts', StringType(), True), StructField('id', StringType(), True), StructField(‘it', StringType(), True) ] fpSchema = StructType(fpFields) df_f0 = sqlContext.createDataFrame(f0, fpSchema) df_f0.registerTempTable('log') f1_df = sqlContext.sql( "SELECT log.id, count(*) AS nb_entries FROM log WHERE ts = '20160515‘ GROUP BY id“ ) 4 SparkSQLHive-on-Spark
  • 5. Hive is Metadata for HDFS files 5
  • 6. Metadata for Files/Directories in HDFS 6 Add Schemas using the Filesystem API Add auditing using the FSImage API Add access control using a Filesystem Plugin
  • 7. Access Control in Hadoop hdfs dfs -chmod -R 000 /apps/hive 7 [http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger]
  • 8. Metadata Totem Poles in Hadoop 8How do you ensure the consistency of the metadata and the data?
  • 9. Why are the Metadata Services Silo’ed? 9
  • 10. HDFS v2 10 DataNodes HDFS Client Journal Nodes Zookeeper NameNode Standby NameNode Max 200 GB metadata
  • 12. Hops: Distributed Metadata for Hadoop 12
  • 13. HopsFS Architecture 13 NameNodes NDB Leader HDFS Client DataNodes > 12 TB > 2.6 X Throughput [HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Niazi et Al, arXiv 2016]
  • 14. HopsYARN Architecture 14 ResourceMgrs NDB Scheduler YARN Client NodeManagers Resource Trackers Leader Election for Failed Scheduler Up to 10K Node Clusters
  • 16. Hops Metadata services Elasticsearch Database [HDFS/YARN] Kafka Zookeeper Metadata API The Distributed Database is the Single Source-of-Truth for Metadata
  • 17. Metadata for HDFS and YARN 17 Files Directories Containers Provenance Security Quotas Projects Datasets Metadata + Data in the same Database 2-phase commit (transactions) Strong Consistency for Metadata. Metadata Integrity maintained using 2PC and Foreign Keys.
  • 18. Metadata in Elasticsearch 18 Files Directories Metadata Search Indexes DatabaseElasticsearch one-way replication Eventual Consistency for Metadata. Metadata Integrity maintained by Asynchronous Replication. [ePipe Tutorial, BOSS Workshop, VLDB 2016]
  • 19. Metadata for Kafka 19 Topics Partitions ACLs Zookeeper/KafkaDatabase Eventual Consistency for Metadata. Metadata integrity maintained by custom recovery logic and polling. Metadata API polling
  • 20. Case Study: Self-Service Multi-Tenant Projects 20 www.hops.io @hopshadoop
  • 21. Problem: Sensitive Data needs its own Cluster 21 NSA DataSet User DataSet Alice can copy/cross-link between data sets Alice has only one Kerberos Identity. Neither attribute-based access control nor dynamic roles supported in Hadoop. Alice
  • 22. Solution: Project-Specific UserIDs 22 Project NSA Project Users Member of NSA__Alice Users__Alice Member of HDFS enforces access control How can we share DataSets between Projects?
  • 23. Sharing DataSets between Projects 23 Project NSA Project Users Member of DataSetowns Add members of Project NSA to the DataSet group NSA__Alice Users__Alice Member of
  • 24. HopsWorks (WebApp) Enforces Dynamic Roles 24 Alice@gmail.com NSA__Alice Authenticate Users__Alice HopsWorks HopsFS HopsYARN Projects Secure Impersonation Kafka X.509 Certificates
  • 25. X.509 Certificate Per Project-Specific User 25 Alice@gmail.com Authenticate Add/Del Users Distributed Database Insert/Remove CertsProject Mgr Root CA Services Hadoop Spark Kafka etc Cert Signing Requests
  • 26. Project •A project is a collection of - Members - HDFS DataSets - Kafka Topics - Notebooks, Jobs •A project has an owner •A project has quotas 26 project dataset 1 dataset N Topic 1 Topic N Kafka HDFS
  • 27. Project Roles Data Owner Privileges - Import/Export data - Manage Membership - Share DataSets, Topics Data Scientist Privileges - Write and Run code 27 We delegate administration of privileges to users
  • 28. Elastic Hadoop Each Project has a: • YARN CPU Quota • HDFS Storage Quota Uber-Style Pricing to incentivize cluster usage 28
  • 29. Sharing DataSets/Topics between Projects 29 The same as Sharing Folders in Dropbox
  • 31. www.hops.site 31 A 2 MW datacenter research and test environment 5 lab modules, planned up to 3-4000 servers, 2-3000 square meters [Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]
  • 33. Status and Upcoming •Automated installation support using Vagrant/Chef or Karamel/Chef •First official release of Hopsworks coming soon •Globally shared datasets with peer-to-peer technology, backed by our data center. •Support for Apache Beam
  • 34. Summing Up Metadata services have the potential to make your life easier as a Data Scientist Most Hadoop Metadata services are proprietary and require an administrator-in-the-loop Hops provides an open, tinker-friendly platform for building consistent metadata Hopsworks shows how you can leverage metadata to build a self-service project-based model for Hadoop/Spark/Flink applications 34
  • 35. The Team Active: Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Johan Svedlund Nordström, Vasileios Giannokostas, Ermias Gebremeskel, Antonios Kouzoupis, Misganu Dessalegn, Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca. Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Ali Gholami, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.