SlideShare a Scribd company logo
EMR Zeppelin & Livy
AWS BIG DATA demystified
Omid Vahdaty, Big Data Ninja
Agenda
● What is Zeppelin?
● Motivation?
● Features?
● Performance?
● Demo?
Zeppelin
A completely open web-based notebook that enables interactive data analytics. Apache Zeppelin is a new and incubating multi-
purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features
Zeppelin out of the box features
● Web Based GUI.
● Supported languages
○ Spark SQL
○ PySpark
○ Scala
○ SparkR
○ JDBC (Redshift,Athena, Presto,MySql ...)
○ Bash
● Visualization
● Users, Sharing and Collaboration
● Advanced Security features
● Built in AWS S3 support
● Orchestration
Why Zeppelin?
● Sexy Look and Feel of any SQL web client
● Backup your SQL easily automatically via S3
● Share and collaborate your notebooks
● Orchestration & Scheduler for your nightly job
● Combine system commands + sql + scala spark visualization.
● Advanced Security features
● Combine all the DB’s you need in one place including data transfer.
● Get one step closer to pyspark and scala and sparkR
● Visualize your data easily.
Getting started - Provisioning EMR
● Zeppelin is installed on the master node of the EMR cluster ( choose the right
installation for you )
● Don't forget to add the AWS glue connectors
● Dont forget to add Spark …
● https://zeppelin.apache.org/docs/0.7.3/
● ML notebook example with zeppelin
● https://raw.githubusercontent.com/hortonworks-gallery/zeppelin-notebooks/hdp-2.6/2CCBNZ5YY/note.json
Interpreter
● The concept of Zeppelin interpreter allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin
supports many interpreters such as Scala ( with Apache Spark ), Python ( with Apache Spark ), Spark SQL, JDBC, Markdown, Shell
and so on.
● SparkContext, SQL context , Zeppelin cotext Z SparkContext, SQLContext and ZeppelinContext are automatically
created and exposed as variable names sc, sqlContext and z, respectively, in Scala, Python and R environments. Staring from 0.6.1
SparkSession is available as variable spark when you are using Spark 2.x.
● https://zeppelin.apache.org/docs/latest/manual/interpreters.html
Binding modes
1. In Scoped mode, Zeppelin still runs single interpreter JVM process but multiple
Interpreter Group serve each Note
2. In Shared mode, single JVM process and single Interpreter Group serves all
Notes.
3. Isolated mode runs separate interpreter process for each Note. So, each Note
have absolutely isolated session.
Binding modes
Binding modes
Binding modes - share mode
In Shared mode, single JVM process
and a single session serves all notes.
As a result, note A can access
variables (e.g python, scala, ..) directly
created from other notes..
Binding modes - scoped mode
In Scoped mode, Zeppelin still runs a
single interpreter JVM process but, in
the case of per note scope, each note
runs in its own dedicated session. (Note
it is still possible to share objects
between these notes via ResourcePool)
Binding modes - Isolated mode
Isolated mode runs a separate
interpreter process for each note in the
case of per note scope. So, each note
has an absolutely isolated session. (But
it is still possible to share objects via
ResourcePool)
When to use each binding mode?
● Isolated means high utilization of resources but less availability to share
options to share objects
● In Scoped mode, each note has its own Scala REPL. So variable defined in a
note can not be read or overridden in another note. However, a single
SparkContext still serves all the sessions. And all the jobs are submitted to this
SparkContext and the fair scheduler schedules the jobs. This could be useful
when user does not want to share Scala session, but want to keep single
Spark application and leverage its fair scheduler.
● In Shared mode, a SparkContext and a Scala REPL is being shared among all
interpreters in the group. So every note will be sharing single SparkContext and
single Scala REPL
Import/Export Notebooks
● U can import /export notebooks into from Url, local disk or Zeppelin Storage: S3 and GI
● Zeppelin storage s3 notes.
○ Need to import from local disk the first time
○ U can use roles to provide access to S3 instead of access key / secret key
○ Each notebook is saved on s3 in a specific path (see docs)
○ Can’t open directly from S3- bug?
○ Yes, you can use encryption of S3...
Zeppelin storage s3 (use role instead of accesskey/secretkey)
https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/
{
"Classification": "zeppelin-env",
"Properties": {
},
"Configurations": [
{
"Classification": "export",
"Properties": {
"ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
"ZEPPELIN_NOTEBOOK_S3_BUCKET":"my-zeppelin-bucket-name",
"ZEPPELIN_NOTEBOOK_USER":"user"
},
"Configurations": [
]
}
Advanced Security
● Basic authentication (via Apache SHIRO): user management
(user,pass,groups), even LDAP
https://zeppelin.apache.org/docs/0.7.3/security/shiroauthentication.html
● notebook permissions management: read/write/share
https://zeppelin.apache.org/docs/0.7.3/security/notebook_authorization.html
● Data source authorization (e.g 3rd party DB):
https://zeppelin.apache.org/docs/0.7.3/security/datasource_authorization.html
● Zeppelin with kerberos
● https://zeppelin.apache.org/docs/latest/interpreter/spark.html#setting-up-zeppelin-with-kerberos
HTTPS/SSL
● You can use a tunnel as used in EMR GUI websites (secured by default)
● Authentication and SSL via nginx
○ https://zeppelin.apache.org/docs/0.7.3/security/authentication.html#http-basic-authentication-using-nginx
○ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_zeppelin-component-guide/content/config-ssl-zepp.html
● you can add ELB on top of EMR , in 443, out 8890 for the zeppelin gui via
HTTPS
User management
Now in order to manage groups/roles, you could create the groups/roles under the
"[roles]" section in the "shiro.ini" file. For example, I could have a set of groups like:
[roles]
admin = *
readonly = *
poweruser = *
scientist = *
engineer = *
```
User management
Then in the "[users]" sections, it could be looking like the below:
```
[users]
admin = password>, admin
user1 = password>, scientist, poweruser
user2 = password>, engineer, poweruser
user3 = password>, readonly
```
User management
For example, the above means that:
- user "admin" is in "admin" group;
- user "user1" is in "poweruser" and "scientist" group
- etc.
Once the groups/roles are created, the authorization setting will be similar to what described in
https://zeppelin.apache.org/docs/0.7.3/security/notebook_authorization.html . For instance, when in a notebook permission
page, you can put the group name, instead of the individual users:
```
Owners admin
Writers scientist,engineer,poweruser
Readers readonly
```
Orchestration & Scheduling
You can go to any Zeppelin notebook and click on clock icon to setup scheduling
using CRON. You can use this link to generate the CRON expression for the time
interested - http://www.cronmaker.com/.
Orchestration & Scheduling
You can ran any job if our have permission and see their status
Bootstrapping EMR zeppelin
● For launching EMR cluster with a pre-defined notebook, we can make use of
Amazon S3 for persistent storage of the notebook and EMR steps since EMR
Bootstrap Actions are run before Zeppelin is installed on the cluster.
● sudo aws s3 cp s3://<my bucket name>/<location>/zeppelin-site.xml
/etc/zeppelin/conf/
● aws s3 cp /etc/zeppelin/conf.dist/shiro.ini s3://my-zeppelin/config/
● sudo stop zeppelin
● sudo start zeppelin
Apache Livy
rest api to manage spark jobs
● Interactive Scala, Python and R shells
● Batch submissions in Scala, Java, Python
● Multi users can share the same zeppelin server (impersonation support)
● Can be used for submitting jobs from anywhere with REST
● Does not require any code change to your programs
Livy + Zeppelin use case
Multi tenant users/jobs:
● Sharing of Spark context across multiple Zeppelin instances.
● When the Zeppelin server runs with authentication enabled, the Livy interpreter
propagates user identity to the Spark job so that the job runs as the originating
user. This is especially useful when multiple users are expected to connect to
the same set of data repositories within an enterprise.
EMR bootstrap of zeppelin in an EMR STEP
If you want, you can automate the above process by using an EMR step. Please find attached a simple shell script that will download your
zeppelin-site.xml file from S3 onto your EMR cluster and restart the Zeppelin service.
To run it, simply copy the script to an S3 bucket and then use the script-runner.jar process as outlined in [2] below with the script s3 location as
its only argument.
To do this via the AWS EMR Console:
1 - Under the "Add steps (optional)" section, select "Custom JAR" for the "Step type" and click the "Configure" button.
2 - In the pop-up window, for us-east-1 the JAR location for script-runner.jar is:
s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar
3 - For the argument, you would pass in your S3 bucket and location of the "setupZeppelin.sh" file e.g.,:
s3://mybucket/mylocation/setupZeppelin.sh
Once done, click "Add" and continue on with your EMR cluster creation (this process is included when cloning an EMR cluster).
Livy + Zeppelin Architecture
Resources
● https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555
● https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/usage/interpreter/interpreter_binding_mode.html
● https://aws.amazon.com/blogs/big-data/import-zeppelin-notes-from-github-or-json-in-zeppelin-0-5-6-on-amazon-emr/
● https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#notebook-storage-in-local-git-repository
● https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#notebook-storage-in-s3
● encryption on s3 :https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#data-encryption-in-s3
● Reference - https://community.hortonworks.com/questions/98101/scheduler-in-zeppelin.html.
● https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_zeppelin-component-guide/content/zepp-with-spark.html
● https://zeppelin.apache.org/docs/0.6.1/interpreter/livy.html
● https://hortonworks.com/blog/recent-improvements-apache-zeppelin-livy-integration/
● https://www.slideshare.net/HadoopSummit/apache-zeppelin-livy-bringing-multi-tenancy-to-interactive-data-analysis

More Related Content

Emr zeppelin & Livy demystified

  • 1. EMR Zeppelin & Livy AWS BIG DATA demystified Omid Vahdaty, Big Data Ninja
  • 2. Agenda ● What is Zeppelin? ● Motivation? ● Features? ● Performance? ● Demo?
  • 3. Zeppelin A completely open web-based notebook that enables interactive data analytics. Apache Zeppelin is a new and incubating multi- purposed web-based notebook which brings data ingestion, data exploration, visualization, sharing and collaboration features
  • 4. Zeppelin out of the box features ● Web Based GUI. ● Supported languages ○ Spark SQL ○ PySpark ○ Scala ○ SparkR ○ JDBC (Redshift,Athena, Presto,MySql ...) ○ Bash ● Visualization ● Users, Sharing and Collaboration ● Advanced Security features ● Built in AWS S3 support ● Orchestration
  • 5. Why Zeppelin? ● Sexy Look and Feel of any SQL web client ● Backup your SQL easily automatically via S3 ● Share and collaborate your notebooks ● Orchestration & Scheduler for your nightly job ● Combine system commands + sql + scala spark visualization. ● Advanced Security features ● Combine all the DB’s you need in one place including data transfer. ● Get one step closer to pyspark and scala and sparkR ● Visualize your data easily.
  • 6. Getting started - Provisioning EMR ● Zeppelin is installed on the master node of the EMR cluster ( choose the right installation for you ) ● Don't forget to add the AWS glue connectors ● Dont forget to add Spark … ● https://zeppelin.apache.org/docs/0.7.3/ ● ML notebook example with zeppelin ● https://raw.githubusercontent.com/hortonworks-gallery/zeppelin-notebooks/hdp-2.6/2CCBNZ5YY/note.json
  • 7. Interpreter ● The concept of Zeppelin interpreter allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin supports many interpreters such as Scala ( with Apache Spark ), Python ( with Apache Spark ), Spark SQL, JDBC, Markdown, Shell and so on. ● SparkContext, SQL context , Zeppelin cotext Z SparkContext, SQLContext and ZeppelinContext are automatically created and exposed as variable names sc, sqlContext and z, respectively, in Scala, Python and R environments. Staring from 0.6.1 SparkSession is available as variable spark when you are using Spark 2.x. ● https://zeppelin.apache.org/docs/latest/manual/interpreters.html
  • 8. Binding modes 1. In Scoped mode, Zeppelin still runs single interpreter JVM process but multiple Interpreter Group serve each Note 2. In Shared mode, single JVM process and single Interpreter Group serves all Notes. 3. Isolated mode runs separate interpreter process for each Note. So, each Note have absolutely isolated session.
  • 11. Binding modes - share mode In Shared mode, single JVM process and a single session serves all notes. As a result, note A can access variables (e.g python, scala, ..) directly created from other notes..
  • 12. Binding modes - scoped mode In Scoped mode, Zeppelin still runs a single interpreter JVM process but, in the case of per note scope, each note runs in its own dedicated session. (Note it is still possible to share objects between these notes via ResourcePool)
  • 13. Binding modes - Isolated mode Isolated mode runs a separate interpreter process for each note in the case of per note scope. So, each note has an absolutely isolated session. (But it is still possible to share objects via ResourcePool)
  • 14. When to use each binding mode? ● Isolated means high utilization of resources but less availability to share options to share objects ● In Scoped mode, each note has its own Scala REPL. So variable defined in a note can not be read or overridden in another note. However, a single SparkContext still serves all the sessions. And all the jobs are submitted to this SparkContext and the fair scheduler schedules the jobs. This could be useful when user does not want to share Scala session, but want to keep single Spark application and leverage its fair scheduler. ● In Shared mode, a SparkContext and a Scala REPL is being shared among all interpreters in the group. So every note will be sharing single SparkContext and single Scala REPL
  • 15. Import/Export Notebooks ● U can import /export notebooks into from Url, local disk or Zeppelin Storage: S3 and GI ● Zeppelin storage s3 notes. ○ Need to import from local disk the first time ○ U can use roles to provide access to S3 instead of access key / secret key ○ Each notebook is saved on s3 in a specific path (see docs) ○ Can’t open directly from S3- bug? ○ Yes, you can use encryption of S3...
  • 16. Zeppelin storage s3 (use role instead of accesskey/secretkey) https://aws.amazon.com/blogs/big-data/running-an-external-zeppelin-instance-using-s3-backed-notebooks-with-spark-on-amazon-emr/ { "Classification": "zeppelin-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo", "ZEPPELIN_NOTEBOOK_S3_BUCKET":"my-zeppelin-bucket-name", "ZEPPELIN_NOTEBOOK_USER":"user" }, "Configurations": [ ] }
  • 17. Advanced Security ● Basic authentication (via Apache SHIRO): user management (user,pass,groups), even LDAP https://zeppelin.apache.org/docs/0.7.3/security/shiroauthentication.html ● notebook permissions management: read/write/share https://zeppelin.apache.org/docs/0.7.3/security/notebook_authorization.html ● Data source authorization (e.g 3rd party DB): https://zeppelin.apache.org/docs/0.7.3/security/datasource_authorization.html ● Zeppelin with kerberos ● https://zeppelin.apache.org/docs/latest/interpreter/spark.html#setting-up-zeppelin-with-kerberos
  • 18. HTTPS/SSL ● You can use a tunnel as used in EMR GUI websites (secured by default) ● Authentication and SSL via nginx ○ https://zeppelin.apache.org/docs/0.7.3/security/authentication.html#http-basic-authentication-using-nginx ○ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.3/bk_zeppelin-component-guide/content/config-ssl-zepp.html ● you can add ELB on top of EMR , in 443, out 8890 for the zeppelin gui via HTTPS
  • 19. User management Now in order to manage groups/roles, you could create the groups/roles under the "[roles]" section in the "shiro.ini" file. For example, I could have a set of groups like: [roles] admin = * readonly = * poweruser = * scientist = * engineer = * ```
  • 20. User management Then in the "[users]" sections, it could be looking like the below: ``` [users] admin = password>, admin user1 = password>, scientist, poweruser user2 = password>, engineer, poweruser user3 = password>, readonly ```
  • 21. User management For example, the above means that: - user "admin" is in "admin" group; - user "user1" is in "poweruser" and "scientist" group - etc. Once the groups/roles are created, the authorization setting will be similar to what described in https://zeppelin.apache.org/docs/0.7.3/security/notebook_authorization.html . For instance, when in a notebook permission page, you can put the group name, instead of the individual users: ``` Owners admin Writers scientist,engineer,poweruser Readers readonly ```
  • 22. Orchestration & Scheduling You can go to any Zeppelin notebook and click on clock icon to setup scheduling using CRON. You can use this link to generate the CRON expression for the time interested - http://www.cronmaker.com/.
  • 23. Orchestration & Scheduling You can ran any job if our have permission and see their status
  • 24. Bootstrapping EMR zeppelin ● For launching EMR cluster with a pre-defined notebook, we can make use of Amazon S3 for persistent storage of the notebook and EMR steps since EMR Bootstrap Actions are run before Zeppelin is installed on the cluster. ● sudo aws s3 cp s3://<my bucket name>/<location>/zeppelin-site.xml /etc/zeppelin/conf/ ● aws s3 cp /etc/zeppelin/conf.dist/shiro.ini s3://my-zeppelin/config/ ● sudo stop zeppelin ● sudo start zeppelin
  • 25. Apache Livy rest api to manage spark jobs ● Interactive Scala, Python and R shells ● Batch submissions in Scala, Java, Python ● Multi users can share the same zeppelin server (impersonation support) ● Can be used for submitting jobs from anywhere with REST ● Does not require any code change to your programs
  • 26. Livy + Zeppelin use case Multi tenant users/jobs: ● Sharing of Spark context across multiple Zeppelin instances. ● When the Zeppelin server runs with authentication enabled, the Livy interpreter propagates user identity to the Spark job so that the job runs as the originating user. This is especially useful when multiple users are expected to connect to the same set of data repositories within an enterprise.
  • 27. EMR bootstrap of zeppelin in an EMR STEP If you want, you can automate the above process by using an EMR step. Please find attached a simple shell script that will download your zeppelin-site.xml file from S3 onto your EMR cluster and restart the Zeppelin service. To run it, simply copy the script to an S3 bucket and then use the script-runner.jar process as outlined in [2] below with the script s3 location as its only argument. To do this via the AWS EMR Console: 1 - Under the "Add steps (optional)" section, select "Custom JAR" for the "Step type" and click the "Configure" button. 2 - In the pop-up window, for us-east-1 the JAR location for script-runner.jar is: s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar 3 - For the argument, you would pass in your S3 bucket and location of the "setupZeppelin.sh" file e.g.,: s3://mybucket/mylocation/setupZeppelin.sh Once done, click "Add" and continue on with your EMR cluster creation (this process is included when cloning an EMR cluster).
  • 28. Livy + Zeppelin Architecture
  • 29. Resources ● https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555 ● https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/usage/interpreter/interpreter_binding_mode.html ● https://aws.amazon.com/blogs/big-data/import-zeppelin-notes-from-github-or-json-in-zeppelin-0-5-6-on-amazon-emr/ ● https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#notebook-storage-in-local-git-repository ● https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#notebook-storage-in-s3 ● encryption on s3 :https://zeppelin.apache.org/docs/0.7.3/storage/storage.html#data-encryption-in-s3 ● Reference - https://community.hortonworks.com/questions/98101/scheduler-in-zeppelin.html. ● https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_zeppelin-component-guide/content/zepp-with-spark.html ● https://zeppelin.apache.org/docs/0.6.1/interpreter/livy.html ● https://hortonworks.com/blog/recent-improvements-apache-zeppelin-livy-integration/ ● https://www.slideshare.net/HadoopSummit/apache-zeppelin-livy-bringing-multi-tenancy-to-interactive-data-analysis