SlideShare a Scribd company logo
Building an Enterprise/Cloud
Analytics Platform with Jupyter
Notebooks and Apache Spark
Fred Reiss
Chief Architect, IBM Spark Technology Center
2
Hi!
Fred Reiss
• 2014-present: Chief Architect,
IBM Spark Technology Center.
• 2006-2014: Worked for IBM
Research.
• 2006: Ph.D. from U.C.
Berkeley.
3
The Jupyter Project
• Open Source project that builds software to enable
interactive notebooks for data science
– Started in 2014
– Grew out of the IPython project
4
What is IPython?
https://upload.wikimedia.org/wikipedia/commons/4/47/IPython-shell.png
By Shishirdasika (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Interactive
console for
Python
Can open a
window to
display
graphics
5
IPython Notebooks
https://upload.wikimedia.org/wikipedia/commons/a/af/IPython-notebook.png
By Shishirdasika (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
Text and
graphics in the
same browser
window
6
Jupyter Notebooks Today
https://developer.ibm.com/code/patterns/create-visualizations-to-understand-food-insecurity/
User
Input
System
Output
Tables
Graphs
Text output
Cells
7
Jupyter Notebooks
• Jupyter notebooks are widely used by data scientists, social
scientists, physical scientists, engineers, and others
• Useful for many tasks
– Analyzing data
– Developing and debugging software
– Running experiments
– Keeping track of experimental results
– Presenting results
• Jupyter is a central part of the IBM Data Science Experience
(http://datascience.ibm.com)
8
Jupyter in the Enterprise: Key Challenges
• Collaboration among multiple users
• Large-scale data analysis (problems that don’t fit in
a laptop)
– Shared cloud infrastructure like Kubernetes
– Parallel frameworks like Spark
• Security and authentication
• Auditing and data access control
9
Isn’t this just shipping strings around?
JavaScript
“1+1”
Server
“1+1”
Python
Process
“1+1”
“2”“2”“2”
10
Isn’t this just shipping strings around?
JavaScript
“1+1”
FancyNewSystem
“1+1”
Python
Process
“1+1”
“2”“2”“2”
Security
Multitenancy
Authentication
Spark
Kubernetes
11
The Five Stages of Enterprise Jupyter Deployment
12
The Five Stages of Enterprise Jupyter Deployment
1. Denial
13
Jupyter does more than just pass strings around.
• Quite a bit more!
14
Asynchronous Operations
• Queue up multiple cells for
execution
– …in arbitrary order
• Stream output while a cell is
running
• Interrupt any operation
Fifteenth cell
that executed in
this session
15
Jupyter’s Display System: Much More than Text
https://nbviewer.jupyter.org/github/ipython/ipython/bl
ob/master/examples/IPython%20Kernel/Custom%2
0Display%20Logic.ipynb
16
Profiling and Debugging
17
Magics
• Jupyter’s
standard
Python kernel
has over 90
built-in magic
commands
http://ipython.readthedocs.io/en/stable/interactive/magics.html
18
Extensions
• Many additional
extensions in the
iPython project’s
Github repository
– https://github.com/ip
ython-
contrib/jupyter_contri
b_nbextensions
19
PixieDust
https://developer.ibm.com/code/patterns/analyze-san-francisco-traffic-data-with-ibm-pixiedust-and-data-science-experience/
20
Brunel
https://developer.ibm.com/open/videos/brunel-visualization-update-tech-talk/
21
The Actual Architecture of Jupyter Notebooks
Notebook Server Process
22
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
23
The Five Stages of Enterprise Jupyter Deployment
1. Denial
24
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
Notebook Server Process
25
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Notebook Server Process
26
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Notebook Server Process
27
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Notebook Server Process
28
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Five ZeroMQ
message queues over
unencrypted TCP
sockets…
…per kernel
29
Third-Party Kernels
• The IPython kernel is
the most common…
• …but there is a long tail
of other Jupyter kernels
– 103 kernels currently
listed on the Jupyter
project’s wiki
Notebook Server Process
30
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
To share
notebooks among
users, need to
share notebook
server
Notebook Server Process
31
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
To use Apache
Spark™ on YARN,
need to be inside
the YARN cluster’s
network.
32
Jupyter in the Enterprise: Key Challenges
• Collaboration among multiple users
• Large-scale data analysis
– Shared cloud infrastructure like Kubernetes
– Parallel frameworks like Spark
• Security and authentication
• Auditing and data access control
Bringing these properties to the Jupyter stack is hard!
33
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
34
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
35
Bargaining
• Meeting all the enterprise requirements is expensive
• Compromise to bring down the cost
36
Compromise #1: Gigantic Server
• Find the biggest machine or container you can get
• Run the entire Jupyter stack on that one machine
• Issues:
– Machine needs to be sized for the maximum aggregate
memory of all active users’ active kernels
• Hard upper limit of 256GB-1TB in most organizations
• Very problematic if you have many users and big data
– Need to authenticate all these users to the same machine
and notebook server
37
Compromise #2: Notebook Server Per User
• Proxy server manages a pool of containers, one per active user
• Each container contains an entire Jupyter notebook stack
• JupyterHub project provides a pre-built implementation of this
approach
• Issues:
– Container needs to be big enough for all the user’s kernels
• What size container to allocate when the user logs in?
• Does a big enough container even exist?
– Disables collaboration features
– Many more moving parts  More failure modes
38
Compromise #3: Replace the Kernel
iPythonKernel
KernelProxy
Shell
IOPub
stdin
control
heartbeat
KernelProxy
Proxy
39
Compromise #3: Replace the Kernel
• Replace the IPython kernel with a proxy
• Put something enterprise-friendly on
the other side of the proxy
• Apache Livy implements this approach
– https://github.com/jupyter-
incubator/sparkmagic
• Issues:
– Breaks Jupyter’s magics and extensions
– Breaks data visualization libraries
– Breaks third-party kernels
– Less control over code execution
Shell
IOPub
stdin
control
heartbeat
RESTfulwebservice
40
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
41
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
4. Depression
42
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
4. Depression
5. Jupyter Enterprise Gateway
43
The Origins of Jupyter Enterprise Gateway
• Multiple IBM products embedding Spark on YARN
• All wanted to add Jupyter notebooks with Spark
• Usual enterprise requirements (multitenancy,
scalability, security, etc.)
• Had reached the “Bargaining” stage
– Mix of compromises 1, 2, and 3
YARN Cluster
Initial
Prototype
44
Security
Layer
YARN
Workers
YARN
Resource
Manager
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Notebook Node
nb2kg
(Proxy)
nb2kg
Jupyter
Kernel
Gateway
Python
Kernel
Spark Driver
Python
Kernel
Spark Driver
Shell
IOPub
stdin
control
heartbeat
YARN Cluster
Initial
Prototype
45
Security
Layer
YARN
Workers
YARN
Resource
Manager
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Notebook Node
nb2kg
(Proxy)
nb2kg
Jupyter
Kernel
Gateway
Python
Kernel
Spark Driver
Python
Kernel
Spark Driver
Shell
IOPub
stdin
control
heartbeat
Issue #2: All
Spark jobs
run as same
user ID
Issue #1: All kernels
and Spark drivers
run on a single node
Issue #1: All kernels run on a single node
8 8 8 8
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
Maximum Number of Simultaneous Kernels
46
Jupyter Enterprise Gateway: Initial Goals
• Optimized Resource Allocation
– Run Spark in YARN Cluster Mode to better utilize cluster resources.
– Pluggable architecture for additional Resource Managers
• Multiuser support with user impersonation
– Enhance security and sandboxing by enabling user impersonation
when running kernels (using Kerberos).
– Individual HDFS home folder for each notebook user.
– Use the same user ID for notebook and batch jobs.
• Enhanced Security
– Secure socket communications
– Any network communication should be encrypted
47
YARN Cluster
Jupyter Enterprise Gateway
48
Security
Layer
YARN
Workers
Jupyter EnterpriseGateway
Multitenancy
Remote kernels and Kernel Lifecycle management
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Impersonation:
Alice’s kernel
runs under
Alice’s user ID.
Scalability Benefits
8 8 8 8
16
32
48
64
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
Maximum Number of Simultaneous Kernels
Before JEG
After JEG
49
Jupyter Enterprise Gateway: Open Source
50
• Released through the
Jupyter Incubator
– BSD License
– https://github.com/jupyter-
incubator/enterprise_gatew
ay
– Current release: 0.7.0
Jupyter Enterprise Gateway: Supported Platforms
• Python/Spark 2.x using IPython kernel
– With Spark Context delayed initialization
• Scala 2.11/ Spark 2.x using Apache Toree kernel
– With Spark Context delayed initialization
• R / Spark 2.x with IRkernel
51
Jupyter Enterprise Gateway – Roadmap
• Add support for other resource managers
– Kubernetes support
• Kernel Configuration Profile
– Enable client to request different resource configuration for kernels (e.g. small,
medium, large)
– Profiles should be defined by Administrators and enabled for user/group of users.
• Administration UI
– Dashboard with running kernels and administration actions
• Time running, stop/kill, Profile Management, etc
• User Environments
• High Availability
52
Jupyter Enterprise Gateway
• Jupyter Enterprise Gateway at IBM Code
– https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/
• Jupyter Enterprise Gateway source code at GitHub
– https://github.com/jupyter-incubator/enterprise_gateway
• Docker images
– https://github.com/jupyter-
incubator/enterprise_gateway/tree/master/etc/docker
• Jupyter Enterprise Gateway 0.7 release
– https://github.com/jupyter-incubator/enterprise_gateway/releases/tag/v0.7.0
• Jupyter Enterprise Gateway Documentation
– http://jupyter-enterprise-gateway.readthedocs.io/en/latest/
53
Free
IBM Data Science
trial
https://ibm.biz/BdZceR
54
Thank you!
And special thanks to the Jupyter
Enterprise Gateway team: Luciano Resende,
Kevin Bates, Kun Liu, Christian Kadner,
Sanjay Saxena, Alan Chin, Sherry Guo, Alex
Bozarth, Zee Chen
55
Backup
56
Building your own test
environment with
Jupyter Enterprise Gateway
Jupyter Enterprise Gateway: Deployment
57
Management
Node
Powered by
Ambari
EG
Compute Engine based on Apache Spark
Jupyter Enterprise Gateway: Deployment
• Ansible deployment scripts
– https://github.com/lresende/spark-cluster-install
• One click deployment of the Spark Cluster
– Configure your host inventory (see example on git repository)
– Run the ”setup-ambari.yml” playbook
• $ ansible-playbook --verbose setup-ambari.yml -i hosts-fyre-ambari -c paramiko
• One click deployment of the Jupyter Enterprise Engine
– Run the ”setup-enterprise-gateway.yml” playbook
• $ ansible-playbook --verbose setup-enterprise-gateway.yml -i hosts-fyre-ambari -c
paramiko
58
Jupyter Enterprise Gateway - Deployment
• Docker images
– yarn-spark: Basic one node Spark on Yarn configuration
– enterprise-gateway: Adds Anaconda and Jupyter Enterprise Gateway to the
yarn-spark image
�� nb2kg: Minimal Jupyter Notebook client configured with hooks to access the
Enterprise Gateway
– https://github.com/jupyter-incubator/enterprise_gateway/tree/master/etc/docker
• Building the latest docker images
– git checkout https://github.com/jupyter-incubator/enterprise_gateway
– make docker-clean docker-images
– Note: Make also have individual targets to clean and build individual images
(type make for help)
59
Jupyter Enterprise Gateway - Deployment
• Connecting to a Spark Cluster using a docker image
docker run -t --rm 
-e KG_URL='http://<Enterprise Gateway IP>:8888' 
-p 8888:8888 
-e VALIDATE_KG_CERT='no' 
-e LOG_LEVEL=DEBUG 
-e KG_REQUEST_TIMEOUT=40 
-e KG_CONNECT_TIMEOUT=40 
-v ${HOME}/opensource/jupyter/jupyter-notebooks/:/tmp/notebooks 
-w /tmp/notebooks 
elyra/nb2kg:dev
60

More Related Content

2018 02 20-jeg_index

  • 1. Building an Enterprise/Cloud Analytics Platform with Jupyter Notebooks and Apache Spark Fred Reiss Chief Architect, IBM Spark Technology Center
  • 2. 2 Hi! Fred Reiss • 2014-present: Chief Architect, IBM Spark Technology Center. • 2006-2014: Worked for IBM Research. • 2006: Ph.D. from U.C. Berkeley.
  • 3. 3 The Jupyter Project • Open Source project that builds software to enable interactive notebooks for data science – Started in 2014 – Grew out of the IPython project
  • 4. 4 What is IPython? https://upload.wikimedia.org/wikipedia/commons/4/47/IPython-shell.png By Shishirdasika (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons Interactive console for Python Can open a window to display graphics
  • 5. 5 IPython Notebooks https://upload.wikimedia.org/wikipedia/commons/a/af/IPython-notebook.png By Shishirdasika (Own work) [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons Text and graphics in the same browser window
  • 7. 7 Jupyter Notebooks • Jupyter notebooks are widely used by data scientists, social scientists, physical scientists, engineers, and others • Useful for many tasks – Analyzing data – Developing and debugging software – Running experiments – Keeping track of experimental results – Presenting results • Jupyter is a central part of the IBM Data Science Experience (http://datascience.ibm.com)
  • 8. 8 Jupyter in the Enterprise: Key Challenges • Collaboration among multiple users • Large-scale data analysis (problems that don’t fit in a laptop) – Shared cloud infrastructure like Kubernetes – Parallel frameworks like Spark • Security and authentication • Auditing and data access control
  • 9. 9 Isn’t this just shipping strings around? JavaScript “1+1” Server “1+1” Python Process “1+1” “2”“2”“2”
  • 10. 10 Isn’t this just shipping strings around? JavaScript “1+1” FancyNewSystem “1+1” Python Process “1+1” “2”“2”“2” Security Multitenancy Authentication Spark Kubernetes
  • 11. 11 The Five Stages of Enterprise Jupyter Deployment
  • 12. 12 The Five Stages of Enterprise Jupyter Deployment 1. Denial
  • 13. 13 Jupyter does more than just pass strings around. • Quite a bit more!
  • 14. 14 Asynchronous Operations • Queue up multiple cells for execution – …in arbitrary order • Stream output while a cell is running • Interrupt any operation Fifteenth cell that executed in this session
  • 15. 15 Jupyter’s Display System: Much More than Text https://nbviewer.jupyter.org/github/ipython/ipython/bl ob/master/examples/IPython%20Kernel/Custom%2 0Display%20Logic.ipynb
  • 17. 17 Magics • Jupyter’s standard Python kernel has over 90 built-in magic commands http://ipython.readthedocs.io/en/stable/interactive/magics.html
  • 18. 18 Extensions • Many additional extensions in the iPython project’s Github repository – https://github.com/ip ython- contrib/jupyter_contri b_nbextensions
  • 21. 21 The Actual Architecture of Jupyter Notebooks
  • 22. Notebook Server Process 22 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem
  • 23. 23 The Five Stages of Enterprise Jupyter Deployment 1. Denial
  • 24. 24 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger
  • 25. Notebook Server Process 25 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem
  • 26. Notebook Server Process 26 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem
  • 27. Notebook Server Process 27 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem
  • 28. Notebook Server Process 28 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem Five ZeroMQ message queues over unencrypted TCP sockets… …per kernel
  • 29. 29 Third-Party Kernels • The IPython kernel is the most common… • …but there is a long tail of other Jupyter kernels – 103 kernels currently listed on the Jupyter project’s wiki
  • 30. Notebook Server Process 30 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem To share notebooks among users, need to share notebook server
  • 31. Notebook Server Process 31 The Actual Architecture of Jupyter Notebooks JavaScript NotebookManagement Python Process KernelManagement iPythonKernel Notebook Server State KernelProxy Shell IOPub stdin control heartbeat Kernel Session State UserCode sklearn Spark Tensor Flow … Local Filesystem To use Apache Spark™ on YARN, need to be inside the YARN cluster’s network.
  • 32. 32 Jupyter in the Enterprise: Key Challenges • Collaboration among multiple users • Large-scale data analysis – Shared cloud infrastructure like Kubernetes – Parallel frameworks like Spark • Security and authentication • Auditing and data access control Bringing these properties to the Jupyter stack is hard!
  • 33. 33 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger
  • 34. 34 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger 3. Bargaining
  • 35. 35 Bargaining • Meeting all the enterprise requirements is expensive • Compromise to bring down the cost
  • 36. 36 Compromise #1: Gigantic Server • Find the biggest machine or container you can get • Run the entire Jupyter stack on that one machine • Issues: – Machine needs to be sized for the maximum aggregate memory of all active users’ active kernels • Hard upper limit of 256GB-1TB in most organizations • Very problematic if you have many users and big data – Need to authenticate all these users to the same machine and notebook server
  • 37. 37 Compromise #2: Notebook Server Per User • Proxy server manages a pool of containers, one per active user • Each container contains an entire Jupyter notebook stack • JupyterHub project provides a pre-built implementation of this approach • Issues: – Container needs to be big enough for all the user’s kernels • What size container to allocate when the user logs in? • Does a big enough container even exist? – Disables collaboration features – Many more moving parts  More failure modes
  • 38. 38 Compromise #3: Replace the Kernel iPythonKernel KernelProxy Shell IOPub stdin control heartbeat
  • 39. KernelProxy Proxy 39 Compromise #3: Replace the Kernel • Replace the IPython kernel with a proxy • Put something enterprise-friendly on the other side of the proxy • Apache Livy implements this approach – https://github.com/jupyter- incubator/sparkmagic • Issues: – Breaks Jupyter’s magics and extensions – Breaks data visualization libraries – Breaks third-party kernels – Less control over code execution Shell IOPub stdin control heartbeat RESTfulwebservice
  • 40. 40 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger 3. Bargaining
  • 41. 41 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger 3. Bargaining 4. Depression
  • 42. 42 The Five Stages of Enterprise Jupyter Deployment 1. Denial 2. Anger 3. Bargaining 4. Depression 5. Jupyter Enterprise Gateway
  • 43. 43 The Origins of Jupyter Enterprise Gateway • Multiple IBM products embedding Spark on YARN • All wanted to add Jupyter notebooks with Spark • Usual enterprise requirements (multitenancy, scalability, security, etc.) • Had reached the “Bargaining” stage – Mix of compromises 1, 2, and 3
  • 45. YARN Cluster Initial Prototype 45 Security Layer YARN Workers YARN Resource Manager Spark ExecutorsSpark ExecutorsSpark Executors Spark ExecutorsSpark ExecutorsSpark Executors Notebook Node nb2kg (Proxy) nb2kg Jupyter Kernel Gateway Python Kernel Spark Driver Python Kernel Spark Driver Shell IOPub stdin control heartbeat Issue #2: All Spark jobs run as same user ID Issue #1: All kernels and Spark drivers run on a single node
  • 46. Issue #1: All kernels run on a single node 8 8 8 8 0 10 20 30 40 50 60 70 80 4 Nodes 8 Nodes 12 Nodes 16 Nodes MaxKernels(4GBHeap) Cluster Size (32GB Nodes) Maximum Number of Simultaneous Kernels 46
  • 47. Jupyter Enterprise Gateway: Initial Goals • Optimized Resource Allocation – Run Spark in YARN Cluster Mode to better utilize cluster resources. – Pluggable architecture for additional Resource Managers • Multiuser support with user impersonation – Enhance security and sandboxing by enabling user impersonation when running kernels (using Kerberos). – Individual HDFS home folder for each notebook user. – Use the same user ID for notebook and batch jobs. • Enhanced Security – Secure socket communications – Any network communication should be encrypted 47
  • 48. YARN Cluster Jupyter Enterprise Gateway 48 Security Layer YARN Workers Jupyter EnterpriseGateway Multitenancy Remote kernels and Kernel Lifecycle management Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Impersonation: Alice’s kernel runs under Alice’s user ID.
  • 49. Scalability Benefits 8 8 8 8 16 32 48 64 0 10 20 30 40 50 60 70 80 4 Nodes 8 Nodes 12 Nodes 16 Nodes MaxKernels(4GBHeap) Cluster Size (32GB Nodes) Maximum Number of Simultaneous Kernels Before JEG After JEG 49
  • 50. Jupyter Enterprise Gateway: Open Source 50 • Released through the Jupyter Incubator – BSD License – https://github.com/jupyter- incubator/enterprise_gatew ay – Current release: 0.7.0
  • 51. Jupyter Enterprise Gateway: Supported Platforms • Python/Spark 2.x using IPython kernel – With Spark Context delayed initialization • Scala 2.11/ Spark 2.x using Apache Toree kernel – With Spark Context delayed initialization • R / Spark 2.x with IRkernel 51
  • 52. Jupyter Enterprise Gateway – Roadmap • Add support for other resource managers – Kubernetes support • Kernel Configuration Profile – Enable client to request different resource configuration for kernels (e.g. small, medium, large) – Profiles should be defined by Administrators and enabled for user/group of users. • Administration UI – Dashboard with running kernels and administration actions • Time running, stop/kill, Profile Management, etc • User Environments • High Availability 52
  • 53. Jupyter Enterprise Gateway • Jupyter Enterprise Gateway at IBM Code – https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/ • Jupyter Enterprise Gateway source code at GitHub – https://github.com/jupyter-incubator/enterprise_gateway • Docker images – https://github.com/jupyter- incubator/enterprise_gateway/tree/master/etc/docker • Jupyter Enterprise Gateway 0.7 release – https://github.com/jupyter-incubator/enterprise_gateway/releases/tag/v0.7.0 • Jupyter Enterprise Gateway Documentation – http://jupyter-enterprise-gateway.readthedocs.io/en/latest/ 53 Free IBM Data Science trial https://ibm.biz/BdZceR
  • 54. 54 Thank you! And special thanks to the Jupyter Enterprise Gateway team: Luciano Resende, Kevin Bates, Kun Liu, Christian Kadner, Sanjay Saxena, Alan Chin, Sherry Guo, Alex Bozarth, Zee Chen
  • 56. 56 Building your own test environment with Jupyter Enterprise Gateway
  • 57. Jupyter Enterprise Gateway: Deployment 57 Management Node Powered by Ambari EG Compute Engine based on Apache Spark
  • 58. Jupyter Enterprise Gateway: Deployment • Ansible deployment scripts – https://github.com/lresende/spark-cluster-install • One click deployment of the Spark Cluster – Configure your host inventory (see example on git repository) – Run the ”setup-ambari.yml” playbook • $ ansible-playbook --verbose setup-ambari.yml -i hosts-fyre-ambari -c paramiko • One click deployment of the Jupyter Enterprise Engine – Run the ”setup-enterprise-gateway.yml” playbook • $ ansible-playbook --verbose setup-enterprise-gateway.yml -i hosts-fyre-ambari -c paramiko 58
  • 59. Jupyter Enterprise Gateway - Deployment • Docker images – yarn-spark: Basic one node Spark on Yarn configuration – enterprise-gateway: Adds Anaconda and Jupyter Enterprise Gateway to the yarn-spark image – nb2kg: Minimal Jupyter Notebook client configured with hooks to access the Enterprise Gateway – https://github.com/jupyter-incubator/enterprise_gateway/tree/master/etc/docker • Building the latest docker images – git checkout https://github.com/jupyter-incubator/enterprise_gateway – make docker-clean docker-images – Note: Make also have individual targets to clean and build individual images (type make for help) 59
  • 60. Jupyter Enterprise Gateway - Deployment • Connecting to a Spark Cluster using a docker image docker run -t --rm -e KG_URL='http://<Enterprise Gateway IP>:8888' -p 8888:8888 -e VALIDATE_KG_CERT='no' -e LOG_LEVEL=DEBUG -e KG_REQUEST_TIMEOUT=40 -e KG_CONNECT_TIMEOUT=40 -v ${HOME}/opensource/jupyter/jupyter-notebooks/:/tmp/notebooks -w /tmp/notebooks elyra/nb2kg:dev 60

Editor's Notes

  1. Now, when I first saw these requirements, my initial reaction was, “sounds easy”. I mean, to a first approximation, all that Jupyter is doing is passing strings around.
  2. This is what I initially thought, and I’ve met a good number of other people who were in the same situation and came up with the same design. The problem with this design is that it’s actually only the first stage of a much longer process that I like to call…
  3. And in particular, the first stage of this process is called…
  4. Let me explain.
  5. All these cool features of Jupyter notebooks rely on an architecture that is substantially more baroque than the cartoon picture from ten slides back…
  6. When an enterprise architect becomes aware of all this complexity, that’s when he or she moves from stage 1 to stage 2, which is…
  7. Let me explain.
  8. This architecture was designed for an academic setting. When you try to transplant it into an enterprise environment and layer enterprise requirements on top of it, things go downhill rather quickly.
  9. …and the purpose of this talk is to help you to work through this fourth stage as quickly as possible and move on to stage 5, which is…
  10. …Jupyter Enterprise Gateway. (Bet you thought I was going to say “acceptance”). So, what is Jupyter Enterprise Gateway?
  11. Min RK
  12. Min RK
  13. Min RK