2018 02 20-jeg_index
- 3. 3
The Jupyter Project
• Open Source project that builds software to enable
interactive notebooks for data science
– Started in 2014
– Grew out of the IPython project
- 7. 7
Jupyter Notebooks
• Jupyter notebooks are widely used by data scientists, social
scientists, physical scientists, engineers, and others
• Useful for many tasks
– Analyzing data
– Developing and debugging software
– Running experiments
– Keeping track of experimental results
– Presenting results
• Jupyter is a central part of the IBM Data Science Experience
(http://datascience.ibm.com)
- 8. 8
Jupyter in the Enterprise: Key Challenges
• Collaboration among multiple users
• Large-scale data analysis (problems that don’t fit in
a laptop)
– Shared cloud infrastructure like Kubernetes
– Parallel frameworks like Spark
• Security and authentication
• Auditing and data access control
- 9. 9
Isn’t this just shipping strings around?
JavaScript
“1+1”
Server
“1+1”
Python
Process
“1+1”
“2”“2”“2”
- 10. 10
Isn’t this just shipping strings around?
JavaScript
“1+1”
FancyNewSystem
“1+1”
Python
Process
“1+1”
“2”“2”“2”
Security
Multitenancy
Authentication
Spark
Kubernetes
- 14. 14
Asynchronous Operations
• Queue up multiple cells for
execution
– …in arbitrary order
• Stream output while a cell is
running
• Interrupt any operation
Fifteenth cell
that executed in
this session
- 15. 15
Jupyter’s Display System: Much More than Text
https://nbviewer.jupyter.org/github/ipython/ipython/bl
ob/master/examples/IPython%20Kernel/Custom%2
0Display%20Logic.ipynb
- 22. Notebook Server Process
22
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
- 25. Notebook Server Process
25
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
- 26. Notebook Server Process
26
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
- 27. Notebook Server Process
27
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
- 28. Notebook Server Process
28
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
Five ZeroMQ
message queues over
unencrypted TCP
sockets…
…per kernel
- 29. 29
Third-Party Kernels
• The IPython kernel is
the most common…
• …but there is a long tail
of other Jupyter kernels
– 103 kernels currently
listed on the Jupyter
project’s wiki
- 30. Notebook Server Process
30
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
To share
notebooks among
users, need to
share notebook
server
- 31. Notebook Server Process
31
The Actual Architecture of Jupyter Notebooks
JavaScript
NotebookManagement
Python Process
KernelManagement
iPythonKernel
Notebook
Server
State
KernelProxy
Shell
IOPub
stdin
control
heartbeat
Kernel
Session
State
UserCode
sklearn
Spark
Tensor
Flow
…
Local
Filesystem
To use Apache
Spark™ on YARN,
need to be inside
the YARN cluster’s
network.
- 32. 32
Jupyter in the Enterprise: Key Challenges
• Collaboration among multiple users
• Large-scale data analysis
– Shared cloud infrastructure like Kubernetes
– Parallel frameworks like Spark
• Security and authentication
• Auditing and data access control
Bringing these properties to the Jupyter stack is hard!
- 36. 36
Compromise #1: Gigantic Server
• Find the biggest machine or container you can get
• Run the entire Jupyter stack on that one machine
• Issues:
– Machine needs to be sized for the maximum aggregate
memory of all active users’ active kernels
• Hard upper limit of 256GB-1TB in most organizations
• Very problematic if you have many users and big data
– Need to authenticate all these users to the same machine
and notebook server
- 37. 37
Compromise #2: Notebook Server Per User
• Proxy server manages a pool of containers, one per active user
• Each container contains an entire Jupyter notebook stack
• JupyterHub project provides a pre-built implementation of this
approach
• Issues:
– Container needs to be big enough for all the user’s kernels
• What size container to allocate when the user logs in?
• Does a big enough container even exist?
– Disables collaboration features
– Many more moving parts More failure modes
- 39. KernelProxy
Proxy
39
Compromise #3: Replace the Kernel
• Replace the IPython kernel with a proxy
• Put something enterprise-friendly on
the other side of the proxy
• Apache Livy implements this approach
– https://github.com/jupyter-
incubator/sparkmagic
• Issues:
– Breaks Jupyter’s magics and extensions
– Breaks data visualization libraries
– Breaks third-party kernels
– Less control over code execution
Shell
IOPub
stdin
control
heartbeat
RESTfulwebservice
- 41. 41
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
4. Depression
- 42. 42
The Five Stages of Enterprise Jupyter Deployment
1. Denial
2. Anger
3. Bargaining
4. Depression
5. Jupyter Enterprise Gateway
- 43. 43
The Origins of Jupyter Enterprise Gateway
• Multiple IBM products embedding Spark on YARN
• All wanted to add Jupyter notebooks with Spark
• Usual enterprise requirements (multitenancy,
scalability, security, etc.)
• Had reached the “Bargaining” stage
– Mix of compromises 1, 2, and 3
- 46. Issue #1: All kernels run on a single node
8 8 8 8
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
Maximum Number of Simultaneous Kernels
46
- 47. Jupyter Enterprise Gateway: Initial Goals
• Optimized Resource Allocation
– Run Spark in YARN Cluster Mode to better utilize cluster resources.
– Pluggable architecture for additional Resource Managers
• Multiuser support with user impersonation
– Enhance security and sandboxing by enabling user impersonation
when running kernels (using Kerberos).
– Individual HDFS home folder for each notebook user.
– Use the same user ID for notebook and batch jobs.
• Enhanced Security
– Secure socket communications
– Any network communication should be encrypted
47
- 48. YARN Cluster
Jupyter Enterprise Gateway
48
Security
Layer
YARN
Workers
Jupyter EnterpriseGateway
Multitenancy
Remote kernels and Kernel Lifecycle management
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Impersonation:
Alice’s kernel
runs under
Alice’s user ID.
- 49. Scalability Benefits
8 8 8 8
16
32
48
64
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MaxKernels(4GBHeap)
Cluster Size (32GB Nodes)
Maximum Number of Simultaneous Kernels
Before JEG
After JEG
49
- 50. Jupyter Enterprise Gateway: Open Source
50
• Released through the
Jupyter Incubator
– BSD License
– https://github.com/jupyter-
incubator/enterprise_gatew
ay
– Current release: 0.7.0
- 51. Jupyter Enterprise Gateway: Supported Platforms
• Python/Spark 2.x using IPython kernel
– With Spark Context delayed initialization
• Scala 2.11/ Spark 2.x using Apache Toree kernel
– With Spark Context delayed initialization
• R / Spark 2.x with IRkernel
51
- 52. Jupyter Enterprise Gateway – Roadmap
• Add support for other resource managers
– Kubernetes support
• Kernel Configuration Profile
– Enable client to request different resource configuration for kernels (e.g. small,
medium, large)
– Profiles should be defined by Administrators and enabled for user/group of users.
• Administration UI
– Dashboard with running kernels and administration actions
• Time running, stop/kill, Profile Management, etc
• User Environments
• High Availability
52
- 53. Jupyter Enterprise Gateway
• Jupyter Enterprise Gateway at IBM Code
– https://developer.ibm.com/code/openprojects/jupyter-enterprise-gateway/
• Jupyter Enterprise Gateway source code at GitHub
– https://github.com/jupyter-incubator/enterprise_gateway
• Docker images
– https://github.com/jupyter-
incubator/enterprise_gateway/tree/master/etc/docker
• Jupyter Enterprise Gateway 0.7 release
– https://github.com/jupyter-incubator/enterprise_gateway/releases/tag/v0.7.0
• Jupyter Enterprise Gateway Documentation
– http://jupyter-enterprise-gateway.readthedocs.io/en/latest/
53
Free
IBM Data Science
trial
https://ibm.biz/BdZceR
- 54. 54
Thank you!
And special thanks to the Jupyter
Enterprise Gateway team: Luciano Resende,
Kevin Bates, Kun Liu, Christian Kadner,
Sanjay Saxena, Alan Chin, Sherry Guo, Alex
Bozarth, Zee Chen
- 58. Jupyter Enterprise Gateway: Deployment
• Ansible deployment scripts
– https://github.com/lresende/spark-cluster-install
• One click deployment of the Spark Cluster
– Configure your host inventory (see example on git repository)
– Run the ”setup-ambari.yml” playbook
• $ ansible-playbook --verbose setup-ambari.yml -i hosts-fyre-ambari -c paramiko
• One click deployment of the Jupyter Enterprise Engine
– Run the ”setup-enterprise-gateway.yml” playbook
• $ ansible-playbook --verbose setup-enterprise-gateway.yml -i hosts-fyre-ambari -c
paramiko
58
- 59. Jupyter Enterprise Gateway - Deployment
• Docker images
– yarn-spark: Basic one node Spark on Yarn configuration
– enterprise-gateway: Adds Anaconda and Jupyter Enterprise Gateway to the
yarn-spark image
– nb2kg: Minimal Jupyter Notebook client configured with hooks to access the
Enterprise Gateway
– https://github.com/jupyter-incubator/enterprise_gateway/tree/master/etc/docker
• Building the latest docker images
– git checkout https://github.com/jupyter-incubator/enterprise_gateway
– make docker-clean docker-images
– Note: Make also have individual targets to clean and build individual images
(type make for help)
59
- 60. Jupyter Enterprise Gateway - Deployment
• Connecting to a Spark Cluster using a docker image
docker run -t --rm
-e KG_URL='http://<Enterprise Gateway IP>:8888'
-p 8888:8888
-e VALIDATE_KG_CERT='no'
-e LOG_LEVEL=DEBUG
-e KG_REQUEST_TIMEOUT=40
-e KG_CONNECT_TIMEOUT=40
-v ${HOME}/opensource/jupyter/jupyter-notebooks/:/tmp/notebooks
-w /tmp/notebooks
elyra/nb2kg:dev
60
Editor's Notes
- Now, when I first saw these requirements, my initial reaction was, “sounds easy”. I mean, to a first approximation, all that Jupyter is doing is passing strings around.
- This is what I initially thought, and I’ve met a good number of other people who were in the same situation and came up with the same design. The problem with this design is that it’s actually only the first stage of a much longer process that I like to call…
- And in particular, the first stage of this process is called…
- Let me explain.
- All these cool features of Jupyter notebooks rely on an architecture that is substantially more baroque than the cartoon picture from ten slides back…
- When an enterprise architect becomes aware of all this complexity, that’s when he or she moves from stage 1 to stage 2, which is…
- Let me explain.
- This architecture was designed for an academic setting. When you try to transplant it into an enterprise environment and layer enterprise requirements on top of it, things go downhill rather quickly.
- …and the purpose of this talk is to help you to work through this fourth stage as quickly as possible and move on to stage 5, which is…
- …Jupyter Enterprise Gateway. (Bet you thought I was going to say “acceptance”). So, what is Jupyter Enterprise Gateway?
- Min RK
- Min RK
- Min RK