Jupyter Notebook is a pretty nifty tool that you can utilize in your day to day activities. To explain the benefits of Jupyter Notebook, we will share how we are using it to solve our regular puzzles at Elucidata.
But before we deep dive into our specific usage, let’s get some context around Jupyter Notebooks.
What is a Jupyter Notebook?
Jupyter Notebook is an open source web application that allows a user, scientific researcher, scholar or analyst to create and share the document called the Notebook, containing live codes, documentation, graphs, plots,
Jupyter Notebook provides support for 40+ programming languages to the users including the most frequently used programming languages – Python, R, Julia to name a few. It allows the user to download the notebook in various file formats like PDF, HTML, Python, Markdown or an .
What makes Jupyter Notebook the de facto standard for analysis?
Due to Jupyter Notebook’s multi-programming support, huge feature availability and rapidly growing popularity among the community, it has become a standard for all sorts of analysis, visualizations, rapid prototyping, ML and various code practices.
Who uses Jupyter Notebooks?
Any person who is a data scientist, data engineer, data analyst, machine learning scientist, research scholar, scientific researchers or a general user who wants to do any sort of scientific computation, data processing or visualization related work can use the Jupyter Notebook.
Base Architecture Behind Jupyter Notebook
Fig:  Image depicting the base components of the Jupyter Notebook
How a Jupyter Notebook works
When the user saves their notebook file, it is sent from the user browser to the notebook server. It is then saved on the disk as a JSON file with a .ipynb extension. The Jupyter Notebook server is responsible for saving, loading and editing the user notebooks if the kernel is still not present.
Jupyter Notebook @Elucidata
We, at Elucidata, are working on this project to develop new features and services on top of a traditional Jupyter Notebook, to facilitate our end users to have the best user experience
We have worked on creating a Jupyter Notebook with a brand new and elegant UI, and new custom functionalities. We are not leaving any stone unturned to make it the best notebook experience a user can have.
We are dedicated to this project to make it, what a data scientist, data engineer or data scholar would want on our platform.
Our Use Cases
We introduced the Jupyter Notebooks in the eco-system of our platform – Polly™, to support the manipulation, visualization and the programming of the end result of the built-in workflows. Later, we leveraged the functionality of the Jupyter Notebook and combined it with the JupyterHub architecture to extend the functionalities for the following use cases:
- Project: In Polly™, a user can go to his/her project and from there, they can open their saved Jupyter Notebooks, or create new Jupyter Notebooks, or even upload an existing Jupyter Notebook. Projects are the most widely used feature for opening, creating, modifying or even deleting the Jupyter Notebook in Polly™
- Templates: We have built some generic template notebooks as well to support all the use cases
- Analysis: Analysis is a key point in finding valuable insights from the huge genomics and metabolomics datasets uploaded to our platform for running various in-house builds and workflows. By integrating the Jupyter Notebook we offer our end-users and our in house data scientists a convenient interface for interactively running code, exploring output, and visualizing data – all from a single cloud-based development environment. Along with this, we have added the custom functionality of our platform API which works seamlessly for fetching dataset from our cloud managed environment directly into the Notebook.
- Our in-house build workflows: As we expanded the platform, we introduced new capabilities called workflows. With the hard efforts of our engineering team & the collaboration of the data science team, we succeeded in building workflows which are a series of algorithms running in a set manner to achieve a specific goal
- Data Scientists can code the experiment using our hosted Jupyter Notebook
- Software Engineers can code various functionalities using the Jupyter Notebook
Our Jupyter Notebook System
Supporting such use cases, require a quite scalable and supporting infrastructure. Let’s walk through some of the components of our Jupyter Notebook System.
- Docker: Docker plays a key role in the infrastructure and endless support for our interactive Notebooks. All our Jupyter Notebooks run inside a contained environment, with all the pre-configured functionalities and library packages available, that the end user might require in their day to day task
- JupyterHub: JupyterHub is a high-level architecture, used to handle the user authentication, routing, spawning notebook dockers, detecting notebooks and deleting them when they are no longer in use.
Why we chose JupyterHub + Docker?
We didn’t want our users to fight over the correct package version installations and their dependency management for work. We wanted every user i.e. data scientists, data engineers, or data analysts, to have an identical reproducible environment with the same library and same datasets. In fact, the same version of everything
If we allow them to install on their own pods, it would lead to different environment versions depending on what workflow they are using for the package installer.
A fully hosted environment makes sure that everybody has the same seamless starting point.
- UI Interface: We have redesigned the UI of the Jupyter Notebook. We used CSS and JS with libraries like JQuery to give it a perfect and clean UI. It is an intuitive UI with a minimalistic aesthetic. This required a thoughtful UX design that made it easy to do the hard things. Below is the look of our notebook.
- Compute: The user’s virtual machine instances support the computation of a core, with 2 GB RAM and 100 GB block storage, but as per availability and cluster usage the computation power would increase. Our cluster is AutoScale enabled which allows spawning of the new user pods on the fly based on high requests. We have deployed our whole notebook infrastructure on Google Cloud.
- Cluster Management Software: We are using Kubernetes for managing our computation instances and cluster. Kubernetes ensures that the pods in running state do not shut down due to an error, maintaining high availability. With Kubernetes we are able to manage 1000+ user pods without losing any data
- Deployment: We use Helm (a package manager) for Kubernetes to automate our deployment process. Helm ensures the correct docker image is deployed and kept for future use to avoid pulling the image again and reduce spawning time
- Storage: We are using Amazon S3 as a storage system for the users’ Notebook and their reusable scripts across the Jupyter Notebook. Thus each user project has a directory structure at S3 for storing, managing, creating or deleting their notebook. They can launch its interactive notebook from within our platform. Following is the snapshot of the users’ project storage on S3.
Here is a brief tech stack map
What did we learn?
- Reduce human maintenance: It is easy to scale large scenarios without a lot of human intervention which would avoid any bottlenecks. With Helm, we have also reduced the bottlenecks faced during deployment for our engineers
- Great infrastructure: With this development, we have a stable infrastructure that can handle large user requests, multiple contained environments, enabling multiple
dockercontainers to run at an instance and power the user to do their task
- Discover new possibilities: During the journey of our development and integration of Jupyter Notebook, we discovered several new possibilities that we were working on to give better features and good user experience to our end users
For further references, refer to the following useful links or schedule a demo with us to witness this in action
Also published on Medium