Skip to content

Reproducible analyses

Objectivesđź“Ť

  • the jupyter ecosystem
  • levels of computational environments

Jupyter notebooks

A Jupyter notebook is a shareable document that combines computer code, plain language descriptions, data, rich visualizations like 3D models, charts, graphs and figures, and interactive controls. A notebook, along with an editor (like JupyterLab), provides a fast interactive environment for prototyping and explaining code, exploring and visualizing data, and sharing ideas with others.

Writing your analyses in a jupyter notebook not only brings benefits for your own work but can ensure the reproducibility of your results. You can share your jupyter notebook together with your conventional paper and thus other researchers or the reviewer of your paper can directly track and reproduce your analysis. This gives high credibility of your work. Besides that, it's just so awesome with it's many features, the big community and support!

Jupyter
Example of a jupyter notebook.

Some universities offer a so-called JupyterHub which is basically an environment to maintain a JupyterLab. JupyterLab enables you to work with documents and activities such as Jupyter notebooks, text editors, terminals, and custom components in a flexible, integrated, and extensible manner. For a demonstration of JupyterLab and its features, you can view this video:

Computational Environments

Info

Most of the content was copied from Peer Herholz's "There and back again: a short introduction to virtualization technologies" and from Docker vs. python virtualization vs. virtual machines by Dr. Stephen Odaibo

The problem statement

Imagine you want to conduct an analysis of some demographic data, including obtaining & reading data, filtering & descriptive analyses of data, inferential statistics and visualization. A colleague has a python script that does all of these things ready to go and shares it with you. Everything is ok….

The script doesn’t run? The script leads to different results?

Meme
...but it worked on my machine?!

What went wrong?

  • each and every single project in a lab depends on complex software environments:
    • operating system
    • Drivers
    • Software dependencies: Python, R, MATLAB + libraries
App

This leads to statements like:

  • "The computer I used was shut down a year ago, I can’t rerun the analyzes from my publication…"
  • "The analyzes were run by my student, I have no idea where and how..."

Virtualization technologies aim to isolate the computing environment by providing a mechanism to encapsulate environments in a self-contained unit that can run anywhere. Thus, with virtualization techniques it is possible to reconstruct and share computing environments.

Virtualization technology

Virtualization technologies have 3 main types:

Virtualization levels

python virtualization containers virtual machines
venv Docker Virtualbox
conda Singularity VMware

Python virtualization

Python virtual environments are a mechanism to prevent incompatibility clashes and other forms of conflict that arise from 3rd party python libraries share space to an extent. For instance, an update of tensorflow from 1.13 to 2.0 may result in breakage of any applications that relied particularly on tensorflow 1.13.

To avoid this problem one would like to configure environments that have specific signatures as pertains to 3rd party python packages. For instance, one virtual environment could be out TF2.0/Keras 2.2.5/Python 2.7.14 environment, while another is our TF2.0/Keras 2.0/Python 3.6.8 environment, and yet another our TF1.10-gpu/Keras 2.3.0/Python3.6.0 environment. This setup facilitates sandboxing and encourages experimentation by greatly decreasing the risk that we will break anything.

Python virtual environments...

  • keep the dependencies required by different projects in separate places
  • allows you to work with specific version of libraries or Python itself without affecting other Python projects
  • Applications
    • venv: an environment manager for Python 3.4 and up, usually preinstalled
    • conda: an environment manager and package manager (for python and beyond)
Python virtualization

Containers

A container wraps an application’s software into an invisible box with everything the application needs to run. That includes the operating system, application code, runtime, system tools, system libraries, and etc. All the operating system level architecture is being shared across containers. One of those container systems is Docker. Docker containers have fully prescribed dependencies with which they can be created. These dependencies as well as the instruction on how specifically to create the container are stored in the container’s image. The image of a container is portable and can be registered on one of a number of registries/repositories/hubs. Once there, anyone can “pull it” so long as they know its unique name.

Docker

Unlike a VM which provides hardware virtualization, a container provides operating-system-level virtualization by abstracting the “user space”.

optional/reading/further materials