SciPy 2023

Data of an Unusual Size: A practical guide to analysis and interactive visualization of massive datasets
07-11, 08:00–12:00 (America/Chicago), Classroom 203

While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.

In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a public cloud provided by the presenters – starting from how the data is stored and read, to how it is processed and visualized.


"Big data" refers to any data that is too large to handle comfortably with your current tools and infrastructure. As the leading language for data science, Python has many mature options that allow you to work with datasets that are orders of magnitudes larger than what can fit into a typical laptop's memory.

This tutorial will help you understand how large-scale analysis differs from local workflows, the unique challenges associated with scale, and some best practices to work productively with your data.

By the end, you will be able to answer:

  • What makes some data formats more efficient at scale?
  • Why, how, and when (and when not) to leverage parallel and distributed computation (primarily with Dask) for your work?
  • How to manage cloud storage, resources, and costs effectively?
  • How interactive visualization can make large and complex data more understandable (primarily with hvPlot)?
  • How to comfortably collaborate on data science projects with your entire team?

The tutorial focuses on the reasoning, intuition, and best practices around big data workflows, while covering the practical details of Python libraries like Dask and hvPlot that are great at handling large data. It includes plenty of exercises to help you build a foundational understanding within three hours.


Prerequisites

The tutorial material will be in the form of Jupyter Notebooks, so a basic understanding of the notebook interface is nice to have, but we will share a quick primer on using Jupyter Notebooks at the beginning of the tutorial. If participants want to run the tutorial materials locally (which is not necessary because the material will be hosted on the cloud for them), a fundamental understanding of the command line interface, git-based version control, and packaging tools like pip and conda will be helpful.

Installation Instructions

https://github.com/nebari-dev/big-data-tutorial/blob/main/00-introduction.ipynb

Pavithra is a Developer Advocate at Quansight, where she works to support the PyData community. She also contributes to the Bokeh and Dask projects; and has helped administrate Wikimedia’s outreach programs in the past. In her spare time, she enjoys a good book and hot coffee. :)

This speaker also appears in: