SciPy 2024

Data of an Unusual Size (2024 edition): A practical guide to analysis and interactive visualization of massive datasets
07-09, 13:30–17:30 (US/Pacific), Ballroom A

While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.

In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a public cloud provided by the presenters – starting from how the data is stored and read, to how it is processed and visualized.

"Big data" refers to any data that is too large to handle comfortably with your current tools and infrastructure. As the leading language for data science, Python has many mature options that allow you to work with datasets that are orders of magnitudes larger than what can fit into a typical laptop's memory.

This tutorial will help you understand how large-scale analysis differs from local workflows, the unique challenges associated with scale, and some best practices to work productively with your data.

By the end, you will be able to answer:

  • Why, how, and when (and when not) to leverage parallel and distributed computation for your work?
  • What are the best tools to use at various scales of “big data”, and how to use them practically?
  • How do you manage cloud storage, resources, and costs effectively?
  • How can interactive visualization make large and complex data more understandable?

The tutorial focuses on reasoning, intuition, and the latest best practices around big data workflows. In addition to the Python libraries like Dask and hvPlot discussed last year, this refreshed version covers the practical details of Polars and DuckDB, the current leading tools in this area. It includes plenty of exercises to help you build a foundational understanding within four hours.


We expect the audience to have some familiarity with Python programming in a data science context. If they know how to create and import Python functions and have some experience doing exploratory data analysis with pandas, they can follow along with the tutorial comfortably.

The tutorial material will be in the form of Jupyter Notebooks. Hence, a basic understanding of the notebook interface is nice to have, but we will share a quick primer on using Jupyter Notebooks at the beginning of the tutorial. If participants want to run the tutorial materials locally (which is not necessary because the material will be hosted on the cloud for them), a fundamental understanding of the command line interface, git-based version control, and packaging tools like pip and conda will be helpful.

Installation Instructions (will be updated)

Pavithra Eswaramoorthy is a Developer Advocate at Quansight, where she works to improve the developer experience and community engagement for several open source projects in the PyData community. Currently, she contributed to the Bokeh visualization library, and contributes to the Nebari (adjacent to the Jupyter community), conda-store (part of the conda ecosystem), and Ragna (a RAG orchestration framework) projects. Pavithra has been involved in the open source community for over 5 years, notable as a maintainer of the Dask library and an administrator for Wikimedia’s OSS programs. In her spare time, she enjoys a good book and hot coffee. :)

This speaker also appears in:

Dharhas Pothina is the CTO at Quansight where he helps clients wrangle their data using the PyData stack. His background includes expertise in computational modeling, big data/high performance computing, visualization, and geospatial analysis. He has been part of the Holoviz (HvPlot) and Dask communities for over 10 years and has given many talks and workshops on distributed computing and big data visualization and actively leads large-scale data science projects at Quansight.

This speaker also appears in: