SciPy 2023

Advanced Dask Tutorial
07-11, 13:30–17:30 (America/Chicago), Classroom 106

Dask is a Python library for scaling and parallelizing Python code. It provides familiar, high-level interfaces to extend the SciPy ecosystem to larger-than-memory or distributed environments, as well as lower-level interfaces for parallelizing custom algorithms. In this tutorial, we’ll cover advanced features of Dask like applying custom operations to Dask DataFrames and arrays, debugging computations, diagnosing performance issues, and more. Attendees should walk away with a deeper understanding of Dask’s internals, an introduction to more advanced features, and ideas of how they can apply these features effectively to their own workloads.


Dask is a popular Python library for scaling and parallelizing Python code on a single machine or across a cluster. It provides familiar, high-level interfaces to extend the SciPy ecosystem (e.g. NumPy, pandas, scikit-learn) to larger-than-memory or distributed environments, as well as lower-level interfaces for parallelizing custom algorithms and workflows. In this tutorial, we’ll cover advanced features of Dask like applying custom operations to Dask DataFrames and arrays, inspecting the internal state of clusters, debugging distributed computations, diagnosing performance issues, and more. Attendees should walk away with a deeper understanding of Dask’s internals, an introduction to more advanced features, and ideas of how they can apply these features effectively to their own data-intensive workloads. Basic Dask experience is required, though knowledge of Dask’s internals is not. This hands-on tutorial is intended for existing or aspiring Dask users looking to gain a deeper understanding of more intermediate and advanced topics.


Installation Instructions

https://github.com/jsignell/dask-tutorial-advanced/

Prerequisites

This tutorial is intended for working and aspiring data professionals. Working knowledge of the basics of Dask and/or distributed computing is required, though knowledge of Dask's internals are not. Tutorial attendees should walk away with a deeper understanding of Dask’s internals, an introduction to more advanced features, and ideas of how they can apply these features effectively to their own data intensive workloads.

James Bourbeau is a core maintainer of Dask, experienced educator, and has presented on Dask at various conferences and meetups such as SciPy, PyCon, and PyData Global. His most recent presentation was an introductory Dask tutorial at SciPy 2020, a recording of which can be found at https://www.youtube.com/watch?v=EybGGLbLipI&t.

Naty is an Open Source Software Engineer at Coiled, Dask contributor, and an experienced educator. She has taught multiple Dask tutorials at conferences like Scipy, PyData, Women Who Code meetups, as well as periodic live tutorials. Her most recent presentations are Dask Tutorial Scipy 2022 (https://youtu.be/J0NcbvkYPoE) and PyData NYC 2022. In her free time, she likes playing ultimate frisbee, going fly-fishing, and playing video games.