SciPy 2023

Thinking in arrays
07-11, 08:00–12:00 (America/Chicago), Classroom 101

Despite its reputation for being slow, Python is the leading language of scientific computing, which generally needs large-scale (fast) computations. This is because most scientific problems can be split into "metadata bookkeeping" and "number crunching," where the latter is performed by array-oriented (vectorized) calls into precompiled routines.

This tutorial is an introduction to array-oriented programming. We'll focus on techniques that are equally useful in NumPy, Pandas, xarray, CuPy, Awkward Array, and other libraries, and we'll work in groups on three class projects: Conway's Game of Life, evaluating decision trees, and computations on ragged arrays.


Array-oriented programming is a paradigm in its own right, challenging us to think about problems in a different way. From APL in 1966 to NumPy today, most users of array-oriented programming are scientists, analyzing or simulating data. This tutorial focuses on the thought process: all of the problems are to be solved in an imperative way (for loops) and an array-oriented way. Matlab will be used for plotting, but all plotting commands will be given (not prerequisites).

We'll alternate between short lectures and small group projects (3‒4 people each), in which tutors will be available for help, followed by a guided tour through solutions, alternatives, and trade-offs.

Here is a general outline:

0:00‒0:20 (20 min): Array-oriented programming as a paradigm: APL, SPEAKEASY, IDL, MATLAB, S, R, NumPy. Overview of basic and advanced slicing, broadcasting, and dimensional reduction. Powerful concept: element indexing is function application and advanced slicing is function composition.

0:20‒0:40 (20 min): Project 1: Conway's Game of Life. Calculating number of neighbors and updating the board "all at once."

0:40‒0:55 (15 min): Break

0:55‒1:15 (20 min): Guided discussion of solutions to Project 1.

1:15‒1:35 (20 min): Array-oriented programming and the "iteration until converged" problem. How to update arrays in which some elements have converged and others haven't.

1:35‒1:55 (20 min): Project 2: evaluating a decision tree, by walking over each node individually (as in a computer science class) and by million-ball Plinko! (how Scikit-Learn actually does it).

1:55‒2:10 (15 min): Break

2:10‒2:30 (20 min): Solutions to Project 2.

2:30‒2:45 (15 min): Demo: Mandelbrot (fractal) picture, computed 11 different ways: Python, NumPy, C++ (pybind11), Cython, Numba imperative, Numba vectorized, CuPy, CuPy with custom CUDA, Numba-CUDA, JAX-CPU, and JAX-GPU. Discussion of performance and trade-offs.

2:45‒3:05 (20 min): Non-rectilinear (ragged) arrays and arrays of arbitrary data structures: Apache Arrow and Awkward Array.

3:05‒3:25 (20 min): Project 3: a big, ragged dataset: computing lengths of taxi trips from polylines with varying numbers of edges. Since this is a big dataset, we'll also look at ways to scale it up with Dask.

3:25‒3:40 (15 min): Break

3:40‒4:00 (20 min): Solutions to Project 3.


Prerequisites

Participants should have a basic familiarity with NumPy, such as the content of the "Introduction to Numerical Computing With NumPy" tutorial.

Installation Instructions

https://github.com/jpivarski-talks/2023-07-11-scipy2023-tutorial-thinking-in-arrays

Jim was trained as a particle physicist with a Ph.D. from Cornell and helped commission the CMS experiment at the Large Hadron Collider (LHC). Then he worked as a data scientist for Open Data Group for 5 years before joining Princeton as a computational physicist in 2016. Now he develops software tools for data analysis in Python, leading the development of Awkward Array, and helps users with a wide range of data analysis problems.