SciPy 2024

Working with U.S. Census Data in Python: Discovery, Analysis, and Visualization
07-09, 13:30–17:30 (US/Pacific), Room 315

The United States Census Bureau publishes over 1,600 data sets via its APIs. These are useful across a myriad of fields in the social sciences. In this interactive tutorial, attendees will learn how to use open-source Python tools to discover, download, analyze, and generate maps of U.S. Census
data. The tutorial is full of practical examples and best practices to help participants avoid the tedium of data wrangling and concentrate on their research questions.

This hands-on tutorial will consider the full breadth and richness of data available from the U.S. Census. We will cover not only American Community Survey (ACS) and similarly well-known data sets, but also a number of data sets that are less well-known but nonetheless useful in a variety of research contexts.

The tutorial has no slides. Instead, it will be presented from a series of live Jupyter notebooks. After each lesson notebook is presented by the instructor, participants will be given a hands-on exercise to put what they just learned into practice. Essentially they will start with a research question and a blank notebook. Using what they just learned, they will then write the code to answer the question.

Lesson will start with the most basic queries and mapping and move through more advanced topics related to geographies, variables, groups and trees of related variables, and data set exploration.
After covering the concepts, the group as a whole will go through a complete end-to-end research example. Finally, individuals and small groups will have a chance to complete a series of short interactive exercises extending what they have learned and share the results with their peers.

All Python tooling used in the workshop is available as open-source software. Final versions of the notebooks used in the tutorial will also be made available via open-source.


Objective

The objective of this tutorial is to give attendees immediately applicable skills they can use
to work with U.S. Census data in Python. The presentation is filled with both practical knowledge
and examples of best practices for both basic and advanced use cases.

U.S. Census data can be difficult to wrangle, even though vast quantities of it are available via a
web API. Data sets and variables can be difficult to locate, and geographic hierarchies can be hard
to manage. Something as simple as querying data for all the census tracts in a metropolitan area that
crosses state lines can be non-trivial. The censusdis
package that we will introduce in this tutorial manages this complexity behind a simple interface
that makes it easy to bring the full set of data and maps the U.S. Census Bureau provides into a
Python environment.

Armed with a working knowledge of censusdis, attendees
will be able to spend less time wrangling data and more time answering research questions.

Format

This tutorial will consist of a series of lessons, each 15-20 minutes in length. Each lesson
will be followed by an interactive exercise where attendees will get a chance to write some
code using the concepts they just learned. Solutions will be provided.

After the lessons and exercises, attendees will break into groups of 2-4 and work on
one of several available projects. These projects will require attendees to apply several
of the techniques they learned in the lessons and practiced in the exercises to answer
a research question using U.S. Census data.

Finally, at the end of the session, attendees will have a chance to present the work
they did on their chosen project to the entire group.

Materials

Tutorial materials can be found in GitHub at https://github.com/censusdis/censusdis-tutorial-2024.

This tutorial is an extended, updated, and more interactive version of a 90 minute tutorial presented at
PyData Seattle 23

Environment Setup

We will be using Nebari,
an open source data science platform designed to quickly set up and deploy
an opinionated JupyterHub distribution, with built-in integrations and features for collaboration.

Setup is simple and takes 10-15 minutes. We encourage participants to set up their environment in advance of the
start of the tutorial.

Details on how to set up Nebari for this tutorial can be found in the README.md file of the tutorial's
GitHub repository, found at
https://github.com/censusdis/censusdis-tutorial-2024.

Prerequisites

Attendees are expected to have a basic working knowledge of Python. Some experience with Pandas is
helpful but not strictly required. No prior knowledge of the U.S. Census data model is required.


Prerequisites

Attendees are expected to have a basic working knowledge of Python. Some experience with Pandas is helpful but not strictly required. No prior knowledge of the U.S. Census data model is required.

Dr. Vengroff is a Computer Scientist with 20+ years of experience in Data Science, Machine Learning, Algorithms, and Software Development. He is the creator and principle maintainer of several open-source projects including censusdis and divintseg.

Dr. Vengroff has worked with organizations large and small, ranging from tech startups to the Bill and Melinda Gates Foundation, Microsoft, and Amazon. His recent work centers on metrics of diversity and integration (e.g. an interactive map of diversity and integration in the U.S.), and modeling techniques to identify systematic bias in areas including home valuation, eviction, and food accessibility. He holds a B.S.E. from Princeton University and an Sc.M. and Ph.D. from Brown University.

Dr. Vengroff's blog can be found at https://datapinions.com.

This speaker also appears in: