SciPy 2024

How to bootstrap a Data Warehouse with DuckDB
07-12, 13:15–13:45 (US/Pacific), Room 316

A Data Warehouse (DW) is a powerful tool to manage your scientific data, training data, logs, or any other type of relational data. Most Data Warehouses are cloud-based and built to scale to petabyte workflows, but might not be optimal for smaller workloads that need a fast iteration cycle. Likewise, a collection of CSV files and python scripts can become painful to share and maintain. This is where DuckDB comes in! DuckDB is a fast, in-process database that you can run on your laptop, supports a rich SQL dialect, and you can push to the cloud with just a single line of code. In this talk, we’ll show you how to bootstrap a Data Warehouse on your laptop using open source, including ETL (extract-transform-load) data pipelines, dashboard visualization, and sharing via the cloud.


The goal of this talk is to show you how to set up a Data Warehouse from scratch with DuckDB and other open source tools. This includes a data pipeline that pre-processes and saves data to the DW as it comes in and a live data exploration dashboard, all running right on your laptop. Then we'll show you how you can move your work to the cloud so you can easily share, publish, collaborate, visualize, and scale your data.

For the demo we'll use all open source or free-tier tools: dbt (https://www.getdbt.com/) to create the ETL pipeline, Apache Superset (https://superset.apache.org/) for data exploration and visualization, and MotherDuck (https://www.motherduck.com) with AWS (https://aws.amazon.com) for cloud sharing. All the tools you'll need can be installed with pip and are built with Python.

This talk is for you if you are:

  • a student or researcher who wants to learn about data engineering best practices and create your own data warehouse with limited resources
  • a seasoned Data Engineer who is using an enterprise cloud data warehouse and wants to learn about DuckDB for production workflows
  • …or simply curious about DuckDB and want to see it in action.

Talk outline:

  • What is a Data Warehouse and who needs one?
    • Talk about ETL (extract-transform-load), data pipelines, databases, and how they all come together in a Data Warehouse
    • Give some examples of companies that have data warehouses and how they use them
    • Talk about the similarities that scientific data workloads share with commercial data flows and how reproducibility comes in
  • What is DuckDB?
    • Explain why DuckDB was created, what problem does it solve?
    • Compare to Pandas, Postgres and Sqlite
  • Technical how-tos and demos! In this part we'll do a demo and explain the code and how to set everything up. We'll share the code via GitHub so you can try it out yourself afterward.
    • How to use DuckDB as a local storage hub for your data
    • How to set up dbt to load your data into DuckDB
    • How to explore and visualize your data
    • How to share your data with collaborators or publish it online

Image credit: Created with Magic Media by Canva
Prompt: A yellow rubber duck that's sitting on top of a cloud and flying in the clouds

Guen is a software engineer at MotherDuck on the Ecosystems team. Previously, she was a Sr. Quantum Measurement Engineer at Microsoft. Guen has broad experience with software engineering, data engineering and data science with Python in the context of scientific data acquisition, analysis and computation for experimental physics and biotech. She has given introductory talks and workshops on quantum computing with Python at various conferences, hackathons and events.

This speaker also appears in:

Software Engineer at MotherDuck. Previously data at The New York Times, Better.com. M.Sc CompSci (Columbia)

Alex is a forward deployed software engineer at MotherDuck and writes blogs and docs part time for DuckDB Labs. He has a bachelor's in Industrial and Systems Engineering from Virginia Tech. Alex recently joined MotherDuck after 9 years at Intel. After starting at Intel as an industrial engineer, Alex later became a technical analyst, and then moved into a data scientist role. Back in 2020 Alex discovered DuckDB while building an internal self-service analytics platform. It was such a perfect fit that he quickly integrated it and began using it in multiple projects. Alex also became one of DuckDB's biggest Twitter fans! He has been diving deeper into duck-themed databases ever since.

This speaker also appears in: