SciPy 2024

All the SQL a Pythonista needs to know: an introduction to SQL and DataFrames with DuckDB
07-09, 08:00–12:00 (US/Pacific), Ballroom A

Structured Query Language (or SQL for short) is a programming language to manage data in a database system and an essential part of any data engineer’s tool kit. In this tutorial, you will learn how to use SQL to create databases, tables, insert data into them and extract, filter, join data or make calculations using queries. We will use DuckDB, a new open source embedded in-process database system that combines cutting edge database research with dataframe-inspired ease of use. DuckDB is only a pip install away (with zero dependencies), and runs right on your laptop. You will learn how to use DuckDB with your existing Python tools like Pandas, Polars, and Ibis to simplify and speed up your pipelines. Lastly, you will learn how to use SQL to create fast, interactive data visualizations, and how to teach your data how to fly and share it via the Cloud.


In this tutorial, you will…

  • Learn how to quack SQL with DuckDB. We will dedicate the first half of the tutorial to a beginner-friendly introduction to SQL. You'll learn to load data, get a table, filter a table by column names, add a calculated column to your table and handy tools like "group by all". We'll also learn how to work with multiple tables by using joins and subqueries. For the exercises, you can use the example data we provide, or BYO data.
  • Learn how to use SQL seamlessly with your favorite Python tools. Here we’ll talk a bit about why we think DuckDB is so helpful for a Pythonista to have in their toolkit! If SQL is the right tool for the job, DuckDB is fluent in the world’s friendliest SQL dialect. DuckDB can also fit seamlessly into any existing dataframe workflow by reading and writing Pandas, Apache Arrow, and Polars dataframes. You'll learn how to use the DuckDB engine with no SQL in sight using either DuckDB’s relational API, PySpark API or Ibis.
  • Run SQL queries in live data viz for on-the-fly analytics. You'll learn how to use DuckDB in the browser and run SQL queries as part of interactive data visualization, and how to plot data directly from DuckDB with Matplotlib.
  • Share your data via the Cloud. Now that you've mastered the basics of SQL, you will teach your data to fly with Cloud providers such as AWS and MotherDuck.

Prerequisites

Beginner/Intermediate Python skills
Basic knowledge of DataFrames in Python (e.g. Pandas, Ibis)

Installation Instructions

1) Create a Google account to access Google Colab or BYO Jupyter Notebook, i.e. set up a Python environment that has Jupyter Notebook, pandas and duckdb installed, 2) (Optional for "Do more with your data" section) Sign up for MotherDuck via https://app.motherduck.com/?auth_flow=signup

Guen is a software engineer at MotherDuck on the Ecosystems team. Previously, she was a Sr. Quantum Measurement Engineer at Microsoft. Guen has broad experience with software engineering, data engineering and data science with Python in the context of scientific data acquisition, analysis and computation for experimental physics and biotech. She has given introductory talks and workshops on quantum computing with Python at various conferences, hackathons and events.

This speaker also appears in:

Alex is a forward deployed software engineer at MotherDuck and writes blogs and docs part time for DuckDB Labs. He has a bachelor's in Industrial and Systems Engineering from Virginia Tech. Alex recently joined MotherDuck after 9 years at Intel. After starting at Intel as an industrial engineer, Alex later became a technical analyst, and then moved into a data scientist role. Back in 2020 Alex discovered DuckDB while building an internal self-service analytics platform. It was such a perfect fit that he quickly integrated it and began using it in multiple projects. Alex also became one of DuckDB's biggest Twitter fans! He has been diving deeper into duck-themed databases ever since.

This speaker also appears in:

Mehdi (aka mehdio) is a data enthusiast with nearly a decade of experience in data engineering for companies of all sizes. He's not your average data guy, injecting humor and fun into his work to make complex topics easier to digest. When he's not actively contributing to the data community through his blog, YouTube, and social media, you can find him off-beat, marching to the beat of his own data drum. In 2023, Mehdi joined Motherduck as a developer advocate, bringing his data eng expertise to supercharge DuckDB.

Elena Felder works on ecosystem integrations at MotherDuck, a DuckDB-powered serverless data warehouse. Diving into data, and helping rows, columns and numbers tell interesting stories has always been a personal hobby! After working mostly in Java for many years, Elena keeps being awed at how dozens of lines of Java code can often be solved by a Python one liner. Prior to embarking on a duck-shaped startup adventure, she led Cloud Java frameworks integrations at Google.