SciPy 2025

Building an AI Agent for Natural Language to SQL Query Execution on Live Databases
07-08, 08:00–12:00 (US/Pacific), Ballroom A

This hands-on tutorial will guide participants through building an end-to-end AI agent that translates natural language questions into SQL queries, executes them on live databases, and returns coherent responses. Using the Retrieval-Augmented Generation (RAG) approach with modern LLMs, participants will learn how to construct robust NL2SQL systems that understand database schema, respect database constraints, and generate accurate SQL. By the end of this 4-hour session, attendees will have created a working prototype using the Brazilian E-Commerce dataset that they can adapt to their own data sources.


Overview

Natural Language to SQL systems enable non-technical users to access database insights. This tutorial bridges the gap between theoretical understanding and practical implementation through a RAG-based approach.

Participants will build an AI agent that can:
1. Ingest and understand database schemas
2. Retrieve relevant context about tables and relationships
3. Generate accurate SQL from natural language questions
4. Execute queries safely on live databases
5. Present results in an understandable format

We'll use the Kaggle dataset "Brazilian E-Commerce dataset by Olist" as our working example, demonstrating how to handle multiple tables with complex relationships. This dataset will be hosted on an EC2 AWS instance for live interaction during the tutorial.

The tutorial balances theoretical foundations with hands-on practice. Participants will start from a repository with backbone code and implement the key components during the session. By the end, attendees will have a working prototype they can adapt to their own datasets.

Tools and Frameworks

This tutorial will leverage modern tools and frameworks for efficient development:

Development Tools:
- pyproject.toml for standardized project configuration
- UV for fast, reliable package management
- Ruff for comprehensive Python linting and formatting
- YAML for configuration management

AI and RAG Frameworks:
- LangChain's agent framework for orchestration
- OpenAI models (GPT-4) with examples of alternatives
- Vector databases (pgvector) for efficient retrieval

Database Tools:
- SQLAlchemy for database interactions
- pandas for data manipulation and analysis
- PostgreSQL as the database engine for the live dataset


Installation Instructions

Follow the official instructions to install Git, UV (astral-uv), and VSCode. All other Python dependencies will be managed through this setup.

Prerequisites
  • Python programming experience
  • Basic understanding of API interactions
  • Basic familiarity with SQL and database management
  • Laptop with Git and UV, UV, and a Python IDE (I recommend VSCode) installed.

No prior experience with LLMs, RAG, or advanced NLP is required.

I’m a data scientist and AI engineer with 10+ years of experience across academic research and industry, building GenAI and machine learning solutions for research labs, startups, and Fortune 500 companies. I’m also a passionate educator, contributing to data training programs as a professor and consultant, and an active open-source contributor and speaker at conferences like PyData.