SciPy 2025

Building an AI Agent for Natural Language to SQL Query Execution on Live Databases
07-08, 08:00–12:00 (US/Pacific), Ballroom A

This hands-on tutorial will guide participants through building an end-to-end AI agent that translates natural language questions into SQL queries, validates and executes them on live databases, and returns accurate responses. Participants will build a system that intelligently routes between a specialized SQL agent and a ReAct chat agent, implementing RAG for query similarity matching, comprehensive safety validation, and human-in-the-loop confirmation. By the end of this 4-hour session, attendees will have created a powerful and extensible system they can adapt to their own data sources.


Overview

Natural‑language interfaces unlock database insights for non‑technical users. This tutorial provides a practical implementation for building these systems reliably and effectively.

Participants will build an AI agent system that can:

  1. Route intelligently between SQL generation and ReAct chat agent workflows
  2. Ingest and understand database schemas with domain knowledge
  3. Retrieve relevant context and similar query examples using RAG with vector similarity
  4. Generate accurate SQL with validation and safety guardrails
  5. Execute queries safely with human-in-the-loop approval
  6. Present results in an understandable format
  7. Track costs and monitor performance using LangSmith
  8. Manage session-based memory and conversation context

We'll use the Kaggle dataset "Brazilian E-Commerce dataset by Olist" as our working example, demonstrating how to handle multiple tables across two schemas with complex relationships. This dataset will be hosted on an EC2 AWS instance for live interaction during the tutorial.

This tutorial addresses real-world database complexity with production-grade considerations. Participants will start from a repository with backbone code and implement the key components during the session. By the end, attendees will have a working system they can adapt to their own datasets.

Tools and Frameworks

This tutorial will leverage modern tools and frameworks for efficient development:

AI and Agent Frameworks:
- LangGraph for agent orchestration and workflow management
- LangChain for agent components and LLM interactions
- LangSmith for comprehensive cost tracking and monitoring
- OpenAI models with examples of alternatives

Database and Vector Store:
- SQLAlchemy for database interactions and schema retrieval
- PostgreSQL as the database engine for the live dataset
- PGVector for similarity-based query retrieval

Development:
- YAML for configuration management
- pyproject.toml for standardized project configuration
- UV reliable package management and Ruff for code formatting/linting


Installation Instructions

Follow the official instructions to install Git, UV (astral-uv), and your preferred Python IDE. All Python dependencies will be managed through this setup.

Prerequisites
  • Python programming experience
  • Basic understanding of API interactions
  • Basic familiarity with SQL and database management
  • Laptop with Git and UV, and your preferred Python IDE (I recommend VSCode) installed.

No prior experience with LLMs, RAG, or advanced NLP is required.

See also: SQL Agent Diagram (669.5 KB)

I’m a data scientist and AI engineer with 10+ years of experience across academic research and industry, building GenAI and machine learning solutions for research labs, startups, and Fortune 500 companies. I’m also a passionate educator, contributing to data training programs as a professor and consultant, and an active open-source contributor and speaker at conferences like PyData.