SciPy 2025

Accelerating scientific data releases: Automated metadata generation with LLM agents
07-11, 14:35–15:05 (US/Pacific), Room 315

The rapid growth of scientific data repositories demands innovative solutions for efficient metadata creation. In this talk, we present our open-source project that leverages large language models to automate the generation of standard-compliant metadata files from raw scientific datasets. Our approach harnesses the capabilities of pre-trained open source models, finetuned with domain-specific data, and integrated with Langgraph to orchestrate a modular, end-to-end pipeline capable of ingesting heterogeneous raw data files and outputting metadata conforming to specific standards.

The methodology involves a multi-stage process where raw data is first parsed and analyzed by the LLM to extract relevant scientific and contextual information. This information is then structured into metadata templates that adhere strictly to recognized standards, thereby reducing human error and accelerating the data release cycle. We demonstrate the effectiveness of our approach using the USGS ScienceBase repository, where we have successfully generated metadata for a variety of scientific datasets, including images, time series, and text data.

Beyond its immediate application to the USGS ScienceBase repository, our open-source framework is designed to be extensible, allowing adaptation to other data release processes across various scientific domains. We will discuss the technical challenges encountered, such as managing diverse data formats and ensuring metadata quality, and outline strategies for community-driven enhancements. This work not only streamlines the metadata creation workflow but also sets the stage for broader adoption of generative AI in scientific data management.

Additional Material:
- Project supported by USGS and ORNL
- Codebase will be available on GitHub after paper publication
- Fine-tuned LLM models will be available on Hugginface after paper publication


We present a novel approach to automate metadata generation for scientific data using LLM agents. Building on the use case of USGS ScienceBase data repository, we demonstrate how pre-trained models can be fine-tuned to understand new scientific data sets and generate standard-compliant metadata. Our system orchestrates a modular pipeline that leverages multiple LLM agents to parse, analyze, and generate high-quality metadata for a variety of scientific datasets, including images, time series, and text data. We discuss the technical challenges and opportunities of using LLMs for metadata generation and outline strategies for community-driven enhancements.

Software Engineer at Oak Ridge National Laboratory, specializing in Generative AI, and machine
learning applications for scientific data processing. Dedicated to designing scalable AI architectures,
collaborating with research teams, and integrating AI-driven solutions to enhance data workflows.

Chirag Shah is an Environmental Data Science/ full-stack software engineer with a keen interest in climate science and cutting-edge technologies. As an integral part of both the Atmospheric Radiation Measurement Facility (ARM) Data Center and the U.S. Geological Survey (USGS) Core Science Systems group, he works closely with product owners, scientists, and researchers to design and develop next-generation tools for managing, analyzing, and visualizing scientific data. These tools are aimed at advancing research in Earth, Climate, and Environmental sciences.

Chirag has a keen interest in various areas of software development, such as data management, data analytics, artificial intelligence, machine learning, the Internet of Things (IoT), and machine-to-machine communication. Chirag aims to harness technology's power to solve real-world challenges and is committed to staying at the forefront of technological advancements.