SciPy 2025

Chirag

Chirag Shah is an Environmental Data Science/ full-stack software engineer with a keen interest in climate science and cutting-edge technologies. As an integral part of both the Atmospheric Radiation Measurement Facility (ARM) Data Center and the U.S. Geological Survey (USGS) Core Science Systems group, he works closely with product owners, scientists, and researchers to design and develop next-generation tools for managing, analyzing, and visualizing scientific data. These tools are aimed at advancing research in Earth, Climate, and Environmental sciences.

Chirag has a keen interest in various areas of software development, such as data management, data analytics, artificial intelligence, machine learning, the Internet of Things (IoT), and machine-to-machine communication. Chirag aims to harness technology's power to solve real-world challenges and is committed to staying at the forefront of technological advancements.

The speaker's profile picture

Sessions

07-11
14:35
30min
Accelerating scientific data releases: Automated metadata generation with LLM agents
Tudor Garbulet, Chirag

The rapid growth of scientific data repositories demands innovative solutions for efficient metadata creation. In this talk, we present our open-source project that leverages large language models to automate the generation of standard-compliant metadata files from raw scientific datasets. Our approach harnesses the capabilities of pre-trained open source models, finetuned with domain-specific data, and integrated with Langgraph to orchestrate a modular, end-to-end pipeline capable of ingesting heterogeneous raw data files and outputting metadata conforming to specific standards.

The methodology involves a multi-stage process where raw data is first parsed and analyzed by the LLM to extract relevant scientific and contextual information. This information is then structured into metadata templates that adhere strictly to recognized standards, thereby reducing human error and accelerating the data release cycle. We demonstrate the effectiveness of our approach using the USGS ScienceBase repository, where we have successfully generated metadata for a variety of scientific datasets, including images, time series, and text data.

Beyond its immediate application to the USGS ScienceBase repository, our open-source framework is designed to be extensible, allowing adaptation to other data release processes across various scientific domains. We will discuss the technical challenges encountered, such as managing diverse data formats and ensuring metadata quality, and outline strategies for community-driven enhancements. This work not only streamlines the metadata creation workflow but also sets the stage for broader adoption of generative AI in scientific data management.

Additional Material:
- Project supported by USGS and ORNL
- Codebase will be available on GitHub after paper publication
- Fine-tuned LLM models will be available on Hugginface after paper publication

Machine Learning, Data Science, and Explainable AI
Room 315