SciPy 2024

Generative AI Copilot for Scientific Software – a RAG-Based Approach using OLMo
07-09, 13:30–17:30 (US/Pacific), Ballroom D

Generative AI systems built upon large language models (LLMs) have shown great promise as tools that enable people to access information through natural conversation. Scientists can benefit from the breakthroughs these systems enable to create advanced tools that will help accelerate their research outcomes. This tutorial will cover: (1) the basics of language models, (2) setting up the environment for using open source LLMs without the use of expensive compute resources needed for training or fine-tuning, (3) learning a technique like Retrieval-Augmented Generation (RAG) to optimize output of LLM, and (4) build a “production-ready” app to demonstrate how researchers could turn disparate knowledge bases into special purpose AI-powered tools. The right audience for our tutorial is scientists and research engineers who want to use LLMs for their work.


Generative AI is changing the way we adopt and use technology. The power of AI is manifest in the everyday interactions of consumer tech and tools. This tutorial aims to utilize the underlying methods in Generative AI to advance scientific research. This tutorial will introduce and implement RAG (a technique for enhancing the accuracy and reliability of a generative AI model) for scientific research. We will go through the fundamentals of RAG and then implement it over documents retrieved from existing GitHub repositories and several domain-specific knowledge bases. The language model used in the tutorial is the Allen Institute for AI (AI2) Open Language Model (OLMo), an LLM with open data, code, weights, and evaluation benchmarks. OLMo is purpose-built for scientific discovery as it has been trained on Dolma, an open dataset of 3 trillion tokens collected from diverse web content, academic publications, code, books, and encyclopedic materials. LangChain is a Python and JavaScript framework for developing applications powered by LLMs. Using LangChain, we create a context-aware question answering agent by implementing a RAG chain. Using a simple example from the astronomy community, we demonstrate how the tool performs correctly with and incorrectly without RAG-enabled context. At the end of the tutorial, attendees will create an AI-powered question and answering application that they can use to advance their research.

Following is an example of the output of a simple use case from the tutorial:

Before Implementing RAG

<|user|> 

What is Astropy?  

<|assistant|> 

A community-driven effort to build a common platform of astronomical tools and data formats, Astpy (short for "astronomy") is an open-source project that aims to create a unified framework for working with data in astronomy. Built on top of NumPy, SciPy, and other popular scientific libraries, Astropy provides functions, classes, and tools to facilitate and simplify the processing and analysis of astronomical datasets and simulations. This platform offers consistency in data structures, formatting, and interfaces across different packages, making it easier for researchers to work with large-scale observational and theoretical datasets alike. The project serves as a central location for many open-source tools, simplifying access and encouraging collaboration among astronomers and computer scientists.

After Implementing RAG

<|user|> 

What is Astropy?

<|assistant|> 

Astropy is a community-developed open-source project that provides an integrated package for astronomical data analysis in Python. It includes various functionalities, such as time management, coordinate transformations, and data manipulation. Astropy is designed to be compatible with other popular data science tools like NumPy, SciPy, and IPython.

The context provided describes the `astropy.utils` package, which contains various utility functions and classes that are not astronomy-specific but mainly for use within the Astropy project. This package covers topics such as downloading data from URLs, data structures, compatibility with older versions of Python/Numpy, and version intercompatibility functions.

To summarize, Astropy is an open-source astronomical data analysis toolkit built upon existing packages like NumPy, SciPy, and IPython to provide a cohesive experience for users seeking astronomy-related functionality in Python. The `astropy.utils` package contains various utility functions and classes that are not specific to astronomy but instead cater to the Astropy project's requirements and internal compatibility concerns.

Tutorial Outline (4 hours total)

Module 1: Introduction to LLMs and Problem Setup (1 hour 15 min)

  • Python Environment Setup
  • Introduction to Language Models
  • LLMs, Prompt Engineering, and OLMo
  • LangChain: The LLM Application Framework

Break 1 (15 minutes)

Module 2: Building LLM-powered Domain Specific Question and Answer System (1 hour 15 mins)

  • Domain-Specific Question and Answer Examples
  • Astrophysics Prompting with OLMo
  • Retrieval-Augmented Text Generation

Break 2 (15 minutes)

Module 3: Creating the Application (1 hour)

  • Astrophysics Chat Application
  • Hands on group exercise and practice: Try with your data!

Acknowledgements

We would like to thank Professor Andy Connolly for providing the questions and the domain expertise for the astronomy use case of this tutorial.

We also thank the Allen Institute for AI (AI2) for providing the Open Language Model (OLMo) and the Dolma dataset, and the organizers of SciPy 2024 for giving us the opportunity to present this tutorial.

The development of this tutorial, was done by engineers at the University of Washington's Scientific Software Engineering Center (SSEC) supported by Schmidt Sciences, LLC. as part of the Virtual Institutes for Scientific Software (VISS).


Prerequisites

No prior experience with LLMs is required, but familiarity with the Jupyter environment will be helpful.

Installation Instructions

You can find our setup instruction at https://github.com/uw-ssec/tutorials/blob/main/SciPy2024/module1/setup.md.

Don Setiawan is a Senior Research Software Engineer at the University of Washington, eScience Institute, Scientific Software Engineering Center (SSEC). He has expertise in Python programming, web development, geospatial data analytics, and cloud-based data engineering. He is interested in building scalable, open software to facilitate scientific discovery across fields and enforce software best practices. He has been involved with various open-source software projects with Ocean Observatory Initiative (OOI), U.S. Integrated Ocean Observing System (IOOS), National Oceanic and Atmospheric Administration (NOAA), and National Aeronautics and Space Administration (NASA).

This speaker also appears in:

Anshul Tambay is a Technical Program Manager with the UW Scientific Software Engineering Center (SSEC) at the eScience Institute. He aims to develop open-source infrastructure that bolsters research across a variety of disciplines.

Prior to joining SSEC, Anshul worked as a Data Analyst at Northwestern University’s Center for Neighborhood Engaged Research and Science, focusing on community violence intervention programs in Chicago. His other experience includes working in support at a tele-health software company and on a development study in Ethiopia, evaluating a mobile phone-based experience sampling method of measuring time use. Anshul received his B.A. in Economics and Mathematics from Grinnell College in Iowa.

Outside of work, Anshul enjoys pickup sports, reading longform journalism, and cooking. He is a passionate supporter of Bay Area sports and Leeds United and interested in learning more about statistical inference in sports.

Cordero Core is a highly skilled senior software engineer with over 14 years of experience successfully delivering innovative software solutions to healthcare, e-commerce, aerospace, and security industries. He has achieved a patent for his groundbreaking work in computational microscopy and digital pathology that has revolutionized imaging and analysis techniques for medical professionals. Cordero maintains active involvement in his field through mentoring startups and entrepreneurs as an advisory board member for the Journal of Small Business and Enterprise Development. He is currently focused on creating software solutions that enable scientific research, data management, and collaboration through his role as a Senior Software Engineer at the eScience Institute.

Niki Burggraf is a Senior Software Engineer for the UW Scientific Software Engineering Center (SSEC) at eScience Institute. With over 6 years of experience building and maintaining cloud web services, Niki is excited to bring her industry knowledge to the scientific software engineering sphere.

Anant Mittal is a Ph.D. student at the Paul G. Allen School of Computer Science & Engineering, University of Washington, advised by Prof. James Fogarty. He also holds a graduate research assistant position at the eScience Institute, Scientific Software Engineering Center (SSEC.) His research interests include designing and building interactive systems for real-world human-computer interaction impact and evaluating them through mixed methods. His Ph.D. focuses on building systems for communication and collaboration in settings where multiple stakeholders have different roles. He has presented several papers and posters at conferences and given invited talks.

Vani Mandava is the Head of Engineering for the UW Scientific Software Engineering Center within eScience Institute. She is responsible for setting up the SSEC organization and working with PIs to define the priorities and scope of software infrastructure that will strengthen the scientific software community. Before joining UW in 2022, Vani spent over two decades at Microsoft. Her career spanned engineering and product roles in client, server, and services products across Microsoft Office, Bing AdCenter, Microsoft Academic Search, and Microsoft Research Open Data. As Director for Data Science at Microsoft Research, she led Cloud, Data Science, and Trustworthy AI research collaborations with partners in academia and government.

Ishika is a graduate Student at the UW - Seattle, specializing in Data Science. She is currently working as a Research Graduate Scholar at the Scientific Software Engineer Center at eScience Institute, UW. Before this, Ishika worked as a Decision Analytics Associate at ZS Associates, India wherein she played a pivotal role in informing strategic decision-making processes and driving impactful outcomes for a pharma client, ultimately contributing to the success of their business objectives. Her passion lies at the confluence of Software Development Engineering (SDE) and Data Science best practices and is driven by a relentless pursuit of knowledge and a thirst for mastery in these dynamic disciplines.

Anuj is a graduate student specializing in Data Science at the University of Washington, Seattle. Currently, he is working as a Research Graduate Scholar at the Scientific Software Engineer Center at eScience Institute, UW. Before joining UW, he worked as a Software Developer at Goldman Sachs, India for 4 years.

Madhav Kashyap is a Graduate Student at the University of Washington majoring in Computational Linguistics and Natural Language Processing. His recent work as a Graduate Research Assistant at the UW Scientific Software Engineering Center (eScience Institute) has been to develop open-source software used by oceanographers in measuring seafloor tectonic shifts to the centimeter level. His Thesis focuses on system optimizations for faster Information Retrieval in Large Language Model workflows. As a Backend Software Engineer at Akamai, he has industry experience coding robust Python and Go systems powering enterprise cybersecurity.