SciPy 2024

Don Setiawan

Don Setiawan is a Senior Research Software Engineer at the University of Washington, eScience Institute, Scientific Software Engineering Center (SSEC). He has expertise in Python programming, web development, geospatial data analytics, and cloud-based data engineering. He is interested in building scalable, open software to facilitate scientific discovery across fields and enforce software best practices. He has been involved with various open-source software projects with Ocean Observatory Initiative (OOI), U.S. Integrated Ocean Observing System (IOOS), National Oceanic and Atmospheric Administration (NOAA), and National Aeronautics and Space Administration (NASA).

The speaker's profile picture

Sessions

07-08
13:30
240min
Xarray: Friendly, Interactive, and Scalable Scientific Data Analysis
Negin Sobhani, Max Jones, Jessica Scheick, Don Setiawan, Tom Nicholas, Luis Lopez, Scott Henderson, Wietze Suijker

Xarray provides data structures for multi-dimensional labeled arrays and a toolkit for scalable data analysis on large, complex datasets with many related variables. Xarray combines the convenience of labeled data structures inspired by Pandas with NumPy-like multi-dimensional arrays to provide an intuitive and scalable interface for scientific analysis. This hands-on tutorial focuses on intermediate and advanced workflows using complex real-world data. We encourage participants in this workshop to bring your own dataset as we will dedicate ample time to apply tutorial concepts to datasets of interest!

Tutorials
Ballroom B/C
07-09
13:30
240min
Generative AI Copilot for Scientific Software – a RAG-Based Approach using OLMo
Don Setiawan, Anshul Tambay, Cordero Core, Niki Burggraf, Anant Mittal, Vani Mandava, Ishika Khandelwal, Anuj Sinha, Madhav Kashyap

Generative AI systems built upon large language models (LLMs) have shown great promise as tools that enable people to access information through natural conversation. Scientists can benefit from the breakthroughs these systems enable to create advanced tools that will help accelerate their research outcomes. This tutorial will cover: (1) the basics of language models, (2) setting up the environment for using open source LLMs without the use of expensive compute resources needed for training or fine-tuning, (3) learning a technique like Retrieval-Augmented Generation (RAG) to optimize output of LLM, and (4) build a “production-ready” app to demonstrate how researchers could turn disparate knowledge bases into special purpose AI-powered tools. The right audience for our tutorial is scientists and research engineers who want to use LLMs for their work.

Tutorials
Ballroom D
07-11
14:20
30min
Echostack: A flexible and scalable open-source software suite for echosounder data processing
Don Setiawan, CaesarTuguinay, Soham Kishor Butala, Brandyn Lucca, Valentina Staneva, Wu-Jung Lee, Dingrui Lei

Water column sonar data collected by echosounders are essential for fisheries and marine ecosystem research, enabling the detection, classification, and quantification of fish and zooplankton from many different ocean observing platforms. However, the broad usage of these data has been hindered by the lack of modular software tools that allow flexible composition of data processing workflows that incorporate powerful analytical tools in the scientific Python ecosystem. We address this gap by developing Echostack, a suite of open-source Python software packages that leverage existing distributed computing and cloud-interfacing libraries to support intuitive and scalable data access, processing, and interpretation. These tools can be used individually or orchestrated together, which we demonstrate in example use cases for a fisheries acoustic-trawl survey.

Earth, Ocean, Geo, and Atmospheric Science
Room 315
0min
Development of Caustics: a differentiable, GPU accelerated, gravitational lensing simulator
Connor Stone, Cordero Core, Don Setiawan

We present Caustics, a tool to accelerate the analysis of gravitational lensing systems for the next generation of astronomical data. Caustics will enable precision measurements of dark matter properties, the expansion rate of the Universe, lensed black holes, the first stars, and more. In this talk I will discuss the benefits and challenges of how we used PyTorch (a differentiable and GPU accelerated scientific python package) to allow for fast development without sacrificing numerical performance. I will detail our development process as well as how we encourage users of all skill levels to engage with our documentation/tools.

General
0min
Prefect Workflows for Scaling Scientific Data Pipelines
Valentina Staneva, Soham Kishor Butala, Don Setiawan, Wu-Jung Lee

With the influx of large data from multiple instruments and experiments, scientists are wrangling complex data pipelines that are context-dependent and non-reproducible. In this talk, we will share our experience leveraging the Prefect orchestration framework to allow scientists and data managers without cyberinfrastructure experience to execute complex data workflows on a variety of local and cloud platforms by editing existing recipes. We hope this will serve as a guide to others embarking on streamlining workflows through Prefect or simply wanting to see how modern orchestration tools can be applied in the scientific context.

General