SciPy 2024

Github Actions for Scientific Data Workflows
07-08, 13:30–17:30 (US/Pacific), Ballroom D

In this tutorial we will introduce Github Actions to scientists as a tool for lightweight automation of scientific data workflows. We will demonstrate that GitHub Actions are not just a tool for software testing, but can be used in various ways to improve the reproducibility and impact of scientific analysis. Through a sequence of examples, we will demonstrate some of Github Actions' applications to scientific workflows, such as scheduled deployment of algorithms to sensor streams, updating visualizations based on new data, processing large datasets, model versioning and performance benchmarking. GitHub Actions can particularly empower Python scientific programmers who are not willing to build fully-fledged applications or set up complex computational infrastructure, but would like to increase the impact of their research. The goal is that participants will leave with their own ideas of how to integrate Github Actions in their own work.


GitHub Actions are quite popular within the software engineering community, but a scientific Python programmer may not have seen their use beyond a continuous integration framework for unit testing. We would like to increase their visibility through a scientific workflow lens. We will use examples that are relevant to the community: wrangling a messy realtime hydrophone data stream to display noise sounds from the Puget Sound (not far from the conference venue!) or processing hundreds of satellite radar images over glacial lakes in High-Mountain Asia to study flood hazards. We assume no knowledge on Github Actions and will start slowly with a “Hello World” step, but build quickly to create complex and exciting workflows. We will also showcase their value for scientific collaborations across institutions as a means to share reproducible workflows and computing infrastructure.

Key Learning Objectives:

  • Learners distinguish between GitHub Actions and Workflows and understand their role within the Python software development cycle
  • Learners are capable of triggering GitHub Action Workflows in several different ways and can determine which method could be useful in typical science applications
  • Learners can export and visualize (data) outputs of Github Action Workflows, e.g. tables, plots.
  • Learners can process large data sets in parallel with GitHub Actions
  • Learners can identify a few components of their work which can be executed as GitHub Actions

Tutorial Repo: https://github.com/uwescience/SciPy2024-GitHubActionsTutorial/

Tutorial Guide: https://uwescience.github.io/SciPy2024-GitHubActionsTutorial/intro.html


Prerequisites

GitHub account, familiarity with git, GitHub, and Python (conda, scipy, matplotlib), some maturity in manipulating scientific data and exposure to the challenges associated with it, ability to read code (our examples may use libraries not familiar to the audience, but the focus will be on the steps these libraries accomplish rather than the details)

Installation Instructions

Participants can make edits from the GitHub interface, but if they are willing to make updates locally, they need to have a functioning git (set up instructions)

Valentina Staneva is a Senior Data Scientist and Data Science Fellow at the eScience Institute, Paul G. Allen School of Computer Science & Engineering, University of Washington. As part of her role she collaborates with researchers from a wide range of domains on extracting information from large data sets of various modalities, such as time series, images, videos, audio, text, etc. She is involved in data science education for audiences at broad level of experience, and regularly teaches workshops on introductory and advanced topics. She supports open science and reproducible research, and strives to help others adopt better data science workflows.

This speaker also appears in:

Quinn is a PhD student in Civil and Environmental Engineering at the University of Washington. Quinn's research involves developing methods to study changing arctic and alpine landscapes with satellite remote sensing data, particularly radar data. This work is at the intersection of data science, geoscience, and remote sensing. Some of Quinn's previous experiences include TAing a Geospatial Data Analysis in Python course, leading a project at the GeoSmart Machine Learning Hackweek, and collaborating on open source remote sensing software (e.g. https://github.com/SnowEx/spicy-snow).