Valentina Staneva
Valentina Staneva is a Senior Data Scientist and Data Science Fellow at the eScience Institute, Paul G. Allen School of Computer Science & Engineering, University of Washington. As part of her role she collaborates with researchers from a wide range of domains on extracting information from large data sets of various modalities, such as time series, images, videos, audio, text, etc. She is involved in data science education for audiences at broad level of experience, and regularly teaches workshops on introductory and advanced topics. She supports open science and reproducible research, and strives to help others adopt better data science workflows.
Sessions
In this tutorial we will introduce Github Actions to scientists as a tool for lightweight automation of scientific data workflows. We will demonstrate that GitHub Actions are not just a tool for software testing, but can be used in various ways to improve the reproducibility and impact of scientific analysis. Through a sequence of examples, we will demonstrate some of Github Actions' applications to scientific workflows, such as scheduled deployment of algorithms to sensor streams, updating visualizations based on new data, processing large datasets, model versioning and performance benchmarking. GitHub Actions can particularly empower Python scientific programmers who are not willing to build fully-fledged applications or set up complex computational infrastructure, but would like to increase the impact of their research. The goal is that participants will leave with their own ideas of how to integrate Github Actions in their own work.
Water column sonar data collected by echosounders are essential for fisheries and marine ecosystem research, enabling the detection, classification, and quantification of fish and zooplankton from many different ocean observing platforms. However, the broad usage of these data has been hindered by the lack of modular software tools that allow flexible composition of data processing workflows that incorporate powerful analytical tools in the scientific Python ecosystem. We address this gap by developing Echostack, a suite of open-source Python software packages that leverage existing distributed computing and cloud-interfacing libraries to support intuitive and scalable data access, processing, and interpretation. These tools can be used individually or orchestrated together, which we demonstrate in example use cases for a fisheries acoustic-trawl survey.
With the influx of large data from multiple instruments and experiments, scientists are wrangling complex data pipelines that are context-dependent and non-reproducible. In this talk, we will share our experience leveraging the Prefect orchestration framework to allow scientists and data managers without cyberinfrastructure experience to execute complex data workflows on a variety of local and cloud platforms by editing existing recipes. We hope this will serve as a guide to others embarking on streamlining workflows through Prefect or simply wanting to see how modern orchestration tools can be applied in the scientific context.