SciPy 2024

Echostack: A flexible and scalable open-source software suite for echosounder data processing
07-11, 14:20–14:50 (US/Pacific), Room 315

Water column sonar data collected by echosounders are essential for fisheries and marine ecosystem research, enabling the detection, classification, and quantification of fish and zooplankton from many different ocean observing platforms. However, the broad usage of these data has been hindered by the lack of modular software tools that allow flexible composition of data processing workflows that incorporate powerful analytical tools in the scientific Python ecosystem. We address this gap by developing Echostack, a suite of open-source Python software packages that leverage existing distributed computing and cloud-interfacing libraries to support intuitive and scalable data access, processing, and interpretation. These tools can be used individually or orchestrated together, which we demonstrate in example use cases for a fisheries acoustic-trawl survey.


The goal of our talk is twofold: 1) to introduce Echostack, an open-source Python software toolbox aimed at democratizing the access, processing, and interpretation of water column sonar data collected by echosounders; and 2) to share our experience in this domain-specific adoption of the Pandata stack of Python libraries, in solving general challenges associated with working with instrument data that are not only high-volume but also irregularly structured in time and space. We envision this work will be of interest to scientists who work with domain- or instrument-specific data facing similar challenges and are interested in learning how the problems are addressed in different communities.

Echosounders are sonar systems optimized for sensing fish and zooplankton in the ocean. They are an essential tool for marine ecosystem and fisheries research, and have in recent years been installed widely onto many ocean observing platforms, resulting in a deluge of data accumulating at an unprecedented speed from all corners of the ocean. These extensive datasets contain crucial information that can help ocean scientists better understand the marine ecosystems and their response to the changing climate. However, broad usage of these data has been hindered by the lack of open, easily interoperable, and scalable software tools. Our development addresses the urgent need for handling large (100s of GBs to TBs) datasets with highly heterogeneous instrument-specific formats, and provides the fisheries acoustics and ocean science communities with a set of open tools for intuitive and transparent access, organization, processing, and interpretation of these data.

In this talk, we will discuss our design philosophy that enables Echostack packages to be used in a mix-and-match manner depending on the use case, and share lessons learned from leveraging the Pandata software packages and storage format, in particular Zarr, Dask, Xarray, and the HoloViz visualization suite. The packages include:

  • Echopype: performs data standardization and computation from raw instrument files to acoustic data products
  • Echopop: generates acoustically derived biological estimates, such as abundance
  • Echoshader: enables interactive acoustic data visualization and exploration
  • Echoregions: interfaces acoustic data with machine learning developments
  • Echodataflow: workflow orchestration via text-based configuration “recipes” instead of code

These packages are accompanied by a set of data processing level definitions, Echolevels, which categorizes data products at different workflow stages to enhance data understanding and provenance tracking.

We will conclude the talk by demonstrating Echostack in an orchestrated end-to-end workflow that processes publicly available water column sonar data from NOAA and the US Ocean Observatories Initiative in a cloud-hosted Jupyter environment.

Don Setiawan is a Senior Research Software Engineer at the University of Washington, eScience Institute, Scientific Software Engineering Center (SSEC). He has expertise in Python programming, web development, geospatial data analytics, and cloud-based data engineering. He is interested in building scalable, open software to facilitate scientific discovery across fields and enforce software best practices. He has been involved with various open-source software projects with Ocean Observatory Initiative (OOI), U.S. Integrated Ocean Observing System (IOOS), National Oceanic and Atmospheric Administration (NOAA), and National Aeronautics and Space Administration (NASA).

This speaker also appears in:

I'm Soham, a Data Science Graduate from the University of Washington. With four years of diverse experience at Deloitte and AWS, I've delved into software engineering, data engineering, and application security. I'm deeply passionate about Data Engineering and always eager to embrace new technologies. Beyond the screen and code, I find solace in the great outdoors; hiking is not just an activity for me but a way to rejuvenate my spirit. And when it comes to mental exercises, who can resist the allure of a thrilling game of chess? Looking forward to connecting and exploring the vast horizons of technology and beyond.

Brandyn Lucca is a posdoctoral scholar at the Applied Physics Laboratory, University of Washington (Seattle, WA). His academic background includes a BSc in marine biology (University of Rhode Island), and both a MSc and PhD in Marine and Atmospheric Science (Stony Brook University). Brandyn's research focuses on using active acoustics to study environmental variability in the spatiotemporal distributions of marine organisms, and better understand how sound scatters from different types of animals through physics-based and numerical methods.

Valentina Staneva is a Senior Data Scientist and Data Science Fellow at the eScience Institute, Paul G. Allen School of Computer Science & Engineering, University of Washington. As part of her role she collaborates with researchers from a wide range of domains on extracting information from large data sets of various modalities, such as time series, images, videos, audio, text, etc. She is involved in data science education for audiences at broad level of experience, and regularly teaches workshops on introductory and advanced topics. She supports open science and reproducible research, and strives to help others adopt better data science workflows.

This speaker also appears in:

Wu-Jung Lee is a scientist at the Applied Physics Laboratory, University of Washington in Seattle, WA, USA. She has an interdisciplinary background, including undergraduate degrees in Electrical Engineering and Life Science from National Taiwan University and a PhD from the MIT-WHOI Joint Program in Oceanography. Her research spans two primary areas, acoustical oceanography and animal echolocation, with a goal of advancing acoustic sensing technology to better observe and understand the marine ecosystem. Dr. Lee loves going to sea despite being very prone to motion sickness. Outside of work, she enjoys spending time in the mountains and drawing.