07-11, 15:00–15:30 (US/Pacific), Room 315
This talk illustrates how machine learning models to detect harmful algal blooms from satellite imagery can help water quality managers make informed decisions around public health warnings for lakes and reservoirs. Rooted in the development of the open source package CyFi, this talk includes insights around identifying when your model is getting the right answer for the wrong reasons, the upsides of using decision tree models with satellite imagery, and how to help non-technical users build confidence in machine learning models. The intended audience is those interested in using satellite imagery to monitor and respond to the world around us.
Motivation
Inland water bodies provide a variety of critical services for both human and aquatic life, including drinking water, recreational and economic opportunities, and marine habitats. A significant challenge water quality managers face is the formation of harmful algal blooms (HABs). One of the major types of HABs is cyanobacteria. HABs produce toxins that are poisonous to humans and their pets, and threaten marine ecosystems by blocking sunlight and oxygen.
While there are established methods for using satellite imagery to detect cyanobacteria in larger water bodies like oceans, detection in small inland lakes and reservoirs remains a challenge. Machine learning is particularly well-suited to this task because indicators of cyanobacteria are visible from free, routinely collected data sources. Whereas manual water sampling is time and resource intensive, machine learning models can generate estimates in seconds. This allows water managers to prioritize where water sampling will be most beneficial, and can provide a birds-eye view of water conditions across the state.
Methods
CyFi (Cyanobacteria Finder) is an open-source Python package that uses satellite imagery and machine learning to detect cyanobacteria levels, one type of HAB. CyFi helps decision makers protect the public by flagging the highest-risk areas in lakes, reservoirs, and rivers quickly and easily.
CyFi was born out of the Tick Tick Bloom machine learning competition, hosted by DrivenData. The model in CyFi is based on the winning solutions, and has been optimized for generalizability and efficiency.
CyFi uses two main data sources: Sentinel-2 satellite imagery and a land cover gridded map. Cyanobacteria estimates are generated by a LightGBM model, a gradient-boosted decision tree algorithm. The model was trained and evaluated using nearly 13,000 "in situ" labels collected manually by organizations across the U.S..
To build intuition around model predictions and error cases, CyFi comes with a visualization tool, which lets users view the base satellite imagery tile corresponding to each sample point prediction.
Results
CyFi uses high-resolution Sentinel-2 satellite imagery (10-30m) to focus on smaller water bodies with rapidly changing blooms. Sentinel-3 is used by most existing tools, but its resolution of 300-500m is often too coarse for small, inland water bodies. We find that CyFi performs at least as well as Sentinel-3 based tools and has 10 times more coverage of lakes across the U.S.
CyFi is most accurate at low and high cyanobacteria densities and is intended to plug into human-in-the-loop workflows. Where blooms are likely absent, water quality managers can better allocate ground sampling resources by deprioritizing these water bodies. Where severe blooms are likely present, water quality managers can flag these for public health interventions, such as swimming or drinking water advisories.
Previous speaking experience
Emily Dorne is a lead data scientist at DrivenData and is the technical lead for CyFi. She has previously given talks at the Women in Data Science (WiDS) Global conference, WiDS Puget Sound, and CamTrapAI. In addition, she has led in-person data ethics workshops using the open-source python package Deon, of which she is an author.
Emily Dorne is a lead data scientist at DrivenData where she develops machine learning models for social impact. Her expertise lies in classifying animals in camera trap videos to support conservationists, identifying harmful algal blooms to support water quality managers, and helping data scientists consider the ethical implications of their work. She is passionate about using data for social good and has previously worked at the Bill & Melinda Gates Foundation, Stanford Center for International Development, and the Brookings Institution.