SciPy 2023

Using Python to accelerate sustainable aviation fuel research and development
07-13, 14:20–14:50 (America/Chicago), Zlotnik Ballroom

Aviation comprises 2-3% of global CO2 emissions. Transitioning to cleaner, more sustainable aviation fuels can reduce its environmental impacts. To help accelerate sustainable aviation fuel development, we trained machine learning models to predict fundamental properties of biofuel blends using Fourier transform infrared (FTIR) spectra. We leveraged TPOT and standard libraries like NumPy, pandas, and scikit-learn to develop the models. This presentation will discuss how we overcame challenges with decomposing FTIR spectra data and using machine learning on small datasets (<100 samples). We will also discuss integration of the models into our open-source webtool to support biofuel research.


Aviation comprises 2-3% of global carbon dioxide emissions and 9-12% of U.S. transportation greenhouse gas emissions. Sustainable aviation fuels have the potential for reducing emissions and environmental impacts; however, due to high costs and high-volume requirements, experimental property testing of bio-based jet fuels is usually conducted years after initial bench-scale experiments are completed. Neglecting to conduct property testing early in the development cycle can lead to wasted investments spent on production of biofuels that do not meet performance expectations.

Machine learning has already proven to be a valuable tool for predicting sustainable aviation fuel properties and accelerating research. In 2020, we presented our approach at SciPy (https://www.youtube.com/watch?v=ENOf0IZDla8) to predict high-throughput aviation fuel properties of over 10,000 molecules with molecular descriptors. The correlation analysis and tree-based methods for feature ranking were later published in Fuel (https://doi.org/10.1016/j.fuel.2022.123836). Using the property prediction models, we created the first Python-based, comprehensive, open-source webtool that enables scientists and companies to explore viable bio-based molecules without spending time and money testing in the lab (https://feedstock-to-function.lbl.gov).

Because aviation fuels are made of blends of molecules and compounds, our current research focuses on expanding the webtool to predict properties of fuel blends using Fourier transform infrared (FTIR) spectra and experimental property data. Specifically, we use binning and smoothing techniques to reduce experimental noise in more than 6700 FTIR spectra features and use non-negative matrix factorization (NMF) for feature selection to develop models that predict fundamental properties of biofuel blends (e.g., boiling point, flash point, melting point, specific gravity, and kinematic viscosity). The predictive models are also integrated into the webtool to help sustainable aviation fuel research.

Our workflow includes using libraries such as Numpy, pandas, scikit-learn to reduce FTIR spectra data into interpretable components to predict properties, and the Tree-based Pipeline Optimization Tool (TPOT) to develop property prediction models with reduced FTIR spectra as features. Specifically, we will discuss methods for coalescing experimental spectra data from different sources, and will present methods for reducing the influence of experimental noise on model performance. We will also discuss using NMF as a dimensionality reduction technique that correctly groups FTIR spectra wavelengths together and results in meaningful features. Additionally, we will address common pitfalls such as defining an applicability domain, and recognizing and limiting the possibility of overfitting.

By sharing our experience and lessons learned, we aim to help the community overcome similar challenges when developing models for advancing science, while also demonstrating how a Python-based, open-source webtool can facilitate faster, less expensive bioprocess optimization and scale-up of sustainable aviation fuels.

Ana Comesana is a Scientific Engineering Associate at Lawrence Berkeley National Laboratory. She is a data scientist who conducts applied machine learning research to support projects in a variety of areas, including water treatment, energy management, and bio-jet fuel research. Ana received her B.S. in Mathematics from UC Berkeley.