SciPy 2024

Atomistic uncertainty driven data generation in ANI neural network potentials
07-10, 10:45–11:15 (US/Pacific), Room 316

Using machine learning to predict chemical properties and behavior is an important complement to traditional approaches to computation and simulation in chemistry. The ANAKIN-ME (ANI) methodology has been shown to produce generalized and transferable neural network potentials, trained on density functional theory (DFT) molecular energies, at a greatly reduced computational cost. The work presented here details an approach to generating new data in an active learning scheme in order to improve predictions in the regions of chemical space with high predictive uncertainty at the atom level.


In computational chemistry, systems of interest are often in regions of chemical space where experimental data is sparse, or where computational approaches rely on extensive use of approximations in order to be applied practically. Highly accurate methods in quantum chemistry, which are sensitive to the smallest electronic interactions, are seen as excessive for studying large systems, such as proteins, which are typically studied using force fields that rely on classical physics. This phenomenon is due to the massive scaling of interacting particles in chemical computations, where quantum methods start around O(N^3) with N electrons, versus the O(N) atoms scaling of classical force fields. Machine learning (ML) has emerged as a means of bridging the gap in accuracy and computational cost between these traditional approaches to chemical computations. The Accurate NeurAl networK engINe for Molecular Energies (ANAKIN-ME, or ANI) methodology has been shown to produce generalized and transferable neural network potential energy surfaces trained on density functional theory (DFT) molecular energies at a cost similar to that of force fields. ANI-NNPs are implemented in PyTorch, in a package we call TorchANI, which has a public version available on GitHub at https://github.com/aiqm/torchani.

The key to success in ML is curating the right dataset for the task at hand, but this becomes quite complicated when trying to capture general and transferable behavior in chemical systems. The ANI training datasets were created using active learning, where structures with high ensemble uncertainty molecular energy predictions were set aside for additional conformational sampling, expanding the size of the dataset. This approach works well for small organic molecules, but as we increase the number of supported atom types (C,H,N,O for ANI-1x, C,H,N,O,S,F,Cl for 2x) the amount of data needed to capture possible interactions also increases. The work presented here is an approach to sampling at the atom level via the uncertainty in predicted atomic forces, rather than the global property of molecular energy, in an effort to sample chemical space surrounding problematic local atomic environments more efficiently. An automated workflow is in development for this advanced sampling approach, though results are promising.

A strong correlation has been found between molecules with a high ensemble uncertainty in molecular energy prediction and high uncertainty in predicted atomic forces. As molecular energy predictions are determined as the sum of individual atomic contributions (a non-physical value determined by the individual atomic neural networks), this measure is proposed as an indicator for poor ensemble understanding of atomic interactions, leading to a high uncertainty prediction of the molecular energy. By focusing on high-uncertainty atomic force predictions, we can narrow the scope of conformational sampling from generating new conformations of entire molecules indiscriminately based on normal modes to sampling the conformational space around high-uncertainty atomic force predictions.

5th year PhD candidate in Physical Chemistry working in Adrian Roitberg's lab at the University of Florida. Research interests involve deep learning techniques and applications to chemical computation and simulation