Preprint has been published in a journal as an article
DOI of the published article https://doi.org/10.1021/acs.jpcb.5c03825
Preprint / Version 1

Improvement of Diffusion Coefficient Prediction by Active Learning

##article.authors##

  • Zeno Romero Laboratory of Engineering Thermodynamics (LTD)
  • Kerstin Münnemann Laboratory of Engineering Thermodynamics (LTD)
  • Hans Hasse Laboratory of Engineering Thermodynamics (LTD)
  • Fabian Jirasek Laboratory of Engineering Thermodynamics (LTD)

DOI:

https://doi.org/10.31224/5491

Keywords:

Diffusion, Diffusion Coefficient, Active Learning

Abstract

Methods for predicting diffusion coefficients in mixtures are essential in many applications, as experimental data are scarce. Machine learning (ML) methods offer promising alternatives to established semiempirical models for predicting diffusion coefficients, but their performance strongly depends on the available training data. Increasing the size of data sets is a straightforward strategy for improving ML methods, but measuring diffusion coefficients is costly, limiting the number of experiments that can be carried out. We have therefore studied active learning (AL) strategies for planning diffusion coefficient measurements and the targeted improvement of ML methods for their prediction, specifically matrix completion methods (MCMs) for predicting diffusion coefficients at infinite dilution Dij∞ in binary mixtures at 298 K. In the first step, different AL strategies were systematically tested on a synthetic data set for Dij∞, and uncertainty sampling was found to be a simple but effective choice. This strategy was therefore used for planning Dij∞ measurements using pulsed-field gradient (PFG) nuclear magnetic resonance (NMR) spectroscopy. In total, Dij∞ in 19 mixtures were measured for which previously no data were available, and the data were used for retraining two hybrid MCMs. The results show that significant improvement in the prediction of Dij∞ can be achieved with only a few suitably planned experiments, but also that the impact strongly depends on the used prediction model: while no clear influence on the performance of an MCM that was trained on the residuals of the semiempirical SEGWE model was found, the accuracy of a hybrid MCM that incorporates SEGWE predictions as soft prior information could be substantially increased, almost halving the relative mean squared error on the test set.

Downloads

Download data is not yet available.

Downloads

Posted

2025-10-01