Sampling considerations for the training and validation of machine learning classifiers from genomic data

28.06.2022

Lukas Lueftinger

PhD Student
Department of Microbiology and Ecosystem Science, University of Vienna
Advisor: Thomas Rattei

Abstract

With the increasing availability of bacterial whole-genome sequencing (WGS) in the clinic, genomic data analysis tools promise to drive down cost and time-to-result for infectious disease diagnostics. In recent years, a wide array of machine learning (ML) tools for the prediction of antimicrobial resistance (AMR) from sequencing data have been put forward. In several cases however, evaluation of tools on independent datasets showed significantly degraded accuracy compared to estimates provided by the initial publication. In this work, I explore the effect of sampling choices for genomic ML tools on the validity of published performance estimates. I show that a commonly used technique for the validation of ML classifiers is unsuitable for learning from genomic data: random cross-validation (CV), which evaluates performance of a classifier on a randomly sampled subset of the input training data, overestimates predictive accuracy of AMR prediction models. One likely reason is the violation of a key assumption of random CV – namely, sampled data are not independent and identically distributed (i.i.d.). On the contrary, in most cases sampling of bacterial isolates used in the training of ML classifiers is biased in several dimensions such as geography, time and clinical importance. I propose an alternative validation technique which splits data under consideration of genome similarity, thereby controlling for dependence structures arising from strain relatedness. I validate this technique on a large and diverse proprietary database of genomic data, showing significant improvements in the validity of performance estimates.