Naive Bayes classification model for isotopologue detection in LC-HRMS data

Summary

Isotopologue identification or removal is a necessary step to reduce the number of features that need to be identified in samples analyzed with non-targeted analysis. Currently available approaches rely on either predicted isotopic patterns or an arbitrary mass tolerance, requiring information on the molecular formula or instrumental error, respectively. Therefore, a Naive Bayes isotopologue classification model was developed that does not depend on any thresholds or molecular formula information. This classification model uses the elemental mass defects of six elemental ratios and successfully identified isotopologues for both theoretical isotopic patterns and wastewater influent samples, outperforming one of the most commonly used approaches (i.e., 1.0033 Da mass difference method - CAMERA). For the theoretical isotopologues, the classification model outperformed an “in-house” mass difference method with a true positive rate (TPr) of 99.0% and false positive rate (FPr) of 1.8% compared to a TPr of 16.2% and an FPr of 0.02%, assuming no error. As for the wastewater influent samples, the classification model, with a TPr of 99.8% and false detection rate (FDr) of 0.5%, again performed better than the mass difference method, with a TPr of 96.3% and FDr of 4.8%. Therefore, it can be concluded that the classification model can be used for isotopologue identification, requiring no thresholds or information on the molecular formula.