Machine Learning methods for genome enabled prediction of complex traits : Benchmarking and robustness to marker elimination

Elenter, Juan - Etchebarne, Guillermo - Hounie, Ignacio - Fariello, María Inés - Lecumberry, Federico

Resumen:

A plethora of machine learning and statistical methods have been applied in the context of genome enabled prediction. Here we address the prediction of complex traits from SNP marker data in agriculture. The datasets used present different levels of trait complexity. These are: Yeast yield, Holstein cattle milk yield, German bulls Sire Conception Rate, and Wheat yield. Population structure, number of samples and SNPs also vary among datasets. We benchmark several popular models including bayesian and penalized linear regressions, kernel methods, and decision tree ensembles. Through exhaustive hyperparameter tuning we outperform state-of-the-art results in all datasets.Furthermore, we compare two genome codifications: One hot encoding and Additive encoding, the latter being the standard codification used in quantitative genetics. We show that, in these datasets, additive encoding outperforms categorical encodings despite the fact that the variables are categorical in nature. This difference in performance may be caused by the predominance of additive effects, the dimensionality increase and the loss of the one-to -one correspondence between variables and biological markers. Regarding robustness to random marker elimination, we found that on all datasets most models present a negligible loss in predictive power even when trained on a small, random sample of markers. We argue that sample size limits the amount of SNPs which are informative with respect to the downstream prediction task.


Detalles Bibliográficos
2021
Este trabajo fue parcialmente financiado por el proyecto ANII FSDA 1-2018-1-154364.
Genomic prediction
Machine learning
Dimensionality reduction
Inglés
Universidad de la República
COLIBRI
https://meetings.cshl.edu/meetings.aspx?meet=PROBGEN&year=21
https://hdl.handle.net/20.500.12008/36814
Acceso abierto
Licencia Creative Commons Atribución - No Comercial - Sin Derivadas (CC - By-NC-ND 4.0)