Machine Learning methods for genome enabled prediction of complex traits : Benchmarking and robustness to marker elimination
Resumen:
A plethora of machine learning and statistical methods have been applied in the context of genome enabled prediction. Here we address the prediction of complex traits from SNP marker data in agriculture. The datasets used present different levels of trait complexity. These are: Yeast yield, Holstein cattle milk yield, German bulls Sire Conception Rate, and Wheat yield. Population structure, number of samples and SNPs also vary among datasets. We benchmark several popular models including bayesian and penalized linear regressions, kernel methods, and decision tree ensembles. Through exhaustive hyperparameter tuning we outperform state-of-the-art results in all datasets.Furthermore, we compare two genome codifications: One hot encoding and Additive encoding, the latter being the standard codification used in quantitative genetics. We show that, in these datasets, additive encoding outperforms categorical encodings despite the fact that the variables are categorical in nature. This difference in performance may be caused by the predominance of additive effects, the dimensionality increase and the loss of the one-to -one correspondence between variables and biological markers. Regarding robustness to random marker elimination, we found that on all datasets most models present a negligible loss in predictive power even when trained on a small, random sample of markers. We argue that sample size limits the amount of SNPs which are informative with respect to the downstream prediction task.
2021 | |
Este trabajo fue parcialmente financiado por el proyecto ANII FSDA 1-2018-1-154364. | |
Genomic prediction Machine learning Dimensionality reduction |
|
Inglés | |
Universidad de la República | |
COLIBRI | |
https://meetings.cshl.edu/meetings.aspx?meet=PROBGEN&year=21
https://hdl.handle.net/20.500.12008/36814 |
|
Acceso abierto | |
Licencia Creative Commons Atribución - No Comercial - Sin Derivadas (CC - By-NC-ND 4.0) |
Sumario: | Los experimentos presentados en este trabajo se realizaron utilizando ClusterUy (sitio: https://cluster.uy). |
---|