DNAI : Machine learning for genome enabled prediction of complex traits in agriculture

Elenter, Juan - Etchebarne, Guillermo - Hounie, Ignacio

Supervisor(es): Fariello, María Inés - Lecumberry, Federico

Resumen:

Genome enabled prediction of complex traits aims to predict a measurable characteristic of an organism using their genetic information. In the present work we address diverse traits and organisms including yeast growth, wheat yield, Jersey bull fertility and Holstein cattle milk yield. We benchmark several popular Machine Learning models: bayesian and penalized linear regressions, kernel methods, and decision tree ensembles. Through exhaustive hyperparameter tuning we outperform state-of-the-art results in most datasets. We also compare two codification techniques for input data and perform ablation studies to assess robustness to genetic marker - i.e input features - elimination. We then explore different Deep Learning architectures for this task. We propose and evaluate CNN architectures, showing that using residual connections improves perfomance but that in some cases Fully Connected Networks outperform CNNs. We link this to the fact that absolute positions are relevant in genomes, and thus, CNN's translational equivariance may not be an adequate inductive bias for tackling this problem. In addition, we explore using PCA and TSNE for mapping input features to two-dimensional image-like feature maps used as inputs to 2D-CNN architectures. We assess the effectiveness of the aforementioned dimensionality reduction techniques when used to construct those mappings, and find that in some cases, using random mappings performs comparably. We also propose a method to construct these image-like feature maps based on an approximation to the Fermat distance. Furthermore, we evaluate graph neural network architectures by formulating trait prediction as a node regression problem on a population graph, where each node represents an individual, and edges association between their genetic information. We evaluate the transferability of these graphical models and find that the extent to which they exploit neighbourhood information is limited. We also propose a model combining CNN and GNN architectures, which outperforms all other models in Holstein cattle milk yield prediction. Lastly, we propose optimising Pearson correlation directly, which is commonly used to evaluate model performance, but MSE is usually minimised. Although this loss does not penalise learning an affine transformation of actual phenotypes, we show that this affine transformation can be estimated from train data, and leads to models with both lower MSE and higher predictive correlations.


Detalles Bibliográficos
2021
Aprendizaje profundo
Predicción genómica
Redes neuronales
Grafos
Inglés
Universidad de la República
COLIBRI
https://hdl.handle.net/20.500.12008/28582
Acceso abierto
Licencia Creative Commons Atribución - No Comercial - Sin Derivadas (CC - By-NC-ND 4.0)
Resumen:
Sumario:Genome enabled prediction of complex traits aims to predict a measurable characteristic of an organism using their genetic information. In the present work we address diverse traits and organisms including yeast growth, wheat yield, Jersey bull fertility and Holstein cattle milk yield. We benchmark several popular Machine Learning models: bayesian and penalized linear regressions, kernel methods, and decision tree ensembles. Through exhaustive hyperparameter tuning we outperform state-of-the-art results in most datasets. We also compare two codification techniques for input data and perform ablation studies to assess robustness to genetic marker - i.e input features - elimination. We then explore different Deep Learning architectures for this task. We propose and evaluate CNN architectures, showing that using residual connections improves perfomance but that in some cases Fully Connected Networks outperform CNNs. We link this to the fact that absolute positions are relevant in genomes, and thus, CNN's translational equivariance may not be an adequate inductive bias for tackling this problem. In addition, we explore using PCA and TSNE for mapping input features to two-dimensional image-like feature maps used as inputs to 2D-CNN architectures. We assess the effectiveness of the aforementioned dimensionality reduction techniques when used to construct those mappings, and find that in some cases, using random mappings performs comparably. We also propose a method to construct these image-like feature maps based on an approximation to the Fermat distance. Furthermore, we evaluate graph neural network architectures by formulating trait prediction as a node regression problem on a population graph, where each node represents an individual, and edges association between their genetic information. We evaluate the transferability of these graphical models and find that the extent to which they exploit neighbourhood information is limited. We also propose a model combining CNN and GNN architectures, which outperforms all other models in Holstein cattle milk yield prediction. Lastly, we propose optimising Pearson correlation directly, which is commonly used to evaluate model performance, but MSE is usually minimised. Although this loss does not penalise learning an affine transformation of actual phenotypes, we show that this affine transformation can be estimated from train data, and leads to models with both lower MSE and higher predictive correlations.