文档库

最新最全的文档下载
当前位置:文档库 > bio-DANN a deep learning approach for annotating the pathogenicity of genetic variants

bio-DANN a deep learning approach for annotating the pathogenicity of genetic variants

DANN:a deep learning approach for annotating the

pathogenicity of genetic variants

Daniel Quang 1,2,†,Yifei Chen 1,†and Xiaohui Xie 1,2,∗∗†

1Department of Computer Science,University of California,Irvine,CA 92697,USA and 2Center for

Complex Biological Systems,University of California,Irvine,CA 92697,USA

ABSTRACT Summary:Annotating genetic variants,especially noncoding variants,for the purpose of identifying pathogenic variants remains a challenge.CADD is an algorithm designed to annotate both coding and noncoding variants,and has been shown to outperform other annotation algorithms.CADD trains a linear kernel support vector machine (SVM)to differentiate evolutionarily derived,likely benign,alleles from simulated,likely deleterious,variants.However,SVMs cannot capture nonlinear relationships among the features,which can limit performance.To address this issue,we have developed DANN.DANN uses the same feature set and training data as CADD to train a deep neural network (DNN).DNNs can capture nonlinear relationships among features and are better suited than

SVMs for problems with a large number of samples and features.We

exploit CUDA-compatible GPUs and deep learning techniques such

as dropout and momentum training to accelerate the DNN training.DANN achieves about a 19%relative reduction in the error rate and about a 14%relative increase in the area under the curve (AUC)metric over CADD’s SVM methodology.Availability and implementation:All data and source code are available at https://cbcl.ics.uci.edu/public data/DANN/.Contact:xhx@ics.uci.edu 1INTRODUCTION Identifying the genetic variants responsible for diseases can be very challenging.The majority of candidate variants lie in noncoding sections of the genome,whose role in maintaining normal genome function is not well understood.Most annotation methods can only annotate protein coding variants,excluding >98%of the human genome.Another annotation method,Combined Annotation–Dependent Depletion (CADD)(Kircher et al.,2014),can annotate both coding and noncoding variants.CADD trains a linear kernel SVM to separate observed genetic variants from simulated genetic variants.Observed genetic variants are derived from differences

between human genomes and the inferred human-chimpanzee ancestral genome.Because of natural selection effects,observed variants are depleted of deleterious variants.Simulated genetic variants are enriched for deleterious variants.∗To whom correspondence should be addressed.†The authors wish it to be known that,in their opinion,the first two authors should be regarded as joint First Authors.CADD’s SVM can only learn linear representations of the data,which limits its performance.To overcome this,we implemented a DNN algorithm that we have named DANN (D eleterious A nnotation of genetic variants using N eural N etworks).A DNN is an artificial neural network with several hidden layers of units between the input and output layers.The extra layers give a DNN added levels of abstraction,but can greatly increase the computational time needed for training.Deep learning techniques and GPU hardware can significantly reduce the computational time needed to train DNNs.DNNs outperform simpler linear approaches such as logistic regression (LR)and SVMs for classification problems involving many features and samples.

2METHODS

2.1Model training

DANN trains a DNN consisting of an input layer,a sigmoid function output layer,and three 1000-node hidden layers with hyperbolic tangent activation function.We use deepnet (https://http://www.wendangku.net/doc/392dd381b8f67c1cfbd6b835.html /nitishsrivastava/deepnet)to exploit fast CUDA parallelized GPU programming on an NVIDIA Tesla M2090card and applied dropout and momentum training to minimize

the cross entropy loss function.Dropout reduces overfitting by randomly

dropping nodes from the DNN (Srivastava,2013).Momentum training

adjusts the parameter increment as a function of the gradient and learning rate (Sutskever et al.,2013).DANN uses a hidden node dropout rate of 0.1,

a momentum rate that increases from 0.01to 0.99linearly for the first 10epochs and then remains at 0.99,and stochastic gradient descent (SGD)with a minibatch size of 100.As a baseline comparison,we trained a LR model.For LR training,we applied SGD using the scikit-learn library (Pedregosa et al.,2011)with parameter α=0.01,which we found to maximize the

accuracy of the LR model.LR and DNN are sensitive to feature scaling,so we preprocess the features to have unit variance before training either

model.We also train an SVM using the LIBOCAS v0.97library (Franc and Sonnenburg,2009)with parameter C =0.0025,closely replicating CADD’s training.

2.2Features

There are a total of 949features defined for each variant.The feature set is sparse,and includes a mix of real valued numbers,integers,and binary values.For example,amino acid identities are only defined for coding

variants.To account for this,we include Boolean features that indicate

whether a given feature is undefined,and missing values are imputed.

Moreover,all n -level categorical values,such as reference allele identity,are converted to n individual Boolean flags.See the Supplementary of Kircher

et al.(2014)for more details about the features and imputation.

1

© The Author (2014). Published by Oxford University Press. All rights reserved. For Permissions, please email:

journals.permissions@http://www.wendangku.net/doc/392dd381b8f67c1cfbd6b835.html Associate Editor: Dr. John Hancock

Bioinformatics Advance Access published October 22, 2014 at Fudan University on November 15, 2014http://www.wendangku.net/doc/392dd381b8f67c1cfbd6b835.html /Downloaded from