EnzML : multi-label prediction of enzyme classes using InterPro signatures

De Ferrari, Luna; Aitken, Stuart; van Hemert, Jano; Goryanin, Igor

Show simple item record

Files in this item

Name:: 1471_2105_13_61.pdf
Size:: 1.049Mb
Format:: PDF

View/Open

Item metadata

dc.contributor.author	De Ferrari, Luna
dc.contributor.author	Aitken, Stuart
dc.contributor.author	van Hemert, Jano
dc.contributor.author	Goryanin, Igor
dc.date.accessioned	2014-01-13T13:01:04Z
dc.date.available	2014-01-13T13:01:04Z
dc.date.issued	2012-04
dc.identifier	35085614
dc.identifier	2a87b079-caa1-4a01-b6df-0624708ed880
dc.identifier	000308067200001
dc.identifier	84860154266
dc.identifier.citation	De Ferrari , L , Aitken , S , van Hemert , J & Goryanin , I 2012 , ' EnzML : multi-label prediction of enzyme classes using InterPro signatures ' , BMC Bioinformatics , vol. 13 , 61 . https://doi.org/10.1186/1471-2105-13-61	en
dc.identifier.issn	1471-2105
dc.identifier.uri	https://hdl.handle.net/10023/4361
dc.description	LDF is funded by ONDEX DTG, BBSRC TPS Grant BB/F529038/1 of the Centre for Systems Biology at Edinburgh and the University of Newcastle. SA is supported by by a Wellcome Trust Value In People award and, together with IG, the Centre for Systems Biology at Edinburgh, a centre funded by the Biotechnology and Biological Sciences Research Council and the Engineering and Physical Sciences Research Council (BB/D019621/1). JVH was funded by various BBSRC and EPSRC grant	en
dc.description.abstract	Background: Manual annotation of enzymatic functions cannot keep up with automatic genome sequencing. In this work we explore the capacity of InterPro sequence signatures to automatically predict enzymatic function. Results: We present EnzML, a multi-label classification method that can efficiently account also for proteins with multiple enzymatic functions: 50,000 in UniProt. EnzML was evaluated using a standard set of 300,747 proteins for which the manually curated Swiss-Prot and KEGG databases have agreeing Enzyme Commission (EC) annotations. EnzML achieved more than 98% subset accuracy (exact match of all correct Enzyme Commission classes of a protein) for the entire dataset and between 87 and 97% subset accuracy in reannotating eight entire proteomes: human, mouse, rat, mouse-ear cress, fruit fly, the S. pombe yeast, the E. coli bacterium and the M. jannaschii archaebacterium. To understand the role played by the dataset size, we compared the cross-evaluation results of smaller datasets, either constructed at random or from specific taxonomic domains such as archaea, bacteria, fungi, invertebrates, plants and vertebrates. The results were confirmed even when the redundancy in the dataset was reduced using UniRef100, UniRef90 or UniRef50 clusters. Conclusions: InterPro signatures are a compact and powerful attribute space for the prediction of enzymatic function. This representation makes multi-label machine learning feasible in reasonable time (30 minutes to train on 300,747 instances with 10,852 attributes and 2,201 class values) using the Mulan Binary Relevance Nearest Neighbours algorithm implementation (BR-kNN).
dc.format.extent	12
dc.format.extent	1100222
dc.language.iso	eng
dc.relation.ispartof	BMC Bioinformatics	en
dc.subject	Enzymatic function	en
dc.subject	Automatic genome sequencing	en
dc.subject	InterPro signatures	en
dc.subject	EnzML	en
dc.subject	QA75 Electronic computers. Computer science	en
dc.subject	QH301 Biology	en
dc.subject.lcc	QA75	en
dc.subject.lcc	QH301	en
dc.title	EnzML : multi-label prediction of enzyme classes using InterPro signatures	en
dc.type	Journal article	en
dc.contributor.institution	University of St Andrews. School of Chemistry	en
dc.identifier.doi	10.1186/1471-2105-13-61
dc.description.status	Peer reviewed	en

This item appears in the following Collection(s)

University of St Andrews Research

Show simple item record