Improving classification in protein structure databases using text mining

Koussounadis, Antonis; Redfern, Oliver; Jones, David

Show simple item record

Files in this item

Name:: koussounadis2009bmcbioinformatics129.pdf
Size:: 709.7Kb
Format:: PDF

View/Open

Item metadata

dc.contributor.author	Koussounadis, Antonis
dc.contributor.author	Redfern, Oliver
dc.contributor.author	Jones, David
dc.date.accessioned	2014-04-29T09:01:04Z
dc.date.available	2014-04-29T09:01:04Z
dc.date.issued	2009-05-05
dc.identifier	6215264
dc.identifier	52344b18-b696-4d32-8313-b57c28e9e192
dc.identifier	66549127733
dc.identifier.citation	Koussounadis , A , Redfern , O & Jones , D 2009 , ' Improving classification in protein structure databases using text mining ' , BMC Bioinformatics , vol. 10 , no. 129 , pp. 638-647 . https://doi.org/10.1186/1471-2105-10-129	en
dc.identifier.issn	1471-2105
dc.identifier.uri	https://hdl.handle.net/10023/4648
dc.description	AK was funded by BBSRC grant BBC5072531 and the BioSapiens Network of Excellence (funded by the European Commission within its FP6 programme, under the thematic area Life Sciences, Genomics and Biotechnology for Health), grant number: LSHG-CT-2003-503265	en
dc.description.abstract	BACKGROUND: The classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis. One of the main bottlenecks in the processing of new entries is the evaluation of 'borderline' cases by human curators with reference to the literature, and better tools for helping both expert and non-expert users quickly identify relevant functional information from text are urgently needed. A text based method for protein classification is presented, which complements the existing sequence and structure-based approaches, especially in cases exhibiting low similarity to existing members and requiring manual intervention. The method is based on the assumption that textual similarity between sets of documents relating to proteins reflects biological function similarities and can be exploited to make classification decisions. RESULTS: An optimal strategy for the text comparisons was identified by using an established gold standard enzyme dataset. Filtering of the abstracts using a machine learning approach to discriminate sentences containing functional, structural and classification information that are relevant to the protein classification task improved performance. Testing this classification scheme on a dataset of 'borderline' protein domains that lack significant sequence or structure similarity to classified proteins showed that although, as expected, the structural similarity classifiers perform better on average, there is a significant benefit in incorporating text similarity in logistic regression models, indicating significant orthogonality in this additional information. Coverage was significantly increased especially at low error rates, which is important for routine classification tasks: 15.3% for the combined structure and text classifier compared to 10% for the structural classifier alone, at 10-3 error rate. Finally when only the highest scoring predictions were used to infer classification, an extra 4.2% of correct decisions were made by the combined classifier. CONCLUSION: We have described a simple text based method to classify protein domains that demonstrates an improvement over existing methods. The method is unique in incorporating structural and text based classifiers directly and is particularly useful in cases where inconclusive evidence from sequence or structure similarity requires laborious manual classification.
dc.format.extent	9
dc.format.extent	726832
dc.language.iso	eng
dc.relation.ispartof	BMC Bioinformatics	en
dc.rights	© 2009 Koussounadis et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited	en
dc.subject	QH301 Biology	en
dc.subject	QA75 Electronic computers. Computer science	en
dc.subject.lcc	QH301	en
dc.subject.lcc	QA75	en
dc.title	Improving classification in protein structure databases using text mining	en
dc.type	Journal article	en
dc.contributor.institution	University of St Andrews.School of Biology	en
dc.identifier.doi	10.1186/1471-2105-10-129
dc.description.status	Peer reviewed	en
dc.identifier.url	http://www.ncbi.nlm.nih.gov/pubmed/19416501	en

This item appears in the following Collection(s)

Show simple item record