PFClust : a novel parameter free clustering algorithm

Mavridis, Lazaros; Nath, Neetika; Mitchell, John B. O.

Show simple item record

Files in this item

Name:: Mavridis2013_1471_2105_14_213.pdf
Size:: 3.967Mb
Format:: PDF

View/Open

Item metadata

dc.contributor.author	Mavridis, Lazaros
dc.contributor.author	Nath, Neetika
dc.contributor.author	Mitchell, John B. O.
dc.date.accessioned	2013-08-06T11:31:07Z
dc.date.available	2013-08-06T11:31:07Z
dc.date.issued	2013-07-03
dc.identifier	57883350
dc.identifier	e3f0b5d6-d9ec-4c86-b3d4-9ec6529538c0
dc.identifier	84879809457
dc.identifier.citation	Mavridis , L , Nath , N & Mitchell , J B O 2013 , ' PFClust : a novel parameter free clustering algorithm ' , BMC Bioinformatics , vol. 14 , no. 213 , 213 . https://doi.org/10.1186/1471-2105-14-213	en
dc.identifier.issn	1471-2105
dc.identifier.other	ORCID: /0000-0002-0379-6097/work/34033396
dc.identifier.uri	https://hdl.handle.net/10023/3930
dc.description.abstract	Background: We present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intra-cluster similarity. A set of n objects can be clustered into any number of clusters from one to n, and there are many different hierarchical and partitional, agglomerative and divisive, clustering methodologies available that can be used to do this. Nonetheless, automatically determining the number of clusters present in a dataset constitutes a significant challenge for clustering algorithms. Identifying a putative optimum number of clusters to group the objects into involves computing and evaluating a range of clusterings with different numbers of clusters. However, there is no agreed or unique definition of optimum in this context. Thus, we test PFClust on datasets for which an external gold standard of 'correct' cluster definitions exists, noting that this division into clusters may be suboptimal according to other reasonable criteria. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simply-expressed metric over the space of possible clusterings. Results: We validate PFClust firstly with reference to a number of synthetic datasets consisting of 2D vectors, showing that its clustering performance is at least equal to that of six other leading methodologies -- even though five of the other methods are told in advance how many clusters to use. We also demonstrate the ability of PFClust to classify the three dimensional structures of protein domains, using a set of folds taken from the structural bioinformatics database CATH. Conclusions: We show that PFClust is able to cluster the test datasets a little better, on average, than any of the other algorithms, and furthermore is able to do this without the need to specify any external parameters. Results on the synthetic datasets demonstrate that PFClust generates meaningful clusters, while our algorithm also shows excellent agreement with the correct assignments for a dataset extracted from the CATH part-manually curated classification of protein domain structures.
dc.format.extent	21
dc.format.extent	4159889
dc.language.iso	eng
dc.relation.ispartof	BMC Bioinformatics	en
dc.subject	PFClust (Parameter Free Clustering)	en
dc.subject	Clustering algorithms	en
dc.subject	QD Chemistry	en
dc.subject.lcc	QD	en
dc.title	PFClust : a novel parameter free clustering algorithm	en
dc.type	Journal article	en
dc.contributor.institution	University of St Andrews. School of Chemistry	en
dc.contributor.institution	University of St Andrews. Biomedical Sciences Research Complex	en
dc.contributor.institution	University of St Andrews. EaSTCHEM	en
dc.identifier.doi	10.1186/1471-2105-14-213
dc.description.status	Peer reviewed	en
dc.identifier.url	http://www.biomedcentral.com/1471-2105/14/213/	en

This item appears in the following Collection(s)

University of St Andrews Research

Show simple item record