Discovering topic structures of a temporally evolving document corpus

In this paper we describe a novel framework for the discovery of the topical content of a data corpus, and the tracking of its complex structural changes across the temporal dimension. In contrast to previous work our model does not impose a prior on the rate at which documents are added to the corpus nor does it adopt the Markovian assumption which overly restricts the type of changes that the model can capture. Our key technical contribution is a framework based on (i) discretization of time into epochs, (ii) epoch-wise topic discovery using a hierarchical Dirichlet process-based model, and (iii) a temporal similarity graph which allows for the modelling of complex topic changes: emergence and disappearance, evolution, splitting, and merging. The power of the proposed framework is demonstrated on two medical literature corpora concerned with the autism spectrum disorder (ASD) and the metabolic syndrome (MetS)—both increasingly important research subjects with significant social and healthcare consequences. In addition to the collected ASD and metabolic syndrome literature corpora which we made freely available, our contribution also includes an extensive empirical analysis of the proposed framework. We describe a detailed and careful examination of the effects that our algorithms’s free parameters have on its output and discuss the significance of the findings both in the context of the practical application of our algorithm as well as in the context of the existing body of work on temporal topic analysis. Our quantitative analysis is followed by several qualitative case studies highly relevant to the current research on ASD and MetS, on which our algorithm is shown to capture well the actual developments in these fields.


Introduction
In the last decade and a half so-called topic modelling has emerged as a powerful statistical paradigm for the automatic semantic analysis of large collections of documents. Topic models as their name suggests can be seen as formalizations of the colloquial understanding of 'topics' addressed in a piece of text. More specifically, in this context a topic becomes a probability distribution over a fixed vocabulary of words (or more generally terms). Thus an example of a particular topic may be: where the upper row contains probabilities which correspond to the vocabulary words shown in the bottom row. Using higher order semantic understanding a human interpreting this formal representation of a topic may describe it as capturing a discussion of educational needs of children with ASD although it should be noted that such interpretation may not always be straightforward [16].
Most research on topic modelling to date has focused on the analysis in the context of static corpora, that is, document collections which do not possess a temporal structure. In such collections documents are said to be interchangeable. The key techniques dominating this domain are Bayesian non-parametric inference algorithms and the latent Dirichlet allocation (LDA) in particular, first described by Blei et al. [15] and subsequently extended in a variety of ways. Indeed at the bottom-most level the present paper uses a model based on the hierarchical Dirichlet process (HDP) which is one of the aforementioned extensions. Both LDA and HDP are explained in some detail in Section 3.1.
However, in many problems of practical significance, it is not only the instantaneous topic structure that is of interest -the change in this structure over time too often conveys important information and insight. For this reason in recent years the problem of temporal topic modelling has been attracting an increasing amount of research attention [8,9]. Indeed the focus of the present paper is on temporally changing corpora. At the heart of the method that we describe is an automatically constructed and temporally constrained graph su- In particular the work described herein extends our contribution first described in [8] (also see [10,11] for related prior work). Retaining the same structural framework, in the present paper we analyse in detail the effects of different pruning parameters in the construction of the graph superimposed over the extracted topics. Specifically we consider different choices of inter-topic distance measures and the process of selecting an appropriate pruning threshold given a specific distance measure. In addition, we describe extended experiments on the collection of abstracts of scholarly articles on the autism spectrum disorder (ASD), gathered by ourselves and made publicly available, as well as additional experiments on a newly gathered collection of abstracts of scholarly articles in the highly active sphere of work on the metabolic syndrome. This new collection of documents will also be made publicly available.
The remainder of the present paper is structured as follows. In the next section we review relevant previous work, both on topic modelling in general as well as on temporal modelling which is the focus of our work. Then in Section 3 we first describe the key prerequisite modelling techniques which our contributions builds upon (Section 3.1), followed by the detail on the proposed modelling framework (Section 3.2). Section 3.2 introduces our main technical contributions.
Specifically these are our general temporal model based on what we term the temporal similarity graph, and the two key aspects involved in the construction of this graph: the choice of an inter-topic similarity measure (Section 3.2.1) and a graph pruning process (Section 3.2.2). The proposed method is extensively evaluated in Section 4. Section 4.1 motivates our focus on biomedical documents used in the evaluation with Section 4.1.1 summarizing previous work in the spe-

Previous work
Owing to its growing popularity in recent years, the existing literature on probabilistic topic modelling is already vast. Hence a comprehensive survey of the field is out of the scope of the present paper. Herein we overview the broad and most influential research directions, with a particular focus on techniques of direct relevance to the methodology described in the proposed paper. In particular we direct our attention first to latent topic models which have dominated the field in the last decade, and then on biomedical text mining, given the application domain within which our framework is evaluated in Section 4.
Topic models in modern machine learning are often described as latent probabilistic models. The attribute 'latent' is intended to capture the nature of inferred topics -these are hidden variables in the sense that they are not explicit in the observable data itself. Arguably the sought after "true" topic structure is also neither objective nor accessible even in principle. On the other hand, the attribute 'probabilistic' conveys the inherent aspect of the aforementioned mod-Discovering Topic Structures of a Temporally Evolving Document Corpus 7 els whereby modelling imperfections as well as ambiguities in data are handled through the use of probability distributions which readily accommodate uncertainty. Therefore we start our discussion with the simplest latent topic models which formalize the key ideas in the field and underlie much of the subsequent, more complex models.

Latent topic models
An important early topic modelling approach comes in the form of so-called latent semantic indexing (LSI) [19] which remains popular. Two notable limitations of LSI are its inability to deal effectively with polysemy and to produce an explicit description of the latent space. A probabilistic improvement overcomes these by explicitly characterizing the latent space with semantic topics, and by employing a probabilistic generative model that addresses the polysemy problem [31]. Nevertheless, probabilistic LSI is prone to parameter overfitting caused by an uncontrolled growth in the number of parameters as the document corpus is increased. In addition, the necessary assignment of probabilities to documents is a nontrivial task [15].
The recently proposed latent Dirichlet allocation (LDA) method [15] overcomes the overfitting problem by adopting a Bayesian framework and a generative process at the document level. While LDA has quickly become a standard tool for topic modelling, it too experiences challenges when applied on real-world data. In particular, being a parametric model the number of desired output topics has to be specified in advance. The HDP model as the nonparametric counterpart of LDA was introduced by Teh et al. [55] and addressed this limitation by using a hierarchical Dirichlet process (as opposed to a Dirichlet distribution) as the prior on topics. Therefore, each document is modelled using an infinite mixture model, allowing the data to inform the complexity of the model and infer the number of resulting topics automatically. We discuss this model in further detail in Section 3.

Temporal topic modelling
A notable limitation of most models described in the existing literature lies in their assumption that the data corpus is static; this includes those based on LDA mentioned previously, or the hierarchical Dirichlet process described in detail in the next section. Here the term 'static' is used to describe the lack of any associated temporal (or indeed sequential) information associated with the documents in a corpus -the documents are said to be exchangeable [14].
However, in many practical applications documents are added to the corpus in a temporal manner and their ordering has significance i.e. the documents are non-exchangeable. As a consequence, the topical structure of the corpus changes over time. For example, the corpus of scholarly literature on a particular subject is a growing corpus which by its very nature exhibits significant changes in its topic structure over time [22,21]: new ideas emerge, old ideas are refined, novel discoveries result in multiple ideas being related to one another thereby forming more complex concepts or a single idea multifurcating into different 'sub-ideas' which are thereafter investigated with some degree of independence etc. A good example from a different realm can be readily found in the corpus of social media contributions, such as Twitter [9]. With an even faster pace here too complex topic changes can be observed, with novel topics of conversation being instigated by e.g. 'real-world' events (such as epidemics, terrorist attacks, or developments in the world of popular culture), changed by the contributions of other users, split into new topics, merged with others etc.
The assumption made by all previous work, and indeed adopted by us, is that documents are not exchangeable at large temporal scales but are at short time scales, thus treating the corpus as temporally locally static. The scale at which this assumption can be considered as valid is clearly application and corpus dependent, and is an important consideration. Indeed the present paper investigates this in detail.
The existing work on temporal topic modelling can be divided into two groups of approaches both of which can be based on parametric [14,58,59] or nonparametric [44,62] techniques, the former suffering from the limitation that they contain free parameters which must be set a priori. Methods of the first group discretize time into epochs, apply a static topic model to each epoch, and by making the Markovian assumption relate the parameters of each epoch's topic model to those of the epochs adjacent to it in time [14,58,44,62]. While the approach we propose in this paper adopts the idea of time discretization, it diverges in its other features from this group of methods thereafter. In particular, instead of employing the Markovian assumption we describe a novel structure in form of a temporal similarity graph, which gives our method greater flexibility, as described in detail in the next section. The second group of methods in the literature regard document time-stamps as observations of a continuous random variable [59,20]. This assumption severely limits the type of topic changes which can be described. For example, as opposed to our model, these models are not capable of describing the evolution of topics, or their splitting and merging, and are rather constrained to tracking simple topic popularity.

Bayesian mixture models
In recent years mixture models have become popular choices for the modelling of so-called heterogeneous data. In this context heterogeneity is taken to mean that observable data is generated by more than one process (source). One of the key challenges in the analysis of heterogenous data lies in the lack of observability of the correspondence between specific data points and their sources i.e. it is not known which data source generated which data. Usually it is also the case that the number of sources is not known either [46]. Mixture models, and in particular mixture models enveloped within a Bayesian framework, have distinct advantages over alternative approaches as we shall explain shortly.

Finite mixture modelling
Finite mixture models rely on the assumption that the observed data is generated by K clusters, each cluster being associated with the parameter φ k and underlain by the probability density function f (.|φ k ). An observation x is assumed to be generated by first choosing a cluster k with probability π k followed by a random draw from the corresponding distribution described by φ k . Therefore the process can be summarized by the following: In a Bayesian setting the model parameters (i.e. mixing proportions π 1:K and component parameters φ 1:k ) are further endowed by priors. Typically the symmetric Dirichlet distribution is placed on top of π 1:K and a prior on φ 1:K conjugate with f (.|φ k ) chosen for computational convenience.

Latent Dirichlet allocation
In the previous section we described how to model a group of data points with a Bayesian finite mixture model. Latent Dirichlet allocation adds a level of hierarchy on the mixing proportions to allow for the modelling of data points in groups that share a set of components.
Following the consensus in the literature we adopt the terminology used in the analysis of textual data (which is the context in which LDA was originally proposed [15]) and hereafter interexchangably refer to data points as words, their groups as documents, and mixture components as topics. The technical term 'topic' can be interpreted as formalizing and abstracting the colloquial notion of a topic which is understood at a higher semantic level. Therefore the modelling framework of LDA can be described by the following generative process: x ji |z ji , φ 1: where H is the base distribution of topics, α the hyperparameter of the prior on the distribution of topics within a document, π j the distribution of topics in document j, and z ji the topic corresponding to the i-th word in the j-the document. The corresponding model likelihood is: Approximation techniques such as MCMC [27] and Variation Bayes [15] methods can be used for posterior inference.

Infinite mixture modelling
As mentioned earlier, LDA requires the number of topics to be fixed in advanced which is a serious limitation in practice. Choosing the number of topics is usually performed by examining how well the model fits a set of held-out documents.
However, if a previously unseen topic has contributed in generating the held-out data, LDA is not able to infer correct parameters of that topic.
Bayesian non-parametric (BNP) methods place priors on the infinite-dimensional space of probability distributions and provide an elegant solution to this problem.
Dirichlet Process (DP) [23] as the non-parametric counterpart of the Dirichlet distribution and the building block of BNP allows for the model to accommodate a potentially infinite number of mixture components. The generative likelihood for a data point x in infinite mixture model is: tration parameter γ. An alternative view of the DP emerges from the so-called stick-breaking process which adopts a constructive approach using a sequence of discrete draws [50].
is the vector of weights obtained by a stick-breaking Owing to the discrete nature and infinite dimensionality of its draws, the DP is a highly useful prior for Bayesian mixture models. By associating different mixture components with atoms φ k of the stick-breaking process, and assuming

Hierarchical Dirichlet process mixture models
While DPM is suitable for non-parametric clustering of exchangeable data in a single group, many real-world problems are more appropriately modelled as comprising multiple groups of exchangeable data. In such cases it is usually desirable to model the observations of different groups jointly, allowing them to share their generative clusters. This idea is known as "sharing the statistical strength" and it is naturally obtained by hierarchical architecture in Bayesian modelling.
Consider a collection of documents. DPM models each group with an infinite number of topics. However, it is desired for multiple group-level DPMs to share their clusters. Amongst different ways of linking group-level DPMs, HDP [55] offers an interesting solution whereby base measures of group-level DPs are drawn from a corpus-level DP. In this way the atoms of the corpus-level DP (i.e. topics in our case) are shared across the documents.
This is illustrated schematically in Figure 2(a). Since G j is drawn from a DP with base measure G 0 , it takes the same support as G 0 . Also the parameters of the group-level mixture components, θ ji , share their values with the corpus-level DP support on {φ 1 , φ 2 , . . .}. Therefore G j can be equivalently expressed using the stick-breaking process as The posterior for θ ji has been shown to follow a Chinese restaurant franchise process which can be used to develop inference algorithms based on Gibbs sampling [55].

Modelling topic evolution over time
Hitherto, the discussion in this section focused on the modelling of static document corpora. We now show how our work builds on top of these ideas in the existing literature and in particular how the described HDP-based model can be applied to the analysis of temporal topic changes in a longitudinal data corpus.
Like some of the previous work in this area we begin by discretizing time and the corresponding document subset to be treated as a static collection; we shall discuss this issue in more detail in the next section.
Each epoch is then modelled separately using an HDP, with models corresponding to different epochs sharing their hyperparameters and the corpus-level base measure. Hence if n is the number of epochs, we obtain n sets of topics . . , φ Kt,t } is the set of topics that describe epoch t, and K t their number (which is inferred automatically, as described previously). This is illustrated in Figure 2(b). In the next section we describe how given an inter-topic similarity measure the evolution of different topics across epochs can be tracked. The key idea behind our approach stems from the observation that while topics may change significantly over time, providing that the duration of epochs is chosen appropriately, the change between successive epochs is limited. Therefore we infer the continuity of a topic in one epoch by relating it to all topics in the immediately subsequent epoch which are sufficiently similar to it under a suitable similarity measure. This can be seen to lead naturally to a similarity graph representation whose nodes correspond to topics and whose edges link those topics in two epochs which are related. Formally, the weight of the directed edge that links φ j,t , the j-th topic in epoch t, and φ k,t+1 is set equal to ρ (φ j,t , φ k,t+1 ) where ρ is a similarity measure. Given that in our HDP-based model each topic is represented by a probability distribution, similarity measures such the Bhattacharyya coefficient [12], the Jenson-Shannon or Kullback-Leibler divergences [38,3], and the Hellinger [30] or resistor-average distances [47], for example, are all readily adapted for use in the proposed framework as we shall demonstrate in the next section.
A conceptual illustration of a small section of a similarity graph is shown in Figure 3. It shows three consecutive time epochs t − 1, t, and t + 1 and a selection of topics in these epochs. Graph edge weight (i.e. inter-topic similarity) is encoded by the thickness of the line representing the edge -the thicker the line, the more similar the corresponding topics are. In constructing a similarity graph we use a threshold to eliminate automatically weak edges, retaining only the connections between sufficiently similar topics in adjacent epochs. This readily allows us to detect the disappearance of a particular topic, the emergence of new topics, as well as the splitting or merging of different topics as follows: -New topic emergence: If a node does not have any edges incident to it, the corresponding topic is taken as having emerged in the associated epoch; for example, in Figure 3 the topic φ j+2 can be seen to emerge during the epoch with the timestamp t.
-Topic disappearance: If no edges originate from a node, the corresponding topic is taken to vanish in the associated epoch; for example, in Figure 3 the topic φ j can be seen to disappear during the epoch with the timestamp t.
-Topic evolution: When exactly one edge originates from a node in one epoch and it is the only edge incident to a node in the following epoch, the topic is understood as having evolved in the sense that its memetic content (captured by the probability distribution over the underlying vocabulary) may have changed; for example, in Figure 3 the topic φ j+2 evolves into the topic φ k+1 between the epochs with the timestamps t and t + 1.
-Topic splitting: If more than a single edge originates from a node, we interpret this as the corresponding topic splitting into multiple topics between the corresponding epochs and the successive epoch; for example, in Figure 3 the topic φ i splits into topics φ j and φ j+1 between the epochs with the timestamps t − 1 and t.
-Topic merging: If more than a single edge is incident to a node, the topics of the nodes from which the edges originate are understood to have interacted by merging to form a new topic; for example, in Figure 3 between the epochs with the timestamps t − 1 and t the topics φ i and φ i+1 merge to form φ j+1 .
Lastly, observe that just as our understanding of topics at a higher semantic level would allow, the proposed framework readily models complex structural changes which involve a topic concurrently undergoing merging and splitting.
For example, the topic labelled φ j+1 in Figure 3 is created though the merging of topic φ i+1 and a split offshoot of the topic φ i .

Automatic temporal similarity graph construction
In our previous work [8,9] the temporal similarity graph is built in two stages.
In the first stage the graph is connected fully in the sense that all pairs of topics in successive epochs are connected by edges. Then, the graph is pruned using a similarity threshold t s . In other words any edge corresponding to an inter-topic similarity lesser than t s is removed from the graph.
A major limitation of the described pruning step is that the similarity threshold t s is not readily interpretable and the framework provides little insight as to how the threshold should be chosen. In addition, considering that the threshold value inherently depends on the similarity measure which is used, it is not clear how two inter-topic similarity measures may be compared i.e. how to control for the threshold in the presence of a changing similarity metric which underlies it.
Hence in the present paper we describe an alternative strategy which employs a more meaningful and more interpretable manner of pruning. Firstly we consider all inter-topic similarities present in the initial fully connected graph and extract the empirical estimate of the corresponding cumulative density function (CDF).
Examples of CDFs obtained using three similarity metrics on a typical epoch in F ρ is the CDF corresponding to a specific initial, fully connected graph formed using a particular similarity measure, and ζ ∈ [0, 1] the CDF operating point, we prune the edge between topics φ j,t and φ k,

Experimental evaluation
Having introduced the main technical contribution of our work we now analyse the performance of the proposed framework empirically on two large real-world data sets. For the sake of completeness we next briefly survey the most influential work on the analysis of biomedical texts, then outline the key reasons for our choice of the specific research areas we focus on, and then proceed with a description of the adopted experimental methodology, a presentation of the most significant results emerging from our analysis, and the associated discussion.

Biomedical text mining
Most previous work on text-based knowledge discovery in biomedicine to date has focused on (i) the tagging of names of entities such as genes, proteins, and diseases [51], (ii) the discovery of relationships between different entities e.g.
functional associations between genes [45], or (iii) the extraction of information pertaining to events such as gene expression or protein binding [52].
The idea that the medical literature could be mined for new knowledge is typically attributed to Swanson [53]. For example by manually examining medical literature databases he hypothesised that dietary fish oil could be beneficial for Raynaud's syndrome patients, which was later confirmed by experimental evidence. Work that followed sought to develop statistical methods which would make this process automatic. Most approaches adopted the use of term frequencies and co-occurrences using dictionaries such as Medical Subject Headings (MeSH) [48].
Most existing work on biomedical knowledge discovery is based on what may be described as traditional data mining techniques (neural networks, support vector machines etc); comprehensive surveys can be found in [35,52].

Autism spectrum disorder and the metabolic syndrome
In this paper we evaluate our topic discovery framework on two corpora of scholarly papers. The first corpus is of abstracts of papers concerning the autism spectrum disorder, and the other of abstracts of papers related to research on the metabolic syndrome. These specific research areas are chosen for several reasons. Firstly, they concern medical issues of major practical importance -they affect (directly or indirectly) a large number of people and impose a significant financial cost both to the society as a whole and to those affected. Secondly, the understanding of mechanisms underlying both conditions have proven to pose a significant intellectual challenge. Consequently, the dominant ideas regarding are continually changing, experiencing both refinement as well as more abrupt paradigm shifts. These aspects make the chosen areas of research highly suitable for the evaluation of the framework described in the present work.
The autism spectrum disorder Autism spectrum disorder is a life-long neurodevelopmental disorder with poorly understood causes on the one hand, and a wide range of potential treatments supported by little evidence on the other. The disorder is characterized by severe impairments in social interaction, communication, and in some cases cognitive abilities, and typically begins in infancy or at the very latest by the age of three. ASD is recognized as comprising an aetiologically and clinically heterogeneous group of conditions whose diagnosis remains to be based solely on the complex behavioural phenotype [41]. According to the definition in the latest version (5th edition) of the Diagnostic and Statistical Manual of Mental Disorders, the autism spectrum disorder includes disorders which were previously diagnosed with more specificity as autism, Asperger syndrome, Rett syndrome, childhood disintegrative disorder, and 'pervasive developmental disorder not otherwise specified' [1]. Current evidence suggests that approximately 0.5-0.6% of the population is afflicted by ASD though the actual diagnosis rate is on the increase due to the broadening diagnostic criteria [6]. The condition is usually detected in early childhood when an abnormal lack of social reciprocity is observed.
Although the last few decades have seen significant progress in the study of ASD, the still relatively poorly understood aetiology of the condition, its phenotypical heterogeneity [37], and stigma associated with mental conditions [26], have all contributed to the penetration of beliefs, and behavioural and educa-tional interventions which are often questionable [60] and poorly supported by evidence (e.g. gluten-free and casein-free diets, and cognitive behavioural therapy [18]), and sometimes outright in conflict with science [56]. For example a recent review of early intensive behavioural and developmental interventions for young children with ASD found 1 existing study as being of good quality, 10 as fair quality, and 23 as poor quality [60]. From the public policy point of view, understanding the practices and beliefs of parents and carers of ASD-affected individuals is crucial, yet often lacking [29].
Metabolic syndrome Much like ASD, metabolic syndrome (also known as insulin resistance syndrome and syndrome X) does not describe a single disorder but rather a cluster of interconnected health risk factors [28]. Specifically, the diagnostic criterion is the presence of at least three of the following: visceral obesity, arterial hypertension, hyperglycaemia, hypertriglyceridemia, and hypoalphalipoproteinemia [28]. MetS is recognized as a major and escalating public health challenge, chiefly in the developed world, and is thought to be caused in part by excess energy intake, and decreased energy output due to an increasingly sedentary lifestyle [36]. Metabolic syndrome is associated with an increased risk of numerous diseases and particularly notably with the development of cardiovascular disease and type 2 diabetes mellitus [2]. Approximately one third of adults in the USA suffer from MetS [25], with the prevalence of the syndrome increasing with age [25].

Data collection
To the best of our knowledge there are no publicly available corpora of ASD or MetS related medical literature. Hence we collected them ourselves. These are The data sets, their collection, and preparation are described next.

Raw data collection
We used the PubMed interface to access the US National Library of Medicine and retrieve abstracts and references of life science and biomedical scholarly articles. We assumed that a paper is related to ASD or MetS respectively if the terms "autism" or "metabolic syndrome" are present in its title or abstract, and MetS. We used the abstract text to evaluate our method.

Data pre-processing
Data collected in the manner described in the previous section comprises abstracts as freeform text. To prepare it for the type of analysis described in Section 3 we perform a series of 'pre-processing' steps. The goal is to remove words which are largely uninformative in any context, reduce dispersal of semantically equivalent terms, and thereafter select terms which are included in the vocabulary over which topics are learnt.
We firstly applied soft lemmatization using the WordNet lexicon [42] to normalize for word inflections. No stemming was performed to avoid semantic distortion often effected by heuristic rules used by stemming algorithms.
After lemmatization and the removal of so-called stop-words, we obtained ap-

Inter-topic similarity measures
Recall that in this work topics are probability distributions over a fixed vocabulary of terms. Thus the inter-topic similarity measure used to construct our temporal graph are thus similarity measures between probability distributions.
Considering that the vocabulary of terms is fixed we represent each topic, say p, using a fixed length vector: where n v is the number of terms in the vocabulary.
For the experiments in this paper we adopted three well known measures for quantifying the similarity (or equivalently, dissimilarity) between probability distributions representing extracted topics. The first of these is the well known Hellinger distance [30]. For two discrete probability distributions, e.g. p and q representing two topics, it is defined as follows: It can be readily seen that H() is symmetric and that it takes on a value between 0 and 1, with 0 signifying the greatest degree of similarity between p and q (in this case p = q) and 1 the least (in this case The second similarity measure we evaluate is the Bhattacharyya coefficient defined as [12]: As the Hellinger distance, the Bhattacharyya coefficient is symmetric and takes on a value between 0 and 1. However note that in this case it is the maximum value of 1 which is attained when there is the greatest degree of similarity between p and q (i.e. p = q) and 0 the least (as before in this case . It is straightforward to demonstrate that the Bhattacharyya coefficient is related to the Hellinger distance as follows: Lastly, due to its widespread use we also compare the performance of the proposed algorithm using the similarity measure often erroneously referred to as Tanimoto similarity [39] (or the Jaccard similarity [24]), defined as: As the Bhattacharyya coefficient, this measure is symmetric and takes on a value between 0 and 1, with the maximum value of 1 being attained when p and q are equal, and 0 when there is no overlap between them (i.e. p i > 0 =⇒ q i = 0 ∧ q i > 0 =⇒ p i = 0). In an effort to avoid perpetuating the aforementioned misnomer on the one hand while retaining some nomenclatural continuity with the existing literature, we will refer to this measure as quasi-Jaccard similarity.

Experiments and results
In this section we conduct experiments on the two corpora of scholarly literature described in Section 4.1.3, and report and discuss our results both using quantitative findings and representative qualitative examples.

Quantitative comparison
We started our evaluation with an experiment examining quantitative differences effected by changing different flexible parameters of the proposed method.
In particular, our aim was to see how the evolving topic structure extracted by  Table 1 and Table 2 Table 1, and 6 : 1 in Table 2) .   Epoch length (years)  1946 1950 1954 1958 1962 1966 1970 1974 1978 1982 1986 1990 1994 1998  Successive epoch overlap Lastly we analysed the effect that the choice of successive epoch overlap has on the output of our algorithm. As in the preceding experiments we visualize a summary of the results using sets of plots of normalized rates of topic birth, death, splitting, and merging, each set corresponding to a different overlap -please see Figure 10. Even a cursory examination of the plots readily reveals that the parameter in question has a profound effect. While a consistent pattern of changes can be observed in each set of plots, the most noticeable effect is on the rate of topic death. In particular it can be seen that    between successive epochs, the less relatedness can be expected between the sets of topics extracted in successive epochs. However, qualitatively, the magnitude of the effect is rather astonishing. Considering that most of the methods described in the literature which discretize time by epochs adopt the no-overlap design, our finding provides strong and valuable evidence that the performance of these methods could be improved with little effort, merely by a slight alteration in the manner discretization is performed.

Parameter combinations and topic life expectancy
In the experiments so far we examined how the topic structure of a longitudinal document corpus and the evolution of this structure over time is affected by different free parameters of the proposed algorithm. Our results suggest that our algorithm is not very sensitive to the exact choice for the value of its parameters. This behaviour is highly desirable because it obviates the need for substantial amounts of data needed to learn sensible parameter values for a particular application. The one parameter which we found to be of particular importance is the amount of overlap between successive epochs used to discretize time. In particular we found that while the precise amount of overlap is not of particular importance, the introduction fo some overlap (25-50% of the epoch length) significantly improves the ability of our algorithm to capture temporal structural changes in the topic content of a corpus. This was most significantly demonstrated in the rate of topic death. As In the experiments described so far we examined the normalized topic birth and death rates independently. In other words we looked at the proportion of topics which are respectively newly created (born) and which disappear (die) as a proportion of the total number of topics extracted in the corresponding epoch.
This information does not provide any insight into which topics die off, that is, how long a specific topic has been in existence before it disappears. Here we adopt a more nuanced approach which comprises the following steps: -Identification of topic creation: We identify the epoch in which a particular topic was created as the epoch of its birth or the epoch in which the topic is created through the splitting or the merging of topics from the previous epoch.

-Topic tracking:
Following the creation of a topic we track it in the context of complex changes of its topic environment by considering its natural descendent to be its child in our temporal topic graph, with the highest degree of inter-topic similarity across all siblings. A topic is considered extant as long as it has any descendents.
-Identification of topic death: Using the operating point corresponding to ζ = 0.95 on the inter-topic similarity CDF, we performed experiments using the six settings summarized in Table 1.
Our results are shown in Table 3.
Lastly, we examined the manner in which topics extracted by our algorithm cease to exist. In particular, for each new topic (where "new" in this context is taken to mean that a topic is either newly born, as defined previously, or that it is created by splitting or merging of topics from the previous epoch) we  Table 3 and the plots in Figure 11. In Table 3 it is important to observe the lack of sensitivity of our method to its specific settings. From the plots in Figure 11 it is interesting to notice that most of the topics cease to exist already in the first epoch. This   Table 3. Average lifespan of topics (in epochs) extracted from our autism spectrum disorder abstracts data set. * For full detail of different settings see Table 1.

Qualitative results
In the previous section we described an extensive set of experiments providing insight into the role that different free parameters of our algorithm have on its output. Importantly we demonstrated that within a wide range of what may be described as reasonable choices for the parameter values, the proposed method exhibits a high degree of robustness. Following these encouraging quantitative findings, using our (human) higher level semantic understanding of the corpora used, we now examine if the output of our algorithm is meaningful and ultimately useful.
Case study 1: ASD and genetics. While the exact aetiology of the ASD is still poorly understood, the existence of a significant genetic component is beyond doubt [41]. Work on understanding complex genetic factors affecting the development of autism, which possibly involve multiple genes which interact with each other and the environment, is a major theme of research and as such a good case study on which the usefulness of the proposed method can be illustrated.
We started by identifying the topic of interest as that with the highest probability of the terms "gene" or "genetic" conditioned on the topic, and tracing  Figure 12 shows the evolution of this topic from 1992 revealed by our method (due to space constraints only the most significant parts of the similarity graph are shown; minor changes to the topic before 1992 are also omitted for clarity, as indicated by the . Each topic is labelled with its first few dominant terms. The following interpretation of our findings is readily apparent. Firstly, in the period 1992-1997, the topic is rather general in nature. Over time it evolves and splits into topics which concern more specific concepts (recall that such splitting Our framework also allows us to look 'back' in time. For example, by examining the topics that the 1992 genetics topic originate from we discovered that the topic evolved from the early concept of "infantile ASD" [34]. Wakefield [57]. The evolution of the topic is illustrated in Figure 13 in the same way as in the previous section. It can be seen that the original topic concerned the subjects initially brought to attention such as "measles", "vaccine", and "autism". In the subsequent epoch, when the original claim was still thought  to have credibility, the topic evolves and splits into numerous others mirroring research directions taken by various researchers. Following this period and the revelations of its fraudulence, the topic assumes mainly single-threaded evolution, at times incorporating various originally separate ideas. For example observe the independent emergence of the term "mercury". Though initially unrelated to it this topic merges with the topic that concerns vaccination which can be explained by the widely publicized thiomersal (vaccine preservative) controversy (again note that such merging of topics cannot be captured by the existing methods).
Although rejected by the medical community due to a lack of evidence, this topic can be seen as persisting to date.
Case study 3: MetS and plasma fatty acids. As noted earlier MetS is highly associated with the risk of developing type 2 diabetes and cardiovascular diseases, and is characterized by insulin resistance, abdominal obesity, and high blood pressure, all of which are intimately linked with dyslipidemia and elevated plasma fatty acid levels. In this case study we sought to investigate patterns associated with topics concerning this aspect of MetS.
As in the previous two case studies we began by identifying the topics with the highest probability of the relevant terms (in this case "acid" and "fatty")   acids (i.e. in the context of our vocabulary the terms "acid" and "fatty"). Fatty acid metabolism plays a key role in the metabolic syndrome. and the increasing incidence of gout (historically known as "the rich man's disease") in the Western world [17].

Summary and conclusions
This paper focused on the problem of modelling and extracting the topic structure of a longitudinal document corpus over time. The approach we described starts with a discretization of time into epochs which may overlap. Then, using the approximation that the topic structure within each epoch is temporally locally static, the aforesaid structure is modelled and extracted using a hierarchical Dirichlet process. Finally, the evolution of the topic structure over time is captured using a temporal graph underlain by an inter-topic similarity measure.
The graph, initially populated by edges between all pairs of topics in two consecutive epochs is pruned automatically and the result used to infer complexity structural changes over time which the existing methods in the literature cannot The proposed framework was evaluated extensively on two large real-world data sets of abstracts of scientific papers, one concerning the autism spectrum disorder and the other the metabolic syndrome. This data was collected by ourselves, and made free for public use. Our detailed quantitative analysis of the effects that the free parameters of the proposed method have on its performance revealed a number of important insights. We found that within a wide range of parameter values our algorithm was little affected by the specific value choices.
Another important finding, the significance of which extends further than the scope of the proposed algorithm, is that in the discretization of time into epochs it is important that successive epochs overlap. The significantly inferior performance observed with non-overlapping epochs has immediate consequences for the interpretation of previous work and the findings reported in the literature, suggesting a simple and immediate way of enhancing the performance of any algorithm which did not adopt the use of overlapping epochs. Lastly, on several case studies highly relevant to the currently popular directions of research on ASD and MetS, our algorithm's output was analysed qualitatively, and shown to capture well the actual developments in these fields.