Coupled differentiation and division of embryonic stem cells inferred from clonal snapshots

The deluge of single-cell data obtained by sequencing, imaging and epigenetic markers has led to an increasingly detailed description of cell state. However, it remains challenging to identify how cells transition between different states, in part because data are typically limited to snapshots in time. A prerequisite for inferring cell state transitions from such snapshots is to distinguish whether transitions are coupled to cell divisions. To address this, we present two minimal branching process models of cell division and differentiation in a well-mixed population. These models describe dynamics where differentiation and division are coupled or uncoupled. For each model, we derive analytic expressions for each subpopulation's mean and variance and for the likelihood, allowing exact Bayesian parameter inference and model selection in the idealised case of fully observed trajectories of differentiation and division events. In the case of snapshots, we present a sample path algorithm and use this to predict optimal temporal spacing of measurements for experimental design. We then apply this methodology to an \textit{in vitro} dataset assaying the clonal growth of epiblast stem cells in culture conditions promoting self-renewal or differentiation. Here, the larger number of cell states necessitates approximate Bayesian computation. For both culture conditions, our inference supports the model where cell state transitions are coupled to division. For culture conditions promoting differentiation, our analysis indicates a possible shift in dynamics, with these processes becoming more coupled over time.


April 2020
Abstract. The deluge of single-cell data obtained by sequencing, imaging and epigenetic markers has led to an increasingly detailed description of cell state. However, it remains challenging to identify how cells transition between different states, in part because data are typically limited to snapshots in time. A prerequisite for inferring cell state transitions from such snapshots is to distinguish whether transitions are coupled to cell divisions. To address this, we present two minimal branching process models of cell division and differentiation in a well-mixed population. These models describe dynamics where differentiation and division are coupled or uncoupled. For each model, we derive analytic expressions for each subpopulation's mean and variance and for the likelihood, allowing exact Bayesian parameter inference and model selection in the idealised case of fully observed trajectories of differentiation and division events. In the case of snapshots, we present a sample path algorithm and use this to predict optimal temporal spacing of measurements for experimental design. We then apply this methodology to an in vitro dataset assaying the clonal growth of epiblast stem cells in culture conditions promoting self-renewal or differentiation. Here, the larger number of cell states necessitates approximate Bayesian computation. For both culture conditions, our inference supports the model where cell state transitions are coupled to division. For culture conditions promoting differentiation, our analysis indicates a possible shift in dynamics, with these processes becoming more coupled over time.

Introduction
Changes in gene expression underlie many aspects of cellular behaviour in tissue development, homeostasis, and regeneration. The concept of discrete cell states is intended to capture the distinct patterns of gene expression that are observed within tissues over time. The deluge of single cell data is leading to an increasingly detailed description of cell state [1,2]. There are various ways to interrogate a cell's state in different contexts in vivo and in vitro. Modern technologies such as scRNAseq produce vast amounts of data but are costly, laborious to analyse, and relatively noisy. Older techniques, such as immunofluorescent stainings, where cell state can be defined by the co-expression of a small number of genes/proteins, are cheaper and simpler and thus still remain heavily used.
While there are numerous techniques for classifying cell states based on single cell data, and ordering them along pseudotime trajectories [3,4], such classification only offers very limited insight into the dynamics of cell state transitions [4,5]. The ability to quantify the dynamics of cell state transitions promises greater insights into cell heterogeneity and how differentiation varies with culture conditions, and could help steer targeted differentiation of stem cells in vitro. Currently, the data remain limited to snapshots in most cases, with only one or few time-point measurements available, and a loss of cell identity across measurements due to their destructive nature. Fluorescent reporters of gene expression for live-imaging circumvent this, but their availability is relatively limited and they allow observation of at most a few genes.
Correct tissue development and regeneration requires balanced cell proliferation and differentiation at the tissue level to ensure the right cell type in the right place at the right amount. However, which dynamics at the cellular level give rise to this population behaviour is in many cases still unclear. In particular, deciding how or whether cell state transitions are coupled to, or occur independently of, cell division is a pre-requisite for quantifying cell state transition rates. Determining evidence for such coupling from snapshot data is an incompletely solved problem [6].
If we only have access to snapshots of clonal colony growth, how can one distinguish whether cell division and cell state transitions are coupled? The answer to this question will affect how well one can infer cell state transition rates, as assuming a "wrong" underlying model will impair the quality of the inference. Conversely, can we inform experimental design to maximise the information gained from experimental data? To address this, we combine biophysical theory of stochastic population dynamics with Bayesian parameter inference and model comparison. As an example we choose the well-established system of epiblast stem cells (EpiSCs) [7]. EpiSCs are self-renewing, pluripotent cell lines that are an in vitro equivalent of the embryonic epiblast tissue. In vivo, these cells begin to differentiate (undergo lineage commitment) during gastrulation.
We start with minimal branching process models of cell division and cell state transition (e.g. reversible differentiation), and derive the population dynamics and exact likelihoods for inference based on complete trajectory data (observing every cell division/state transition). We then present a sample path algorithm for inference based on snapshot data. We verify our methods using simulated data and show how the information gained from experimental data can be improved by adaptively spacing the snapshots in time. Next, we extend our models and inference pipeline to EpiSC clonal assay data. In this case, the sample path algorithm becomes too computationally intensive, due to the larger cell state space. Using an approximate Bayesian computation sequential Monte-Carlo scheme, we find that inference favours coupling of cell division and transitions. We compare inferred transition rates in two experimental conditions, promoting self-renewal and differentiation, and find a bias towards increased expression of two transcription factors (T(Bra) and FoxA2) and decreased expression of another (Sox2) under conditions promoting differentiation, in line with published results of a previous experimental study that cultured EpiSCs in vitro [7].

Methods
We begin this section by introducing two continuous-time Markov processes as minimal stochastic models for stem cell dynamics. These simple models are then generalized to describe the dynamics of epiplast stem cells (EpiSC) in more detail. We will close this section with a brief summary of Bayesian parameter inference and model selection.

Minimal models
We begin by defining minimal models of cell division and differentiation in a well-mixed population. We assume that the population dynamics can be adequately described by a continuous-time Markovian stochastic process where state transition rates depend only on the current state of the system. This in turn is specified by a set of population numbers {N} of different cell states. In this section we will consider systems with two different cell states and assume that cells within the population divide and change state independently of one another.
We consider two distinct models of population dynamics: one in which cell differentiation is either coupled to cell division (Model C) one in which processes are uncoupled (Model U). In both models there are two cell states, A and B. The essential difference is that in Model C, cells of a different state can be produced only through cell division, whereas in Model U cells can change state reversibly without dividing. The dynamics in the state space spanned by N A (t) and N B (t) are illustrated in Figure 1. Initial populations N A (0) and N B (0) will be denoted as N A0 and N B0 , respectively.
2.1.1. Model C Model C is defined by the following continuous-time transition rates: where the system's transition rates are denoted as h C k . In each of these three reactions, an A-state cell divides, producing two A-state cells, an A-and a B-state cell, or two B-state cells, at per-capita rates λ AA , λ AB , and λ BB respectively. In this model, B-state cells do not divide.
Let p i,j (t) be the probability that our cell population consists of i A-state and j B-state cells at time t, then the master equation for Model C is given by dp i,j dt This master equation is a differential equation defining the time evolution of population probabilities p i,j (t) caused by stochastic transitions.
with transition rates h U k . These reactions correspond to division of A-state and B-state cells into two daughter cells of the same state at per-capita rates λ A and λ B , respectively, and to dynamic state changes of an A-state into a B-state at per-capita rate k AB , and of a B-state into an A-state at per-capita rate k BA .
The master equation of Model U is given by dp i,j dt 2.1.3. Generalisations of Models C and U: Epiblast stem cell models While Models C and U are useful for illustrative purposes, we require more cell states to make use of experimental data on epiblast-derived stem cell (EpiSC) dynamics. Here, we define EpiSC states by the binary expression of the transcription-factor genes T(Bra), Sox2, and Foxa2, which is what the experimental dataset contains information about [7]. There are thus 8 distinct cell states accessible for modelling, which we will denote by where the index T,S,F represents gene T(Bra), Sox2 and Foxa2, respectively. For simplicity, we start with the assumption that cell state transitions change only one of the three genes T, F, S at a time. This leads to 12 allowed transitions between cell states, which can be conveniently visualised as a cube where vertices represent cell states and edges represent transition paths (see Figure 2a). The generalisation of Model C retains the defining feature that new cell states can be created only at division of the parent. Specifically, we allow symmetric division of cells in each state, Φ i → 2Φ i , at per-capita rate θ ii . Note that we can recover Model C by setting all but one of these rates to zero. To keep the number of parameters small, we accommodate only the other symmetric cell division processes, Figure 2: Dynamics in generalisations of Models C and U. (a) We characterise EpiSCs by the co-expression of three genes, T(Bra), Sox2 and Foxa2, corresponding to a space of eight possible cell states (cube vertices), S = {(T ± , S ± , F ± )}. A combination of these 8 states may be present in a heterogeneous EPiSC population. We assume that cell state transitions change the expression of only one gene at a time. This results in 12 possible transitions between cell states (cube edges). (b) In the generalised Model C, cells in state Φ i ∈ S divide symmetrically with rate θ ii into two daughter cells with state Φ i , and symmetrically with rate θ ij into two daughter cells with a neighbouring state Φ j . (c) In the generalised Model U, cells in state Φ i ∈ S divide symmetrically with rate θ ii into two daughter cells with state Φ i , and transition with rate θ ij into a neighbouring state Φ j .
any j sharing an edge with i, at per-capita rate θ ij . These dynamics are illustrated in Figure 2b. The fact that we chose to include only symmetric cell divisions should not change the outcome of model comparison too much, considering the following argument: the asymmetric division process Φ i → Φ i + Φ j with i = j effectively increases the Φ j population by one. The same outcome can also be achieved either by a symmetric division Φ j → 2Φ j or by the sequence of two symmetric cell divisions Φ i → 2Φ j , i = j and Φ j → 2Φ k , j = k.
Since cells in any state can divide, there are eight different reactions of the form Φ i → 2Φ i . State-changing divisions of the form Φ i → 2Φ j can happen between all neighbouring cell states (i, j) in the network shown in Figure 2a; there are 3 × 8 = 24 such reactions, since each cell state has three neighbouring states. The total number of reaction types in extended Model C is therefore 8 + 24 = 32.
In principle therefore there are 32 parameters in this model. To reduce the size of the parameter space, we further assume that the symmetric cell division rates are the same for all cell states. This assumption implies the transition rates of genes are independent of each other, which is a first approximation only. We relax this assumption in future work (Schumacher et al., in preparation). Thus we define the cell division rate θ ii = θ 0 for all i. We further assume that switching each gene M ∈ {T, S, F } on or off does not depend on the state of the other two genes which are not changed. Each gene M thus contributes two independent rate constants θ M + , θ M − for switching on and off, respectively. The extended Model C therefore has 7 free parameters: We now turn to the generalisation of Model U, recalling that the original two-state model allowed symmetric division of both states, and dynamic differentiation in which one state could turn into the other. This is easily extended to 8 cell states Φ i , wherein each cell state can perform symmetric cell division at rate θ ij , and dynamical state changes between all pairs of neighbouring cell states can occur at rates θ ij and θ ji . See Figure 2c. As in the extended Model C, there are eight symmetric division rates, which we again set to a universal value θ ii = θ 0 . Similarly the rates of dynamic cell state changes are defined by the gene that is switched on or off, which again leads to a total of 8 + 24 = 32 reaction types and 7 free parameters: An appealing feature of Models C and U from a classical model-selection perspective is that they have the same number of free parameters. A problem when comparing models with different numbers of free parameters is that the models which offer more free parameters will usually show better agreement with the data. However, this is because it can be more accurately fitted to the data, not because the underlying assumptions are more accurate in describing nature. Bayesian model selection generally eliminates this problem since it embodies Occams razor: simpler models have higher posterior probabilities than more complex models if both fit the data equally well [8].

Bayesian inference and model selection
We use Bayesian methods to perform model selection and parameter inference. Given some experimental data ξ, we seek the probability distribution p(θ|ξ) over the parameters θ of either Model C or Model U. This is achieved by appealing to Bayes' theorem, Here p(θ) encapsulates our prior beliefs as to appropriate values for the parameters θ (i.e., in the absence of any data), and p(ξ|θ) is the likelihood of the data ξ given a parameter choice. In this context, the likelihood is given by solving the master equation for the model under consideration. Lastly p(ξ) is the probability of observing the data ξ, and is usually obtained indirectly by ensuring that the posterior distribution is correctly normalised. It is worth mentioning that data ξ can take very different forms, such as continuous trajectory data of cell population or population snapshots at a finite number of times. The form of ξ does not affect Bayes' theorem, but enters the calculation via the likelihood function p(ξ|θ).
If not only the model parameters θ are unknown, but also the model itself, we can use Bayesian methods for model comparison. Given data ξ and the two competing models M C and M U , we wish to calculate the probability of each model p(M j |ξ) given the data. Following Bayes' theorem, the posterior odds of the models are given by where q(M i ) is the prior belief that model M i is the appropriate description of the data. The selection of the "best" model is usually not based on posterior odds, but the ratio of posterior to prior odds: that is, how much more the data point to one model over the other, given our prior beliefs: This ratio is called the Bayes factor [9]. Following (10), the marginal likelihoods p(ξ|M j ) are the quantities of key interest. They can be obtained by integrating over the parameter space θ j of the corresponding model: where p(θ j |M j ) = q(θ j ) is the parameter prior of model M j . All algorithms shown in this work were implemented using python3 with standard libraries (numpy, pandas, scipy, pyabc). All code can be accessed via GitHub (https://github.com/real-save/Stem-cell-inference).

Coupling of differentiation and division leads to qualitatively different population dynamics
Our first set of results pertain to how populations evolve over time in Models C and U in their original formulation with two cell states. In particular, we will see the utility of moment generating functions. In the following we will derive equations of the probability generating function and mean expressions for N A and N B for each of Models C and U. In the last section we will also derive approximate expressions for mean populations Φ in the extended models.
This is a linear, first-order PDE for G(z, t) and can be solved by the method of characteristics [10]. With the initial condition stated above, we find where The full time-dependence of mean population sizes then follow from (B.2): where ∆λ := λ AA −λ BB . To understand these results, we note that cells in state A divide at per-capita rate λ AA , and are lost at per-capita rate λ BB . These competing processes lead to either an exponential growth (λ AA > λ BB ) or exponential decay (λ AA < λ BB ) in the mean number of cells in state A. Since there is no loss of cells from state B, the mean number of cells in state B either grows exponentially with time (λ AA > λ BB ) or plateaus at a finite value (λ AA < λ BB ). A particular feature of Model C is that there is an absorbing state. This arises because A-state cells can be lost via the process A → 2B, and B-state cells do not divide. Therefore, this a finite probability that the number of cells in state A will eventually decrease to zero and the system dynamics will come to a halt. We can obtain p ext (t), the extinction probability that this has occurred by time t, from the moment generating function as follows: From (13) we find In the special case λ AA = λ BB = λ this simplifies to The extinction probability is monotonically increasing with time and converges to a finite value lim t→∞ p ext (t) = p max ext . For λ BB ≥ λ AA , it follows p max ext = 1 and A-state cells will always die out. For λ BB < λ AA however, p max ext = (λ BB /λ AA ) N A0 < 1.

Population dynamics of Model U
We can solve Model U by following the same sequence of steps. We first arrive at the PDE The solution is given by Here, c 1 , c 2 , c 3 , c 4 are functions of time t: and Γ, ω, θ are constants depending on model parameters (λ A , λ B , k AB , k BA ): The mean numbers of cells in each state are given by where α, β, γ are constants depending on model parameters (λ A , λ B , k AB , k BA ): Over long times, the population means grow approximately exponentially, as in Model C, but with a different rate α + β. In contrast to Model C, however, the population means never decay since α + β ≥ 0 as can be easily shown. This is due to the absence of an absorbing state in Model U. It is worth mentioning that for any given parameter . Thus, we cannot use information about mean populations alone to distinguish Model U from Model C. The converse is not true, since N A decays for some parameter regimes in Model C, which is impossible in Model U. This is part of the rationale for appealing to Bayesian methods to infer model parameters and facilitate model comparison.

Population dynamics of extended models C & U
As we showed in the last two sections, obtaining an exact solution of the probability generating function is already quite challenging for only two distinct cell states A and B. To obtain the probability generating function for the extended models including eight distinct cell states we would need to solve PDEs in 8 dimensions with 7 free parameters. Therefore we restrict the analysis of population dynamics in the extended models on obtaining approximate expressions of the mean populations Φ (t). Following the definition of allowed transitions in the extended models, on can derive a system of ODEs for the time evolution of moments Φ of the form [11]: where M is a 7 × 7 matrix which depends on the allowed transitions in the respective model. The general solution of mean populations is a linear combination Φ (t) = m c m v m e λmt where λ m , v m are the m-th eigenvalue and eigenvector of matrix M, respectively, and c m are constants depending on initial conditions. At late times λ m t 1, the term with the largest eigenvalue max({λ m }) dominates the dynamics. For extended model U, the late time evolution according to the largest eigenvalue is given by Cell states are therefore distributed according to their equilibrium distribution, where e i equals one for those cell state favoured by the combination of imbalanced rates θ M + θ M − and zero for all other cell states. The effective growth rate of the total population equals θ 0 and is therefore the same as in Model U. On the other hand, if rates for switching genes on and off are approximately equal, θ M + ≈ θ M − = θ M , late time dynamics are given by where 1 is a 7 × 1 all-ones vector. Cell states are therefore uniformly distributed and the overall population grows with effective growth rate (θ 0 + θ T + θ S + θ F ). Differentiation events can therefore trigger increased population growth in extended model C. However, the magnitude of the increased growth depends on the rate imbalance θ M + /θ M − of the marker switching rates.

Summary
In this section we derived exact solutions of the probability generating function for both Model C and Model U. This allows an efficient and systematic way of calculating moments of the population distribution. We explicitly derived expressions for the time evolution of mean populations in both models. Neglecting the event of extinction in Model C, mean populations grow exponentially over long time scales with rates depending on the underlying model parameters. These expressions will be used in the next section for deriving an efficient algorithm for reaction rate inference (see Appendix D.2) and for proposing experimental measurement procedures which optimise parameter inference (see Section 3.2.3). We showed that the extended stem cell models follow similar growth dynamics and presented approximate expressions for mean population at late times.

Bayesian inference on minimal models
In the previous section we analysed the time-evolution of the distribution over cell states N (t) as a function of model parameters θ. Now we will look at the inverse problem: obtaining the distribution over model parameters θ given population data N . In the first section we consider parameter inference given complete knowledge about population sizes N (t) at all times. In the second section we generalise parameter inference to cases where we have incomplete data and population sizes N (t i ) are only known at certain time points {t i }.

Longer observations decrease uncertainty exponentially on complete data
We first suppose that we have complete knowledge of how the population sizes N (t) have changed over time. In other words, each cell division and differentiation event is evident in the trajectory N (t). Such data could, for example, come from live-imaging experiments in which each cell state can be identified and the full lineage be tracked.
Since each reaction as defined by Eq. (1), (3) leads to a unique change in the number of cells in each state, we can infer the type of an occurred reaction from changes in N (t). We let τ i , ν i and denote h ν i (τ i ) the reaction times, corresponding reactions and transition rates, respectively. We define the combined hazard h 0 (t) as the sum over reaction hazards for all possible reactions that can take place at time t (which will depend on the state of the system at that time). Since each reaction event is modelled as a Poisson process with corresponding hazard h ν i , the likelihood of a sequence of n reactions is given by [12] L where the subscript comp denotes complete data. Now, in the models we consider here, each reaction hazard h ν i is proportional to a per-capita rate θ ν i multiplied by the number of cells in state Φ ν , which is the cell state associated with a reaction of type ν. If we define now N ν (t) as the total number of cells in state Φ ν at time t, we have that where the product is over each type of reaction ν (i.e. division or differentiation of each cell state) and r ν is the number of times that reaction ν occurs during the observation Note that what matters for Bayesian inference is the dependence of the likelihood on the parameters θ. Therefore factors that do not depend on these parameters (such as the population sizes N ν that multiply each of the transition rates θ k ) can be subsumed into the constant of proportionality in (34).
A key feature of (34) is that it factorises into terms that each depend on a different reaction rate θ ν . This means that each parameter value can be inferred independently, assuming that parameter priors are independent: p(θ) = ν p(θ ν ). That is, the posterior distribution of θ ν is where p(θ ν ) is the prior distribution over the reaction rate θ ν . As before, the constant of proportionality can be obtained by normalising the posterior distribution p(θ ν |ξ). It is convenient for the prior distribution to be taken conjugate to the likelihood, which means that the product of prior and likelihood is of the same functional form as the prior. Here, the likelihood is a gamma distribution, and therefore the conjugate prior is also a gamma distribution [13]. If the prior is a gamma distribution with shape a ν and rate b ν , then the posterior is a gamma distribution with shape a ν = a ν + r ν and rate . The confidence of parameter estimation can be quantified by the dispersion of the posterior distribution p(θ ν |ξ), measured by the coefficient of variation c v (COV) For experimental design it is essential to estimate how much data is needed to achieve parameter inference of a certain precision. More precisely, if we observe n identical systems over a time period T , we would like to know how the COV scales on average with sample size n and observation time T . As shown in Appendix C, the asymptotic scaling of the mean COV with observation time T in Model C is given by where ∆λ = λ AA − λ BB . For Model U, we obtain where α and β are functions of rate constants as given by (28). Note that the mean COV of θ ν depends on the rate constant θ ν itself. This result can be useful for experimental design if a rough estimate for the magnitude of θ ν is available. If not, one has to rely only on the scaling with sample size n.
To summarise, if we have a complete history of all the population sizes N ν (t) over time, we can infer the reaction rate θ ν by counting how many times the reaction ν occurs, and by integrating the size of the population that drives each reaction, N ν (t), over time t. Scaling laws for the confidence of parameter estimation can be obtained as a function of sample size n and observation time T . The uncertainty of the parameter estimation decreases exponentially with observation time, but only decays as a power law with sample size. Besides parameter inference, the aim of experiments is often to compare competing models based on data ξ. In the complete-data scenario, reaction types can be directly inferred from changes in population. Hence, if reactions are observed which are impossible in either Model C or Model U, we can rule out the respective model and model selection is trivial. This will be very different in scenarios where we only have access to snapshot data, as we now show.

Efficient sampling enables inference for analytically intractable snapshot data
The results of the previous section depend on having complete knowledge of population sizes over time. It is unlikely that this level of detail will be available in practice. More likely are a set of snapshots, that is, measurements of N at a finite set of times t 0 = 0, t 1 , t 2 , . . . , t n .
In the previous section we obtained an expression for the posterior distribution p(θ|ξ) given a completely-specified trajectory ξ. To obtain the corresponding quantity from snapshot data, we integrate over all trajectories ξ that are consistent with the observations {N (t i )}, weighted by the probability p(ξ|N (t i )) which is conditioned on passing through the observation points. That is, Since the underlying model is Markovian, we have that where ξ i is the portion of the trajectory corresponding to the time interval [t i−1 , t i ]. This observation implies that the posterior distribution can be sampled using the following algorithm (see also Figure 3). Previously we used the Gillespie algorithm to sample random trajectories with a fixed starting point. However, to propose new trajectory segments ξ i in step 2 of the  [12]. algorithm, we need to sample a random trajectory where both end points are fixed. This is a much harder problem and we will use an approach based on the Metropolis-Hastings-Green algorithm [12,15] to achieve this. The basic idea is to draw ξ i from a distribution that approximates the true distribution, and adjust the acceptance ratio A accordingly.
To construct the approximate distribution, we make use of the exact results for the mean population sizes that were derived in Sections 3.1.1 and 3.1.2. We found that typically these grow or decline exponentially. Therefore we consider a set of reactions ν whose hazards are deterministic functions of time, where t is the time from the start of the trajectory segment, and T is the length of the trajectory segment. The simplification here is that the probability a reaction occurs at time 0 ≤ t ≤ T is not affected by the sequence of reactions that has occurred up to time t, as would be the case for the true distribution. This amounts to sampling r ν reactions of type ν from an inhomogeneous Poisson process with the time-dependent reaction hazards h * ν . The procedure is to first draw the numbers of reactions r ν , and then sample the reaction times from an inhomogeneous Poisson process. To account for the approximation in sampling the trajectory, the acceptance ratio in the Sample Path MCMC algorithm must be chosen accordingly, which is shown in detail in Appendix D.
3.2.3. Optimal snapshot timing can improve parameter inference We now apply the Sample Path MCMC algorithm to understand how well it performs given specific snapshots. Recall that this algorithm generates a sequence of rate-constant vectors, θ, which (when converged) are distributed according to the correct posterior distribution p(θ|{N (t i )}). Our first test is a single run of Model C, in which the initial condition is N (0) = (1, 1) and after time t = 1.5 has evolved to N (1.5) = (3, 7) with reaction rates θ = (λ AA , λ AB , λ BB ) = (1, 0, 1). With a large amount of data, one would expect the posterior distribution to be peaked around this point. A two-dimensional projection of the three-dimensional posterior obtained via kernel density estimation (KDE) is shown in Figure 4, and shows that the parameter estimates are reasonably well constrained despite the small amount of data (2 snapshots) used for the inference. Throughout, we use Scott's Rule [16] for KDE bandwidth estimation. Unlike the complete-data scenario (Section 3.2.1), the likelihood (and so the posterior) does not factorise into independent component likelihoods (cf. Eq. (35)). This means given snapshot-data, rate constants are correlated and rate inference cannot be carried out separately. The alignment of the high-density regions in Figure 4 along the diagonal clearly indicate correlations between λ AA and λ BB .
We now investigate how the parameter estimates are affected by the length of time, ∆t, between the two snapshots. We anticipate that when ∆t is small, only a small number of reactions could reasonably have occurred, and so one would not expect to lose much from having only the trajectory endpoints (as opposed to the full trajectory). On the other hand, when ∆t is large, we would expect estimates based on snapshots to be less tightly constrained than those based on full trajectories.
Our procedure is to generate a trajectory from Model C of length ∆t, starting from N (0) = (1, 1) and with θ = (λ AA , λ AB , λ BB ) = (1, 1, 1). We then perform the parameter inference with both the full trajectory, and with the two endpoints. The resulting posterior distributions for the parameter λ AA for different ∆t are shown in Figure 5.
The snapshot posteriors (blue, filled curves) are always less accurate with respect to the true parameters than their complete-data counterpart (red lines), as one would expect. As more reaction events are contained in the full trajectory ξ for longer times, it provides more information about the system and the variance of the complete-data posterior decreases. The marginal information stored in snapshot-data increases too with reaction numbers, however at a much smaller rate than the full trajectory ξ (see Figure 6).
Suppose now we have the ability to take more than two snapshots, and have freedom to choose when they are taken. Is it more informative to space them out evenly, or to   cluster them together? Given that the total rate at which reactions occur is proportional to the number of cells in the population, we expect to learn more from closely-spaced observations when the population size is large than when it is small. More precisely, we can attempt to maximise the information content of multiple snapshots by keeping the mean number of reactions in between them constant. For example, in Model C we know that the total reaction rate is N A (t)(λ AA + λ AB + λ BB ) and from equation (16) that the mean number of A-state cells grows exponentially as N A0 e ∆λt , where ∆λ = λ AA − λ BB . We can therefore deduce that the mean number of reactions between times t i−1 and t i is Setting this number equal for all i, with the constraint that t 0 = 0 and t n = T , we find that the i th snapshot should be taken at time where n is the total number of snapshots taken in the time interval. In Figure 7 we compare the variance of posterior distributions obtained from differently spaced snapshots by replacing ∆λ with a variation parameter z in equation (43) (z = 0 represents uniformly spaced snapshots). We see that parameter estimates are most precise if the average number of reactions between snapshots are kept constant (z = ∆λ), as proposed. This confirms our intuition that one is likely to be best served by sampling larger populations more frequently. A similar analysis is possible for Model U, which yields the same optimal snapshot times as in Eq. (43), but ∆λ is replaced by α + β from Eq. (28).

Application to epiblast stem cell data
Having established that parameter inference and model selection is viable with relatively limited amounts of data in the context of Models C and U with two cell states, we turn now to the more challenging case of epiblast stem cell data with eight states as described in Section 2.1.3. This may elucidate whether cell division and state changes are independent or coupled in this particular biological system.
where M 1, M 2 are measured marker expressions and M 3 is the unknown marker. To compare snapshot data obtained from simulations to experimental data, we first have to calculate the projected populations {N [M 1,M 2] } := D sim before comparing it to measured populations from experiments. Population snapshot data was experimentally recorded from cell colonies subject to two different environmental conditions: CHIR (D CHIR ), which refers to the presence of the differentiation factor Chrion, and EPISC (D EPISC ), which refers to the absence of this factor [7]. The first data set D CHIR consists of eight measurement series recorded at different times and for marker expressions: (i) Initial marker T + , marker measured after ∆t = 2d and ∆t = 3d: T, F ; (ii) Initial marker T + , marker measured after ∆t = 2d and ∆t = 3d: T, S; (iii) Initial marker T − , marker measured after ∆t = 2d and ∆t = 3d: T, F ; (iv) Initial marker T − , marker measured after ∆t = 2d and ∆t = 3d: T, S.
The second data set D EPISC consists of four measurement series recorded at different times and for marker expressions: (i) Initial marker T + , marker measured after ∆t = 3d: T, F ; (ii) Initial marker T + , marker measured after ∆t = 3d: T, S; (iii) Initial marker T − , marker measured after ∆t = 3d: T, F ; (iv) Initial marker T − , marker measured after ∆t = 3d: T, S.
Each measurement series in both sets contains between 98 and 142 observations of projected population snapshots.

Inference using ABC
The main challenge in applying the MCMC algorithm developed above to the EpiSC cell state network is the high dimensionality of the state space. The state space of our system is given by population numbers {N Φ } of all 8 cell states Φ i . We need to evaluate snapshot likelihoods of the form L SS ({N Φ (∆t)}|θ) by an MC estimate, so a sufficiently large ensemble of trajectories is necessary to populate most parts of state space {N Φ }. The experimental data D CHIR and D EPISC contain up to 8 cells per cell state and therefore the MC estimate would need to sample up to 8 8 ≈ 2 · 10 7 possible system states. Unfortunately, sampling a sufficiently large ensemble from ∼ 10 7 system states turned out to be computationally not feasible. We therefore turn to Approximate Bayesian Computation (ABC) [17]. In our case, ABC speeds up calculations by several order of magnitudes by comparing data based on lowerdimensional summary statistics only, at the expense of introducing an additional error by approximating posterior distributions. We choose the following summary statistics: the mean N Φ , median med{N Φ }, and standard deviation σ Φ . We use a Sequential Monte Carlo algorithm (SMC-ABC) for parameter inference and model selection [18,19]. We verify our results also using a Rejection-ABC algorithm (modified from [17]) for parameter inference and model selection, which is presented in Appendix E.
In ABC, a distance function D(s(D sim ), s(D)) classifies how "close" the summary statistic of observed and simulated data are. A tolerance ≥ 0 controls the level of agreement between both summary statistics and > 0 introduces the eponymous approximation in ABC. We choose the following summary statistic for a measurement series d: s i (d) = [ N i , Var(N i ), Med(N i )] T , where i ∈ {1, 2, 3, 4} is indexing the projected populations (expression of either T and F or T and S, starting from either T+ or T-cells). This summary statistic captures the centre, spread, and skewness of the respective distribution p(N i ). We use the following norm D(s(D sim ), s(D)) to compare summary statistics of experimental and simulated data: where d sim k and d k is the summary statistic of the k-th measurement series of simulated and experimental data, respectively, and ||.|| 1 is the L 1 -norm. The choice of norm does not significantly affect the shape of parameter posteriors if the acceptance threshold is sufficiently small (see Figure G1). Note that k ∈ {1, . . . , 8} for data set D CHIR and k ∈ {1, . . . , 4} for data set D EPISC . The output of this ABC algorithm is a set of parameters {θ} distributed according to p(θ|D(s(D sim ), s(D)) < ). For sufficiently small , this distribution should be a good approximation of the true posterior p(θ|s(D)) which we confirmed by comparing the ABC posterior of cell division rate θ 0 in model U with the analytical solution (see Figure F1).
For parameter inference and model selection we choose identical log-uniform priors for all rate constants in both models log 10 (θ k ) ∼ U[−2, 0]) (in units of per day). The upper cut-off can be justified by inspection of posterior distributions, which are close to zero above this threshold, p(θ k > 1) ≈ 0. Cell transitions of a certain type should occur sufficiently often over the observation period of the experiment to be considered in our model. By constraining the prior distribution p(θ k < 0.01) = 0, we only discard very rare cell transitions which would occur less than once every 100 days in a single cell. Considering that total population sizes in the experiment are around 10 cells and the observation period is three days, this lower cut-off seems reasonable.
We run the SMC-ABC algorithm, which progressively lowers tolerance until the acceptance probability becomes too small and cannot be decreased much further without impractical computational costs. The Bayes factor B CU can be calculated for every SMC-iteration from the marginal posterior distributions of models M C and M U . This was done for both datasets D CHIR and D EPISC and the resulting Bayes factors are shown in Figure 9 as a function of /N set . We normalised the tolerance value by the number of measurement series N set in a dataset since the our distance function Eq. (45) sums over the deviation of every measurement series 1 ≤ k ≤ N set . For the lowest achievable tolerance (see red dots in Figure 9) we obtain Bayes factors of B CU = 4.8 ± 0.5 for D CHIR and B CU = 20 ± 4 for D EPISC ). This constitutes moderate to strong evidence in favour of Model C dynamics, i.e. a coupling of cell division and state transitions.
The posterior distribution of Model C log-rates log 10 (θ ν ) are shown in Figure 10(a) for both data sets. While the cell division rate θ 0 depends only slightly on the experimental conditions (log 10 (θ 0 ) ≈ −1.1), the rates for switching markers on and off differ considerably in different experimental conditions. The rate imbalance θ M + /θ M − of marker switching rates is shown in Figure 10   still more T + cells than under EPISC conditions which is consistent with the raw data from [7]. The rate-imbalance of markers F and S are quite small, so in steadystate there may be significant sub-populations with marker expression S + or F − .
• Cells in EPISC conditions are biased towards the state [T − , S + , F − ]. The rateimbalance of marker F is quite small, so in steady-state there may be subpopulations with marker expression F + .
It is worth mentioning that although Model U was shown to be less likely, the posterior distribution of Model U shows a very similar rate imbalance (see Figure F2). Furthermore, the inferred log-rates of model C allow us to obtain the complete distribution over cell states, which is not directly accessible by the experiments. As shown in Figure 11, the cell state distribution depends on the initial cell conditions (T − -sorted/T + -sorted stem cells) and the environmental conditions (CHIR/EPISC). The cell state distribution for initially unsorted stem cells (assuming every cell state is equally likely) is shown in Appendix F.
So far we assumed cell transitions happen at a fixed rate θ k which does not vary in time. We can test this assumption by using experimental data D CHIR which recorded population snapshots at two different times ∆t = 2d and ∆t = 3d. We used exclusively  day 2 and day 3 data to obtain parameter posteriors separately for population dynamics at early times 0 < t < 2d and late times 2d < t < 3d, respectively. The inferred reaction rates are compared in Figure 12, which indicates: • The universal reproduction rate θ 0 and marker rates θ S+ , θ S− do not show a substantial time dependence and stay constant over the course of three days.
• The rate imbalance of marker T + , θ T + /θ T − , significantly decreases at day 3 compared to day 1 and day 2 (difference > 2σ). The rate imbalance of marker F + , θ F + /θ F − , seems to slightly increase with time, although the difference between day 2 and day 3 is not very substantial (difference < σ).
Again, the posterior distribution of Model U shows a very similar time dependence of reaction rates (see Figure F4).  ). The initial cell state distribution of T ± -sorted stem cells was obtained from steady-state populations from previous experiments [7] and model C dynamics with inferred log-rates (see Figure 10) were used to obtain cell-state distribution at t = 3d. The probability of the respective marker expression is indicated by the colour map.

Discussion
In this paper, we studied the population dynamics of two stochastic models of cell division and differentiation, or more generally, cell state changes: one in which cell state transitions are coupled to cell division, model C, defined by Eq. (2), and one in which these two processes are independent, model U, defined by Eq. (4). We solved the corresponding master equations analytically by using probability generating functions, which allow a systematic calculation of moments of the population distribution. The resulting solutions for both Models C and U allowed us to analyse the dynamics of the population mean, based on which we showed that one cannot always differentiate between the two models. In Model C, there is a finite probability that the proliferating sub-population will eventually die out and the dynamics will come to a halt. The CHIR D2 and CHIR D3 shows inferred parameters using exclusively day 2 or day 3 data, respectively. CHIR D2+D3 was obtained by including all data points of D CHIR and coincides with the results shown in Figure 10.
probability of extinction (18) can be used to approximately infer the underlying reaction rate of model C solely based on the fraction of systems in which extinction was observed. Contrary, in Model U both populations grow without bounds and extinction is impossible for finite reaction constants. This characteristic feature of extinction in Model C might be useful to classify experimental observations or to infer cell division rates when there is negligible cell death.
For Bayesian parameter inference we were able to obtain analytical expressions for the likelihood if continuous-time observations [0, T ] of all populations are available, e.g. as would be obtained from live imaging. Using conjugate priors, we obtained the posterior distributions analytically. When confronted with incomplete data in form of population snapshots, Bayesian inference is more challenging, and we have to rely on numerical results. As estimating the snapshot likelihood directly is computationally costly, we used a more elaborate, but efficient sample path method for obtaining posterior distributions of rate constants given snapshot-data. We further showed that the information content of snapshot-data depends on the times these snapshots were taken, and proposed an optimal spacing strategy which maximises the confidence of parameter inference. For parameter and model inference based on experimental EpiSC data, we had to extend Models C and U from two distinct cell populations to in total 8 cell states. Since our sample path method turned out to be computationally not feasible, we used Approximate Bayesian Computation (ABC) which speeds up calculation by several orders of magnitude. We used ABC rejection sampling and ABC-SMC to obtain posterior parameter distributions and Bayes factors for selecting between the two competing Models C and U. The Bayes factors provide moderate to strong evidence in favour of model C in which cell division and differentiation are coupled (see Figure 9). The posterior distribution of rate constants θ quantify how cell fate changes depending on the in vitro culture conditions (CHIR/EPISC, see Figure 10). For cells in CHIR condition, we quantified how the rates of cell state transitions might change over time (see Figure 12). As we only have data from multiple time-points for the CHIR conditions, we cannot say whether this apparent time dependence is caused by a cell-intrinsic timing mechanism or whether the environment the cells experience changes over time because the cell colony is growing. Since cells kept under EPISC conditions don't shift their expression of transcription factors over comparable timescales [7], it would seem more likely that these rates depend on the local environment and that cell-cell interactions are important, so that reaction rates change depending on the state of neighbouring cells. Interestingly, the data for cells under CHIR conditions indicates that cell division and differentiation are initially independent of each other (Model U) up to day 2, after which cell division and differentiation appear to be coupled (Model U, see Figure F5). To account for this time-dependent dynamics more rigorously one would have to consider more complex models such as non-homogeneous Markov models or explicit representations of dynamic gene expression [20].

Limitations and alternative approaches
4.1.1. Assumption of Markovian dynamics We began our investigation with minimal models of dynamics with two cell states. Although Models C and U were motivated by biological arguments, the stochastic, two-dimensional jumping processes we defined could be applied to a variety of systems. The assumption of Markovian dynamics is often used due to a range of useful properties which makes the models easier to analyse. Often, especially in the context of cell dynamics, the Markov assumption is at best a rough approximation [21,22]. Waiting times τ between jumps are distributed according to an exponential distribution, which is biologically unrealistic: On a molecular level, many transport processes are required for a cell to replicate. Therefore, waiting times between two division events cannot be arbitrarily small [22].
Indeed, Recent work has suggested that experimentally observed cell cycle time distributions differ substantially from a markovian exponential distribution and are better described by the Erlang distribution [22]. How do population dynamics change for a non-Markovian process? In this scenario cell populations cannot grow arbitrarily fast, there is a time-dependent upper bound on population sizes N < N lim (t). State probabilities p i,j (t) = p(N A (t) = i, N B (t) = j) are therefore zero for population numbers (i, j) exceeding limit N lim (t). Cell population dynamics with Erlang distributed waiting times are non-Markovian stochastic processes with discrete states in continuous time. Non-Markovian processes like this can in general be analysed by converting them into Markov processes by inclusion of supplementary variables [23].

Insight from minimal models
For Bayesian inference given snapshot data, our main focus was on building a Bayesian framework which can be applied to experimental data. However, the algorithms for parameter inference and model selection originally proposed for Model C and U were computationally too intensive for the more complicated models with larger cell state networks. Nonetheless, the presented optimal snapshot spacing and scaling behaviour of Bayes factors with snapshot separation should generally be useful for experimental design. Measurement should be taken at times distributed non-uniformly according to equation (43) and in case there can be taken only a single snapshot after observation time ∆t, one should maximise this time.

Sufficiency of summary statistics
We carried out parameter inference and model selection based on experimental EpiSC data using an ABC algorithm which works with summary statistics of the data. We obtained Bayes factors of B CU ≈ 4.8 for D CHIR and B CU ≈ 20 for D EPISC (see Fig. 9), which indicates moderate to strong evidence in favour of Model C. This result, however, has to be interpreted with care: First, model selection based on Bayes factors depends on the prior distribution accurately reflecting the state of knowledge about the underlying model parameters θ. Here we chose log-uniform priors θ k ∼ log(U[−2, 0]), which reflects not knowing the magnitude of the rate θ. The exact value of Bayes factors might therefore change slightly depending on the choice of prior.
Secondly, because we used summary statistics of the data instead of the full set of observations, we did not obtain the "true" Bayes factor based on data D, but one based on the summary statistic s(D) of data. Considering the full data would mean to look at each experimental replicate individually by calculating the parameter posterior given a single experimental realisation and iterate over all data points, using the posterior distribution as the prior distribution for the next data point. The "true" Bayes factor and the one obtained using summary statistics only coincide if the summary statistic s(D) is sufficient for comparing Models C and U: p(D|s(D), M X ) = p(D|s(D), M Y ).
There are only few cases in which summary statistics are known to be sufficient , e.g. [24]. The more usual case is that we have to work with arbitrary summary statistics, which involve an unknown loss of information if they are insufficient. This "curse of insufficiency" is the reason why some researchers caution against a naive use of ABC for model selection [25]. There are indeed three levels of approximation errors: one due to MC estimation, one due to the non-zero ABC tolerance > 0, and one due to insufficient summary statistics. Alternatives to model selection with insufficient statistics were explored by [26], which selected models via a machine learning approach (random forests).
Despite concerns about sufficiency of our summary statistics, we verified that the summary statistic used here is able to discriminate between extended Models C and U. Using Rejection-ABC, we obtained Bayes factors of B CU ≈ 3.5 on synthetic CHIR data D sim CHIR ( /N set = 3.75) and B CU ≈ 3.7 on synthetic EPISC data D sim EPISC ( /N set = 6, note that one would likely achieve higher Bayes Factors at lower with more efficient sampling schemes). Synthetic data was generated from model C with reaction rates set to posterior means obtained from experimental data (see Figure 10). It is worth noting that for simulated data, the acceptance probability for a given value is much higher than for experimental data. This suggests that the Models C and U do not perfectly describe the experimental data. More realistic models for cell state changes are therefore discussed in the following section.

4.1.4.
Building models for single cell biology Several approaches that aim to go beyond pseudotime ordering and connect single cell gene expression data to dynamic models have recently emerged. Examples include fitting drift-diffusion equations on a k-nearest neighbour graph [5] or along one-dimensional pseudotime trajectories [27], simulating cell trajectories in dimensionally reduced gene expression space based on RNA velocity estimates [28], and continuous state Markov process from single-cell time series [29]. Here, we have presented an alternative approach to these which allows for rigorous quantification of uncertainty. Working in a Bayesian framework further allows us to integrate multiple sources of data relatively easy, e.g. to combine bulk gene expression and single-cell data, which we discuss further below.

Future work
The proposed models in this work assumed that rates for switching genes on and off are independent of each other, thereby neglecting regulatory interactions between genes. A natural next step is to consider models with interdependencies between transcription factors, i.e., where the rate of transitioning into the on/off state for one transcription factor can depend on the expression state of another transcription factor. The number of possible pairwise dependencies alone leads to a model comparison problem that is computationally too intensive typical ABC inference approaches. However, systematic Bayesian model comparison can be performed on bulk population data, where likelihood evaluation is tractable, before parameterising the models on clonal data (Schumacher et al., in preparation).
Here we have have applied Bayesian inference and model comparison on in vitro data. It would be interesting to compare estimates of transition rates with in vivo case, where, in certain contexts, lineage tracing data can be used to identify clones. A potential application of our model comparison approach in this context would be to compare the number of stages in stem/progenitor cell lineages to best explain observed clone size distributions [30]. Contemporary experiments increasingly measure the expression of many more genes than considered here, for which one should consider models with a higher number of genes. As we showed in the last section, however, exact Bayesian inference is already computationally challenging for models considering only three genes. Since the number of cell states grows exponentially with the number of considered genes, one would run into computational scaling issues for methods such as ours. Thus, dimensionality reduction or careful consideration of the state space will be crucial to extend the inference of cell state transition rates to single-cell transcriptomic data.
The presented work focused on total population sizes of cell states where individual cells are independent of each other, which implies a spatially homogeneous or wellmixed system. In many biological systems cell fate however depends on the local environment and signalling of cells. Here, there is potential to draw on and add to recent work inferring motility as well as proliferation in cell populations and tissue patterning [31,32,33]. The extension of these population models to spatially heterogeneous models including interactions between neighbouring cells could thus be promising for inferring rates of differentiation in embryonic development.

Appendix A. Mapping parameters from Model U to Model C
In the following we show that for any given parameter set (λ A , λ B , k AB , k BA ) in Model U, there exists a set of parameters (λ AA , λ AB , λ BB ) in Model C for which the asymptotic behaviour (t → ∞) of the means N A , N B is identical in both models. Following equation (25), the asymptotic behaviour ((βt 1)) of mean populations in Model U is given by where α, β, γ are constants depending on Model U parameters and given by equation (25). By comparing equations (A.2) to the mean populations in Model C, given by it can be shown that mean population in Model C can be mapped to the ones of Model U by setting where m is a free parameter which can be chosen within the interval 0 < m < f (λ) and f (λ) is a function depending on Model U parameters: value N ν (t). The resulting process is a Poisson process with time dependent reaction hazard h * ν = θ ν · N ν (t), also called Inhomogenous Poisson Process. In this case, the total number of reactions r k is Poisson distributed [35]: with mean r ν = Λ. The mean COV for an ensemble of n trajectories is therefore: where the mean COV of θ ν depends explicitly on the rate constant itself (additionally, N ν may also depend on θ ν ). This result can be useful for experimental design again: a rough estimate for the magnitude of θ k might be obtained from either prior knowledge or from a posteriors of quick trail runs. Equation (C.4) then allows us to estimate how long a full experiment should run in order to get parameter posteriors of desired precision.

Appendix C.3. Parameter estimation in Models C and U
As derived in equation (16), the mean number of cells in state A in Model C, N A (t) = N A0 e t∆λ , (C.5) either grows or decays over time, depending on the sign of ∆λ. Combining equation (C.5) and (C.4), the asymptotic behaviour for T ∆λ 1 is given by: The posterior precision therefore does not increase with observation time T if ∆λ < 0. This makes sense, since A-type cells will always die out for λ BB > λ AA (see equation (18)). The scaling of c v with observation time T is shown in Figure C1a for rate constant θ 1 of Model C. The proposed exponential scaling e −∆λ·T /2 coincides well with the ensemble mean of c v for large observation times T . For small times T , both the mean-field approximation (C.2) and T ∆λ 1 break down. We derived the following asymptotic behaviour of the means N A for Model U: where α and β are functions of rate constants θ (see 28). Since α + β ≥ 0, the mean populations always grow exponentially. Combining equations (C.7) and (C.4), the asymptotic behaviour for (α + β)T 1 is given by The Poisson approximation step is taken into account by selecting an appropriate Metropolis-Hastings acceptance probability, ensuring that the MCMC output is actually drawn from the correct posterior (see Metropolis-Hastings-Green Algorithm in [15]).
To construct the acceptance probability, we require the complete-data likelihood of trajectories ξ in the inhomogeneous Poisson model L A (ξ|θ). For arbitrary reaction hazards h * k (t), the complete-data likelihood of n reaction events in range [0, ∆t] is given by where h * 0 = ν h * ν is the combined reaction hazard and ν i is the type of the i-th reaction. Plugging in the exponential reaction hazard (D.3) yields: where ∆N k = N k (∆t) − N k (0) and ∆ log(N k ) = log(N k (∆t)) − log(N k (0)). We are now ready to construct the Metropolis-Hastings step for updating trajectories ξ with fixed endpoints on interval [0, ∆t]. First we propose a new set of valid reaction numbers r from a proposal distribution f (r |r). We chose to update r = r + ζ by adding to each element a random integer ζ drawn from a symmetric Skellam distribution [37] with variance 2ω. The tuning parameter ω was chosen such that the proposals r are accepted on average with a probability of ≈ 30%. Then, conditional on r , we will sample a trajectory (reaction times τ j ) from the approximate Poisson process with exponential reaction hazards (D.3). The trajectory ξ := {r , τ i } is then accepted with probability min(1, A) with where L CD is the complete-data likelihood of the true model (see (34)) and q(r) = ν q ν (r ν ) is the probability that r reactions were observed in the inhomogeneous Poisson model. The terms q ν are the PMF of the Poisson distribution: Note that the the acceptance probability (D.6) depends on the current parameter vector θ, which is drawn at every iteration of the MCMC. The overall structure of this algorithm was originally motivated by [12] and modified by us to fitS Model c AND u. It is an application of the general Metropolis-Hastings-Green Algorithm as described in [15].
Appendix E. Bayes factor calculation using population snapshots Integrating both sides over θ gives us the identity This is useful because the average is taken over the posterior distribution of the parameters θ, conditioned on a particular set of observations ξ. This is exactly what the MCMC algorithm generates, and hence we can estimate the marginal likelihood where θ m is the m th sample of the parameter vector obtained from the MCMC algorithm. Unfortunately, the snapshot likelihood L SS ({N (t i )}|θ m ) needs to be evaluated by a Monte Carlo (MC) estimate for each parameter vector θ m separately by: where N sim k (t i ) is a simulated trajectory from the respective model with parameters θ m . The delta function δ(·) equals one if simulated and observed populations coincide at snapshot times. This resulting in a computationally intense MC within MCMC algorithm, which is however still less computationally intensive than the other presented methods (thermodynamic integration, product space search).

Appendix E.2. Thermodynamic integration
Thermodynamic integration is another method to estimate marginal likelihoods of models. It is based on the fact that the logarithm of the marginal likelihood can be represented by the following integral [38]: log p(ξ) = with ξ being general data, p(θ) the parameter prior, p(ξ|θ) the likelihood and p(ξ) the marginal likelihood. The distribution used to evaluate the mean E θ|ξ,t [·] is called power-posterior p t (θ|ξ) with temperature t ∈ [0, 1], defined as [39]: The power-posterior can be interpreted as an intermediate distribution between prior and posterior, since p t=0 (θ|ξ) = p(θ) and p t=1 (θ|ξ) = p(θ|ξ). For derivations and evaluation techniques of the integral (E.5), see [38]. The Bayes factor calculation via thermodynamic integration is usually very accurate. On the other hand, it is computationally even more intensive than the harmonic mean identity and also more complicated to implement. We therefore try to avoid thermodynamic integration if other methods provide equally good estimates. As an example, Figure E1 compares the Bayes factors obtained via the harmonic mean estimator (E.3) and via thermodynamic integration given two population snapshots. The harmonic mean estimator converges quite fast and provides equally good Bayes factor estimates than the method of thermodynamic integration. Figure E1: Calculation of Bayes factor B CU using the harmonic mean identity (solid blue line) and thermodynamic integration (dashed black line) given two snapshots N (0) = (3,3), N (1) = (8,8) and flat priors θ ν iid ∼ U[0, 3] for Models C and U. The harmonic mean estimate (E.3) converges after only B ≈ 300 MCMC iterations, although the convergence speed strongly depends on the snapshot data and on the sample size of the MC-step for estimating the snapshot-likelihood L SS ({N (t i )}|θ). We use thermodynamic integration only to estimate convergence speed, since this calculation is much more intensive than the harmonic mean identity.

Appendix E.3. Product space search
Another popular method to obtain Bayes factors is the product space search [40]. Whereas the harmonic mean identity and thermodynamic integration are used to estimate the marginal likelihoods of Models C and U, the product space search estimates B CU directly. This method involves an MCMC algorithm not only exploring the parameter space of a single model, but jumping between competing models. In general, jumps between a whole set of competing models {M j } are allowed, but we will allow the algorithm to switch only between Models C and U. At the start of each iteration, we either stay in the current model or jump to the other model with a certain probability. In the end, the Bayes factor B CU ≈ N (M = C)/N (M = U ) can be estimated by the ratio of times the MCMC stayed in either Model C or Model U (see Figure E2 (a)). We implement the product space search following [41, "Metropolised Carlin and Chib"]. The performance of this method is compared to the harmonic mean estimator in Figure E2(b). Both methods have similar convergence speeds and variances. This is surprising, since the harmonic mean estimator has been criticised for being the "Worst Monte Carlo Method Ever" [42]. The reason being that although it is a consistent estimator, it may have infinite variance in some cases. This could not be observed in our system, where it has similar variance than more sophisticated methods such as the product space search. Figure F3: Complete cell state distribution of initially unsorted stem cells after t = 3d for different environmental conditions (CHIR/EPISC). The initial cell state distribution at t = 0d is assumed to be uniform. The probability of the respective marker expression is indicated by the color map. Figure F4: Median (horizontal line), 0.25/0.75 quantiles (box) and 0.05/0.95 quantiles (vertical lines) of the rate imbalance of Model U log-rates log 10 (θ k ) obtained from D CHIR . CHIR D2 and CHIR D3 shows inferred parameters using exclusively day 2 or day 3 data, respectively. CHIR D2+D3 was obtained by including all data points of D CHIR . Figure F5: Bayes factors B CU of dataset D CHIR using either exclusively day 2 or day 3 data over a range of tolerance values . Identical log-uniform priors were used for all rate constants in both models log 10 (θ k ) ∼ U[−2, 0]). The x-axis was re-scaled by the number of measurement series N set (N set = 4 for both D2 and D3). Red dots show the minimal epsilon value to which the SMC-ABC algorithm converged after 20 iterations. The data up to day 2 favours model C dynamics (B CU > 1) while the population data at day 3 provides evidence in favour for model U dynamics (B CU < 1). Figure G1: Mean and standard deviation of Model C log-rates log 10 (θ k ) calculated from posterior distributions given D CHIR using Rejection-ABC. Replacing the L 1 -norm in eqn (45) by the L 2 -norm does not significantly change the shape of parameter posteriors given is small enough ( /N set = 5 used for L 1 and /N set = 3 used for L 2 ).
(a) Acceptance probability as a function of tolerance for Model C and dataset D CHIR .
(b) Bayes factors B CU of datasets D CHIR and D EPISC for a range of tolerance values . Figure G2: Acceptance probability and Bayes factors for a range of tolerance values with identical log-uniform priors for all rate constants in both models log 10 (θ k ) ∼ U[−3, 1]). The x-axis was re-scaled by the number of measurement series N set in the respective dataset (N set = 8 in D CHIR , N set = 4 in D EPISC ). Errorbars indicate the spread between five independent rejection-ABC runs.