Computers in Biology and Medicine 167 (2023) 107573 A 0 Contents lists available at ScienceDirect Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/compbiomed Localization and phenotyping of tuberculosis bacteria using a combination of deep learning and SVMs Marios Zachariou a,∗, Ognjen Arandjelović a, Evelin Dombay b, Wilber Sabiiti b, Bariki Mtafya c, Nyanda Elias Ntinginya c, Derek J. Sloan b a School of Computer Science, University of St Andrews, St Andrews, KY16 9SX, United Kingdom b School of Medicine, University of St Andrews, St Andrews, KY16 9AJ, United Kingdom c Mbeya Medical Research Center, Mbeya, Tanzania A R T I C L E I N F O A B S T R A C T Keywords: Successful treatment of pulmonary tuberculosis (TB) depends on early diagnosis and careful monitoring of Microscopy treatment response. Identification of acid-fast bacilli by fluorescence microscopy of sputum smears is a common Machine learning tool for both tasks. Microscopy-based analysis of the intracellular lipid content and dimensions of individual Fluorescence Mycobacterium tuberculosis (Mtb) cells also describe phenotypic changes which may improve our biological Feature descriptors MSVR understanding of antibiotic therapy for TB. However, fluorescence microscopy is a challenging, time-consuming Regression and subjective procedure. In this work, we automate examination of fields of view (FOVs) from microscopy Deep learning images to determine the lipid content and dimensions (length and width) of Mtb cells. We introduce an Treatment monitoring adapted variation of the UNet model to efficiently localising bacteria within FOVs stained by two fluorescence Mycobacterium tuberculosis dyes; auramine O to identify Mtb and LipidTox Red to identify intracellular lipids. Thereafter, we propose a feature extractor in conjunction with feature descriptors to extract a representation into a support vector multi- regressor and estimate the length and width of each bacterium. Using a real-world data corpus from Tanzania, the proposed method i) outperformed previous methods for bacterial detection with a 8% improvement (Dice coefficient) and ii) estimated the cell length and width with a root mean square error of less than 0.01%. Our network can be used to examine phenotypic characteristics of Mtb cells visualised by fluorescence microscopy, improving consistency and time efficiency of this procedure compared to manual methods.1. Introduction approaches for measurement of individual Mtb lipid content and cell dimensions could contribute to this effort. Traditionally sputum smear Tuberculosis (TB), the leading infectious cause of death worldwide, microscopy has been important for diagnosis and treatment monitoring is mainly caused by Mycobacterium tuberculosis (Mtb), a bacterial of pulmonary TB. The technique involves heat-fixing a small (10-20 μm) species which is transmitted by coughing droplets and aerosols. 85% aliquot of sputum from symptomatic patients onto microscopy slides of TB disease is pulmonary, affecting the lungs. The World Health and staining them using procedures that selectively detect acid-fast Organisation (WHO) report up to 10 million cases of active disease bacilli (AFB) such as Mtb cells. The contemporary approach to this per year, with almost 2 million deaths [1]. The greatest burden of uses Auramine-O based fluorescence staining to label AFB yellow- morbidity and mortality from TB occurs in low- and middle-income green on a black background (usually at ×400 magnification) [4]. countries with fewer healthcare resources [2]. Since the 1940s, TB has Sputum smear grading scales describing the number of AFB seen in been curable by antibiotic treatment, but the long duration of therapy each sample are able to triage disease severity at the start of treat- (commonly at least 6 months) is challenging, both for patients and ment and describe changes in bacterial load over time [5]. In recent public health programmes. years, many centres worldwide have shifted their focus from smear In 2015, WHO developed the ‘‘End TB Strategy’’, aiming to elim- microscopy to molecular technologies (such as the Xpert MTB/RIF inate TB as a public health problem by 2050 [3]. However, this test) for TB diagnosis [6]. However, current molecular tools are not will require major advances in biological tools to improve our un- recommended for on-treatment monitoring as results stay ‘positive’ for derstanding of the effect of antibiotic treatment on Mtb. This paper months even when treatment is progressing well [7]. Therefore, smear will use sputum smear microscopy images to show how automated ∗ Corresponding author. E-mail address: marios.zachariou@hotmail.com (M. Zachariou).https://doi.org/10.1016/j.compbiomed.2023.107573 Received 7 June 2023; Received in revised form 9 September 2023; Accepted 11 O vailable online 13 October 2023 010-4825/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access actober 2023 rticle under the CC BY license (http://creativecommons.org/licenses/by/4.0/). M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573 o d f c 7 t p a r e p t t a s i f f i f a F a f c f t T a t 3 b b s s p d o u m 2 a d i h d 3 n H f h t i r a f a c a w a o l i p microscopy remains important for this purpose [8]. Smear microscopy can also study changes within individual Mtb cells during therapy. This is important because findings from recent microbiological research suggest that the physical morphology of each organism offers pheno- typic information on its physiological behaviour in relation to antibiotic susceptibility. For example, some Mtb cells accumulate nonpolar lipids intracellularly, allowing them to be classed as lipid-rich (LR) rather than lipid-poor (LP) [9–13]. In vitro microbiological data suggest that LR bacteria are antibiotic tolerant (less easy to kill by the first-line drugs used to treat TB) [12,13] and may play a role in poor patient outcomes (treatment failure or post-treatment relapse) [14]. Modifica- tion of the Auramine-O fluorescence staining method, by incorporating a LipidTox Red (LTR) dye to show intracellular lipids within AFB allows discrimination between LR and LP cells [11,14]. Additionally, in vitro microscopy has previously demonstrated that Mtb cells grow asymmetrically creating variation in cell length over time [15,16]. Cells of different sizes with different growth poles have variable susceptibility to individual antibiotics [16,17]. Preliminary clinical data suggest that the median length of persistent Mtb cells may be associated with worse disease severity and increases after antibi- otic exposure [18,19]. To understand whether changes in intracellular lipid content or dimensions of Mtb cells really are useful character- istics for the study of TB treatment response, larger scale laboratory and clinical studies are required. However, smear microscopy is time- intensive and subjective which makes this work difficult to perform at scale [20]. Each slide must be examined in discrete fields of view (FOV) that are inspected sequentially. This process is tiring which can introduce errors [20]. Some slides are challenging to evaluate because AFB might have odd appearances or because non-bacterial compo- nents (artefacts) inside the sputum matrix mimic Mtb cells. A possible strategy for tackling these issues is to apply contemporary artificial intelligence techniques [21]. Recent studies have demonstrated signif- icant accomplishments in the realm of automated diagnosis, treatment monitoring, and the potential prevention of other medical conditions (e.g., cardiovascular and gynaecological pathology) [22,23]. In this paper we aim to advance computer-based approaches to TB microscopy by developing methods to: • Locate Mtb cells within given FOVs on Auramine-O and LTR stained fluorescence microscopy images, with performance eval- uation by two established metrics (Jaccard index and Dice coef- ficient). • Co-localise the same Mtb cells on paired Auramine-O and LTR stained images of an FOVs, in order to assess the proportion of LR bacteria, achieving a maximum error of less than 1.5 pixel difference between ground truth and predicted FOVs. • Estimate the length and width of Mtb cells in FOV patches from sputum smears collected at 0, 2 and 6 months of therapy with less or equal to 2% error across 3 regression metrics. . Related work Most research on automating TB microscopy has focussed on en- ancing diagnosis. We were unable to find any other work that was irectly comparable to our approach of developing morphological phe- otypes of Mtb cells which may be relevant to treatment response. owever, some previous literature describes use of deep learning tools or morphological phenotypic evaluation of other cell types in order o understand the pathophysiology of infectious diseases and bacterial esponse to antibiotics. In the realm of mycobacteria, Bao et al. used using light microscopy nd convolutional neural networks (CNN) to classify morphological lterations of macrophages infected with Mycobacterium marinum, surrogate model for Mtb, to show the role of the essential viru- ence factor EsxA [24]. Whilst this work focussed on identification henotypic changes in host cells rather than bacteria and did not t 2 fall under the purview of treatment monitoring, it still demonstrates the capacity of automated image analysis to detect changes in cell appearance of individual cells which enhance our understanding of bacterial pathophysiology. In the domain of antibiotic response, Yu et al. assessed susceptibility f Escherichia coli bacteria in urine to five relevant antibiotics using eep learning video microscopy [25]. Whilst conventional procedures or antimicrobial susceptibility testing can take several days and delay linical decision making, the authors described a technique that used a layer CNN to evaluate footage of freely moving bacterial cells in real ime. Inhibition (or not) by antibiotics was reported by learning several henotypic characteristics of the cell without requiring the definition nd quantification of each characteristic. Antibiotic susceptibility was eported with mean accuracy of 91.8% within 30 min. Similarly, Zahir t al. used high throughput screening and deep learning to describe henotypic ‘bulging’ in E. coli which are associated with resistance and olerance to 𝛽-lactam antibiotics [26]. To be best of our knowledge there are three published algorithms for he detection of bacteria from fluorescence microscopy slides. Mithra nd Emmanuel proposed a methodology consisting of three sequential tages: segmentation, feature extraction, and classification [27]. The nitial step in detecting bacteria is to perform a colour space trans- ormation on the input microscope image to better separate bacteria rom the background. Thresholding is then applied to the transformed mage to segment potential bacteria regions based on colour intensity or further analysis. Data on length, density, area, and histogram char- cteristics are gathered for the purpose of classifying contours using a uzzy Hyco-Entropy Decision Tree classifier as: low-bacilli, non-bacilli, nd overlapping bacilli. Diaz-Huerta et al. proposed a method that ocuses solely on the segmentation stage and implements a Bayesian lassifier, based on a Gaussian mixture model, to differentiate bacteria rom background [28]. The latest technique in tuberculosis bacterial de- ection utilises Cycle-GANs in an image-to-image translation approach. he objective of this method is to learn how to transfer bounding boxes round possible regions of interest from labelled field of views (FOVs) o unlabelled ones [29]. . Proposed method The methods proposed in this paper consists of three stages: (i) acterial detection from microscopy FOVs, (ii) paired detection of acterial locations from two images of each FOV (one captured to how auramine-O staining of Mtb cells, and one captured to show LTR taining of intracellular lipid; collectively these allow inference of the roportion of LR bacteria in the FOV), and finally (iii) estimation of in- ividual bacterial dimensions (length and width) from cropped patches f FOVs containing one or more Mtb cell. Segmentation techniques are sed for stage (i) and (ii), and regression is used for stage (iii). These ethods are designed and evaluated separately, with distinct objectives nd evaluation criteria. Although we describe them to operate indepen- ently, they could be used sequentially with a future goal of pipeline ntegration. .1. Convolutional neural networks: A brief introduction CNNs are a class of artificial neural networks that have proven ighly effective for visual analysis tasks. CNNs take advantage of the nherent grid structure of image data by employing convolutional layers s building blocks. These convolutional layers consist of learnable ilters that are convolved across the input image to extract spatially- orrelated features. During training, the CNN learns values for the filter eights that activate on specific visual patterns, such as edges, textures, r higher-level concepts. The convolutional filters are slid across the mage, computing dot products between the filter and local regions of he input at each location. M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573Multiple convolutional layers are stacked to extract hierarchical feature representations, from low-level features like edges in early layers to high-level semantic features like objects in later layers. CNNs also contain pooling layers to gradually reduce spatial dimensions and provide translational invariance. This hierarchical architecture loosely mimics the organisation of neurons in the visual cortex. After several convolutional and pooling layers, CNNs generally use fully-connected layers to condense feature representations and perform classification. Moreover, CNNs are also suitable for regression and segmentation tasks. The entire network, including filter values (with the exception of pooling layers), is trained using backpropagation to minimise a loss function through gradient descent. This provides an efficient way to tune the filters to extract optimal features tailored to the training data. CNNs have achieved state-of-the-art results on many computer vi- sion tasks, including image classification, object detection and semantic segmentation. Their automatic feature extraction, ease of training, and layered feature learning have made CNNs the dominant approach for nearly all vision problems. Careful CNN architecture design, reg- ularisation, and hyperparameter tuning is crucial to ensure robust generalisation and avoid overfitting. Overall, CNNs provide a flexible yet efficient deep learning framework well-suited for diverse computer vision applications. 3.1.1. UNet: segmentation-based CNN As explained previously, CNNs leverage convolutional layers to extract hierarchical features from images. A UNet model builds on these CNN principles to create an encoder–decoder segmentation architec- ture [30]. The encoder portion of UNet uses repeated blocks of convolution, activation, and max pooling layers, similarly to a typical CNN. This encodes the input image into high-level feature representations while downsampling spatially. The decoder pathway then upsamples these features back to the original input resolution using transpose con- volutions. A key difference from a CNN is the introduction of skip connections that concatenate encoder features with the upsampled decoder features. These skips provide the decoder with both con- textual information from the encoder (information recall) as well as fine-grained localisation from the upsampled features. Finally, the decoded features are fed into a convolution layer to generate a pixel-wise probability map for semantic segmentation. So UNet leverages a CNN encoder to analyse contextual features, but adds a decoding path with skip connections to localise and precisely segment input images in an end-to-end manner. The model is trained via backpropagation just like ordinary CNNs. This architecture remains popular for segmentation tasks, especially in medical imaging where precision is critical. In summary, UNet extends CNNs into an efficient encoder–decoder structure specialised for precise pixel-level segmentation while retain- ing automated feature extraction capabilities. Fig. 1 provides a visual example of the UNet architecture. 3.2. Bacteria detection As described in Section 1, lipid content within Mtb cells is calculated as the proportion of total bacteria detected in an FOV stained with Auramine-O (green channel) which are also detected in the same FOV stained with LTR (red channel) at the same location. As each FOV is represented as two RGB images, we set the red and blue channels to 0 to make bacteria visible in the green channel only. Similarly, we set the green and blue channels to 0 makes bacteria visible in the red channel only. If a bacterium is localised in the green channel and co-localised at the same spot in the red channel, it is LR. If the bacterium is localised in the green channel without co-localisation in the red channel, it is LP. For example, when examining paired images of an FOV, if 5 bacterial locations are found in the green channel, 3 are co-localised in the red channel are found in the red channel only, 5 Mtb cells have been3 identified in total, 3 (60% lipid content) of which are LR. Any other objects resembling bacteria in the red channel are also insignificant, as microbiologists typically address this task in a unidirectional manner rather than bidirectionally, i.e. first detect bacteria in the green channel and then the red one. The key component in this analysis is not the actual colour intensity, but rather the intensity of the object in contrast to the image back- ground when the other two channels of an RGB image are set to a value of 0, i.e. suppressed. We convert each FOV to greyscale in order to prevent extra training for each coloured FOV and also to reduce complexity as we go from three dimensions to one. In addition, to make FOVs less susceptible to noise, we use the image enhancement tech- nique described by Zachariou et al. to make the approach more robust and effective [29,31]. Fig. 2 presents an example of our preprocessing procedure, as well as featuring segmentation ground truths. The first objective is to binarize the FOV, which means that the resulting image has a black background and the objects of interest (bacteria) are white areas. We adapt and apply UNet [30] since it is an effective choice for learning to collect important information about objects of interest and generate a binarised image. We replace the first layer of the UNet with one that has input channels of one rather than three and output channels of 32 rather than 64 as required by the original implementation. Therefore, the input and output channels of subsequent layers are adjusted to align with the original UNet implementation, whereby the number of channels in each layer is doubled compared to the previous layer. Consequently, the proposed network architecture exhibits a reduction in the number of channels at the bottleneck level from 1024 to 512. Kernel sizes and padding for the convolutional layers are not changed. In addition, the max pooling layers in the model have a stride of 1, as opposed to the original UNet that used stride of 2, while the kernel size remains the same. These modifications of the layers are driven by the fact that bacteria do not have complicated shapes. As the form of a bacterium is relatively concise, the first layer requires less deductive reasoning; therefore, higher channel layers may cause the model to overfit on the training data and acquire extraneous features. As this is supervised learning, an experienced mycobacterial microscopist manually examins and highlights bacterial outlines in each a FOV, which is then converted into a binary image and used as ground truth for both the UNet and the proposed network training. 3.2.1. Training of segmentation networks The training is carried out in an end-to-end fashion; there is no use of transfer learning. Due to the fact that there is no transfer learning, we train the network for over 1000 epochs. We employ AdaBelief, a novel optimiser that has been demonstrated to converge as rapidly as adaptive optimisers (such as Adam [32]) and to generalise better than Stochastic Gradient Descent (SGD) [33] in intricate models such as GANs [34]. A circular scheduler with a step size equal to five times the size of the dataset (which in turn is dependent on the batch size) is used in conjunction with a learning rate of 0.0001, which was the default setting [35]. Both the base learning rate and the upper learning rate are set to their respective default values of 0.00001 and 0.0004. We use Dice loss [36] (also known as F1-score) as the loss function to train the model. To increase the robustness and generalisability of the learning process, we augmented real data with synthetic data. To achieve this we synthesise images randomly rotated by ±25◦ and mirrored around the vertical or horizontal axis; this increases the quantity of training data by roughly 50% [37]. Note that this type of enhancement is particularly well suited to the task at hand because, unlike natural images, in which there is an inherent asymmetry in directions (e.g., the horizontal and vertical directions are objectively defined and cannot be swapped), in the microscopy slides of interest, all directions are interchangeable and therefore equivalent. Furthermore, input images are resized to 256 × 256 pixels using bicubic interpolation [38]. M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573Fig. 1. Flow of information of the UNet architecture.3.2.2. Minimising false positives with bacterial morphological features Existing literature described that, detection of bacteria using gradient-based methods alone is not always successful [39,40]. Speci- ficity can be compromised by false positive misinterpretation of arte- facts as bacteria, on the basis of similar colour intensity. To reduce de- tection of these false positives, our method includes heuristic morpho- logical characteristics (area, perimeter, number of edges, and Fourier descriptors). Using the Douglas–Peucker [41] technique, we compute the area, perimeter and approximate form of each contour. Essential parameters for a detected shape to be identified as a bacterium are: the area must be between 80 and 1200 pixels, the perimeter must be between 40 and 300 pixels, and the approximate form must have between 9 and 20 edges. In the last step of this process, we calculate the elliptic Fourier de- scriptors for each contour of the ground truth labels using the 20thhar- monic. The application of the 20thharmonic for representation yields proximate coefficients that capture well the morphology of a majority of the designated bacteria specimens chosen at random. Higher num- bers of harmonic result in an overfitted outline of the current contour. Once each Fourier descriptor for every contour has been computed, the resulting matrix has the dimensions 𝑛 × 20 × 4, where n is the total number of contours. The last dimension, 4, reflects the coefficients returned, of the Fourier series representation of the contour. The final 20 × 4 matrix is created by averaging the Fourier descriptors from all calculated contours. Furthermore, the Fourier descriptors of each predicted contour are calculated. These descriptors are then used in the calculation of the Euclidean distance between the average descriptors derived from the ground truth labels. To be considered a valid bacterial shape, the Euclidean difference between the predicted contour Fourier descriptors and the average descriptors must be between 14 and 18 pixels.4 3.3. Estimating cell length and width For the last step of this process, any bacterium/contour in the green channel images that fits the requirements given in Section 3.2.2 is utilised as the test set. Firstly, a medical microscopist manually crops patches containing one or more bacteria that overlap and annotates the cells with straight lines down their entire length. If a bacterium has a somewhat curved form, multiple straight lines may be necessary. Multiple straight lines are needed for bacteria with curved or angular forms. Since the cell width across all cells is very similar (typically 5–6 pixels), width is averaged per patch, thus a patch with three cells is represented by a scalar for its width. Furthermore, we observe that the maximum number of bacteria per patch is four (n.b. for our dataset), the size of the vector acting as the ground truth label during training is five. If a patch contains two bacteria, for example, the first entry is the average width, the second entry is the sum of the lengths of the first bacterium, and the third entry is the sum of the lengths of the second bacterium. The remaining entries are all 0. Evidently, an additional benefit of our approach is its ability to count the number of bacteria present in a patch, similarly to previous works [21,29]. We utilise these labels to train a second CNN model, using re- gression, i.e. the final linear output layer does not contain a sigmoid activation. The trained model is stored and later used as a pre-trained model with its linear output layer removed, transforming it into a feature extraction encoder of 128 sized vector. Additionally, several feature descriptors are applied to extract a supplementary 128 sized vector of features from the input patches. These are: RootSIFT [42], Multiple Kernel local descriptors [43], HardNet [44], HardNet8 [45], HyNet [46], TFeat [47], SOSNet [46], Histogram of Oriented Gradi- ents [48] and Local Binary Patterns [49,50]. The two vectors, one M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573 Fig. 2. Examples of paired images from the same FOV from our dataset. In the top row (a) all colour channels except green suppressed and (b) all colour channels except red suppressed. In the second row, all images are converted to grayscale using the aforementioned image enhancement technique. In the bottom row, manual ground-truth labelling of cells in the two images is shown; the two separate fluorescence labels visible on the same FOV display different information, necessitating a different ground truth label. At this phase, we train our model to detect as many Mtb-shaped objects as possible in both green and red channel images, even although objects which are only detected in the red image are ultimately discarded. 5 M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573Fig. 3. Following an encoding procedure from both the CNN and the feature descriptor, the MSVR outputs the final predictions. Since 1 × 1 convolutional filters may be used to modify the dimensionality of the filter space while maintaining linear activation of pixel values, the kernel size of all CNN layers is set to 1. This is due to the small dimensions of the input images; thus, we want to capture bacterial characteristics without losing spatial information.from the CNN and the other from the feature descriptor, are then concatenated, creating a 256-dimensional feature vector. This vector serves as input to a multi-output support vector regressor (MSVR) [51] aiming to predict the same 5-dimensional ground truth. Fig. 3 shows an overview of our method’s information flow. 3.3.1. Training setup As with the training for segmentation, no transfer learning is per- formed in this instance, and the model is trained from scratch for 1000 epochs. For the optimiser we use Adam for its straightforward implementation, being computationally efficient, and low memory re- quirements. The hyper-parameters 𝛽1 and 𝛽2 are set to 0.5 and 0.999 respectively, the learning rate to 0.001, and the cosine annealing learn- ing rate scheduler employed [52]. This scheduler decreases the learning rate every 20 iterations until it reaches 0.0001 before initiating over again. Finally, the loss function used is the Least Absolute Deviation (L1), since the dataset contains many outliers which are emphasised by squared differences. Considering that we have 1000 patches available for training in this stage (80% for training and 20% for testing), no data augmentation is performed. Like the previous CNN, the input patches are made square before being scaled to 80 × 80 pixels using bicubic interpolation. Following grid search hyperparameter learning, the following are used: a radial basis function (RBF) kernel, 𝐶 = 1, 𝜖 = 0.001, and 𝛾 = 0.01. Fig. 4 presents a graphical summary. 4. Experimental evaluation 4.1. Dataset The images used in this work are from the dataset of TB patients de- scribed in previous work [29,31]. Briefly, 46 patients with pulmonary TB were recruited at clinical facilities affiliated to NIMR-Mbeya Medical Research Centre (NIMR-MMRC). Microscopy smears were made from sputum samples collected pre-treatment, and after 2 and 5–6 months of TB therapy. These were stained according to standard Auramine- O LTR protocols and viewed at ×1000 using an oil immersion lens of a Leica DM5500 microscope with a DFC 300G camera attachment. Paired FOVs containing Mtb were photographed at manual microscopy, using an N3 filter cube (excitation and emission spectra of 546/12 and 600/40 nm) to assess Auramine-O staining and a TX2 filter cube to assess LTR staining (excitation and emission spectra of 560/40 and 645/75 nm).6 Altogether, 1000 FOVs were selected at random from the Tanzanian corpus [29,31]. To confirm that the automated image analysis approach under development is unaffected by changes in the morphology of Mtb cells during or after TB treatment, images were selected from all sample collection time periods. To create ground-truth data for the segmentation analysis, a microscopist who was independent of the original Tanzanian project re-examined these images, labelling objects of interest in both green and red channel images. 80% of the FOVs were utilised for training, while the remaining 20% were used for testing and assessment. 4.2. Semantic segmentation of bacteria detection As outlined in Section 3.2, bacterial detection and estimation of lipid content must be done in combination. Therefore, evaluating the performance of these tasks should be done together too. However, distinct techniques are required to assess the separate processes of semantic segmentation on green and red channel images of an FOV, and distance-based evaluation of whether the same objects have been localised on both images. For example, although being a true positive in terms of detection, for the lipid content it cannot be deemed accurate. This is also the primary reason why these two stages of our work require two distinct assessment techniques. The evaluation metrics used in the assessment of semantic segmen- tation are the Jaccard index [53] and Dice coefficient [54]. When only Auramine-O stained FOVs are included, these are 97.00% and 96.06%. The value of the Jaccard index and Dice coefficient exceeds that achieved by earlier efforts [27–29]. All three works employ the same evaluation metrics, which facilitates direct comparison with our method. However, when LTR stained images are included, the percent- ages decrease to 92.03% and 85.84%, respectively. As seen in Fig. 5, Nile red stained FOVs often result in false positives, which motivated our subsequent use of morphology. Table 1 presents a comprehensive overview of the outcomes obtained from the comparison of the two UNet models. Following the application of morphological criteria and Fourier descriptors to the FOVs of both dyes, the final percentages are 95.47% and 91.33%. Considering that it is very difficult to match pre- cise bacterial outlines by manual or automated labelling, it unrealistic to anticipate that the form of the predicted contour would precisely match the shape outline of the ground-truth contour. Therefore, even if the model accurately predicted a contour, the errors in the reference used as the ground truth itself may penalise it slightly. M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573 Fig. 4. The diagram presented illustrates the proposed approach. The first method involves the training of the UNet model and the proposed network. Afterwards, bacterial patches are manually cropped from ground truth labels in the green channel that were employed in the segmentation method. The decision stage involves determining whether it is necessary to pre-train the CNN model for the final stage, or alternatively, to utilise the same CNN along with a regression layer to make predictions on the vector representing cell length and width. If pre-training is not needed, the features extracted from the pre-trained CNN are combined with the output of a feature descriptor. This concatenated feature representation is then fed into a MSVR, which produces a prediction vector that resembles the output of the CNN’s regression layer. 7 M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573Fig. 5. (a) and (b) are separate examples of Auramine-O stained FOVs, the prediction image from the segmentation model and labels applied by a microscopist to corresponding ground truth images. (c) is an example of a different LipidTox Red stained FOV (not paired with (a) or (b). The prediction in (c) has localised three false positives objects, which are likely due to noise or artefacts and are subsequently rejected using our morphology-based approach.Table 1 A comparison of segmentation results between the original UNet and our proposed network. In the training phase, a composite of both stained FOVs was used, whereas in the testing phase, both models were initially evaluated using only green and then both types of FOVs. The LTR dye stained more artefact, making it more difficult to detect Mtb cells on the red images precisely. Although the original UNet performed better in training, the proposed network performed better on unseen test data. Models Training Test Dice coefficient Jaccard index Dice coefficient Jaccard index UNet (Baseline) Green FOVs 99.53% 99.07% 96.10% 92.49% UNet (Baseline) All FOVs 91.29% 83.97% Our network Green FOVs 99.04% 98.11% 97.00% 96.06% Our network All FOVs 92.03% 85.24%8 M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573Fig. 6. Plots (a, b) length and width samples vs their respective percentage error rate. Due to the large number of samples in both the training set and the test set, the graph is simplified by averaging identical samples with 0.01% error difference.Table 2 Performance evaluation metrics for both training and test sets, including all model and shape characteristics. The quantitative results demonstrate that our approach has learnt and generalised the problem. The CNN model utilised throughout the evaluation phase was the pretrained model specifically designed for this purpose. Training Test RMSE MAPE MAE RMSE MAPE MAE CNN 1.9840 0.0213 0.5366 2.4746 0.1111 1.7442 CNN & HOG 0.0161 0.0046 0.0212 0.0815 0.0112 0.1004 CNN & SIFT 0.6727 0.0350 0.6753 0.8357 0.0533 1.0778 CNN & MKD 0.5374 0.0290 0.5469 0.6915 0.0431 0.8732 CNN & HardNet 0.4651 0.0246 0.4646 0.5307 0.0339 0.6628 CNN & HardNet 8 0.7747 0.0393 0.7599 0.9575 0.0615 1.2421 CNN & HyNet 0.5034 0.0263 0.5162 0.6476 0.0402 0.8076 CNN & TFeat 0.1322 0.0077 0.1572 0.1563 0.0096 0.2068 CNN & SOSNet 0.4718 0.0256 0.4948 0.6248 0.0394 0.8007 CNN & LBPs 0.7684 0.0396 0.7624 0.9737 0.0626 1.24959 M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 1075734.3. Distance-based evaluation Next, we evaluated the ability of our network to recognise the same bacteria in both images of each FOVs at the same location. We utilise the 𝐿1, 𝐿2, and 𝐿𝑖𝑛𝑓 𝑖𝑛𝑖𝑡𝑦(𝐿∞) norms in a manner similar to that described by Zachariou et al. [29]. Instead of comparing ground truth contours to predicted contours, contours from the green FOV and the red FOV are compared in this paper. Essentially, we are attempting to correlate the centroids of bacteria in the green FOV with the centroid of bacteria in the red FOV. The pairing was determined by the minimum Euclidean distance between the centroid positions, with a threshold of 15 if an apparent bacterium in one image could not be matched with a partner in the other. If no suitable contour is obtained in the red FOV, the contour from the green FOV is discarded, since it is deemed irrelevant. The combined distances constitute a vector which is subsequently used for the norms calculation. Additionally, we provide the counts of paired contours from each category, namely, green and red ground truth FOVs, along with their corresponding predicted coun- terparts. The 𝐿1, 𝐿2, and L∞ norms for the ground truth FOVs measured at 1010.77, 49.17, and 8.54 pixels, respectively, with a total of 572 pairs. For the predicted FOVs, the norms were 1067.7, 56.12, and 9.85 pixels, with 577 pairs. The close proximity of the norms between the two sets of FOV pairings underscores the accuracy of our technique in predicting bacterial locations in both scenarios. Specifically, the 𝐿∞ norm, representing the maximum absolute distance, has a difference of less than 2 pixels, and the total of all distances is within 70 pixels of each other. Considering that the average length of a bacterium can range from 20 to 100 pixels, these numbers suggest that the predicted pairings closely align with the ground truth ones. 4.4. Bacterial length and width As described in Section 3.3, we use regression to estimate the individual length and average width of bacteria. Therefore, we apply regression evaluation metrics, comprising of root-square mean error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE). The rationale for incorporating both MAPE and MAE lies in the dissimilar nature of the scaling of length and width. As depicted in Fig. 6, it is evident that the scaling of length and width differs significantly, thereby implying that an error in length would not have an equivalent effect as an error in width. This figure also indicates that MAE is a more suitable loss function than MSE for our dataset because the outliers, represented by the two tails of the distribution, exhibit a smaller deviation from 0% error, while the majority of errors occur on the average samples. This is due to the fact that the outliers, represented by the two tails of the distribution, exhibit a smaller deviation from 0% error, while the majority of errors occur on the average samples. Altogether the results are promising and all model combinations performed well, with the CNN + HOG combination consistently performing best according to all the criteria. Fig. 7 shows two examples of cell dimensions measurements using the best model. Table 2 summarises all training and test set metrics. Two additional plots depicted in Figs. 8 and 9, derived from the test set, provide supplementary evidence that the model has exhibited exceptional performance and has acquired the ability to extrapolate to novel data. 5. Conclusion The majority of machine learning and deep learning research on au- tomating sputum smear microscopy has focussed on its long-established role as a frontline diagnostic test for pulmonary TB. As molecular tools, such as Xpert MTB/RIF, replace this function, a key contribu- tion of microscopy may become its ability to report on phenotypic characteristics of individual Mtb cells for treatment monitoring and to improve our biological understanding of therapeutic response. The10Fig. 7. Examples of patches to illustrate the labelling procedure for cell dimensions, and show examples of ground truth and prediction distances. The length of bacteria are shown by the blue straight lines while width is depicted in green straight lines. When the length of a curved or angular bacterium requires several blue lines for full coverage, its total length is calculated as the sum of all the blue lines within it . Distances written next to individual cells in blue are ground truth lengths in pixels, while those in red are predicted lengths. The width value is the average of all green lines in each patch and is written in the bottom left corner of each image. work we publish here is the first demonstration of artificial intelligence approaches for this application. We have pioneered a new method for semantic segmentation of Mtb bacteria on fluorescence microscopy FOVs which performs well according to established evaluation metrics. Our method, is robust for use with multiple fluorescence stains so that paired images of the same FOV can be used to report on bacterial detection and the presence of important intracellular structures such as lipid content. Finally, a significant contribution of our work is that our models accurately predict the dimensions (length and width) of cells in original ground truth images, which will improve the ability of clinical researchers and microbiologists to investigate the relevance of heterogenous bacterial appearances in biological samples. Next steps for this work will include: (i) interdisciplinary collab- oration between Infectious Disease and Computer Science researchers M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573Fig. 8. The histogram of residuals plot depicts a concentration of residuals around 0, indicating that the model’s residuals are predominantly distributed in close proximity to the origin. Patches consisting of 3 or 4 cells are infrequent. As a result, the third length is typically 0, which aligns with the model’s accurate prediction. For clarity, the fourth length is not displayed.Fig. 9. Residual plot indicates a dispersion of residuals that is close to zero. An additional observation provides further evidence for the selection of MAE as a more appropriate loss function, given that the outliers in the test set exhibit a proximity to zero that exceeds that of the average sample.to deploy these tools on more microscopy image-sets to assess their real-world application, and (ii) optimisation of methods for automated reading of whole slides, so that the manual labour required to identify FOVs and patches before deep learning techniques can be used is also eliminated. Overall, the information compiled in this work argues that mi- croscopy based treatment monitoring and Mtb cell phenotyping re- search is important, and we have shown than automated deep learning techniques make these activities possible.11Declaration of competing interest The authors would like to declare no conflict of interest. Acknowledgements Supported by a Wellcome Trust Institutional Strategic Support Fund award to the University of St Andrews, grant code 204821/Z/16/Z. M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573References [1] World Health Organization, Global Tuberculosis Report, Tech. Rep., 2022, URL https://www.who.int/teams/global-tuberculosis-programme/tb-reports/global- tuberculosis-report-2022. [2] D.P. Spence, J. Hotchkiss, C.S. Williams, P.D. Davies, Tuberculosis and poverty, Br. Med. J. 307 (6907) (1993) 759–761. [3] World Health Organization, End TB strategy, 2015, URL https://apps. who.int/iris/bitstream/handle/10665/331326/WHO-HTM-TB-2015.19- eng.pdf?sequence=1&isAllowed=y. [4] K.R. Steingart, M. Henry, V. Ng, P.C. Hopewell, A. Ramsay, J. Cunningham, R. Urbanczik, M. Perkins, M.A. Aziz, M. Pai, Fluorescence versus conventional sputum smear microscopy for tuberculosis: a systematic review, Lancet Infect. Dis. 9 (6) (2006) 570–581. [5] Stop T. B. Partnership, Global Laboratory Initiative Advancing TB Diag- nosis: Mycobacteriology Laboratory Manual, Stop TB Partnership, Geneva, 2014, URL https://stoptb.org/wg/gli/assets/documents/gli_mycobacteriology_ lab_manual_web.pdf. [6] C.C. Boehme, P. Nabeta, D. Hillemann, M.P. Nicol, S. Shenai, F. Krapp, J. Allen, R. Tahirli, R. Blakemore, R. Rustomjee, A. Milovic, M. Jones, S.M. O’Brien, D.H. Persing, S. Ruesch-Gerdes, E. Gotuzzo, C. Rodrigues, D. Alland, M.D. Perkins, Rapid molecular detection of tuberculosis and rifampin resistance, N. Engl. J. Med. 363 (11) (2010) 1005–1015. [7] S.O. Friedrich, A. Rachow, E. Saathoff, K. Singh, C.D. Mangu, R. Dawson, P.P. Phillips, A. Venter, A. Bateson, C.C. Boehme, N. Heinrich, R.D. Hunt, M.J. Boeree, A. Zumla, T.D. McHugh, S.H. Gillespie, A.H. Diacon, M. Hoelscher, Assessment of the sensitivity and specificity of Xpert MTB/RIF assay as an early sputum biomarker of response to tuberculosis treatment, Lancet Respir. Med. 1 (6) (2013) 462–470. [8] World Health Organization, WHO Operational Handbook on Tuberculosis. Module 4: Treatment. Drug-Susceptible Tuberculosis Treatment, World Health Organization, Geneva, 2022, URL https://www.who.int/publications/i/item/ 9789240050761. [9] R.J.H. Hammond, F. Kloprogge, O.D. Pasqua, S.H. Gillespie, Implications of drug- induced phenotypical resistance: Is isoniazid radicalizing M. tuberculosis? Front. Antibiot. 1 (2022) 928365. [10] J. Daniel, H. Maamar, C. Deb, T.D. Sirakova, P.E. Kolattukudy, Mycobacterium tuberculosis uses host triacylglycerol to accumulate lipid droplets and acquires a dormancy-like phenotype in lipid-loaded macrophages, PLoS Pathog. 7 (6) (2011) e1002093. [11] N.J. Garton, S.J. Waddell, A.L. Sherratt, S.-M. Lee, R.J. Smith, C. Senner, J. Hinds, K. Rajakumar, R.A. Adegbola, G.S. Besra, P.D. Butcher, M.R. Barer, Cytological and transcript analyses reveal fat and lazy persister-like bacilli in tuberculous sputum, PLoS Med. 5 (4) (2008) e75. [12] R.J.H. Hammond, V.O. Baron, K. Oravcova, S. Lipworth, S.H. Gillespie, Pheno- typic resistance in mycobacteria: is it because I am old or fat that I resist you? J. Antimicrob. Chemother. 70 (10) (2015) 2823–2827. [13] C. Deb, C.M. Lee, V.S. Dubey, J. Daniel, B. Abomoelak, T.D. Sirakova, S. Pawar, L. Rogers, P.E. Kolattukudy, A novel in vitro multiple-stress dormancy model for mycobacterium tuberculosis generates a lipid-loaded, drug-tolerant, dormant pathogen, PLoS One 4 (6) (2009) e6077. [14] D.J. Sloan, H.C. Mwandumba, N.J. Garton, S.H. Khoo, A.E. Butterworth, T.J. Allain, R.S. Heyderman, E.L. Corbett, M.R. Barer, G.R. Davies, Pharmacodynamic modeling of bacillary elimination rates and detection of bacterial lipid bodies in sputum to predict and understand outcomes in treatment of pulmonary tuberculosis, Clin. Infect. Dis. 61 (1) (2015) 1–8. [15] E.S. Chung, W.C. Johnson, B.B. Aldridge, Types and functions of heterogeneity in mycobacteria, Nat. Rev. Microbiol. 20 (9) (2022) 529–541. [16] B.B. Aldridge, M. Fernandez-Suarez, D. Heller, V. Ambravaneswaran, D. Irimia, M. Toner, S.M. Fortune, Asymmetry and aging of mycobacterial cells lead to variable growth and antibiotic susceptibility, Science 335 (6064) (2012) 100–104. [17] K. Richardson, O.T. Bennion, S. Tan, A.N. Hoang, M. Cokol, B.B. Aldridge, Temporal and intrinsic factors of rifampicin tolerance in mycobacteria, Proc. Natl. Acad. Sci. 113 (29) (2016) 8302–8307. [18] S. Vijay, D.N. Vinh, H.T. Hai, V.T.N. Ha, V.T.M. Dung, T.D. Dinh, H.N. Nhung, T.T.B. Tram, B.B. Aldridge, N.T. Hanh, D.D.A. Thu, N.H. Phu, G.E. Thwaites, N.T.T. Thuong, Influence of stress and antibiotic resistance on cell-length distribution in mycobacterium tuberculosis clinical isolates, Front. Microbiol. 8 (2017) 2296. [19] D.A. Barr, C. Schutz, A. Balfour, M. Shey, M. Kamariza, C.R. Bertozzi, T.J. de Wet, R. Dinkele, A. Ward, K.A. Haigh, Serial measurement of M. tuberculosis in blood from critically-ill patients with HIV-associated tuberculosis, EBioMedicine 78 (2022). [20] H.L. Rieder, A. Van Deun, K. Man Kam, S. Jae Kim, T.M. Chonde, A. Trebucq, R. Urbanczik, Priorities for Tuberculosis Bacteriology Services in Low-Income Countries, Bull. Int. Union Tuberc. Lung. Dis. (2007). [21] D. Vente, O. Arandjelović, V.O. Baron, E. Dombay, S.H. Gillespie, Using machine learning for automatic estimation of M. Smegmatis cell count from fluorescence microscopy images, in: International Workshop on Health Intelligence, 2019, pp. 57–68.12[22] B. Yesilkaya, M. Perc, Y. Isler, Manifold learning methods for the diagnosis of ovarian cancer, J. Comput. Sci. 63 (2022) 101775. [23] M. Surucu, Y. Isler, M. Perc, R. Kara, Convolutional neural networks predict the onset of paroxysmal atrial fibrillation: Theory and applications, Chaos 31 (11) (2021). [24] Y. Bao, X. Zhao, L. Wang, W. Qian, J. Sun, Morphology-based classification of mycobacteria-infected macrophages with convolutional neural network: reveal EsxA-induced morphologic changes indistinguishable by naked eyes, Transl. Res. 212 (2019) 1–13. [25] H. Yu, W. Jing, R. Iriya, Y. Yang, K. Syal, M. Mo, T.E. Grys, S.E. Haydel, S. Wang, N. Tao, Phenotypic antimicrobial susceptibility testing with deep learning video microscopy, Anal. Chem. 90 (10) (2018) 6314–6322. [26] T. Zahir, R. Camacho, R. Vitale, C. Ruckebusch, J. Hofkens, M. Fauvart, J. Michiels, High-throughput time-resolved morphology screening in bacteria reveals phenotypic responses to antibiotics, Commun. Biol. 2 (1) (2019) 1–13. [27] K.S. Mithra, W.R. Sam Emmanuel, FHDT: fuzzy and hyco-entropy-based decision tree classifier for tuberculosis diagnosis from sputum images, Sādhanā 43 (8) (2018) 1–15. [28] J.L. Díaz-Huerta, A.d.C. Téllez-Anguiano, M. Fraga-Aguilar, J.A. Gutierrez- Gnecchi, S. Arellano-Calderón, Image processing for AFB segmentation in bacilloscopies of pulmonary tuberculosis diagnosis, PLoS One 14 (7) (2019) e0218861. [29] M. Zachariou, O. Arandjelović, W. Sabiiti, B. Mtafya, D. Sloan, Tuberculosis bacteria detection and counting in fluorescence microscopy images using a multi-stage deep learning pipeline, Information 13 (2) (2022) 96. [30] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241. [31] M. Zachariou, O. Arandjelović, E. Dombay, W. Sabiiti, B. Mtafya, D. Sloan, Extracting and Classifying Salient Fields of View From Microscopy Slides of Tuberculosis Bacteria, in: International Conference on Pattern Recognition and Artificial Intelligence, 2022, pp. 1–12. [32] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980. [33] H. Robbins, S. Monro, A stochastic approximation method, Ann. Math. Stat. (1951) 400–407. [34] J. Zhuang, T. Tang, Y. Ding, S. Tatikonda, N. Dvornek, X. Papademetris, J.S. Duncan, Adabelief optimizer: Adapting stepsizes by the belief in observed gradients, Adv. Neural Inf. Process. Syst. 33 (2020) 18795–18806. [35] L.N. Smith, Cyclical learning rates for training neural networks, in: IEEE Winter Conference on Applications of Computer Vision, 2017, pp. 464–472. [36] X. Li, X. Sun, Y. Meng, J. Liang, F. Wu, J. Li, Dice loss for data-imbalanced NLP tasks, 2019, arXiv preprint arXiv:1911.02855. [37] X. Yue, N. Dimitriou, O. Arandjelovic, Colorectal cancer outcome prediction from H&E whole slide images using machine learning and automatically in- ferred phenotype profiles, in: 11th International Conference, Vol. 60, 2019, pp. 139–149. [38] O. Arandjelović, Hallucinating optimal high-dimensional subspaces, Pattern Recognit. 47 (8) (2014) 2662–2672. [39] R.O. Panicker, K.S. Kalmady, J. Rajan, M.K. Sabu, Automatic detection of tuberculosis bacilli from microscopic sputum smear images using deep learning methods, Biocybern. Biomed. Eng. 38 (3) (2018) 691–699. [40] V. Makkapati, R. Agrawal, R. Acharya, Segmentation and classification of tuberculosis bacilli from ZN-stained sputum smear images, in: International Conference on Automation Science and Engineering, 2009, pp. 217–220. [41] D.H. Douglas, T.K. Peucker, Algorithms for the reduction of the number of points required to represent a digitized line or its caricature, Cartogr. Int. J. Geogr. Inf. Geovisualization 10 (2) (1973) 112–122. [42] R. Arandjelović, A. Zisserman, Three things everyone should know to improve ob- ject retrieval, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2911–2918. [43] A. Mukundan, G. Tolias, A. Bursuc, H. Jégou, O. Chum, Understanding and improving kernel local descriptors, Int. J. Comput. Vis. 127 (11) (2019) 1723–1737. [44] A. Mishchuk, D. Mishkin, F. Radenovic, J. Matas, Working hard to know your neighbor’s margins: Local descriptor learning loss, Adv. Neural Inf. Process. Syst. 30 (2017). [45] M. Pultar, Improving the HardNet Descriptor, 2020, arXiv preprint arXiv:2007. 09699. [46] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, V. Balntas, Sosnet: Second order similarity regularization for local descriptor learning, in: IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11016–11025. [47] V. Balntas, E. Riba, D. Ponsa, K. Mikolajczyk, Learning local feature descriptors with triplets and shallow convolutional neural networks, in: BMVC, Vol. 1, No. 2, 2016, p. 3. [48] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, IEEE, 2005, pp. 886–893. [49] T. Ojala, M. Pietikainen, D. Harwood, Performance evaluation of texture mea- sures with classification based on Kullback discrimination of distributions, in: 12th International Conference on Pattern Recognition, Vol. 1, 1994, pp. 582–585. M. Zachariou et al. Computers in Biology and Medicine 167 (2023) 107573[50] J. Fan, O. Arandjelović, Employing domain specific discriminative information to address inherent limitations of the LBP descriptor in face recognition, in: International Joint Conference on Neural Networks, IEEE, 2018, pp. 1–7. [51] Y. Bao, T. Xiong, Z. Hu, Multi-step-ahead time series prediction using multiple-output support vector regression, Neurocomputing 129 (2014) 482–493. [52] I. Loshchilov, F. Hutter, Sgdr: Stochastic gradient descent with warm restarts, 2016, arXiv preprint arXiv:1608.03983.13[53] A. Beykikhoshk, O. Arandjelović, D. Phung, S. Venkatesh, Overcoming data scarcity of Twitter: using tweets as bootstrap with application to autism-related topic content analysis, in: International Conference on Advances in Social Networks Analysis and Mining, 2015, pp. 1354–1361. [54] B. Guindon, Y. Zhang, Application of the dice coefficient to accuracy assessment of object-based image classification, Can. J. Remote Sens. 43 (1) (2017) 48–61.