Frame-by-frame annotation of video recordings using deep neural networks

Video data are widely collected in ecological studies but manual annotation is a challenging and time-consuming task, and has become a bottleneck for scientific research. Classification models based on convolutional neural networks (CNNs) have proved successful in annotating images, but few applications have extended these to video classification. We demonstrate an approach that combines a standard CNN summarizing each video frame with a recurrent neural network (RNN) that models the temporal component of video. The approach is illustrated using two datasets: one collected by static video cameras detecting seal activity inside coastal salmon nets, and another collected by animal-borne cameras deployed on African penguins, used to classify behaviour. The combined RNN-CNN led to a relative improvement in test set classification accuracy over an image-only model of 25% for penguins (80% to 85%), and substantially improved classification precision or recall for four of six behaviour classes (12–17%). Image-only and video models classified seal activity with equally high accuracy (90%). Temporal patterns related to movement provide valuable information about animal behaviour, and classifiers benefit from including these explicitly. We recommend the inclusion of temporal information whenever manual inspection suggests that movement is predictive of class membership.


Introduction
There are three approaches to using DNNs for video classification beyond treating the problem as 87 an image classification task by modeling frames independently. The simplest approach concate-88 nates the vector encodings obtained from each of a sequence of input images to predict the class 89 of the last image in the sequence; images in the input sequence are considered to be independent.

90
The second approach uses the sequence of vector encodings produced from the sequence of input  In this paper we have used these approaches to perform frame-by-frame annotation of two video 98 datasets. The first was taken from a fixed underwater camera placed inside nets at a salmon trap 99 net fishery in Scotland, for the purpose of detecting seal visits to salmon nets and ultimately re-100 ducing conflict between fisheries and seals. Here the task was to detect whether a seal is present 101 in a frame, based on that and preceding frames. The second dataset was collected by animal-  of 52722 images were obtained, with substantial imbalance between behaviours ( We consider four broad classes of models, of increasing complexity. The first ignores the tem- passed these as input to a recurrent neural network, which combined these temporally (Figure 1). 163 We used two pre-trained CNNs to encode frames (ResnNet50, VGG16) and three different RNN 164 architectures (Long Short-Term Memory (LSTM), SimpleRNN, Gated Recurrent Units (GRU)).

165
One key step was to pre-compute the frame vector encodings from the pre-trained CNN mod-166 els so that these did not have to be re-computed in each RNN model. A single training epoch for accuracy than even an image-only model. We therefore do not report on these results further. 176 We chose model hyperparameters using a grid search over the number of nodes in each of the 177 three dense layers in Model 1 and 2 (32, 64, 96, . . . , 512), the dropout rate (0, 0.1, 0.2, . . . , 0.5), 178 and the length of the sequence of images used in Models 2 and 3 (1, 3, 5, 7, 9, . . . , 31). Follow-   (Table B.1, Appendix B), but particularly for descent and bottom dive phases (precision in-207 creasing by 17% and 14%), and for shallow and subsurface dives (recall increasing by 12% and 208 13%). Image-only models tended to misclassify bottom dives as descent dives, and mistook parts 209 of the ascending and descending dive phases for shallow dives. To some extent this reflects fuzzy 210 boundaries between behavioural classes, but temporal information resolved some of these mis-211 classifications (Figure 2). Search activity, the sole surface behaviour and also the most prevalent 212 class, was almost perfectly discriminated.  (Trinh et al., 2016). We found that for a relatively simple task -224 detecting seal activity in an image -an image-only CNN was adequate, and incorporating tempo-225 ral information did not meaningfully improve out-of-sample performance, even for those difficult 226 cases in which a seal enters or exits the field of view. For a more difficult task of inferring pen-227 guin behaviour from animal-borne cameras, using a video model led to substantial reduction in 228 classification error over an image-only model, and was particularly useful in disentangling cer-229 tain kinds of diving behaviour. In both applications accuracy is not sufficient for full automa-230 tion of the tasks, but can facilitate manual processes by partially labelling the data -identifying 231 those classes that can be accurately discriminated and pointing the researcher to segments re-232 quiring closer inspection. Our datasets were relatively small, consisting of 6-12 hours of labelled 233 footage, and the ability of the models to generalize to new environments is unclear, but even in 234 those classes where absolute performance was moderate, video models outperformed image-only 235 models. Improvements are likely to be larger with larger datasets.

236
Practically, researchers wanting to construct a model for the frame-by-frame annotation of video 237 have to follow a number of steps: manually labelling a subset of the data; converting the video 238 11 into images; allocating these images between training, validation, and test sets; choosing appro-239 priate neural network architectures and estimating the parameters of those models; selecting a 240 preferred model and using it to process the unlabelled portion of the data; and linking frame-by-241 frame predictions to the broader research objectives for which the classifier was developed.

242
Video data are manually annotated by recording the start and end times of events whose bound-  Video data are converted to images at a user-specified frame rate, with the recording equipment 257 setting an upper bound. A higher frame rate increases the number of images available to train 258 models, which is always beneficial as long as there are meaningful differences between adjacent 259 images. It is important to randomly allocate contiguous sequences of frames i.e. video sequences, 260 to training, validation and test datasets, rather than randomly allocating the frames themselves.

261
Doing the latter breaks apart sequences, losing potentially valuable information, and also means 262 that very similar images occur in both training and test sets. We also recommend assessing whether 263 the video in the test dataset has the same environmental conditions as video used to train the 264 model (e.g. if a random segment of each file is used to test). If so, the ability of the model to gen-265 eralize to new environments may be overestimated.

266
When building an RCNN, key choices are what frame rate and sequence length to use. These 267 factors are study-specific, and the chosen frame rate need not be the same as the frame rate used 268 to convert video to frames. Higher frame rates allow for fine-scale changes in movement to be 269 captured, but the same number of frames covers a shorter time interval. Increasing sequence 270 length requires more parameters, increasing the chances of overfitting and requiring more data.

271
Which of the two -looking back further in time or capturing fine-scale movement -benefits 272 classification accuracy more will be study-specific. These factors can be investigated by search-273 ing over possible frame rate/length pairs, but this quickly becomes computationally expensive.

274
Our applications have relatively little labelled data and so we fixed the frame rate to one that 275 would allow broad differences in behaviour, observed over a few seconds, with 5 < F < 10. Pre- face dives were also well differentiated. These distinctions hold practical value, and also limit the 293 amount of manual labelling that must be done.

294
Deep learning holds enormous promise for automating the labelling of video data, a process that 295 looks increasingly unsustainable with manual methods. Case studies such as the ones reported 296 here play an important role in reporting successes and failures, and developing and disseminat-297 ing best practices. Classification of ecological data is difficult. Limited time and other resources, 298 remote locations, and rare or difficult-to-detect target species, serve to decrease sample sizes at 299 the same time that variable background environments increase the necessary sample sizes for 300 good classification. In these contexts full automation is perhaps, for the time being, unrealistic.

301
Facilitating the process of manually annotating video datasets is both valuable and achievable.

302
Video data has the great advantage that large datasets, in terms of numbers of images, are often 303 collected relatively quickly. At 60fps, a one minute encounter with an animal provides 3600 im-304 ages. This offers exciting opportunities for developing and testing deep learning approaches. Our 305 study suggest that many applications may benefit from incorporating temporal information in 306 video, where the goal remains to predict the class to which a particular frame or image belongs.   Predicted probabilities for penguin behaviour classes, with misclassifications plotted as crosses.

337
Observed and predicted classes are plotted above the probabilities, using the same notation. Image-