2.1 Object detection
Object detection is a field of CV that deals with detecting instances of objects in images and videos (Zhao et al., 2019). Methods for object detection generally include traditional image processing and analysis algorithms, and deep learning techniques (Zhao et al., 2019). Deep learning is a subset of machine learning that uses networks capable of learning from data that is unstructured, either labelled (supervised) or unlabelled (i.e. unsupervised) (Lecun, Bengio & Hinton, 2015). Studies show that deep learning models are robust and efficient for fish detection in underwater scenarios (Cui et al., 2020; Ditria et al., 2020b; Jalal et al., 2020; Villon et al., 2020). In this paper, we use deep learning, and more specifically Mask Regional Convolutional Neural Network (Mask R-CNN) for fish detection. Mask R-CNN is one of the most effective open-access deep learning models for locating and classifying objects of interest (He et al., 2017).
To develop and train the fish detection model, we collected video footage of the target species, yellowfin bream in the Tweed River estuary, Australia (-28.169438, 153.547594) between May and September 2019. We used six submerged action cameras (1080p Haldex Sports Action Cam HD) deployed for 1 hr over a variety of marine habitats (i.e. rocky reefs and seagrass meadows). We varied the angle and placement of the cameras to ensure the capture of diverse backgrounds and fish angles (Ditria et al., 2020a). We trimmed the original 1 hour videos into snippets where yellowfin bream was present using VLC media player 3.0.8. The snippets were then into still frames at 5 frames per second. The training videos included 8,700 fish annotated across the video sequences (Supplementary A). We used software developed at Griffith University for data preparation and annotation tasks (FishID - https://globalwetlandsproject.org/tools/fishid/). We trained the model using a ResNet50 architecture with a learning rate of 0.0025 (He et al., 2017). We used a randomly selected 90% sample of the annotated dataset for the training, with the remaining 10% for validation. To minimise overfitting, we used the early-stopping technique (Prechelt, 2012), where we assessed mAP50 on the validation set at intervals of 2,500 iterations and determined where the performance began to drop. We used a confidence threshold of 80%, meaning that we selected OD outputs where the model was 80% or more confident that it was a yellowfin bream. We developed the models and analysed the videos using a Microsoft Azure Data Science Virtual Machine powered with either NVIDIA V100 GPUs or Tesla K80 GPUs.

2.2 Object tracking

Tracking objects in underwater videos is challenging due to the 3D medium that aquatic animals move through, which creates greater variation in the shape and texture of the objects and their surroundings in a video (Sidhu, 2016). Additionally, underwater images are often obscured by floating objects, so automated tracking of fish is not a trivial task. Advances in object tracking have started to address these issues, and have managed to track objects consistently even with natural variations of the object’s shape, size and location (Bolme et al., 2010; Cheng et al., 2018). We developed a pipeline where the OT architecture activates once the OD model detected a fish of the target species. This approach resulted in an automated detection and subsequent tracking of fish from the underwater videos. Additionally, we benchmarked the performance of three OT architectures (MOSSE, Seq-NMS and SiamMask) by using movement data gathered a month after the training dataset was collected from a different location in the Tweed River estuary, Australia. In this location, a 150 m long rocky wall restricts access to a seagrass-dominated harbour (Figure 1). The placement of the rock wall creates a 20 m wide passageway that fish use as a movement corridor to access a seagrass meadow.
We collected the fish movement data by submerging two sets of three action cameras (1080p Haldex Sports Action Cam HD) for one hour during a morning flood tide in October 2019. We placed the sets of cameras parallel to each other and separated by 20 m (Figure 1). Within each set, the cameras faced horizontally towards the fish corridor and parallel with the seafloor, and were separated by ~ 3 m. The camera placement allowed us to calculate horizontal movement (left or right) of fish through the corridor. The distance between the cameras and between the sets ensured non-overlapping field of views. Set 1 cameras faced north and set 2 faced south (Figure 1). We placed the cameras in a continuous line starting at the harbour entrance and ending at the border of the seagrass meadow, deployed at a depth of 2-3 m. We manually trimmed each video using VLC media player 3.0.8 into video snippets with continuous yellowfin bream movement. The trimming resulted in 76 videos of varying durations (between 3 – 70 seconds) with each video transformed into still frames at 25 frames per second. All frames with bream present were manually annotated and these annotations used as groundtruth. We used the fish movement dataset to evaluate the object detection model and the OT architectures.