2.1 Object detection
Object detection is a field of CV that deals with detecting instances of
objects in images and videos (Zhao et al., 2019). Methods for object
detection generally include traditional image processing and analysis
algorithms, and deep learning techniques (Zhao et al., 2019). Deep
learning is a subset of machine learning that uses networks capable of
learning from data that is unstructured, either labelled (supervised) or
unlabelled (i.e. unsupervised) (Lecun, Bengio & Hinton, 2015). Studies
show that deep learning models are robust and efficient for fish
detection in underwater scenarios (Cui et al., 2020; Ditria et al.,
2020b; Jalal et al., 2020; Villon et al., 2020). In this paper, we use
deep learning, and more specifically Mask Regional Convolutional Neural
Network (Mask R-CNN) for fish detection. Mask R-CNN is one of the most
effective open-access deep learning models for locating and classifying
objects of interest (He et al., 2017).
To develop and train the fish detection model, we collected video
footage of the target species, yellowfin bream in the Tweed River
estuary, Australia (-28.169438, 153.547594) between May and September
2019. We used six submerged action cameras (1080p Haldex Sports Action
Cam HD) deployed for 1 hr over a variety of marine habitats (i.e. rocky
reefs and seagrass meadows). We varied the angle and placement of the
cameras to ensure the capture of diverse backgrounds and fish angles
(Ditria et al., 2020a). We trimmed the original 1 hour videos into
snippets where yellowfin bream was present using VLC media player 3.0.8.
The snippets were then into still frames at 5 frames per second. The
training videos included 8,700 fish annotated across the video sequences
(Supplementary A). We used software developed at Griffith University for
data preparation and annotation tasks (FishID -
https://globalwetlandsproject.org/tools/fishid/). We trained the model
using a ResNet50 architecture with a learning rate of 0.0025 (He et al.,
2017). We used a randomly selected 90% sample of the annotated dataset
for the training, with the remaining 10% for validation. To minimise
overfitting, we used the early-stopping technique (Prechelt, 2012),
where we assessed mAP50 on the validation set at intervals of 2,500
iterations and determined where the performance began to drop. We used a
confidence threshold of 80%, meaning that we selected OD outputs where
the model was 80% or more confident that it was a yellowfin bream. We
developed the models and analysed the videos using a Microsoft Azure
Data Science Virtual Machine powered with either NVIDIA V100 GPUs or
Tesla K80 GPUs.
2.2 Object tracking
Tracking objects in underwater videos is challenging due to the 3D
medium that aquatic animals move through, which creates greater
variation in the shape and texture of the objects and their surroundings
in a video (Sidhu, 2016). Additionally, underwater images are often
obscured by floating objects, so automated tracking of fish is not a
trivial task. Advances in object tracking have started to address these
issues, and have managed to track objects consistently even with natural
variations of the object’s shape, size and location (Bolme et al., 2010;
Cheng et al., 2018).
We
developed a pipeline where the OT architecture activates once the OD
model detected a fish of the target species. This approach resulted in
an automated detection and subsequent tracking of fish from the
underwater videos. Additionally, we benchmarked the performance of three
OT architectures (MOSSE, Seq-NMS and SiamMask) by using movement data
gathered a month after the training dataset was collected from a
different location in the Tweed River estuary, Australia. In this
location, a 150 m long rocky wall restricts access to a
seagrass-dominated harbour (Figure 1). The placement of the rock wall
creates a 20 m wide passageway that fish use as a movement corridor to
access a seagrass meadow.
We collected the fish movement data by submerging two sets of three
action cameras (1080p Haldex Sports Action Cam HD) for one hour during a
morning flood tide in October 2019. We placed the sets of cameras
parallel to each other and separated by 20 m (Figure 1). Within each
set, the cameras faced horizontally towards the fish corridor and
parallel with the seafloor, and were separated by ~ 3 m.
The camera placement allowed us to calculate horizontal movement (left
or right) of fish through the corridor. The distance between the cameras
and between the sets ensured non-overlapping field of views. Set 1
cameras faced north and set 2 faced south (Figure 1). We placed the
cameras in a continuous line starting at the harbour entrance and ending
at the border of the seagrass meadow, deployed at a depth of 2-3 m. We
manually trimmed each video using VLC media player 3.0.8 into video
snippets with continuous yellowfin bream movement. The trimming resulted
in 76 videos of varying durations (between 3 – 70 seconds) with each
video transformed into still frames at 25 frames per second. All frames
with bream present were manually annotated and these annotations used as
groundtruth. We used the fish movement dataset to evaluate the object
detection model and the OT architectures.