Although state-of-the-art text recognition systems perform well on
printed text, accurately recognizing handwritten text is still a
challenge. Older handwritten text might contain unique writing styles
and may have deteriorated. Nevertheless, such cases can still provide
valuable information relating to the writing style. Text written by the
same author could be automatically clustered based on visual similarity
and used to identify the collection and reduce manual validation.
Besides text, secondary data hidden in the handwriting, ink colour,
mounting paper, label shape and printed label decorations (Fig. \ref{312487}, \ref{478509} & \ref{295237}) can
be used to determine their origins and history. Image analysis by itself
can be enough to make clusters of specimens for particular purposes, for
example, a group of specimens from a particular expedition. These
clusters can also help to do further image analysis on images that share
some common characteristics.
Rulers and colour checkers
Another element often seen on digitised images of collection objects are
rulers, scale bars and colour checkers. These come in many different
types and sizes, as different institutions often customise them based on
the requirements of the imaging campaign. Colour checkers are used to
validate the fidelity of the colours of the specimen image, while a
ruler provides a reference to the actual size of the specimen with
regards to the image size. Especially when digitising with a digital
camera, it can be complex to calculate the actual dimensions of the
image, as it depends on the camera lens and individual camera
parameters. As it is time-consuming to measure each specimen manually,
specimen dimensions are often not included as metadata. Therefore, the
detection of rulers and colour checkers on digital images can prove
useful to estimate the actual specimen size and correct colour balance.
A generic object detection or instance segmentation model can be trained
to detect these common objects. If all the rulers in a collection are of
a fixed size, the length of the detected ruler can be used to calculate
a transformation from pixels to the ruler’s unit of measurement (e.g.
cm, mm). This transformation can then be combined with specimen
segmentation models, to automatically extract the dimensions and
specimen traits \citep{triki_deep_2021-1}. However, when rulers are not of a
uniform size, the distance transformation needs to be estimated by
calculating the pixel distance between the measurement stripes or bars
on the ruler \citep*{bhalerao_ruler_2014}. To extract the specific unit
of measurement, the text denoting the unit on the ruler can be
recognized or additional metadata about the specimen can be used to
infer it.
Finding stamps and
signatures
Specimens are often labelled with rubber stamps and occasionally printed
or embossed with crests that indicate provenance or ownership (Fig. \ref{295237}).
For instance, the stamps of botanical exchange clubs (Fig. \ref{295237}C, \ref{295237}E),
which operated in Europe, and particularly the United Kingdom, from the
middle of the 19th century into the 1930s \citep{groom_herbarium_2014}. Tens of thousands of specimens were exchanged this way and
found their way into collections around the world. If a specimen was
part of a botanical exchange club, it implies that duplicates of this
specimen existed and it circumscribes the dates within which a specimen
was collected. Although stamps usually contain some text, they are often
circular or oval, making them intractable to standard OCR engines.