Material & Methods
Bait development
For an overview over the whole bait development process, see Figure 1.
Transcriptome assemblies
To target expressed genomic regions relevant for investigating signature
of selection, genome-guided transcriptome assemblies were used for bait
design. Raw Illumina RNA-Seq data from blood, epidermis, subdermis,
muscle and heart tissue of the white shark (Carcharodon
carcharias ) were obtained from Richards et al. (2013), Marra et al.
(2017) and Marra et al. (2019). The data was quality checked with FastQC
(v. 0.11.5, Babraham Bioinformatics 2010) and detected adapters trimmed
via Trimmomatic (v. 0.39, Bolger et al. 2014). Paired end (PE) data were
filtered for TruSeq3-PE-2 adapters in palindrome mode, minimum adapter
length of 1 bp and retention of both reads, with the addition of
TruSeq2-SE-adapter clipping for data obtained from Marra et al. (2019).
Single end (SE) data were filtered for TruSeq3-SE adapters.
Genome-guided transcriptome assemblies were constructed via Scallop
(Shao and Kingsford 2017) and StringTie (Pertea et al. 2015; Kovaka et
al. 2019). For assembly, RNA reads were aligned to the white shark
reference genome (Marra et al. 2019) in four data sets:
Specimen-specific and data type-specific (SE vs. PE), via the
splice-aware mapper STAR (v. 2.7.5a_2020-06-29, Dobin et al. 2013).
Genome indices were generated with 150 bp splice junction overhang and
usage of the parent GFF attribute to assign the exons to the
transcripts. Sorted bam-files were then generated in basic two-pass
mode, including the XS field. Splice junction overhang was again 150 bp
for all data sets and the parent GFF attribute was used.
Transcripts of the individual, aligned data sets were assembled in
StringTie (v. 2.1.4, Pertea et al. 2015; Kovaka et al. 2019), and the
results were merged via StringTie’s merge mode. Furthermore, Scallop (v.
0.10.5, Shao and Kingsford 2017) was used to assemble transcripts of
each aligned data set, using default options for unstranded library
types. The resulting four Scallop-assemblies were then merged via TACO
(v. 0.7.3, Niknafs et al. 2017), with no minimum length or expression
filtering applied.
Expressed exon sequences were extracted from gtf to fasta-format via the
software gffread (v. 0.12.1, Pertea and Pertea 2020). Assemblies were
then compared for quality via the tool rnaQUAST (v. 2.1.0, Bushmanova et
al. 2016), using the PE read data. Estimation of the presence of
Benchmarking Universal Single-Copy Orthologs (BUSCO) for vertebrates
(Seppey et al. 2019) were included in the analysis. Genes and
transcripts were not inferred in the process, as provided via the genome
annotation. After analysis, the Scallop/TACO-derived assembly was chosen
for further downstream analysis because the assembly covered more
annotated genes as well as vertebrate BUSCOs than the StringTie-derived
assembly.
Sequence selection for probe set
design
To reduce the probability of inclusion of paralogues into the bait
design, multiple filtering steps were applied to the assembled
transcriptomes. For this, transcripts were aligned to the genome and to
themselves with minimap2 (v. 2.17-r941, Li 2018). First, the transcripts
were aligned to the genome in splice-aware mode, with parameters that in
theory allow for up to 33%, but in practice for up to 15%, sequence
divergence (personal communication by the software’s authors).
Paralogous genes with 15% divergence or less should therefore be
covered by alignments from the same transcript. Up to 1 Mio secondary
alignments to the genome were included to allow for hits to putative
paralogous genomic loci. In a second step, all transcripts were aligned
to each other, not allowing for splicing but instead for up to 20% of
sequence divergence, and again including up to 1 Mio secondary hits to
catch as many paralogues as possible.
Continuing from the alignment output, sequences for the target capture
probe design were selected via a custom-written Python 3 script (v.
3.8.5, Rossum and Boer 1991). In a first step, potential paralogous
sequences were detected, to be later excluded from the probe set. First,
all transcripts that aligned to more than one locus in the genome were
marked as possible paralogues. Furthermore, two transcripts mapping to
each other were also marked as possible paralogues if the two
transcripts did not overlap in their mapping locus in the genome. If
one, or both of two transcripts mapping to each other, did not map to
the genome at all, both were declared as possible paralogues. Finally,
all transcripts that mapped to a potential paralog were themselves
marked as potential paralogues in a last, repetitive step.
Annotated genes were then selected to be included in the probe design if
they corresponded to a transcript, therefore being acknowledged as
expressed, but excluded if they corresponded to a transcript previously
identified as a potential paralogue, therefore excluding putative
paralogue gene copies. Next, to avoid genomic linkage between loci
included in our design, all selected genes from the same scaffold were
filtered for being located at least 250,000 bp apart. Then, sequences
were cut to their first 200 bases and sequences with two or more
consecutive missing nucleotides were discarded. 49 additional genes
annotated as part of a major histocompatibility complex (MHC) were
included in the bait set with their whole assembled nucleotide sequence
(Marra et al. 2019). Another four full genes suggested to be under
positive selection in some shark species were included, two of them
possibly involved in reproduction (VAMP4 and TCTEX1D2 ) and
two in brain development (YWHAE and ARL6IP5 ) (Swift et al.
2016).
Baits were designed as a myBaits Custom DNA-Seq kit by Arbor Biosciences
(Ann Arbor, USA) with 80-mer probes and a tiling density of 3x, as
recommended for fresh as well as degraded DNA.
Capture probe testing
DNA extraction & library
preparation
To test the efficiency of the designed baits, three individuals each of
bull sharks (Carcharhinus leucas ), tope (Galeorhinus
galeus ), porbeagle (Lamna nasus ), shortfin mako (Isurus
oxyrinchus ) and spurdog (Squalus acanthias ), as well as nine
basking sharks (Cetorhinus maximus ) and twelve white sharks
(Carcharodon carcharias , Table 1 and Supplementary File 1) were
used.
All sharks were sampled either via muscle plugs or fin clips, except for
the basking sharks from which skin mucus swabs were taken. Genomic DNA
was extracted following a standard phenol-chloroform extraction approach
(Sambrook et al. 1989) or with the DNeasy Blood & Tissue Kit (QIAGEN
N.V., Venlo, The Netherlands), including an over-night incubation step.
In the following, DNA was fragmented via sonication with a M220
Focused-ultrasonicator (Covaris, LLC, Woburn, USA) to a target peak
length of 350 bp and libraries were prepared using the NEBNext® Ultra™
II DNA Library Prep Kit for Illumina (New England Biolabs, Inc.,
Ipswich, USA) and 8 nucleotide long NEBNext® Multiplex Oligos for
Illumina (New England Biolabs, Inc., Ipswich, USA) with 530 ng as a
starting amount. Library quality and quantity was checked on an Agilent
Tape Station 4150 (Agilent Technologies, Inc., Santa Clara, USA). Three
target gene capture reactions were performed according to manufacturer´s
high sensitivity hybridization protocol v. 5.01, by pooling twelve
individual per capture reaction and using equal starting amount of each
individual library within each capture reaction (namely 500 ng, 400 ng,
and 300 ng). The white shark samples were pooled in one single pool,
while the individuals of the other species were randomly mixed between
two pools, so that each reaction contained representatives of each of
the six species. Hybridization and wash temperatures were 60 °C to
account for bait- and target sequence divergence, and hybridization was
performed over 72 hours. The captured libraries were sequenced together
on an Illumina NextSeq 500 platform (Illumina, Inc., San Diego, USA)
with a 2x150 bp mid-output NextSeq 500/550 v. 2.5. kit (Illumina, Inc.,
San Diego, USA). After sequencing, reads were demultiplexed with
bcl2fastq (v. 2.20.0.422, Illumina, Inc., San Diego, USA).
Data analysis
Sequencing data was analyzed for technical artefacts with FastQC (v.
0.11.9, Babraham Bioinformatics 2010). Poly-G tails were then trimmed
with cutadapt (v. 3.5, Martin 2011) via the option “nextseq-trim”, and
a phred threshold of 20. Trimmomatic (v. 0.39, Bolger et al. 2014) was
then used in palindrome paired end mode to remove internally provided
TruSeq3-PE-2 adapters and low-quality base calls, with a maximum seed
mismatch count of 2, a palindrome clip threshold of 30, a simple clip
threshold of 10, and minimum detected adapter length threshold of 1.
Both reads were retained after adapter clipping. Reads were then cleaned
of low quality, with a phred threshold of 20 for clipping bases at both
read ends, and for trimming within a sliding window of 4 bp. All reads
with a minimum length of 36 bp were retained.
Read mapping, cleaning and Single Nucleotide Polymorphisms (SNPs)
calling was performed following a pipeline developed elsewhere (Choquet
et al. 2019; Choo et al. 2020), which was adapted to the needs of this
study. Reads were mapped to the white shark reference genome (NCBI
accession number GCF_017639515.1, Rhie et al. 2021; Sayers et al. 2022)
using BWA-MEM (v. 0.7.17-r1188, Li 2013). Reads with ambiguous
alignments were then removed via SAMtools (v. 1.10, Danecek et al.
2021), with filtering for reads with multiple mapping loci (XA flag),
reads that produced chimeric alignments (SA flag), non-mapping reads,
reads with secondary or supplementary alignments, and duplicated reads
(flag 3332, “UNMAP,SECONDARY,DUP,SUPPLEMENTARY”). In addition,
duplicate reads were removed and all remaining reads grouped into one
single read group via Picard (v. 2.26.6, Broad Institute 2019).
SNPs were then called and processed using the Genome Analysis Toolkit
(GATK, v. 4.2.3.0, McKenna et al. 2010), following the project’s best
practice recommendations (Van der Auwera and O’Connor 2020) and sped up
using GNU Parallel (v. 20161222, Tange 2011). Variants were called for
each individual with HaplotypeCaller and afterwards combined for each
species separately with CombineGVCFs. Species-specific joined genotyping
was performed using the GenotypeGVCFs tool, allowing for a maximum of 18
alternative alleles. SNPs were then extracted with SelectVariants, and,
after visualization of quality metrics via the VariantsToTable tool and
R (v. 4.1.2, R Core Team 2019), hard filtered with VariantFiltration.
Lenient filtering thresholds were used, with one filter expression of
QualByDepth (QD) < 2.0, FisherStrand (FS) > 60.0,
RMSMappingQuality (MQ) < 50.0, MappingQualityRankSumTest
(MQRankSum) < -5.0 or ReadPosRankSumTest (ReadPosRankSum)
< -5.0, in accordance with parameters of previous studies by
our group (Choo et al., 2020; Choquet et al., 2019).
Afterwards, SNPs were filtered to retain only biallelic SNPs, using a
custom Python3 script (Supplementary File 2, Python3 v. 3.8.5, van
Rossum and de Boer 1991) and PLINK2 (v. 2.00a3, Chang et al. 2015) for
allele frequency computation. In addition, only SNPs present in at least
80% of the genotypes were retained, as well as SNPs with a minimum read
coverage of 5, using VCFtools (v. 0.1.16, Danecek et al. 2011).
Furthermore, to allow for better comparison between data sets, SNP
numbers were calculated for target regions only. For this, the bait
sequences were aligned to the white shark reference genome from NCBI
(accession number GCF_017639515.1, Rhie et al. 2021; Sayers et al.
2022) with Bowtie 2 (v. 2.4.5, Langmead and Salzberg 2012) in end-to-end
mode. All target loci with multiple alignments were then excluded
(“XS:i” flag). The resulting coordinate set, produced with BEDTools
(v. 2.27.1, Quinlan and Hall 2010), was used in the filtering steps
described above to keep the detection of SNPs within the actual target
regions.