II. MATERIALS AND METHODS
IIA. CROSSING AND SAMPLE
COLLECTION
Gynogenetic double haploids were produced by fertilizing eggs with UV
irradiated sperm, then pressure shocking embryos immediately following
the first mitotic division (as described in
Thorgaard et al. 1983; Limborg et
al. 2016). Double haploid (DH) offspring were created at Pendill’s Creek
National Fish Hatchery using eggs and sperm collected from captive adult
Lake Trout from the Seneca Lake brood stock. Due to low survivorship of
DH offspring (Komen & Thorgaard
2007), we tested multiple UV and pressure shock treatments on eggs from
five different females. Batches of 900 eggs from each female were
fertilized with sperm that was irradiated for 140, 280, or 1,260
seconds. Each batch was then split and sub-batches were pressure shocked
at 11,000 PSI for five minutes at either 6.5, 7, 7.5, 8, 8.5, 9, 9.5, or
10 hours post-fertilization. One batch of 900 eggs from each female was
also exposed to a control treatment which involved no sperm irradiation
or pressure shock. Embryos were incubated in heath trays at ambient
temperature until eye-up stage (E36 per Balon 1980),
with dead embryos being removed from trays on a daily basis. Individuals
surviving past post-embryo stage (sensuMarsden et al. 2021) were
euthanized using a lethal dose of tricaine methanesulfonate (MS-222) and
flash frozen in liquid nitrogen. Prior to sequencing and assembly, we
verified that the DH individual chosen for sequencing was completely
homozygous at 15 microsatellite loci that are typically highly
heterozygous in the Seneca Lake hatchery population
(Valiquette et al. 2014). One of
the two individuals was selected for high molecular weight DNA
extraction and long-read sequencing.
IIB. LABORATORY METHODS
A long-read sequencing library was prepared for the selected individual
using the SMRTbell Template Prep Kit 1.0 (Pacific Biosciences, Menlo
Park, California), with the optional DNA Damage Repair step after size
selection. Size selection was made for >10 kb using a Blue
Pippin instrument (Sage Science, Beverly, Massachusetts) according to
the manufacturer recommended protocol for 20kb template preparation. 5ug
of concentrated DNA was used as input for the library preparation
reaction. Library quality and quantity were assessed using a genomic DNA
Tape Station assay (Agilent, Santa Clara, California), as well as Broad
Range and High Sensitivity Qubit fluorometric assays (Thermo Fisher,
Waltham, Massachusetts). Single-Molecule Real Time sequencing was
performed on the Pacific Biosciences Sequel instrument at the McGill
Genome Centre (McGill University, Montreal, Canada,
https://www.mcgillgenomecentre.ca/) using an on-plate concentration
ranging from 1.5-7.5pM and the Sequel Sequencing Kit 2.0 with diffusion
loading. 38 SMRTCells were run with 600-minute movies and two SMRTCells
were run with 1200-minute movies.
Hi-C proximity ligation libraries were generated to aid with assembly
scaffolding. Two libraries were prepared from spleen and muscle tissue
using library preparation kits manufactured by Kapa Biosystems
(Wilmington, Massachusetts) and Lucigen (Middleton, Wisconsin),
respectively. Each Hi-C library was spiked into a portion of an Illumina
HiSeq lane in order to assess how effectively reads could be mapped
against the draft contig assembly. Genpipes version 3.1.5
(Bourgey, Dali et al. 2019) and
HiCUP version 0.7.2 (Wingett,
Ewels et al. 2015) were used to map Hi-C sequencing reads. The Hi-C
library prepared using muscle-derived DNA and prepared using the
Arima-Hi-C Lucigen Kit (Arima Genomics, San Diego, CA), was selected for
further sequencing. This kit employs a restriction enzyme cocktail that
digests chromatin at N^GATC and G^ANTC sequence motifs.
DNA was also extracted from four Lake Trout from the Seneca Lake, Isle
Royale, and Green Lake hatchery broodstocks using MagAttract HMW DNA
extraction kits (Qiagen, Hilden, Germany) for the purpose of generating
Illumina shotgun sequencing data. Sequencing reads from Seneca origin
individuals were later used for contig polishing and correction
(described below in Assembly and Scaffolding ). Libraries were
prepared for these individuals using 100ng of input DNA and the NEBNext
Ultra Library Preparation Kit for Illumina (New England Biolabs,
Ipswich, Massachusetts). Libraries sheared to approximately 400 bp using
a Covaris M220 Ultrasonicator, amplified for eight cycles, and
quantified using Quant-It Picogreen dsDNA assays (Thermo Fisher,
Waltham, Massachusetts) run in triplicate. Fragment size was assessed
using a genomic DNA Tape Station assay (Agilent, Santa Clara,
California). Libraries were multiplexed in equal concentrations and
sequenced in two HiSeqX lanes using paired end 2x150 format by the
Novogene Corporation (Beijing, China).
IIC. ASSEMBLY AND
SCAFFOLDING
Assembly was carried out using the polished_falcon_fat assembly
workflow run using the SMRT Analysis v3.0 pbsmrtpipe workflow engine
provided with an installation of SMRT Link v5.0
(smrtlink-release_6.0.0.47841;
https://github.com/PacificBiosciences/pbsmrtpipe). Read metadata
were extracted using the SMRT Analysis v3.0 dataset tool with the merge
option. Sequencing read metadata, pipeline settings, and an output
directory were specified for the polished_falcon_fat pipeline option.
Default assembly settings were used except genome size
(HGAP_GenomeLength_str) was set to 3 gigabases (GB), seed coverage
(HGAP_SeedCoverage_str) was set to 40X, and the minimum read length to
use a read as a seed (HGAP_SeedLengthCutoff_str) was set to 1000.
Multiple compute settings were also changed. The resulting assembly
settings file, read metadata file, and commands used to run the pipeline
are available in Supplemental Material 1 -Assembly Parameters).
The polished_falcon_fat workflow uses FALCON assembly algorithm
(Chin et al. 2013) and the
Quiver/Arrow consensus tool
(https://github.com/PacificBiosciences/GenomicConsensus) to
generate a polished contig assembly. The Falcon method operates in two
phases: First, overlapping sequence reads were compared to generate
accurate consensus sequences with read N50 greater than 10.9Kb. Next,
overlaps between the corrected longer reads were used to generate a
string graph. The graph was reduced so that multiple edges formed by
heterozygous structural variation were replaced to represent a single
haplotype. Contigs were formed by using the sequences of nonbranching
paths. Two supplemental graph cleanup operations were applied to improve
assembly quality by removing spurious edges from the string graph: tip
removal and chimeric duplication edge removal. Tip removal discards
sequences with errors that prevent 5′ or 3′ overlaps. Chimeric
duplication edges may result from the raw sequence information or during
the first sequence cleanup step and artificially increase the copy
number of a duplication. In a second and final workflow stage, the
polished_falcon_fat workflow used the Arrow consensus tool to perform
error correction on the assembly and generate an initial polished
assembly. The resulting contigs were passed through a second round of
error correction using Pilon in order to resolve SNP, indel, and local
assembly errors before proceeding with scaffolding
(https://github.com/broadinstitute/pilon). The Illumina paired-end
sequencing dataset described above was used as input for Pilon after
removing adapters and trimming reads using the sliding window approach
implemented in Trimmomatic v0.32
(Bolger et al. 2014).
We adopted a multifaceted scaffolding approach leveraging information
from Hi-C sequencing and a high-density linkage map for Lake Trout
(Smith et al. 2020). Hi-C reads were mapped to Pilon corrected contigs
with default setting using the Arima Genomics Mapping pipeline (Arima
Genomics, https://github.com/ArimaGenomics/mapping_pipeline), which
included four primary steps. First, forward and reverse reads were
mapped to the reference genome using bwa version 0.7.17 (Li 2013)
separately. Next, the 5’ end of the mapped reads were trimmed. Samtools
version 1.9 (Li, Handsaker et al.
2009) was then used to filter reads with mapping quality (MAPQ) less
than 10. Finally, Picard version 2.17.3
(https://broadinstitute.github.io/picard/) was used to add read group
information and mark duplicate reads. The resulting BAM file was used as
input for SALSA v2.2 (Ghurye et
al. 2017) run with default settings (three iterations). We also tested
Salsa2 using five iterations and compared results with those produced
using default settings by calculating Spearman’s rank order correlation
coefficients between the order of loci on the Lake Trout linkage map
(Smith et al. 2020) and the order of loci on the 50 largest scaffolds.
Linkage mapped RAD contigs were aligned to the reference assembly using
minimap2 using the -asm5 option. RAD contigs with mapping qualities less
than 60 were removed before calculating correlation coefficients using
the cor function and the method argument set to
“spearman.”
Additional scaffolding was carried out using Chromonomer v1.13
(Catchen et al. 2020). The
assembly was initially scaffolded using default settings, which yielded
chromosome length scaffolds with a high degree of concordance with the
linkage map; however, structural differences between the linkage map and
scaffolds were apparent on six chromosomes. In order to resolve these
inconsistences, we aligned the full set of PacBio subreads to the
assembly using Minimap2 (Li 2018)
using the preset option for PacBio data. The resulting bam file was
sorted, indexed, and per-base coverage was calculated for all positions
using samtools depth with the –a option. We then ran a second round
of Chromonomer using the –rescaffold, –depth, and depth_stdevs =
2 options, which allowed for gaps to be opened in contigs if the
site-specific depth within a sliding window of 1000 base pairs was
greater than 2 standard deviations from the mean, suggesting an assembly
error. This resulted in an assembly with improved concordance with the
linkage map; however, linkage group 41 still exhibited a large inversion
relative to the scaffolds. We determined the approximate location of
this assembly error by identifying the pair of linkage mapped loci for
which the level of discordance between the linkage map and assembly was
maximized. The scaffold was manually broken and reoriented using an
existing gap that existed between these two loci.
Gaps were filled using PBJelly from PBSuite v15.8.24 (English et al.
2012). All PacBio reads were aligned to the draft assembly using
Minimap2 using the -pb preset option and reads mapping within 5000 base
pairs of a gap were retained for gap filling using bedtools intersect
(Quinlan and Hall 2010). Retained
reads were re-mapped with Blasr v5.3.2
(Chaisson et al. 2012) using the
options –minMatch 11, –minPctIdentity 75, –bestn 1,
–nCandidates 10, –maxScore -500, and –fastSDP. The
“maxWiggle” argument was set to 100 kilobases (KB) for the PBJelly
assembly stage in order to account for gaps of unknown length. After
filling gaps, we corrected single nucleotide and short indel errors by
running 3 iterations of Polca (distributed with MaSuRCA v. 3.4.2;
Zimin and Salzberg 2020) using
Illumina data from a Seneca strain female as input. Default settings
were used except alignments overlapping gaps were removed from bam files
using bedtools intersect (Quinlan and Hall 2010) prior to running the
Polca variant calling step.
Illumina paired end data from the same individual used for genome
polishing and PacBio data from one SMRTcell were aligned to the Arctic
Char (Salvelinus alpinus ) mitome (RefSeq: NC_000861.1) in order
to obtain reads useful for assembling the Lake Trout mitome. Reads were
aligned using Minimap2 using the sr and map-pb present options for
short-reads and long-reads, respectively. Reads aligning to the Arctic
Char mitome were extracted from original fastq files using seqtk subseq
(https://github.com/lh3/seqtk) and hybrid assembly was conducted
using Unicycler v0.4.8 (Wick et al. 2017) using the settings
–min_fasta_length 15000 and –keep 0. Unicycler implements a
hybrid-assembly approach using Spades
(Bankevich et al. 2012), SeqAn
(Döring et al. 2008), and Pilon.
First, Spades (v3.13.1 here) was used to assemble Illumina short-reads
and contigs with graph coverage less than half the median coverage were
removed due to potential contamination from the nuclear genome. Contigs
were then scaffolded using long-reads and SeqAn (Döring et al. 2008) was
used to generate gap consensus sequences. Finally, Pilon was used to
resolve assembly errors using short-read alignments as input.
IIG. ASSEMBLY QUALITY
CONTROL
We used multiple approaches to assess the accuracy, contiguity, and
completeness of the genome assembly. First, we determined the proportion
of the genome that was recovered in our assembly by comparing total
assembly size with an estimate of genome size based on the distribution
of k-mer frequencies from Illumina paired-end 2x150 data generated using
DNA from a Seneca strain female. The frequency of all 19mers in the read
data was calculated using the count function in Jellyfish v2.2.6
(Marcais and Kingsford 2012) with
the options -m 19 and -C. K-mer counts were then exported to the
histogram format using the histo function. This file was used as
input for GenomeScope v1.0 (http://qb.cshl.edu/genomescope/;
Vurture et al. 2017) with read
length set to 150 bp and k-mer length set to 19.
Basic assembly statistics were calculated using the program
summarizeAssembly.py from PBSuite v15.8.24
(English et al. 2012). Statistics
included total assembly size, contig and scaffold N50s, and minimum and
maximum contig and scaffold lengths. Assembly statistics were calculated
with and without gaps. Contig and scaffold N50s and counts were obtained
for 14 additional salmonid assemblies from NCBI for comparison. Single
base consensus accuracy was estimated after each iteration of polishing
with Polca.
Next, we calculated percentages of complete singleton, complete
duplicated, fragmented, and missing Benchmarking Single Copy Orthologs
(BUSCOs) for seven chromosome-level salmonid assemblies and compared
these with scores for the Lake Trout assembly discussed here. These
included genomes for Brown Trout (Salmo trutta ;
GCA_901001165.1), European Whitefish (Coregonus sp. balchen ;
GCA_902810595.1; De-Kayne et al.
2020), Atlantic Salmon (Salmo salar; GCA_000233375.4; Lien et al.
2016), Coho Salmon (Oncorhynchus kisutch ;
GCA_002021735.1), Rainbow Trout (Oncorhynchus mykiss ;
GCA_002163505.1; Pearse et al. 2019), Chinook Salmon
(Oncorhynchus tshawytscha ; GCA_002872995.1; Christensen et al.
2018b), and Dolly Varden (Salvelinus malma ;
GCA_002910315.1; Christensen et al. 2018a). It should be noted that the
assembly originally produced for Arctic Char (GCA_002910315.1;
Christensen et al. 2018a, referred
to as the Dolly Varden assembly here) was later found to be from a Dolly
Varden or potentially a Dolly Varden – Arctic Char hybrid (see
Shedko et al. 2019 and Christensen
et al. 2021). BUSCO scores were also calculated for the Northern Pike
genome (Esox lucius ; GCA_000721915.3), a member of the
order Salmoniformes that is commonly used as a pre-Ss4R outgroup
species. BUSCO scores were calculated using BUSCO v4.0.6, the
actinopterygii_odb10 database (created November 20th,
2019), and the -genome option.
Finally, we aligned the linkage mapped contigs from Smith et al. (2020)
to the final assembly and calculated Spearman’s rank order correlation
coefficients between physical mapping locations and the order of loci
along linkage groups. Linkage mapped contigs were aligned to the
reference assembly using minimap2 using the -asm5 preset parameters and
the resulting sam file was filtered to exclude contigs with mapping
qualities less than 60. Correlation coefficients were calculated using
the cor function in R (R
Core Team 2017) with the method argument set to “spearman.”
Correlation coefficients were then converted to absolute values using
the abs function in order to compare chromosomes and linkage
groups with reversed orientations.
III. REPETITIVE DNA
A custom repeat library was created using RepeatModeler v2.0.1
(Flynn et al. 2020) and repeats
were subsequently classified using RepeatClassifier
(Smit et al. 2015). Repeats were
then masked using RepeatMasker (Smit et al. 2015) and the output of
RepeatMasker was used to determine the genome-wide abundance of
different repeat families and the relative density of repeat types
across chromosomes. The density of the most abundant repeat type
(Tcl-mariner) was visualized across chromosomes using the R-package
circlize (Gu et al. 2014; Figure
2).
IIK. HOMEOLOG IDENTIFICATION AND
SYNTENY
We performed a self-vs-self synteny analysis using SyMap v5
(Soderlund et al. 2006; Soderlund
et al. 2011) to identify Lake Trout homeologs resulting from the
Salmonid specific autotetraploid event
(Macqueen and Johnston 2014; Lien
et al. 2016). Prior to running SyMap, we hard-masked the genome using
RepeatMasker v4.1.0 (Smit et al. 2015) using our custom repeat library
as input and RMblast as the search engine (-e ncbi). Nucmer was used for
SyMap alignments and options were set to –min-dots = 30, top_n = 2,
and merge_blocks = 1. We then used Symap to identify blocks of synteny
between Lake Trout and Dolly Varden, Rainbow Trout, and Atlantic Salmon.
Alignments were conducted using Promer, and we used the options
min_dots = 30, top_n = 1, merge_blocks = 1, and
no_overlapping_blocks = 1. Results from self-vs-self synteny analysis
were visualized using the R-package circlize (Gu et al. 2014). Results
from the species-vs-species synteny analysis were visualized using the
Chromosome Explorer option in Symap v5 (Supplemental Material 4 –
Syntenic Blocks and Between Species Circos Plots ).
IIF. RNA SEQUENCING AND GENE
ANNOTATION
RNA samples were obtained from the offspring of Seneca Lake hatchery
strain fish held within the Ontario Ministry of Natural Resources and
Forestry (OMNRF) hatchery system. Offspring were produced using four
males and four females in a full factorial mating cross, by dry-spawning
anesthetized fish (anesthetic: 0.1 g L-1 MS-222; Aqua Life, Syndel
Laboratories Ltd., B.C., Canada). Eggs (140 mL) were stripped from each
female, divided evenly among four jars, and fertilized by pipetting milt
directly onto them. After fertilization, embryos were transported to the
Codrington Fish Research Facility (Codrington, Ontario, Canada) where
they were transferred from the jars into perforated steel boxes with one
family per box. These boxes were contained in flow-through tanks
receiving freshwater at ambient temperature (5-6℃) and natural
photoperiod under dim light. When the embryos fully absorbed their yolk
sacs and were ready to feed exogenously (i.e. free embryos;
approximately March 2016), 14 individuals from each family were randomly
selected and split into two groups of seven, then transferred into one
of four larger (200 L) tanks.
Tissue sample collection occurred between June 28 to August 9, 2016.
Each fish was euthanized in a bath of 0.3 g L-1 of MS-222 and dissected
to remove the whole liver. The liver was gently blotted on a lab wipe
and stored in RNAlater (Invitrogen, Thermo Fisher Scientific) for 24-48
hours at room temperature. RNALater was pipetted from the liver tissue
and the samples were stored at -80℃ until RNA isolation. Liver tissues
were homogenized individually in 2 mL Lysing Matrix D tubes (MP
Biomedicals) with 1 mL of Trizol reagent (Invitrogen, Thermo Fisher
Scientific). RNA was extracted from the homogenate using
phenol-chloroform extraction (Chomczynski & Sacchi, 2006). RNA was
precipitated with RNA precipitation solution (Sambrook & Russel, 2001)
and isopropanol, and washed with 75% ethanol. RNA samples were
resuspended in nuclease-free water (Thermo Fisher Scientific). The
purity and concentration of the RNA were initially determined using a
NanoDrop-8000 spectrophotometer. RNA quality was also assessed using a
Bioanalyzer (Agilent) and resulting RNA integrity numbers (RIN). All RNA
samples met our minimum RIN threshold of 7.5.
RNA sequencing was performed over two years. Twenty-four samples were
sent to The Centre for Applied Genomics (Sick Kids Hospital, Toronto,
Ontario, Canada) in 2018, and another 30 samples were sent to the Centre
d’expertise et de services Génome Québec (Montreal, Quebec, Canada;
https://cesgq.com/) in 2020. cDNA libraries were produced by enriching
the poly(A) tails of mRNA with oligo dT-beads using the NEBNext Ultra II
Directional polyA mRNA Library Prep kit (New England Biolabs; Ipswich,
Massachusetts). The group of 24 individuals was sequenced in 2.5
Illumina HiSeq 2500 lanes using 2X126 bp paired end reads. The
additional thirty individuals were sequenced in three Illumina HiSeq
4000 lanes using 2X126 bp paired end reads. Data were deposited in
sequence read archives associated with BioProject PRJNA682236. These
sequencing reads, along with those from two previous RNAseq experiments
(Goetz et al. 2010; Goetz et al.
2016), were used as input for NCBI’s Eukaryotic Genome Annotation
Pipeline (Thibaud-Nissen et al. 2016).
IIE. RECOMBINATION RATES AND
CENTROMERES
Sex averaged recombination rates were estimated across chromosomes using
the sliding window interpolation approach implemented in MareyMap
(Rezvoy et al. 2007). Restriction
site associated DNA (RAD) contigs from the Lake Trout linkage map (Smith
et al. 2020) were mapped to chromosomes using minimap2 using the -asm5
preset option and reads with mapping qualities less than 60 were
removed. At this point, RAD loci overlapping centromere mapping
intervals for each linkage group were extracted and the centromere
center was considered to be the mean mapping position for centromere
associated RAD tags. Centromere positions were visualized using the
R-package circlize (Gu et al. 2014).
In order to remove contigs with anomalous mapping positions that could
bias recombination rate estimates, we fit a loess model describing
linkage map position as a function of physical position for each
chromosome, extracted model residuals, and removed markers with
residuals that were greater than one standard deviation from the mean.
Loess models were fit using the loess function in R with the span
argument set to 0.2 and the degree argument set to 2. The remaining
markers were output to MareyMap format and were manually curated using
MareyMap Online (Siberchicot et
al. 2017). A sex averaged recombination map was calculated using sliding
window interpolation and exported from the program (Supplemental
Material 3 – Recombination Map ).