III. RESULTS

IIIA. SEQUENCING, ASSEMBLY, AND SCAFFOLDING

Of the 13,500 embryos exposed to UV irradiation and pressure shock treatments, two individuals survived beyond the post-embryo stage. The individual selected for assembly was found to be homozygous at all 15 genotyped microsatellite loci, suggesting that chromosome set manipulations were successful at inducing doubled haploidy. We proceeded with PacBio sequencing, and produced a dataset with an estimated genome coverage of 89X, with 53X coverage provided by reads longer than 12 KB in length.
The Falcon-based assembly pipeline and polishing with Arrow and Pilon yielded an initial assembly with 8,321 contigs, a total length of 2.3 GB, and a contig N50 of 1.3 megabases (MB) with a maximum contig length of 19.6 MB. Our analysis comparing the correlation between the Lake Trout linkage map and Hi-C scaffolds indicated that three iterations of Salsa (the default setting) produced moderately large scaffolds, while yielding a mean map versus scaffold correlation of 0.89. Thirty-three of the 50 largest scaffolds had correlations greater than 0.95 and 42 had correlations greater than 0.8. We opted to use these settings for scaffolding. Salsa v2.2 split multiple contigs, resulting in 8,367 contigs with an N50 of 1.25 MB and 5,171 scaffolds with an N50 of 5.15 MB. Additional scaffolding with Chromonomer v1.13 increased scaffold N50 to 44 MB and reduced the total number of scaffolds to 4,122. Chromonomer v1.13 also reduced contig N50 to a small degree due to the insertion of additional gaps at likely misassembles. Scaffolding with Hi-C and the Lake Trout linkage map ultimately allowed us to assign 84.7% of the genome to chromosomes. Gap filling with PBJelly increased scaffold N50 to 44.97 MB, increased the total assembly size to 2.345 GB, and increased contig N50 to 1.8 MB. Gap filling increased the maximum contig length to 34.78 MB and the maximum scaffold length to 98.19 MB. The estimated consensus accuracy after three rounds of error correction with Polca was 99.9959 %. The polished assembly was submitted to GenBank for public use (accession GCA_016432855.1).

IIIB. ASSEMBLY QUALITY CONTROL

We estimated the total haploid genome size for Lake Trout to be between 2.119 and 2.122 GB using k-mer analysis and GenomeScope v1.0, with 38% of the genome composed of unique sequence and 62% composed of repetitive sequence. Heterozygosity for the sample used for polishing was estimated to be between 2.78 and 2.9 heterozygous sites per 1000 base pairs. It should be noted that the individual used for polishing was a diploid and not a gynogenetic double haploid. The estimated coverage for the sample used for genome-size estimation was 16X, which should be sufficient for k-mer based methods (Williams et al. 2013).
We recovered 93.2% of BUSCO genes with 60.3% and 32.9% being present as singletons and duplicates, respectively (Figure 3). The salmonid genomes evaluated recovered between 88.1% and 95.3% complete BUSCOs with between 25.3% and 34.9% being duplicated and between 58.3% and 65% being singletons. The proportion of duplicated BUSCOs in the Lake Trout genome was the second highest among salmonid genomes (32.9%) and appears to be comparable to the Brown Trout genome (GCA_901001165.1; River Trout), which was also assembled using Falcon (Falcon-unzip) and polished using a method based on the Freebayes variant caller (Garrison and Marth 2012).
Spearman’s rank order correlations between the genome assembly and the Lake Trout linkage map ranged from 0.89 to 1.0 for the 42 Lake Trout chromosomes. The mean correlation was 0.98 and 39 of 42 chromosomes had correlations greater than or equal to 0.96, suggesting that the final genome assembly provides an accurate representation of the order of loci along Lake Trout chromosomes.

IIIC. REPETITIVE DNA

RepeatModeler 2 identified 2,810 interspersed repeats and 462 of these were classified by RepeatClassifier. RepeatMasker reported that 53.8% of the Lake Trout genome is composed of sequences from this repeat library. A total of 13.04% of the genome was composed of retroelements, with 10.47% being LINEs and 2.57% being LTR elements, and 9.97% of the genome was composed of DNA transposons. As has been observed in other salmonids, TcMar-Tc1 was the most abundant superfamily and these repeats were most abundant near centromeres (Figure 2; Lien et al. 2016; Pearse et al. 2019). A total of 30.79% of the genome was composed of interspersed repeats that were not classified by RepeatClassifier.

IIID. HOMEOLOG IDENTIFICATION AND SYNTENY

Self-vs-self synteny analysis conducted using Symap v5 identified 126 syntenic blocks shared between putative Lake Trout homeologs (Figure 2). Blocks ranged in size from 477,153 bp to 57,126,662 bp. Fifty-two blocks were longer than 10 MB and 70 were longer than 5 MB (Figure 2, inner links). We identified 50 syntenic blocks shared between Rainbow Trout and Lake Trout and identified homologous rainbow trout chromosomes for all Lake Trout chromosomes. Syntenic blocks shared between these two species ranged in size from 1.9 MB to 97.2 MB. Symap identified homologous chromosomes in Atlantic Salmon for all chromosomes except 32 and 39. However, we expect that Lake Trout chromosome 39 is homologous to a region of Atlantic Salmon chromosome 2 and chromosome 32 is homologous with a region of chromosome 14 based on the size of missing synteny blocks. Fifty-four syntenic blocks were detected between the two species that ranged in size from 208,516 bp to 88 MB. We identified 42 syntenic blocks shared between Dolly Varden and Lake Trout and identified homologs for all chromosomes except chromosome 41. Syntenic blocks ranged in size from 6.8 MB to 79.9 MB (Supplemental Material 4 – Syntenic Blocks and Between Species Circos Plots).

IIID. GENOME ANNOTATION

We generated a total of 3.45 billion RNA-seq reads that were subsequently used as input for the NCBI Eukaryotic Genome Annotation Pipeline v8.5 (July 9, 2020 release date). An additional 528,760 reads were used from previous Lake Trout gene expression studies. A total of 86% of reads were aligned to the genome assembly, and 12 Lake Trout transcripts from GenBank and 3,547 known Atlantic Salmon transcripts from RefSeq were also used as input for the pipeline.
The pipeline produced annotations for 49,668 genes and pseudogenes. A total of 3,307 non-transcribed pseudogenes and two transcribed pseudogenes were identified. Gene length ranged from 53 to 1,198,409 bp, with a median length of 8,676 bp. Gene densities for chromosomes ranged from 15.45 to 31.39 genes/mb with an average genome-wide density of 21.07 genes/mb (Figure 2, C). A total of 422,014 exons were identified, with between 1 and 224 exons per transcript (mean=10.31, median=8).

IIIE. RECOMBINATION RATES AND CENTROMERES

We were able to map between 1 and 238 centromere-associated RAD contigs to their respective chromosomes and determine approximate centromere locations for all chromosomes except chromosome 42. Smith et al. (2020) did not determine the location of the centromere for chromosome 42, which prohibited us from identifying its location. Across all chromosomes, we mapped 35 centromere-associated RAD loci to each chromosome on average. Between 39 and 238 centromeric loci were mapped to metacentric chromosomes (mean = 93), while between 1 and 59 loci were mapped for acrocentric or telocentric chromosomes (mean = 21).
In all, 14,438 linkage mapped contigs were mapped to the genome with mapping qualities greater than 60. A total of11,232 loci were retained for recombination rate estimation after manual curation and filtering using loess model residuals. We determined the mean sex averaged recombination rate to be 1.09 centimorgans/mb, with recombination rates varying between 0 and 6.58 centimorgans/mb across the genome.