2.2 | Filtering raw reads and variant calling
We used Trimgalore 0.4.2 (available here:
https://github.com/FelixKrueger/TrimGalore), a wrapper script for
CutAdapt (Martin, 2011), to remove sequencing adaptors and to trim low
quality ends from reads with a phred quality score below 30. Reads were
aligned to the caribou reference genome (Taylor et al., 2019) using
Bowtie2 2.3.0 (Langmead & Salzberg, 2012), and the SAM file converted
to a BAM file using Samtools 1.5 (Li et al., 2009). We removed duplicate
reads and added correct read group information to each BAM file using
Picard 2.17.3 (Available:
http://broadinstitute.github.io/picard/). We then sorted the BAM
file using Samtools 1.5, and built an index using Picard. All BAM files
were checked using FastQC 0.11.8 (Andrews, 2010).
In addition to the 30 caribou genomes we sequenced, we also used a
reindeer genome from a domesticated animal from Inner Mongolia,
sequenced by Li et al., (2017). The reads were downloaded from NCBI
(SRR5763125-SRR5763133) and mapped back to the caribou reference genome
using the same methods as above. After using FastQC, adaptor
contamination was detected, as well as duplicate reads, and so we used
ClipReads in GatK to remove Nextera adaptor contamination, and removed
duplicates using Picard. We then re-checked the file using FastQC and
found we had successfully removed the contamination. Some sequence
duplication was still detected, however, and so results from the
reindeer sequence may need to be treated with caution.
We ran each BAM file through BUSCO (Benchmarking Universal Single-Copy
Orthologs 3.0.2; Waterhouse et al., 2018) to reconstruct 4,104 conserved
mammalian genes to assess the completeness of each genome. As our
reference genome reconstructed 3,820 (93.1%; Taylor et al., 2019)
complete mammalian BUSCO genes, this represents an upper limit for our
re-sequenced individuals. We used Haplotype Caller in GATK 3.8 (McKenna
et al., 2010) to call variants and produce a variant call format (VCF)
file for each caribou. Individual VCF files were combined using the
Combine GVCFs function, and then we performed joint genotyping using
Genotype GVCFs, both in GATK, to produce a VCF file with all caribou and
the reindeer. For some PCA’s (see below), we also made VCF files
containing subsets of individuals.
We downloaded the raw reads for a Sitka deer (Odocoileus hemionus
sitkensis ) genome from the NCBI database (Bioproject PRJNA476345, run
SRR7407804) sequenced as part of the CanSeq150 Initiative, to use as an
outgroup. We aligned and filtered the reads in the same way as for the
caribou genomes to produce an individual VCF file. We then used the
Combine GVCFs function, and performed joint genotyping using Genotype
GVCFs, both in GATK, to produce a VCF file with all caribou, the
reindeer, and the Sitka deer, for analyses requiring an outgroup.
We did some additional filtering on the combined VCF files to ensure
quality. We used VCFtools 0.1.14 (Danecek et al., 2011) to do two rounds
of filtering. Firstly, we removed indels, and any site with a depth of
less than 10 or more than 80 (roughly double the average depth across
the genome), and removed any low-quality genotype calls, with a score
below 20, which in VCFtools are changed to missing data. In the second
round, we filtered to remove genotypes with more than 10% missing data.
We did not filter to remove any SNP with a minor allele frequency (MAF)
of less than 0.05 as we have only one or two individuals from each
location and this results in removing the private sites, instead relying
on very high depth and stringent filtering to ensure a high-quality
dataset. However, we did conduct the PCAs using an MAF filter and these
looked identical to the data without the MAF filter (Figures S2-S5). The
combined VCF file used for analyses with all individuals apart from the
Sitka deer contained 34,573,476 SNPs, and the VCF including the Sitka
deer contained 65,412,957 SNPs