Nuclear SNPs
Reads from 13 louse genomes were individually mapped with Bowtie2 (Langmead & Salzberg, 2012) to the reference genome described above. Before mapping, reference sequence was indexed using Samtools (Li et al., 2009) and a dictionary file was made with CreateSequenceDictionary in Picard v.2.0.1 (https://broadinstitute.github.io/picard/). After mapping with Bowtie2, resulting SAM files were sorted to the BAM files and indexed using Samtools. Duplicated sequences were removed from sorted BAM files with Picard v.2.0.1 (https://broadinstitute.github.io/picard/) and quality of mapping was verified with QualiMap (http://qualimap.bioinfo.cipf.es/). SNPs for population-level analysis were called using the GATK Genome Analysis Toolkit following the “Best Practices” guide from the Broad Institute (Van der Auwera et al., 2013). SNPs were jointly called for all 13 Polyplax serrata samples and filtered with QD (quality by depth) <2.0, FS (Fisher strand test) >60.0, MQ (mapping quality) <40.0, and MQRankSum (mapping quality rank sum test) <–12.5. The SNPs from GATK were filtered with minor allele frequency (MAF) threshold 0.05 in PLINK 1.9 (https://www.cog-genomics.org/plink/1.9/). SNPs in linkage disequilibrium (LD) were pruned with the squared coefficient of correlation (r2) 0.2 and missing data threshold 0.2.
Resulting 12, 285 variants passed filters and quality control in 13P. serrata samples and used in population structure estimation. Principal Component Analysis (PCA) was performed in R package SNPRelate (DOI: 10.18129/B9.bioc.SNPRelate) and phylogenetic tree based on Maximum likelihood analysis was reconstructed using SNPhylo pipeline (Lee, Guo, Wang, Kim, & Paterson, 2014) with 1000 bootstraps.