2.3 RNA preparation and sequencing
Total RNA was extracted from the tissues of pooled mites (adult mites, nymphs, larvae and eggs) using TRIzol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer’s instructions. RNA degradation and contamination were monitored by 1 % agarose gel electrophoresis. RNA purity was assessed using Nanodrop®2000 spectrophotometer (ThermoFisher Scientific, UK) and then quantified. RNA integrity was assessed using an RNA Nano 6000 Assay Kit with the Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, USA). RNA sequencing libraries were constructed using the NEB Next mRNA Library Prep Kit following the manufacturer’s instructions and sequenced on the Illumina HiSeq X platform. Finally, a total of 25.03 Gb clean data were generated.
Genome assembly and evaluation of S. scabiei var.cuniculi genome
Jellyfish (v2.0) was used to estimate genome size based on k -mer (kmer=17) distribution using short-insert size libraries (Kingsford, 2011). The estimated genome size of S. scabiei var. cuniculi is 49.83 Mb and the heterozygous ratio is 1.02%.
Using long reads generated by the PacBio Sequel platform, contigs were assembled using the canu (canu-1.7) software (Koren et al. , 2017) with parameters setting as ‘-correct genomeSize=50m gnuplotTested=true stopOnReadQuality=false -pacbio-raw pacbio.fasta -trim genomeSize=50m gnuplotTested=true stopOnReadQuality=false -pacbio-corrected PacBio.correctedReads.fasta.gz -assemble genomeSize=50m gnuplotTested=true stopOnReadQuality=false -pacbio-corrected PacBio.trimmedReads.fasta.gz’. The initial assembly was then polished using Quiver (smrtlink 6.0.1) with default parameters (Chin et al. , 2013). Heterozygosity in the assembly was removed by Purge Haplotigs software (v1.0.4) (Roach et al. , 2018). Short Illumina reads were then used to correct any remaining errors by pilon (v1.22) with parameters set as follows: ‘-Xmx300G –diploid –threads 20’ (Walker et al. , 2014). Finally, we used Hi-C data to scaffold S. scabiei var.cuniculi genome to chromosome-level by Lachesis software (version-201701) with default parameters (Burton et al. , 2013).
To evaluate the accuracy of the assembly at single base level, short Illumina reads were mapped to the S. scabiei var. cuniculigenome using BWA (H. Li & Durbin, 2009) with parameters setting as ‘-k 32 -w 10 -B 3 -O 11 -E 4’ and performed variant calling with SAMtools (v1.8) (H. Li, 2011). Meanwhile, assembly completeness was assessed based on Benchmarking Universal Single-Copy Orthologs (BUSCO v4.0, arthropoda_odb9) (Simao et al. , 2015) and Core Eukaryotic Genes Mapping Approach (CEGMA V2.5) (Parra et al. , 2007).
Genome annotation
Homologous comparison and de novo prediction methods were used to annotate the repeat sequences on S. scabiei var. cuniculigenome. RepeatMasker and the RepeatProteinMask v4.0.8 (https://www.girinst.org/repbase/) were performed for homologous comparison against Repbase database (https://www.girinst.org/repbase/) (Jurka et al. , 2005). For ab initio prediction, LTR_FINDER (http://tlife.fudan.edu.cn/ltr_finder/)(Xu & Wang, 2007), RepeatScout (http://www.repeatmasker.org/)(Priceet al. , 2005) and RepeatModeler (v2.1) (http://www.repeatmasker.org/RepeatModeler.html) were first used for repetitive elements de novo candidate database constructing, and further annotated using RepeatMasker. Tandem repeat was predicted using TRF (http://tandem.bu.edu/trf/trf.html).
Gene prediction was performed through combination methods of homology-based prediction, de novo prediction and transcriptome-based prediction. For homologous annotation, protein sequences including S. scabiei var. canis ,Metaseiulus occidentalis , Ixodes scapularis ,Tetranychus urticae , Drosophila melanogaster ,Tropilaelaps mercedesae , Pediculus humanus ,Tribolium castaneum and Stegodyphus mimosarum were aligned against the S. scabiei var. cuniculi genome using TBLASTN (Altschul et al. , 1990). Blast hits that correspond to reference proteins were concatenated by Solar software (version 0.9.6) and low-quality records were filtered. Sequence of each reference protein was extended upstream and downstream by 1000 bp to represent a protein-coding region. GeneWise (Birney et al. , 2004) software was used to predict gene structure contained in each protein region. Homology predictions were denoted as “Homology-set”. All RNA-seq clean data were first de novo assembled using Trinity (v2.0)(Grabherr et al. , 2011) and the assembled sequences were then aligned against the S. scabieivar. cuniculi genome using PASA pipeline v2.0.2(Haas et al. , 2003) with BLAT (Kent, 2002) as the aligner. Gene models created by PASA were denoted as PASA-T-set (PASA Trinity set). We simultaneously employed five tools of Augustus (Stanke & Morgenstern, 2005), GeneID (Guigo et al. , 1992), GeneScan (Burge & Karlin, 1997), GlimmerHMM (Majoros et al. , 2004), and SNAP(Korf, 2004) for ab initioprediction, in which Augustus, SNAP, and GlimmerHMM were trained by PASA-T-set gene models. In addition, RNA-seq reads were directly mapped to the genome using Tophat (v2.0.9) (Trapnell et al. , 2009), and then the mapped reads were assembled into gene models (Cufflinks-set) by Cufflinks (Trapnell et al. , 2010). According to these three approaches, all the gene models were finally integrated by EvidenceModeler(EVM v1.1.1)(Haas et al. , 2008). Weights for each type of evidence were set as follows: PASA-T-set > Homology-set > Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. In order to get the untranslated regions (UTRs) and alternative splicing information, PASA2 was used to update the gene structure. To achieve the functional annotation, the predicted protein sequences were aligned against public databases, including SwissProt (Bairoch & Apweiler, 2000), NR database (from NCBI), InterPro and KEGG pathway (Kanehisa & Goto, 2000) (release 76). Of that, InterproScan tool (Jones et al. , 2014) in coordination with InterPro database (Finn et al. , 2017) were applied to predict protein function based on the conserved protein domains and functional sites. NR, KEGG pathway and SwissProt database were mainly used to map gene set to identify the best match for each gene.
Reconstruction of phylogenetic relationship and estimation of divergence time
We retrieved the protein-coding sequence of 7 arthropod includingT. urticae , Apis mellifera , D. melanogaster ,I. scapularis , S. mimosarum , M. occidentalis ,T. mercedesae and a nematode Caenorhabditis elegans from Ensembl (Release 85). For gene model with multiple alternative isoforms, only the longest transcript was selected to represent this gene. To identify orthologous genes, OrthoMCL pipeline with the parameter of “-inflation 1.5” (L. Li et al. , 2003) as used to construct gene families. In total, 15,542 gene families were identified, involved in 1,338 single-copy orthologous gene families.
The phylogenetic relationship of S. scabiei var. cuniculiwas reconstructed using the 1,338 shared single-copy orthologous genes. The sequences were aligned by MUSCLE tool with default parameters (Edgar, 2004). Sequences were then concatenated to one super-gene sequence for each species and formed a data matrix. The GTR-GAMMA was selected as the best substitution model deduced by jModeltest tool (Darribaet al. , 2012). Then the phylogenetic analysis was performed using maximum-likelihood (ML) algorithm in RAxML tool (Stamatakis, 2006). The best-scoring ML tree was inferred by rapid BP algorithm and ML searches after performing 1,000 rapid bootstraps. The MEGA (version 7) was used for visualizing the constructed phylogenetic tree (Kumaret al. , 2016).
Furthermore, divergence time between S. scabiei var.cuniculi and other species were estimated based on the phylogeny. Four calibration times obtained from TimeTree database were used to calibrate divergence dates of other nodes on this phylogenetic tree (Sudhir et al. , 2017). In this process, we implemented the Monte Carlo Markov Chain algorithm for divergence time estimation by MCMCtree tool in PAML package (Yang, 2007). Finally, we found thatS. scabiei var. cuniculi was separated away from its closest species of T. urticae at approximate 338.8 million years ago.
Homolog annotation method was applied for gene models prediction of scabies mites variants (isolated from pigs and humans)(Mofiz et al. , 2016),Psoroptes ovis (Burgess et al. , 2018) and Dermatophagoides pteronyssinus(Waldron et al. , 2017) that only have genome assembly files available in public database. OrthoMCL pipeline was employed to acquire shared single copy genes, molecular phylogenetic relationship of scabies mites from four hosts with plant-living, free-living, predatory mites, and other parasitic mite species that have available genomic data sets was constructed based on shared single copy genes generated from the same methods mentioned above.
Gene-family evolution and identification of gene families related with detoxification
The evolutionary dynamics (expansion/contraction) of gene families were analyzed using CAFÉ tool(De Bie et al. , 2006 ) with a stochastic birth and death model. Global parameter λ was estimated based on the phylogenetic tree and the datasets of gene family clustering. Viterbi method in CAFÉ was employed to identify the significantly changed families (p -value < 0.05). Finally, significantly changed expansion/contraction gene families were used to perform enrichment analysis.
To further analyze gene families of detoxification enzymes and transporters, the S. scabiei var. cuniculi genome sequence assembly was searched by TBLASTN (E -value=10-5) (Altschul et al. , 1990) with close arthropod relatives including T. urticae , T. mercedesae ,M. occidentalis , and D. melanogaster target gene family protein sequences. The Basic Local Alignment Search Tool (BLAST) hits were then conjoined by Solar software (Yuet al. , 2006). GeneWise was used to predict the exact gene structure of the corresponding genomic region on each BLAST hit (Birney et al. , 2004). We then annotated preliminary identified candidate genes with Pfam database (http://pfam.xfam.org/) and NR database to get putative gene sets. Finally, a The proteins of target genes were aligned with mafft software (Katoh & Standley, 2014) and a neighbor-joining tree was constructed with TreeBest (Vilella et al. , 2009) with aligned by MAFFT software (Katoh & Standley, 2014).
Characterization of Hox genes
We identified putative Homeobox genes in S. scabiei var.cuniculi genome sequence assembly by performing TBLASTN (E -value=10-5)(Altschulet al. , 1990), using curated Homeobox proteins from A. mellifera, Bombyx mori, Anopheles gambiae, Daphnia pulex, D. melanogaster, I. scapularis, Pachycrepoideus vindemmiae, Strigamia maritima, T. urticae, and Tribolium castaneum . The Basic Local Alignment Search Tool (BLAST) hits were then conjoined by Solar software (Yu et al. , 2006). GeneWise was used to predict the exact gene structure of the corresponding genomic region on each BLAST hit (Birney et al. , 2004). HMM searches of the above genes against the Pfam database (http://pfam.xfam.org/) revealed each to have PF00046.29 domain. The classification of deduced proteins and their integrity were verified using BlASTP with model species D. melanogaster Hox genes.E -values < 10-5 was considered as significant hits of similarity.
Phylogenetic and population genetic analysis
Twelve scabies mite colonies from three different mammal hosts (one from dog, three from pigs and eight from rabbits) were sequenced on the Illumina HiSeq X sequencing platform using standard procedures. In addition, we also downloaded 7 individuals in the public database from other representative populations, including one mite population isolated from dogs, 2 isolated from humans and 4 isolated from pigs (Mofiz et al. , 2016; Rider et al. , 2015). Detailed information for all 20 samples was shown in Tables S1 . Consequently, a total of 136.45 Gb high-quality paired-end DNA sequence was obtained. For downloaded datasets, high sequencing depth data were truncated to around 50x coverage for further analysis.
FASTQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was used for reads quality control. Clean reads were then aligned to theS. scabiei var. cuniculi genome using Burrows-Wheeler Alignment MEM (BWA-MEM) version 0.7.8 with default parameters except for the “-t 4 -k 32 -M -R” option (H. Li & Durbin, 2009). The ‘sort’ and ‘rmdup’ commands of SAMtools (v1.3) were used to perform data manipulation and alignment statistics. SNP calling was performed by SAMtools (v1.3) ‘mpileup’ command (H. Li et al. , 2009), bcftools ‘call’ command. To estimate individual admixture assuming different numbers of clusters, the genetic structure was inferred using sNMF (http://membres-timc.imag.fr/Olivier.Francois/snmf/software.htm) , which is based on sparse non-negative matrix factorization algorithms. We calculated the numbers of genetic clusters from 2 to 7 to explore the convergence of individuals with default settings and plotted Kfrom 2-4.
The software GCTA (http://cnsgenomics.com/software/gcta/) was used for principal component analysis (PCA) with biallelic SNPs of the 20 individuals. Only the first two significant components were plotted. The pam function of Rggfortify package (https://cran.r-project.org/web/packages/ggfortify/index.html) was used to determine the significant level of the principal components.