2.2 De novo assembly and annotation of transcripts
To explore genomic variation across the macrotis group, we
generated the de novo transcriptome assembly for species sequenced. We
obtained clean reads from raw data by removing reads containing adapter,
reads containing ploy-N, and low-quality reads. All the downstream
analyses were based on clean data. Transcriptome assembly was
accomplished based on the pooled paired-end reads from three tissues
using Trinity (Grabherr et al., 2011)
with min_kmer_cov set to 2 and all other parameters set to default. We
selected the longest transcript of a gene as the unigene and used it in
the following analyses.
To obtain functional annotation for more unigenes, we used the genome
data of R. sinicus and H. armiger from NCBI as references.
First, the protein of each unigene was aligned to the NCBI Non-redundant
(Nr) protein database using diamond v0.8.22 to produce annotation
results. NCBI blast 2.2.28+ was then used to retrieve NCBI nucleotide
sequences (Nt) for each unigene. Functional annotation of the unigene
was undertaken based on the best match derived from the alignments to
the proteins annotated in SwissProt and euKaryotic Ortholog Groups (KOG)
database. And we used HMMER 3.0Package to annotate unigene in Protein
family (Pfam). Descriptions of gene proteins from Gene Ontology (GO) ID
were retrieved based on the results of NR and Pfam. Finally, the Kyoto
Encyclopedia of Genes and Genomes (KEGG) orthology of each protein was
determined with the KAAS-KEGG Automatic Annotation Server, using the
bi-directional best hit (BBH) method.