Functional CNVs extraction from transcriptome
The transcriptome data of 99 tree species (listed in Table S1) was from
a research program of Han et al. (2017). Plant mRNAs were extracted from
seedling leaves collected from GTS FDP with the sampling strategy as
follow. Fully expanded leaves from three seedling individuals per
species were sampled from each of the five main habitats (low valley,
low ridge, mid-slope, high slope and high ridge) (Chen et al. 2010).
While some rare species were only sampled from three individuals or
three leaves from the only seedling individual. The samples were
immediately frozen in liquid nitrogen in the field and then stored in a
-80°C freezer before sequencing. The transcriptomes were sequenced on an
Illumina HiSeq 2500 platform with 2×125 bp length reads and at least 6G
clean data for each sample, de novo assembled by Trinity v2.2
without reference genome sequence (Grabherr et al. 2011) and annotated
to GOs by the software Blast2GO using the UniProt database (The UniProt
Consortium 2016). In this study, we focused on four GOs with terms
“defense response to fungus” (GO: 0050832), “defense response to
bacterium” (GO: 0042742), “defense response to insect” (GO: 0002213),
and “defense response to virus” (GO: 0051607), which are involved in
the defense response to four lineages of natural enemies. Based on the
result of GO annotation, we picked out transcripts annotated by the four
GOs for 99 well sequenced tree species and translated to protein
sequences by TransDecoder and the Pfam database (Haas et al. 2013). For
each GO, we did the all-by-all blast for the protein sequences set.
Before clustering, the blast results were filtered with 0.4 hit
fraction. And then homologous gene clusters were obtained by employing
MCL software with 10-5 for the e-value and 2.0 for
inflation value. The steps from blasting to clustering were referring
the pipeline of Yang & Smith (2014). At last, we counted the number of
genes in each cluster for each species. This resulted in four matrices
containing gene clusters in columns and 99 tree species in the rows
(hereafter denoted as functional CNV matrices). To show the
dissimilarity of functional CNV among species, a heat-map was drawn by
two-way cluster with the heatmap package in R 4.0.2 (R core Team, 2020),
by using six clusters with most gene clusters for each GO.
Before the calculations of functional CNV at species- and
community-levels (by seedling station), the values in functional CNV
matrices were standardized by dividing by the maximum of each cluster to
limit the values in a range between 0 and 1. For each defense response
GO, the gene copy number of each species was defined as the sum of the
standardized values of all the clusters, and the gene copy number of
each seedling station was defined as the averaged gene copy number of
all individuals in that station.