Hu LD et al / Acta Pharmacol Sin 2003 Aug; 24 (8): 741-745

Mutation analysis of 20 SARS virus genome sequences: evidence for negative selection in replicase ORF1b and spike gene1

HU Lan-Dian2, ZHENG Guang-Yong2, JIANG Hai-Song2, XIA Yu2, ZHANG Yi2, KONG Xiang-Yin2,3,4

2Health Science Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences and Shanghai Second Medical University, Shanghai 200025;

3State Key Lab for Medical Genomics, Rui Jin Hospital, Shanghai Second Medical University, Shanghai 200025, China

1 Project supported by National High Technology "863" Programs of China, National Natural Science Foundation of China, National Science Fund for Distinguished Young Scholars, State Key Technologies R&D programme (973).

4 Correspondence to Prof KONG Xiang-Yin. Phn 86-21-6467-8976. Fax 86-21-6467-8976. E-mail xykong@sibs.ac.cn

Received 2003-06-22 Accepted 2003-07-07

KEY WORDS severe acute respiratory syndrome; SARS-CoV virus; negative selection; replicase polyprotein; spike protein

ABSTRACT

AIM: Recently, more SARS-CoV virus genome sequences are released to the GenBank database. The aim of this study is to reveal the evolution forces of SARS-CoV virus by analyzing the nucleotide mutations in these sequences. METHODS: We obtained 20 SARS-CoV virus genome sequences from NCBI database, and calculated the ratio of non-synonymous nucleotide substitution per non-synonymous site (Ka) and synonymous nucleotide substitution per synonymous site (Ks) for SARS-CoV virus genes. RESULTS: The Ka/Ks ratios for replicase polyprotein ORF1a, ORF1b, and spike protein gene are 1.09 (P=0.6501), 0.38 (P=0.0074), 0.65 (P=0.0685) respectively. CONCLUSION: SARS-CoV virus replicase polyprotein ORF1b is undergoing negative selection; negative selection force is also probably operating on spike protein gene. These results provide basis for future developing a new drug and vaccine against SARS.

INTRODUCTION

Severe Acute Respiratory Syndrome (SARS) is a global outbreak disease, epidemic from November, 2002. The pathogen has been discovered as a novel coronavirus (SARS-CoV). SARS-CoV is a 30 kb ssRNA positive-strand virus[1-6]. Similar to other known coronaviruses, the viral RNA genome has five major open reading frames (ORFs) and additional nine potential ORFs. These ORFs-encoded proteins include the replicase polyprotein, the spike (S), envelope (E), and membrane (M) glycoproteins and the nucleocapsid protein (N)[2-5]. Sequence analysis reveals that the SARS-CoV virus is distinct from all known human viruses. Therefore, SARS-CoV virus is unlikely the mutant or recombinant of any known human coronaviruses, instead, probably jumps to human population from an unknown source[2-5]. Recently, a coronavirus resembling the SARS virus has been detected in palm civets (Paguma larvata) and a raccoon dog (Nyctereutes procyonoides)[7]. However, at present, it is uncertain whether these animals are the exact origin of human SARS-CoV virus.

After jumping to human, to adapt itself to the new host and to avoid the host immune system, SARS-CoV virus should be under evolutionary selection. The decoding of the virus complete genome sequence and identification of its encoded proteins provide the basis for evolutionary analysis of the virus genome. Sequence comparison of 14 SARS-CoV virus genome sequences revealed the common origins of human SARS-CoV viruses[8]. Although the SARS-CoV virus is relatively stable[8], the mutations within virus genome are not even distributed. This indicates that some genes may mutate rapidly than others. Identifying the undergoing evolution process of these genes will be helpful in virus detection and therapy.

Worldwide SARS research accelerates the progress of SARS-CoV virus genome sequencing. So far, 20 SARS-CoV virus complete genome sequences have been released to GenBank. Thus, it is possible to obtain more mutations of SARS-CoV virus genome from these sequences. These mutations harbor information about the virus-host interaction during the past half year epidemic. In this study, we investigate the evolution of virus genes, particular the replicase polyprotein gene and spike protein gene.

MATERIALS AND METHODS

SARS-CoV virus genome sequences The 20 SARS-CoV virus complete genome sequences were from GenBank (http://www.ncbi.nlm.nih.gov/).

Multiple sequence alignment and phylogenetic analysis We performed multiple sequence alignment and constructed consensus neighbor-joining tree of the 20 SARS-CoV virus genome sequences using the free online ClustW programm (http://www.ebi.ac.uk/clustalw/).

Prediction of transmembrane helices We used TMHMM Server v. 2.0 to predict the transmembrane helices of membrane glycoprotein (http://www.cbs.dtu.dk/services/TMHMM-2.0/).

Non-synonymous nucleotide substitution per nonsynonymous site (Ka) and synonymous nucleotide substitution per synonymous site (Ks) analysis We calculated the synonymous sites and non-synonymous sites, synonymous and nonsynonmous mutations using the DnaSP 3.51 programm[9]. We used Fisher's exact test to calculate the P value under null hypothesis of equal rates of synonymous and non-syn onymous changes[10].

RESULTS AND DISCUSSION

Mutation distribution SARS-CoV virus spread to human population half years ago, then quickly broke out in the world. In the new host, the viruses are subjected to either positive, negative or neutral evolution forces. Positive selection often operates on genes involved in evading the defensive systems or immunity, such as the human immunodeficiency virus-1 envelope gene (env). During genome replication, SARS-CoV viruses are apt to obtain mutations by its error-prone polymerase. Greater Ka/Ks ratio characterizes positive Darwinian selection; contrariwise, low Ka/Ks ratio implies negative selection[11]. In this study, we try to explore which gene is under the force of positive selection, and which gene is under negative selection. We calculated the number of non-synonymous nucleotide substitution and synonymous nucleotide substitution that occurred in each of the five major genes in the 20 SARS-CoV virus genome sequences (Tab 1). In the coding region of the five major genes, there are totally 129 nucleotide substitutions. Multiple sequence alignment shows that some mutations seem to be cluster in these genes (Fig 1). This result suggests that these regions are undergoing rapid adaptive evolution.

Tab 1. Nucleotide substitutions and Ka/Ks values of the five SARS-CoV virus genes.

Note. Fisher's exact test was used to test the null hypothesis of neutral evolution.

Fig 1. Multiple SARS-CoV viruses sequence alignment showing clustered mutations in different regions of the virus genome.

Negative selection of S protein The structure proteins of SARS-CoV virus include S protein, E protein, M protein and N protein[2-5]. The first three forms the surface of the SARS-CoV viral particles. These proteins were under selection pressure from the host immune response. Due to lack of enough nucleotide substitutions (Tab 1), we could not analyze the operating evolution forces for gene E, M and N. S protein is a large protein with 1255 amino acids. It recognizes specific receptors on the surface of host cells and mediates membrane fusion[3,12]. Furthermore, it is important in determining the species specificity, tissue tropism and virulence of virus infection. We observed 14 non-synonymous nucleotide mutations and 9 synonymous nucleotide substitutions in this gene (Tab 1). The Ka/Ks value is 0.65, P value is 0.0685 (Tab 1). This result suggests that the S protein is likely under negative selection in human host; S protein is stable during human passage. Thus, S protein is an ideal vaccine component. Meanwhile, it also hints high similarity of S protein binding receptors in human and its original host.

Evolution of replicase polyprotein gene The coronavirus replicase polyprotein is a 7073-amino acid large protein, composed of two polyproteins ORF1a and ORF1b. They are translated from the virus genomic RNA sequence. The replicase polyprotein autocatalytically processes to produce a group of proteins including proteases PLPpro and 3CLpro, RNA-dependent polymerase (POL), RNA helicase (HEL) and other function unknown proteins[2-5]. These proteins are important targets for drug design[13,14]. Consistent with the observation that the product of ORF1 is relatively lower in conservation among different coronaviruses[3], the Ka/Ks ratio for ORF1a is 1.09 (P=0.6501) (Tab 1). Thus, according to the Ka/Ks ratio, ORF1a is likely evolving in a fashion of neutral evolution. However, in view of the uneven distribution of nucleotide substitutions (Fig 1), some parts of ORF1a might be under positive selection and other parts under negative selection. In fact, when analyzing the first 1667 codons, the Ka/Ks ratio reaches to 1.58 (p=0.3192). ORF1b encodes several important proteins including the RNA-dependent polymerase and the RNA helicase. Based on Ka/Ks ratio, this ORF is under negative selection (Tab 1), reflecting the functional conservation of its encoded proteins in new host.

CONCLUSION

In this study, we identified two genes, replicase polyprotein ORF1b and spike protein genes that are subject to negative selection. However, based on the current available SARS-CoV virus sequences, we could not detect positive selection effect. With the accumulation of new data, positive evolution force might be uncovered such as on M gene. In this study, we could not distinguish mutations caused by host pressure from mutations occurring in in vitro expansion or sequence errors[8], the current results need to be confirmed in future by analyzing SARS-COV sequences free of in vitro expansion mutations.

REFERENCES