Mapping the genetic variations of SARS-CoV-2 onto its proteins’ crystal structures – Post 1

Many people have spent a few months in quarantine because of the emergence of a novel coronavirus, SARS-CoV-2, the pathogen that causes COVID-19. (1) More people around the globe are fighting this virus as the number of cases increase. As of today, May 19, 2020, the number of confirmed cases globally almost reach five million, and over 320,000 deaths reported so far. (2) It is no doubt that our fight against this disease is time-sensitive, and that has brought many scientists into force and action.

About a month ago, my supervisor, Matthieu Schapira, discussed a potential project with me where the goal would be to map the genetic variations of SARS-CoV-2 onto its proteins’ crystal structures. Without a single second of pause, I accepted to join his mission. In this post and my next couple of posts, I will discuss some of the work we have been doing so far to achieve that. You can view the methods here.

I want to give you just a little background about the “why” of this project. Viral genomes are genetically unstable, which means that they mutate rapidly and can, therefore, easily develop mutations associated with drug resistance. (3) To identify whether we can potentially target viral proteins with antiviral drugs, we first identify druggable binding sites on their 3D structures. Following this step, we make a list of the amino acids lining these drug binding sites. Then we look at the variation of these amino acids across hundreds or thousands of SARS-CoV-2 samples from COVID-19 patients to evaluate the genetic variability at these binding sites and see how conserved they are from one patient to another. We then extend the analysis to other coronaviruses, to see if some binding sites are conserved not only across SARS-CoV-2 variants from COVID-19 patients but also across all coronaviruses. If such binding sites exist, drugs that target them have a better chance of being active against future coronaviruses representing future pandemic threats.

To start, we focused on the SARS-CoV-2 main protease (MPro). SARS-CoV-2 is an RNA virus with a genome size of about 30kb.  About 2/3 of its genome is directly translated to two polyproteins called pp1a and pp1ab. The MPro plays a critical role in the viral life cycle by processing these polyproteins into functional pieces. (4)

We used one of the available crystal structures from the protein databank (PDB code: 7bqy). As a first step, we identified the druggable pocket on the MPro (7bqy) using the PocketFinder function in ICM (Molsoft, San Diego). The druggability analysis of the MPro catalytic site (calculated using Schrodinger’s SiteMap) is shown in Figure 1, and the druggability score (Dscore) was around 0.88, which means that it will be difficult to make drugs that bind with high potency to this site. The good news here is that one of the amino acids lining this pocket is a cysteine, which can be exploited by drugs to form a covalent bond, and that is indeed what all potent inhibitors targeting this site are doing so far. We found the surrounding residues within 3.5 Å vicinity of the catalytic site, which are 21 residue side chains: Thr25, Leu27, His41, Cys44, Met49, Pro52, Tyr54, Phe140, Asn142, Ser144, Cys145, His163, His164, Met165, Glu166, Leu167, Pro168, His172, Asp187, Gln189, Gln192.

Figure 1. Druggability analysis of binding pockets at the surface of the MPro. Druggability scores (Dscore) are calculated with SiteMap (Schrodinger, NY)

We then downloaded reviewed coronavirus sequences from a well-established database ( We focused on alpha and beta coronaviruses (SARS-CoV-2 is a beta coronavirus). The reason we chose these two to look at first is that they are among the most likely viruses to jump to humans. (5)   A sequence similarity search retrieved MPro from 27 different coronaviruses, and we did a multiple sequence alignment on them. In Figure 2, we show the result for the diversity dendrogram alongside with the multiple sequence alignment. Not surprisingly, we find that the MPro catalytic site of SARS-COV-2 is closest to that of SARS-CoV-1 and furthest to that of alpha coronaviruses.

Figure 2. The diversity dendrogram of the coronavirus MPro catalytic site. 27 entries from the UniProt database and their corresponding organism names are shown in the table. Critical non-conserved residues are highlighted in orange.

In my next post, I will explain how we looked into MERS, SARS-CoV, and SARS-CoV-2 MPro structures and will highlight their catalytic sites’ critical non-conserved sidechains for ligand-protein interactions.

Please contact me via the “Leave a comment” link at the top of this post. Stay Tuned for more updates!


  1. Yang, P., Wang, X. COVID-19: a new challenge for human beings. Cell Mol Immunol 17, 555–557 (2020).
  2. “Coronavirus Cases:” Worldometer,
  3. Sanjuán, R., Domingo-Calap, P. Mechanisms of viral mutation. Cellular and Molecular Life Sciences 73(23), 4433-4448 (2016).
  4. Dai, W. et al. Structure-based design of antiviral drug candidates targeting the SARS-CoV-2 main protease. Science, (2020).  
  5. Ye, Z. W., Yuan, S., Yuen, K. S., Fung, S. Y., Chan, C. P., & Jin, D. Y. Zoonotic origins of human coronaviruses. International Journal of Biological Sciences 16(10), 1686–1697 (2020).



Leave a Reply

Your email address will not be published. Required fields are marked *