In my previous posts (post 1 and post 2), I explained how we found the neighboring residues of the SARS-CoV-2 main protease (MPro) catalytic site, then created a diversity dendrogram of the residues, and looked at three examples of coronavirus MPro structures (SARS-CoV, SARS-CoV-2, MERS). In this post, I explain how we looked at hundreds and thousands of coronavirus sequences to create a matrix table that shows sequence diversity. You can view the detailed method and ICM scripts in this Zenodo post here.
To depict sequence diversity at the MPro catalytic site (21 residues), we looked at 682 SARS-CoV-2 samples from COVID-19 patients, which we downloaded from https://bigd.big.ac.cn/gwh/browse/virus/coronaviridae on April 9, 2020. The result of this analysis is shown in Table 1. All 682 samples are identical at the main catalytic site and we saw no mutation.
Table 1. Sequence diversity at the MPro catalytic site across 682 SARS-CoV-2 samples downloaded from https://bigd.big.ac.cn/gwh/browse/virus/coronaviridae on April 9, 2020. The column “X” represents deletions.
Finally, we extended our analysis to coronavirus samples beyond SARS-CoV-2 with at least a 40% sequence similarity to the SARS-CoV-2 MPro sequence. The number of samples that met this cutoff was 4903. Table 2 shows the sequence diversity across the 21 critical residues of the MPro catalytic site. We saw a significant variation at the catalytic site of MPro across the 4903 coronavirus samples. In Figure 1, we highlight the non-conserved residues at the MPro active site in orange.
Table 2. Sequence diversity at the MPro catalytic site across 4903 coronavirus sequence samples downloaded from https://bigd.big.ac.cn/gwh/browse/virus/coronaviridae on April 9, 2020. The column “X” represents deletions. non-conserved residues critical for inhibitor binding are highlighted in orange.
Figure 1. Active site residues of the coronavirus MPro that are not conserved across ~5000 coronavirus samples from dozens of viral strains (Table 2) are highlighted in orange.
After doing this analysis, Andrew Leach at the European Bioinformatics Institute (EBI) brought to our attention an important report from Nick Goldman’s lab, also at EBI, showing that some (a minority) of the coronavirus sequences uploaded in public databases had sequencing errors. Our next step is now to work with Nicola de Maio, in Nick’s lab, who kindly agreed to collaborate with us to make sure we only use valid genome sequences.
If you think of other ways to improve this ongoing effort, please contact me via the “Leave a comment” link at the top of this post. Stay Tuned for more updates.