We investigated genetic variation of SARS-CoV-2 at the catalytic site of the main SARS-CoV-2 protease, following the work of https://zenodo.org/record/3834875#.Xs1IHsZ7nyk and https://openlabnotebooks.org/mapping-the-genetic-variations-of-sars-cov-2-onto-its-proteins-crystal-structures-post-1/ ).
We collected information from more than 15,000 genomic sequences available from GISAID (https://www.epicov.org/) available on the 17th of May 2020, givig us considerable power to detect viral genetic variation within the current outbreak.
Due to the possible presence of errors within the sequences considered, We used a new approach to detect variants that are likely the result of sequencing issues, and that are therefore not reliable. Our analysis is based on phylogenetic inference of homoplasy (how many independent times the genetic mutation seems to have appeared along the phylogenetic tree), on the clustering of variants along the genome, on frequent ambiguous characters (positions of the genome where the sequence seems uncertain), and on the sequence metadata (in particular the laboratory that generated the sequence). The idea behind our approach is that sequencing artefacts are likely to result in homoplasic patterns, in ambiguous characters, in clustered rare variants, and in variants only detected in specific sequencing laboratories (or more generally specific sequencing and sample preparation methods).
For details of our methods, see https://zenodo.org/record/3865719#.XtJ9UcZ7nyl and http://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 .
We find that the positions considered are mostly conserved.
Only one position considered (Thr25) seems affected by a likely sequencing artefact, which however results only in synonymous variants.
The amino acid variants detected are only four: M49I, P52S, N142S, and P168S, and all of them appear only at extremely low frequencies (maximum of two samples each). More details for each position considered are given in Table 1 below.