A group of bioinformaticians at two prestigious universities in Delhi, India, published a preprint scientific manuscript on the bioRxiv preprint server Friday has led many to speculate wildly that 2019-nCoV may have been deliberately engineered using HIV protein sequences.
The paper, entitled “Uncanny similarity of unique inserts in the 2019-nCoV spike protein to HIV-1 gp120 and Gag,” presented a sequence alignment analysis of the unique elements of the 2019-nCoV genome which noted some similarities to elements of the HIV genome. The authors seemed to suggest that these similarities couldn’t have arisen randomly, so people can be forgiven for jumping straight to “it’s a bioweapon” after reading it.
But having read the paper, I still don’t find this bioweapon argument convincing, and despite this new paper’s language, a random sequence overlap is still the leading explanation for sequence alignment it identifies with HIV.
Take this to the bank: 2019-nCoV continues to give every appearance of being a wild coronavirus that jumped from bats to humans by way of an animal intermediary in the Huanan seafood market in Wuhan in late 2019. It is not an escaped bioweapon.
Author’s note: I have a Ph.D. in bioinformatics, and am a principal data scientist at a major pharmaceutical company. This paper isn’t directly in my wheelhouse, but it’s pretty close.
What the Indian group did
The new paper’s authors took 28 sequences of the 2019-nCoV genome isolated from 28 different patient samples, and aligned them with the bat coronavirus genome which is 2019-nCoV’s closest known relative. Although the two viral genomes are 96% identical, this leaves about 1200 DNA bases, and a smaller number of protein residues, where they differ.
Among the differences, the Indian team identifies four insertions, where the 2019-nCoV genome contains a small extra sequence corresponding to a few additional amino acids in an otherwise similar protein. These insertions are as short as 6 residues.
All four insertions were located in the “spike protein” of 2019-nCoV, the projecting protein on the virus’s round envelope which recognizes the ACE2 receptor and enables the virus to penetrate mucous membrane cells, and also gives the coronavirus its name. The variable sequences of these recognition regions enable viruses to penetrate different types of cells in the human body.
The authors took the 2019-nCoV version of the spike protein sequence, and performed homology modeling to generate a likely 3D structure of the spike protein, using known 3D structure of the spike protein from the SARS virus as a starting point. They found that although the four sequences are distant in the 1D chain of the protein, the folding of the spike protein brings three of them together in 3D space, and that they are on the “tip” of the spike, at the ACE2 recognition site.
The authors then use the pBLAST sequence alignment tool to identify any sequences from any known viral genome that look similar to the short sequences identified from 2019-nCoV. They searched the National Center for Biotechnology Information’s viral genome database, which contains over three million viral genome sequences.
They found that all four of these spike protein inserts appear as matches to at least one sequence in at least one variant of the HIV virus. The sequences come from the gp120 and Gag proteins in HIV, the former of which is also a viral envelope recognition protein. This has led many to credulously assume that this is evidence, or even a strong indication, that 2019-nCoV was engineered from its bat ancestor by humans inserting HIV sequences.
No, 2019-nCoV is still not an escaped bioweapon
But they’re wrong; it’s still not engineered. An analysis of the paper clearly reveals that:
- There is nothing remarkable about the fact that 2019-nCoV’s sequence diverges from its nearest known relative, or that its unique sequences are conserved among cases of 2019-nCoV.
- The sequence matches with HIV are very short and appear in hypervariable regions of both virus, and similar overlaps are seen between 2019-nCoV sequences and many other organisms.
- The unique biological properties that HIV sequences could theoretically impart to another virus are completely missing from 2019-nCoV, and 2019-nCoV has no unique clinical properties that are outside what is known to be possible for a coronavirus.
In other words, the sequence overlap is not actually uncanny, and there is no big scoop here. The group in India fell prey to some of the pitfalls of bioinformatics research.
There’s no genomic or clinical anomaly that needs explaining
The 2019-nCoV genome does not contain remarkable genomic properties which need explaining, and for which we’d look to some kind of bioengineering as a cause.
The virus has a close 96% sequence overlap to a naturally occurring bat coronavirus, and coronaviruses have been known to jump from bats to humans by way of intermediates before, like the SARS coronavirus. The differences between the genome sequences, including the ones identified by the Indian study, are in variable regions of the genome that we’d expect to differ, and the 4% difference in the genomes is hard to call as “high” or “low,” given that we don’t know exactly which bats the 2019-nCoV strain came from or when it diverged from its closest known ancestor.
Nor is it surprising that the known 2019-nCoV sequences all contain the same genomic changes relative to a known relative. They all came from the same outbreak from the same animal reservoir, i.e. they only diverged from each other a few months ago at most. It’s not surprising that they haven’t evolved very much away from each other.
Nor does the clinical presentation of 2019-nCoV have novel features which need explaining. Its symptom profile, degree of transmissibility, severity, mortality rate, duration, incubation and latent period, ability to jump from animals to humans, and ability to transmit asymptomatically and by skin contact are all within the precedents established by other human coronaviruses.
That is, the 2019-nCoV genome and the way it affects humans have, by themselves, no special anomaly which needs explaining.
The sequence overlap is not remarkable and is probably random
Worse, though, the HIV sequence overlap is not particularly remarkable.
The insertions are as short as 6 peptide residues long, and the two which are longer are not identical matches.
Very short sequences are not really what pBLAST was designed for, especially not when searching huge databases. Looking through three million viral genomes for a sequence that short means you’re bound to find something, and other scientists have pointed out in the hours since the Indian paper was posted that similar overlaps, just as strong, may be found in a wide variety of viruses, and also bacteria, protists, fungi, fruit flies, and plants.
The overlap to HIV is not to a “characteristic” HIV region that is conserved among HIV, but to particular samples (in fact, three different ones from three different countries). They’re just the flotsam and jetsam of variable regions generating a lot of different sequences which get picked up in mass sequencing efforts, not a smoking gun.
In particular, the sequences identified both come from short alpha helical regions on the surface of an envelope/membrane protein, and both feature a lot of positively charged polar residues. These kinds of similar residues have a tendency to appear together on sequences of this type, increasing the chance that unrelated sequences may share short overlaps if they both come from this type of protein domain.
The sequence alignment charts of these variable regions of the spike proteins of known coronaviruses from the Indian paper is a good example of this, as the “alignment” is an alphabet soup of wildly variable sequences from different coronaviruses, with no real consistency. The genome database for all organisms is an ocean of this kind of alphabet soup, and the kind of overlap we’re talking about isn’t Hamlet, or even full sentences, but just a few words.
There’s no special effect for this overlap to achieved
On a clinical level, too, there’s no link between these two things. The coronavirus spike protein and the HIV gp120 protein are both recognition proteins on the envelope surface, but they’re very different. The spike protein allows the coronavirus to recognize the ACE2 receptor and invade mucous membrane epithelium, while the gp120 protein allows the HIV virus to recognize the CD4 receptor and invade CD4+ T-cells. The Gag protein on HIV, host of the fourth matching sequence recognized by the Indian team, is in the interior of the virus.
So, if the hypothesis were true, you might expect the 2019-nCoV strain to be able to infect T-cells or recognize the CD4 receptor. But there is no evidence so far that 2019-nCoV can infect T-cells, or that it can infect any cells expressing CD4, or that it can infect any cells which don’t express ACE2 or can’t be infected by other known coronaviruses.
The epidemiology still suggests animal (zoonotic) origin, not an escaped weapon
Human nature, crowd psychology, the availability heuristic of this storyline from fiction, and a bunch of other factors have made the “escaped bioweapon” storyline appear over and over, and spread like wildfire when it does. But there’s no evidence it’s true.
2019-nCoV continues to give every appearance of being a wild coronavirus that jumped from bats to humans by way of an animal intermediary in the Huanan seafood market in Wuhan in late 2019.