No, the 2019-nCoV genome doesn’t really seem engineered from HIV

A group of bioinformaticians at two prestigious universities in Delhi, India, published a preprint scientific manuscript on the bioRxiv preprint server Friday has led many to speculate wildly that 2019-nCoV may have been deliberately engineered using HIV protein sequences.

The paper, entitled “Uncanny similarity of unique inserts in the 2019-nCoV spike protein to HIV-1 gp120 and Gag,” presented a sequence alignment analysis of the unique elements of the 2019-nCoV genome which noted some similarities to elements of the HIV genome. The authors seemed to suggest that these similarities couldn’t have arisen randomly, so people can be forgiven for jumping straight to “it’s a bioweapon” after reading it.

But having read the paper, I still don’t find this bioweapon argument convincing, and despite this new paper’s language, a random sequence overlap is still the leading explanation for sequence alignment it identifies with HIV.

Take this to the bank: 2019-nCoV continues to give every appearance of being a wild coronavirus that jumped from bats to humans by way of an animal intermediary in the Huanan seafood market in Wuhan in late 2019. It is not an escaped bioweapon.

Author’s note: I have a Ph.D. in bioinformatics, and am a principal data scientist at a major pharmaceutical company. This paper isn’t directly in my wheelhouse, but it’s pretty close.

Common sense prepping, straight to your inbox.

Get our free newsletter for great new articles and giveaways. 1-2 emails a month. 0% spam.

Read our continuously updated explainer on nCoV 2019, and check out our guide to the best respirators if you’re stocking up this weekend.

What the Indian group did

The new paper’s authors took 28 sequences of the 2019-nCoV genome isolated from 28 different patient samples, and aligned them with the bat coronavirus genome which is 2019-nCoV’s closest known relative. Although the two viral genomes are 96% identical, this leaves about 1200 DNA bases, and a smaller number of protein residues, where they differ.

Among the differences, the Indian team identifies four insertions, where the 2019-nCoV genome contains a small extra sequence corresponding to a few additional amino acids in an otherwise similar protein. These insertions are as short as 6 residues.

All four insertions were located in the “spike protein” of 2019-nCoV, the projecting protein on the virus’s round envelope which recognizes the ACE2 receptor and enables the virus to penetrate mucous membrane cells, and also gives the coronavirus its name. The variable sequences of these recognition regions enable viruses to penetrate different types of cells in the human body.

The authors took the 2019-nCoV version of the spike protein sequence, and performed homology modeling to generate a likely 3D structure of the spike protein, using known 3D structure of the spike protein from the SARS virus as a starting point. They found that although the four sequences are distant in the 1D chain of the protein, the folding of the spike protein brings three of them together in 3D space, and that they are on the “tip” of the spike, at the ACE2 recognition site.

The authors then use the pBLAST sequence alignment tool to identify any sequences from any known viral genome that look similar to the short sequences identified from 2019-nCoV. They searched the National Center for Biotechnology Information’s viral genome database, which contains over three million viral genome sequences.

They found that all four of these spike protein inserts appear as matches to at least one sequence in at least one variant of the HIV virus. The sequences come from the gp120 and Gag proteins in HIV, the former of which is also a viral envelope recognition protein. This has led many to credulously assume that this is evidence, or even a strong indication, that 2019-nCoV was engineered from its bat ancestor by humans inserting HIV sequences.

No, 2019-nCoV is still not an escaped bioweapon

But they’re wrong; it’s still not engineered. An analysis of the paper clearly reveals that:

  • There is nothing remarkable about the fact that 2019-nCoV’s sequence diverges from its nearest known relative, or that its unique sequences are conserved among cases of 2019-nCoV.
  • The sequence matches with HIV are very short and appear in hypervariable regions of both virus, and similar overlaps are seen between 2019-nCoV sequences and many other organisms.
  • The unique biological properties that HIV sequences could theoretically impart to another virus are completely missing from 2019-nCoV, and 2019-nCoV has no unique clinical properties that are outside what is known to be possible for a coronavirus.

In other words, the sequence overlap is not actually uncanny, and there is no big scoop here. The group in India fell prey to some of the pitfalls of bioinformatics research.

There’s no genomic or clinical anomaly that needs explaining

The 2019-nCoV genome does not contain remarkable genomic properties which need explaining, and for which we’d look to some kind of bioengineering as a cause.

The virus has a close 96% sequence overlap to a naturally occurring bat coronavirus, and coronaviruses have been known to jump from bats to humans by way of intermediates before, like the SARS coronavirus. The differences between the genome sequences, including the ones identified by the Indian study, are in variable regions of the genome that we’d expect to differ, and the 4% difference in the genomes is hard to call as “high” or “low,” given that we don’t know exactly which bats the 2019-nCoV strain came from or when it diverged from its closest known ancestor.

Nor is it surprising that the known 2019-nCoV sequences all contain the same genomic changes relative to a known relative. They all came from the same outbreak from the same animal reservoir, i.e. they only diverged from each other a few months ago at most. It’s not surprising that they haven’t evolved very much away from each other.

Nor does the clinical presentation of 2019-nCoV have novel features which need explaining. Its symptom profile, degree of transmissibility, severity, mortality rate, duration, incubation and latent period, ability to jump from animals to humans, and ability to transmit asymptomatically and by skin contact are all within the precedents established by other human coronaviruses.

That is, the 2019-nCoV genome and the way it affects humans have, by themselves, no special anomaly which needs explaining.

The sequence overlap is not remarkable and is probably random

Worse, though, the HIV sequence overlap is not particularly remarkable.

The insertions are as short as 6 peptide residues long, and the two which are longer are not identical matches.

Very short sequences are not really what pBLAST was designed for, especially not when searching huge databases. Looking through three million viral genomes for a sequence that short means you’re bound to find something, and other scientists have pointed out in the hours since the Indian paper was posted that similar overlaps, just as strong, may be found in a wide variety of viruses, and also bacteria, protists, fungi, fruit flies, and plants.

The overlap to HIV is not to a “characteristic” HIV region that is conserved among HIV, but to particular samples (in fact, three different ones from three different countries). They’re just the flotsam and jetsam of variable regions generating a lot of different sequences which get picked up in mass sequencing efforts, not a smoking gun.

In particular, the sequences identified both come from short alpha helical regions on the surface of an envelope/membrane protein, and both feature a lot of positively charged polar residues. These kinds of similar residues have a tendency to appear together on sequences of this type, increasing the chance that unrelated sequences may share short overlaps if they both come from this type of protein domain.

The sequence alignment charts of these variable regions of the spike proteins of known coronaviruses from the Indian paper is a good example of this, as the “alignment” is an alphabet soup of wildly variable sequences from different coronaviruses, with no real consistency. The genome database for all organisms is an ocean of this kind of alphabet soup, and the kind of overlap we’re talking about isn’t Hamlet, or even full sentences, but just a few words.

There’s no special effect for this overlap to achieved

On a clinical level, too, there’s no link between these two things. The coronavirus spike protein and the HIV gp120 protein are both recognition proteins on the envelope surface, but they’re very different. The spike protein allows the coronavirus to recognize the ACE2 receptor and invade mucous membrane epithelium, while the gp120 protein allows the HIV virus to recognize the CD4 receptor and invade CD4+ T-cells. The Gag protein on HIV, host of the fourth matching sequence recognized by the Indian team, is in the interior of the virus.

So, if the hypothesis were true, you might expect the 2019-nCoV strain to be able to infect T-cells or recognize the CD4 receptor. But there is no evidence so far that 2019-nCoV can infect T-cells, or that it can infect any cells expressing CD4, or that it can infect any cells which don’t express ACE2 or can’t be infected by other known coronaviruses.

The epidemiology still suggests animal (zoonotic) origin, not an escaped weapon

Human nature, crowd psychology, the availability heuristic of this storyline from fiction, and a bunch of other factors have made the “escaped bioweapon” storyline appear over and over, and spread like wildfire when it does. But there’s no evidence it’s true.

2019-nCoV continues to give every appearance of being a wild coronavirus that jumped from bats to humans by way of an animal intermediary in the Huanan seafood market in Wuhan in late 2019.


    • Blougram

      Isn’t the (discredited) idea rather that this may have been the unfortunate result of a zoonosis study (to find out the risk of bat coronaviruses making the leap) with someone getting accidentally infected? I have no idea what “HIV pseudotyping” is, and I hope it would not lead to the sequences discussed in the discredited paper, but this kind of research gives me the creeps. Heck. The authors even talk about the danger:

      -7 |
      • Ari Allyn-Feuer Blougram

        Interesting link.  Thanks for sending.  It’s unlikely that this kind of research would lead to an outbreak, because they take a lot of precautions in virology labs.  They work with HIV, Ebola, Smallpox, and they don’t let them out.  A few individual cases of SARS have happened in virology labs since the 2003 outbreak, but no widespread transmission.

        -67 |
      • P N Blougram

        That’s not a logical argument at all. It’s very unlikely that nuclear catastrophes happen, but they do. A lot of thing “very unlikely” because “we take all the security steps” happen. It’s very unlikely to die from the lightning or to find a treasure, but it happen. But here, one little time is enough, one person is enough.

        Logically, your argument doesn’t mean it can happen, or it cannot happen: it means nothing.

        0 |
    • Bryan

      BLAST searching all the insertions in the paper reveals there is indeed a 100% match of those residues in a known bat corona virus (Accession #QHR63300.1). However–the submitter of that bat corona virus is actually Wuhan Virology Institute, submitted 4 days ago (Feb 27, 2020). As a professional of a closely related field, do you think any of that is even relevant? If the validity of that bat corona virus is in question, then these 4 insertions are truly novel in an known corona virus.

      -10 |
      • Bryan Bryan
        0 |
      • Ari Allyn-Feuer Bryan

        It looks like that may be a more recently generated sequence of a wild coronavirus that more closely matches the 2019-nCoV strain than the previously known bat coronavirus, but was only found recently.  It would make sense for that kind of research to be done.

        But anyway, the larger point is, there’s nothing surprising about 2019-nCoV having novel sequences.

        -16 |
      • Jason Weishaupt Bryan

        Go to the NIH’s BLAST database and search for a genetic match to virus serial # AVP78033.1 and it comes up with a 100% match for the “Wuhan Seafood Market Virus”. AVP78033.1 was registered with the NIH in March of 2018 by the Chinese Military.

        -16 |
    • Gretchen Dulak
      -27 |
    • Isaac

      Is there any way you could touch on the ORF10 protein ?


      ( Please and thank you )

      -19 |
    • Tom Young

      A long time ago, last millennium in fact, I had peripheral involvement in a project using peptides from the helical regions of the S protein to interfere with the cell fusion process.  This type of work keeps getting repeated and repeated, eg here

      Or this one

      It always seems to work in cell assays but as far as I know no one goes on to do human tests.  For a number of reasons I think, it is not an economically attractive proposition for a drug company, peptides are comparatively expensive and not easy to administer as a pill.  Also these high profile pandemics don’t infect large numbers and then they disappear and don’t leave an easy population to run clinical trials.

      But I do wonder if a regulatory and market gap has opened up here.  Since the development time for locating suitable peptide agents would be very short.  Perhaps someone should just bite the bullet this time and start testing them in a clinical setting.

      Or perhaps some has and it hasn’t worked as well as it did in cell culture.

      -3 |
    • Gerry Lassche

      I’m no scientist. From my limited understanding of what you’re saying:

      1) the indian research is real, not fake. The facts are not in question, just their interpretation of the data is disputed. Correct?

      2) the sequences in question are not novel enough to warrant a special HIV resemblance. They are common to coronaviruses.

      3) a smoking gun would be someone being diagnosed with HIV after catching ncova. But currently arriving passengers are not being screened for HIV, are they?

      -52 |
      • Tom Young Gerry Lassche

        I’m not much of a scientist, but what I would say is

        1).  Correct

        2.)  The Indian group identified some inserts in 2019-nCoV in the spike protein (used for cell entry) compared with its closest relative.  Having inserts is not suspicious, however they managed to identify these inserts from the HIV genome, suggesting someone might have deliberately engineered them in.  If you look at Table 1 you find 3 of the 4 inserts are located in hypervariable regions of HIV.  These are regions that are not conserved as they are not under heavy selection pressure – so there will be lots of different variations in existence.  The isolates are from Kenya, India and Thailand which suggests that the underlying database they are searching is probably very large.  If someone was searching the database for useful segments to insert into Coronavirus, then why would they chose one from Kenya, one from Thailand and two from India?  It wouldn’t be a sensible approach for bioengineering because the target proteins of the S protein and GP120 (HIV) are completely different.

        3.)  No, I think the Indian authors were arguing this was a deliberate case of engineering, hence you wouldn’t expect any cases of  HIV to crop up.

        Just out of curiosity, India was where the story of HIV being an escaped bioweapon originally emerged back in 1983.  The US managed to stamp down on it successfully, but it looks like an oral history of it has survived among the Indian scientific community

        -7 |
      • Jason Weishaupt Gerry Lassche

        Go to the NIH’s BLAST database and search for a genetic match to virus serial # AVP78033.1 and it comes up with a 100% match for the “Wuhan Seafood Market Virus”. AVP78033.1 was registered with the NIH in March of 2018 by the Chinese Military. Cold Spring Harbor Labratory just found 2 HIV viruses attached to it…

        -30 |
      • Ale Sh Gerry Lassche

        Can you provide links to the Cold Spring Harbor Laboratory lead you mentioned?

        1 |
    • John RameyThe Prepared

      Looks like we’re getting hit with a downvote ring. Ignore the votes folks. Might just be trolls, or might be pro-chinese groups.

      1 |
    • Jason Weishaupt

      [comment deleted]

      -8 |
    • Ridha

      [removed by moderators for breaking community policy]

      -6 |
      • Niklas E. Ridha

        It’s a complicated matter which makes it super dificult to make it easily understandable for people without a degree in the topic. I am studying immunology and have my issues understanding it completely.

        What I can tell you though is that the effectiveness of HIV medication is very unlikely related to the inserts identified by the researchers. Why am I so certain of that? The HIV medications are protease inhibitors, so substances that block a specific enzyme of the virus. The inserts that are alleged to come from HIV are in the spike proteins (part of the virus surface) and these proteins are not the target of the medication. The sequence of the nCoV protease doesn’t seem to contain any insertions.

        The escaped bioweapons / failed experiment theory is certainly the more exciting one and I can understand why people would want to belive it, but I think for now this is not really convincing evidence. The overlapping sequences are very short, these kinds of sequences are not completely random but tend to follow certain patterns, the inserts don’t seem to transfer any of the unique properties of HIV (especially the ability to infect T-Cells via the CD4 surface protein).

        6 |
    • Christian Ryan White

      [comment deleted]

      0 |
    • Jason Weishaupt

      Go to the NIH’s BLAST database and search for a genetic match to virus serial # AVP78033.1 and it comes up with a 100% match for the “Wuhan Seafood Market Virus”. AVP78033.1 was registered with the NIH in March of 2018 by the Chinese Military. A lab in the U.S. just found HIV attached to the Coronavirus.$=prottop&blast_rank=1&RID=3CFMZWSH01R

      5 |
    • Jason

      I was a medical student, though I am not currently working in medicine-related industries. Some points in your article do arouse my interest.

      1. “Among the differences, the Indian team identifies four insertions… These insertions are as short as 6 residues.” It seems to me you are suggesting that given the sequences matched with those of HIV virus are as short as it is that we cannot jump to the conclusion that those sequences were actually from HIV. Fair enough. BUT, how about the fact that 2019-nCoV has 4 genes similar to HIV (or many other virus) is compared with other coronaviruses, if it is the case that we have not until now seen any other coronavirus has even 4 short genes resembling those of HIV (or many other virus), it seems a smoking gun to me. The sheer number and length of those genes in itself is not enough to debunk any assumption as is not to affirm the conspiracy theory, whereas on average, how common it is that other coronaviruses have 4 genes similar to those of HIV (or other viruses) is definitely much more convincing. In other word, if it’s a rarity that coronaviruses have 4 or more genes similar to those of other viruses, WOW, you can tell, your assertion does not weaken the claim at all, BUT on contrary, strengthens the original claim.

      2. “In particular, the sequences identified both come from short alpha helical regions on the surface of an envelope/membrane protein…” it seems to me that you are claiming the similarities in genes come from regions where the chances of genes showing the same traits abound, and so the appearance of 4 genes from these regions resembling those of HIV (or other viruses) is not unique enough for someone the raise concern. However, this logic does not go well with me. It would be natural for me were I a researcher to choose highly adaptive (changeable) regions to conduct preliminary studies as it bears higher chance of success. Again, how likely is that other coronaviruses have 4 genes from highly changeable regions resemble those of HIV (or other viruses) is the GOLDEN STANDARD as far as I’m concerned.

      3 |
    • Alfinea

      German scientists in earthworms discovered the LIN-53 protein – this is a chaperone (a class of protein responsible for restoring the correct structure of proteins and the formation of protein complexes), which binds to special molecules (histones) that provide the “packing” of DNA strands in the cell nucleus, similar to human protein RBBP4 / 7. If so, why not create the RBBP4 / 7 genomic virus?)

      0 |
    • Max Kennedy

      I think your first sentence needs fixing, like maybe a word added (e.g. “THAT”):  “A group of bioinformaticians at two prestigious universities in Delhi, India, published a preprint scientific manuscript on the bioRxiv preprint server Friday THAT has led many to speculate wildly that 2019-nCoV may have been deliberately engineered using HIV protein sequences.”

      1 |
    • Amanda federick

      [comment deleted]

      0 |