Miltoncontact: What could I find out with a Wuhan COVID-19 coronavirus sequence?

When I heard on the news that the sequence of the new Wuhan COVID-19 coronavirus had been made public, the nerd in me was awoken. It triggered memories of when I used to work with plant viruses two decades ago. Would I still be able to find out more about the virus using some of the methods I would have used back then? Would I be able to understand how the up-to-date and vastly more experienced teams in animal and human virology might approach the problem? Could I explain it to others who don’t work with viruses?

If you are simply looking for general information on the current COVID-19 epidemic, have a look at my earlier post here http://www.miltoncontact-blog.com/2020/02/should-i-worry-about-wuhan-2019-ncov.html. This includes up to date charts and tables on the progress of the epidemic using data from the WHO situation reports.

For my adventure with the Covid-19 sequence, read on.

Finding the Wuhan COVID-19 sequence

One of the great early benefits of the internet was the setting up of DNA databases accessible to all scientists, via The European Molecular Biology Laboratory (EMBL), a molecular biology research institution supported by 27 member states - https://www.embl.org/. This database is currently held by the EBI, the European Bioinformatics Institute, whose centre is based in Hinxton, just outside of Cambridge.

My first step was to see if I could find and download the 2019 nCoV (COVID-19) sequence, using the European Nucleotide Sequence Browser that I found at https://www.ebi.ac.uk/ena/browser/home. In the last week of January, I found a sequence of "A novel coronavirus associated with a respiratory disease in Wuhan of Hubei province, China" provided by F. Wu and a further 18 co-workers. It had been submitted on 05-JAN-2020 by the Shanghai Public Health Clinical Center & School of Public Health, Fudan University, Shanghai, China and entered into the database on 13-JAN-2020. Its entry number on the EBI database is MN908947.

Note that if you search now, you will pick up a different set of 7 later sequences under the accession number MN988668: The first two are from "RNA based mNGS approach identifies a novel human coronavirus from two individual pneumonia cases in 2019 Wuhan outbreak"; Emerg Microbes Infect :0(2019) by Chen Land others, submitted 23-JAN-2020, State Key Laboratory of Virology, Wuhan University, Bayi Road, Wuchang District, Wuhan, Hubei 430072, P.R. China. These cover the first two sequences.

There are 5 further sequences from US laboratories from isolates of the virus from US patients in Arizona, California and Illinois, also from around the end of January.

Update 24 February 2020: 24 different COVID-19 sequences were available, from the US, Japan, China and Italy

The virus sequence is given as 29,881 nucleotides

Are there differences between the eight COVID-19 sequences?

I downloaded all the sequences and combined them into a merged file using the software SeqVerter, part of a downloaded freeware called GenStudio Pro. I then uploaded the merged file to Clustal Omega, an EBI program that compares multiple sequences online. After a cup of tea and a scone (with jam), the result appeared on the screen.

They were all 100% identical – the three Chinese and the 5 five different US isolates.

Short section showing sequence identity over the eight available 2019 nCoV (COVID-19) sequences

From a disease point of view, this was ‘good’ news. It showed that from the beginning of the outbreak in China to the first cases in the US, the virus had not mutated into a different strain.

From an old virologist’s point of view, this was quite an unusual result. Why? Well, virus RNA is replicated with a far higher error rate than DNA. The rate is 1 in 10,000 nucleotides. The virus RNA is almost 30,000 nucleotides long. So every time the virus reproduces, I would expect two to three differences to be introduced. When you get millions of virus particles made during an infection, the sequences are actually a spread of mutations that average out around a consensus sequence. An RNA virus is thus not a species as such, but technically a quasispecies. This hold true for another coronavirus disease, MERS (Middle East Respiratory Syndrom), as explained in Mandary et a; (2019) “Impact of RNA Virus Evolution on Quasispecies Formation and Virulence” https://www.researchgate.net/publication/335938867_Impact_of_RNA_Virus_Evolution_on_Quasispecies_Formation_and_Virulence.

I would expect it to be true for the COVID-19.

Diseases like polio, for example, took advantage of these spread of mutations. After infecting a body and causing mild symptoms, a few viruses were able to break through the blood brain barrier and cause the more severe paralysis. This spread of mutations is also what might make it possible for a virus to jump species and infect a host it does not normally reproduce in.

When I did sequencing more than 20 years ago, we had to clone virus fragments and then sequence each clone. We would have seen the different sequence mutations and would have had to sequence a number of clones to get the average quasispecies sequence.

Modern virus sequencing gets around the individual cloning by using NGS, next generation sequencing https://bitesizebio.com/21193/a-beginners-guide-to-next-generation-sequencing-ngs-technology/. The viral RNA, with its whole population of different sequences is extracted, amplified and sequenced. The sequence obtained is the most average sequence, the quasi-species sequence (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3708773/pdf/1471-2164-14-444.pdf).

Update 24 February 2020: A news report stated that there were now sufficient sequences of COVID-19 for researchers to determine how the disease might be migrating into different countries.

I therefore had a go myself, generating a first phylogenetic tree. Here are the results I obtained with Clustal with unedited sequences straight from the EBA.

I do not claim that these results are definitive linkages of relationship or origin at this point.

24 February 2019: Phylogenetic tree showing possible relationships using unedited COVID-19 sequences from the EBA.

Tentative questions from my data. Are there really two major families of the virus? Are the Japanese and the Italian strains closely related?

A more comprehensive chart created by researchers can be found at the global flu initiative GISAID at https://www.gisaid.org/.

What is the closest relative to the COVID-19?

The beauty of having a public sequence database is that you can take a new sequence, like COVID-19, and use it to see if you find similar sequences amongst the millions already there. I did this online at EBI, using the nucleotide similarity search, Fasta. It was limited to finding 50 related sequences.

I had the results displayed as a “Phylogenetic tree cladogram”, a branching pattern showing the degree of similarity between the different sequences.

My first phylogenetic tree using 2019 nCoV (Wuhan Seafood Market, COVID-19) against the EBI nucleotide database

The tree showed that the closest similarity of the COVID-19 (Wuhan Seafood Market) coronavirus was to Bat SARS-like viruses, and more distantly to other SARS viruses. Those simply called SARS are human isolates. There was a SARS epidemic, which also originated in China, back in 2002, which was finally brought under control in 2004. SARS stands for Severe Acute Respiratory Syndrome. It killed almost one in ten of people infected.

It is interesting how many patent sequences were also picked up. Presumably from companies and organisations that wanted to provide detection and possibly treatment products against SARS.

I wanted a different display to put the COVID-19 in a wider context. I therefore downloaded a number of species specific Coronavirus sequences that I found by searching the European Nucleotide Sequence Browser for coronavirus. I also removed all the patent sequences from the original set found. The new set of data was uploaded for analysis.

The new phylogenetic tree is shown below. I’ve left the accession numbers in to make to make it easier for future work. I also stretched the tree horizontally from the original, to make the branching clearer and coloured different groups for interpretation.

My second phylogenetic tree from using the results from the first search, minus the patent sequences, plus 15 other coronaviruses from different animals

The different human SARS (simply called SARS) sequences in dark red divide into two groups. One has similarities to Civet SARS (in orange), the other has links that reach to Bat SARS strains (orange). The Wuhan COVID-19, marked in bold red, is more closely related to the Bat SARS. The MERS, Middle Eastern Respiratory Syndrome (dark red), was first identified in Saudi Arabia in 2012. It seems to be more lethal than SARS, with about 36% of individuals diagnosed with the disease dying from it. However, it does not seem to spread easily and there have been around 2000 cases recorded in the period 2012 to 2017. My and the professional advice is – keep away from sick camels.

The remaining animal coronaviruses marked in black cover a range from pig to human to rat. The human coronavirus OC43 is one of a number of viruses that cause the common cold.

Using the COVID-19 sequence to find a vaccine

Companies and organisations around the world, including the US and Porton Down in the UK, are now racing to develop a vaccine. One company in the news recently hoping to get to human trials in the Summer.

Where would I begin?

A search for antigenic regions in SARS on Google, after the last SARS outbreak, will find a number of publications. Researchers use blood serum from people who are ill with, or have just recovered from SARS and see if their sera cross react with any of the virus’s proteins (are they antigenic). If they do, they probably contain useful antibodies.

A paper I particularly liked studied the SARS spike protein. The spike protein sits on the outside of the virus capsule. It interacts with the human cells during infection and plays a part in the absorption of the virus into the cell. The spike protein has two domains (stretches), S1 and S2. S1 is not very antigenic, but S2 is. Hong Zhang et al (2004) used a strain of the human SARS called BJ01 (accession no. AY278488). They created 12 different overlapping fragments of the S2 domain of the spike protein. They labelled them F1 to F13. They looked at which of these fragments were bound by antibodies in sera from 15 different SARS patients. (published as “Identification of an Antigenic Determinant on the S2 Domain of the Severe Acute Respiratory Syndrome Coronavirus Spike Glycoprotein Capable of Inducing Neutralizing Antibodies” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC421668/).

The fragments F3 and F9 in SARS turned out to be antigenic, i.e., the patients’ sera reacted with them. Proteins are made up of chains of amino-acids. F3 stretched from the amino acid Arginine at position 797 in the spike protein to amino acid Proline at position 844. F9 stretched from amino-acid Leucine at position 1045 to Aspartic acid at 1109.

Using a single letter code for each amino acid, the sequences of F3 and F9 are:

F3=RSFIEDLLFNKVTLADAGFMKQYGECLGDINARDLICAQKFNGLTVLP

F9=LHVTYVPSQERNFTTAPAICHEGKAYFPREGVFVFNGTSWFITQRNFFSPQIITTDNTFVSGNCD

First I compared how similar the human SARS BJ01, a Bat SARS, a MERS and a human coronavirus spike proteins were to that of COVID-19 (labelled 2019 nCoV). I did this in pairs using the alignment program in GenStudioPro. The results are shown in the figure below. You can get a good first impression how well sequences match by looking for how much of the aligned protein sequences is coloured darkly, indicating a 100% match.

Pairs of alignments between COVID-19 (2019 nCoV) and human SARS BJ01, a Bat SAR, a MERS and human coronavirus OC43. The locations of the antigenic fragments F3 and F9 from human SARS BJ01 are marked in red bars. The dark colours in the aligned sequences show a 100% match.

Just from the colour patterns alone, you can see that the spike protein of COVID-19 is very similar to both the human and the bat SARS sequences. There is much less matching between 2019 nCoV and the MERS or human coronavirus OC43.

I then looked more closely at the similarities between the F3 and F9 antigenic fragments from human coronavirus OC43 and a variety of sequences. I always included the F3 or F9 fragment, the human coronavirus OC43 sequence from which it was derived, and the COVID-19 sequence.

Looking at similarities to the F3 antigen from human SARS BJ01 with COVID-19 (2019 nCoV) and Bat SARS, a MERS and human coronavirus OC43. Differences just in the COVID-19 (2019 nCoV) are highlighted in red. Differences in the other sequences are highlighted in orange.

Looking at similarities to the F9 antigen from human SARS BJ01 with COVID-19 (2019 nCoV) and Bat SARS, a MERS and human coronavirus OC43. Differences just in the COVID-10 (2019 nCoV) are highlighted in red. Differences in the other sequences are highlighted in orange.

COVID-19 shows three differences in amino acids from F3, marked in red and a further three differences from Bat SARS marked in orange. Compared to F9, COVID-19 shows 9 differences marked in red, as well as a further three differences from Bat SARS.

The question I might try to answer in experimental trials would be, if I changed the F3 and the F9 sequences to match those of COVID-19, would these changes give me antigens that could generate antibodies, and therefore help create potential vaccines against the current COVID-19 outbreak?

Other approaches

The answer to my question might actually be no. Remember earlier on we learnt that an RNA virus like COVID-19 is likely to be a quasispecies. It is not one definite sequence but a population of viruses with an average around the published sequence. The altered F3 and F9 based vaccines would only affect those viruses in the population with exactly these changes. Those with a different mutation might slip through.

Vaccines made with live attenuated viruses are often the most effective and can be made by using such a mixed population found in a quasispecies virus. The other strategy could well be to create an attenuated version of the COVID-19. By being weakened in some functions so it did not cause disease, it might still being able to induce the same antibodies against the native and more dangerous virus and so create an effective vaccine. The availability of the consensus viral sequence and existing information on related SARS viruses might make this work easier.

Conclusion

I hope this gives an insight on what can be done if you have available sequences for a new disease like COVID-19. But do remember, I am just an outsider, 20 years behind the times, who used to work with plant viruses, which are very different to animal and human viruses.
It is reassuring to know, as the disease continues its growth and spread, that vastly more experienced research teams in public labs and private companies are racing to generate an effective vaccine.
They are likely to do so at a pace that I would have found unbelievable when I was working in my field.

Miltoncontact

Wednesday, 5 February 2020

What could I find out with a Wuhan COVID-19 coronavirus sequence?