Mining the

tobacco genome

In 2017 we published a gold-standard assembly of the tobacco genome – a significant achievement given the genomic complexity of tobacco.

A gold mine of information, we’re already using it to tailor and improve tobacco for our potentially reduced-risk products and answer broader plant science questions.


The tobacco plant

Tobacco (Nicotiana tabacum) was the first plant to be adapted for tissue culture and among the first to be genetically engineered, both of which made significant contributions to molecular plant biology. Today, potential applications in biofuel and biopharmaceutical production have generated renewed interest in this plant. In order to explore these and other potential applications, improved genomic resources are critical. Unfortunately, progress has been hampered by the very large size of the tobacco genome (roughly 50% bigger than the human genome) and its highly complex nature: N. tabacum arose from the fusion of two different ancestral species (Nicotiana sylvestris and Nicotiana tomentosiformis). As a result, each cell contains a set of chromosomes originating from each parent, and there is a great deal of repetition and duplication throughout the genome.



In collaboration with Lukas Mueller’s group at the Boyce Thompson Institute at Cornell University, New York, NY, USA, we created a new assembly of the N. tabacum genome by laying out (or “anchoring”) 64% of the genetic code onto chromosomes, dramatically improving its coherency and utility.

Although draft sequences of the tobacco genome were previously available, their usefulness was limited because less than 20% of the genetic sequence was accurately placed on the 24 tobacco chromosomes, making it very difficult to link identified genes to particular traits. We tried a different approach to put the pieces of the genome together, using a new technology that enhances the utility of the optical mapping technique for large genomes. This involves marking specific sequence patterns in very long pieces of DNA to create barcoded DNA fragments. The barcode is then used as a template onto which the new assembly can be dropped and matched, a bit like completing a jigsaw on top of a trace of its picture.

Using optical mapping alongside the more common method of next generation sequencing enabled us to anchor much more of the genome to tobacco chromosomes compared to previous assemblies. “Assembly of the genome sequence was technically very difficult because the two parental genomes are very similar – in effect a bit like trying to put together a jigsaw puzzle showing a picture of very similar but non-identical twins,” says Jennifer Bromley, Computational Biology and Gene Identification Manager at BAT. “The sequence also contains a lot of repetition, making anchoring some areas like trying to complete a jigsaw puzzle of blue sky.” Rather than keeping the method proprietary, we chose to publish these results in BMC Genomics1 and the sequence, along with tools that are important for interrogating this complex genome, is now available as a central resource for scientists around the world on the Solanaceae Genomics Network


Using the genome to probe nitrogen use and utilisation

Although a substantial step forward in understanding tobacco, generating this genomic roadmap was just the beginning. The challenge now is to use it to figure out the underlying genetic pathways within the tobacco plant and identify which genes are involved in determining reduced levels of toxicants or toxicant precursors in the leaf, increased yields and improved disease resistance. "By modifying these traits, not only can we change the characteristics of the tobacco leaf and start breeding plants with optimal traits for our potentially reduced risk products, we can also help our farmers grow tobacco more sustainably," says Bromley.

We have already used the genome assembly and associated genetic map to identify two mutated genes implicated in nitrogen use and utilisation in collaboration with Ramsey Lewis’ group at North Carolina State University, Raleigh, NC, USA. "Our increased understanding of these mutant genes could lead to the development of novel tobacco cultivars that contain lower levels of carcinogenic compounds, for use in potentially reduced risk products, such as tobacco heating products," says Bromley.

It has long been known that the Burley variety of tobacco is poor at utilising nitrogen, and the impact of this on its metabolism and growth means that as well as increased levels of nicotine and other alkaloids, the plant contains more nitrate, resulting in raised levels of carcinogenic tobacco-specific nitrosamines in its leaves during harvesting and curing. When burnt in a cigarette, these compounds transfer to the smoke. "Different cultivars of Burley tobacco all share these two gene mutations, giving us a handle on why they differ from other tobaccos," explains Allen Griffiths, former Head of Plant Biotechnology and now Head of Reduced Risk Substantiation. "We believe this represents the first successful map-based gene discovery for N. tabacum, and demonstrates the value of a high-quality genome assembly for future research." Nitrogen is essential for plant growth and many farmers add nitrogen-based fertilizers to crops to achieve good yields, but excess nitrogen can have adverse effects on the environment. Through helping to improve commercially important crops, discovery of these genes could ultimately reduce the need for chemical fertilizers.


Using the genome to understand flowering timing

Using the genome assembly as a reference to identify genetic differences between different varieties of tobacco has the potential to breed new cultivars or varieties with improved characteristics for research and other applications. We are already using it this way to unravel a mystery, at the molecular level, that has long baffled plant biologists: why the Maryland Mammoth mutant variety of tobacco can only flower when the days are short (less than 14 hours of daylight). One of the ancestral parents of tobacco (N. sylvestris) flowers during long days and the other (N. tomentosiformis) when days are short. As a result, most varieties of cultivated tobacco can generally flower under any day length. "Maryland Mammoth, however, is unusual and cannot produce flowers during the long days of summer – something has been lost," explains Bromley. "If you want seeds from Maryland Mammoth, you have to grow it in a greenhouse under a specific light regime to induce it to flower."

From a broader plant biology perspective, the discovery of the Maryland Mammoth tobacco mutant in 1919 led scientists to first hypothesise about the concept of photoperiodism – that organisms can behave differently in response to changes in length of daily, seasonal or yearly cycles of light or darkness. The genetic pathways of why, how and when plants transition from making leaves to making flowers was worked out in the plant model systems for Arabidopsis thaliana (which has a much quicker seed-to-seed life cycle, and a much simpler genome) and Antirrhinum spp. (snapdragon) by 1995, but has remained a mystery in tobacco for nearly 100 years.


To identify what was preventing flowering during long days in the Maryland Mammoth, we focused on all the genes involved in the plant’s response to seasonal changes in daylight and looked for differences between genes from the two parents – no single gene controls flowering. We found that tobacco carries a mutation in one of two copies of a gene which forms part of a protein complex that triggers expression of other genes and causes the plant to flower. In normal flowering tobacco, the non-mutant form (or homeolog) dominates, whereas in Maryland Mammoth the mutant form dominates. "We think that this disrupts the formation of the protein complex that induces the plant to make the transition from making leaves to making flowers at its growing tip," says Bromley. "In the parents, these homeologs will be expressed at different times, in precise response to different environmental cues. In tobacco, which has both copies, you would expect that the N. sylvestris copy would be expressed under long days, while the N. tomentosiformis copy would be expressed under short days. We have found that the gene from N. sylvestris is mutated, and it’s the N. tomentosiformis gene that does all the work under any photoperiod to induce flowering, but in Maryland Mammoth, the N. sylvestris homeolog is still highly expressed under long days, which we think causes the plant to keep producing leaves," says Bromley. "We can use this information to start regulating when the plants flower, helping to speed up the breeding process."

Shortening the normal life cycle of particular plants could potentially accelerate plant research by massively impacting the breeding pipeline for that species. Tobacco currently has a normal life cycle of four months, whereas A. thaliana, which is commonly used for research, has a six-week life cycle. So, understanding how to encourage a tobacco plant to produce seed faster has many commercial applications.

"Generating this dramatically improved assembly for tobacco is a substantial step forward," says Griffiths. "It will open up several avenues of research, from helping scientists gain a greater understanding of the evolution of the tobacco plant to the identification of genes responsible for several traits, whether they be related to improving sustainability of agriculture, reducing the levels of toxicants in tobacco products, or improving the production of pharmaceuticals and biofuels."



  1. Edwards K et al. (2017).

    A reference genome for Nicotiana tabacum enables map-based cloning of homeologous loci implicated in nitrogen utilization efficiency. BMC Genomics 18: 448.