Lab Exercise 1

Abstract


Degenerate primer design is important when designing polymerase chain reaction (PCR) experiments to identify novel genes of interest in plant gene families. In this laboratory exercise we apply ClustalW multiple sequence alignment to the sequences of 6 members of the A. thaliana NHX gene family found using NCBI3. Consideration of their degeneracies helps us identify which primers might work best from the 10 possible primer sequences predicted by the j-CODEHOP primer prediction tool6. We then use nucleotide sequences from 2 gene family members to generate specific primers using nucleotide sequences instead of amino acid sequences and use the information about conserved amino acid sequences to inform our decisions about which primers might be more optimal than others. We conclude that degenerate primer design must be tailored to the proposed experimental design constraints and that it is up to the researcher to decide which primers are best for their individual needs.

Introduction


The study and modification of plant genomes to produce better transgenic plants is of great interest to the plant science community. In order to modify plant genomes we must first quantify the genes that regulate their biological processes. Many plant genes are further classified into “gene families”, genes that express similar proteins that either share the same function, express a new function, or are regulated in different ways1. Molecular biologists often use polymerase chain reaction, or PCR, to amplify small amounts of DNA into large quantities suitable for study. During PCR, mixed double stranded DNA is heated and allowed to melt into single stranded DNA. Specially designed short strands of DNA corresponding to conserved regions of a particular gene or group of genes called primers are then added and allowed to anneal to their corresponding regions. Thermostable DNA polymerase uses this primer to synthesize the rest of the complementary sequence. Primer design is a crucial step in experimental design and is dependent on an assortment of factors, chief among which is the degeneracy of the genetic code.

It is known as part of the “wobble hypothesis” that the last base in a codon is variable such that multiple different codons can code for the same amino acid1. This means that if we know the exact sequence of DNA we wish to amplify with a primer we can only account for the physical properties of the primer that make it ideal for PCR. But what if we want to identify new members of a gene family? We also know that genes of the same gene family share some similarity with each other, and so by aligning the amino acid sequences using bioinformatics software we can identify conserved domains between all family members. By designing primers that take into account this similarity in amino acid sequences instead of nucleotide sequences, we can use the wobble hypothesis to our advantage to design non-specific primers. This can tell us additional information about which proposed primers might work best to identify novel genes.

We begin designing a degenerate primer for a gene of interest by finding known representative sequences of the NHX gene family using a genomic database such as NCBI3. We subsequently design possible primers to identify specific genes using their nucleotide sequences, and design degenerate primers to identify novel genes by performing a multiple sequence alignment of the amino acid sequences of our query genes.

Methods


All information concerning the methodology of this lab exercise can be found in the class lab manual1. The author used R for data analysis and presentation. Full source code can be found here with partial code snippets embedded in the document.

Results


The first analysis of this experimental exercise involved using ClustalW8 multiple sequence alignment algorithm as implemented by the Kyoto University Bioinformatics Center2 to identify conserved amino acid sequences shared by the six members of the A. thaliana NHX gene family identified in the lab manual1. The ClustalW algorithm aligns the sequences and outputs a conservation score for each amino acid. A "*" denotes complete complementarity, a “:” significant complementarity, and a “.” denotes weak complementarity. The website uses spaces to identify regions with no correspondence, here represented using underscores. The results are summarized in tables 1a - 1c below as part of 50 amino acid blocks further split into groups of 10. The groups containing the complementary regions are highlighted in gray. Click each button to see their respective summary tables.


Conserved Region 1

Sequence ID 51-60aa 61-70aa 71-80aa 81-90aa 91-100aa
AAM08403.1 RWMNESITAL LIGLGTGVVI LLISRGKNS- HLLVFSEDLF FIYLLPPIIF
sp|Q68KI4.2|NHX1_ARATH RWMNESITAL LIGLGTGVTI LLISKGKSS- HLLVFSEDLF FIYLLPPIIF
sp|Q84WG1.2|NHX3_ARATH RWMNESITAL IIGSCTGIVI LLISGGKSS- RILVFSEDLF FIYLLPPIIF
AAM08405.1 RWVNESITAI LVGAASGTVI LLISKGKSS- HILVFDEELF FIYLLPPIIF
AAM08407.1 YYLPEASASL LIGLIVGGLA NISNTETSIR TWFNFHDEFF FLFLLPPIIF
AAM08406.1 HYLPEASGSL LIGLIVGILA NISDTETSIR TWFNFHEEFF FLFLLPPIIF
Conservation Output _::_*:__:: ::*___*___ _:_.__..__ __:_*_:::* *::*******
Table 1a: Shown here is a subset of the Clustal 2.1 multiple sequence alignment displayed in an easy to read format with the region of interest highlighted.

Conserved Region 2

Sequence ID 151-160aa 161-170aa 171-180aa 181-190aa 191-200aa
AAM08403.1 LGDFLAIGAI FAATDSVCTL QVLNQD-ETP LLYSLVFGEG VVNDATSVVL
sp|Q68KI4.2|NHX1_ARATH LGDYLAIGAI FAATDSVCTL QVLNQD-ETP LLYSLVFGEG VVNDATSVVV
sp|Q84WG1.2|NHX3_ARATH IADYLAIGAI FSATDSVCTL QVLNQD-ETP LLYSLVFGEG VVNDATSVVL
AAM08405.1 ARDYLAIGTI FSSTDTVCTL QILHQD-ETP LLYSLVFGEG VVNDATSVVL
AAM08407.1 FVECLMFGSL ISATDPVTVL SIFQELGSDV NLYALVFGES VLNDAMAISL
AAM08406.1 FVECLMFGAL ISATDPVTVL SIFQDVGTDV NLYALVFGES VLNDAMAISL
Conservation Output :_*_:*::_: ::**.*_.*_ .::::_____ _**:*****. *:***_::_:
Table 1b: Shown here is a subset of the Clustal 2.1 multiple sequence alignment displayed in an easy to read format with the region of interest highlighted.

Conserved Region 3

Sequence ID 251-260aa 261-270aa 271-280aa 281-290aa 291-300aa
AAM08403.1 FGRHSTD-RE VALMMLMAYL SYMLAELFAL SGILTVFFCG IVMSHYTWHN
sp|Q68KI4.2|NHX1_ARATH FGRHSTD-RE VALMMLMAYL SYMLAELFDL SGILTVFFCG IVMSHYTWHN
sp|Q84WG1.2|NHX3_ARATH IGRHSTD-RE VALMMLLAYL SYMLAELFHL SSILTVFFCG IVMSHYTWHN
AAM08405.1 FGRHSTT-RE LAIMVLMAYL SYMLAELFSL SGILTVFFCG VLMSHYASYN
AAM08407.1 LDVDNLQNLE CCLFVLFPYF SYMLAEGLSL SGIVSILFTG IVMKHYTYSN
AAM08406.1 LDTENLQNLE CCLFVLFPYF SYMLAEGVGL SGIVSILFTG IVMKRYTFSN
Conservation Output :._..____* _.:::*:.*: ******_._* *.*::::*_* *::*.:*:_*
Table 1c: Shown here is a subset of the Clustal 2.1 multiple sequence alignment displayed in an easy to read format with the region of interest highlighted.

Multiple sequence alignment using ClustalW identified 3 conserved regions further summarized in table 2 below. Degeneracy scores were calculated for each conserved region by multiplying the individual amino acid residues by the number of codons which could code for them1. In addition, the amino acid sequences were converted to nucleotide sequences using an online translation tool5. Preliminary examination shows that conserved region 1 exhibits a much higher degeneracy score than conserved regions 2 and 3.


Conserved Region Protein Sequence ClustalW Position (aa) Calculated Degeneracy Nucleotide Sequence
1 LLPPIIF 94-100 10368 YTNYTNCCNCCNATHATHTTY
2 LVFGE 186-190 384 YTNGTNTTYGGNGAR
3 SYMLAE 271-276 576 WSNTAYATGYTNGCNGAR
Table 2: A summary of the identified conserved amino acid stretches of length 5 or greater from the Clustal 2.1 multiple sequence alignment generated from 6 members of the A. thaliana NHX gene family.

For our third analysis 10 degenerate primer pairs were generated using the j-CODEHOP platform as part of the Base-by Base analysis sequence analysis suite6 using default settings and ClustalW for multiple sequence alignment, with an equal number in both the forward and reverse directions. These results are reported in Table 3 below.


Primer Name Primer Sequence 5’-3’ Direction Annealing Temperature (C) Primer Length (NT) Clamp Length (NT) Core length (NT(AA)) Degeneracy Primer Location (AA) Primer Location (NT) Primer AA Sequence Clamp Score
VFGE-F 32x ACTCCTCTTCTGTATTCTCTGgtnttyggnga forward N/A 32 21 11(4) 32 179-189 535-566 TPLLYSLVFGE 62
VFGE-R 64x CAGACGTAGCATCATTAACAACACCytcnccraanac reverse N/A 37 25 12(4) 64 186-198 556-592 VFGEGVVNDATSV 73
YMLA-F 16x TATGATGCTGATGGCTTATTTCTCTtayatgytngc forward N/A 36 25 11(4) 16 263-275 789-824 LMmLmAYLSYMLA 68
MLAE-F 32x GATGCTGATGGCTTATTTCTCTTATatgytngcnga forward N/A 36 25 11(4) 32 264-276 792-827 MmLmAYLSYMLAE 70
GIVM-F 48x TGGTATTCTGACTGTCTTCTTCTGTggnathgtnat forward N/A 36 25 11(4) 48 281-293 843-878 SGILTVFFCGIVM 69
AETF-F 32x TGCCTTTGCTATGATGTCCTTTCTTgcngaracntt forward N/A 36 25 11(4) 32 311-323 933-968 HfFAllSFLAETF 60
YMLA-R 64x GAATACCAGACAGAGCAAATAGTTCngcnarcatrta reverse N/A 37 25 12(4) 64 272-284 814-850 YMLAELFsLSGIL 64
MLAE-R 64x TCAGAATACCAGACAGAGCAAATAGytcngcnarcat reverse N/A 37 25 12(4) 64 273-285 817-853 MLAELFsLSGILT 63
GIVM-R 48x TAACATTATGCCAAGTATAATGTGTcatnacdatncc reverse N/A 37 25 12(4) 48 290-302 868-904 GIVMSHYTwhNVT 68
AETF-R 64x CATCCATTCCCACATAAAGAAAGATraangtytcngc reverse N/A 37 25 12(4) 64 320-332 958-994 AETFIFLYVGmDA 74
Table 3: Predicted primers output by the j-CODEHOP program using ClustalW for alignment with all other settings kept at default.

The final analysis of this laboratory exercise used the Primer3 platform7 to generate primer pairs specific to A. thaliana genes AAM08407.1 and AAM08406.1 due to their similarities outlined in the Discussion section. The primary primer pair represents the best predicted result, while the secondary primer pair indicates a less optimal alternative as scored by the Primer3 algorithm using default settings.


AAM08407.1 Primers

Primary
Oligo Start Length Melting Temperature (C) GC Percentage Overall Self Complementarity 3’ Self Complementarity Nucleotide Sequence
Left Primer 1177 20 59.98 45 5 0 ATGGCATTTGCTCTTGCTCT
Right Primer 1375 20 59.94 45 3 0 TGTTCACCACCTCAAATCCA
Table 4a: The primary forward and reverse primers computed for A. thaliana gene AAM08407.1 using the Primer3 analysis platform.

Secondary
Oligo Start Length Melting Temperature (C) GC Percentage Overall Self Complementarity 3’ Self Complementarity Nucleotide Sequence
Left Primer 1011 20 59.94 45 4 2 TTGGTCACACTTGGGATTCA
Right Primer 1211 20 60.14 50 4 2 TCGTGAACAGATTGCAGAGC
Table 4b: A secondary pair of forward and reverse primers computed for A. thaliana gene AAM08407.1 using the Primer3 analysis platform.

AAM08406.1 Primers

Primary
Oligo Start Length Melting Temperature (C) GC Percentage Overall Self Complementarity 3’ Self Complementarity Nucleotide Sequence
Left Primer 82 20 59.84 45 4 0 ATGATGCTCGTGCTTTCCTT
Right Primer 281 20 60.31 40 3 0 ATGATGGGAGGCAACAAAAA
Table 5a: The primary forward and reverse primers computed for A. thaliana gene AAM08406.1 using the Primer3 analysis platform.

Secondary
Oligo Start Length Melting Temperature (C) GC Percentage Overall Self Complementarity 3’ Self Complementarity Nucleotide Sequence
Left Primer 82 20 59.84 45 4 0 ATGATGCTCGTGCTTTCCTT
Right Primer 278 20 60.34 40 2 0 ATGGGAGGCAACAAAAACAA
Table 5b: A secondary pair of forward and reverse primers computed for A. thaliana gene AAM08407.1 using the Primer3 analysis platform.

Discussion


From the results of the Clustal 2.1 multiple sequence alignment I identified three conserved regions that may be suitable for degenerate primer design. Upon closer examination one will notice that genes AAM08407.1 and AAM08406.1 share more similarity with each other than with the other gene family members. This is made apparent by looking at the highly conserved regions denoted by a “:”; in most cases it is those two genes that are an exact match to each other, while the remaining genes tend to share the same, yet different, amino acid. According to the lab manual, these two genes are about 79% similar to each other, but only about 22% similar to the other genes, which share 56% similarity. Removal of these genes from the multiple sequence alignment would allow us to identify more regions of conservation, bringing the overall similarity from 22% to 56%.

Looking at Table 2 it is clear that conserved regions 2 and 3 are more ideal for degenerate primer design than conserved region 1, as this region has a degeneracy many times that of 2 and 3. Recall that degeneracy is the measure of how uncertain we are of the corresponding nucleotide sequence when given an amino acid sequence due to the wobble hypothesis. The greater the degeneracy, the less specific our primer will be and the greater the chance of it annealing and amplifying genes outside of the target gene family.

With this information we can better interpret the results of the j-CODEHOP analysis of our ClustalW multiple sequence alignment, summarized in Table 3. Many more primers were generated than conserved amino acid sequences because the algorithm first translates the sequence into a multitude of possible DNA sequences for reasons described earlier. In addition, the algorithm outputs a similar number of both forward and reverse primer examples. Examining the amino acid column of Table 3 we can see that predicted reverse primers VFGE-R, YMLA-R, and MLAE-R all correspond to conserved amino acid regions in the multiple sequence alignment results from Table 2. Primer VFGE-R corresponds to conserved region 2 and has a consensus clamp score of 73, while the other two primers correspond to conserved region 3 and have consensus clamp scores of 64 and 63 respectively. Recall that the higher the consensus clamp score, the higher sequence similarity across all members in the multiple sequence alignment. Since these three examples all share the same degeneracy score of 64, we can conclude that primers VFGE-R and YMLA-R might be the best candidates for novel gene isolation based on their consensus clamp scores and similarity to highly conserved sequence domains.

While we can predict possible degenerate primers using a given amino acid sequence from proteins within the same gene family, the more traditional approach is to use the nucleotide sequence of a gene of interest directly. Tables 4 and 5 summarize the output of the Primer3 platform when used to analyze the sequences of NHX genes AAM08407.1 and AAM08406.1. We can see that while the primary and secondary primers of AAM08407.1 vary greatly in their 3’ complementarity and nucleotide sequences, both primer pairs of AAM08406.1 are similar and vary only in their melting temperatures. While using individual nucleotide sequences to generate primer pairs eliminates the uncertainty of translating amino acid sequences to cDNA, this may exclude genes of the same gene family that do not contain much if any of the same sequences yet still code for functionally similar proteins. This is why using multiple different primer pairs for the same gene can be beneficial for isolation, as the more sites we can target the more genetic material we can recover during our experiment for further analysis.

Conclusion


Degenerate primer design through bioinformatic analysis requires consideration of multiple approaches to best predict which primer sequence will most effectively probe for novel members of a plant gene family. Many factors, such as primer melting temperature and 3’ end complementarity must be considered to narrow down the results of these approaches4. Overall complementarity is especially important to consider as it indicates the probability of a primer sequence annealing to itself and forming rather than the target sequence. The 3’ end of the primer is particularly susceptible to dimer formation and receives its own probability score4. Even with careful consideration and data analysis, degenerate primer design is dependent on the constraints of the proposed PCR experiment, with different algorithms allowing for varying levels of customization for a particular need. In addition, it is often necessary to convert the file outputs of bioinformatics tools to more human readable formats for the benefits of the researcher and reader alike. Degenerate primer design is not an exact process and it comes down to the researcher to consider all of the information available to choose the best set of primers.

References


1Experiment 1: Bioinformatics. (n.d.). BIT161B SQ2020. Retrieved April 14, 2020, from https://canvas.ucdavis.edu/courses/461005/files/folder/Laboratory%20Manual?preview=8296468

2 Multiple Sequence Alignment—CLUSTALW. (n.d.). Retrieved April 14, 2020, from https://www.genome.jp/tools-bin/clustalw

3 NHX and Arabidopsis—Protein—NCBI. (n.d.). Retrieved April 14, 2020, from https://www.ncbi.nlm.nih.gov/protein/?term=NHX%20and%20Arabidopsis&utm_source=gquery&utm_medium=search

4 Primer Design. (n.d.). Retrieved April 14, 2020, from http://bioweb.uwlax.edu/GenWeb/Molecular/seq_anal/primer_design/primer_design.htm

5 Protein to DNA reverse translation. (n.d.). Retrieved April 14, 2020, from http://www.biophp.org/minitools/protein_to_dna/demo.php

6 Shin-Lin Tu, Jeannette P. Staheli, Colum McClay, Kathleen McLeod, Timothy M. Rose and Chris Upton. 2018 Base-By-Base Version 3: New Comparative Tools for Large Virus Genomes. Viruses 2018, 10(11), 637; https://doi.org/10.3390/v10110637.

7 Steve Rozen, Helen J. Skaletsky (1998) Primer3. Code available at http://www-genome.wi.mit.edu/genome_software/other/primer3.html.

8 Thompson, J. D., Higgins, D. G., & Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research, 22(22), 4673–4680. https://doi.org/10.1093/nar/22.22.4673