Transgene Design

A transgene is an artificial gene, and transgene design must incorporate all appropriate elements critical for gene expression. A simple construction scheme has been developed that provides the best transgene expression: each transgene contains a promoter, an intron, a protein coding sequence (termed the reporter), and transcriptional stop sequence. These elements are typically assembled in a bacterial plasmid, and sequences are usually chosen from previous transgenes with proven function. In addition, the construct must be linearized and prokaryotic sequences removed before injection into the nucleus of a mouse zygote.

The promoter. The transgene promoter is a regulatory sequence that will determine in which cells and at what time the transgene is active. The promoter is typically derived from sequences of a mammalian gene upstream from the start site of transcription, and has been tested to contain the appropriate transcriptional regulatory elements. For example, the insulin promoter sequence spanning from -585 to the start site of transcription will direct transgene expression exclusively to pancreatic β-cells (1). Each promoter will produce the same expression pattern with any protein reporter. A large number of promoters have now been characterized that direct a wide variety of transgene expression patterns. When choosing which promoter to use, it is worthwhile to carefully read the original papers defining the promoter expression pattern. Subtle differences from the expression pattern of the original gene are often evident, as well as differences in developmental expression or expression in ectopic tissues. The exact promoter sequence that was originally characterized and published should be used to construct your transgene. The promoter sequence normally contains the transcriptional start site as well as the transcription regulatory sequences. In addition, the promoter sequence also typically contains some extraneous sequence downstream of the transcriptional start as will be discussed below. Synthetic promoters have also been designed for inducible gene expression and other specialized applications.

The reporter (protein coding sequence). Transgenes are normally designed to produce a protein, and must contain a valid protein coding sequence (CDS). This sequence is usually derived from the cDNA for the protein of interest. The CDS must contain a translational start codon (ATG) and translational stop codon, plus a Kozak sequence upstream of the start codon. The ideal Kozak sequence is GCCGCCACC, but the ten nucleotides found just five prime to any start codon can be assumed to have appropriate function. Frequently, extra linker sequences are incorporated between the transcriptional start site and the translational start codon as byproducts of assembling the transgene fragments. These sequences should be examined for ATG start codons or other potential regulatory elements (see below), but otherwise rarely cause problems. Five prime and three prime non-translated sequences from the protein coding transcript should be avoided as much as possible, since these may contain regulatory elements controlling translation or mRNA stability.

Introns and transcriptional stop sequences. Transgenes are incorporated into the murine genome at random sites and transgene expression will vary depending on the sequences that surround the insertion site. The same transgene will be more active in one site than another, and a four log variation in activity is not unusual among insertion sites. This variation in expression levels complicates the study of factors that influence transgene activity. However, inclusion of an intron in a transgene construct results in a significantly greater percentage of active transgenes (2-4). In a direct comparison, 6/7 transgenes with an intron had detectable activity, while 2/5 identical constructs without an intron had detectable albeit weaker expression (3). The exact mechanism for this effect is unknown, but is hypothesized to be related to the known functional link between transcription and splicing.

Each transgene must also contain a transcriptional stop signal to match the start signal typically included in the promoter. Eukaryotic transcriptional stop signals include a polyA addition sequence (AAAUAA) as well as hundreds of downstream nucleotides where function is important but not clearly understood. Numerous introns and transcriptional stop sequences have been tested in transgenes. The most convenient arrangement is to include a gene or gene fragment at the end of the coding sequence with both an intron and transcriptional stop sequence. Examples of introns commonly used are the rabbit β-globin intron or SV40 intron. Examples of transcriptional stop sequences are those from SV40 or human growth hormone. Examples of combined sequences are an SV40 intron/stop, the last exon of human growth hormone plus stop sequences, or the entire human growth hormone gene.

Transgene linearization and removal of bacterial ori and prokaryotic sequences. Transgenes are subject to epigenetic regulation, which may be influenced by the transgene sequence as well as the integration site. Transgenes that include prokaryotic ori sequences are less likely to be expressed and to be expressed at lower levels than transgenes without the prokaryotic sequences. In a direct comparison, 4/4 transgenes without vector sequences were expressed while only 2/6 transgenes that included vector sequences were active at detectable levels (5). In addition, transgene incorporation into the murine genome is increased orders of magnitude if the construct is linear as opposed to circular. It is normally convenient to remove the prokaryotic sequences and purify the linear transgene fragment with a single restriction endonuclease digest, and the transgene construction scheme should include a strategy for cutting out the transgene from the plasmid backbone.

Linker and extra sequences. Typical bacterial cloning methods will result in inclusion of extra sequences between each segment of the transgene that derive from sequences between restriction endonuclease sites or plasmid polylinker sequences. These sequences may be as much as one hundred nucleotides in length and contain multiple restriction sites, but they do not normally affect transgene function as long as inadvertent regulatory elements are not created. For example, the promoter will contain the start site of transcription and usually tens of nucleotides downstream of the start site of transcription that will ultimately be incorporated into the 5' untranslated sequence of the transgene transcript. This sequence must be verified to be free of translational start or stop sites. Any extraneous sequences must be examined to ensure the absence of unwanted functional elements. In addition, any plasmid sequences included in the final linearized transgene construct should be known to be free of regulatory function. For example, if the transgene is freed from the plasmid backbone with restriction endonucleases that leave an extra hundred nucleotides at the ends of the transgene, these extra sequences should be known to be free of eukaryotic enhancers or promoters that are frequently present in plasmids.

A note on transgene construction strategy. Transgenes are normally assembled from proven promoters, introns, etc., and the protein coding sequence of interest. A cloning scheme should be designed with a clear understanding of all the elements and a strategy to free the transgene from vector sequences once constructed. Meticulous attention to detail in transgene design will be rewarded by avoiding the time and expense spent generating a transgene that does not function as anticipated.

Synthetic promoters. Most transgene promoters are derived from endogenous mammalian gene regulatory sequences, but synthetic promoters with specialized functions have also been developed. The most common synthetic promoters respond to synthetic activators as part of a binary system to allow for inducible gene expression (6). For example, one transgene will produce a synthetic transcription factor that contains a prokaryotic tetracycline binding domain and a tet-operon DNA binding domain coupled to a eukaryotic transcriptional activating acidic domain. The second transgene will be constructed of multimerized tetO binding sites upstream of a minimal promoter that contains only a TATA box. The second transgene will only be active in the presence of the synthetic transcription factor and in the absence of tetracycline, allowing inducible, tissue-specific transgene activation. These transgenes are designed and assembled the same way as other transgenes.


  1. Hanahan, D. Heritable formation of pancreatic β-cell tumours in transgenic mice expressing recombinant insulin/simian virus 40 oncogenes. Nature 315, 115-122 (1985).

  2. Clark, A.J., Archibald, A.L., McClenaghan, M., Simons, J.P., Wallace, R., Whitelaw, C.B. Enhancing the efficiency of transgene expression. Philos Trans R Soc Lond B Biol Sci 339, 225-232 (1993).

  3. Choi, T., Huang, M., Gorman, C., Jaenisch, R. A generic intron increases gene expression in transgenic mice. Molecular Cellular Biology 11, 3070-3074 (1991).

  4. Duncker, B.P., Davies, P.L., Walker, V.K. Introns boost transgene expression in Drosophila melanogaster. Mol Gen Genet 254, 291-296 (1997).

  5. Kjer-Nielsen, L., Holmberg, K., Perera, J.D., McCluskey, J. Impaired expression of chimaeric major histocompatibility complex transgenes associated with plasmid sequences. Transgenic Research 1, 182-187 (1992).

  6. Lewandoski, M. Conditional control of gene expression in the mouse. Nat Rev Genet 2, 743-755 (2001).

© 2020 Washington University in St. Louis