Analyzing characteristics of Highly Expressed Genes (HEG) of Escherichia coli K12

Hieu Trung Le^1,†, Nam Tri Vo^1,†, Tan Huynh¹, and Hoang Duc Nguyen¹

¹University of Science, Ho Chi Minh City, Viet Nam

^†Equally contributing authors

Abstract

In recent years, the gene optimization has been widely applied in many fields such as medicine, industry and agriculture. The nucleotide components of original gene are transformed in order to optimize the ability to express target protein. Optimization program must capture the structural trend of the Highly Expressed Genes (HEG), thereby optimizing gene in the appropriate direction. In this study, we analyzed deeply into every characteristic of HEG and compared to the low and middle expressed genes (non-HEG) in Escherichia coli K12. Results showed that the codon usage and GC content of the HEG are more stable than those of non-HEG. The structures, which destabilize the mRNA, restrict transcription and translation such as Shine-Dalgarno sequences, Polynucleotide, Polycodon, mRNA secondary structures and repeated sequences are limited in HEG. These results can be applied to construct gene optimization program.

Citation: Le HT, Vo NT, Huynh T & Nguyen HD (2015) Analyzing characteristics of Highly Expressed Genes (HEG) of Escherichia coli K12. Genomic Medicine 2015, eds Le L & Pham S (Ho Chi Minh City, Viet Nam).

Full-text Download: PDF

VJS Editor: Van Hoang, Sunnybrook Research Institute, Toronto, Canada

Introduction

Gene optimization is a genetic technique that increases expression level of genes through increasing its efficiency transcription and translation (1). In principle, gene optimization is a process that replaces codons in native genes to gain new sequences that carry Highly Expressed Genes (HEG) features. HEG was predicted and announced on the Highly Expressed Genes Database (HEG-DB) in 2008 by Puigbo et al. (2). This has provided a useful data for researches on either gene expression or gene optimization.

Characteristics of target genes will affect (increase or decrease) gene expression level. Codon usage is one of the most importance criteria of gene optimization. Each amino acid may be encoded by more than one codons and each organism has its own bias in the use of the 61 available codons. The intracellular tRNA populations are correlated to the codon bias of the mRNA population (3). Highly expressed genes tend to contain codons for which the cell has high abundant tRNA whereas genes that are expressed at low levels tend to include rare codons. GC content of target genes also affects its product amount. The closer GC content of gene to desire value of expression organism is, the higher expression level is. The existences of special structures such as Shine-Dalgarno sequences, polynucleotide, polycodon, mRNA secondary structures and repeat sequences will decrease expression level by affecting transcription and translation efficiencies. The presence of hidden stop codons also affects protein amount but in a distinct way by blocking frame shift translations (4). In this study, we carried out the analysis and comparison on these gene optimization criteria of HEG and non-HEG of Escherichia coli K12 – microorganisms are commonly used in the productions of both prokaryotic and eukaryotic recombinant proteins (5).

Materials and Methods

Micro-organism

The analysis was carried out on genome of Escherichia coli str. K-12 substr. W3110 with accession number NC_007779.1.

HEG and non-HEG

E. coli HEG was obtained from the HEG-DB - a database of predicted highly expressed genes of prokaryotic genomes (2).

E. coli non-HEG was obtained by eliminating the HEG from E. coli K12 genome (from NCBI database).

Criteria analysis

Codon usage

The w_c variable was used to evaluate the frequency of usage of a codon that was calculated using the formula:

In which,

w_c: Relative Codon Adaptation of codon a

C_a: The codons that codes for amino acid a

O_c: The number of codon c, which code for acid amine a

max_Oc: The number of codon have the largest that code for amino acid a

GC content (%GC)

The average GC content of each group genes were analyzed according to the formula:

In which, len is length of gene.

Shine Dalgarno (SD)

We calculated the frequency of occurrence of SD on genes according to the formula:

In which,

Ratio_SD: the frequency of occurrence of homologous sequences to the consensus sequence of SD

s_i: the number of homologous sequences to the consensus sequence of SD with the mismatch is i (from 0 to 2)

len: the length of gene

Hidden stop codons (HSCs)

The frequency of occurrence of HSC was calculated using the formula below:

In which,

Ratio_fi: the frequency of occurrence of HSC in i^th frame

HSC: the number of HSC

len: the length of gene

Repeated sequence (RS)

We analyzed RS with different repetition length (from 6 to 11 and greater than 11) then calculated the frequency of occurrence of the RS with this formula:

In which,

Ratio_RSi: the frequency of occurrence of RS with repeat level i

RS_i: the number of RS with repeat level i

len: the length of gene

Polynucleotide (PN)

The frequency of occurrence of PN was statistical analyzed with length from 4 to 8 according to the following formula:

In which,

Ratio_PNi: the frequency of occurrence of PN with the repeat level i

PN_i: the number of PN with repeat level i

len: the length of gene

Polycodon (PC)

The frequency of occurrence of PC was statistically analyzed motif in which a codon repeated from 2 to 5 times according to the following formula:

In which,

Ratio_PCi: the frequency of occurrence of PC with times of repeat is i

PC_i: the number of PC with repeat level i

len: the length of gene

mRNA secondary structure (SS)

The mRNA secondary structure was evaluated via minimum free energy (MFE) which was calculated as described (6).

Statistical analysis

All the statistical description and verification were done by R i386 3.0.2. We used t-test with 95% reliability in the testing of the data obtained.

Results

HEG and Non-HEG

From HEG-DB we have gained 253 HEG in E. coli K12.

Genome of E. coli str.K-12 substr.W3110 was obtained from the NCBI database to gather all the gene names and their corresponding sequences. After preprocessing, we gained 4206 genes in total. 3953 genes of non-HEG were obtained by eliminating 253 HEG from this data.