Codon bias as a factor in regulating expression via...

www.elsevier.com/locate/gene

Gene 345 (2005

Codon bias as a factor in regulating expression via translation rate

in the human genome

Yizhar Lavner, Daniel Kotlar*

Department of Computer Science, Tel Hai Academic College, Upper Galilee 12210, Israel

Received 13 September 2004; received in revised form 10 November 2004; accepted 11 November 2004

Available online 24 December 2004

Received by H.E. Roman

Abstract

We study the interrelations between tRNA gene copy numbers, gene expression levels and measures of codon bias in the human genome.

First, we show that isoaccepting tRNA gene copy numbers correlate positively with expression-weighted frequencies of amino acids and

codons. Using expression data of more than 14,000 human genes, we show a weak positive correlation between gene expression level and

frequency of optimal codons (codons with highest tRNA gene copy number). Interestingly, contrary to non-mammalian eukaryotes, codon bias

tends to be high in both highly expressed genes and lowly expressed genes.We suggest that selection may act on codon bias, not only to increase

elongation rate by favoring optimal codons in highly expressed genes, but also to reduce elongation rate by favoring non-optimal codons in

lowly expressed genes. We also show that the frequency of optimal codons is in positive correlation with estimates of protein biosynthetic cost,

and suggest another possible action of selection on codon bias: preference of optimal codons as production cost rises, to reduce the rate of amino

acid misincorporation. In the analyses of this work, we introduce a new measure of frequency of optimal codons (FOPV), which is unaffected byamino acid composition and is corrected for background nucleotide content; we also introduce a new method for computing expected codon

frequencies, based on the dinucleotide composition of the introns and the non-coding regions surrounding a gene.

D 2004 Elsevier B.V. All rights reserved.

Keywords: Homo sapiens; Codon bias; Gene expression; Translation efficiency; Optimal codon; Biosynthetic cost

1. Introduction

Codon bias, the unequal use of synonymous codons for

encoding amino acids (Grantham et al., 1980; Moriyama,

2003), has been found in many organisms, both prokaryotes

and eukaryotes. This bias varies considerably among

organisms and even within the genes of the same organism.

The bias was found to be in relation with many genomic

factors, such as gene length, GC-content, recombination rate,

gene expression level, and density of genes (Duret and

Mouchiroud, 1999; Kreitman and Comeron, 1999; Duret,

2000; Marais et al., 2001; Urrutia and Hurst, 2001, 2003;

0378-1119/$ - see front matter D 2004 Elsevier B.V. All rights reserved.

doi:10.1016/j.gene.2004.11.035

Abbreviations: CB, codon bias; ENC, effective number of codons; FOP,

frequency of optimal codons; MCB, maximum likelihood codon bias.

* Corresponding author. Tel.: +972 4 6952965; fax: +972 4 6952899.

E-mail address: [email protected] (D. Kotlar).

Hey and Kliman, 2002; Versteeg et al., 2003), or with other

regularities in the genetic code (Karlin and Mrazek, 1996). In

different species, codon bias was found to be in weak

correlation with gene expression level (Ikemura, 1981; Sharp

et al., 1986; Duret and Mouchiroud, 1999; Urrutia and Hurst,

2003). Two main processes were proposed to explain codon

bias: natural selection acting on silent changes in DNA,

mutational bias, or both. In unicellular organisms, such as E.

coli and S. cerevisiae, it was found that the codons translated

by the most abundant tRNA are the most frequently used

(Ikemura, 1981, 1982). In multicellular organisms, such as

C. elegans (Duret, 2000) and Drosophila (Akashi, 1995;

Moriyama and Powel, 1997), similar findings were found,

namely, that codon bias favoring codons with high tRNA

gene copy number rises with expression level, thus support-

ing the action of selection on codon bias to improve

translation efficiency. This idea has not been confirmed in

) 127–138

Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138128

mammals (Kanaya et al., 2001). Although a weak correlation

between gene expression level and codon bias has been

observed in the human genome (Urrutia and Hurst, 2003),

this relation has not been linked to tRNA abundance.

Recently, Comeron (2004) showed that in the human

genome, in the majority of amino acids with degeneracy

greater than one, the codons with the most abundant tRNA

gene copy numbers, also exhibit an increase in frequency in

highly expressed genes compared to lowly expressed genes.

In this study, we introduce new methods for computing

the frequency of optimal codons (FOP) and for correcting

codon bias for background nucleotide content. Using these

methods, we show evidence indicating that the human

genome translation efficiency, as estimated using tRNA

gene copy numbers, is in weak positive correlation with

expression level, and that codon bias has a role in this

relation, although not the simple role it has in the model

described above: on the one hand, we found that codon bias

favors codons with high tRNA gene copy number in highly

expressed genes, and on the other hand, based on the

evidence presented here, we suggest that codon bias may act

as a gene expression regulator by favoring codons with low

tRNA gene copy numbers in lowly expressed genes. This

supports a mechanism proposed by Fiers and Grosjean

(1979) and supported by Konigsberg and Godson (1983) for

rare codons in regulatory genes of E. coli. Zhang et al.

(1991) also proposed this regulatory mechanism for several

organisms, including primates. In addition, we present

evidence that selection might act on codon bias to prefer

optimal codons, possibly to reduce the rate of amino acid

misincorporation as protein production cost rises.

2. Materials and methods

2.1. Frequency weighted by expression

The count ca of each amino acid a is calculated as

follows:

ca ¼X

g

ca gð ÞE gð Þ

where ca( g) is the count of a in the gene g, E( g) is the

expression level of g (average of expression; see below),

and the sum is taken over all the relevant genes (either the

highly expressed genes or all expressed genes). The

expression-weighted frequency faex of the amino acid a is

given by

f exa ¼ caX

a

cað1Þ

where the sum in the denominator is over all the amino

acids. This calculation is similar to the one performed by

Duret (2000) for C. elegans. In a similar manner, we

compute the expression-weighted frequency of a codon.

2.2. Estimating translation efficiency

2.2.1. Gene copy numbers data

Gene copy number data was taken from Lander et al.

(2001) and from the tRNA-scan site (http://www.rna.wustl.

edu/GtRDB/Hs/Hs-summary.html). In these data, pseudo-

genes have already been removed. We use tRNA gene copy

numbers as an assumed estimate of cellular tRNA abundance

(see explanation for this at the beginning of the Results

section).

2.2.2. Frequency of optimal codons (FOP)

The optimal codon of an amino acid is defined here as

the codon with the highest number of tRNA genes for its

anticodon, among its synonymous codons. The simplest

way to compute the frequency of optimal codons (FOP)

of a gene is to count the number of appearances of

optimal codons in the gene, and divide it by the total

number of codons in the gene (excluding the stop

codons):

FOPs gð Þ ¼ 1

N

X

i

ni gð Þ ð2Þ

where ni( g) is the count of the codon i in the gene g, N

is the total number of codons in g, and the sum is taken

over all the optimal codons. The subscript s stands for

bsimpleQ. This FOP measure is affected by amino acid

usage. If synonymous codon usage is random, a gene

composed only of amino acids of degeneracy two would

have FOP of 0.5, whereas a gene composed of amino

acids of degeneracy four would have FOP of 0.25. In

order to obtain a measure which is independent of amino

acid composition, we multiply each codon count in Eq.

(2) by the corresponding amino acid degeneracy:

FOP gð Þ ¼ 1

N

X

i

syn ið Þni gð Þ: ð3Þ

Here, syn(i) is the degeneracy of the amino acid coded by

i. This way a gene with close to random synonymous codon

usage will have FOP value close to 1, regardless of its

amino acid composition. To see that this is a sensible

measure, we write Eq. (3) in a slightly different way:

FOP gð Þ ¼X

i

naa ið Þ gð ÞN

ni gð Þ=naa ið Þ gð Þ1=syn ið Þ ð4Þ

where naa(i)( g) is the count of the amino acid coded by i in

g. Assigning fi( g)=ni( g)/naa(i)( g) and faa(i)( g)=naa(i)( g)/

N, we have:

FOP gð Þ ¼X

i

faa ið Þ gð Þ fi gð Þ1=syn ið Þ ð5Þ

Now, the second multiplier is just the relative synon-

ymous codon usage, or RSCU, of the codon i in the gene g

(Sharp et al., 1986). Hence, the FOP measure is a weighted

http://www.rna.wustl.edu/GtRDB/Hs/Hs-summary.html

Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138 129

mean of the RSCUs of the optimal codons, where the

weights are the corresponding amino acid frequencies:

FOP gð Þ ¼X

i

faa ið Þ gð ÞRSCUi gð Þ ð6Þ

This measure does not take into account the background

nucleotide composition. In order to correct FOP for back-

ground nucleotide content, we replace 1/syn(i) in Eq. (5) by

Einc( g), the expected proportion of the codon i among its

synonyms, based on the non-coding region surrounding the

gene (see below for the way to compute Einc ( g)). Taking

RASCUiV( g)= fig/Ei

nc( g) in Eq. (6), we get

FOPV gð Þ ¼X

i

faa ið Þ gð ÞRSCUiV gð Þ ð7Þ

Replacing syn(i) in Eq. (3) with 1/Einc, we get a simpler

way to compute FOPV( g):

FOPV gð Þ ¼ 1

N

X

i

ni gð ÞEnci gð Þ ð8Þ

where the sum is taken over all optimal codons for which

Einc( g) p 0.

2.3. Computation of gene expression levels

Expression levels for individual genes were taken from

SAGE (http://www.cgap.nci.nih.gov/SAGE/SALL, version

of July 21, 2003). Only tags that matched a named gene

were taken into account. Expression values were calculated

by counting the tags for each gene in each library,

normalized per 200,000 tags, and combined over 43

libraries representing 18 normal tissues: brain (7 libraries,

311,726 tags), breast (7 libraries, 310,477 tags), colon (2

libraries, 76,954 tags), heart (1 library, 71,926 tags), kidney

(1 library, 30,721 tags), liver (1 library, 58,467 tags), lung (1

library, 77,024 tags), muscle (2 libraries, 88,332 tags), ovary

(2 libraries, 81,270 tags), pancreas (2 libraries, 54,673 tags),

peritoneum (1 library, 53,527 tags), placenta (2 libraries,

207,348 tags), prostate (4 libraries, 232,573 tags), retina (4

libraries, 239,211 tags), spinal cord (1 library, 45,109 tags),

stomach (1 library, 18,193 tags), vascular (2 libraries,

91,131 tags), white blood cells (2 libraries, 67,177 tags).

We combined the expression levels in the libraries in

three ways: (a) breadth of expression, defined here as the

number of libraries in which the gene was expressed; (b)

average over the libraries; and (c) maximum over the

libraries. The correlation values among the three methods

are listed in Table 1. All correlations are highly significant

with pb10�100. Average is the method that correlates the

best with the two other methods.

Table 1

Correlation coefficients between different methods of combining values in

SAGE libraries

Breadth Average Maximum

Breadth 1 0.904 0.734

Average 1 0.913

Maximum 1

[Our definition of Breadth is different from the usual

definition (Urrutia and Hurst, 2001) which is the number of

tissues in which a gene is expressed. However, calculating

breadth in both ways yielded two sets of values which are

highly correlated (RN0.96). As for Average expression, it

may seem more accurate to average first among the libraries

of each tissue and then to average over the tissues, as in

(Urrutia and Hurst, 2001); we observed that the values

computed this way and those computes by simply averaging

over all available libraries are highly correlated (RN0.98)].

2.4. Codon bias corrected for background nucleotide

content

We used four methods to compute codon bias, corrected

for background nucleotide content:

1. Effective number of codons corrected for background

nucleotide content or ENCV (Wright, 1990; Novembre,

2002).

2. B measure (Karlin et al., 1998), which is applied as

follows:

B gð Þ ¼X

i

faa ið Þ gð Þ�� f i gð Þ � Enc

i gð Þ�� ð9Þ

where fi( g) is the proportion of the codon i in the gene

g among its synonymous codons; Einc( g) is the expected

proportion of i in g (see below); faa(i)( g) is the

frequency of the amino acid coded by i in g; and the

sum is over all codons.

3. HK measure: computing the uncorrected codon bias

(Karlin and Mrazek, 1996):

CB gð Þ ¼X

i

faa ið Þ gð Þ�� f i gð Þ � 1=syn ið Þ

�� ð10Þ

where syn(i) is the degeneracy of the amino acid coded

by i (the number of synonymous codons for i). Then

computing the regression line of CB( g) versus non-

coding GC-content, from the non-coding regions

surrounding g, and subtracting the regression line from

the CB measure (as done by Hey and Kliman, 2002;

thus we shall denote this method HK). This is based on

the known observation that codon bias is positively

correlated with both non-coding GC-content and

expression level in some eukaryotic genomes, including

the human genome (see below in the next subsection).

4. Maximum-likelihood codon bias, or MCB (Urrutia and

Hurst, 2001).

2.5. Computing expected values

For the first three of the four methods described above,

we need non-coding sequences neighboring a given gene

(the fourth method, MCB, uses the coding sequence itself).

We used the sequence consisting of the introns of the gene,

the 1000 nucleotides immediately preceding the coding area

http://www.cgap.nci.nih.gov/SAGE/SALL


of the gene, and similarly, those 1000 nucleotides immedi-

ately succeeding it (or truncated, as necessary, in the case

that genes were less than 1000 nucleotides apart; see also

Hey and Kliman, 2002 and Urrutia and Hurst, 2003). If an

intron is longer than 2000 bp, only the 1000 nucleotides on

each of the intron’s ends were taken. By taking 1000

flanking bases, we assure that regions that may be under

selective constrains, both in flanking regions and introns,

constitute only a small portion of the strands that are used as

control. On the other hand, regions of large introns that are

far from any coding sequence may not represent the

mutational bias that acts on the nearby exons, and thus

introns were truncated to 1000 bases on each end. We

masked repetitive elements using RepeatMasker (http://

www.repeatmasker.org/cgi-bin/WEBRepeatMasker).

In warm-blooded vertebrates, including human, it is well

known that disochoresT structure (Bernardi et al., 1985)

correlates with both expression level and codon bias

(Versteeg et al., 2003; Knight et al., 2001). Although

correction of codon bias for the influence of isochore

structure is essential, it is not enough. It has been observed

that the G and C compositions (and similarly the A and T

compositions), in both coding and non-coding sequences, in

the human genome, are not equal, and the differences

(termed GC-skew and AT-skew, respectively) correlate with

expression level (Duret, 2002). Thus, the correction of codon

bias should consider the base composition of the background

in a more differential manner. Here we introduce two

methods for computing Einc( g) (see Eqs. (7)–(9)).

We treated the amino acids with six synonymous codons

as two independent amino acids each, one with four codons,

and one with two (as in Urrutia and Hurst, 2001), so that any

two synonymous codons differ only in the third codon

position.

The first and simpler method applies the base proportions

in the non-coding surrounding of the gene to the third codon

position. For example, if the base A appears 21% of the

times in the non-coding surrounding of a given gene, then in

a fourfold amino acid, the codon ending with an Awill have

Einc( g) value of 0.21.

However, since it is known that there are dinucleotides

that are in excess or are avoided in the genome, for example,

the dinucleotide CG is depleted by its tendency to mutate

and to disappear from genomic sequences (Graur and Li,

2000), it may not be enough to consider single nucleotide

frequencies for correction. The second method incorporates

the dinucleotide composition of the non-coding surrounding

of the gene in the following manner:

We count the number of appearances of each triplet in all

three reading frames of the non-coding surrounding of a

gene. For each triplet XYZ, we denote this number by #XYZ.

For a codon XYZ, we denote by S(XYZ) the set of bases that

when replacing Z would yield a synonymous codon

(including the base Z itself). For example, S(GCA)=

{A,C,G,T}, S(AGC)={C,T}, and S(ATT)={A,C,T} (recall

that the sixfold amino acids are split into a twofold amino

acid and fourfold one). It is clear that the frequency of a

codon XYZ among its synonyms in a gene could be affected

by the background frequencies of both the YZ dinucleotide

and the ZN dinucleotide, where N is any of the four bases.

Calculating the expected relative frequencies of XYZ at the

non-coding surrounding is done as follows:

Let

wNXYZ ¼ #YZNX

KaS XYZð Þ#YKN

; Na A; T ;C;Gf g

Then

EncXYZ gð Þ ¼ f Aaa XYZð Þw

AXYZ þ f C

aa XYZð ÞwCXYZ

þ f Gaa XYZð ÞwGXYZ þ f Taa XYZð Þw

TXYZ ð11Þ

Where aa(XYZ) is the amino acid coded by XYZ and

faa(XYZ)N is the number of triplets that code for aa(XYZ), that

are followed by the base N, divided by the total number of

triplets that code for aa(XYZ). All triplets are counted in the

non-coding surrounding of g in all three reading frames.

We calculated FOPV and B using both methods. The B

(Eq. (9)) values for all genes in the study, with Einc

calculated in both methods are highly correlated (R=0.96),

and similarly for FOPV in Eqs. (7) and (8) (R=0.93). Thus,

the results involving these measures, with Einc( g) calculated

in both ways, are almost identical. In this paper, we included

only the results where Einc( g) was computed in the second

method which seems more accurate, as it considers

dinucleotide frequencies.

2.6. Protein biosynthetic cost measures

We used the size/complexity score of Dufton (1997) for

amino acids (Table 2). To evaluate the biosynthetic cost of a

protein, encoded by a given gene, we used two measures:

2.6.1. Average size/complexity score

Each codon was given the score of the amino acid it

encodes. The size complexity score was averaged over the

codons of a gene.

2.6.2. Frequency of expensive amino acids

This is the relative frequency of amino acids with a size/

complexity score greater than 40 (Arg, Cys, His, Phe, and

Tyr). We excluded the single-codon amino acids Met and

Trp, since they do not contribute to the FOPV or to the codonbias).

2.7. Sequence data

Gene and intron sequences were downloaded from

NCBI GenBank, Build 33 (ftp://ftp.ncbi.nih.gov/genomes/

H_sapiens/). We included only CDSs that start with a start

codon, end with a stop codon, have a length that is a

multiple of three, and have no unidentified bases. For genes

http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker

ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/

Table 2

Amino acids in the human genome: Frequency, frequency weighted by expression in highly expressed genes and in all expressed genes, Isoaccepting tRNA

gene copy numbers, and size/complexity score (as in Dufton, 1997)

Amino acid Frequency weighted by expression

in highly expressed genes

Frequency weighted by expression

in all expressed genes

Isoaccepting tRNA

gene copy number

Dufton’s size/

complexity score

Ala 7.50 7.41 40 4.76

Arg 5.74 5.73 30 56.36

Asn 3.71 3.69 34 33.72

Asp 5.22 5.16 10 32.72

Cys 1.92 1.97 30 57.16

Gln 4.48 4.53 32 37.48

Glu 7.34 7.32 22 36.48

Gly 7.19 7.07 24 1.0

His 2.34 2.40 12 58.7

Ile 4.51 4.48 19 16.04

Leu 9.12 9.29 35 16.04

Lys 6.76 6.59 38 30.14

Met 2.36 2.32 17 64.68

Phe 3.59 3.60 14 44.0

Pro 5.70 5.80 25 31.8

Ser 7.22 7.41 26 17.86

Thr 5.16 5.17 25 21.62

Trp 1.12 1.14 7 73.0

Tyr 2.80 2.77 12 57.0

Val 6.22 6.18 44 12.28


with more than one CDS, we took the longest CDS. Thus,

between 5% and 10% of the genes were excluded. In

addition, less than half the genes were expressed in the 43

SAGE libraries (see below). After a further removal of

genes, required by the computation of MCB (Urrutia and

Hurst, 2001), 14,131 genes were left for the analysis.

3. Results

In some unicellular organisms, a positive relation

between cellular tRNA and tRNA gene copy number was

found (Dong et al., 1996; Percudani et al., 1997; Kanaya

et al., 1999). As in C. elegans and other eukaryotes

(Moriyama and Powel, 1997; Duret, 2000; Kanaya et al.,

2001), in the human genome there is also a redundancy in

the set of tRNA genes (Lander et al., 2001, see also http://

www.rna.wustl.edu/GtRDB/Hs/Hs-summary.html). The

number of tRNA genes varies from 7 (Trp) to 44 (Val).

Although a variation between different tRNA genes could

be in the transcription level, a positive correlation between

intracellular tRNA and tRNA gene copy number is expected

(Duret, 2000; Comeron, 2004). We assume such correlation

and use the number of tRNA genes as a measure for the

amount of intracellular tRNAs. As done by Duret (2000) for

C. elegans, we measured the correlation between the

isoaccepting tRNA genes and the expression-weighted

frequencies of the 20 amino acids, and those of the

individual codons (see Tables 2 and 3 and Materials and

methods). When amino acid frequencies were considered

(Fig. 1a), a significant correlation was found for highly

expressed genes (R=0.585, p=0.007, N=4320). For codon

frequencies (Fig. 1b), excluding two outliers, we obtained a

highly significant positive correlation (R=0.654, pb0.001,

and R=0.56, pb0.001 when including the outliers). Similar

results were obtained when all expressed genes were taken,

and slightly lower correlation were obtained when frequen-

cies were not weighted (R=0.565, p=0.009 for amino acids;

R=0.62, pb0.001 for codons). These correlations do not

necessarily prove a correlation between cellular tRNA

abundance and the number of tRNA gene, but, as indicated

by Duret (2000), if such relation were not present, we would

not expect any correlation between the frequencies of amino

acids and codons and the number of tRNA genes.

3.1. The relation between gene copy numbers and gene

expression level

Under the assumption that gene copy numbers can be

used as an indication for cellular tRNA abundance (see also

Duret, 2000; Comeron, 2004), we used frequency of optimal

codons (FOPV, see Materials and methods) as a measure of

translation efficiency. As explained above, this measure is

independent of the amino acid frequencies and is corrected

for background nucleotide content. Here, the term optimal

codon denotes the codon with the highest tRNA gene copy

number, for each amino acid (also termed major codon and

translationally superior codon; see Akashi, 1995). We

calculated the correlation between FOPV and expression

level in over 14,000 genes in the all human chromosomes.

We combined the expression levels in 43 libraries in three

ways: (a) breadth of expression; (b) average over the

libraries; and (c) maximum over the libraries (see Materials

and methods). The correlations are very weak (R=0.075,

http://www.rna.wustl.edu/GtRDB/Hs/Hs%1Esummary.html

Table 3

Codons in the human genome: tRNA gene copy numbers and frequencies in highly expressed genes and in all expressed genes

Amino acid Codon Isoaccepting tRNA

gene copy number

Frequency weighted by expression

in highly expressed genes

Frequency in al

expressed genes

Ala gca 10 1.57 1.63

gccw 0 3.08 2.80

gcg 5 0.73 0.74

gct* 25 2.12 1.86

Arg aga 5 1.07 1.20

agg 4 1.03 1.17

cga 7 0.64 0.63

cgcw 0 1.19 1.05

cgg 5 1.20 1.17

cgt* 9 0.60 0.46

Asn aac* 33 2.08 1.90

aatw 1 1.63 1.73

Asp gac* 10 2.76 2.55

gatw 0 2.46 2.26

Cys tgc* 30 1.06 1.22

tgtw 0 0.85 1.05

Gln caa 11 1.02 1.26

cag* 21 3.45 3.49

Glu gaa* 14 2.95 3.06

gag 8 4.39 4.06

Gly gga 5 1.70 1.66

ggc* 11 2.56 2.23

ggg 8 1.54 1.63

ggtw 0 1.39 1.07

His cac* 12 1.39 1.51

catw 0 0.95 1.10

Ile ata 5 0.57 0.75

atcw 1 2.31 2.02

att* 13 1.62 1.60

Leu cta 2 0.58 0.71

ctcw 0 1.69 1.90

ctg 6 3.90 3.95

ctt* 13 1.18 1.33

tta 8 0.58 0.78

ttg 6 1.19 1.30

Lys aaa 16 2.56 2.52

aag* 22 4.20 3.23

Met atg 17 2.36 2.17

Phe ttc* 14 2.00 1.95

tttw 0 1.59 1.73

Pro cca 10 1.52 1.72

cccw 0 1.92 2.00

ccg 4 0.61 0.71

cct* 11 1.65 1.78

Ser agc 7 1.70 1.98

agtw 0 1.07 1.26

tca 5 1.00 1.24

tccw 0 1.65 1.74

tcg 4 0.41 0.45

tct* 10 1.40 1.52

Thr aca* 10 1.36 1.50

accw 0 1.95 1.82

acg 7 0.57 0.60

act 8 1.28 1.32

Trp tgg 7 1.12 1.24

Tyr tac* 11 1.58 1.48

tatw 1 1.22 1.21

Val gta 5 0.68 0.72

gtcw 0 1.52 1.40

gtg 19 2.91 2.78

gtt* 20 1.12 1.11


l

Fig. 1. (a) Isoaccepting tRNA gene copy number vs. frequency of amino acids and codons, weighted by expression, in highly expressed genes, in the human

genome. Frequencies were computed and weighted for the top 30% of all expressed genes (A total of 4320 genes). (a) Amino acid frequencies, R=0.585,

p=0.007. (b) Codons frequencies, R=0.654, pb0.001. Codons translated by the same anticodon (the wobble effect) were regarded as one point. The data appear

in Tables 2 and 3. Expression values were calculated by averaging over 43 SAGE libraries (see Materials and methods).


0.08, and 0.06, respectively) but highly significant

( pb0.0001). The error-bar graphs in Fig. 2 illustrate the

relation between the FOPV and the average expression (see

Materials and methods). Although the correlations are weak,

the graph indicates that the FOPV values rise as expression

level rises.

3.2. The role of codon bias

Since codon bias is the unequal use of synonymous

codons, we expect high codon bias in genes with stronger

preference for optimal codons, and thus we expect codon

Fig. 2. Frequency of optimal codons (FOP V) vs. average expression. 14,131genes expressed in 43 SAGE libraries, were divided, according to expression

level, into ten categories of approximately equal size. Circles represent the

mean value. Error bars show 95% of confidence. R=0.08, pb0.0001.

bias to rise with frequency of optimal codons. Interestingly,

we found that genes with the lowest values of FOPV alsotends to have high codon bias, even higher than in genes

with the highest value of FOPV. This is illustrated in the

graphs in Fig. 3. The high values of codon bias in genes with

low frequency of optimal codons indicate that the bias tends

to be for codons that correspond to low tRNA gene copy

numbers, and this in turn may suggest that the bias is for

codons with low translational values. The figure shows the

relation between FOPV and four measures of codon bias:

Effective number of codons, or ENCV (Novembre, 2002), B

measure (Karlin et al., 1998), HK measure (Hey and Kliman,

2002), and maximum-likelihood codon bias, or MCB

(Urrutia and Hurst, 2001). ENCV and B are corrected for

background nucleotide composition by considering dinu-

cleotides in the background (see Materials and methods).

When considering the relation between codon bias and

gene expression level, we encounter another unexpected

result. Instead of the expected positive correlation between

codon bias and expression level, based on studies in

different organisms (see the Introduction), we observed that

the average codon bias is highest both in the classes of

genes with the highest as well as with the lowest expression

levels. This is shown in the graphs of Fig. 4. Similar graphs

were obtained when using breadth of expression or

maximum expression (see Materials and methods, graphs

available upon request).

As indicated above, a rise in codon bias as expression

level drops has not been observed in lower organisms. As it

appears, in addition to the role associated with codon bias in

enhancing the expression of certain genes by preferring

codons with high cellular levels of isoaccepting tRNA (as

was found for unicellular organism; Ikemura, 1981; Kanaya

et al., 1999) or with high tRNA gene copy number (Duret,

Fig. 3. Codon bias vs. frequency of optimal codons (FOPV). (a) Effective number of codons (ENCV); (b) B; (c) Regression line subtraction (HK, see Materials

and methods); (d) Maximum-likelihood codon bias (MCB). A total of 14,131 genes was divided, according to FOPV, into ten categories of approximately equal

size. Circles represent the mean value. Error bars show 95% of confidence.


2000), here we found that in the human genome, codon bias

may have an additional role: controlling the expression of

certain genes by preferring codons with small tRNA gene

copy number.

3.3. Translation efficiency and amino acid biosynthetic cost

Translation efficiency is also affected by the size, the

structure, and the production cost of the amino acids

incorporated in the protein. To further analyze these factors,

we hypothesized that proteins coded by highly expressed

genes are composed of smaller and biosynthetically cheaper

amino acids.

The relation between expression levels and biosynthetic

cost is shown in Fig. 5. We use the size/complexity quotient

of Dufton (1997) as a measure of biosynthetic cost (Table

2). A clear monotonic relation between expression levels

and biosynthetic cost is evident. Genes with higher average

expression (see Materials and methods) tend to code less for

bexpensiveQ amino acids, as estimated by the size/complex-

ity measure. Similar graphs were obtained when using

breadth of expression or maximum expression.

Interestingly, we found that the frequency of optimal

codons is in significant positive correlation with the

measures of biosynthetic cost (R=0.18, pb0.0001), as

indicated in Fig. 6.

Although this result seems to be counterintuitive, since it

was shown above that for highly expressed genes, the size/

complexity score of genes tends to decrease with the

average expression, and thus showing a possible mechanism

of preference for cheap and smaller amino acids in highly

expressed genes, the correlations between the FOP and

expression level and between expression level and size/

complexity score are too weak to infer from about the

relation between the FOP and size/complexity score.

The graph indicates that genes that encode for more

expensive amino acids tend to have more tRNA genes for

their anticodons. This suggests the possibility of a mecha-

Fig. 5. Biosynthetic cost vs. average expression. (a) Average size/complexity score; (b) Frequency of occurrence of the most expensive amino acids (see

Materials and methods). A total of 14,131 genes expressed in 43 SAGE libraries was divided, according to expression level, into ten categories of

approximately equal size. Circles represent mean values. Error bars show 95% of confidence. Correlation coefficients are (a) R=�0.071 ( pb0.0001) (b)

R=�0.127 ( pb0.0001).

Fig. 4. Codon bias vs. Average expression level. (a) Effective number of codons (ENCV); (b) B; (c) Regression line subtraction (HK, see Materials and

methods); (d) Maximum-likelihood codon bias (MCB). A total of 14,131 genes was divided, according to expression level, into ten categories of approximately

equal size. Circles represent the mean value. Error bars show 95% of confidence.


Fig. 6. Frequency of optimal codons vs. Average size/complexity score.

R=0.18, pb0.0001. A total of 14,131 genes expressed in 43 SAGE libraries

was divided according to average size/complexity into ten categories of

approximately equal size. Circles represent the mean values. Error bars

show 95% of confidence.


nism that compensates high cost of production of a protein

by codon bias favoring optimal codons, enabling faster and

more accurate translation.

4. Discussion

In this work, we studied the relation between translation

efficiency (as estimated by the number of tRNA genes) and

gene expression level (as estimated by measures derived

from SAGE/EST data) in the human genome, and the

interrelations between these two factors and codon bias. For

this purpose, we introduced two methods: a method for

computing the frequency of optimal codons, which is

independent of amino acid composition and with correction

for background nucleotide content, and a method for

computing expected values of codon frequency, based on

dinucleotide composition of the background. We showed

that amino acid and codon frequencies, weighted by

expression, correlate positively with tRNA gene copy

number, thus possibly indicating a relation between the

number of tRNA genes and tRNA abundance. We showed

that expression level is in weak, highly significant, positive

correlation with frequency of optimal codons (which is

assumed to be a measure of translation efficiency). This, in

turn, shows that codon choice, or codon bias, relates to

expression level. A caveat must be admitted here, since we

used measures of the transcriptome, indicating numbers of

mRNAs, and not of proteins, but due to the lack of data on

protein levels in human, we assume that the former can be

induced by the mRNA levels as used here. In addition, we

obtained a surprising result not observed in previously

studied organisms, namely, that the average codon bias is

high both in weakly expressed genes and in highly expressed

genes. This result was obtained with all four measures of

codon bias used in this study. Previous analysis of the human

genome (Urrutia and Hurst, 2003) did not show high codon

bias for the weakly expressed genes. This may be explained

by the fact that in that study an older version of the genome

was used and that a large part of the weakly expressed genes

was deliberately excluded from the analysis.

In highly expressed genes, we show a tendency for high

codon bias, and also for higher frequency of optimal

codons. A possible explanation for this is that in these

genes, the high codon bias is a consequence of more codons

with high tRNA gene copy number, increasing the trans-

lation elongation rate. The finding that the average codon

bias is high for lowly expressed genes, together with the

result that these genes tend to have low frequency of optimal

codons (and therefore their high bias is probably the

consequence of favoring non-optimal codons in terms of

translation efficiency), suggest that some lowly expressed

genes may also experience the effect of natural selection

against optimal codons. Comeron (2004) showed strong

evidence for a transcription-associated bias for higher G and

T content on the coding strand of introns. This bias is

observed in coding regions in the third codon position as

well (Duret, 2002). Computing codon bias and RSCU

values relative to expected values, derived from introns on

the coding strand only, as done in this study, avoids the

difference between the coding and the non-coding strands,

as observed in Comeron (2004). Since only half of the

optimal codons end with a G or a T, it is unlikely that the

transcription associated bias for G and T in both introns and

coding regions can accounts for the correlation between

FOP and expression level (Fig. 2) and the relation between

codon bias and expression level (Fig. 4).

We hypothesize that the translation efficiency of proteins,

which can be a disadvantage in high levels, is controlled by

this mechanism of preferring codons with low tRNA

abundance, and thus regulating the elongation rate in these

proteins. Such a mechanism was suggested by Fiers and

Grosjean (1979), and supporting evidence was provided by

Konigsberg and Godson (1983). The latter found that some

E. coli regulatory genes contains an unusually high number

of codons that are not frequently used in most E. coli genes,

and therefore suggested that this could be part of a

mechanism that helps to keep a low expression level in

some regulatory genes. In another study, Zhang et al.

(1991), who showed that in E. coli, S. cerevisiae, D.

melanogaster, and primates (mainly Homo sapiens) pro-

teins containing a high percentage of low-usage codons can

be characterized as cases where an excess of the protein

could be detrimental. Another indication of this mechanism

in bacteria is provided by Saier (1995) who showed

evidence suggesting that the translation of proteins involved

in various specialized functions may be regulated by using

rare codons. Robinson et al. (1984) and Pedersen (1984)

presented experimental evidence which also indicates that

the presence of non-optimal codons can reduce translation

efficiency. Support to the notion of an expression regulation


mechanism in lowly expressed genes in mammals can be

also found in the work of Zhou et al. (1999) and Wells et al.

(1997), who showed that modifying the codon composition

of non-mammalian genes to resemble that of mammalian

genes can significantly enhance their translation in mamma-

lian cells, where the translation of the original genes is

limited. This model was challenged by arguing that the

presence of rare codons in lowly expressed genes is due to

mutational drift randomizing codon usage (Sharp and Li,

1986). However, it is hard to see how random drift can

account for the high level of codon bias, favoring rare and

non-optimal codons, in lowly expressed genes, as observed

in the human genome, in this study. More research is needed

to show whether this mechanism is used to regulate

translation rate. Since the former hypothesis leads to an

experimental prediction, namely, that proteins that could be

of disadvantage in excess, indeed contain significantly more

non-optimal codons, it is of high value.

Using the size/complexity index developed by Dufton

(1997, see Materials and methods) as an estimate of the

amino acid cost, we found that there is a negative correlation

(R=�0.071) between expression level and the size/complex-

ity score. This result is expected if amino acid usage is

shaped by selective forces to optimize translation efficiency.

A more pronounced effect was demonstrated when the

correlation between the frequency of expensive amino acids

within genes and the expression level is considered

(R=�0.127, see Fig. 5). Urrutia and Hurst (2003) also used

Dufton’s index to examine its association with gene

expression level, and found a similar tendency to avoid

the use of complex amino acid in highly expressed genes.

Another study (Akashi and Gojobori, 2002), conducted on

B. subtilis and E. coli, and concentrated on the relation

between metabolic costs of amino acid biosynthesis and

patterns of amino acid composition, shows an increased

usage of less energetically costly amino acids in highly

expressed genes in both cells, and thus support the action of

selection on amino acid usage to increase metabolic

efficiency. It should be noted however, that Dufton’s index,

although an indirect measure, takes into account the size and

the structural complexity of amino acids, factors that have

an influence on the rate of incorporating amino acids in the

elongation process.

Contrary to these results, Comeron (2004), using micro-

array data, reports that there is no detectable influence of

expression on amino acid usage. One reason to this

discrepancy could be the utilization of two different

methods for the estimation of the expression level. Another

possible explanation could be the fact that Comeron

analyzed the expression of each tissue separately

(dexpression levelT), which can be problematic. Genes

which are highly expressed in one tissue can be of poor

expression in another, or not expressed at all. Since it is

more reasonable to assume that if there is a selection for

translational efficiency, it will be detected in genes with high

activity in many tissues, the more appropriate measure for

gene expression, to asses its association with amino acid

usage, is average expression or breadth of expression, (see

Materials and methods). These measures reflect more

accurately the total activity of genes in different tissues.

We showed that frequency of optimal codons correlates

positively with protein production cost. We suggest that this

may be an indication of the action of selection on codon bias

to reduce error rate in the production of costly proteins. This

mechanism was suggested by Akashi (1994), who showed

evidence that natural selection acts on synonymous codon

usage to enhance the accuracy of protein synthesis in

Drosophila, based on association between synonymous

codon usage and amino acid constraint. In the study of

Akashi and Gojobori (2002), a negative correlation between

metabolic costs of amino acids and codon bias was shown.

This relation seems to contradict the result presented here.

However, apart from the fact that their study deals with

bacteria, and that both selective forces and regulation

mechanism may be different in higher organisms, there

are other factors that may explain the difference: as was

mentioned above, the metabolic cost calculation does not

take into account the size and the complexity of amino

acids, and besides, in other organisms other factors such as

biosynthetic pathways and dietary conditions may contrib-

ute differently to the amino acid composition. In addition,

the MCU measure in that study is different from the FOPVmeasure as was defined here, and also the former is not

corrected for background composition and for amino acid

usage composition, as in the FOPV.Two points of caution should be emphasized here: First,

as done in previous studies (Duret, 2000; Comeron, 2004)

we assumed a correspondence between tRNA gene copy

numbers and tRNA cellular abundance. As far as we know,

this relation, however proved for several organisms, has not

been substantiated for humans. As noted above, the results of

this study may suggest this relation, since without such

correspondence one could not expect the correlations,

observed here, between gene copy numbers and amino acid

and codon frequencies. The fact that, in 14 out of 18 amino

acids, the codons with the highest tRNA gene copy numbers

also exhibit an increase in their frequency when comparing

between lowly and highly expressed genes (Comeron, 2004),

support this assumption. However, this is not a proof of such

a relation. Thus, some of the results suggested in the above

discussion, concerning the action of selection on codon bias

for translation efficiency, rely partly on an assumption, that,

although reasonable, has not been firmly substantiated yet.

The second point is that the correlations between FOPVand the different expression level measures are very weak.

FOPV was calculated by controlling the effect of amino acid

composition and that of background nucleotide content. It is

not unconceivable that another third effect may account for

the observed correlations.

In summary, based on the evidence presented, we suggest

three possible ways in which selection may act on codon

bias in the human genome: (1) Increasing translation


efficiency in highly expressed genes; (2) regulating trans-

lation efficiency of some proteins that can be a disadvantage

at high levels; and (3) improving translation efficiency and

reducing the rate of amino acid misincorporation in the

production of biosynthetically expensive proteins.

Acknowledgements

We thank Edward Trifonov, Laurent Duret and Giuseppe

D’Onofrio for valuable discussions that contributed to this

manuscript. We also thank Yefim Yakir for preparing part of

the figures and for technical support and Nurit Carmi for

statistical consultation.

References

Akashi, H., 1994. Synonymous codon usage in Drosophila mela-

nogaster: natural selection and translational accuracy. Genetics 136

(3), 927–935.

Akashi, H., 1995. Inferring weak selection from patterns of polymorphism

and divergence at bsilentQ sites in Drosophila DNA. Genetics 139,

1076–1677.

Akashi, H., Gojobori, T., 2002. Metabolic efficiency and amino acid

composition in the proteomes of Escherichia coli and Bacillus subtilis.

Proc. Natl. Acad. Sci. U. S. A. 99 (6), 3695–3700.

Bernardi, G., et al., 1985. The mosaic genome of warm-blooded vertebrates.

Science 228 (4702), 953–958.

Comeron, J.M., 2004. Selective and mutational patterns associated with

gene expression in humans: influences on synonymous composition and

intron presence. Genetics 167, 1293–1304.

Dong, H., Nilsson, L., Kurland, C.G., 1996. Co-variation of tRNA

abundance and codon usage in Escherichia coli at different growth

rates. J. Mol. Biol. 260, 649–663.

Dufton, M.J., 1997. Genetic code synonym quotas and amino acid

complexity: cutting the cost of proteins? J. Theor. Biol. 187, 165–173.

Duret, L., 2000. tRNA gene number and codon usage in the C. elegans

genome are co-adapted for optimal translation of highly expressed

genes. Trends Genet. 16 (7), 287–289.

Duret, L., 2002. Evolution of synonymous codon usage in metazoans. Curr.

Opin. Genet. Dev. 12, 640–649.

Duret, L., Mouchiroud, D., 1999. Expression pattern and, surprisingly, gene

length shape codon usage in Caenorhabditis , Drosophila , and

Arabidopsis. Proc. Natl. Acad. Sci. U. S. A. 96 (8), 4482–4487.

Fiers, W., Grosjean, H., 1979. On codon usage. Nature 277 (5694), 328.

Grantham, R., Gautier, C., Gouy, M., Mercier, R., Pave, A., 1980. Codon

catalog usage and the genome hypothesis. Nucleic Acids Res. (8),

r49– r62.

Graur, D., Li, W.-H., 2000. Fundamentals of Molecular Evolution, 2nd ed.

Mass, Sinauer, Sunderland.

Hey, J., Kliman, R.M., 2002. Interactions between natural selection,

recombination and gene density in the genes of Drosophila. Genetics

160, 595–608.

Ikemura, T., 1981. Correlation between the abundance of Escherichia coli

transfer RNAs and the occurrence of the respective codons in its protein

genes. J. Mol. Biol. 146 (1), 1–21.

Ikemura, T., 1982. Correlation between the abundance of yeast transfer

RNAs and the occurrence of the respective codons in protein genes.

Differences in synonymous codon choice patterns of yeast and

Escherichia coli with reference to the abundance of isoaccepting

transfer RNAs. J. Mol. Biol. 158 (4), 573–597.

Kanaya, S., Yamada, Y., Kudo, Y., Ikemura, T., 1999. Studies of codon

usage and tRNA genes of 18 unicellular organisms and quantification of

Bacillus subtilis tRNAs: gene expression level and species-specific

diversity of codon usage based on multivariate analysis. Gene 238,

143–155.

Kanaya, S., Yamada, Y., Kinouchi, M., Kudo, Y., Ikemura, T., 2001. Codon

usage and tRNA genes in eukaryotes: correlation of codon usage

diversity with translation efficiency and with CG-dinucleotide usage as

assessed by multivariate analysis. J. Mol. Evol. 53, 290–298.

Karlin, S., Mrazek, J., 1996. What drives codon choices in human genes?

J. Mol. Biol. 262, 459–472.

Karlin, S., Mrazek, J., Campbell, A.M., 1998. Codon usages in different

gene classes of the Escherichia coli genome. Mol. Microbiol. 29 (6),

1341–1355.

Knight, R.D., Freeland, S.J., Landweber, L.F., 2001. A simple model based

on mutation and selection explains trends in codon and amino-acid

usage and GC composition within and across genomes. Genome Biol. 2

(4) (http://www.genomebiology.com/2001/2/4/research/010.1).

Konigsberg, W., Godson, N., 1983. Evidence for use of rare codons in the

dnaG gene and other regulatory genes of Escherichia coli. Proc. Natl.

Acad. Sci. U. S. A. 80 (3), 687–691.

Kreitman, M., Comeron, J.M., 1999. Coding sequence evolution. Curr.

Opin. Genet. Dev. 9 (6), 637–641.

Lander, E.S., et al. International Human Genome Sequencing Consortium,

2001. Initial sequencing and analysis of the human genome. Nature

409, 860–921.

Marais, G., Mouchiroud, D., Duret, L., 2001. Does recombination improve

selection on codon usage? Lessons from nematode and fly complete

genomes. Proc. Natl. Acad. Sci. U. S. A. 98 (10), 5688–5692.

Moriyama, E.N., 2003. Codon usage, Encyclopedia of the human genome.

MacmillanPublishers, Nature PublishingGroup. http://www.ehgonline.net.

Moriyama, E.N., Powel, J.R., 1997. Codon usage bias and tRNA

abundance in Drosophila. J. Mol. Evol. 45, 514–523.

Novembre, J.A., 2002. Accounting for background nucleotide composition

when measuring codon usage bias. Mol. Biol. Evol. 19 (8), 1390–1394.

Pedersen, S., 1984. Escherichia coli ribosomes translate in vivo with

variable rate. EMBO J. 3 (12), 2895–2898.

Percudani, R., Pavesi, A., Ottonello, S., 1997. Transfer RNA gene

redundancy and translational selection in Saccharomyces cerevisiae.

J. Mol. Biol. 268, 322–330.

Robinson, M., et al. , 1984. Codon usage can affect efficiency of translation

of genes in Escherichia coli. Nucleic Acids Res. 12 (17), 6663–6671.

Saier, M.J., 1995. Differential codon usage: a safe guard against

inappropriate gene expression of specialized genes. FEBS 362, 1–4.

Sharp, P.M., Li, W.-H., 1986. Codon usage in regulatory genes in

Escherichia coli does not reflect selection for rare codons. Nucleic

Acids Res. 14, 7737–7749.

Sharp, P.M., Tuohy, T.M., Mosurski, K.R., 1986. Codon usage in yeast:

cluster analysis clearly differentiates highly and lowly expressed genes.

Nucleic Acids Res. 14, 5125–5143.

Urrutia, A.O., Hurst, L.D., 2001. Codon usage bias covaries with

expression breadth and the rate of synonymous evolution in humans,

but this is not evidence for selection. Genetics 159, 1191–1199.

Urrutia, A.O., Hurst, L.D., 2003. The signature of selection mediated by

expression on human genes. Genome Res. 13 (10), 2260–2264.

Versteeg, R., et al. , 2003. The human transcriptome map reveals extremes

in gene density, intron length, GC content, and repeat pattern for

domains of highly and weakly expressed genes. Genome Res. 13 (9),

1998–2004.

Wells, K.D., Foster, J.A., Moore, K., Pursel, V.G., Wall, R.J., 1997. Codon

optimization, genetic insulation, and an rtTA reporter improve perform-

ance of the tetracycline switch. Transgenic Res. 8 (5), 371–381.

Wright, F., 1990. The deffective number of codonsT used in a gene. Gene 87,23–29.

Zhang, S., Zubay, G., Goldman, E., 1991. Low usage codons in Escherichia

coli, yeast, fruit fly, and primates. Gene 105, 61–72.

Zhou, J., Liu, W-J., Peng, S.W., Sun, X.Y., Frazer, I., 1999. Papillomavirus

capsid protein expression level depends on the match between codon

usage and tRNA availability. J. Virol. 73, 4972–4982.

http://www.genomebiology.com/2001/2/4/research/010.1

http://www.ehgonline.net

Codon bias as a factor in regulating expression via...

Documents

Transcript of Codon bias as a factor in regulating expression via...