Codon bias as a factor in regulating expression via...
Transcript of Codon bias as a factor in regulating expression via...
www.elsevier.com/locate/gene
Gene 345 (2005
Codon bias as a factor in regulating expression via translation rate
in the human genome
Yizhar Lavner, Daniel Kotlar*
Department of Computer Science, Tel Hai Academic College, Upper Galilee 12210, Israel
Received 13 September 2004; received in revised form 10 November 2004; accepted 11 November 2004
Available online 24 December 2004
Received by H.E. Roman
Abstract
We study the interrelations between tRNA gene copy numbers, gene expression levels and measures of codon bias in the human genome.
First, we show that isoaccepting tRNA gene copy numbers correlate positively with expression-weighted frequencies of amino acids and
codons. Using expression data of more than 14,000 human genes, we show a weak positive correlation between gene expression level and
frequency of optimal codons (codons with highest tRNA gene copy number). Interestingly, contrary to non-mammalian eukaryotes, codon bias
tends to be high in both highly expressed genes and lowly expressed genes.We suggest that selection may act on codon bias, not only to increase
elongation rate by favoring optimal codons in highly expressed genes, but also to reduce elongation rate by favoring non-optimal codons in
lowly expressed genes. We also show that the frequency of optimal codons is in positive correlation with estimates of protein biosynthetic cost,
and suggest another possible action of selection on codon bias: preference of optimal codons as production cost rises, to reduce the rate of amino
acid misincorporation. In the analyses of this work, we introduce a new measure of frequency of optimal codons (FOPV), which is unaffected byamino acid composition and is corrected for background nucleotide content; we also introduce a new method for computing expected codon
frequencies, based on the dinucleotide composition of the introns and the non-coding regions surrounding a gene.
D 2004 Elsevier B.V. All rights reserved.
Keywords: Homo sapiens; Codon bias; Gene expression; Translation efficiency; Optimal codon; Biosynthetic cost
1. Introduction
Codon bias, the unequal use of synonymous codons for
encoding amino acids (Grantham et al., 1980; Moriyama,
2003), has been found in many organisms, both prokaryotes
and eukaryotes. This bias varies considerably among
organisms and even within the genes of the same organism.
The bias was found to be in relation with many genomic
factors, such as gene length, GC-content, recombination rate,
gene expression level, and density of genes (Duret and
Mouchiroud, 1999; Kreitman and Comeron, 1999; Duret,
2000; Marais et al., 2001; Urrutia and Hurst, 2001, 2003;
0378-1119/$ - see front matter D 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.gene.2004.11.035
Abbreviations: CB, codon bias; ENC, effective number of codons; FOP,
frequency of optimal codons; MCB, maximum likelihood codon bias.
* Corresponding author. Tel.: +972 4 6952965; fax: +972 4 6952899.
E-mail address: [email protected] (D. Kotlar).
Hey and Kliman, 2002; Versteeg et al., 2003), or with other
regularities in the genetic code (Karlin and Mrazek, 1996). In
different species, codon bias was found to be in weak
correlation with gene expression level (Ikemura, 1981; Sharp
et al., 1986; Duret and Mouchiroud, 1999; Urrutia and Hurst,
2003). Two main processes were proposed to explain codon
bias: natural selection acting on silent changes in DNA,
mutational bias, or both. In unicellular organisms, such as E.
coli and S. cerevisiae, it was found that the codons translated
by the most abundant tRNA are the most frequently used
(Ikemura, 1981, 1982). In multicellular organisms, such as
C. elegans (Duret, 2000) and Drosophila (Akashi, 1995;
Moriyama and Powel, 1997), similar findings were found,
namely, that codon bias favoring codons with high tRNA
gene copy number rises with expression level, thus support-
ing the action of selection on codon bias to improve
translation efficiency. This idea has not been confirmed in
) 127–138
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138128
mammals (Kanaya et al., 2001). Although a weak correlation
between gene expression level and codon bias has been
observed in the human genome (Urrutia and Hurst, 2003),
this relation has not been linked to tRNA abundance.
Recently, Comeron (2004) showed that in the human
genome, in the majority of amino acids with degeneracy
greater than one, the codons with the most abundant tRNA
gene copy numbers, also exhibit an increase in frequency in
highly expressed genes compared to lowly expressed genes.
In this study, we introduce new methods for computing
the frequency of optimal codons (FOP) and for correcting
codon bias for background nucleotide content. Using these
methods, we show evidence indicating that the human
genome translation efficiency, as estimated using tRNA
gene copy numbers, is in weak positive correlation with
expression level, and that codon bias has a role in this
relation, although not the simple role it has in the model
described above: on the one hand, we found that codon bias
favors codons with high tRNA gene copy number in highly
expressed genes, and on the other hand, based on the
evidence presented here, we suggest that codon bias may act
as a gene expression regulator by favoring codons with low
tRNA gene copy numbers in lowly expressed genes. This
supports a mechanism proposed by Fiers and Grosjean
(1979) and supported by Konigsberg and Godson (1983) for
rare codons in regulatory genes of E. coli. Zhang et al.
(1991) also proposed this regulatory mechanism for several
organisms, including primates. In addition, we present
evidence that selection might act on codon bias to prefer
optimal codons, possibly to reduce the rate of amino acid
misincorporation as protein production cost rises.
2. Materials and methods
2.1. Frequency weighted by expression
The count ca of each amino acid a is calculated as
follows:
ca ¼X
g
ca gð ÞE gð Þ
where ca( g) is the count of a in the gene g, E( g) is the
expression level of g (average of expression; see below),
and the sum is taken over all the relevant genes (either the
highly expressed genes or all expressed genes). The
expression-weighted frequency faex of the amino acid a is
given by
f exa ¼ caX
a
cað1Þ
where the sum in the denominator is over all the amino
acids. This calculation is similar to the one performed by
Duret (2000) for C. elegans. In a similar manner, we
compute the expression-weighted frequency of a codon.
2.2. Estimating translation efficiency
2.2.1. Gene copy numbers data
Gene copy number data was taken from Lander et al.
(2001) and from the tRNA-scan site (http://www.rna.wustl.
edu/GtRDB/Hs/Hs-summary.html). In these data, pseudo-
genes have already been removed. We use tRNA gene copy
numbers as an assumed estimate of cellular tRNA abundance
(see explanation for this at the beginning of the Results
section).
2.2.2. Frequency of optimal codons (FOP)
The optimal codon of an amino acid is defined here as
the codon with the highest number of tRNA genes for its
anticodon, among its synonymous codons. The simplest
way to compute the frequency of optimal codons (FOP)
of a gene is to count the number of appearances of
optimal codons in the gene, and divide it by the total
number of codons in the gene (excluding the stop
codons):
FOPs gð Þ ¼ 1
N
X
i
ni gð Þ ð2Þ
where ni( g) is the count of the codon i in the gene g, N
is the total number of codons in g, and the sum is taken
over all the optimal codons. The subscript s stands for
bsimpleQ. This FOP measure is affected by amino acid
usage. If synonymous codon usage is random, a gene
composed only of amino acids of degeneracy two would
have FOP of 0.5, whereas a gene composed of amino
acids of degeneracy four would have FOP of 0.25. In
order to obtain a measure which is independent of amino
acid composition, we multiply each codon count in Eq.
(2) by the corresponding amino acid degeneracy:
FOP gð Þ ¼ 1
N
X
i
syn ið Þni gð Þ: ð3Þ
Here, syn(i) is the degeneracy of the amino acid coded by
i. This way a gene with close to random synonymous codon
usage will have FOP value close to 1, regardless of its
amino acid composition. To see that this is a sensible
measure, we write Eq. (3) in a slightly different way:
FOP gð Þ ¼X
i
naa ið Þ gð ÞN
ni gð Þ=naa ið Þ gð Þ1=syn ið Þ ð4Þ
where naa(i)( g) is the count of the amino acid coded by i in
g. Assigning fi( g)=ni( g)/naa(i)( g) and faa(i)( g)=naa(i)( g)/
N, we have:
FOP gð Þ ¼X
i
faa ið Þ gð Þ fi gð Þ1=syn ið Þ ð5Þ
Now, the second multiplier is just the relative synon-
ymous codon usage, or RSCU, of the codon i in the gene g
(Sharp et al., 1986). Hence, the FOP measure is a weighted
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138 129
mean of the RSCUs of the optimal codons, where the
weights are the corresponding amino acid frequencies:
FOP gð Þ ¼X
i
faa ið Þ gð ÞRSCUi gð Þ ð6Þ
This measure does not take into account the background
nucleotide composition. In order to correct FOP for back-
ground nucleotide content, we replace 1/syn(i) in Eq. (5) by
Einc( g), the expected proportion of the codon i among its
synonyms, based on the non-coding region surrounding the
gene (see below for the way to compute Einc ( g)). Taking
RASCUiV( g)= fig/Ei
nc( g) in Eq. (6), we get
FOPV gð Þ ¼X
i
faa ið Þ gð ÞRSCUiV gð Þ ð7Þ
Replacing syn(i) in Eq. (3) with 1/Einc, we get a simpler
way to compute FOPV( g):
FOPV gð Þ ¼ 1
N
X
i
ni gð ÞEnci gð Þ ð8Þ
where the sum is taken over all optimal codons for which
Einc( g) p 0.
2.3. Computation of gene expression levels
Expression levels for individual genes were taken from
SAGE (http://www.cgap.nci.nih.gov/SAGE/SALL, version
of July 21, 2003). Only tags that matched a named gene
were taken into account. Expression values were calculated
by counting the tags for each gene in each library,
normalized per 200,000 tags, and combined over 43
libraries representing 18 normal tissues: brain (7 libraries,
311,726 tags), breast (7 libraries, 310,477 tags), colon (2
libraries, 76,954 tags), heart (1 library, 71,926 tags), kidney
(1 library, 30,721 tags), liver (1 library, 58,467 tags), lung (1
library, 77,024 tags), muscle (2 libraries, 88,332 tags), ovary
(2 libraries, 81,270 tags), pancreas (2 libraries, 54,673 tags),
peritoneum (1 library, 53,527 tags), placenta (2 libraries,
207,348 tags), prostate (4 libraries, 232,573 tags), retina (4
libraries, 239,211 tags), spinal cord (1 library, 45,109 tags),
stomach (1 library, 18,193 tags), vascular (2 libraries,
91,131 tags), white blood cells (2 libraries, 67,177 tags).
We combined the expression levels in the libraries in
three ways: (a) breadth of expression, defined here as the
number of libraries in which the gene was expressed; (b)
average over the libraries; and (c) maximum over the
libraries. The correlation values among the three methods
are listed in Table 1. All correlations are highly significant
with pb10�100. Average is the method that correlates the
best with the two other methods.
Table 1
Correlation coefficients between different methods of combining values in
SAGE libraries
Breadth Average Maximum
Breadth 1 0.904 0.734
Average 1 0.913
Maximum 1
[Our definition of Breadth is different from the usual
definition (Urrutia and Hurst, 2001) which is the number of
tissues in which a gene is expressed. However, calculating
breadth in both ways yielded two sets of values which are
highly correlated (RN0.96). As for Average expression, it
may seem more accurate to average first among the libraries
of each tissue and then to average over the tissues, as in
(Urrutia and Hurst, 2001); we observed that the values
computed this way and those computes by simply averaging
over all available libraries are highly correlated (RN0.98)].
2.4. Codon bias corrected for background nucleotide
content
We used four methods to compute codon bias, corrected
for background nucleotide content:
1. Effective number of codons corrected for background
nucleotide content or ENCV (Wright, 1990; Novembre,
2002).
2. B measure (Karlin et al., 1998), which is applied as
follows:
B gð Þ ¼X
i
faa ið Þ gð Þ�� f i gð Þ � Enc
i gð Þ�� ð9Þ
where fi( g) is the proportion of the codon i in the gene
g among its synonymous codons; Einc( g) is the expected
proportion of i in g (see below); faa(i)( g) is the
frequency of the amino acid coded by i in g; and the
sum is over all codons.
3. HK measure: computing the uncorrected codon bias
(Karlin and Mrazek, 1996):
CB gð Þ ¼X
i
faa ið Þ gð Þ�� f i gð Þ � 1=syn ið Þ
�� ð10Þ
where syn(i) is the degeneracy of the amino acid coded
by i (the number of synonymous codons for i). Then
computing the regression line of CB( g) versus non-
coding GC-content, from the non-coding regions
surrounding g, and subtracting the regression line from
the CB measure (as done by Hey and Kliman, 2002;
thus we shall denote this method HK). This is based on
the known observation that codon bias is positively
correlated with both non-coding GC-content and
expression level in some eukaryotic genomes, including
the human genome (see below in the next subsection).
4. Maximum-likelihood codon bias, or MCB (Urrutia and
Hurst, 2001).
2.5. Computing expected values
For the first three of the four methods described above,
we need non-coding sequences neighboring a given gene
(the fourth method, MCB, uses the coding sequence itself).
We used the sequence consisting of the introns of the gene,
the 1000 nucleotides immediately preceding the coding area
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138130
of the gene, and similarly, those 1000 nucleotides immedi-
ately succeeding it (or truncated, as necessary, in the case
that genes were less than 1000 nucleotides apart; see also
Hey and Kliman, 2002 and Urrutia and Hurst, 2003). If an
intron is longer than 2000 bp, only the 1000 nucleotides on
each of the intron’s ends were taken. By taking 1000
flanking bases, we assure that regions that may be under
selective constrains, both in flanking regions and introns,
constitute only a small portion of the strands that are used as
control. On the other hand, regions of large introns that are
far from any coding sequence may not represent the
mutational bias that acts on the nearby exons, and thus
introns were truncated to 1000 bases on each end. We
masked repetitive elements using RepeatMasker (http://
www.repeatmasker.org/cgi-bin/WEBRepeatMasker).
In warm-blooded vertebrates, including human, it is well
known that disochoresT structure (Bernardi et al., 1985)
correlates with both expression level and codon bias
(Versteeg et al., 2003; Knight et al., 2001). Although
correction of codon bias for the influence of isochore
structure is essential, it is not enough. It has been observed
that the G and C compositions (and similarly the A and T
compositions), in both coding and non-coding sequences, in
the human genome, are not equal, and the differences
(termed GC-skew and AT-skew, respectively) correlate with
expression level (Duret, 2002). Thus, the correction of codon
bias should consider the base composition of the background
in a more differential manner. Here we introduce two
methods for computing Einc( g) (see Eqs. (7)–(9)).
We treated the amino acids with six synonymous codons
as two independent amino acids each, one with four codons,
and one with two (as in Urrutia and Hurst, 2001), so that any
two synonymous codons differ only in the third codon
position.
The first and simpler method applies the base proportions
in the non-coding surrounding of the gene to the third codon
position. For example, if the base A appears 21% of the
times in the non-coding surrounding of a given gene, then in
a fourfold amino acid, the codon ending with an Awill have
Einc( g) value of 0.21.
However, since it is known that there are dinucleotides
that are in excess or are avoided in the genome, for example,
the dinucleotide CG is depleted by its tendency to mutate
and to disappear from genomic sequences (Graur and Li,
2000), it may not be enough to consider single nucleotide
frequencies for correction. The second method incorporates
the dinucleotide composition of the non-coding surrounding
of the gene in the following manner:
We count the number of appearances of each triplet in all
three reading frames of the non-coding surrounding of a
gene. For each triplet XYZ, we denote this number by #XYZ.
For a codon XYZ, we denote by S(XYZ) the set of bases that
when replacing Z would yield a synonymous codon
(including the base Z itself). For example, S(GCA)=
{A,C,G,T}, S(AGC)={C,T}, and S(ATT)={A,C,T} (recall
that the sixfold amino acids are split into a twofold amino
acid and fourfold one). It is clear that the frequency of a
codon XYZ among its synonyms in a gene could be affected
by the background frequencies of both the YZ dinucleotide
and the ZN dinucleotide, where N is any of the four bases.
Calculating the expected relative frequencies of XYZ at the
non-coding surrounding is done as follows:
Let
wNXYZ ¼ #YZNX
KaS XYZð Þ#YKN
; Na A; T ;C;Gf g
Then
EncXYZ gð Þ ¼ f Aaa XYZð Þw
AXYZ þ f C
aa XYZð ÞwCXYZ
þ f Gaa XYZð ÞwGXYZ þ f Taa XYZð Þw
TXYZ ð11Þ
Where aa(XYZ) is the amino acid coded by XYZ and
faa(XYZ)N is the number of triplets that code for aa(XYZ), that
are followed by the base N, divided by the total number of
triplets that code for aa(XYZ). All triplets are counted in the
non-coding surrounding of g in all three reading frames.
We calculated FOPV and B using both methods. The B
(Eq. (9)) values for all genes in the study, with Einc
calculated in both methods are highly correlated (R=0.96),
and similarly for FOPV in Eqs. (7) and (8) (R=0.93). Thus,
the results involving these measures, with Einc( g) calculated
in both ways, are almost identical. In this paper, we included
only the results where Einc( g) was computed in the second
method which seems more accurate, as it considers
dinucleotide frequencies.
2.6. Protein biosynthetic cost measures
We used the size/complexity score of Dufton (1997) for
amino acids (Table 2). To evaluate the biosynthetic cost of a
protein, encoded by a given gene, we used two measures:
2.6.1. Average size/complexity score
Each codon was given the score of the amino acid it
encodes. The size complexity score was averaged over the
codons of a gene.
2.6.2. Frequency of expensive amino acids
This is the relative frequency of amino acids with a size/
complexity score greater than 40 (Arg, Cys, His, Phe, and
Tyr). We excluded the single-codon amino acids Met and
Trp, since they do not contribute to the FOPV or to the codonbias).
2.7. Sequence data
Gene and intron sequences were downloaded from
NCBI GenBank, Build 33 (ftp://ftp.ncbi.nih.gov/genomes/
H_sapiens/). We included only CDSs that start with a start
codon, end with a stop codon, have a length that is a
multiple of three, and have no unidentified bases. For genes
Table 2
Amino acids in the human genome: Frequency, frequency weighted by expression in highly expressed genes and in all expressed genes, Isoaccepting tRNA
gene copy numbers, and size/complexity score (as in Dufton, 1997)
Amino acid Frequency weighted by expression
in highly expressed genes
Frequency weighted by expression
in all expressed genes
Isoaccepting tRNA
gene copy number
Dufton’s size/
complexity score
Ala 7.50 7.41 40 4.76
Arg 5.74 5.73 30 56.36
Asn 3.71 3.69 34 33.72
Asp 5.22 5.16 10 32.72
Cys 1.92 1.97 30 57.16
Gln 4.48 4.53 32 37.48
Glu 7.34 7.32 22 36.48
Gly 7.19 7.07 24 1.0
His 2.34 2.40 12 58.7
Ile 4.51 4.48 19 16.04
Leu 9.12 9.29 35 16.04
Lys 6.76 6.59 38 30.14
Met 2.36 2.32 17 64.68
Phe 3.59 3.60 14 44.0
Pro 5.70 5.80 25 31.8
Ser 7.22 7.41 26 17.86
Thr 5.16 5.17 25 21.62
Trp 1.12 1.14 7 73.0
Tyr 2.80 2.77 12 57.0
Val 6.22 6.18 44 12.28
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138 131
with more than one CDS, we took the longest CDS. Thus,
between 5% and 10% of the genes were excluded. In
addition, less than half the genes were expressed in the 43
SAGE libraries (see below). After a further removal of
genes, required by the computation of MCB (Urrutia and
Hurst, 2001), 14,131 genes were left for the analysis.
3. Results
In some unicellular organisms, a positive relation
between cellular tRNA and tRNA gene copy number was
found (Dong et al., 1996; Percudani et al., 1997; Kanaya
et al., 1999). As in C. elegans and other eukaryotes
(Moriyama and Powel, 1997; Duret, 2000; Kanaya et al.,
2001), in the human genome there is also a redundancy in
the set of tRNA genes (Lander et al., 2001, see also http://
www.rna.wustl.edu/GtRDB/Hs/Hs-summary.html). The
number of tRNA genes varies from 7 (Trp) to 44 (Val).
Although a variation between different tRNA genes could
be in the transcription level, a positive correlation between
intracellular tRNA and tRNA gene copy number is expected
(Duret, 2000; Comeron, 2004). We assume such correlation
and use the number of tRNA genes as a measure for the
amount of intracellular tRNAs. As done by Duret (2000) for
C. elegans, we measured the correlation between the
isoaccepting tRNA genes and the expression-weighted
frequencies of the 20 amino acids, and those of the
individual codons (see Tables 2 and 3 and Materials and
methods). When amino acid frequencies were considered
(Fig. 1a), a significant correlation was found for highly
expressed genes (R=0.585, p=0.007, N=4320). For codon
frequencies (Fig. 1b), excluding two outliers, we obtained a
highly significant positive correlation (R=0.654, pb0.001,
and R=0.56, pb0.001 when including the outliers). Similar
results were obtained when all expressed genes were taken,
and slightly lower correlation were obtained when frequen-
cies were not weighted (R=0.565, p=0.009 for amino acids;
R=0.62, pb0.001 for codons). These correlations do not
necessarily prove a correlation between cellular tRNA
abundance and the number of tRNA gene, but, as indicated
by Duret (2000), if such relation were not present, we would
not expect any correlation between the frequencies of amino
acids and codons and the number of tRNA genes.
3.1. The relation between gene copy numbers and gene
expression level
Under the assumption that gene copy numbers can be
used as an indication for cellular tRNA abundance (see also
Duret, 2000; Comeron, 2004), we used frequency of optimal
codons (FOPV, see Materials and methods) as a measure of
translation efficiency. As explained above, this measure is
independent of the amino acid frequencies and is corrected
for background nucleotide content. Here, the term optimal
codon denotes the codon with the highest tRNA gene copy
number, for each amino acid (also termed major codon and
translationally superior codon; see Akashi, 1995). We
calculated the correlation between FOPV and expression
level in over 14,000 genes in the all human chromosomes.
We combined the expression levels in 43 libraries in three
ways: (a) breadth of expression; (b) average over the
libraries; and (c) maximum over the libraries (see Materials
and methods). The correlations are very weak (R=0.075,
Table 3
Codons in the human genome: tRNA gene copy numbers and frequencies in highly expressed genes and in all expressed genes
Amino acid Codon Isoaccepting tRNA
gene copy number
Frequency weighted by expression
in highly expressed genes
Frequency in al
expressed genes
Ala gca 10 1.57 1.63
gccw 0 3.08 2.80
gcg 5 0.73 0.74
gct* 25 2.12 1.86
Arg aga 5 1.07 1.20
agg 4 1.03 1.17
cga 7 0.64 0.63
cgcw 0 1.19 1.05
cgg 5 1.20 1.17
cgt* 9 0.60 0.46
Asn aac* 33 2.08 1.90
aatw 1 1.63 1.73
Asp gac* 10 2.76 2.55
gatw 0 2.46 2.26
Cys tgc* 30 1.06 1.22
tgtw 0 0.85 1.05
Gln caa 11 1.02 1.26
cag* 21 3.45 3.49
Glu gaa* 14 2.95 3.06
gag 8 4.39 4.06
Gly gga 5 1.70 1.66
ggc* 11 2.56 2.23
ggg 8 1.54 1.63
ggtw 0 1.39 1.07
His cac* 12 1.39 1.51
catw 0 0.95 1.10
Ile ata 5 0.57 0.75
atcw 1 2.31 2.02
att* 13 1.62 1.60
Leu cta 2 0.58 0.71
ctcw 0 1.69 1.90
ctg 6 3.90 3.95
ctt* 13 1.18 1.33
tta 8 0.58 0.78
ttg 6 1.19 1.30
Lys aaa 16 2.56 2.52
aag* 22 4.20 3.23
Met atg 17 2.36 2.17
Phe ttc* 14 2.00 1.95
tttw 0 1.59 1.73
Pro cca 10 1.52 1.72
cccw 0 1.92 2.00
ccg 4 0.61 0.71
cct* 11 1.65 1.78
Ser agc 7 1.70 1.98
agtw 0 1.07 1.26
tca 5 1.00 1.24
tccw 0 1.65 1.74
tcg 4 0.41 0.45
tct* 10 1.40 1.52
Thr aca* 10 1.36 1.50
accw 0 1.95 1.82
acg 7 0.57 0.60
act 8 1.28 1.32
Trp tgg 7 1.12 1.24
Tyr tac* 11 1.58 1.48
tatw 1 1.22 1.21
Val gta 5 0.68 0.72
gtcw 0 1.52 1.40
gtg 19 2.91 2.78
gtt* 20 1.12 1.11
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138132
l
Fig. 1. (a) Isoaccepting tRNA gene copy number vs. frequency of amino acids and codons, weighted by expression, in highly expressed genes, in the human
genome. Frequencies were computed and weighted for the top 30% of all expressed genes (A total of 4320 genes). (a) Amino acid frequencies, R=0.585,
p=0.007. (b) Codons frequencies, R=0.654, pb0.001. Codons translated by the same anticodon (the wobble effect) were regarded as one point. The data appear
in Tables 2 and 3. Expression values were calculated by averaging over 43 SAGE libraries (see Materials and methods).
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138 133
0.08, and 0.06, respectively) but highly significant
( pb0.0001). The error-bar graphs in Fig. 2 illustrate the
relation between the FOPV and the average expression (see
Materials and methods). Although the correlations are weak,
the graph indicates that the FOPV values rise as expression
level rises.
3.2. The role of codon bias
Since codon bias is the unequal use of synonymous
codons, we expect high codon bias in genes with stronger
preference for optimal codons, and thus we expect codon
Fig. 2. Frequency of optimal codons (FOP V) vs. average expression. 14,131genes expressed in 43 SAGE libraries, were divided, according to expression
level, into ten categories of approximately equal size. Circles represent the
mean value. Error bars show 95% of confidence. R=0.08, pb0.0001.
bias to rise with frequency of optimal codons. Interestingly,
we found that genes with the lowest values of FOPV alsotends to have high codon bias, even higher than in genes
with the highest value of FOPV. This is illustrated in the
graphs in Fig. 3. The high values of codon bias in genes with
low frequency of optimal codons indicate that the bias tends
to be for codons that correspond to low tRNA gene copy
numbers, and this in turn may suggest that the bias is for
codons with low translational values. The figure shows the
relation between FOPV and four measures of codon bias:
Effective number of codons, or ENCV (Novembre, 2002), B
measure (Karlin et al., 1998), HK measure (Hey and Kliman,
2002), and maximum-likelihood codon bias, or MCB
(Urrutia and Hurst, 2001). ENCV and B are corrected for
background nucleotide composition by considering dinu-
cleotides in the background (see Materials and methods).
When considering the relation between codon bias and
gene expression level, we encounter another unexpected
result. Instead of the expected positive correlation between
codon bias and expression level, based on studies in
different organisms (see the Introduction), we observed that
the average codon bias is highest both in the classes of
genes with the highest as well as with the lowest expression
levels. This is shown in the graphs of Fig. 4. Similar graphs
were obtained when using breadth of expression or
maximum expression (see Materials and methods, graphs
available upon request).
As indicated above, a rise in codon bias as expression
level drops has not been observed in lower organisms. As it
appears, in addition to the role associated with codon bias in
enhancing the expression of certain genes by preferring
codons with high cellular levels of isoaccepting tRNA (as
was found for unicellular organism; Ikemura, 1981; Kanaya
et al., 1999) or with high tRNA gene copy number (Duret,
Fig. 3. Codon bias vs. frequency of optimal codons (FOPV). (a) Effective number of codons (ENCV); (b) B; (c) Regression line subtraction (HK, see Materials
and methods); (d) Maximum-likelihood codon bias (MCB). A total of 14,131 genes was divided, according to FOPV, into ten categories of approximately equal
size. Circles represent the mean value. Error bars show 95% of confidence.
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138134
2000), here we found that in the human genome, codon bias
may have an additional role: controlling the expression of
certain genes by preferring codons with small tRNA gene
copy number.
3.3. Translation efficiency and amino acid biosynthetic cost
Translation efficiency is also affected by the size, the
structure, and the production cost of the amino acids
incorporated in the protein. To further analyze these factors,
we hypothesized that proteins coded by highly expressed
genes are composed of smaller and biosynthetically cheaper
amino acids.
The relation between expression levels and biosynthetic
cost is shown in Fig. 5. We use the size/complexity quotient
of Dufton (1997) as a measure of biosynthetic cost (Table
2). A clear monotonic relation between expression levels
and biosynthetic cost is evident. Genes with higher average
expression (see Materials and methods) tend to code less for
bexpensiveQ amino acids, as estimated by the size/complex-
ity measure. Similar graphs were obtained when using
breadth of expression or maximum expression.
Interestingly, we found that the frequency of optimal
codons is in significant positive correlation with the
measures of biosynthetic cost (R=0.18, pb0.0001), as
indicated in Fig. 6.
Although this result seems to be counterintuitive, since it
was shown above that for highly expressed genes, the size/
complexity score of genes tends to decrease with the
average expression, and thus showing a possible mechanism
of preference for cheap and smaller amino acids in highly
expressed genes, the correlations between the FOP and
expression level and between expression level and size/
complexity score are too weak to infer from about the
relation between the FOP and size/complexity score.
The graph indicates that genes that encode for more
expensive amino acids tend to have more tRNA genes for
their anticodons. This suggests the possibility of a mecha-
Fig. 5. Biosynthetic cost vs. average expression. (a) Average size/complexity score; (b) Frequency of occurrence of the most expensive amino acids (see
Materials and methods). A total of 14,131 genes expressed in 43 SAGE libraries was divided, according to expression level, into ten categories of
approximately equal size. Circles represent mean values. Error bars show 95% of confidence. Correlation coefficients are (a) R=�0.071 ( pb0.0001) (b)
R=�0.127 ( pb0.0001).
Fig. 4. Codon bias vs. Average expression level. (a) Effective number of codons (ENCV); (b) B; (c) Regression line subtraction (HK, see Materials and
methods); (d) Maximum-likelihood codon bias (MCB). A total of 14,131 genes was divided, according to expression level, into ten categories of approximately
equal size. Circles represent the mean value. Error bars show 95% of confidence.
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138 135
Fig. 6. Frequency of optimal codons vs. Average size/complexity score.
R=0.18, pb0.0001. A total of 14,131 genes expressed in 43 SAGE libraries
was divided according to average size/complexity into ten categories of
approximately equal size. Circles represent the mean values. Error bars
show 95% of confidence.
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138136
nism that compensates high cost of production of a protein
by codon bias favoring optimal codons, enabling faster and
more accurate translation.
4. Discussion
In this work, we studied the relation between translation
efficiency (as estimated by the number of tRNA genes) and
gene expression level (as estimated by measures derived
from SAGE/EST data) in the human genome, and the
interrelations between these two factors and codon bias. For
this purpose, we introduced two methods: a method for
computing the frequency of optimal codons, which is
independent of amino acid composition and with correction
for background nucleotide content, and a method for
computing expected values of codon frequency, based on
dinucleotide composition of the background. We showed
that amino acid and codon frequencies, weighted by
expression, correlate positively with tRNA gene copy
number, thus possibly indicating a relation between the
number of tRNA genes and tRNA abundance. We showed
that expression level is in weak, highly significant, positive
correlation with frequency of optimal codons (which is
assumed to be a measure of translation efficiency). This, in
turn, shows that codon choice, or codon bias, relates to
expression level. A caveat must be admitted here, since we
used measures of the transcriptome, indicating numbers of
mRNAs, and not of proteins, but due to the lack of data on
protein levels in human, we assume that the former can be
induced by the mRNA levels as used here. In addition, we
obtained a surprising result not observed in previously
studied organisms, namely, that the average codon bias is
high both in weakly expressed genes and in highly expressed
genes. This result was obtained with all four measures of
codon bias used in this study. Previous analysis of the human
genome (Urrutia and Hurst, 2003) did not show high codon
bias for the weakly expressed genes. This may be explained
by the fact that in that study an older version of the genome
was used and that a large part of the weakly expressed genes
was deliberately excluded from the analysis.
In highly expressed genes, we show a tendency for high
codon bias, and also for higher frequency of optimal
codons. A possible explanation for this is that in these
genes, the high codon bias is a consequence of more codons
with high tRNA gene copy number, increasing the trans-
lation elongation rate. The finding that the average codon
bias is high for lowly expressed genes, together with the
result that these genes tend to have low frequency of optimal
codons (and therefore their high bias is probably the
consequence of favoring non-optimal codons in terms of
translation efficiency), suggest that some lowly expressed
genes may also experience the effect of natural selection
against optimal codons. Comeron (2004) showed strong
evidence for a transcription-associated bias for higher G and
T content on the coding strand of introns. This bias is
observed in coding regions in the third codon position as
well (Duret, 2002). Computing codon bias and RSCU
values relative to expected values, derived from introns on
the coding strand only, as done in this study, avoids the
difference between the coding and the non-coding strands,
as observed in Comeron (2004). Since only half of the
optimal codons end with a G or a T, it is unlikely that the
transcription associated bias for G and T in both introns and
coding regions can accounts for the correlation between
FOP and expression level (Fig. 2) and the relation between
codon bias and expression level (Fig. 4).
We hypothesize that the translation efficiency of proteins,
which can be a disadvantage in high levels, is controlled by
this mechanism of preferring codons with low tRNA
abundance, and thus regulating the elongation rate in these
proteins. Such a mechanism was suggested by Fiers and
Grosjean (1979), and supporting evidence was provided by
Konigsberg and Godson (1983). The latter found that some
E. coli regulatory genes contains an unusually high number
of codons that are not frequently used in most E. coli genes,
and therefore suggested that this could be part of a
mechanism that helps to keep a low expression level in
some regulatory genes. In another study, Zhang et al.
(1991), who showed that in E. coli, S. cerevisiae, D.
melanogaster, and primates (mainly Homo sapiens) pro-
teins containing a high percentage of low-usage codons can
be characterized as cases where an excess of the protein
could be detrimental. Another indication of this mechanism
in bacteria is provided by Saier (1995) who showed
evidence suggesting that the translation of proteins involved
in various specialized functions may be regulated by using
rare codons. Robinson et al. (1984) and Pedersen (1984)
presented experimental evidence which also indicates that
the presence of non-optimal codons can reduce translation
efficiency. Support to the notion of an expression regulation
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138 137
mechanism in lowly expressed genes in mammals can be
also found in the work of Zhou et al. (1999) and Wells et al.
(1997), who showed that modifying the codon composition
of non-mammalian genes to resemble that of mammalian
genes can significantly enhance their translation in mamma-
lian cells, where the translation of the original genes is
limited. This model was challenged by arguing that the
presence of rare codons in lowly expressed genes is due to
mutational drift randomizing codon usage (Sharp and Li,
1986). However, it is hard to see how random drift can
account for the high level of codon bias, favoring rare and
non-optimal codons, in lowly expressed genes, as observed
in the human genome, in this study. More research is needed
to show whether this mechanism is used to regulate
translation rate. Since the former hypothesis leads to an
experimental prediction, namely, that proteins that could be
of disadvantage in excess, indeed contain significantly more
non-optimal codons, it is of high value.
Using the size/complexity index developed by Dufton
(1997, see Materials and methods) as an estimate of the
amino acid cost, we found that there is a negative correlation
(R=�0.071) between expression level and the size/complex-
ity score. This result is expected if amino acid usage is
shaped by selective forces to optimize translation efficiency.
A more pronounced effect was demonstrated when the
correlation between the frequency of expensive amino acids
within genes and the expression level is considered
(R=�0.127, see Fig. 5). Urrutia and Hurst (2003) also used
Dufton’s index to examine its association with gene
expression level, and found a similar tendency to avoid
the use of complex amino acid in highly expressed genes.
Another study (Akashi and Gojobori, 2002), conducted on
B. subtilis and E. coli, and concentrated on the relation
between metabolic costs of amino acid biosynthesis and
patterns of amino acid composition, shows an increased
usage of less energetically costly amino acids in highly
expressed genes in both cells, and thus support the action of
selection on amino acid usage to increase metabolic
efficiency. It should be noted however, that Dufton’s index,
although an indirect measure, takes into account the size and
the structural complexity of amino acids, factors that have
an influence on the rate of incorporating amino acids in the
elongation process.
Contrary to these results, Comeron (2004), using micro-
array data, reports that there is no detectable influence of
expression on amino acid usage. One reason to this
discrepancy could be the utilization of two different
methods for the estimation of the expression level. Another
possible explanation could be the fact that Comeron
analyzed the expression of each tissue separately
(dexpression levelT), which can be problematic. Genes
which are highly expressed in one tissue can be of poor
expression in another, or not expressed at all. Since it is
more reasonable to assume that if there is a selection for
translational efficiency, it will be detected in genes with high
activity in many tissues, the more appropriate measure for
gene expression, to asses its association with amino acid
usage, is average expression or breadth of expression, (see
Materials and methods). These measures reflect more
accurately the total activity of genes in different tissues.
We showed that frequency of optimal codons correlates
positively with protein production cost. We suggest that this
may be an indication of the action of selection on codon bias
to reduce error rate in the production of costly proteins. This
mechanism was suggested by Akashi (1994), who showed
evidence that natural selection acts on synonymous codon
usage to enhance the accuracy of protein synthesis in
Drosophila, based on association between synonymous
codon usage and amino acid constraint. In the study of
Akashi and Gojobori (2002), a negative correlation between
metabolic costs of amino acids and codon bias was shown.
This relation seems to contradict the result presented here.
However, apart from the fact that their study deals with
bacteria, and that both selective forces and regulation
mechanism may be different in higher organisms, there
are other factors that may explain the difference: as was
mentioned above, the metabolic cost calculation does not
take into account the size and the complexity of amino
acids, and besides, in other organisms other factors such as
biosynthetic pathways and dietary conditions may contrib-
ute differently to the amino acid composition. In addition,
the MCU measure in that study is different from the FOPVmeasure as was defined here, and also the former is not
corrected for background composition and for amino acid
usage composition, as in the FOPV.Two points of caution should be emphasized here: First,
as done in previous studies (Duret, 2000; Comeron, 2004)
we assumed a correspondence between tRNA gene copy
numbers and tRNA cellular abundance. As far as we know,
this relation, however proved for several organisms, has not
been substantiated for humans. As noted above, the results of
this study may suggest this relation, since without such
correspondence one could not expect the correlations,
observed here, between gene copy numbers and amino acid
and codon frequencies. The fact that, in 14 out of 18 amino
acids, the codons with the highest tRNA gene copy numbers
also exhibit an increase in their frequency when comparing
between lowly and highly expressed genes (Comeron, 2004),
support this assumption. However, this is not a proof of such
a relation. Thus, some of the results suggested in the above
discussion, concerning the action of selection on codon bias
for translation efficiency, rely partly on an assumption, that,
although reasonable, has not been firmly substantiated yet.
The second point is that the correlations between FOPVand the different expression level measures are very weak.
FOPV was calculated by controlling the effect of amino acid
composition and that of background nucleotide content. It is
not unconceivable that another third effect may account for
the observed correlations.
In summary, based on the evidence presented, we suggest
three possible ways in which selection may act on codon
bias in the human genome: (1) Increasing translation
Y. Lavner, D. Kotlar / Gene 345 (2005) 127–138138
efficiency in highly expressed genes; (2) regulating trans-
lation efficiency of some proteins that can be a disadvantage
at high levels; and (3) improving translation efficiency and
reducing the rate of amino acid misincorporation in the
production of biosynthetically expensive proteins.
Acknowledgements
We thank Edward Trifonov, Laurent Duret and Giuseppe
D’Onofrio for valuable discussions that contributed to this
manuscript. We also thank Yefim Yakir for preparing part of
the figures and for technical support and Nurit Carmi for
statistical consultation.
References
Akashi, H., 1994. Synonymous codon usage in Drosophila mela-
nogaster: natural selection and translational accuracy. Genetics 136
(3), 927–935.
Akashi, H., 1995. Inferring weak selection from patterns of polymorphism
and divergence at bsilentQ sites in Drosophila DNA. Genetics 139,
1076–1677.
Akashi, H., Gojobori, T., 2002. Metabolic efficiency and amino acid
composition in the proteomes of Escherichia coli and Bacillus subtilis.
Proc. Natl. Acad. Sci. U. S. A. 99 (6), 3695–3700.
Bernardi, G., et al., 1985. The mosaic genome of warm-blooded vertebrates.
Science 228 (4702), 953–958.
Comeron, J.M., 2004. Selective and mutational patterns associated with
gene expression in humans: influences on synonymous composition and
intron presence. Genetics 167, 1293–1304.
Dong, H., Nilsson, L., Kurland, C.G., 1996. Co-variation of tRNA
abundance and codon usage in Escherichia coli at different growth
rates. J. Mol. Biol. 260, 649–663.
Dufton, M.J., 1997. Genetic code synonym quotas and amino acid
complexity: cutting the cost of proteins? J. Theor. Biol. 187, 165–173.
Duret, L., 2000. tRNA gene number and codon usage in the C. elegans
genome are co-adapted for optimal translation of highly expressed
genes. Trends Genet. 16 (7), 287–289.
Duret, L., 2002. Evolution of synonymous codon usage in metazoans. Curr.
Opin. Genet. Dev. 12, 640–649.
Duret, L., Mouchiroud, D., 1999. Expression pattern and, surprisingly, gene
length shape codon usage in Caenorhabditis , Drosophila , and
Arabidopsis. Proc. Natl. Acad. Sci. U. S. A. 96 (8), 4482–4487.
Fiers, W., Grosjean, H., 1979. On codon usage. Nature 277 (5694), 328.
Grantham, R., Gautier, C., Gouy, M., Mercier, R., Pave, A., 1980. Codon
catalog usage and the genome hypothesis. Nucleic Acids Res. (8),
r49– r62.
Graur, D., Li, W.-H., 2000. Fundamentals of Molecular Evolution, 2nd ed.
Mass, Sinauer, Sunderland.
Hey, J., Kliman, R.M., 2002. Interactions between natural selection,
recombination and gene density in the genes of Drosophila. Genetics
160, 595–608.
Ikemura, T., 1981. Correlation between the abundance of Escherichia coli
transfer RNAs and the occurrence of the respective codons in its protein
genes. J. Mol. Biol. 146 (1), 1–21.
Ikemura, T., 1982. Correlation between the abundance of yeast transfer
RNAs and the occurrence of the respective codons in protein genes.
Differences in synonymous codon choice patterns of yeast and
Escherichia coli with reference to the abundance of isoaccepting
transfer RNAs. J. Mol. Biol. 158 (4), 573–597.
Kanaya, S., Yamada, Y., Kudo, Y., Ikemura, T., 1999. Studies of codon
usage and tRNA genes of 18 unicellular organisms and quantification of
Bacillus subtilis tRNAs: gene expression level and species-specific
diversity of codon usage based on multivariate analysis. Gene 238,
143–155.
Kanaya, S., Yamada, Y., Kinouchi, M., Kudo, Y., Ikemura, T., 2001. Codon
usage and tRNA genes in eukaryotes: correlation of codon usage
diversity with translation efficiency and with CG-dinucleotide usage as
assessed by multivariate analysis. J. Mol. Evol. 53, 290–298.
Karlin, S., Mrazek, J., 1996. What drives codon choices in human genes?
J. Mol. Biol. 262, 459–472.
Karlin, S., Mrazek, J., Campbell, A.M., 1998. Codon usages in different
gene classes of the Escherichia coli genome. Mol. Microbiol. 29 (6),
1341–1355.
Knight, R.D., Freeland, S.J., Landweber, L.F., 2001. A simple model based
on mutation and selection explains trends in codon and amino-acid
usage and GC composition within and across genomes. Genome Biol. 2
(4) (http://www.genomebiology.com/2001/2/4/research/010.1).
Konigsberg, W., Godson, N., 1983. Evidence for use of rare codons in the
dnaG gene and other regulatory genes of Escherichia coli. Proc. Natl.
Acad. Sci. U. S. A. 80 (3), 687–691.
Kreitman, M., Comeron, J.M., 1999. Coding sequence evolution. Curr.
Opin. Genet. Dev. 9 (6), 637–641.
Lander, E.S., et al. International Human Genome Sequencing Consortium,
2001. Initial sequencing and analysis of the human genome. Nature
409, 860–921.
Marais, G., Mouchiroud, D., Duret, L., 2001. Does recombination improve
selection on codon usage? Lessons from nematode and fly complete
genomes. Proc. Natl. Acad. Sci. U. S. A. 98 (10), 5688–5692.
Moriyama, E.N., 2003. Codon usage, Encyclopedia of the human genome.
MacmillanPublishers, Nature PublishingGroup. http://www.ehgonline.net.
Moriyama, E.N., Powel, J.R., 1997. Codon usage bias and tRNA
abundance in Drosophila. J. Mol. Evol. 45, 514–523.
Novembre, J.A., 2002. Accounting for background nucleotide composition
when measuring codon usage bias. Mol. Biol. Evol. 19 (8), 1390–1394.
Pedersen, S., 1984. Escherichia coli ribosomes translate in vivo with
variable rate. EMBO J. 3 (12), 2895–2898.
Percudani, R., Pavesi, A., Ottonello, S., 1997. Transfer RNA gene
redundancy and translational selection in Saccharomyces cerevisiae.
J. Mol. Biol. 268, 322–330.
Robinson, M., et al. , 1984. Codon usage can affect efficiency of translation
of genes in Escherichia coli. Nucleic Acids Res. 12 (17), 6663–6671.
Saier, M.J., 1995. Differential codon usage: a safe guard against
inappropriate gene expression of specialized genes. FEBS 362, 1–4.
Sharp, P.M., Li, W.-H., 1986. Codon usage in regulatory genes in
Escherichia coli does not reflect selection for rare codons. Nucleic
Acids Res. 14, 7737–7749.
Sharp, P.M., Tuohy, T.M., Mosurski, K.R., 1986. Codon usage in yeast:
cluster analysis clearly differentiates highly and lowly expressed genes.
Nucleic Acids Res. 14, 5125–5143.
Urrutia, A.O., Hurst, L.D., 2001. Codon usage bias covaries with
expression breadth and the rate of synonymous evolution in humans,
but this is not evidence for selection. Genetics 159, 1191–1199.
Urrutia, A.O., Hurst, L.D., 2003. The signature of selection mediated by
expression on human genes. Genome Res. 13 (10), 2260–2264.
Versteeg, R., et al. , 2003. The human transcriptome map reveals extremes
in gene density, intron length, GC content, and repeat pattern for
domains of highly and weakly expressed genes. Genome Res. 13 (9),
1998–2004.
Wells, K.D., Foster, J.A., Moore, K., Pursel, V.G., Wall, R.J., 1997. Codon
optimization, genetic insulation, and an rtTA reporter improve perform-
ance of the tetracycline switch. Transgenic Res. 8 (5), 371–381.
Wright, F., 1990. The deffective number of codonsT used in a gene. Gene 87,23–29.
Zhang, S., Zubay, G., Goldman, E., 1991. Low usage codons in Escherichia
coli, yeast, fruit fly, and primates. Gene 105, 61–72.
Zhou, J., Liu, W-J., Peng, S.W., Sun, X.Y., Frazer, I., 1999. Papillomavirus
capsid protein expression level depends on the match between codon
usage and tRNA availability. J. Virol. 73, 4972–4982.