Digital Commons@Becker Digital Commons@Becker Overview of gene structure Overview of gene structure

Throughout the C. elegans sequencing project Genefinder was the primary protein-coding gene prediction program. These initial predictions were manually reviewed by curators as part of a “first-pass annotation” and are actively curated by WormBase staff using a variety of data and information. In the WormBase data release WS133 there are 22,227 protein-coding gene, including 2,575 alternatively-spliced forms. Twenty-eight percent of these have every base


What is a gene?
Sidney Brenner, the founder of modern worm biology, once said, "Old geneticists knew what they were talking about when they used the term 'gene', but it seems to have become corrupted by modern genomics to mean any piece of expressed sequence…" (Brenner, 2000). Sidney's lament serves to illustrate two points. The first is that the concept of a gene can mean different things to different people in different contexts. The second is that the concept of a gene has been evolving, not only in the modern genomic era, but ever since it first appeared in the early 1900s as a term to conceptualize the particulate basis of heritable physical traits (Snyder and Gerstein, 2003). Therefore, in a review of gene structure in C. elegans it seems prudent to define what we mean by a gene.
Our definition of a gene is necessarily heavily influenced by modern genomics, but we prefer to think it has not been corrupted by it. We define a gene as "the complete sequence region necessary for generating a functional product". This encompasses promoters and control regions necessary for the transcription, processing and if applicable, translation of a gene. Hence, we include not only protein-coding genes (genes that encode polypeptides), but also non-coding RNA genes (ribosomal RNA, transfer RNA, micro RNA and small nuclear RNA genes). One additional type of gene we will briefly discuss is the pseudogene, though usually these are not considered to be functional.
The full extent of a gene, or the "complete sequence region" is not known for most C. elegans genes because promoters remain, for the most part, incompletely defined. Even the full extent of the primary transcript is frequently not known because a majority (70%) of protein-coding genes are rapidly modified by trans-splicing, which involves the addition of a short 22 nt exogenous RNA species to the 5' end of a transcript (Zorio et al., 1994). Recently, it has been shown that some non-coding RNA genes are also trans-spliced; a precursor of the microRNA let-7 was identified with a trans-splice leader sequence (Bracht et al., 2004).

Protein-coding genes
Protein-coding genes are the largest class of genes in the C. elegans genome, and probably the most interesting to the majority of people, so we will cover these genes first.

Prediction and curation
Throughout the C. elegans sequencing project Genefinder (Green and Hillier, unpublished software) was the gene prediction program of choice. Genefinder is an ab initio predictor and requires only a genomic DNA sequence and parameters based on a training set of confirmed coding sequences. Note that Genefinder, like most other gene prediction tools, is actually a coding-sequence (CDS) predictor and does not attempt to locate promoters, or untranslated regions (UTRs). All Genefinder predictions were appraised by human curators as part of the 'first-pass annotation' prior to the publication of the genome in 1998 (The C. elegans Sequencing Consortium, 1997). The quality of CDS predictions has improved over the course of the sequencing project as better training sets of confirmed CDS were generated and improved versions of Genefinder became available.
With the completion of the C. elegans genome sequence and funding for WormBase (NHGRI and MRC sources), a renewed effort to improve the gene predictions was initiated. These 'curated' gene structures are available through WormBase (http://www.wormbase.org/) and public nucleotide and protein databases (GenBank/EMBL/DDBJ/UniProt).
The generation and maintenance of a gene prediction data set is always a 'work in progress'. WormBase has invested heavily in correlating transcript data (experimentally confirmed coding sequences) with gene predictions. Hence, all messenger RNA (mRNA), Expressed Sequence Tag (EST) (Yugi Kohara, unpublished; http://nematode.lab.nig.ac.jp/) and Orfeome Sequence Tag (OST) (Reboul et al., 2003;Vaglio et al., 2003) sequences are routinely mapped to the C. elegans genome and compared to the current set of gene predictions. Annotators then modify the exon/intron structure of the prediction to accommodate any changes highlighted by new transcript data. Other types of information used in gene curation include protein alignment data (with special weighting to matches within C. elegans and related nematode species such as C. briggsae (Stein et al., 2003), repeat sequences and sequence features such as trans-splice leaders and poly(A) sites.
Overview of gene structure Curation endeavors to continually improve the accuracy of the gene structures. How accurate are the gene structures? In WormBase release WS133 of September 24, 2004, 6,202 CDS predictions (28%) have every base of every exon confirmed by some type of transcription evidence, showing the gene is real and the structure correct. An additional 11,459 (51%) have at least one base of an exon confirmed by transcript data, indicating the gene is real and part of the structure is correct. The remaining 21% of the CDS predictions currently have no EST or mRNA support but could have underlying protein alignments or strong sequence conservation with C. briggsae (Stein et al., 2003).

Gene number and sizes
There were 22,227 protein-coding genes on Sept 24, 2004 (WormBase data release WS133), including 2,575 alternatively-spliced forms. This is up from 19,099 in 1998 when the genome was declared essentially complete (The C. elegans Sequencing Consortium, 1997), primarily due to new transcript data indicating the existence of previously undetected new genes, the splitting of existing genes, or the detection of conserved sequences between the C. elegans and C. briggsae genomes that are predicted to be coding (Stein et al., 2003).
How many additional genes might be missing from WormBase? It's hard to know for certain, but an estimated 1,119 new genes come from 2,228 gene predictions made by TWINSCAN (Korf et al., 2001); these gene predictions do not overlap existing WormBase genes and an RT-PCR success rate of 55% confirms a subset of the novel genes (Wei et al., 2005, in press).
The average C. elegans protein-coding gene is compact in comparison to vertebrate genes. Most C. elegans genes are relatively small (Figure 1), covering a genomic region of approximately 3 kb (from start to stop codon including introns), however there are some very large genes, which skew the average. The median size is only 1,956 bases with a range from 48 bases (Y108G3AL.6, confirmed by transcript data) to 80,957 bases (W06H8.8g, the C. elegans titin gene). The distribution of gene sizes for confirmed genes is nearly identical to that for all genes ( Figure 1) suggesting that Genefinder does not significantly over-or under-predict the size of genes.

Exons
C. elegans genes, like most eukaryotic protein-coding genes, contain exons separated by introns. There are 126,477 predicted unique, coding exons (the same exon used in alternatively-spliced isoforms of the same gene is considered as one unique exon) in the WS133 protein-coding gene set, which account for 25.55% of the genome, considerably more than the 1.5% estimated for the human genome (The International Human Genome Sequencing Consortium, 2001).
The average gene contains 6.4 coding exons, 6.0 if only genes that are confirmed by ESTs or mRNAs are considered; however, there are a few genes with a large number of exons ( Figure 2). W06H8.8g, an isoform of the titin gene, has 66 coding exons. If only confirmed genes are considered then the gene with the largest number of exons (62) is F15G9.4b, an isoform of him-4. There are also a few single exon genes (570 in WS133) amounting to about 3% of total genes. Almost 60% of these are supported by EST or mRNA data. The average size of unique exons in all protein-coding genes is 208 bases, but there are a small number of very large exons. Again, as with gene size, these few large exons skew the average. The median size is only 123 bases, thus exons are similar in size to exons in human and fly genes (The International Human Genome Sequencing Consortium, 2001). The average size of unique exons in confirmed genes (201 bases), the median size (144 bases) and the distribution (Figure 2) is very similar to that in all worm genes, and to an earlier study based on 862 C. elegans exons in GenBank entries (Blumenthal and Steward, 1997). The largest exon in a confirmed gene is 7,569 bases found in pqn-43 (F54E2.3b). The largest exon in all genes is 14,975 bases, found in an isoform of the unc-44 gene. The smallest confirmed coding exon is 3 bases, found in F54C1.3b, an isoform of mes-3.
Overview of gene structure

Introns
There are 106,909 predicted unique introns (the same intron used in different isoforms or spliced variants is considered as one unique intron) in all of the protein-coding genes of C. elegans (WS133 release). Some of these are probably not real introns or have incorrect boundaries because they are either predicted only by Genefinder or based on imperfect alignments of cDNA or single-pass EST reads. Of these, 824 are less than 30 bases, almost all of which probably result from erroneous EST alignments in WormBase. 67,833 introns are considered confirmed because there is EST or cDNA sequences spanning the intron boundaries. The most common size of confirmed introns is 47 bases with the median size being 65 bases. The smallest confirmed intron is only 10 bases. It is found in the 3' UTR of mag-1 (R09B3.5), has good splice acceptor and donor sites (5'-CAAAAA/gtacagttag/AAAAG-3') and is supported by an mRNA sequence and 3 separate EST clones. The largest confirmed intron is 21,230 bases, found in kin-1 (ZK909.2). It is confirmed by a single EST clone containing the SL1 trans-splice leader sequence on its 5' end. Interestingly, intron size in C. elegans appears to be positively correlated with local recombination rates (Prachumwat et al., 2004) and short introns are preferentially found in highly expressed genes (Castillo-Davis et al., 2002).
The introns of C. elegans have always been considered small, but as more genomes are being sequenced and annotated it is becoming evident that they are not distinctly smaller than those of most eukaryotes. The most common size for fly introns is only 59 bases (The International Human Genome Sequencing Consortium, 2001), as compared to 47 bases for the worm. The average size of introns on the largest, somatic, macronuclear chromosome of Paramecium is only 25 bases (Zagulski et al., 2004). Fungal introns are also small; Neurospora introns average 134 bases (Galagan et al., 2003), S. macrospora 106 bases (Nowrousian et al., 2004), and C. neoformans 67 bases (Loftus et al., 2005). Even in humans the most common intron size is only 87 bases, but there are also some very large introns, shifting the mean sized to more than 3,300 bases genes (The International Human Genome Sequencing Consortium, 2001).
C. elegans introns follow the GU-AG splice site rule, although GC is a rare 5' splice site variant (Blumenthal and Stewart, 1997). From their analysis of 669 introns Blumenthal and Steward found that C. elegans has a highly conserved and extended 3' splice site (UUUCAG) and no obvious polypyrimidine track other than the 3' splice site consensus. They suggest that the 3' intron boundary may be more important in C. elegans intron recognition than in other organisms.
In addition to splicing information, some C. elegans introns contain sequences involved in the regulation of gene expression (Zhang and Emmons, 2000).

Alternative splicing
Alternative splicing will be covered in detail in another chapter (see Alternative splicing in C. elegans) so here we will just mention the topic with reference to WormBase. Alternative splice forms are only annotated when there is direct transcript or literature citation evidence for the alternative form. In WormBase release WS133 there are 1,834 genes that have a total of 4,407 alternatively-spliced forms. The number of alternatively-spliced forms per gene tends to be small. Over 90% have either one (1,375) or two (302) alternatively-spliced forms. Many of these alternative forms show only minor changes to the CDS with modulo 3 (i.e., 3,6,9 base differences) to the splice donor or acceptor.

Pseudogenes
Processed pseudogenes, which are created by reverse transcription of mRNA into cDNA followed by reintegration into the genome, are fairly easy to detect because they lack introns. These are rare in C. elegans (Harrison et al., 2001). Unprocessed pseudogenes arise by duplication of a gene, which is subsequently disabled by random mutation. Unprocessed pseudogenes usually have features that aid in their identification, such as frameshifts, premature stops, insertions and truncations compared to their functional homolog, or a ratio of non-synonymous vs. synonymous nucleotide substitution rates indicating a lack of purifying selection. These features are probably valid indicators for most pseudogenes, assuming that the function of the gene is at the protein level. Some pseudogenes may even be expressed, but mRNAs containing premature stops are usually subject to rapid, nonsense-mediated decay (NMD), making them difficult to detect.
The uaf-1 gene is an interesting example of how features indicating a pseudogene should be viewed with caution. Uaf-1 encodes the essential splicing factor U2AF 65 (Zorio et al., 1997) and produces several classes of Overview of gene structure mRNA, including a 1.7 kb mRNA that encodes a functional U2AF 65 and a slightly larger mRNA with an extra exon, which inserts an in-frame stop (MacMorris et al., 1999). The premature stop in the latter isoform suggests that this form is non-functional and should be degraded. However, the larger mRNA remains in the nucleus, thus escaping nonsense mediated decay, probably because the extra exon contains multiple copies of a 3' splice-site consensus sequence, which can bind U2AF 65 (Zorio and Blumenthal, 1999). The likely function of this alternatively-spliced form of uaf-1, which is retained in the nucleus, is to down-regulate levels of uaf-1 when the need for splicing is reduced and to retain free splicing factors in the nucleus where they can be made quickly available when the need for splicing increases. So even though this alternatively-spliced form of uaf-1 is non-functional at the protein level, and would appear to be a pseudogene version of uaf-1, it does function at the RNA level and therefore is not a pseudogene.
Even before large-scale sequencing of the C. elegans genome commenced, pseudogenes were identified in the major sperm protein (MSP; Ward et al., 1988) and heat-shock protein gene families (Heschl and Baillie, 1989). Genefinder predictions associated with the genome sequencing project did not attempt to predict pseudogenes. The first genome-wide analysis of pseudogenes was done in 2001 (Harrison et al., 2001). Analyzing Wormpep release 18 and the corresponding version of the genomic sequence (which had only 332 annotated pseudogenes), the authors found 2,168 pseudogenes, or 11.7% of the annotated genes. Most of these were unprocessed pseudogenes, with only 208 designated as processed pseudogenes. They found that pseudogenes are unevenly distributed across the genome with a disproportionate number on chromosome IV; the density was also higher on chromosome arms than in the central regions. Looking at the distribution of pseudogenes among gene families they found that the number of pseudogenes is not correlated with the size of the gene family, but several families were associated with large numbers of pseudogenes. One of these families was the 7-TM receptor family, a finding supported by Robertson's characterization of chemoreceptor gene families (Robertson, 1998, Robertson, 2000, Robertson, 2002. A higher estimate of the number of pseudogenes comes from an analysis of reporter gene fusions (Mounsey et al., 2002). Extrapolating from the number of 364 randomly selected reporter gene fusions that showed no expression, the authors estimate that 20% of the annotated C. elegans genes may be pseudogenes. Furthermore, they found that pseudogenes were enriched for genes that had been recently duplicated.
WormBase release WS133 contains only 561 annotated pseudogenes, far fewer than either of the above estimates. Half of these are located on chromosome V (Table 1), reflecting the curation of chemoreceptor genes, which are located primarily on chromosome V (Robertson, 1998, Robertson, 2000, Robertson, 2002; see Putative chemoreceptor families of C. elegans). It seems likely that the number of annotated pseudogenes in WormBase is too low and that other gene families need to be scrutinized for them in the same way the chemoreceptor gene family has.

transfer RNA genes
There are 608 nuclear and 22 mitochondrial tRNA genes in C. elegans. Seven nuclear genes have been identified as suppressor tRNAs (sup-5, sup-7, sup-21, sup-24, sup-29, sup-28 and sup-33) and two are likely to be pseudogenes (rtw-5 and rtw-6). The remainder are predicted by tRNAscan-SE (Lowe and Eddy, 1997). The tRNA genes range in size from 64 to 122 bases with 72% having 72 or 73 bases. 29 of the 608 are genes with two exons and the remainder have a single exon. tRNAscan also predicts that there are 213 tRNA pseudogenes.
Nuclear tRNA genes are over-represented on the X chromosome with 45% residing there (Table 1). The other 55% are distributed uniformly over the autosomes; however, there is a slight enrichment on chromosome III and a lower density in the central region of chromosome IV and on the left arm of chromosome I.

ribosomal RNA genes
The genes for the 18S, 5.8S and 26S ribosomal RNAs, first sequenced and characterized by Ellis et al. (1986), are found in a large tandem-repeat of 100-150 copies on the right-end of chromosome I. Each repeat contains one copy each of the 18S, 5.8S and 26S genes. The 5S ribosomal RNA genes are found in a tandem-repeat of an estimated 100 copies on chromosome V. Each copy of the 5S gene is interspersed with one SL1 splice leader gene.

Genomic organization
Protein-coding genes are found equally on either strand of DNA and are fairly uniformly distributed throughout the genome. They are slightly denser on autosomes than on chromosome X (Table 1) and, in general, the central regions of the autosomes are denser than the arms. The left arm of chromosome II is an exception.
Genes in general do not overlap one another, that is to say, their exons do not overlap, but there are numerous examples of genes that fall within introns of another gene, either on the same or the opposite strand. F10F2.2 contains 5 genes on the opposite strand in 2 large exons. A rare and unusual example of overlapping genes can be found with unc-17 and cha-1. These two genes share a common promoter and a first, non-coding exon. The rest of the coding exons do not overlap, so the two genes encode different proteins with mutationally separable functions. unc-17 encodes a synaptic vesicle acetylcholine transporter, while cha-1 encodes a choline acetyltransferase (Alfonso et al., 1994).
An unusual and interesting feature of the worm genome is the existence of genes organized into operons. These polycistronic gene clusters contain two or more closely spaced genes, which are oriented in a head to tail Overview of gene structure