IGTC Annotation Pipeline
As described in the Overview Tutorial, a gene trap
experiment produces a sequence that includes some of the vector along with
sequence from the trapped gene. The vector sequence is removed, and bioinformatics
techniques are used to identify the trapped gene and its location in the
mouse genome. Using this information, researchers can select cell lines to
generate mice with the desired mutant genes. The IGTC pipeline annotates cell line
sequences with the identities of the trapped genes (transcripts) and their
genomic locations. The pipeline is entirely
automated, allowing us to annotate large numbers of new trap lines quickly,
according to systematic and reproducible procedures.
The steps in the IGTC pipeline are shown in Figure 1.
A "feeder" loads data from dbGSS and initiates the pipeline on each of the new
cell line sequence tags, or cell line sequence tag that is being reprocessed.
The pipeline is described in three main processing stages: Coordinate Identification,
Gene Association, and Additional Annotation. These are labeled in
Figure 1 as Identification,
Association, and Annotation.
Figure 1. The IGTC pipeline.
The goal of Identification is to assign a single genomic coordinate set to
each sequence tag. This stage employs both a BLAST-based Transcript protocol
and a BLAT-based Localization protocol. These two protocols provide
overlapping results (see Figure 2).
The Association stage uses the coordinates to assign an Ensembl and an Entrez
gene to the tag. The Annotation stage collects additional information from
public databases related to the associated genes.
A detailed description of the three pipeline stages follows.
Figure 2. Advantage of utilizing two different
protocols (transcript search and genomic localization) to
characterize cell line sequences. (Overlap is not to scale).
Coordinate Identification
This stage is further broken down into three distinct protocols:
- Transcript Protocol: determine the transcript(s) associated with this cell line.
- Localization Protocol: determine the genomic coordinates of the sequence.
- Reconciliation Protocol: determine if the two identification protocols agree or disagree.
Transcript Protocol
Each sequence tag is queried against the GenBank 'nt' nucleotide database using
blastall v 2.2.12. The '-S' flag is used to limit results to those with the
correct orientation relative to the sequence tag, as communicated by IGTC
members. Results returned are limited to 50; maximum e-value is 1e-5.
Each result is examined for the longest "ungapped" run, i.e., containing no
gaps exceeding 3 bases. Statistics are gathered for run length and number of
matching bases. N's are treated as mismatches. Percent run is calculated as the
maximum ungapped run length / sequence tag length; percent identity is
calculated as matching base count / maximum ungapped run length. The last 60
bases of the sequence tag are separately examined. Contrary to the above, an N
in both the query and the target qualifies here as a matching base. A so-called
3prime statistic is calculated as matching base count / (base count + query gap
count).
Each result earns the title of putative id if a) 3prime = 1.00 or b) percent
run >= .90 and percent identity >= .95. Results without a GenBank record or to
a non-mouse species or without a sequence or with a sequence exceeding 65536
bases in length are discarded.
Putative id results are then sorted into lists of synonyms. The first grouping
attempt is by MGI gene ID. Results not associated to an MGI gene are compared
by sequence to those that are. Two sequences are synonyms if they differ in
length by no more than 2% of the shorter sequence and share at least 98%
sequence identity. Finally, remaining results are collected into synonym lists
by comparison against each other, using the same sequence synonym metric, so
that each result ends up in a synonym list of one or more members.
The members of each list are then sorted by accession type, e-value, descending
length of maximum ungapped run, descending maximum match count, and GenBank
accession. In sorting, accessions are ranked by preferred type based on prefix:
"NM", all others, "XM", "NT", and "NC". Using the first member of each list,
the lists themselves are then sorted by preferred accession type, e-value, and
accession.
Finally, the first member of each synonym list is taken to represent a distinct
result. If only one such result remains, it is labeled a Putative ID, otherwise
we have Multiple IDs.
If the above process finds no results, we try again with less stringent
parameters: percent run >= .80 and percent identity >= .40 and results are not
limited to mouse. No 3prime statistic is calculated. The protocol for grouping
synonyms remains as before. Results found here are labeled Homolog IDs.
Localization Protocol
The sequence tag is reverse-complemented if the tag orientation is '+/-' to the
genome. The sequence is localized against the current mouse build using a
locally installed UCSC BLAT server, gfServer v 32x1 and gfClient v 32, both
with default parameters. Blat results are filtered to removed results against
"random" chromosomes, then filtered through pslReps, again with default
parameters. These results are sorted by descending number of: matched
bases, mis-matched bases, sequence tag gap bases, and genomic span. The
sorted list is then filtered to drop any result where the number of matching bases is
less than 50% of the query sequence length. Next, we discard results where the
number of matching bases is less than 90% of the highest matching result.
Finally, in an attempt to discard matches against pseudogenes, we discard a
result if the span of the genome match is more than 100 bases less than the
span of the top result. If exactly one result remains, the tag is deemed
Localized.
Reconciliation Protocol
Next, we attempt to reconcile a single genomic coordinate set from the two
identification protocols, Transcript and Localization. In our Reconciliation protocol,
we compare genomic coordinates two at a time. If species, chromosome,
or strand do not match, we have a conflict and no reconciled result. Likewise
if the genomic regions do not overlap. If species, chromosome, and strand
match, then the overlapping genomic region of each comparison is returned and
used for the next comparison, until we get either a conflict or a single set of
genomic coordinates.
In applying the reconciliation protocol, we first reconcile just the Transcript
results, if any. To obtain genomic coordinates, each Transcript protocol result
is run through the localization protocol using a match length filter cut-off of
80% in place of the 50% used for tags. Transcripts which fail to localize are
discarded. The remaining transcript coordinates are reconciled as above. The
sequence tag localization protocol returns at most one result, so no separate
reconciliation of that protocol is needed.
If we have a result from each of the two protocols, the results are reconciled
as above. If a conflict is observed, the tag's identification status is
'Conflict' and the reason recorded. If an overlapping region is returned, it is
assigned to the sequence tag and the tag's status becomes 'Localized +
Transcript'. If the Transcript result is a Homolog ID, then the Localization
result is treated as if it were the only result. If we have only a Transcript
result or only a Localization result, then the sequence tag's identification
status becomes 'Transcript Only' or 'Localized Only', respectively; if neither
a transcript nor a localization result remains, the sequence tag is deemed
'Unidentified'.
Results are stored in the database and presented on the web site.
The results of the reconciliation process are displayed on the web site using the
following symbols:
L (Localized): the cell line sequence was successfully localized
T (Transcript): a transcript matching the cell line sequence was found
U (Unidentified): the cell line sequence was not able to be localized, nor was a transcript found
C (Conflict): the cell line sequence was localized and a transcript was found, but they disagree
P (Processing): the cell line is in the pipeline processing queue.
This can be expressed as the following 2X2 table:
Gene Association
If the sequence tag is assigned genomic coordinates by the Identification
stage, those coordinates are used to search for genes identified by
Entrez and Ensembl.
IGTC maintains a local database of Entrez gene information
that includes genomic mapping data.
The IGTC uses as its reference genome build the build in use at the
time by Entrez Gene.
If the sequence tag is assigned genomic coordinates by the pipeline's
Identification stage, the local Entrez is searched for genomic
coordinates that overlap with the tag coordinates. If exactly one overlapping
gene is found, it is associated with the sequence tag.
Some period of time may elapse after the release of a new mouse genome and its
adoption by Ensembl. (Ensembl provides a large number of automatically
generated annotations that can take significant time to compute and verify.)
During these times of build "skew," we relocalize all sequence tag results to
the older Ensembl mouse genome build and rerun the entire reconciliation
protocol using that build. Regardless of the mouse build used, the sequence
tag's coordinates, if any, are used to search the local Ensembl database for
overlapping genomic coordinates. If exactly one overlapping gene is found, it
is associated with the sequence tag.
Additional Annotation
The IGTC pipeline provides additional information about the gene associated with a
trapped cell line so that researchers can decide whether a mutant mouse derived
from this cell line will be appropriate for their experiments.
While information on
a gene is generally available at a number of sites, cross-references and links directly
from the IGTC site provide important context for the researcher.
The IGTC pipeline cross-references to annotations at Ensembl, Entrez, and MGI, providing
links to expression data, pathway information, literature
references, gene ontology (GO) terms, phenotype information, and a variety
of additional data.
Tying together the genes and the gene annotations is the assignment of the MGI symbol.
The final step in the processing pipeline produces an image showing an alignment
of all cell line sequences against the full-length cDNA
of the trapped gene and its exon boundaries (see Figure
3).
Figure 3. Example alignment image from IGTC web
site. The coding sequence of a gene is shown in magenta and exons are
shown in blue. Red bars represent the cell line sequences associated
with that gene. (See individual images for a more extensive legend.)