International Gene Trap Consortium
IGTC Pipeline

IGTC Annotation Pipeline

As described in the Overview Tutorial, a gene trap experiment produces a sequence that includes some of the vector along with sequence from the trapped gene. The vector sequence is removed, and bioinformatics techniques are used to identify the trapped gene and its location in the mouse genome. Using this information, researchers can select cell lines to generate mice with the desired mutant genes. The IGTC pipeline annotates cell line sequences with the identities of the trapped genes (transcripts) and their genomic locations. The pipeline is entirely automated, allowing us to annotate large numbers of new trap lines quickly, according to systematic and reproducible procedures.

The steps in the IGTC pipeline are shown in Figure 1. A "feeder" loads data from dbGSS and initiates the pipeline on each of the new cell line sequence tags, or cell line sequence tag that is being reprocessed. The pipeline is described in three main processing stages: Coordinate Identification, Gene Association, and Additional Annotation. These are labeled in Figure 1 as Identification, Association, and Annotation.

Figure 1. The IGTC pipeline.

The goal of Identification is to assign a single genomic coordinate set to each sequence tag. This stage employs both a BLAST-based Transcript protocol and a BLAT-based Localization protocol. These two protocols provide overlapping results (see Figure 2). The Association stage uses the coordinates to assign an Ensembl and an Entrez gene to the tag. The Annotation stage collects additional information from public databases related to the associated genes. A detailed description of the three pipeline stages follows.

Figure 2. Advantage of utilizing two different protocols (transcript search and genomic localization) to characterize cell line sequences. (Overlap is not to scale).

Coordinate Identification
This stage is further broken down into three distinct protocols:

  • Transcript Protocol: determine the transcript(s) associated with this cell line.
  • Localization Protocol: determine the genomic coordinates of the sequence.
  • Reconciliation Protocol: determine if the two identification protocols agree or disagree.

    Transcript Protocol

Each sequence tag is queried against the GenBank 'nt' nucleotide database using blastall v 2.2.12. The '-S' flag is used to limit results to those with the correct orientation relative to the sequence tag, as communicated by IGTC members. Results returned are limited to 50; maximum e-value is 1e-5.

Each result is examined for the longest "ungapped" run, i.e., containing no gaps exceeding 3 bases. Statistics are gathered for run length and number of matching bases. N's are treated as mismatches. Percent run is calculated as the maximum ungapped run length / sequence tag length; percent identity is calculated as matching base count / maximum ungapped run length. The last 60 bases of the sequence tag are separately examined. Contrary to the above, an N in both the query and the target qualifies here as a matching base. A so-called 3prime statistic is calculated as matching base count / (base count + query gap count).

Each result earns the title of putative id if a) 3prime = 1.00 or b) percent run >= .90 and percent identity >= .95. Results without a GenBank record or to a non-mouse species or without a sequence or with a sequence exceeding 65536 bases in length are discarded.

Putative id results are then sorted into lists of synonyms. The first grouping attempt is by MGI gene ID. Results not associated to an MGI gene are compared by sequence to those that are. Two sequences are synonyms if they differ in length by no more than 2% of the shorter sequence and share at least 98% sequence identity. Finally, remaining results are collected into synonym lists by comparison against each other, using the same sequence synonym metric, so that each result ends up in a synonym list of one or more members.

The members of each list are then sorted by accession type, e-value, descending length of maximum ungapped run, descending maximum match count, and GenBank accession. In sorting, accessions are ranked by preferred type based on prefix: "NM", all others, "XM", "NT", and "NC". Using the first member of each list, the lists themselves are then sorted by preferred accession type, e-value, and accession.

Finally, the first member of each synonym list is taken to represent a distinct result. If only one such result remains, it is labeled a Putative ID, otherwise we have Multiple IDs.

If the above process finds no results, we try again with less stringent parameters: percent run >= .80 and percent identity >= .40 and results are not limited to mouse. No 3prime statistic is calculated. The protocol for grouping synonyms remains as before. Results found here are labeled Homolog IDs.

    Localization Protocol

The sequence tag is reverse-complemented if the tag orientation is '+/-' to the genome. The sequence is localized against the current mouse build using a locally installed UCSC BLAT server, gfServer v 32x1 and gfClient v 32, both with default parameters. Blat results are filtered to removed results against "random" chromosomes, then filtered through pslReps, again with default parameters. These results are sorted by descending number of: matched bases, mis-matched bases, sequence tag gap bases, and genomic span. The sorted list is then filtered to drop any result where the number of matching bases is less than 50% of the query sequence length. Next, we discard results where the number of matching bases is less than 90% of the highest matching result. Finally, in an attempt to discard matches against pseudogenes, we discard a result if the span of the genome match is more than 100 bases less than the span of the top result. If exactly one result remains, the tag is deemed Localized.

    Reconciliation Protocol

Next, we attempt to reconcile a single genomic coordinate set from the two identification protocols, Transcript and Localization. In our Reconciliation protocol, we compare genomic coordinates two at a time. If species, chromosome, or strand do not match, we have a conflict and no reconciled result. Likewise if the genomic regions do not overlap. If species, chromosome, and strand match, then the overlapping genomic region of each comparison is returned and used for the next comparison, until we get either a conflict or a single set of genomic coordinates.

In applying the reconciliation protocol, we first reconcile just the Transcript results, if any. To obtain genomic coordinates, each Transcript protocol result is run through the localization protocol using a match length filter cut-off of 80% in place of the 50% used for tags. Transcripts which fail to localize are discarded. The remaining transcript coordinates are reconciled as above. The sequence tag localization protocol returns at most one result, so no separate reconciliation of that protocol is needed.

If we have a result from each of the two protocols, the results are reconciled as above. If a conflict is observed, the tag's identification status is 'Conflict' and the reason recorded. If an overlapping region is returned, it is assigned to the sequence tag and the tag's status becomes 'Localized + Transcript'. If the Transcript result is a Homolog ID, then the Localization result is treated as if it were the only result. If we have only a Transcript result or only a Localization result, then the sequence tag's identification status becomes 'Transcript Only' or 'Localized Only', respectively; if neither a transcript nor a localization result remains, the sequence tag is deemed 'Unidentified'. Results are stored in the database and presented on the web site. The results of the reconciliation process are displayed on the web site using the following symbols:

  • L (Localized): the cell line sequence was successfully localized
  • T (Transcript): a transcript matching the cell line sequence was found
  • U (Unidentified): the cell line sequence was not able to be localized, nor was a transcript found
  • C (Conflict): the cell line sequence was localized and a transcript was found, but they disagree
  • P (Processing): the cell line is in the pipeline processing queue.
  • This can be expressed as the following 2X2 table:

      Transcript Found?
    Yes No
    Localized? Yes L T or C L
    No T U

    Gene Association
    If the sequence tag is assigned genomic coordinates by the Identification stage, those coordinates are used to search for genes identified by Entrez and Ensembl.

    IGTC maintains a local database of Entrez gene information that includes genomic mapping data. The IGTC uses as its reference genome build the build in use at the time by Entrez Gene. If the sequence tag is assigned genomic coordinates by the pipeline's Identification stage, the local Entrez is searched for genomic coordinates that overlap with the tag coordinates. If exactly one overlapping gene is found, it is associated with the sequence tag.

    Some period of time may elapse after the release of a new mouse genome and its adoption by Ensembl. (Ensembl provides a large number of automatically generated annotations that can take significant time to compute and verify.) During these times of build "skew," we relocalize all sequence tag results to the older Ensembl mouse genome build and rerun the entire reconciliation protocol using that build. Regardless of the mouse build used, the sequence tag's coordinates, if any, are used to search the local Ensembl database for overlapping genomic coordinates. If exactly one overlapping gene is found, it is associated with the sequence tag.

    Additional Annotation
    The IGTC pipeline provides additional information about the gene associated with a trapped cell line so that researchers can decide whether a mutant mouse derived from this cell line will be appropriate for their experiments. While information on a gene is generally available at a number of sites, cross-references and links directly from the IGTC site provide important context for the researcher. The IGTC pipeline cross-references to annotations at Ensembl, Entrez, and MGI, providing links to expression data, pathway information, literature references, gene ontology (GO) terms, phenotype information, and a variety of additional data. Tying together the genes and the gene annotations is the assignment of the MGI symbol.

    The final step in the processing pipeline produces an image showing an alignment of all cell line sequences against the full-length cDNA of the trapped gene and its exon boundaries (see Figure 3).

    Figure 3. Example alignment image from IGTC web site. The coding sequence of a gene is shown in magenta and exons are shown in blue. Red bars represent the cell line sequences associated with that gene. (See individual images for a more extensive legend.)

    closet