Introduction To Sequence Analysis |
|
Exercises in Data Retrieval and Using Blast Searches
[The NCBI web site and its parts are updated periodically, therefore
results given below may change with time.]
Step by step instructions with screen shots are given at the end of each exercise. Instuctions not in the
step by step section of the handout are given in green type.
Screen shots were taken from either Safari or Firefox, depending upon which produced the clearer, more easily read image.
| |
Answer: Lots of data to be explored. |
#1 step by step instructions
| |
enter cystic fibrosis in the for box |
| |
and click Go. |
| 2. |
The returned Entrez page is organized with literature matches in the top box, sequence based information
in the middle one and NLM's resources in the bottom box. Items of possible interest are denoted by numbers
in a white box next to a topic title. |

[search was run on 2/1/2008]
These results will change with time as more information is generated on the topic.
There are 151 OMIM entries [catalog of human genes and genetic disorders].
OMIM background information with links off to help and frequently asked questions
The sequence based information section contains data on cystic fibrosis from all species.
Some of these topics are sensitive to the addition of species information to the search query.
OMIM and MeSH are examples of this.
| 3. |
Return to the previous window. To restrict the sequence data to that from humans, by adding homo sapiens to the current cystic
fibrosis in the Search across databases box |
| 4. |
The number of matches for nucleotide, protein, gene topics have decreased, but, there still are a very
large number of items to sift through. |
| |
Hints: |
Perform an Entrez Gene search (http://www.ncbi.nlm.nih.gov) to find the other genes and their
function or relationship to the disease. |
| S100A8 |
- |
cystic fibrosis antigen |
| TGFB1 |
- |
mutations modify severity of pulmonary disease in cystic fibrosis patients |
| |
- |
protein expression correlates with portal tracts showing histological abnormalities associated
with cystic fibrosis liver disease |
| GOPC |
- |
CFTR binding |
| ADRB2 |
- |
2002 polymorphisms contribute to clinical severity and disease progressionin cystic fibrosis
2005 - transfected beta3 not beta2-adrenergic receptors regulates CFTR activity via new pathway |
| SLC9A3R1 |
- |
6/2007 plays a role in the turnover of CFTR at the cell surface |
| |
- |
5/2007 modulation of the expression of CFTR protein partners, like NHE-RF1, can rescue
sequence-deleted CFTR activity. |
| ABCB1 |
- |
study to see how the common cystic fibrosis mutation might disturb transmembrane segments of the protein using ABCB1 as a model
ABCB1 expression increases ATP release in respiratory cystic fibrosis cells potential clinical benefits discussed |
#2 step by step instructions
| |
change the Search option from All Databases to Gene using the pull down menu, |
| |
enter homo sapiens cystic fibrosis in the for box |


| |
Click on at least five diverse hits below the CFTR gene, finding out their relationship to cystic fibrosis. Ignore any gene that doesn't
have a NCBI Reference Sequences (RefSeq) section. |
| |
Here is what one of the results pages looks like. |




| |
Check out the Gene References into Function section of the Bibliography part of the page, if it exists. No hits, click on the PUBMED links in
this section and scan through the titles for mention of cystic fibrosis. |
| |
|
| |
The given example (S100A8) had 4 pages of references and the cystic fibrosis ones were on the last page. |
| |
|
| |
If no papers are listed with cystic fibrosis in the description, check out the
OMIM link on the side of the page or the links in the GeneOntology section. |
| CFM1 |
- |
no RefSeq data (ignored) |
| CFM2 |
- |
no RefSeq data (ignored) |
| S100A8 |
- |
cystic fibrosis antigen |
| TGFB1 |
- |
mutations modify severity of pulmonary disease in cystic fibrosis patients |
| TGFB1 |
- |
protein expression correlates with portal tracts showing histological abnormalities associated with cystic fibrosis liver disease |
| GOPC |
- |
CFTR binding |
| ADRB2 |
- |
2002 polymorphisms contribute to clinical severity and disease progressionin cystic fibrosis
2005 - transfected beta3 not beta2-adrenergic receptors regulates CFTR activity via new pathway |
| SLC9A3R1 |
- |
6/2007 results indicate that NHERF1 plays a role in the turnover of CFTR at the cell surface, and that rDeltaF508 CFTR at the cell
surface remains highly susceptible to degradation |
| SLC9A3R1 |
- |
5/2007 modulation of the expression of CFTR (cystic fibrosis transmembrane conductance regulator) protein partners, like NHE-RF1, can
rescue sequence-deleted CFTR activity |
| ABCB1 |
- |
study to see how the common cystic fibrosis mutation might disturb transmembrane segments of the protein using ABCB1 as a model
ABCB1 expression increases ATP release in respiratory cystic fibrosis cells potential clinical benefits discussed |
| |
Note there is another database that is relevant for getting clinical information, Online Mendelian
Inheritance in Man (OMIM or MIM). Althought OMIM can not be searched via an actual sequence, it does allow searching
by gene symbol, chromosome location, keywords or other features.
|
| |
Notice that gene names can change over time. SLC9A3R1 used to be called EBP50, NHERF, NHERF1 or NHE-RF1. |
| |
Hints: |
Do a Gene search at NCBI (http://www.ncbi.nlm.nih.gov), record the codes. Compare the formats of the
mRNA and protein sequences. Run a BLAST search, (http://blast.ncbi.nlm.nih.gov). |
gene - ADBR2 adrenergic, beta-2-, receptor, surface
RefSeq codes: mRNA Sequence NM_000024
Product (protein) NP_000015
FASTA format is very concise, limited to the actual sequence and an identification line that starts
with a > symbol. The default format is very verbose, giving all sorts of reference details about the sequence
and a version of the sequence that is more easily read by the user.
|
| BLAST searching allows for different types of data entry including the use of accession
codes (such as a RefSeq accession code). |
| ADBR2 contains the 7tm_1 conserved domain signature which is highly conserved across species. |
#3 step by step instructions
| |
change the All Databases Search option to Gene using the pull down menu, |
| |
enter homo sapiens nocturnal asthma in the for box |
| |
and click Go. |
| 2. |
Check the resulting hits to insure that the summary information on the gene mentions that various types
of changes in this gene are associated with the disease. |
Here are the summary sections of the top three hits.



Only the first one contains a reference to nocturnal asthma.
ADBR2 adrenergic, beta-2-, receptor, surface
| 3. |
Scroll down the page to the NCBI Reference Sequences (RefSeq) section. Record the mRNA sequence and
Product (protein) codes: |
The required information:
mRNA Sequence NM_000024
Product (protein) NP_000015
| 4. |
Click on the mRNA code to see the data on the actual mRNA sequence data. Scroll down the page taking in the
format of the information presented.
|
| 5. |
Scroll back to the top of the page and change the Display option from GenBank to FASTA.
The format automatically changes. Note the difference. FASTA format is the sequence format required by many
database searching programs.
|
| 6. |
Click back to the Entrez Gene page and repeat this process with the protein code.
|
| 7. |
After noting the difference, click on the NCBI logo at the top of the page. |
| |
From the blue navigation bar on the main NCBI page, |
| 8. |
From the main BLAST page, click on protein blastp in the Basic BLAST section. |
| |
In the blastp suite page, click on the ? icon at the top of the
box in the Enter Query Sequence section to find out about what sort of inputs this form accepts. Clicking the
more... link provides additional details. |
| |
After reading the presented information, click on the "?" icon again to close the information block and then enter the protein code
into the Enter Query Sequence box. Once this is done, information appears in the Job Title box. |
| |
In the Choose Search Set section of the page,and start to enter
the term Vertebrata into the Organism box. |
| |
As the term is entered, matching Entrez terms start to
appear. When enough of the word is entered to find the desired term, select this term from the list. |
| |
Clicking the BLAST button at the bottom of the page starts the search. |
| |
If results are to be displayed in a new window, click on the "Show results
in a new window" box prior to clicking the BLAST button. |
| |
Protein searches gets around the problem of multiple codons coding for the same amino acid that impacts
nucleotide searches. However, depending in the information sought, this is not always possible. |
| 9. |
It may take a few seconds for the search to be completed. While waiting, click on the 7tm_1 in the image to find
out about the conserved domain that was found in the sequence. |
Initial Conserved Domains page.
Clicking on the + symbol at the beginning of the green highlighted line produces the full version of the page.
| |
7tm_1 indicates that the protein being search with contains the transmembrane receptor signature of the
rhodopsin family of transmembrane proteins.
This signature is located in residues 50 to 326 of the sequence.
Close the popup window.
Note the length of the query sequence, this may be given on the Query line, the Job Title line or in the conserved domain image.
413 letters
Wait for the results page to appear. |
| |
Here is a screen shot of the top part of the Blast results page. |



| 10. |
Scroll down the results page past the image with its colored horizontal bars to the
Sequences producing significant alignments section. |
| |
|
| |
Scores are based on the length of the query sequence and the size of the database. Short sequences
will never produce great scores. To get a E value of 0.0 requires a match of at least 330 characters.
A very long sequence could easily have a match this long and still not have a match that covers a
significant portion of the query sequence. Always look at the resulting alignment. The mathematics
of the process can sometimes result in the strange ordering of hits. |
| |
|
| |
A hit line gives the database the hit is from, its accession code, a description of the sequence from the database, its Bits score and finally
the E value. Hits in the list are ordered their E value, then their Bits score which reflects the
length of their actual match. Enough of the description may be given to see what species the match
is from. |
| |
|
| |
Clicking on the link given on the left side of a hit line goes off to the actual sequence information.
Clicking on the right side link moves down to the alignment data for that hit. |
| |
|
| |
Notice that there are over 30 hits with an E value of 0.0 at the top of this list and that the protein
code entered is at the top of the list. There are about 130 hits in the list which mention
ADBR2, beta-2 adrenergic receptor or variations thereof before sequence description changes
to something else. The first 12 hits are all from man with from 0 to 2 mismatches in the alignment.
NCBI used to make an effort to remove redundant sequences, but the size of the database increased to
such an extent that it was no longer possible to do this quickly enough so that it wouldn't impact
the processing of new data.
When an accession code begins with XP_, it means that the data is the results of an automated
analysis process. This situation usually occurs when a genome sequencing project is first being analyzed.
These sequences have not been checked for accuracy and can be much longer or shorter than their homologs
from more mature genome studies. These sequences usually have their description start with PREDICTED:.
The letter inclosed in colored boxes to the right of the hit line indicate that there is additional information
available about that matching sequence elsewhere. A boxed U means that there is Unigene data. A boxed G indicates
that there is Entrez Gene data. The boxed S means that there is structural data.
Check out some of the hits beyond the 0.0 E values and determine where the match is actually taking place within
the query sequence.
PREDICTED: similar to beta-2 adrenergic receptor [Monodelphis domestica] Length=404 (opossum)
PREDICTED: similar to beta-2 adrenergic receptor, [Gallus gallus] Length=397 (chicken)
beta-2 adrenergic receptor [Homo sapiens] Length=275 (man)
beta-2 adrenergic receptor [Macaca mulatta] Length=275 (rhesus monkey)
beta-2 adrenergic receptor [Hylobates concolor] Length=275 (crested gibbon)
beta-2 adrenergic receptor [Ateles fusciceps] Length=275 (spider monkey)
The match is happening in the 7tm_1 region of the sequence which appears to be highly conserved. |
| |
Hints: |
Use the protein accession code from the previous exercise and run a protein
BLAST search (http://blast.ncbi.nlm.nih.gov).
This time, instead of using the default database, use the swissprotein database and a structure database.
Compare the available structure information to make the decision. |
swissprotein - ADBR2_human
| transmembrane segments: |
1. 35 - 58 |
3. 107 - 129 |
5. 197 - 220 |
7. 306 - 329 |
| |
2. 72 - 95 |
4. 151 - 174 |
6. 275 - 298 |
|
pdb match - the Human Beta2 Adrenoceptor structure covers most of the transmembrane segments
| comparison - The results show that the human proteins being compared are identical to one another. However,
the structure and the swissprotein TMD segments don't agree as to number and location. Perhaps more study needs
to be done on this protein to get the correct TMD locations and a complete structure.
|
#4 step by step instructions
| 2. |
From the main BLAST page, click on protein blast in the Basic BLAST section. |
| |
In the blastp suite page, click on the ? icon at the end of the Database line in the Choose Search Set section to find out
information about the databases that can be used in this protein BLAST search. Clicking on the more... link provides additional
information. Once a suitable structure database name has been located, close the more... page and re-click on the "? icon"
to close the information block. |
| |
From the list given the structural database to use is pdb (Protein Data Bank proteins). The swissprotein database (Swissprot protein
sequences) was also listed. |
| |
|
| |
Of the protein databases, swissprotein is considered to have the best annotation. One of the
features they report is transmembrane segment locations when available or predicted. |
| |
|
| 3. |
Change the Choose Search Set Database option from nr to swissprotein using the pull down menu, |
| |
enter the previously found RefSeq protein accession code into the Enter Query Sequence box. |
| |
To speed things up and reduce the size of the output file, restrict the organism searched to humans by starting to enter
homo sapiens in the Organism line. Select the proper line when it appears. |
| |
Start the run by clicking the BLAST button. |
| 4. |
At the top of the actual results page, click on the "Reformat these Results" link. |
| |
This leads off to a form which allows the changing of the produced results. |
| |
The number of descriptions, lines in the image and alignments can be restricted using the Descriptions:, Graphical overview:
and Alignments: pull down menus. Restrict these three options to 10 each and then click on the View report button near the top of
the page.
Here is the top part of the results page. |


| 5. |
Scroll down the results to the significant alignments section and click on the sequence link containing
the term ADRB2_HUMAN. It should be the first one on the list and hist a 100% match to
the submitted reference sequence. |
| |
|
| |
link to ADRB2_human |
| |
|
| 6. |
Scroll down the swissprotein data file to the FEATURES section. Then read through the listed features
to find those regions called "Transmembrane region" and record them. |
| |
The first Transmembrane region from the data file |
| |
There are a total of 7 TMD regions |
| |
| transmembrane segments: |
1. 35 - 58 |
3. 107 - 129 |
5. 197 - 220 |
7. 306 - 329 |
| |
2. 72 - 95 |
4. 151 - 174 |
6. 275 - 298 |
|
|
| 7. |
Return to the protein blast page, re-enter the RefSeq accession code if necessary, and change
the database to be used to pdb, and return Organism to its default blank
value. |
| |
Click BLAST button. |
| 8. |
Wait until results page appears.
|
| 9. |
The best hit comes from Human Beta2 Adrenoceptor and is a perfect match. |
| |
The alignment does cover the entire area containing the transmembrane segments. The pdb
code is in two parts, the first four alphanumeric characters refer to the structure name and
the character after the | refers to the chain within the structure that has the match.
|
| 10. |
To find out more information about this structure, click on the red boxed S. This goes off to a
"Related Structures" page. |
| |
The image shows that the 100% match is only for the first ~ 370 residues of the
protein. Click on the name of the structure in the lower left hand side of the page. |

| |
Now at the Structure Summary site, more information is given about the actual
structure. Sequence A (chain A) that matches the protein appears to have structural information
for residues 1 to about 245, or the first five TMD sections according to swissprotein. There
are two other parts of the structure which don't appear to be part of the ADBR2 protein. Click
on the pdb link in the top section of the page to find out more. |
| |
Scrolling down the PDB page on this structure to the Molecular Description section
results in finding out that chains L and H are antibodies to human beta2 adrenoceptor protein.
Transmembrane proteins are very difficult to crystallize, the first step in doing x-ray
diffraction studies. It appears that attaching these antibodies to the protein made
crystallization possible.
Return to the Structure Summary page and look closely at the green and gold image at the
top of the page. The region with the green cylinders is the part of the structure from the ADRB2_HUMAN
protein and the gold regions are the antibodies. Count the number of green cylinders shown. |
| |
There are 6 distinct cylinders. These represent helical structural elements in the protein. TMD sections
for most proteins are expected to be helices. However, according to the swissprotein data, the part of the protein
that was crystallized should only contain 5. Perhaps, additional study needs to be done on this protein to
clarify the number of TMD sections the protein contains and where they are located.
|
| |
Hints: |
Determine the proteins associated with the disease by doing an Entrez protein search
Choose the top hit and check its length. Use the sequence in a NCBI protein BLAST search to try to find possible model animals.
Also use this sequence to find dog sequences. Compare any found dog sequence(s) with the human sequence you started with. |
protein - potassium voltage-gated channel, shaker-related subfamily, member 5
rhesus monkey, dog, rabbit, mouse and rat would be possible animal models
| best dog match |
NP_001006646 600 aa |
| complete dog sequence - Yes, but, there is a 15 residue gap starting at residue 72 that
would have to be further investigated before proceeding. |
#5 step by step instructions
| 1. |
Go to http://www.ncbi.nlm.nih.gov, change the All Databases option to Protein and enter the term
pulmonary artery hypertension homo sapiens into the for box |
| |
and click Go. |
| |
The results of the search. |
| 2. |
Choose the top hit, in this case NP_002225. Clicking on the link will take you to the page for the
protein.
|
| 3. |
Check out the length of the protein. The length is the second item on the "LOCUS" line and is a number followed by aa. |
| |
the protein is 613 residues long |
| |
(The sequence to be used in the search is 613 residues long.) |
| |
Scan down the presented references to ensure that this protein has some relationship to pulmonary artery hypertension. |
| |
Reference 4 seems to indicate that SNPs in this protein were found in patients with idiopathic pulmonary arterial hypertension that impacted
function. |
| 4. |
To obtain the sequence of NP_002225 for the BLAST search, change the Display option of the page from
GenPept to FASTA using the pull down menu. |
| |
This automatically changes the format to FASTA. |
| |
Copy this data, starting with the ">" and continuing to the end of the sequence.
|
| 5. |
Click on the NCBI logo in the upper left-hand side of the screen. |
| 6. |
From the main NCBI page click on BLAST in the blue navigation bar. |
| |
On the main BLAST page, click on protein blast in the Basic BLAST portion of the page. |
| 7. |
Paste your sequence into the Enter Query Sequence box, |
| |
be sure that the Choose Search Set parameter are at their default values (database nr and organism blank) |
| |
and then click the BLAST button. |
| 8. |
Wait until the results page appears. |
| 9. |
Check out the best hits with a description that appears to be correct, |
| |
Checking out the best hits to determine the quality of the matches and the species
results in the following information. |
| XP_522330 |
Pan troglodytes (chimpanzee) |
602 residues |
Identities = 600/613 (97%) |
| XP_001102294 |
Macaca mulatta (rhesus monkey) |
605 residues |
Identities = 596/616 (96%) |
| KCNA5_MUSPF |
Mustela putorius furo (domestic ferret) |
601 residues |
Identities = 547/613 (89%) |
| NP_001006646 |
Canis lupus familiaris (dog) |
600 residues |
Identities = 538/615 (87%) |
| XP_001495044 |
Equus caballus (horse) |
595 residues |
Identities = 539/614 (87%) |
| NP_001075505 |
Oryctolagus cuniculus (rabbit) |
598 residues |
Identities = 533/616 (86%) |
| NP_037104 |
Rattus norvegicus (Norway rat) |
602 residues |
Identities = 529/615 (86%) |
| NP_666095 |
Mus musculus (house mouse) |
602 residues |
Identities = 530/615 (86%) |
| NP_001015552 |
Bos taurus (cattle) |
598 residues |
Identities = 526/614 (85%) |
| NP_001006593 |
Sus scrofa (pig) |
600 residues |
Identities = 524/616 (85%) |
| XP_001368410 |
Monodelphis domestica (gray short-tailed opossum) |
609 residues |
Identities = 472/621 (76%) |
| |
These results would indicate that rhesus monkey [Macaca mulatta] would be the best model. However, dog, rabbit, mouse and rat would all be good animal
models in which to study this gene and its function. |
| 10. |
Return to the blastp suite submission page and change the Organism option in the Choose Search Set section from
blank to Canis familiaris by starting to enter this term into the field. Highlight
the term when it appears on the list. |
| |
Click the BLAST button to start the run. |
| |
Confirm that the hits are from dog and determine how
close the length is to that of the starting human sequence. The first one looks the most likely. |
| |
Looking at these results there is only one real area of concern. The 15 residue gap starting at position 72 would
need to be looked at to find out if there are any known functions or features associated with this region in the human sequence. If so, then
the dog protein wouldn't be a good study model. |
| |
Hints: |
Find the mRNA FASTA formatted sequence for the AGPAT6 mouse gene by doing an Entrez Gene
search at NCBI (http://www.ncbi.nlm.nih.gov). Then a BLAST search at the International Gene Trap Consortium
(IGTC) site (http://www.genetrap.org) to see if such knockouts exist. |
There are four possible knockouts. One is an intronic trap and its interaction with the gene is not shown,
the other three (DTM030, XS0453 and XS0575) are. It is up to the user to decide which of these three shown gene
traps would be most useful and best meet their needs.
Information on how to order a cell line is provided in the Sequence Tag Information section of the cell line page.
In this case:
DTM030 is from BayGenomics and can be ordered from MMRRC
http://www.mmrrc.org/catalog/StrainCatalogSearchForm.jsp?pageSize=25&jboEvent=Search&SourceCollection=BayGenomics
XS0453 and XS0575 are from the Sanger International Gene Trap Resource ordered from the MMRRC
http://www.mmrrc.org/catalog/StrainCatalogSearchForm.jsp?pageSize=25&jboEvent=Search&SourceCollection=SIGTR
#6 step by step instructions
last updated 3/14/2008
|
|