HUMAN GENOME PROJECT AND BIOINFORMATICS
The Human Genome Project is an extensive international effort to identify and sequence all the genes in the human genome. This information will be a resource for the investigation of hereditary diseases as well as normal gene structure and expression. The following techniques that have been used to analyze the 23 pairs of human chromosomes were developed to map the genomes of other organisms.
1. Genomic Libraries. A cellular genome is extracted and fragmented with restriction enzymes. A so-called genomic library is obtained when each of the fragments is cloned. The nucleotide sequence of each cloned DNA fragment can be determined.
2. Chromosomal Walking. This technique allows for sequencing DNA fragments of several hundred kb. A series of overlapping fragments from the genome are cloned Once the first fragment has been sequenced, it provides the sequence information required to construct a probe to identify the cone containing the neighboring DNA sequence. After the second clone is sequenced, the process is repeated until all the overlapping fragments have been sequenced.
Because the human genome is complex and enormous, faster and more convenient
techniques have been developed. For example, genomic libraries are now
routinely constructed from individual human chromosomes obtained by using
automated machines called fluorescence-activated cell-sorters. Similarly,
vast improvements in computers that store and retrieve sequence information
have been made, Despite intense and innovative research, progress in sequencing
the genome, as well as its cost effectiveness, must be improved substantially
for its goal to be met by the year 2005.
Bioinformatics is the convergence of two technology revolutions: the explosive growth in biotechnology, paralleled by the quality explosive growth in information technology. This is illustrated, in an uncanny way, by the fact that both the size of GenBank and the power of computers have been doubling at about the same rate for many years. The term bioinformatics still carries with it enough hype to make investigating ‘biology’ with computers’ seem like cutting edge. Apart from the obvious data management applications of bioinformatics, computation biology research is divided into two main schools: the analysis and interpretation of data and development of new logarithms and statistics.
To find the genes within the genomic sequence is massive task itself. Once apparent, otherwise uncharacterized coding regions must be assigned a function. Thereafter, the interactions between the genes and gene products must be understood at all levels, not merely in the context of the pathways within and between cells but also in terms of the evolution of gene families within and between species. These questions can all be addresses using bioinformatics.
Bioinformatics touches all of biology and straight forward access to data via the Internet means that a wealth of information is available, literary at our fingertips. However, the newcomer to bioinformatics might be discouraged because of the initial daunting computational and mathematical content. The language and terminology of bioinformatics might simply confuse others. To help overcome these barriers, this exercise is intended to give the students a sampling of Internet computational tools for analysis of biological molecules, with emphasis on proteins and nucleic acids.
The amount of biological information accessible via the World Wide Web is truly astonishing, and the volume of data increasing at a fast pace. It is important for the bench scientist to have easy and efficient ways of wading through the data and finding what is important to his or her research. Although one can browse the data, a far more efficient access to method is to perform a search. Depending on the type of data at hand, there are two basic ways of searching: using descriptive words to search text databases or using a nucleotide or protein sequence to search a sequence database. Entrez, Sequence Retrieval System (SRS) and DBGET allow text searching for multiple molecular biology databases and provide links to relevant information for entries that match the search criteria. This retrieval system, queries can be as simple as accession number of a newly published sequence or as complex as searching multiple database fields for specific term. The advantage of these systems is that they not only return matches to a query, but also provide handy pointers to additional important information in related databases. The three differ in the databases that they search and the links that they make to other information.
Entrez, a molecular biology database and retrieval system, was developed
by the National Center for Biotechnology Information (NCBI). It is an entry
point for exploring distinct but integrated databases. It provides access
to nucleotide and protein sequence databases; a molecular modeling 3-D
structures database (MMDB). A genomes and maps database and the literature.
The literature database, PubMed, provides excellent and easy access to
medline articles. The taxonomy database contains more than 23000 different
species and allows retrieval of DNA and protein sequences for any taxonomic
group. Of the three, Entrez is the easiest to use, but it offers more limited
information to search.
National Center for Biotechnology Information
(NCBI)
Entrez Home http://
www.ncbi.nlm.nih.gov/Entrez/
Entrez Help http://www.ncbi.nlm.nih.gov/Entrez/entrezhelp.html
SRS is a homogeneous interface to over biological databases that has been developed at the European Bioinformatics Institute (EBI) at Hinxton, UK. The types of databases included are sequences and sequence related metabolic pathways, transcription factors, application results, protein 3-D structure, and genome, mapping mutations and locus-specific mutation. You can access and query their contents and navigate around them. The WebPages listing all the databases contains a link to a description page about the databases and includes the date on which it was last updated. You select one or more of the databases to search before entering your query. Over 30 versions of SRS are currently running on the WWW. Each includes a different subset of databases and associated analytical tools. SRS are indexed well, thus reducing the search time.
European Bioinformatics Institute (EBI)
SRS http://srs.ebi.ac.uk/
SRS Help http://srs.ebi.ac.uk/srs5/man/mi
srswww.html
DBGET/LinkDB is an integrated database retrieval system, developed by the
Institute for Chemical Research, Kyoto University, and the Human Genome
Center of the University of Tokyo, that is available through the GenomeNet.
After querying one of these databases, BDGET presents links to associated
information in addition to the list of results. The LinkDB database can
also be search directly with a specific entry. Another unique feature of
DBGET is its connection with the Kyoto Encyclopedia of Genes and Genomes
(KEGG) database, which is a database of metabolic and regulatory pathways
that has been developed by the same group.DBGET has simpler, but more limited,
search methods than the two. The bfind command allows searching based on
text terms while the bget command searches by entry name or accession number.
GenomeNet (Kyoto University and University
of Tokyo)
BDGET http://www.genome.ad.jp/dbget/dbget2.html
DBGET Help http://www.genome.ad.jp/dbget/dbget
manual.html
I. Databases
1) Brookhaven Protein Databank:
a huge database of molecular 3-D structures
2) GenBank & EMBL: nucleic
acid sequence database
3) SWISS-PROT, PIR, PRF : protein
sequence database
4) PROSITE: dictionary of protein
sites and patterns
5) KEGG: pathway and genes database
6) BRITE: biomolecular interaction
database
7) PMD : protein mutant database
8) ExPASy Molecular Biology Server
: a Swiss site for molecular biology database tools
9) MMDB : Molecular Modeling Database
II. Tools
1) ExPASy-Proteomics Tools: ExPASy
tools for proteomics menu
2) WebCutter 2.0 : an internet
based program for restriction mapping
3) CMS Molecular Biology Resources:
a collection of internet tools for DNA analyses
A. Physical Properties
1. Compute pI/MW is a simple tool,
which allows the computation of the theoretical PI (isoelectric point)
and MW (molecular weight) for a list of SWISS-PROT database or for a user,
entered sequence.
2. ProtParam is a more advanced tool that allows computation of various physical and chemical parameters for a given protein stored in the SWISS-PROT database or for a user entered sequence.
B. Sequence Properties
1. Protein sequence alignment of
raw data with existing database entries can be done using Entrez’s BLAST
Homology Search Engine. This is a basic local alignment search tool for
both proteins and DNA sequences.
2. Protein Machine allows raw DNA sequence data or known DNA sequence (with accession number) to be translated into a protein sequence.
3. GOR, nnPredict and Predator are simple secondary structure prediction programs. ProteinPredict will give a secondary structure prediction as well as predictions about residue solvent accessibility and other esoteric information. Here you might notice that the different programs don’t necessarily agree with each other!
4. If you feel lucky and you think that your protein is similar to one whose structure has been solved, you can try to obtain a 3-D model of your protein with Swiss-Model or at least a structural classification using SCOP.
5. ProtScale is a versatile tool,
which allows you to compute and represent the profile produced by any amino
acid scale on a selected protein. An amino acid scale is defined by a numerical
value assigned to each type of amino acid. The most frequently used scales
are hydrophobocity or hydrophilicity scales and the secondary structure
conformational parameters scales, but many other scales exist which are
based on different chemical and physical properties of amino acids. This
program provides 50 predefined scales entered from literature.
These and many other tools are located
at the ExPASy server at the University of Geneva.
Useful URL: ExPASy Molecular Biology Server http://expassy.hcuge.ch/
1. Physical parameters of a list of GenBank database sequence or for a user entered sequence can be computed using Physico-Chemical Features Analyses using the CMS Molecular Biology Resource.These parameters include calculation of Tm, %GC, and thermal denaturation profile among others. This website is a compendium of electronic Internet-accessible tools and resources for Molecular Biology, Biotechnology, Molecular Evolution, Biochemistry and Biomolecular Modeling
2. DNA sequence alignment of raw data with existing database entries can be done using Entrez’s BLAST Homology Search Engine. This is a basic local alignment search tool for both proteins and DNA sequences. You can also use the software’s WuBLAST; ClustalW or ClustalX for alignment of DNA sequences.
WuBLAST http://blast.wustl.edu
CLUSTALW http://ftp-igbmc.u-strasberg.fr/pub/ClustalW/
CLUSTALX http://ftp-igbmc.u-strasberg.fr/pub/ClustalX/
3. Restriction and mapping. Virtual restriction digestion of DNA sequences can be done using WebCutter.
4. PCR Primer design
5. Translation of Nucleotide sequence to protein sequence. See Protein tools
Useful URLs
Entrez: GenBank and BLAST
DBGET
SRTS
CMS Molecular Biology Resources
http://www.unl.edu/stc-95/ResTools/cmchp.html
Other large collections of biochemistry
and molecular biology database and computational tools are located at:
1. Pedro’s Biomlecular research
tools : most complete molecular biology link page
2. The Biophysical Society List
3. University of Biochemistry Home
pages
4. BiochemNet: a list of several
Biochemistry and Molecular Biology Internet Resources
http://www.schimidel.com./bionet/
5. BioMed Net: electronic magazine
for Biological and Medical Researchers
http://www.biomednet.com/
Links :
Back to Main Page
Chemistry 145.1
Human Genome Project and
Bioinformatics
Experiment No. 1
Experiment No. 2
Experiment No. 3
Experiment No. 4
Experiment No. 5
Experiment No. 6
Experiment No. 7
Experiment No. 8
Experiment No. 9
Experiment No.
10
Experiment No.
11
Experiment No.
12
References