Human Genome Project and Bioinformatics

HUMAN GENOME PROJECT AND BIOINFORMATICS

The Human Genome Project is an extensive international effort to identify and sequence all the genes in the human genome. This information will be a resource for the investigation of hereditary diseases as well as normal gene structure and expression. The following techniques that have been used to analyze the 23 pairs of human chromosomes were developed to map the genomes of other organisms.

1. Genomic Libraries. A cellular genome is extracted and fragmented with restriction enzymes. A so-called genomic library is obtained when each of the fragments is cloned. The nucleotide sequence of each cloned DNA fragment can be determined.

2. Chromosomal Walking. This technique allows for sequencing DNA fragments of several hundred kb. A series of overlapping fragments from the genome are cloned Once the first fragment has been sequenced, it provides the sequence information required to construct a probe to identify the cone containing the neighboring DNA sequence. After the second clone is sequenced, the process is repeated until all the overlapping fragments have been sequenced.

Because the human genome is complex and enormous, faster and more convenient techniques have been developed. For example, genomic libraries are now routinely constructed from individual human chromosomes obtained by using automated machines called fluorescence-activated cell-sorters. Similarly, vast improvements in computers that store and retrieve sequence information have been made, Despite intense and innovative research, progress in sequencing the genome, as well as its cost effectiveness, must be improved substantially for its goal to be met by the year 2005.

Bioinformatics is the convergence of two technology revolutions: the explosive growth in biotechnology, paralleled by the quality explosive growth in information technology. This is illustrated, in an uncanny way, by the fact that both the size of GenBank and the power of computers have been doubling at about the same rate for many years. The term bioinformatics still carries with it enough hype to make investigating ‘biology’ with computers’ seem like cutting edge. Apart from the obvious data management applications of bioinformatics, computation biology research is divided into two main schools: the analysis and interpretation of data and development of new logarithms and statistics.

To find the genes within the genomic sequence is massive task itself. Once apparent, otherwise uncharacterized coding regions must be assigned a function. Thereafter, the interactions between the genes and gene products must be understood at all levels, not merely in the context of the pathways within and between cells but also in terms of the evolution of gene families within and between species. These questions can all be addresses using bioinformatics.

Bioinformatics touches all of biology and straight forward access to data via the Internet means that a wealth of information is available, literary at our fingertips. However, the newcomer to bioinformatics might be discouraged because of the initial daunting computational and mathematical content. The language and terminology of bioinformatics might simply confuse others. To help overcome these barriers, this exercise is intended to give the students a sampling of Internet computational tools for analysis of biological molecules, with emphasis on proteins and nucleic acids.

The amount of biological information accessible via the World Wide Web is truly astonishing, and the volume of data increasing at a fast pace. It is important for the bench scientist to have easy and efficient ways of wading through the data and finding what is important to his or her research. Although one can browse the data, a far more efficient access to method is to perform a search. Depending on the type of data at hand, there are two basic ways of searching: using descriptive words to search text databases or using a nucleotide or protein sequence to search a sequence database. Entrez, Sequence Retrieval System (SRS) and DBGET allow text searching for multiple molecular biology databases and provide links to relevant information for entries that match the search criteria. This retrieval system, queries can be as simple as accession number of a newly published sequence or as complex as searching multiple database fields for specific term. The advantage of these systems is that they not only return matches to a query, but also provide handy pointers to additional important information in related databases. The three differ in the databases that they search and the links that they make to other information.

Entrez, a molecular biology database and retrieval system, was developed by the National Center for Biotechnology Information (NCBI). It is an entry point for exploring distinct but integrated databases. It provides access to nucleotide and protein sequence databases; a molecular modeling 3-D structures database (MMDB). A genomes and maps database and the literature. The literature database, PubMed, provides excellent and easy access to medline articles. The taxonomy database contains more than 23000 different species and allows retrieval of DNA and protein sequences for any taxonomic group. Of the three, Entrez is the easiest to use, but it offers more limited information to search.

National Center for Biotechnology Information (NCBI)
Entrez Home http:// www.ncbi.nlm.nih.gov/Entrez/
Entrez Help http://www.ncbi.nlm.nih.gov/Entrez/entrezhelp.html

SRS is a homogeneous interface to over biological databases that has been developed at the European Bioinformatics Institute (EBI) at Hinxton, UK. The types of databases included are sequences and sequence related metabolic pathways, transcription factors, application results, protein 3-D structure, and genome, mapping mutations and locus-specific mutation. You can access and query their contents and navigate around them. The WebPages listing all the databases contains a link to a description page about the databases and includes the date on which it was last updated. You select one or more of the databases to search before entering your query. Over 30 versions of SRS are currently running on the WWW. Each includes a different subset of databases and associated analytical tools. SRS are indexed well, thus reducing the search time.

European Bioinformatics Institute (EBI)
SRS http://srs.ebi.ac.uk/
SRS Help http://srs.ebi.ac.uk/srs5/man/mi srswww.html

DBGET/LinkDB is an integrated database retrieval system, developed by the Institute for Chemical Research, Kyoto University, and the Human Genome Center of the University of Tokyo, that is available through the GenomeNet. After querying one of these databases, BDGET presents links to associated information in addition to the list of results. The LinkDB database can also be search directly with a specific entry. Another unique feature of DBGET is its connection with the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, which is a database of metabolic and regulatory pathways that has been developed by the same group.DBGET has simpler, but more limited, search methods than the two. The bfind command allows searching based on text terms while the bget command searches by entry name or accession number.
GenomeNet (Kyoto University and University of Tokyo)

BDGET http://www.genome.ad.jp/dbget/dbget2.html
DBGET Help http://www.genome.ad.jp/dbget/dbget manual.html

I. Databases
1) Brookhaven Protein Databank: a huge database of molecular 3-D structures
2) GenBank & EMBL: nucleic acid sequence database
3) SWISS-PROT, PIR, PRF : protein sequence database
4) PROSITE: dictionary of protein sites and patterns
5) KEGG: pathway and genes database
6) BRITE: biomolecular interaction database
7) PMD : protein mutant database
8) ExPASy Molecular Biology Server : a Swiss site for molecular biology database tools
9) MMDB : Molecular Modeling Database

II. Tools
1) ExPASy-Proteomics Tools: ExPASy tools for proteomics menu
2) WebCutter 2.0 : an internet based program for restriction mapping
3) CMS Molecular Biology Resources: a collection of internet tools for DNA analyses

A. Physical Properties
1. Compute pI/MW is a simple tool, which allows the computation of the theoretical PI (isoelectric point) and MW (molecular weight) for a list of SWISS-PROT database or for a user, entered sequence.

2. ProtParam is a more advanced tool that allows computation of various physical and chemical parameters for a given protein stored in the SWISS-PROT database or for a user entered sequence.

B. Sequence Properties
1. Protein sequence alignment of raw data with existing database entries can be done using Entrez’s BLAST Homology Search Engine. This is a basic local alignment search tool for both proteins and DNA sequences.

2. Protein Machine allows raw DNA sequence data or known DNA sequence (with accession number) to be translated into a protein sequence.

3. GOR, nnPredict and Predator are simple secondary structure prediction programs. ProteinPredict will give a secondary structure prediction as well as predictions about residue solvent accessibility and other esoteric information. Here you might notice that the different programs don’t necessarily agree with each other!

4. If you feel lucky and you think that your protein is similar to one whose structure has been solved, you can try to obtain a 3-D model of your protein with Swiss-Model or at least a structural classification using SCOP.

5. ProtScale is a versatile tool, which allows you to compute and represent the profile produced by any amino acid scale on a selected protein. An amino acid scale is defined by a numerical value assigned to each type of amino acid. The most frequently used scales are hydrophobocity or hydrophilicity scales and the secondary structure conformational parameters scales, but many other scales exist which are based on different chemical and physical properties of amino acids. This program provides 50 predefined scales entered from literature.
These and many other tools are located at the ExPASy server at the University of Geneva.

Useful URL: ExPASy Molecular Biology Server http://expassy.hcuge.ch/

1. Physical parameters of a list of GenBank database sequence or for a user entered sequence can be computed using Physico-Chemical Features Analyses using the CMS Molecular Biology Resource.These parameters include calculation of Tm, %GC, and thermal denaturation profile among others. This website is a compendium of electronic Internet-accessible tools and resources for Molecular Biology, Biotechnology, Molecular Evolution, Biochemistry and Biomolecular Modeling

2. DNA sequence alignment of raw data with existing database entries can be done using Entrez’s BLAST Homology Search Engine. This is a basic local alignment search tool for both proteins and DNA sequences. You can also use the software’s WuBLAST; ClustalW or ClustalX for alignment of DNA sequences.

WuBLAST   http://blast.wustl.edu
CLUSTALW   http://ftp-igbmc.u-strasberg.fr/pub/ClustalW/
CLUSTALX   http://ftp-igbmc.u-strasberg.fr/pub/ClustalX/

3. Restriction and mapping. Virtual restriction digestion of DNA sequences can be done using WebCutter.

4. PCR Primer design

5. Translation of Nucleotide sequence to protein sequence. See Protein tools

Useful URLs
Entrez: GenBank and BLAST
DBGET
SRTS
CMS Molecular Biology Resources
http://www.unl.edu/stc-95/ResTools/cmchp.html

Other large collections of biochemistry and molecular biology database and computational tools are located at:
1. Pedro’s Biomlecular research tools : most complete molecular biology link page
2. The Biophysical Society List
3. University of Biochemistry Home pages
4. BiochemNet: a list of several Biochemistry and Molecular Biology Internet Resources
http://www.schimidel.com./bionet/
5. BioMed Net: electronic magazine for Biological and Medical Researchers
http://www.biomednet.com/

Links :
Back to Main Page
Chemistry 145.1
Human Genome Project and Bioinformatics
Experiment No. 1
Experiment No. 2
Experiment No. 3
Experiment No. 4
Experiment No. 5
Experiment No. 6
Experiment No. 7
Experiment No. 8
Experiment No. 9
Experiment No. 10
Experiment No. 11
Experiment No. 12
References