Chemical database

A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data.

Types of chemical databases

Chemical structures

Chemical structures are traditionally represented using lines indicating chemical bonds between atoms and drawn on paper (2D structural formulae). While these are ideal visual representations for the chemist, they are unsuitable for computational use and especially for search and storage. Small molecules (also called ligands in drug design applications), are usually represented using lists of atoms and their connections. Large molecules such as proteins are however more compactly represented using the sequences of their amino acid building blocks. Large chemical databases for structures are expected to handle the storage and searching of information on millions of molecules taking terabytes of physical memory.

Literature database

Chemical literature databases correlate structures or other chemical information to relevant references such as academic papers or patents. This type of database includes STN and Scifinder. Links to literature are also included in many databases that focus on chemical characterization.

Crystallographic database

Crystallographic databases store x-ray crystal structure data. Common examples include Protein Data Bank and Cambridge Structural Database.

NMR spectra database

NMR spectra databases correlate chemical structure with NMR data. These databases often include other characterization data such as FTIR and Mass Spec.

Reactions database

Most chemical databases store information on stable molecules but in databases for reactions also intermediates and temporarily created unstable molecules are stored. Reaction databases contain information about products, educts, and reaction mechanisms.

Thermophysical database

Thermophysical data are information about

phase equilibria including vapor-liquid equilibrium, solubility of gases in liquids, liquids in solids (SLE), heats of mixing, vaporization, and fusion.
caloric data like heat capacity, heat of formation and combustion,
transport properties like viscosity and thermal conductivity

Chemical structure representation

There are two principal techniques for representing chemical structures in digital databases

As connection tables / adjacency matrices / lists with additional information on bond (edges) and atom attributes (nodes), such as:

MDL Molfile, PDB, CML
As a linear string notation based on depth first or breadth first traversal, such as:

SMILES/SMARTS, SLN, WLN, InChI

These approaches have been refined to allow representation of stereochemical differences and charges as well as special kinds of bonding such as those seen in organo-metallic compounds. The principal advantage of a computer representation is the possibility for increased storage and fast, flexible search.

Search

Substructure

Chemists can search databases using parts of structures, parts of their IUPAC names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for subgraph isomorphism (sometimes also called a monomorphism) and is a widely studied application of Graph theory. The algorithms for searching are computationally intensive, often of O (n³) or O (n⁴) time complexity (where n is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of Ullman's algorithm or variations of it (i.e. SMSD ^[1]). Speedups are achieved by time amortization, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called 'fingerprints' although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by 'folding', which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length.^[2]

Conformation

Search by matching 3D conformation of molecules or by specifying spatial constraints is another feature that is particularly of use in drug design. Searches of this kind can be computationally very expensive. Many approximate methods have been proposed, for instance BCUTS, special function representations, moments of inertia, ray-tracing histograms, maximum distance histograms, shape multipoles to name a few.^[3]^[4]^[5]^[6]^[7]

Descriptors

All properties of molecules beyond their structure can be split up into either physico-chemical or pharmacological attributes also called descriptors. On top of that, there exist various artificial and more or less standardized naming systems for molecules that supply more or less ambiguous names and synonyms. The IUPAC name is usually a good choice for representing a molecule's structure in a both human-readable and unique string although it becomes unwieldy for larger molecules. Trivial names on the other hand abound with homonyms and synonyms and are therefore a bad choice as a defining database key. While physico-chemical descriptors like molecular weight, (partial) charge, solubility, etc. can mostly be computed directly based on the molecule's structure, pharmacological descriptors can be derived only indirectly using involved multivariate statistics or experimental (screening, bioassay) results. All of those descriptors can for reasons of computational effort be stored along with the molecule's representation and usually are.

Similarity

Main article: Chemical similarity

There is no single definition of molecular similarity, however the concept may be defined according to the application and is often described as an inverse of a measure of distance in descriptor space. Two molecules might be considered more similar for instance if their difference in molecular weights is lower than when compared with others. A variety of other measures could be combined to produce a multi-variate distance measure. Distance measures are often classified into Euclidean measures and non-Euclidean measures depending on whether the triangle inequality holds. Maximum Common Subgraph (MCS) based substructure search ^[1] (similarity or distance measure) is also very common. MCS is also used for screening drug like compounds by hitting molecules, which share common subgraph (substructure).^[8]

Chemicals in the databases may be clustered^{[disambiguation needed ]} into groups of 'similar' molecules based on similarities. Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties may either be determined empirically or computationally derived descriptors. One of the most popular clustering approaches is the Jarvis-Patrick algorithm (k-nearest neighbours algorithm).^[9]

In pharmacologically oriented chemical repositories, similarity is usually defined in terms of the biological effects of compounds (ADME/tox) that can in turn be semiautomatically inferred from similar combinations of physico-chemical descriptors using QSAR methods.

Registration systems

Databases systems for maintaining unique records on chemical compounds are termed as Registration systems. These are often used for chemical indexing, patent systems and industrial databases.

Registration systems usually enforce uniqueness of the chemical represented in the database through the use of unique representations. By applying rules of precedence for the generation of stringified notations, one can obtain unique/'canonical' string representations such as 'canonical SMILES'. Some registration systems such as the CAS system make use of algorithms to generate unique hash codes to achieve the same objective.

A key difference between a registration system and a simple chemical database is the ability to accurately represent that which is known, unknown, and partially known. For example, a chemical database might store a molecule with stereochemistry unspecified, whereas a chemical registry system requires the registrar to specify whether the stereo configuration is unknown, a specific (known) mixture, or racemic. Each of these would be considered a different record in a chemical registry system.

Registration systems also preprocess molecules to avoid considering trivial differences such as differences in halogen ions in chemicals.

An example is the Chemical Abstracts Service (CAS) registration system [1]. See also CAS registry number.

Tools

The computational representations are usually made transparent to chemists by graphical display of the data. Data entry is also simplified through the use of chemical structure editors. These editors internally convert the graphical data into computational representations.

There are also numerous algorithms for the interconversion of various formats of representation. An open-source utility for conversion is OpenBabel. These search and conversion algorithms are implemented either within the database system itself or as is now the trend is implemented as external components that fit into standard relational database systems. Both Oracle and PostgreSQL based systems make use of cartridge technology that allows user defined datatypes. These allow the user to make SQL queries with chemical search conditions (For example a query to search for records having a phenyl ring in their structure represented as a SMILES string in a SMILESCOL column could be

 SELECT * FROM CHEMTABLE WHERE SMILESCOL.CONTAINS('c1ccccc1')

Algorithms for the conversion of IUPAC names to structure representations and vice versa are also used for extracting structural information from text. However there are difficulties due to the existence of multiple dialects of IUPAC. Work is on to establish a unique IUPAC standard (See InChI).

References

^ ^a ^b S. A. Rahman, M. Bashton, G. L. Holliday, R. Schrader and J. M. Thornton (2009) Small Molecule Subgraph Detector (SMSD) toolkit, Journal of Cheminformatics, 1:12. DOI:10.1186/1758-2946-1-12
^ Cummings, Maxwell D.; Maxwell, Alan C.; DesJarlais, Renee L. (2007) Processing of Small Molecule Databases for Automated Docking. Medicinal Chemistry 3(1):107-113
^ Pearlman, R.S. and Smith, K.M., Metric Validation and the Receptor-Relevant Subspace Concept, J. Chem. Inf. Comput. Sci., 1999, 39:28-35
^ LIN Jr-Hung ; CLARK Timothy (2005) An analytical, variable resolution, complete description of static molecules and their intermolecular binding properties. JCIM Vol. 45, no4, pp. 1010-1016
^ Meek P. J., Liu, Z., Tian, L., Wang C. J, Welsh W. J, Zauhar, R. J (2006) Shape Signatures: speeding up computer aided drug discovery. DDT 2006 Vol (19-20):895-904
^ Grant, J. A, Gallardo, M. A., Pickup B. T. (1996) A fast method of molecular shape comparison: A simple application of a Gaussian description of molecular shape. JCIC Vol 17, No. 14, pp 1653-1666
^ Ballester, P. J. & W. G. Richards (2007) Ultrafast shape recognition for similarity search in molecular databases. Proc R Soc A, 463:1307-1321
^ http://www.ebi.ac.uk/thornton-srv/software/SMSD/
^ Butina, Darko (1999) Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets. Chem. Inf. Comput. Sci. 39:747-750

External links

Database and registration software

CDK a Java open source library for chemical data handling
JChem Base and JChem Cartridge Java and .NET database management and search toolkits from ChemAxon
Instant JChem Java desktop database management and search application from ChemAxon. Personal Edition free.
SMSD (Small Molecule Subgraph Detector) is a Java based software library for calculating Maximum Common Subgraph (MCS) between small molecules
JOELib, a Java chemical data handling software library
'Chemical Structure Lookup Service' and 'NCI Enhanced Database Browser', web services of the CADD group at the National Cancer Institute (NCI)
Pinpoint from Dotmatics is a C based cartridge for Oracle that supports the free Oracle XE.
Bingo from GGA Software Services is a free and open-source cartridge for Oracle, Microsoft SQL Server and PostgreSQL.
MolSql from Scilligence is a chemistry cartridge built on Microsoft .NET for Microsoft SQL Server, supporting the free SQL Server Express edition.

Databases of chemical structures

Synthesis references database
Aurora Fine Chemicals
eChemPortal, a global portal to information on Chemical Substances
NLM ChemIDplus, biomedical chemicals searchable by name and structure.
Chembase, a chemical compounds database with data and properties.
Organic synthesis database
ZINC, a free database for virtual screening
ChemSpider, Free access to > 20 Million Chemical Structures, Physical Property Data and Systematic Identifiers
MMsINC, a free web-oriented database of commercially available compounds for virtual screening and chemoinformatic applications
ChemIndustry a free database derived from PubChem data
[2], OpenCDLig: a free Web application for host/guest complexes
NCI/CADD Chemical Structure Lookup Service, lookup in which databases a structure occurs (currently > 70 million indexed chemical structures)
Chempedia, the open, peer reviewed chemical substance registry
ChEBI, the free chemical substance registry for biologically relevant molecules
Chemonaut Chemonaut is the world's most comprehensive source of physically available commercial compounds.

Databases of chemical names

Chemical Substances Database, a free database of chemical names, mainly useful for translation of names between Japanese and English. More than 37,000 entries.
ChemSub Online, the Free Web Portal and Information System on Chemical Substances, substance names available in 8 languages.
EuroChem Online Database, the free Chemical Database.

Categories:

Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

Chemical Abstracts Service — Formation 1907 Headquarters Columbus, Ohio Location United States Official languages English President Robert J. Massie … Wikipedia
Chemical Abstracts Service — Химическая реферативная служба (англ. Chemical Abstracts Service, CAS) подразделение Американского химического общества (American Chemical Society), издающее реферативный журнал Chemical Abstracts (издаётся с 1907 года), публикующий рефераты … Википедия
Chemical Engineering And Biotechnology Abstracts — Chemical Engineering And Biotechnology Abstracts, {CEABA VTB} is an abstracting and indexing service that is published by DECHEMA, BASF, and Bayer Technology Services, all based in Germany. This is a bibliographic database that covers multiple… … Wikipedia
Chemical biology — is a scientific discipline spanning the fields of chemistry and biology that involves the application of chemical techniques and tools, often compounds produced through synthetic chemistry, to the study and manipulation of biological systems.… … Wikipedia
Chemical substance — Chemical redirects here. For other uses, see Chemical (disambiguation). Steam and liquid water are two different forms of the same chemical substance, water. In chemistry, a chemical substance is a form of matter that has constant chemical… … Wikipedia
Chemical accidents — are unanticipated releases, explosions, fires and other harmful incidents involving toxic and hazardous materials. While chemical accidents may occur whenever toxic materials are stored, transported or used, the most severe accidents tend to… … Wikipedia
Chemical Wedding (film) — Chemical Wedding Theatrical poster Directed by Julian Doyle Produced by David … Wikipedia
Chemical similarity — (or molecular similarity) refers to the similarity of chemical elements, molecules or chemical compounds with respect to either structural or functional qualities, i.e. the effect that the chemical compound has on reaction partners in anorganic… … Wikipedia
Chemical decomposition — Chemical decomposition, analysis or breakdown is the separation of a chemical compound into elements or simpler compounds. It is sometimes defined as the exact opposite of a chemical synthesis. Chemical decomposition is often an undesired… … Wikipedia
Chemical Computing Group — Type Private Industry Cheminformatics and bioinformatics software Headquarters Montreal, PQ, Canada … Wikipedia

Academic Dictionaries and Encyclopedias

Chemical database

Contents

Types of chemical databases

Chemical structures

Literature database

Crystallographic database

NMR spectra database

Reactions database

Thermophysical database

Chemical structure representation

Search

Substructure

Conformation

Descriptors

Similarity

Registration systems

Tools

See also

References

External links

Database and registration software

Databases of chemical structures

Databases of chemical names

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Chemical database

Contents

Types of chemical databases

Chemical structures

Literature database

Crystallographic database

NMR spectra database

Reactions database

Thermophysical database

Chemical structure representation

Search

Substructure

Conformation

Descriptors

Similarity

Registration systems

Tools

See also

References

External links

Database and registration software

Databases of chemical structures

Databases of chemical names

Look at other dictionaries:

Share the article and excerpts

Direct link