Fragment Based Properties

The definitions of these functions are located in the gnova schema. There are corresponding tables in gnova as used in the examples below. For a general discussion of fragment based functions and how you might implement new ones, see below.

Function SQL example
amw Select amw('c1ccccc1COCN');

Update nci.structure set molwt = amw(smiles);
-- takes about 5 minutes for 250,000 structures
tpsa Select tpsa('c1ccccc1COCN');

Update nci.structure set tpsa = tpsa(smiles); --
takes about 5 minutes for 250,000 structures
public166keys Select public166keys('c1ccccc1COCN');

Update nci.structure set fkey = public166keys(smiles); --
takes about 20 minutes for 250,000 structures
All the above functions are installed into a SCHEMA named gnova. They can be accessed as, for example gnova.public166keys. Or, you can set your search_path to include the SCHEMA gnova and access them by their unqualified names, for example public166keys.

amw(text Smiles) returns numeric

This function returns a value for the molecular weight of the structure given by the input Smiles. It uses a table of Smarts for each of the first 103 atoms in the periodic table and their corresponding average atomic weight. This table is gnova.amw. This table also contains entries for atoms with 1 to 6 implicit H atoms. Note: this will return an incorrect value if any of your Smiles contains atoms having more that 6 implicit H atoms. Note: this function should not be used extensively, since it is in our opinion the world's most inefficient computer of molecular weight. However, it is an excellent introductory example of how fragment Smarts can be used to compute molecular properties.

tpsa(text Smiles) returns numeric

This function returns a value for the polar surface area of the structure given by the input Smiles. It uses a table of Smarts for the fragments defined by Ertl, et. al. and their corresponding partial surface areas.1 This table is gnova.tpsa.

public166keys(text Smiles) returns bit

This function returns a bitstring fingerprint for the structure given by the input Smiles. It uses a table of Smarts for the 166 fragment keys published by MDL2 and a corresponding bit number to set. This table is gnova.public166keys.

General discussion of fragment based functions

Many molecular properties can be computed by considering the fragments making up the molecule and combining the fragment values to estimate the molecular value, for example: molecular weight, polar surface area and fragment key fingerprints.

Molecular Weight

Molecular weight is computed as a sum of atomic weights. The CHORD SQL function count_matches(Smiles,'[#6]') will return the number of carbon atoms in the input smiles. When this is multiplied by 12.01, the average atomic weight of carbon, and the process is repeated for each element in Smiles, the sum of these computations will be the average molecular weight.

First nine rows of table gnova.amw.
atomic_smarts atomic_weight atomic_symbol
[#1] 1.01 H
[#2] 4.00 He
[#3] 6.94 Li
[#4] 9.01 Be
[#5] 10.81 B
[#6] 12.01 C
[#7] 14.01 N
[#8] 16.00 O
[#9] 19.00 F
The atomic_smarts and atomic_weight are conveniently stored in a table such as the one to the left. Using the built-in aggregate function sum, we can compute the average molecular weight of benzoic acid as:

Select sum(atomic_weight*count_matches('c1ccccc1C(=O)O',atomic_smarts)) from gnova.amw;

The gnova.amw function is defined as the above SQL statement, but with $1 substituted for the benzoic acid smiles so that any smiles string can be used. Note: the same name gnova.amw is used for both the table and the function. This is not necessary; some view this as confusing, while some think it is elegant. Note: this function is presented as an example. While it is entirely correct and useable, we claim that it is the world's least efficient computer of molecular weight.
Last six rows of table gnova.amw.
[*;h1] 1.01 H1
[*;h2] 2.02 H2
[*;h3] 3.03 H3
[*;h4] 4.04 H4
[*;h5] 5.05 H5
[*;h6] 6.06 H6
Since Smiles typically contain implicit H atoms which are not matched by '[#1]', there are six additional rows in the gnova.amw table to account for this. These Smarts will match atoms having one to six implicit H atoms. The corresponding extra H atom weights will be properly added to the sum in the gnova.amw function. If your database tables contain structures with atoms having more than six implicit H atoms, it should now be crystal clear how to add to the gnova.amw table to account for this.

You could consider creating an analogous exact molecular weight table/function in which the atomic weights are replaced by the weight of the exact isotope instead of the average weight of atomic isotopes. This function would be useful in the analysis of mass spectra.

Polar Surface Area

Polar surface area can be computed as a sum of parameterized partial surface areas based on atom types defined by Ertl, et. al.1 The table gnova.tpsa contains smarts(column smarts) and partial surface areas(column atom_psa) taken from that work. The function gnova.tpsa uses SQL and the data in that table to sum the partial surface areas for atoms which match each smarts. The tpsa function for c1ccccc1C(=O)NC can be computed as:

Select sum(atom_psa*oechem.count_matches('c1ccccc1C(=O)NC',smarts)) from gnova.tpsa;

The gnova.tpsa function simply uses the SQL $1 parameter in place of 'c1ccccc1C(=O)NC' above.

Structure Fragment/Keys Fingerprints

Structural fingerprints can be produced as a bit string with each bit representing the presence or absence of particular fragments or keys.2 We describe here how the CHORD SQL function matches and orsum can be used to implement this computation. The table gnova.public166keys contains 166 rows, each with a SMARTS representing a fragment, a brief description of the fragment and a bit number between 1 and 166. The following SQL statement:

Select * from gnova.public166keys where matches('c1ccccc1C(=O)NC',smarts);

will produce 14 rows detailing which fragments are contained in this structure. A single row containing just the bitstring with each of these bits set can be computed using:

Select orsum(bit_set(0::bit(166),bit)) from gnova.public166keys where matches('c1ccccc1C(=O)NC',smarts);

The general public166keys function is simply the above statment, with the SQL parameter $1 substituted in place of 'c1ccccc1C(=O)NC'.

Other Methods

LogP can be estimated in a similar fashion as a sum of atom based fragment values, using the method of Ghose and Krippen3-5. Andrews' binding energy6 can also be computed as a sum of atom based fragment values. We have not implemented functions for these. If you have implemented any of these or other fragment based methods and wish to contribute them for other users, please let us know and we'll be glad to include them in future distributions of CHORD.


  1. P. Ertl, B. Rohde, P. Selzer, Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-based Contributions and Its Application to the Prediction of Drug Transport Properties, J.Med.Chem. 43, 3714-3717, 2000.
  2. J. L. Durant, B. A. Leland, D. R. Henry, and J. G. Nourse, Reoptimization of MDL Keys for Use in Drug Discovery, J. Chem. Inf. Comput. Sci., 42 (6), 1273 -1280, 2002.
  3. A. K. Ghose, G. M. Crippen, Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. Modeling dispersive and hydrophobic interactions, Journal of Chemical Information and Computer Sciences, 27(1), 21-35, 1987.
  4. A. K. Ghose, A. Pritchett and G. M. Crippen, J. Comput. Chem., 9, 80-90, 1988.
  5. S. A. Wildman and G. M. Crippen, J. Chem. Inf. Comput. Sci., 39, 868-873, 1999.
  6. R. Andrews, D. J. Craik, J. L. Martin, J. Med. Chem. 27(12), 1648-1657, 1984.