SQL Functions for Smiles Strings

Smiles1 is a compact string representation of molecular structures. The functions described here help with the use of these strings within your database.

Function SQL example
cansmiles Select cansmiles('c1ccccc1C(=O)NC');

Select count(smiles) From nci.structure Where smiles != cansmiles(smiles);

Update nci.structure Set cansmiles = cansmiles(smiles);
keksmiles Select keksmiles('c1ccccc1C(=O)NC');
impsmiles Select impsmiles('c1ccccc1C(=O)NC');

Select count(smiles) From nci.structure Where
matches(smiles, impsmiles('c1cc(*)ccc1C(=O)NC'));
-- example 2 --

Select impsmiles(keksmiles('c1ccccc1C(=O)NC')); -- example 3 --
isosmiles Select isosmiles('C[C@@H](C(=O)O)N'); -- example 1--

Select isosmiles('C\\C(F)=C(/C)Br'); -- example 2--
oe_valid Select oe_valid('c1ccccc1C(=O)NC');

Select oe_valid('c1cncc1C(=O)NC');
-- example 2 --

Select oe_valid('c1c[nH]cc1C(=O)NC');
-- example 3--

Update nci.structure Set cansmi=cansmiles(smiles) Where oe_valid(smiles);
All the above functions are installed into a SCHEMA named gnova. They can be accessed as, for example gnova.cansmiles. Or, you can set your search_path to include the SCHEMA gnova and access them by their unqualified names, for example cansmiles.

cansmiles(text Smiles) returns text

This function takes a Smiles string and returns a canonical Smiles string. The process of canonicalizing Smiles may reorder the atomic symbols from the input Smiles and will convert any aromatic atoms to use lower-case atomic symbols. The purpose of reordering the atomic symbols is to produce a unique string for each unique molecular structure. Any stereo-isomeric atoms will NOT be preserved when using cansmiles. To keep stereo-isomeric atoms, use the isosmiles() function.

Hints: if your tables contain canonical Smiles strings, you can ensure that each structure is entered only once. The SQL unique constraint is important for this purpose. If you index a canonical Smiles column, then direct lookups of structure become simple and very fast.

keksmiles(text Smiles) returns text

This function takes a Smiles string and returns a kekule Smiles string. The process of kekulizing Smiles will not reorder the atomic symbols from the input Smiles and will convert any aromatic atoms to use upper-case atomic symbols with alternating double bonds. A kekule Smiles is generally considered to be more portable among various programs which may have somewhat different notions of aromaticity. It may not be unique, since there are multiple kekule structures possible for any one (aromatic) molecular structure. For non-aromatic structures, the keksmiles is usually identical to the input Smiles.

Hints: if you're having troubles using Smiles or cansmiles with another programs (e.g. Marvin, ChemDraw, Concord), try converting to keksmiles before giving to the other program.

impsmiles(text Smiles) returns text

This function takes a Smiles string and returns an implicit-H Smiles string. The implicit-H Smiles strings will contain all the H atoms that are normally implied in an input Smiles string. The H atoms will not be explicitly stated. For example, CCC becomes [CH3][CH2][CH3] rather than [H]C([H])([H])C([H])([H])C([H])([H])[H]

Hints: if you're having troubles using Smiles or cansmiles with another programs (e.g. Marvin, ChemDraw, Concord), try converting to impsmiles after keksmiles before giving to the other program (example 3 above). If you want to do a sub-structure search and also need to limit substitution to only one (or a few) selected sites, put a [*] or [R] into the Smiles at the selected sites, then convert to impsmiles (example 2 above).

isosmiles(text Smiles) returns text

This function takes a Smiles string and returns a canonical Smiles string that retains any atom and bond stereo-isomerism. It also preserves any atom maps in the input smiles. The process of canonicalizing Smiles may reorder the atomic symbols from the input Smiles and will convert any aromatic atoms to use lower-case atomic symbols. The purpose of reordering the atomic symbols is to produce a unique string for each unique molecular structure.

oe_valid(text Smiles) returns boolean

This function takes a Smiles string and returns true or false, depending on whether it is interpreted as a valid Smiles string by OEChem. In example 2 above, the result is false because the 5 membered aromatic nitrogen ring is not valid. Example 3 is valid because an implicit hydrogen is specified. Aromatic nitrogens can be troublesome in Smiles, along with some other notorious issues with aromatic Smiles. The oe_valid function can be very useful when importing a set of Smiles from another data base or file. The cansmiles() function will cause an error if it is called with an invalid Smiles. So, the following can prevent problems:

Update nci.structure set cansmi=cansmiles(smiles);

Update nci.structure set cansmi=cansmiles(smiles) where oe_valid(smiles);

 

References

1. D. Weininger, SMILES a chemical language and information system, 1. Introduction to methodology and encoding rules, J. Chem. Inf. Comput. Sci. 28 (1988) 31–36.