SQL Functions for Converting Chemical Names and Files

CHORD supports many ways of interconverting molecular structure representations. Of course, Smiles is the "coin of the realm", but molecular structures can be also represented using chemical names, molfiles or other types of files. There are several utility functions for common conversion, such as molfiles and IUPAC names, as well as a more general molconvert function.

There are also two general functions that allow the full range of format conversions. In order to fully utilize the general conversion functions, this section introduces two data types for storing molecular structures other than Smiles. The first is a string representation called a strmol. This may contain a chemical name or a molecular structrure "file". The other type is a binary representation called a binmol. This is not directly readable, but is useful as an intermediate during file conversion.

Function SQL example
molfile_to_smiles Select molfile_to_smiles(molfile) from vla4.sdf;
smiles_to_molfile Select smiles_to_molfile(smiles) from nci.structure
iupac Select iupac('c1ccccc1COCN');
name_to_smiles Select name_to_smiles('phenylmethoxymethanamine');
Select name_to_smiles('benzoxymethylamine');
Select name_to_smiles('benzyloxymethanamine');
molconvert Select molconvert('c1ccccc1COCN', 'smi', 'pdb');
Select molconvert(molfile, 'mol', 'syb') from vla4.sdf;
Select smiles, molconvert(smiles, 'smi', 'iupac') from nci.structure;
oe_binmol Update vla4.sdf set binmol=oe_binmol(molfile, format_code('mdl'));
oe_strmol Select type, oe_strmol('c1ccccc1COCN', code) from gnova.formats where writeable;
All the above functions are installed into a SCHEMA named gnova. They can be accessed as, for example gnova.iupac. Or, you can set your search_path to include the SCHEMA gnova and access them by their unqualified names, for example iupac.

molfile_to_smiles(text Molfile) returns text

This function takes a Molfile represented as a text string and returns a valid Smiles. The isomeric Smiles is returned in order to ensure that any stereo atoms and bonds are preserved. Any name contained in the molfile is not included in the Smiles. See the molconvert or oe_strmol functions below if you need to preserve the structure name contained in the molfile.

smiles_to_molfile(text Smiles) returns text

This function takes a Smiles string and returns a string representation of the molecule in the molfile format. It can be useful when exporting a structure to a program that does not recognize Smiles. The coordinates of the atoms are all zero. In future versions of CHORD, it may be possible to produce 2D or 3D coordinates when generating a molfile. In addition, another argument specifying an array of coordinates will be allowed. The oe_strmol function is described below.

iupac(text Smiles) returns text

This function takes a Smiles string and returns a chemical name formed according to the IUPAC naming system. In some cases an IUPAC name for a valid Smiles cannot be generated. In that case, a string 'BLAH' is returned. This is a quirk of the underlying Lexichem library. In future versions of CHORD, a null string will be returned.

name_to_smiles(text Name) returns text

This function takes a string interpreted as a chemical name and returns a Smiles representation of the name, if possible. If the name cannot be interpreted, a null value is returned.

molconvert(text Mol, text InType, text OutType)

This function takes a string representation of a molecule, Mol expected to be in the format InType. It converts the Mol to a string representation of type OutType. The allowed values for Intype and OutType are in the table gnova.formats shown below.

oe_strmol(bytea binmol, integer code) returns text

This function is more general that the preceeding functions. The concept of a strmol, or string representation of a molecule is explained in the section below. Smiles is one string representation, but any of the various chemical name types or file formats are also possible string representations. The oe_strmol function converts a binmol (described below) to any of these various names or formats. The integer code corresponds to the type desired and is defined in the table gnova.formats. That table is shown below.

The functions shown above all use oe_strmol or oe_binmol in various ways. Those functions are very simple SQL wrappers around oe_strmol and oe_binmol. The definition of those functions can be found in the functions.sql file contained in the installation tarball for CHORD.

oe_binmol(text mol, integer code) returns bytea

This function takes a string representation of a molecular structure and returns a binary representation called a binmol. This uses the PostgreSQL data type bytea. The binmol corresponds to the OpenEye version 2 binary format. The binmol is not expected to be directly used, but instead functions as in intermediate during conversion to and from the various name and file formats. However, it is entirely possible to store a binmol in a column in a table in your data base. Future versions of CHORD will make more use of the binmol representation, especially for handling 3D coordinates and interfacing to other OpenEye functions.

The integer codes correspond to the code column in the format_code table shown above. To convert a molfile to a binmol, use the following SQL for molfile strings stored in a table named vla4.sdf. The coordinates in the molfile will be preserved in the binmol.

Update vla4.sdf set binmol=oe_binmol(molfile, format_code('mdl'));

Format codes and name codes

Integer codes are used in the conversion functions. Sometimes the code is used to determine which type of strmol is output; sometimes to determine which type of strmol is input. In both cases, the following table will help you choose the right code. Rather than using a constant integer, we recommend using the utility functions format_code and name_code. These functions are defined as:

Select code from gnova.formats where type ilike $1;

The gnova.formats table consists of the following rows.

code type description
1 smi SMILES
2 mdl MDL Mol
2 mol MDL Mol
2 rxn MDL Mol
3 ent PDB
3 pdb PDB
4 mol2 Tripos MOL2
4 syb Tripos MOL2
5 bin OEBinary v1
6 tdt Daylight TDT
7 ism Isomeric SMILES
7 isosmi Isomeric SMILES
8 mol2h MOL2 with H
9 sd MDLSDF
9 sdf MDLSDF
10 can Canonical SMILES
11 mf Molecular Formula
12 xyz XYZ
13 fasta FASTA
13 seq FASTA
14 mopac MOPAC
14 pac MOPAC
15 oeb OEBinary v2
16 dat Macromodel
16 mmd Macromodel
16 mmod Macromodel
17 sln Tripos SLN
18 rd MDL RDF
18 rdf MDL RDF
19 cdx ChemDraw CDX
101 openeye_name OpenEye names loosely correspond to the kinds of names familiar to a medicinal chemist. These names are intended to be a subset of the IUPAC 2005 standard's acceptable names, but not necessarily the PIN (Preferred IUPAC Name). These correspond to the types of names found in a Sigma-Aldrich catalogue or Journal of Medicinal Chemistry article for example.
102 iupac_name IUPAC names are intended to follow the IUPAC 2005 recommendations for the Preferred IUPAC Name (PIN). Unfortunately, this functionality is relatively recent, so the best that can be hoped for these names is that they are more IUPAC-like than the default OpenEye name style. Future release of Lexichem may further refine this definition to provide IUPAC2005, IUPAC93 and IUPAC79 name styles that reflect the corresponding standard's preferred name.
103 cas_name The Lexichem CAS name style is intended to follow the Chemical Abstracts Service's naming conventions, where they differ from IUPAC's. Once again, as this functionality is relatively recent, the effect is to generate names that are more CAS-like than the default OpenEye name style.
104 traditional_name The Traditional name style corresponds to forms of compound naming that are now no longer acceptable to the IUPAC rules. The boundary between whether a trivial/common name is considered OpenEye or Traditional when it acceptable to IUPAC but not preferred is blurred, with OpenEye attempting to follow the more prevalent usage.
105 systematic_name Systematic names correspond to the fully systematic IUPAC names that the IUPAC preferred names are slowly converging towards.
101 openeye openeye_name
102 iupac iupac_name
103 cas cas_name
104 traditional traditional_name
105 systematic systematic_name

String representations of molecular structure (strmol)

A strmol is any of the various ways of representing a molecular structure as a text string. This includes molfiles, chemical names and any of the various "file" formats in the gnova.formats table above. These may be stored in tables in the data base or simply used when exporting structures.

Some of the possible strmol types corresponds to Smiles (smi, isosmi, ism, can). Because the Smiles "file" format allows for an optional name, the name will be included when converting say from a molfile to a Smiles using the oe_strmol function or the molconvert function. The molfile_to_smiles function does not preserve names.

Byte string binary representations of molecular structure (binmol)

A binmol is a binary representation of a molecular structure. This uses the PostgreSQL data type bytea. The binmol corresponds to the OpenEye version 2 binary format. The binmol is not expected to be directly used, but instead functions as in intermediate during conversion to and from the various name and file formats. However, it is entirely possible to store a binmol in a column in a table in your data base. Future versions of CHORD will make more use of the binmol representation, especially for handling 3D coordinates and interfacing to other OpenEye functions.

One option currently under consideration at gNova is using python functions to implement some OpenEye functionality. It is possible to pass a binmol to a python function and operate on it as an OEMol type. If you would like to pursue this further, please contact us for some experimental python wrappers.