CHORD supports many ways of interconverting molecular structure representations. Of course, Smiles is the "coin of the realm", but molecular structures can be also represented using chemical names, molfiles or other types of files. There are several utility functions for common conversion, such as molfiles and IUPAC names, as well as a more general molconvert function.
There are also two general functions that allow the full range of format conversions. In order to fully utilize the general conversion functions, this section introduces two data types for storing molecular structures other than Smiles. The first is a string representation called a strmol. This may contain a chemical name or a molecular structrure "file". The other type is a binary representation called a binmol. This is not directly readable, but is useful as an intermediate during file conversion.
Function | SQL example |
---|---|
molfile_to_smiles | Select molfile_to_smiles(molfile) from vla4.sdf; |
smiles_to_molfile | Select smiles_to_molfile(smiles) from nci.structure |
iupac | Select iupac('c1ccccc1COCN'); |
name_to_smiles | Select name_to_smiles('phenylmethoxymethanamine'); Select name_to_smiles('benzoxymethylamine'); Select name_to_smiles('benzyloxymethanamine'); |
molconvert | Select molconvert('c1ccccc1COCN', 'smi', 'pdb'); Select molconvert(molfile, 'mol', 'syb') from vla4.sdf; Select smiles, molconvert(smiles, 'smi', 'iupac') from nci.structure; |
oe_binmol | Update vla4.sdf set binmol=oe_binmol(molfile, format_code('mdl')); |
oe_strmol | Select type, oe_strmol('c1ccccc1COCN', code) from gnova.formats where writeable; |
molfile_to_smiles(text Molfile) returns text
This function takes a Molfile represented as a text string and returns a valid Smiles. The isomeric Smiles is returned in order to ensure that any stereo atoms and bonds are preserved. Any name contained in the molfile is not included in the Smiles. See the molconvert or oe_strmol functions below if you need to preserve the structure name contained in the molfile.
smiles_to_molfile(text Smiles) returns text
This function takes a Smiles string and returns a string representation of the molecule in the molfile format. It can be useful when exporting a structure to a program that does not recognize Smiles. The coordinates of the atoms are all zero. In future versions of CHORD, it may be possible to produce 2D or 3D coordinates when generating a molfile. In addition, another argument specifying an array of coordinates will be allowed. The oe_strmol function is described below.
iupac(text Smiles) returns text
This function takes a Smiles string and returns a chemical name formed according to the IUPAC naming system. In some cases an IUPAC name for a valid Smiles cannot be generated. In that case, a string 'BLAH' is returned. This is a quirk of the underlying Lexichem library. In future versions of CHORD, a null string will be returned.
name_to_smiles(text Name) returns text
This function takes a string interpreted as a chemical name and returns a Smiles representation of the name, if possible. If the name cannot be interpreted, a null value is returned.
molconvert(text Mol, text InType, text OutType)
This function takes a string representation of a molecule, Mol expected to be in the format InType. It converts the Mol to a string representation of type OutType. The allowed values for Intype and OutType are in the table gnova.formats shown below.
oe_strmol(bytea binmol, integer code) returns text
This function is more general that the preceeding functions. The concept of a strmol, or string representation of a molecule is explained in the section below. Smiles is one string representation, but any of the various chemical name types or file formats are also possible string representations. The oe_strmol function converts a binmol (described below) to any of these various names or formats. The integer code corresponds to the type desired and is defined in the table gnova.formats. That table is shown below.
The functions shown above all use oe_strmol or oe_binmol in various ways. Those functions are very simple SQL wrappers around oe_strmol and oe_binmol. The definition of those functions can be found in the functions.sql file contained in the installation tarball for CHORD.
oe_binmol(text mol, integer code) returns bytea
This function takes a string representation of a molecular structure and returns a binary representation called a binmol. This uses the PostgreSQL data type bytea. The binmol corresponds to the OpenEye version 2 binary format. The binmol is not expected to be directly used, but instead functions as in intermediate during conversion to and from the various name and file formats. However, it is entirely possible to store a binmol in a column in a table in your data base. Future versions of CHORD will make more use of the binmol representation, especially for handling 3D coordinates and interfacing to other OpenEye functions.
The integer codes correspond to the code column in the format_code table shown above. To convert a molfile to a binmol, use the following SQL for molfile strings stored in a table named vla4.sdf. The coordinates in the molfile will be preserved in the binmol.
Update vla4.sdf set binmol=oe_binmol(molfile, format_code('mdl'));
Integer codes are used in the conversion functions. Sometimes the code is used to determine which type of strmol is output; sometimes to determine which type of strmol is input. In both cases, the following table will help you choose the right code. Rather than using a constant integer, we recommend using the utility functions format_code and name_code. These functions are defined as:
Select code from gnova.formats where type ilike $1;
The gnova.formats table consists of the following rows.
code | type | description |
---|---|---|
1 | smi | SMILES |
2 | mdl | MDL Mol |
2 | mol | MDL Mol |
2 | rxn | MDL Mol |
3 | ent | PDB |
3 | pdb | PDB |
4 | mol2 | Tripos MOL2 |
4 | syb | Tripos MOL2 |
5 | bin | OEBinary v1 |
6 | tdt | Daylight TDT |
7 | ism | Isomeric SMILES |
7 | isosmi | Isomeric SMILES |
8 | mol2h | MOL2 with H |
9 | sd | MDLSDF |
9 | sdf | MDLSDF |
10 | can | Canonical SMILES |
11 | mf | Molecular Formula |
12 | xyz | XYZ |
13 | fasta | FASTA |
13 | seq | FASTA |
14 | mopac | MOPAC |
14 | pac | MOPAC |
15 | oeb | OEBinary v2 |
16 | dat | Macromodel |
16 | mmd | Macromodel |
16 | mmod | Macromodel |
17 | sln | Tripos SLN |
18 | rd | MDL RDF |
18 | rdf | MDL RDF |
19 | cdx | ChemDraw CDX |
101 | openeye_name | OpenEye names loosely correspond to the kinds of names familiar to a medicinal chemist. These names are intended to be a subset of the IUPAC 2005 standard's acceptable names, but not necessarily the PIN (Preferred IUPAC Name). These correspond to the types of names found in a Sigma-Aldrich catalogue or Journal of Medicinal Chemistry article for example. |
102 | iupac_name | IUPAC names are intended to follow the IUPAC 2005 recommendations for the Preferred IUPAC Name (PIN). Unfortunately, this functionality is relatively recent, so the best that can be hoped for these names is that they are more IUPAC-like than the default OpenEye name style. Future release of Lexichem may further refine this definition to provide IUPAC2005, IUPAC93 and IUPAC79 name styles that reflect the corresponding standard's preferred name. |
103 | cas_name | The Lexichem CAS name style is intended to follow the Chemical Abstracts Service's naming conventions, where they differ from IUPAC's. Once again, as this functionality is relatively recent, the effect is to generate names that are more CAS-like than the default OpenEye name style. |
104 | traditional_name | The Traditional name style corresponds to forms of compound naming that are now no longer acceptable to the IUPAC rules. The boundary between whether a trivial/common name is considered OpenEye or Traditional when it acceptable to IUPAC but not preferred is blurred, with OpenEye attempting to follow the more prevalent usage. |
105 | systematic_name | Systematic names correspond to the fully systematic IUPAC names that the IUPAC preferred names are slowly converging towards. |
101 | openeye | openeye_name |
102 | iupac | iupac_name |
103 | cas | cas_name |
104 | traditional | traditional_name |
105 | systematic | systematic_name |
A strmol is any of the various ways of representing a molecular structure as a text string. This includes molfiles, chemical names and any of the various "file" formats in the gnova.formats table above. These may be stored in tables in the data base or simply used when exporting structures.
Some of the possible strmol types corresponds to Smiles (smi, isosmi, ism, can). Because the Smiles "file" format allows for an optional name, the name will be included when converting say from a molfile to a Smiles using the oe_strmol function or the molconvert function. The molfile_to_smiles function does not preserve names.
A binmol is a binary representation of a molecular structure. This uses the PostgreSQL data type bytea. The binmol corresponds to the OpenEye version 2 binary format. The binmol is not expected to be directly used, but instead functions as in intermediate during conversion to and from the various name and file formats. However, it is entirely possible to store a binmol in a column in a table in your data base. Future versions of CHORD will make more use of the binmol representation, especially for handling 3D coordinates and interfacing to other OpenEye functions.
One option currently under consideration at gNova is using python functions to implement some OpenEye functionality. It is possible to pass a binmol to a python function and operate on it as an OEMol type. If you would like to pursue this further, please contact us for some experimental python wrappers.