SQL Functions for Hashed Path Fingerprints

These functions are used to compute path-based fingerprints. CHORD also functions to compute fragment-based fingerprints. There is some discussion about the differences between these two methods below.
Function SQL example
fp Select fp('c1ccccc1C(=O)NC');

Update nci.structure set gfp = fp(smiles); -- example 2; takes about 10 minutes for 250,000 structures

Select count(smiles) from nci.structure where
    contains(gfp, fp('c1ccccc1C(=O)NC'))
 and matches(smiles, 'c1ccccc1C(=O)NC')
; -- example 3 --
fp Select fp('c1ccccc1C(=O)NC',128);

Update nci.structure set gfp = fp(smiles,512);

Select count(smiles) from nci.structure where
    contains(gfp, fp('c1ccccc1C(=O)NC',512))
 and matches(smiles, 'c1ccccc1C(=O)NC')
;
fp Select fp('c1ccccc1C(=O)NC',2048,64,25);

Update nci.structure set vfp = fp(smiles,2048,64,25);

Select count(smiles) from nci.structure where
bit_contains(vfp, fp('c1ccccc1C(=O)NC',2048,64,25))
 and matches(smiles, 'c1ccccc1C(=O)NC');
All the above functions are installed into a SCHEMA named gnova. They can be accessed as, for example gnova.fp. Or, you can set your search_path to include the SCHEMA gnova and access them by their unqualified names, for example fp.

fp(text Smiles) returns bit

This function takes a Smiles string and returns a 1024 bit string fingerprint which encodes many of the structural features of the Smiles. The main purpose of this fingerprint is to speed up sub-structure searching. One typically stores a fingerprint for each Smiles in a table, say in a column called gfp, as in example 2 above. The search shown in example 3 runs faster because it uses the bit_contains and fp functions before the matches function. This is discussed in more detail in later sections of this document. More flexible versions of this function are described below.

fp(text Smiles, integer bitsize) returns bit

This function takes a Smiles string and returns a bit string fingerprint which encodes many of the structural features of the Smiles. The parameter bitsize specifies the number of bits the fingerprint is to contain. The fp function refuses to compute fingerprints longer than 32768 bits, or shorter than 16 bits. The bitsize should be a power of two: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 or 32768. Other bitsizes are tolerated, but the distribution of bits within the bit string fingerprint is uneven, due to the algorithm used to compute the fingerprint. The size of the fingerprint can affect it's usefulness. This is discussed in more detail in later sections of this document.

fp(text Smiles, integer maxbits, integer minbits, integer fold) returns bit varying

This function takes a Smiles string and returns a bit string fingerprint which encodes many of the structural features of the Smiles. This is an even more flexible version of the function above, since it allows you to provide additional parameters specifying how the fingerprint is to be computed. The parameter maxbits specifies the maximum number of bits the fingerprint is to contain. The parameter minbits specifies the minimum number of bits the fingerprint is to contain. The fingerprint is first computed using maxbits bits and then possibly folded in half repeatedly, as long as the percentage of bits set to 1 is less than the fold parameter and the size of the fingerprint does not go below minbits. This ensures that sparse fingerprints are avoided. Small and simple structures would otherwise have sparse fingerprints. Both maxbits and minbits must be less than 32768 and greater than 16. Both the maxbits and minbits parameters should be a power of two: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 or 32768. The parameter maxbits must not be less than minbits. A more elaborate explanation of how this all works is given below.

Hints: If you give the same number for maxbits and minbits, the fingerprint will not be folded. If you give fold as 0, no folding will occur and the resulting fingerprint will be maxbits in size. If you give fold as 100, folding will occur until the fingerprint is minbits in size. Don't use fold=100, since this is the slowest way to produce a fingerprint of size minbits. You can also use this function with only the Smiles argument and default values will be used: maxbits=1024, minbits=1024, fold=0.

Fingerprint Method and Usage

The fingerprint method used here results in a hashed path fingerprint. All bonded paths of atoms (up to a maximum length) are encoded using a hash function and compounded into a string of bits. Here we only consider paths up to a maximum length of 7 atoms. Each path is hashed, producing a pseudo-random number that is used to set bits in the fingerprint. While constructing the path, if a ring is found, two paths are generated - the original path and an additional path containing just this ring. After all paths are considered, many bits are set in the fingerprint and it becomes a nearly-unique representation of the molecule. More importantly, it becomes a screen for all structures of which it is a sub-structure. In other words, all the bits set in the fingerprint for structure A are guaranteed to also be set in fingerprints for structures B that contain structure A, although structures B may also have other bits set. So, instead of exhaustively searching every structure in a database for a substructure match, each structure can be screened by comparing it's fingerprint to the fingerprint of the search target. In this way, impossible matches can be quickly ruled out.

To put this into practice, consider the following two searches:

  1. Select count(smiles) from nci.structure where matches(smiles,'c1ccccc1C(=O)NC');
  2. Select count(smiles) from nci.structure where
    bit_contains(gfp, fp('c1ccccc1C(=O)NC'))
     and matches(smiles, 'c1ccccc1C(=O)NC');

In the first search, every Smiles in the table nci.structure is matched against the Smiles c1ccccc1C(=O)NC. The match function is relatively slow. In the second search, the fingerprint (pre-computed and stored in column gfp) of each Smiles is first compared to the fingerprint of c1ccccc1C(=O)NC. Only if gfp contains every bit of fp('c1ccccc1C(=O)NC')does the search proceed to match to ensure an accurate match. The bit_contains function is relatively fast and the overall effect is to speed up the search by a factor of two to ten. When the search target is small or commonplace, most every structure in the table will also have the target fingerprint's bits set and few structures will be screened out. In these cases there is almost no increase in search speed.

Fingerprint Size

The size of the fingerprint is the total number of bits that it contains, whether they are set or not. A large structure will contain a large number of paths and hence will cause more bits to be set than a small structure. The size of the fingerprint must be large enough to contain all these bits. For example, the CHORD fingerprint for caffeine, Cn1cnc2n(C)c(=O)n(C)c(=O)c12 has 222 bits set, using the default size of 1024 bits. However, if we use a size of 64, all 64 bits are set. Most, if not all other molecules of similar or larger size and complexity would also have all 64 bits set. The use of a 64 bit fingerprint as a screen becomes nil. If we use 32768 as the size of the fingerprint, caffeine sets 243 bits. This is perfectly OK, but it wastes some space in the database and it takes more time to compare (using bit_contains) two long fingerprints. The CHORD default of 1024 bits is a compromise based on an analysis of the NCI database of about 240,000 compounds. If your database contains only small molecules or contains many very large molecules, you may use a smaller or larger fingerprint, or a variable size fingerprint as described below.

Try these examples:

select nbits_set(fp('Cn1cnc2n(C)c(=O)n(C)c(=O)c12', 64));
select nbits_set(fp('Cn1cnc2n(C)c(=O)n(C)c(=O)c12', 1024));
select nbits_set(fp('Cn1cnc2n(C)c(=O)n(C)c(=O)c12', 32768));
select nbits_set(fp('CNC',1024));
select nbits_set(fp('CNC',64));
select nbits_set(fp('CCC(C)C1C(=O)NC(C(=O)NC(=C)C(=O)NC(C(=O)NC23CCC(=NC2c4csc(n4)C(C(OC(=O)c5cc(c6c(n5)C(C(N1)C=C6)O)C(C)O)C)NC(=O)c7csc(n7)C(NC(=O)C8CSC(=N8)C(=CC)NC(=O)C(NC(=O)c9csc3n9)C(C)O)C(C)(C(C)O)O)c1nc(cs1)C(=O)NC(=C)C(=O)NC(=C)C(=O)N)C)C', 2048));

Variable Size Fingerprints

The fp function of CHORD allows you to compute a variable size fingerprint. A variable size fingerprint is still limited to a specified maximum and minimum size. CHORD uses a default maximum size of 2048 and minimum size of 64. The maximum size must be less than 32768 and greater than 16 bits. The maximum size must not be less than the minimum size. Furthermore, the maximum and minimum bit size must be a power of 2. This is because the fp function attempts to reduce the size of the fingerprint from it's maximum bit size by folding it in half. (Only a size which is a power of two can be folded an arbitrary number of times without ever yielding an odd-number size, which could not be folded in half.) Each fold reduces the size of the fingerprint by one half and doubles (approximately) the number of bits set. The folding stops when the number of bits set, expressed as a percent of the fingerprint size, is at least as large as a specified number, or the size of the fingerprint has been reduced to the minimum bit size.

Try these examples:

select length(fp('Cn1cnc2n(C)c(=O)n(C)c(=O)c12', 2048, 64, 25)),
    nbits_set(fp('Cn1cnc2n(C)c(=O)n(C)c(=O)c12', 2048, 64, 25));
select length(fp('CNC', 2048, 64, 25)),
    nbits_set(fp('CNC', 2048, 64, 25));
select length(fp(smiles,2048,64,25)),
    nbits_set(fp(smiles,2048,64,25)),
  bit_density(fp(smiles,2048,64,25)) from nci.structure limit 100;

Fingerprints and Similarity

Fingerprints are sometimes used to compute the similarity of one structure to another. We do not recommend using the default CHORD fingerprint for this purpose. Instead, consider using a fragment/key fingerprint such as the public166keys or the mesa.maccskeys320 function along with the tanimoto or other similarity function. The results of tanimoto(vfp, fp('c1ccccc1C(=O)NC',2048,64,25)) are not incorrect, but are untested for real-world applications. If you are interested in investigating this, we are interested in collaborating in this research. If your database contains structures which have many fragments not found in the public 166 keys, the gNova makefp program can help you create a table of fragments upon which you can base similarity measures.