John Bradshaw has given a good general introduction to similarity measures.
Function | SQL example |
---|---|
tverksy | Select tversky(public166keys('c1ccccc1COCN'), public166keys('c1ccccc1COCN'),
0.8, 0.2); Select count(smiles) from nci.structure where tversky(fkey, public166keys('c1ccccc1COCN'), 0.8, 0.2) > 0.7; |
tanimoto | Select tanimoto(public166keys('c1ccccc1COCN'),
public166keys('c1ccccc1COCN')); Select count(smiles) from nci.structure where tanimoto(fkey, public166keys('c1ccccc1COCN')) > 0.45; |
euclid | Select euclid(public166keys('c1ccccc1COCN'), public166keys('c1ccccc1COCN')); Select count(smiles) from nci.structure where euclid(fkey, public166keys('c1ccccc1COCN')) > 0.95; |
hamming | Select hamming(public166keys('c1ccccc1COCN'), public166keys('c1ccccc1COCN')); Select count(smiles) from nci.structure where hamming(fkey, public166keys('c1ccccc1COCN')) < 0.1; |
similarity | Select similarity(public166keys('c1ccccc1COCN'), public166keys('c1ccccc1COCN')); Select count(smiles) from nci.structure where similarity(fkey, public166keys('c1ccccc1COCN')) > 0.7; |
similarity | Select similarity('c1ccccc1C(=O)NC', 'c1ccccc1COCN'); |
tversky(bit Afp, bit Bfp, real alpha, real beta) returns real
This function takes two bit strings fingerprints and two real number parameters. It applies the proper logical operations and ratios to produce the Tverksy measure of similarity. The result is between 0.0 and 1.0, where 1.0 means the fingerprints are identical and 0.0 means they have nothing in common. Intermediate values may be interpreted in various ways, depending upon how the fingerprints were computed.
tanimoto(bit Afp, bit Bfp) returns real
This function takes two bit strings fingerprints and applies the proper logical operations and ratios to produce the Tanimoto measure of similarity. The result is between 0.0 and 1.0, where 1.0 means the fingerprints are identical and 0.0 means they have nothing in common. Intermediate values may be interpreted in various ways, depending upon how the fingerprints were computed.
euclid(bit Afp, bit Bfp) returns real
This function takes two bit strings fingerprints and applies the proper logical operations and ratios to produce the Euclid measure of similarity. The result is between 0.0 and 1.0, where 1.0 means the fingerprints are identical and 0.0 means they have nothing in common. Intermediate values may be interpreted in various ways, depending upon how the fingerprints were computed.
hamming(bit Afp, bit Bfp) returns real
This function takes two bit strings fingerprints and applies the proper logical operations and ratios to produce the Hamming measure of similarity. The result is between 0.0 and 1.0, where 1.0 means the fingerprints are identical and 0.0 means they have nothing in common. Intermediate values may be interpreted in various ways, depending upon how the fingerprints were computed.
similarity(bit Afp, bit Bfp) returns real
For convenience, we have included this function that accepts two bit string fingerprints and returns a similarity measure. It uses the tversky function with alpha = 0.5 and beta = 0.5.
similarity(text ASmiles, text BSmiles) returns real
For convenience, we have included this function that accepts two Smiles and returns a similarity measure. It uses the tversky function with alpha = 0.5 and beta = 0.5, along with the public166keys function to compute a fragment key fingerprint. For direct comparisons of two constant smiles, this version of similarity is very convenient, e.g. Select similarity('c1ccccc1C(=O)NC', 'c1ccccc1COCN'); When searching an entire table however, the use of this function should be avoided in order to optimize efficiency using SQL. Consider the following:
Select count(smiles) from nci.structure where similarity(smiles, 'c1ccccc1C(=O)NC') > 0.7;don't do this!
Select count(smiles) from nci.structure where similarity(fkey, public166keys('c1ccccc1C(=O)NC')) > 0.7;The second statement executes much more quickly, because it uses the fkey column containing pre-computed fragment/key fingerprints. The first statement will have to compute the fingerprint for every smiles.
Hashed path fingerprints are sometimes used to compute the similarity of one structure to another. We do not recommend using the default CHORD fingerprint for this purpose. Instead, consider using a fragment/key fingerprint such as the public166keys or the Mesa maccskeys320 function along with the tanimoto or other similarity function. The results of tanimoto(gfp, fp('c1ccccc1C(=O)NC')) or tanimoto(vfp, fp('c1ccccc1C(=O)NC',2048,64,25)) are not incorrect, but are untested for real-world applications. If you are interested in investigating this, we are interested in collaborating in this research. If your database contains structures which have many fragments not found in the public 166 keys, the gNova makefp program can help you create a table of fragments upon which you can base similarity measures.