makefp |
download |
makefp is a command line program to compute hashed path fingerprints from input smiles, or other file formats such as sdf or mol files. This functionality is available using CHORD's fp function, but this stand-alone program is also available for you convenience. The makefp program may have additional features which are being developed but not yet incorporated into CHORD. Note that makefp may use a different default size and folding than CHORD. However, if the parameters given for the creation of fingerprints using makefp are the same as those used with CHORD, the output fingerprints will also be the same.
Usage: makefp [options] <infile>
Options:
--max-path # max length of path to consider; default 7
--min-path # min length of path to consider; default 1
--max-bits # max bits in fingerprint; default 2048
--min-bits # min bits to fold down to; default 64
--density # bit density percent, below which folding occurs; default 0.30
--binary output fingerprints as 0's and 1's instead of hex
--rings include ring paths
--no-paths do not include atom or ring paths (overrides --rings)
--neighbors include atom nearest neighbors
--valence include formal charge, valence, connection and implicit H atom counts
--strings show each path and neighbor string
--percents print density statistics
--debug print debug info
--no-output no output at all (useful for timing)
If <infile> is not given, Smiles are read
from standard input. Otherwise, the <infile>
can be a file in any of the following formats:
File type | File extension |
---|---|
SMILES | .smi .ism .can |
SDF | .sdf .mol |
MOL2 | .mol2 |
PDB | .pdb .ent |
MacroModel | .mmod |
The output is a canonical Smiles followed by the fingerprint in hexadecimal format, or optionally using 0's and 1's.
The input structures are examined to find connected paths of atoms. The Smarts-like strings representing these paths are encoded into a bit pattern called a path-based fingerprint. You will see the string values when using the -s switch. This could be useful for creating a library of fragments for an entire database to be used as fragment-based or key-based fingerprints. Note: for aromatic compounds, some fragments may be ccc, or ccccc which cannot be interpreted as Smiles, but are perfectly valid Smarts. If there are rings in the input structure, the closed-ring paths are output as well as the open-chain paths. For example, the paths C1CCC1 and C1CCC1C are output for methyl-cyclobutane as well as CCCCC. If you want only open-chain paths, use the -r switch.
This program normally considers continuous paths of atoms up to a maximum length of 7, but this can be changed using the -l argument. The maximum size of the fingerprint is 2048, but this can be adjusted using the -b switch. If the resulting fingerprint has too few bits that are set to 1 (this happens for small and simple structures), then the fingerprint is "folded", thus halving the total length and doubling (approximately) the bit density. The bit density is defined as the ratio of the number of bits that are set to 1, to the size of the fingerprint. This process of folding is repeated until the bit density is at least 0.30, or the total length has been reduced to 64. You can change the density and minimum fingerprint length using the -y and -f arguments, respectively. The maximum (-b) and minimum (-f) fingerprint lengths should be powers of two. Other sizes are tolerated, but the bits will be unevenly distributed within the fingerprint due to the nature of the algorithm used to compute the fingerprint. A size greater than 32768 or less than 16 is not tolerated.
The -d switch will output extra debugging information. The -n switch will generate no output. It is useful if you just want timing information without including the overhead of formatting and producing output. The -p switch will also output the bit density of the final fingerprint as well as intermediate fingerprints generated during the folding process. The -z switch will output the fingerprint using 0's and 1's instead of hexadecimal.