glogP simple atom types + xlogP training set

Derive simple atom types by considering atomic number, aromaticity, charge, valence, hydrogen count and substitution count. Yields 36 types from xlogP training set. 1 type (=NH) occurs only once, leaving 35 types.

simple
smarts train_freq coefficient error
[c+0v4H0D3] 1471 0.294 0.019
[c+0v4H1D2] 1453 0.318 0.012
[O+0v2H0D1] 1075 -0.309 0.051
[C+0v4H3D1] 969 0.553 0.027
[C+0v4H2D2] 808 0.325 0.012
[C+0v4H0D3] 796 0.177 0.045
[O+0v2H1D1] 594 -0.401 0.034
[O+0v2H0D2] 543 -0.313 0.029
[N+0v3H1D2] 377 -0.523 0.040
[N+0v3H2D1] 371 -0.688 0.043
[n+0v3H0D2] 347 -0.375 0.035
[C+0v4H1D3] 283 -0.058 0.024
[Cl+0v1H0D1] 238 0.724 0.032
[C+0v4H0D4] 174 0.009 0.065
[C+0v4H1D2] 159 0.267 0.037
[N+0v3H0D3] 156 -0.893 0.052
[F+0v1H0D1] 140 0.405 0.033
[N+0v5H0D3] 138 0.631 0.106
[N+0v3H0D2] 129 0.317 0.062
[C+0v4H0D2] 108 0.573 0.160
[n+0v3H1D2] 103 -0.562 0.069
[Br+0v1H0D1] 87 0.964 0.058
[n+0v3H0D3] 87 -0.468 0.086
[S+0v6H0D4] 81 -0.216 0.118
[N+0v3H0D1] 79 -0.241 0.176
[S+0v2H0D2] 65 0.423 0.079
[S+0v2H0D1] 46 0.555 0.138
[C+0v4H2D1] 41 0.518 0.088
[I+0v1H0D1] 40 1.376 0.107
[P+0v5H0D4] 29 -0.225 0.156
[s+0v2H0D2] 13 0.663 0.188
[o+0v2H0D2] 12 -0.176 0.196
[S+0v2H1D1] 5 0.826 0.300
[S+0v4H0D3] 5 -1.174 0.305
[C+0v4H1D1] 4 0.020 0.372
[N+0v3H1D1] 1    
 

Coefficient computation

  • xlogp test set of 1850 compounds
  • gNova CHORD count_match(smiles, smarts) function
  • R statistical program linear models function to correlate counts with experimental logP

Residual standard error: 0.6645 on 1817 degrees of freedom
Multiple R-Squared: 0.8104, Adjusted R-squared: 0.8068 
F-statistic: 221.9 on 35 and 1817 DF, p-value: < 2.2e-16