calculate exit vector distance of each molecule v2.

Somedays ago, I posted the topics about ‘exit vector’.
The way to represent chemical space give me new insight for molecular design.
I think it’s useful for replacement of linkage or scaffold.
It’s important for drug discovery to design new molecules that similar ligand for target, but not similar for competitor’s one.

So, I added some function to my code. New code calculate exit vector and tanimoto similarity of each molecule.

All code is following.

```from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolTransforms
from rdkit.Chem import DataStructs
import numpy as np
from rdkit.Chem import Draw
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sdf = Chem.SDMolSupplier( "3d_amine.sdf", removeHs=False )
mols = [ m for m in sdf ]
fps = [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in mols ]
mol = mols[1]
mol.GetSubstructMatch( Chem.MolFromSmarts( "N[H]" ) )
atoms = [atom for atom in mol.GetAtoms()]

def getev( mol ):
if mol.GetNumConformers() >= 1:
matches = mol.GetSubstructMatches( Chem.MolFromSmarts( "N[H]" ) )
conf = mol.GetConformer()
theta = rdMolTransforms.GetDihedralDeg( conf,
matches[0][1],
matches[0][0],
matches[1][0],
matches[1][1]  )
temp_phi1 = 180 - rdMolTransforms.GetAngleDeg(conf,
matches[1][0],
matches[0][0],
matches[0][1]
)
temp_phi2 = 180 - rdMolTransforms.GetAngleDeg(conf,
matches[0][0],
matches[1][0],
matches[1][1]
)
if temp_phi1 >= temp_phi2:
phi1 = temp_phi1
phi2 = temp_phi2
else:
phi1 = temp_phi2
phi2 = temp_phi1

r = rdMolTransforms.GetBondLength( conf, matches[0][0], matches[1][0] )
return theta, phi1, phi2, r
else:
print( "No conformer!" )

def transform_cartegian( theta, phi1, phi2, r ):
x = np.sin( theta ) * np.sin( phi1 ) * np.sin( phi2 ) *r
y = np.sin( theta ) * np.sin( phi1 ) * np.cos( phi2 ) *r
z = np.sin( theta ) * np.cos( phi1 ) *r
t = np.cos( theta ) *r
return x, y, z, t

def get_dist(v1,v2):
v1 = np.asarray( v1 )
v2 = np.asarray( v2 )
delta =  v1 - v2
d = np.linalg.norm( delta )
return d

def calc_distance( mol1, mol2 ):
theta1, phi11, phi21, r1 = getev( mol1 )
theta2, phi12, phi22, r2 = getev( mol2 )
cart1 = transform_cartegian( theta1, phi11, phi21, r1 )
cart2 = transform_cartegian( theta2, phi12, phi22, r2 )
d = get_dist( cart1, cart2 )
return d

f = open( 'ditance_sim.txt','w' )
f.write( 'idx1,idx2,distance,ECFP4sim\n' )
dataset = []
for i, mol in enumerate( mols ):
d = calc_distance(mols[3],mol)
sim = DataStructs.TanimotoSimilarity( fps[3], fps[i] )
if d < 2:
print( "idx_"+str(3), "idx_"+str(i), "distance_"+str(d), "tanimoto"+str(sim) )
else:
pass
f.write( '{},{},{},{}\n'.format( 3,i,d,sim ) )
dataset.append([d,sim])
f.close()
```

Out put is following.

```idx_3 idx_1 distance_1.48017700543 tanimoto0.47058823529411764
idx_3 idx_3 distance_0.0 tanimoto1.0
idx_3 idx_6 distance_1.40146560569 tanimoto0.6666666666666666
idx_3 idx_7 distance_0.0619944778129 tanimoto0.6
idx_3 idx_13 distance_1.85179121146 tanimoto0.5
idx_3 idx_14 distance_1.80235655185 tanimoto0.5
```

OK, plot these results.

```df = pd.DataFrame(dataset, columns=['dist','sim'])
g=sns.lmplot('dist','sim',df, fit_reg=False)
plt.show(g)
```

Finally, I got the image.

Some molecules showed similarity scores are low ( ~0.5) but high distance similarity ( 1.~2.).
Is it useful for hopping ?
I want to try more large data sets ASAP.
*All code is written in Python, and pushed my code to my github repo!
https://github.com/iwatobipen/rdkit_evp