Matched molecular pair analysis is very common method to analyze SAR for medicinal chemists. There are lots of publications about it and applications in these area.
I often use rdkit/Contrib/mmpa to make my own MMP dataset.
The origin of the algorithm is described in following URL.
https://www.ncbi.nlm.nih.gov/pubmed/20121045
Yesterday, good news announced by @RDKit_org. It is release the package that can make MMPDB.
I tried to use the package immediately.
This package is provided from github repo. And to use the package, I need to install apsw at first. APSW can install by using conda.
And the install mmpdb by python script.
iwatobipen$ conda install -c conda-forge apsw iwatobipen$ git clone https://github.com/rdkit/mmpdb.git iwatobipen$ cd mmpdb iwatobipen$ python setup.py install
After success of installation, I could found mmpdb command in terminal.
I used CYP3A4 inhibition data from ChEMBL for the test.
I prepared two files, one has smiles and id, and the another has id and ic50 value.
* means missing value. In the following case, I provided single property ( IC50 ) but the package can handle multiple properties. If reader who is interested the package, please show more details by using mmpdb –help command.
iwatobipen$ head -n 10 chembl_cyp3a4.csv CANONICAL_SMILES,MOLREGNO Cc1ccccc1c2cc(C(=O)n3cccn3)c4cc(Cl)ccc4n2,924282 CN(C)CCCN1c2ccccc2CCc3ccccc13,605 Cc1ccc(cc1)S(=O)(=O)\N=C(/c2ccc(F)cc2)\n3c(C)nc4ccccc34,1698776 NC[C@@H]1O[C@@H](Cc2c(O)c(O)ccc12)C34CC5CC(CC(C5)C3)C4,59721 Cc1ccc(cc1)S(=O)(=O)N(Cc2ccccc2)c3ccccc3C(=O)NCc4occc4,759749 O=C(N1CCC2(CC1)CN(C2)c3ccc(cc3)c4ccccc4)c5ccncc5,819161 iwatobipen$ head -n 10 prop.csv ID STANDARD_VALUE 924282 * 605 * 1698776 * 59721 19952.62 759749 2511.89 819161 2511.89
mmdb fragment has –cut-smarts option.
It seems attractive for me! ;-)
”’
–cut-smarts SMARTS alternate SMARTS pattern to use for cutting (default:
‘[#6+0;!$(*=,#[!#6])]!@!=!#[!#0;!#1;!$([CH2]);!$([CH3]
[CH2])]’), or use one of: ‘default’,
‘cut_AlkylChains’, ‘cut_Amides’, ‘cut_all’,
‘exocyclic’, ‘exocyclic_NoMethyl’
”’
Next step, make mmpdb and join the property to db.
# run fragmentation and my input file has header, delimiter is comma ( default is white space ). Output file is cyp3a4.fragments. # Each line of inputfile must be unique! iwatobipen$ mmpdb fragment chembl_cyp3a4.csv --has-header --delimiter 'comma' -o cyp3a4.fragments # rung indexing with fragmented file and create a mmpdb. iwatobipen$ mmpdb index cyp3a4.fragments -o cyp3a4.mmpdb
OK I got cyp3a4.mmpdb file. (sqlite3 format)
Add properties to a DB.
Type following command.
iwatobipen$ mmpdb loadprops -p prop.csv cyp3a4.mmpdb Using dataset: MMPs from 'cyp3a4.fragments' Reading properties from 'prop.csv' Read 1 properties for 17143 compounds from 'prop.csv' 5944 compounds from 'prop.csv' are not in the dataset at 'cyp3a4.mmpdb' Imported 5586 'STANDARD_VALUE' records (5586 new, 0 updated). Generated 83759 rule statistics (1329408 rule environments, 1 properties) Number of rule statistics added: 83759 updated: 0 deleted: 0 Loaded all properties and re-computed all rule statistics.
Ready to use DB. Let’s play with the DB.
Identify possible transforms.
iwatobipen$ mmpdb transform --smiles 'c1ccc(O)cc1' cyp3a4.mmpdb --min-pair 10 -o transfom_res.txt iwatobipen$ head -n3 transfom_res.txt ID SMILES STANDARD_VALUE_from_smiles STANDARD_VALUE_to_smiles STANDARD_VALUE_radius STANDARD_VALUE_fingerprint STANDARD_VALUE_rule_environment_id STANDARD_VALUE_counSTANDARD_VALUE_avg STANDARD_VALUE_std STANDARD_VALUE_kurtosis STANDARD_VALUE_skewness STANDARD_VALUE_min STANDARD_VALUE_q1 STANDARD_VALUE_median STANDARD_VALUE_q3 STANDARD_VALUE_max STANDARD_VALUE_paired_t STANDARD_VALUE_p_value 1 CC(=O)NCCO [*:1]c1ccccc1 [*:1]CCNC(C)=O 0 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA 1049493 14 3632 5313.6 -0.71409 -0.033683 -6279.7 498.81 2190.5 7363.4 12530 -2.5576 0.023849 2 CC(C)CO [*:1]c1ccccc1 [*:1]CC(C)C 0 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA 1026671 20 7390.7 8556.1 -1.1253 -0.082107 -6503.9 -0 8666.3 13903 23534 -3.863 0.0010478
Output file has information of transformation with statistics values.
And the db can use to make a prediction.
Following command can generate two files with prefix CYP3A-.
CYP3A_pairs.txt
CYP3A_rules.txt
iwatobipen$ mmpdb predict --reference 'c1ccc(O)cc1' --smiles 'c1ccccc1' cyp3a4.mmpdb -p STANDARD_VALUE --save-details --prefix CYP3A
iwatobipen$ head -n 3 CYP3A_pairs.txt rule_environment_id from_smiles to_smiles radius fingerprint lhs_public_id rhs_public_id lhs_smiles rhs_smiles lhs_value rhs_value delta 868610 [*:1]O [*:1][H] 0 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA 1016823 839661 C[C@]12CC[C@@H]3[C@H](CC[C@H]4C[C@@H](O)CC[C@@]43C)[C@@H]1CC[C@H]2C(=O)CO CC(=O)[C@@H]1CC[C@H]2[C@H]3CC[C@H]4C[C@@H](O)CC[C@]4(C)[C@@H]3CC[C@@]21C 1000 15849 14849 868610 [*:1]O [*:1][H] 0 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA 3666 47209 O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12 O=c1cc(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12 15849 5011.9 -10837 iwatobipen$ head -n 3 CYP3A_rules.txt rule_environment_statistics_id rule_id rule_environment_id radius fingerprint from_smiles to_smiles count avg std kurtosis skewness min q1 median q3 max paired_t p_value 28699 143276 868610 0 59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA [*:1]O [*:1][H] 16 -587.88 14102 -0.47579 -0.065761 -28460 -8991.5 -3247.8 10238 23962 0.16674 0.8698 54091 143276 1140189 1 tLP3hvftAkp3EUY+MHSruGd0iZ/pu5nwnEwNA+NiAh8 [*:1]O [*:1][H] 15 -1617 13962 -0.25757 -0.18897 -28460 -9534.4 -4646 7271.1 23962 0.44855 0.66062
It is worth that the package ca handle not only structure based information but also properties.
I learned a lot of things from the source code.
RDKit org is cool community!
I pushed my code to my repo.
https://github.com/iwatobipen/mmpdb_test
original repo URL is
https://github.com/rdkit/mmpdb
Do not miss it!