Fragmentation of molecules.

I often use molecular fragmentation for getting small drug like fragments from any data sources.
RECAP, BRICS, or another algorithms are used to do it.
RDKit is suite tool to do that. I often use the toolkit.
Also I was interested in a tool that named “molBLOCKS” and tried to use it.
If reader who is interested in it, more details are described in this paper.
You can get source code from following url.
http://compbio.cs.princeton.edu/molblocks/
And install is easy.

iwatobipen$ wget http://compbio.cs.princeton.edu/molblocks/molblocks.tar.gz
iwatobipen$ tar xzvf molblocks.tar.gz
iwatobipen$ cd molbcloks
iwatobipen$ make

That’s all.
===========================================
molBLOCKS depend on openbabel.
So, I installed openbabel before installing molBLOCKS.
===========================================
Ready to fragmentation.
I tested molBLOCKS using drugbank data set.

Over 6000 mols set.
And structure data is provided as smiles

iwatobipen$ wc drugbank/all_drugs.txt 
    6813   10647  379523 drugbank/all_drugs.txt
iwatobipen$ head -n 5 drugbank/all_drugs.txt 
O=C(N1[C@@H](CCC1)C(=O)NNC(=O)N)[C@@H](NC(=O)[C@@H](NC(=O)[C@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@@H](NC(=O)[C@H]1NC(=O)CC1)Cc1[nH]cnc1)Cc1c2c([nH]c1)cccc2)CO)Cc1ccc(O)cc1)COC(C)(C)C)CC(C)C)CCCN=C(N)N	[NO NAME]
NC(=O)CNC(=O)[C@@H](NC(=O)[C@@H]1CCCN1C(=O)[C@H]1NC(=O)[C@@H](NC(=O)[C@H](CCC(=O)N)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@@H](NC(=O)CCSSC1)Cc1ccc(cc1)O)CC(=O)N)CCCNC(=N)N	[NO NAME]
c12c(cccc1)ccc(c2)C[C@@H](NC(=O)C)C(=O)N[C@@H](C(=O)N[C@H](Cc1cccnc1)C(=O)N[C@H](C(=O)N[C@@H](Cc1ccc(cc1)O)C(=O)N[C@@H](C(=O)N[C@@H](CC(C)C)C(=O)N[C@H](C(=O)N1[C@H](C(=O)N[C@@H](C(=O)N)C)CCC1)CCCNC(=N)N)CCCNC(=O)N)CO)Cc1ccc(cc1)Cl	[NO NAME]
C/C=C/CC(C)C(C1C(=NC(CC)C(=O)N(C)CC(=O)N(C)C(CC(C)C)C(=NC(C(C)C)C(=O)N(C)C(CC(C)C)C(=NC(C)C(=NC(C)C(=O)N(C)C(CC(C)C)C(=O)N(C)C(CC(C)C)C(=O)N(C)C(C(C)C)C(=O)N1C)O)O)O)O)O	
NC(=O)CNC(=O)[C@@H](NC(=O)[C@@H]1CCCN1C(=O)[C@H]1NC(=O)[C@@H](NC(=O)[C@H](CCC(=O)N)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@@H](NC(=O)[C@H](CSSC1)N)Cc1ccccc1)CC(=O)N)CCCCN	

Here we go!
I ran fragment command with following option.
-i input file / drugbank/all_drugs.txt
-r rules.txt / RECAP.txt
-n min. number of atoms in fragment / 4
-o out put / drugbank/fragment_drugs.txt

iwatobipen$ time ./fragment -i drugbank/all_drugs.txt -r ./RECAP.txt -n 4 -o drugbank/fragment_drugs.txt 
[==================================================] 100%     


real	0m17.283s
user	0m17.184s
sys	0m0.081s

iwatobipen$ head -n 5 drugbank/fragment_drugs.txt 
CC(O)(C)C.NC(=O)NNC(=O)[C@@H]1CCCN1.O=C[C@@H]1CCC(=O)N1.O=C[C@H](Cc1c[nH]c2c1cccc2)N.N[C@H](C=O)Cc1cnc[nH]1.O=C[C@H](Cc1ccc(cc1)O)N.O=C[C@H](CC(C)C)N.O=C[C@H](CCCN=C(N)N)N.N[C@H](C=O)CO.C[C@H](C=O)N	[NO
O=C[C@@H]1CSSCCC(=O)N[C@@H](Cc2ccc(cc2)O)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N1)CC(=O)N)CCC(=O)N)Cc1ccccc1.O=C[C@@H]1CCCN1.O=C[C@H](CCCNC(=N)N)N.NCC(=O)N	[NO
NC(=O)[C@H](N)C.O=C[C@@H](Cc1ccc2c(c1)cccc2)NC(=O)C.O=C[C@@H]1CCCN1.O=C[C@@H](Cc1cccnc1)N.O=C[C@H](Cc1ccc(cc1)O)N.O=C[C@@H](Cc1ccc(cc1)Cl)N.O=C[C@H](CC(C)C)N.O=C[C@H](CCCNC(=N)N)N.O=C[C@@H](CCCNC(=O)N)N.N[C@H](C=O)CO	[NO
C/C=C/CC(C(C1C(=NC(CC)C(=O)N(C)CC(=O)N(C)C(CC(C)C)C(=NC(C(C)C)C(=O)N(C)C(CC(C)C)C(=NC(C(=NC(C(=O)N(C(C(=O)N(C(C(=O)N(C(C(=O)N1C)C(C)C)C)CC(C)C)C)CC(C)C)C)C)O)C)O)O)O)O)C
O=C[C@@H]1CSSC[C@H](N)C(=O)N[C@@H](Cc2ccccc2)C(=O)N[C@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N1)CC(=O)N)CCC(=O)N)Cc1ccccc1.O=C[C@@H]1CCCN1.NCC(=O)N.NCCCC[C@@H](C=O)N

I got period separated data set.
This program also can analyse frequency of fragments.
Like this.

iwatobipen$ time ./analyze -i drugbank/all_drug_frag.txt -o drugbank/all_drug_analyse.txt

Storing the main fragment set

[==================================================] 100%     


real	0m0.348s
user	0m0.307s
sys	0m0.023s
iwatobipen$ head -n 5 drugbank/all_drug_analyse.txt 
208	c1ccccc1	177 274 349 166548 384 483 678 712 713 796 805 812 872 991 1029 1075 1138 1139 1148 1167 1263 1342 1349 1424 1429 1435 1580 4575 4842 65781 71388 5335 2200 [NO [NO [NO 1495 580 3585 4817 4826 11047 3516 4468 4468 4468 4468 4468 220 110221-27-7.mol TNL L10 A45 BFL BIR HTQ IOE 669 INV BPG LY2 PFE IOF NFP BPY R18 OPA 696 BSI TSX PF3 FXN S58 EQU DFA IPC PTU ISF D1L LVA 132 CL3 977 KMB FTA 656 2AN 580 201 824 BIF HYQ UHD 2361 11F 11X 130 199 19A 251 2A6 2PD 2SC 37A 3DE 509 521 53R 547 5B2 5MS 642 6C3 6NH 6SC 783 839 979 A51 AAX AAZ ARH B08 B28 B2Y B70 B76 BCE BDL BHF BP4 BP7 BPS BWP BWP C00 C4E CMF CP7 CY0 D25 D26 DF1 DF2 DFW DFY DFZ DMZ FL9 FRZ G14 HA3 HDY HS4 IHI IOK J88 JNH JT5 JT6 KJ2 KOM KSF KSL L5G LJ3 LKG LRG LZL MD7 MMG N4D NN3 OA4 P19 P29 P44 P45 P49 P4T P55 P63 PB2 PFD PFQ PIQ PXB QPP RBS RF2 RIV SCJ SCQ SCW SCZ T1D TF5 TFG TFK TR1 TUO VII VRV VX3 X98 ZY6 ZZA 1548992
162	Nc1ncnc2c1nc[nH]2	118 131 170 32326 464205 640 718 2703 65781 157 1892 4468 929 537 5RM 2HC [NO [NO I84 SUB NLA NLA LI6 LI6 DFN DFN DMA PIL 186 BEK 446202 FL8 FL8 FSP INT INT A45 WAC WAC SU2 FR2 HTQ OFL OFL HBO DST ROI 669 669 669 669 6JZ 3IP KTP PHN BYS BPG FOH CP5 FLV FLV 655 TQT ISA 113 207 LHY FIL Structure NU1 EMT PVB 772 PLO CAH NBF CXM FCD CK4 K21 K21 6IN R18 B1V CXA 4PN 7I2 219 TRR BH7 696 696 IH5 PFA 5DE 5DE AHC AL8 SN2 P28 INQ TSX TSX U55 U55 46936647 LIH LIH TYI PPT I06 AI1 NBL NBL S58 S58 AL9 AL9 BRN BRN MBP MXA PTU TPI ISF BLN LVA F6B PGX 616 PU4 PU4 PU4 PY1 SDS MF2 CB4 194 HC4 DEB AL3 CK3 HUX CL3 QUE LUM LUM PHC SG1 MBS MBS 2AN TCO BTQ 103 3DH 5F1 6IA AD5 B32 EH9 ZIP
98	Cc1ccccc1	214 274 298 425 482 524 612 622 678 692 743 796 843 863 925 966 1029 1116 1244 1349 1626 4842 5105 [NO 4833 11047 4468 4468 4468 TNL 16A WSK PP1 INT INT INT A45 OLO CP8 4BT 669 3IP INV 8IN TAQ 6IN 1PB IOA OPA I3N SN2 TSX 2BF DFA SCL PTU BBH AFI 580 008 201 047 22M 23M 2SK 4BR 53S 575 5SC 812 8AP 8CA 910 A17 B34 CP9 E20 F1H G6A GK3 GK4 GN8 II4 IX1 J67 JEN L05 L13 LJ1 LJ2 LZQ MI2 OST PB2 SC3 SP6 SPB TR1
82	Oc1ccccc1	251 481 573 600 887 925 944 1010 1025 1149 1400 1608 2703 4842 [NO [NO 3585 [NO GEN 929 WSK KMP DZN 5353306 IOE 122 124 BPY FRM GP8 DTQ HQQ 9HP 340 132 LIM 194 MAX CK6 AB3 87Y CA2 134 166 1HP 244 342 397 3B9 3GV 4MR 4NA 555 5PP 608 694 697 7MR 7NH 859 92G ABJ HMDB02124.mol ALH B65 BI5 C00 EES EI1 FMX IHX INK LJ4 MSI MSR N20 OA5 OBP PDZ POT RS2 ZTW
66	c1cccnc1	235 51-83-2 469 619 705 792 817 976 1072 1427 4842 [NO 5282204 25295 [NO 4819 4468 FYX 446202 IPO MNX FPH Structure NHB PF3 LIG PRC WAI PY1 D4M M1N 255 2EA 2PY 325 6EA 7PY 855 896 8IP AAX AAZ D13 D1G D2G D3G GAX GG5 HM2 L20 LL1 MOJ MUH PF8 PF9 PPX RC8 SB2 SB6 SDZ SM5 SS3 SS4 TAK XMJ XMK

Format of out put is separated following data; frequency of fragment, fragment, molecules ids.

There are additional options in this program.
Also nice document is provided developer’s site.
http://compbio.cs.princeton.edu/molblocks/

Advertisement

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: