Make QSAR models and predict activity using pandas_ml and RDKit

In mishima.syk y_sama introduced us about pandas_ml.
I know it’s so late,but I was interested in the module, so I used pandas_ml for QSAR.
Pandas_ml is library of python to integrate pandas, scikit-learn, xgboost and seaborn.
To use pandas_ml, I installed xgboost python binding before and installed pandas_ml using pip command.
I tried to build QSAR model about AMES posi/nega classifier.
At first read sdf using pandasTools of RDKit and added Morganfingerprint.

from rdkit import Chem
from rdkit.Chem import PandasTools
import pandas as pd
import pandas_ml as pdml
from rdkit.Chem import AllChem, DataStructs
import numpy as np
from sklearn.preprocessing import LabelEncoder
df = PandasTools.LoadSDF('cas_4337.sdf')
le = LabelEncoder()
label=le.fit( df["Ames test categorisation"] )
df['target']=le.transform( df["Ames test categorisation"] )
def getFparr( mol ):
    fp = AllChem.GetMorganFingerprintAsBitVect( mol,2 )
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray( fp , arr )
    return arr
df['morganFP'] = df.ROMol.apply(getFparr)

Next, I made dataframe that has X and Y.

res = pd.DataFrame(df.morganFP.tolist())
res["target"] = df.target

Now ready to use pandas_ml. To use pandas_ml, I made ModelFrame was first.
To use the object, it was easy to call any methods, i.e. train_test split, make radomforest, svc, svm, xgboost etc.

df4ml = pdml.ModelFrame( res, target='target' )
train_df, test_df = df4ml.cross_validation.train_test_split()

I predict AMES using three methods, xgboost, svc, randomforest.
Code was following.
It seems very simple regardless of ML algorithms.
And also pnadas_ml can do cross_validation, grid_search of parameters and connect several functions using pipeline.

est = df4ml.xgboost.XGBClassifier()
train_df.fit( est )
pred = test_df.predict( est )
test_df.metrics.confusion_matrix()

Predicted	0	1
Target		
0	486	126
1	96	377

est2 = df4ml.ensemble.RandomForestClassifier()
train_df.fit(est2)
predict_rf=test_df.predict(est2)
test_df.metrics.confusion_matrix()

Predicted	0	1
Target		
0	527	85
1	126	347

est3 = df4ml.svm.SVC()
train_df.fit(est3)
predict_svc=test_df.predict(est3)
test_df.metrics.confusion_matrix()
Predicted	0	1
Target		
0	466	146
1	147	326

In summary, I think that pandas_ml is very useful for build models.
( In my personal opinion, user should study sklearn, xgboost, pandas at first before using pandas_ml. )
I uploaded the code to my repo.
https://github.com/iwatobipen/chemo_info/blob/master/qsar/pdml_rdkit.ipynb

Try to use casperjs

CasperJS is navigation scripting and testing utility for the PhantomJS and SlimerJS written in Javascript.
You know, PhantomJS, and SlimerJS are headless browsers.
Some years ago, I used selenium for web scraping because selenium has python binding and easy to use.
Today, I used CasperJS for test.
Installation is very easy. Just use homebrew(for Mac users) or npm (Need to install PhantomJS before).😉
I wrote simple code that the code search patents in google patent and echo the each link.
At first, create casper object. And then write next action like ‘casper.then( function() { /* your function */ } );’ .
fill function is useful for form input, user don’t need push button command.
Following code access google patent and search patents that are written about JAK3.
Then, echo urls.


var casper = require( 'casper' ).create();
function getLinks() {
        var links = [];
        var list = document.querySelectorAll( 'article > a' );


        for ( var i = 0; i < list.length; i++ ){
            var a = list[i];
            links.push( a.href );
        };
        return links;
};

casper.start().viewport( 1600,1000 );

casper.thenOpen( 'https://patents.google.com/',
                 function(){
                   this.echo( this.getTitle() );
                 });
casper.then(
                 function(){ this.capture('top.png') }
);

casper.then( function(){
             this.fill("form", { q : "JAK3" }, true);
});
casper.wait( 5000,
                 function(){ this.capture('res.png') }
);

casper.then(
                 function(){
                        links = this.evaluate( getLinks );
                        this.echo( links.length + 'patents found' );
                        for ( i = 0; i < links.length; i++ ){
                                    this.echo( links[i]  );
                        };
});


casper.run();

To run the code, just type casperjs yourscript.js.

 iwatobipen$ casperjs googlepat.js 
Google Patents
10patents found
https://patents.google.com/patent/US6210654B1/en?q=jak3
https://patents.google.com/patent/US6136595A/en?q=jak3
https://patents.google.com/patent/US5741899A/en?q=jak3
https://patents.google.com/patent/US7598257B2/en?q=jak3
https://patents.google.com/patent/US7335667B2/en?q=jak3
https://patents.google.com/patent/US7491732B2/en?q=jak3
https://patents.google.com/patent/US6080747A/en?q=jak3
https://patents.google.com/patent/US20060030018A1/en?q=jak3
https://patents.google.com/patent/US5916792A/en?q=jak3
https://patents.google.com/patent/US6160010A/en?q=jak3

Works fine and I got following screenshot.
CasperJS has more function for scraping. I’ll read API as soon as possible.

res

top

New index for compound prioritization

Somedays ago, I found a report about compound prioritization in single-concentration screening.

http://pubs.acs.org/doi/abs/10.1021/acs.jcim.6b00299

The early stage of drug discovery project, we need to screening lots of compounds. Single-concentration assays are often used because of throughput. In this case, inhibition percent or related output are used.

Medicinal chemist need to prioritize compound using these dataset. In my case, I used PEI (inh%/MW) instead of LE or LLE. But I knew that PEI is not correlate LE ( I’ll check paper ASAP )!

The author proposed new index named single-concentration response value scpIC50(Resp) .

scpIC50(Resp) = -log10[ ( 100/Resp-1 ) * 10^-5  ]   —- eq(5)

The equation is same of IC50 when Hill slope = 1.0.

I think this is bold thinking, but the index works well.

In the fig2, scLE and LE showed good correlation. And also, scLE can use hit triage. In the fig 3 and 4 indicated that the index would rescue attractive compounds set.

It was interesting for me.

By the way, recently there are a lot of scores for medchem not only biological/ADME data but also in silico data.

Alogp. Clogp, PSA, …i.e., SAscore, CNS MPO, druggability, LE, LLE, LEAT, PEI, BEI, ……

Which score do you use ?

 

RandomForest Classification on Hadoop.

Belatedly I’m interested in hadoop.
I felt that it’s difficult for me to handle hadoop ( I’m not good at data science…. ) but somedays ago I found very attractive library named ‘Hivemall’.
Following document get from github page.

Hivemall is a scalable machine learning library that runs on Apache Hive. Hivemall is designed to be scalable to the number of training instances as well as the number of training features.
https://github.com/myui/hivemall

It’s mean that I can store data, build model, predict on hadoop. Hmm, that’s sounds nice. Let’s DIY!
My environment is Mac. So, I installed hadoop and hive by using homebrew.
It was very simple way. Type following command.

iwatobipen$ brew install hadoop
iwatobipen$ brew install hive

And after installation, I set up some files. And format file system.
And run hadoop.

iwatobipen$ hdfs namenode -format
iwatobipen$ /usr/local/Cellar/hadoop/2.7.2/sbin/start-all.sh 

Works fine.
Next I installed hivemall. It was easy because just put two files into /tmp.
Files can get from https://github.com/myui/hivemall/releases.
I got following files.
hivemall-core-0.4.2-rc.2-with-dependencies.jar
define-all.hive

And if hive version is newer, it needs comment out the line of sha1 function in define-all.hive.

~part of define-all.hive~
-----------------------
-- hashing functions --
-----------------------

drop temporary function mhash;
create temporary function mhash as 'hivemall.ftvec.hashing.MurmurHash3UDF';

--following line cause error
--drop temporary function sha1;
--create temporary function sha1 as 'hivemall.ftvec.hashing.Sha1UDF';

drop temporary function array_hash_values;
create temporary function array_hash_values as 'hivemall.ftvec.hashing.ArrayHashValuesUDF';

It’s ready.

I wrote sample sql for iris.dataset classification.
Following code is almost same to github example.

test.sql

-- install hivemall ( ADD jar & SOURCE ) 
ADD jar /tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar;
SOURCE /tmp/define-all.hive

CREATE TABLE iris_raw(
              F1 float,
              F2 float,
              F3 float,
              F4 float,
              CLASS string
              )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/Users/iwatobipen/iris.data' INTO TABLE iris_raw;

CREATE TABLE iris_dataset
AS
SELECT
  CLASS,
  ARRAY( concat('1:',F1), concat('2:',F2), concat('3:', F3), concat('4:', F4) ) AS FEATURES
FROM iris_raw;

CREATE TABLE label_mapping
     AS
     SELECT
       CLASS,
       RANK -1 AS LABEL
     FROM (
       SELECT
         distinct CLASS,
         dense_rank() over (order by CLASS) AS RANK
     FROM
       iris_raw
     ) t
     ;

--SELECT * FROM label_mapping;

CREATE TABLE training
AS
SELECT
  rowid() as rowid,
  array( t1.F1, t1.F2, t1.F3, t1.F4 ) AS FEATURES,
  t2.LABEL
FROM
  iris_raw t1
  JOIN label_mapping t2 ON ( t1.class = t2.class )
;

CREATE TABLE model
STORED AS SEQUENCEFILE
AS
SELECT train_randomforest_classifier( features, label )
FROM training;

desc model;

set hivevar:classification = true;
set hive.auto.convert.join = true;
set hive.mapjoin.optimized.hashtable = false;

CREATE TABLE predict_vm
AS
SELECT
  rowid,
  rf_ensemble( predicted ) as predicted
FROM(
  SELECT
    rowid,
    tree_predict( p.model_id, p.model_type, p.pred_model, t.FEATURES, ${classification} ) AS predicted
  FROM
    model p
    LEFT OUTER JOIN training t
) t1
GROUP BY
  rowid;

SELECT t.ROWID,pv.ROWID, t.LABEL, pv.predicted
FROM training t
LEFT OUTER JOIN predict_vm pv ON (t.ROWID = pv.ROWID);

Hive is useful because, user can handle hadoop like a RDB.

Hivemall’s radomforest can handle array as features, so next step I’ll make fingerprint array and predict SAR or ADMET properties.
And prediction can do using join method. Hmm all like SQL.

Finally run the script.

iwatobipen$ hive -f query.sql > loghive.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/Cellar/hive/2.1.0/libexec/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/Cellar/hadoop/2.7.2/libexec/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/2.1.0/libexec/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
Added [/tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar] to class path
Added resources: [/tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar]
OK
Time taken: 1.349 seconds
OK
Time taken: 0.011 seconds

.............
..............
2016-09-06 22:35:46	End of local task; Time Taken: 0.79 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-09-06 22:35:49,634 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local1070174860_0008
MapReduce Jobs Launched: 
Stage-Stage-3:  HDFS Read: 33486 HDFS Write: 29891 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 10.06 seconds, Fetched: 150 row(s)

Log file was following.
Of course the result showed good accuracy because I used training dataset for test.😉
Also hivemall has lots of function for machine learning.
Next, I’ll try to use the library for chemoinformatics.

odel_id            	string              	                    
model_type          	int                 	                    
pred_model          	string              	                    
var_importance      	array<double>       	                    
oob_errors          	int                 	                    
oob_tests           	int                 	                    
1-1	1-1	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-2	1-2	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-3	1-3	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-4	1-4	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-5	1-5	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-6	1-6	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-7	1-7	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-8	1-8	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-9	1-9	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-10	1-10	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-11	1-11	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-12	1-12	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-13	1-13	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-14	1-14	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-15	1-15	0	{"label":0,"probability":0.96,"probabilities":[0.96,0.04]}
1-16	1-16	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-17	1-17	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-18	1-18	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-19	1-19	0	{"label":0,"probability":0.98,"probabilities":[0.98,0.02]}
1-20	1-20	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-21	1-21	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-22	1-22	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-23	1-23	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-24	1-24	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-25	1-25	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-26	1-26	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-27	1-27	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-28	1-28	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-29	1-29	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-30	1-30	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-31	1-31	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-32	1-32	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-33	1-33	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-34	1-34	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-35	1-35	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-36	1-36	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-37	1-37	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-38	1-38	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-39	1-39	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-40	1-40	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-41	1-41	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-42	1-42	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-43	1-43	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-44	1-44	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-45	1-45	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-46	1-46	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-47	1-47	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-48	1-48	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-49	1-49	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-50	1-50	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-51	1-51	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-52	1-52	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-53	1-53	1	{"label":1,"probability":0.9,"probabilities":[0.0,0.9,0.1]}
1-54	1-54	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-55	1-55	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-56	1-56	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-57	1-57	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-58	1-58	1	{"label":1,"probability":0.94,"probabilities":[0.0,0.94,0.06]}
1-59	1-59	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-60	1-60	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-61	1-61	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-62	1-62	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-63	1-63	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-64	1-64	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-65	1-65	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-66	1-66	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-67	1-67	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-68	1-68	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-69	1-69	1	{"label":1,"probability":0.96,"probabilities":[0.0,0.96,0.04]}
1-70	1-70	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-71	1-71	1	{"label":2,"probability":0.56,"probabilities":[0.0,0.44,0.56]}
1-72	1-72	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-73	1-73	1	{"label":1,"probability":0.76,"probabilities":[0.0,0.76,0.24]}
1-74	1-74	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-75	1-75	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-76	1-76	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-77	1-77	1	{"label":1,"probability":0.88,"probabilities":[0.0,0.88,0.12]}
1-78	1-78	1	{"label":1,"probability":0.66,"probabilities":[0.0,0.66,0.34]}
1-79	1-79	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-80	1-80	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-81	1-81	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-82	1-82	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-83	1-83	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-84	1-84	1	{"label":1,"probability":0.64,"probabilities":[0.0,0.64,0.36]}
1-85	1-85	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-86	1-86	1	{"label":1,"probability":0.96,"probabilities":[0.02,0.96,0.02]}
1-87	1-87	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-88	1-88	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-89	1-89	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-90	1-90	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-91	1-91	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-92	1-92	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-93	1-93	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-94	1-94	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-95	1-95	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-96	1-96	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-97	1-97	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-98	1-98	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-99	1-99	1	{"label":1,"probability":0.96,"probabilities":[0.0,0.96,0.04]}
1-100	1-100	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-101	1-101	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-102	1-102	2	{"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]}
1-103	1-103	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-104	1-104	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-105	1-105	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-106	1-106	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-107	1-107	2	{"label":1,"probability":0.58,"probabilities":[0.0,0.58,0.42]}
1-108	1-108	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-109	1-109	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-110	1-110	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-111	1-111	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-112	1-112	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-113	1-113	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-114	1-114	2	{"label":2,"probability":0.92,"probabilities":[0.0,0.08,0.92]}
1-115	1-115	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-116	1-116	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-117	1-117	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-118	1-118	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-119	1-119	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-120	1-120	2	{"label":1,"probability":0.52,"probabilities":[0.0,0.52,0.48]}
1-121	1-121	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-122	1-122	2	{"label":2,"probability":0.8,"probabilities":[0.0,0.2,0.8]}
1-123	1-123	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-124	1-124	2	{"label":2,"probability":0.88,"probabilities":[0.0,0.12,0.88]}
1-125	1-125	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-126	1-126	2	{"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]}
1-127	1-127	2	{"label":2,"probability":0.86,"probabilities":[0.0,0.14,0.86]}
1-128	1-128	2	{"label":2,"probability":0.88,"probabilities":[0.0,0.12,0.88]}
1-129	1-129	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-130	1-130	2	{"label":2,"probability":0.66,"probabilities":[0.0,0.34,0.66]}
1-131	1-131	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-132	1-132	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-133	1-133	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-134	1-134	2	{"label":2,"probability":0.56,"probabilities":[0.0,0.44,0.56]}
1-135	1-135	2	{"label":2,"probability":0.68,"probabilities":[0.0,0.32,0.68]}
1-136	1-136	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-137	1-137	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-138	1-138	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-139	1-139	2	{"label":2,"probability":0.84,"probabilities":[0.0,0.16,0.84]}
1-140	1-140	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-141	1-141	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-142	1-142	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-143	1-143	2	{"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]}
1-144	1-144	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-145	1-145	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-146	1-146	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-147	1-147	2	{"label":2,"probability":0.94,"probabilities":[0.0,0.06,0.94]}
1-148	1-148	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-149	1-149	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-150	1-150	2	{"label":2,"probability":0.94,"probabilities":[0.0,0.06,0.94]}

Install rdkit to CentOS7

I use OSX or unbuntu for coding usually.
But, some case CentOS is needed to make some services.
So, I checked the way to install rdkit to centos7.
I used docker for my test, because docker can make virtual environment in my mac book.
First I pull the Cent OS ver7.

iwatobipen$ docker pull centos:7

Then run the image.

iwatobipen$ docker run -it centos:7

OK I can log in centos7.😉
Next, I installed pyenv for install anaconda.
I used dnf instead of yum. So, I installed dnf using yum.
Command is following.

[root@1db835ac61b4 ~]# mkdir rdkittest & cd rdkittest
[root@1db835ac61b4 rdkittest]# yum install epel-release
[root@1db835ac61b4 rdkittest]# yum install dnf

OK dnf is installed.
But I couldn’t update dnf, so I referred following url and updated dnf.
http://stackoverflow.com/questions/32541196/i-attempted-to-enable-the-epel-repo-on-my-fedora-22-machine-and-i-broke-it-now

[root@1db835ac61b4 rdkittest]# rm -rf /etc/yum.repos.d/epel*
[root@1db835ac61b4 rdkittest]# dnf clear all
[root@1db835ac61b4 rdkittest]# dnf install epel-release
[root@1db835ac61b4 rdkittest]# dnf upgrade

Next, Install pyenv.

[root@1db835ac61b4 rdkittest]# dnf install -y gcc zlib-devel bzip2 bzip2-devel readline-devel sqlite sqlite-devel openssl-devel
[root@1db835ac61b4 rdkittest]# git clone https://github.com/yyuu/pyenv.git ~/.pyenv
[root@1db835ac61b4 rdkittest]# echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
[root@1db835ac61b4 rdkittest]# echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
[root@1db835ac61b4 rdkittest]# echo 'eval "$(pyenv init -)"' >> ~/.bash_profile

Done! I just installed pyenv. So install anaconda ASAP.😉
And install rdkit!

[root@1db835ac61b4 rdkit]# pyenv install anaconda3-4.1.0
[root@1db835ac61b4 rdkit]# pyenv local anaconda3-4.1.0
[root@1db835ac61b4 rdkit]# conda install -c rdkit rdkit=2016.03.4

Finished! Now I got rdkit in CentOS7.
Check it.

[root@1db835ac61b4 rdkittest]# ipython
Python 3.5.1 |Anaconda 4.1.0 (64-bit)| (default, Jun 15 2016, 15:32:45)
Type "copyright", "credits" or "license" for more information.

IPython 4.2.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from rdkit import Chem

In [2]: from rdkit import rdBase

In [3]: rdBase.rdkitVersion
Out[3]: '2016.03.4'

Works fine.

Anaconda is very useful tool for every one I think!

BTW why I install dnf ?

The answer is yum can handle only python2, but dnf can handle python3.

https://www.linux.com/learn/what-you-need-know-about-fedoras-switch-yum-dnf

 

 

 

Visualize chemical space using Knime rdkit node

Usually I use python for analyse, visualize chemical space. Because, I love coding.😉
I know, work flow tool is useful solution to do that.

So, I tried to plot chemical space using Knime. Knime is one of famous work flow tool and lots of nodes are developed.

I made very simple work flow to do PCA. My work flow is following.
workflow

At first, the flow read smiles strings from excel file. And convert smies to RDKit molecule.
Then calculate morgan FP using RDKit Finger printer. You know, the node can also calculate various FP like MACCS, topological etc.
Next, extend bit vector to 1024 bit columns.
And do PCA and make scatter plot. The plotting node is implemented in Erlwood chemoinformatics node.
When I call view scatter plot, I got following dynamic scatter plot.
scatter plot
The node can select each columns easily and user can set color or size own criteria. And visualize structure as label. Wow cool!

And I set activity cliff viewer.
The node needs two parameter, one of smiles and another is distance matrix of similarity.
N x N distance matrix is generated using distance matrix calculate node.
Finally run the flow, I got network view of activity cliffs.
Screen Shot 2016-08-24 at 11.28.47 PM
Edges that are colored green are indicated activity cliffs. ( in my case delta pIC50 >= 1.0 and similarity >= 0.5 )
Hmm but the image seems to difficult to understand SAR. Cytoscape is suitable tool to visualize network.
Mistake ???

Activity cliffs table seems good.

Knime is powerful tool for medchem.

Scoring 3D diversity using RDKit #RDKit

Recently importance of 3D character of molecules are increasing.
If I design a libraries, I want to estimate not only 2D, but also 3D diversity.
Fortunately RDKit implemented function for characterize the 3D character of molecules named ‘plane of best fit’ (PBF).
You can call this function from rdkit/Contrib/PBF folder.😉 Great!!!
And reference of PBF is following.
Plane of Best Fit: A Novel Method to Characterize the Three-Dimensionality of Molecules
http://pubs.acs.org/doi/abs/10.1021/ci300293f

In the paper, the authors compare with relationship between PBFscore and PMI( principal moment of inertia ). I’m interested in the algorithm, so I calculated two scores.
At first, I can calculate PBF by using RDKit Contrib/PBF code. But code for calculate PMI is not available. So, I write calculator.
My code is following. calc_pmi.py.

import sys
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem
from numpy import linalg
from sklearn.preprocessing import normalize
import numpy as np

def inertia( mol,confId=-1,  principal = True, moments_only = True  ):
    numatoms = mol.GetNumAtoms()
    conf = mol.GetConformer( confId )
    if not conf.Is3D():
        return 0
    #get cordinate of each atoms
    pts =  np.array( [ list( conf.GetAtomPosition(atmidx) ) for atmidx in range( numatoms ) ]  )
    atoms = [ atom for atom in mol.GetAtoms() ]
    mass = Descriptors.MolWt( mol )
    #get center of mass
    center_of_mass = np.array(np.sum( atoms[i].GetMass() * pts[i] for i in range( numatoms ))) /mass

    inertia = np.zeros( (3,3) )
    for atmidx in range( numatoms ):
        cmx,cmy,cmz = pts[ atmidx ] - center_of_mass
        inertia += -atoms[ atmidx ].GetMass() / mass * np.matrix(
                [[ 0, -cmz, cmy ],
                 [ cmz, 0, -cmx ],
                 [ -cmy, cmx, 0 ]]
                ) ** 2
    if principal:
        if moments_only:
            return np.linalg.eigvalsh( inertia )
        else:
            return np.linalg.eigh( inertia )
    else:
        return inertia
if __name__=='__main__':
    mols = [ mol for mol in Chem.SDMolSupplier( sys.argv[1], removeHs=False )  ]
    for mol in mols:
        I = inertia(mol)
        npr1, npr2 = I[0]/I[2], I[1]/I[2]
        print( npr1, npr2 )

OK, let’s start calculation!
I picked up sample from one molecule from the reference.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG=True
# pbf imported from rdkit/Contrib/PBF
import pbf
import calc_pmi

#(1) mol in the ref.
mol = Chem.MolFromSmiles( 'c1cc(O)c(C(=O)O)cc1N=Nc1ccc(O)c(C(=O)O)c1' )
hmol = Chem.AddHs(mol)
AllChem.EmbedMolecule(hmol,useExpTorsionAnglePrefs=True ,useBasicKnowledge=True)
AllChem.UFFOptimizeMolecule(hmol)
rhmol = Chem.RemoveHs(hmol)

hmolI = calc_pmi.inertia(hmol)
rmolI = calc_pmi.inertia(rhmol)

Compare score with results in the reference.

#PBF score
print(pbf.PBFRD(hmol), pbf.PBFRD(rhmol))
#out
#0.140618602774 0.0769181645171

#NPR1, NPR2
hmolI = calc_pmi.inertia(hmol)
rmolI = calc_pmi.inertia(rhmol)
print(hmolI[0]/hmolI[2], hmolI[1]/hmolI[2])
print(rmolI[0]/rmolI[2], rmolI[1]/rmolI[2])
# out
#with hydrogen
#0.128978367365 0.876583937787
#without hydorgen
#0.125631084212 0.879566191929

In the reference data is following.

NPR1 = 0.089 ( 0.128978367365, 0.125631084212 )
NPR2 = 0.910 ( 0.876583937787, 0.879566191929 )
PBF = 0.00313 ( 0.140618602774 , 0.0769181645171)
*data which surrounded by parentheses is calculated data with hydrogen, and without hydrogen in my code.

My data indicate that PMI does not see depend on existence of hydrogen, but PBF score is depended on hydrogen.
And data without hydrogen seems good correlation of literature data. I can’t understand the reason.😦
I need to study more deep in the method.
BTW RDKit has a lots of function for chemoinformatics. It’s Swiss knife for chemoinformatics!

I referred following URL.
The URL gave good suggestion for me.
https://github.com/wrwrwr/mmoi-calc

Also, You can calculate PMI using Knime vernalis community nodes!!!
Wow! that’s nice for OSS! That’s so cool!😉

Screen Shot 2016-08-16 at 11.19.13 PM