QSAR toolbox for myself #RDKit #Chemoinformatics

I often work with SDF to build some QSAR models. Data preparation process is almost same. So I decided to make tool box for myself. It is not complex code, very simple.
https://github.com/iwatobipen/QSAR_TOOLBOX

Current code has two main scripts. One is finger print generator and another is model builder and optimizer which make SVM, RF and Gaussian Process Classifier and optimize hyper parameters with optuna. Optuna is developed by preferred networks.

Fingerprint maker named fpgen.py needs SDF as input. And the script generates fingerprint array and target array as npz format. It can select ECFP/FCFP as an option.

Following code is an example. I downloaded JAK3 inhibitors from ChEMBL as TSV and converted the file to SDF.

 
usage: fpgen.py [-h] [--fptype FPTYPE] [--radius RADIUS] [--nBits NBITS]
                [--molid MOLID] [--target TARGET] [--output OUTPUT]
                input

positional arguments:
  input            finename of sdf

optional arguments:
  -h, --help       show this help message and exit
  --fptype FPTYPE  ECFP, FCFP,
  --radius RADIUS  radius of ECFP, FCFP
  --nBits NBITS    number of bits
  --molid MOLID    molid prop
  --target TARGET  target name for predict
  --output OUTPUT  output path
# Example usage
python qsartools/fpgen.py example/CHEMBL25-chembl_activity-JAK3.sdf --molid Molecule --target Class

The fingerprint and class array was stored in ./data folder.

Ready to build model. Then run the next script. I referred following blog post. https://qiita.com/koshian2/items/ef9c0c74fe38739599d5
And wrote the code. This code make model and optimize parameters and save the model as pkl.

import argparse
import uuid
import pickle
import optuna
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier


def make_parser():
    parser = argparse.ArgumentParser()
    parser.add_argument('X', type=str, help='data for X npz format')
    parser.add_argument('Y', type=str, help='data for Y npz format')
    parser.add_argument('target', type=str, help='name of target, it will be database name')
    parser.add_argument('--testsize', type=float, help='testsize of train/test split default = 0.2', default=0.2)
    parser.add_argument('--n_trials', type=int, help='number of trials default = 200', default=200)
    parser.add_argument('--outpath', type=str, help='path for output default is lgb_output', default='svc_rf_gp_output')
    return parser

def objectives(trial):
    trial_uuid = str(uuid.uuid4())
    trial.set_user_attr('uuid', trial_uuid)
    classifier_name = trial.suggest_categorical('classifier', ['SVC', 'RandomForest', 'GP'])
    if classifier_name == 'SVC':
        svc_c = trial.suggest_loguniform('svc_c', 1e-10, 1e10)
        svc_gamma = trial.suggest_loguniform('svc_gamma', 1e-10, 1e10)
        classifier_obj = SVC(C=svc_c, gamma=svc_gamma)
        trial_uuid += 'svc_'
    elif classifier_name == 'RandomForest':
        rf_max_depth = int(trial.suggest_loguniform('rf_max_depth', 2, 32))
        classifier_obj = RandomForestClassifier(max_depth=rf_max_depth, n_estimators=10)
        trial_uuid += 'rf_'
    else:
        classifier_obj = GaussianProcessClassifier()
        trial_uuid += 'gp_'
    classifier_obj.fit(train_x, train_y)
    score = cross_val_score(classifier_obj, train_x, train_y, n_jobs=4, cv=5)
    val_accuracy = 1.0-score.mean()
    trial.set_user_attr('val_acc', val_accuracy)
    #classifier_obj.fit(train_x, train_y)

    y_pred_train = classifier_obj.predict(train_x)
    y_pred_test = classifier_obj.predict(test_x)

    acc_train = accuracy_score(train_y, y_pred_train)
    acc_test =  accuracy_score(test_y, y_pred_test)
    
    error_train = 1.0 - acc_train
    error_test = 1.0 - acc_test

    trial.set_user_attr('train_error', error_train)
    trial.set_user_attr('train_acc', acc_train)
    trial.set_user_attr('test_error', error_test)
    trial.set_user_attr('test_arcc', acc_test)

    if not os.path.exists('svc_rf_gp_output'):
        os.mkdir('svc_rf_gp_output')
    with open('svc_rf_gp_output/' + f'{trial_uuid}.pkl', 'wb') as fp:
        pickle.dump(classifier_obj, fp)
    return error_test

if __name__=='__main__':
    parser = make_parser()
    args = parser.parse_args()
    study = optuna.create_study(storage=f'sqlite:///{args.target}_svc_rf.db')
    X = np.load(args.X)['arr_0']
    Y = np.load(args.Y)['arr_0']
    print(X.shape)
    print(Y.shape)
    train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=args.testsize)
    study.optimize(objectives, n_trials=args.n_trials)
    print(study.best_params)
    print(study.best_value)
    df = study.trials_dataframe()
    df.to_csv(f'{args.outpath}/optuna_svc_rf.csv')

Run the code is very easy.

 
usage: svm_rf_gp_optuna.py [-h] [--testsize TESTSIZE] [--n_trials N_TRIALS]
                           [--outpath OUTPATH]
                           X Y target

positional arguments:
  X                    data for X npz format
  Y                    data for Y npz format
  target               name of target, it will be database name

optional arguments:
  -h, --help           show this help message and exit
  --testsize TESTSIZE  testsize of train/test split default = 0.2
  --n_trials N_TRIALS  number of trials default = 200
  --outpath OUTPATH    path for output default is lgb_output

python qsartools/svm_rf_gp_optuna.py data/ECFP_2_1024_X.npz data/Class_arr.npz JAK3_ --n_trial 50

After finished the code, I could get csv file as a summary optimization process.

The screen shot is below. SVC showed the best performance against test dataset. And GP was the second.

code

There are many useful packages, documents and web resources. So I can learn lots from them.

I would like to update the toolbox with some functions near the feature.

Advertisements