Make predictive models with small data and visualize it #Chemoinformatics

I enjoyed chemoinformatics conference held in Kumamoto in this week.
The first day of the conference, I could hear about very interesting lecture. That was very basic data handling and visualization tutorial but useful for newbie of chemoinformatics.
I would like to reproduce the code example, so I tried it.

First, visualize training data. It important to know the properties of training data.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVC, SVR
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
points = [(1.5, 2), (2, 1), (2, 3), (3, 5), (4, 3), (7, 6), (9,  10)]
label = [1, -1, 1, 1, -1, -1, 1]
def color_map(label):
    if label > 0:
        return 'blue'
    return 'red'
train_color = list(map(color_map, label))
# check data
train_x = [ i[0] for i in points]
train_y = [ i[1] for i in points]
plt.scatter(x=train_x, y=train_y, c=train_color)
plt.xlim(0,15)
plt.ylim(0,15)

Hmm, it seems linear relationship between x and y.

Next, made test data and helper function for visualize data.

test_x = np.linspace(0, 10, 20)
test_y = np.linspace(0, 10, 20)
xx, yy = np.meshgrid(test_x, test_y)
test_x = xx.ravel()
test_y = yy.ravel()
n_data = len(test_x)
test_data = [(test_x[i], test_y[i]) for i in range(n_data)]

def makeplot(test_x, test_y, predict_data):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.scatter(train_x, train_y, c=train_color)
    color = list(map(color_map, predict_data))
    ax.scatter(test_x, test_y, c = color, alpha=0.3)
    fig.show()

OK let’s build model!

# Linear Regression
model = LinearRegression()
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)


Oh, Simple linear regressor works very well. ;-)
OK, next how about random forest ?

#Random forest
model = RandomForestClassifier(random_state=np.random.seed(794))
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)


The result is quite different from linear regressor.

Next I checked non linear data.

points = [(1, 5), (2, 10), (2, 3), (3, 5), (5, 4), (7, 6), (9,  10), (11,2), (7, 3)]
label = [1, 1, 1, 1, -1, -1, 1, 1, -1]
def color_map(label):
    if label > 0:
        return 'blue'
    return 'red'

train_color = list(map(color_map, label))
train_x = [ i[0] for i in points]
train_y = [ i[1] for i in points]
plt.scatter(x=train_x, y=train_y, c=train_color)
plt.xlim(0,15)
plt.ylim(0,15)

In this case, linear model does not work well.

#Ridge
model = Ridge()
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

How about RF and SVR?

#RandomForest
model = RandomForestRegressor(random_state=np.random.seed(794))
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

#SVR
model = SVR()
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

RF

SVR

Non linear regressor can fit non linear data but shows different output.
Model selection is important and to select model, it is needed check training data carefully.
All my experiments can check from google colab.

https://colab.research.google.com/drive/1ywqRlcjEPm7pLP-IeawPTsclb9siuFI4

Any comments and suggestions are appreciated.

Advertisements