I enjoyed chemoinformatics conference held in Kumamoto in this week.

The first day of the conference, I could hear about very interesting lecture. That was very basic data handling and visualization tutorial but useful for newbie of chemoinformatics.

I would like to reproduce the code example, so I tried it.

First, visualize training data. It important to know the properties of training data.

%matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.linear_model import Ridge from sklearn.linear_model import ElasticNet from sklearn.svm import SVC, SVR from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor points = [(1.5, 2), (2, 1), (2, 3), (3, 5), (4, 3), (7, 6), (9, 10)] label = [1, -1, 1, 1, -1, -1, 1] def color_map(label): if label > 0: return 'blue' return 'red' train_color = list(map(color_map, label)) # check data train_x = [ i[0] for i in points] train_y = [ i[1] for i in points] plt.scatter(x=train_x, y=train_y, c=train_color) plt.xlim(0,15) plt.ylim(0,15)

Hmm, it seems linear relationship between x and y.

Next, made test data and helper function for visualize data.

test_x = np.linspace(0, 10, 20) test_y = np.linspace(0, 10, 20) xx, yy = np.meshgrid(test_x, test_y) test_x = xx.ravel() test_y = yy.ravel() n_data = len(test_x) test_data = [(test_x[i], test_y[i]) for i in range(n_data)] def makeplot(test_x, test_y, predict_data): fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.scatter(train_x, train_y, c=train_color) color = list(map(color_map, predict_data)) ax.scatter(test_x, test_y, c = color, alpha=0.3) fig.show()

OK let’s build model!

# Linear Regression model = LinearRegression() predictor = model.fit(points, label) reg = predictor.predict(test_data) makeplot(test_x, test_y, reg)

Oh, Simple linear regressor works very well. ;-)

OK, next how about random forest ?

#Random forest model = RandomForestClassifier(random_state=np.random.seed(794)) predictor = model.fit(points, label) reg = predictor.predict(test_data) makeplot(test_x, test_y, reg)

The result is quite different from linear regressor.

Next I checked non linear data.

points = [(1, 5), (2, 10), (2, 3), (3, 5), (5, 4), (7, 6), (9, 10), (11,2), (7, 3)] label = [1, 1, 1, 1, -1, -1, 1, 1, -1] def color_map(label): if label > 0: return 'blue' return 'red' train_color = list(map(color_map, label)) train_x = [ i[0] for i in points] train_y = [ i[1] for i in points] plt.scatter(x=train_x, y=train_y, c=train_color) plt.xlim(0,15) plt.ylim(0,15)

In this case, linear model does not work well.

#Ridge model = Ridge() predictor = model.fit(points, label) reg = predictor.predict(test_data) makeplot(test_x, test_y, reg)

How about RF and SVR?

#RandomForest model = RandomForestRegressor(random_state=np.random.seed(794)) predictor = model.fit(points, label) reg = predictor.predict(test_data) makeplot(test_x, test_y, reg) #SVR model = SVR() predictor = model.fit(points, label) reg = predictor.predict(test_data) makeplot(test_x, test_y, reg)

Non linear regressor can fit non linear data but shows different output.

Model selection is important and to select model, it is needed check training data carefully.

All my experiments can check from google colab.

https://colab.research.google.com/drive/1ywqRlcjEPm7pLP-IeawPTsclb9siuFI4

Any comments and suggestions are appreciated.