Can machine learn important feature from SMILES?

Today I found challenging article in arxiv.
It describes about SMILES2Vec.
You know word2vec is very attractive and major application for ML area and SMILES2Vec has same concept.
It converts smiles to vector and learn which character is important. The author use “black box” models for building model. I am not sure about “black box” model but I think it likes leave one out. The method masks some features, builds model and finds important features.

To use the method, SMILES2Vec can find important characters in the given smiles.
They found CNN-GRU model gives best result for solubility prediction. My question is … Why convolution of SMILES work fine???
My opinion is that solubility or logP depends on the presence or absence of substituents such as hydroxyl or amino groups, they do not strongly depend on the position some case. So I think the method is interesting but difficult to predict biological affinity.

SMILES strings is major input format for deep learning area. Also I often use SMILES. ;-) But I want to find another format for DL.

ref for black box model

Click to access 1602.07043.pdf


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: