Today I found challenging article in arxiv.
It describes about SMILES2Vec. https://arxiv.org/pdf/1712.02034.pdf
You know word2vec is very attractive and major application for ML area and SMILES2Vec has same concept.
It converts smiles to vector and learn which character is important. The author use “black box” models for building model. I am not sure about “black box” model but I think it likes leave one out. The method masks some features, builds model and finds important features.
To use the method, SMILES2Vec can find important characters in the given smiles.
They found CNN-GRU model gives best result for solubility prediction. My question is … Why convolution of SMILES work fine???
My opinion is that solubility or logP depends on the presence or absence of substituents such as hydroxyl or amino groups, they do not strongly depend on the position some case. So I think the method is interesting but difficult to predict biological affinity.
SMILES strings is major input format for deep learning area. Also I often use SMILES. ;-) But I want to find another format for DL.
ref for black box model