Molecular fingerprint(FP) is a very important for chemoinformatics because it is used for building many predictive models not only ADMET but also biological activities.
As you know, ECFP (Morgan Fingerprint) is one of golden standard FP of chemoinformatics. Because it shows stable performance against any problems. After ECFP is reported, many new fingerprint algorithm is proposed but still ECFP is used in many case.
It means that more improvement of fingerprint is required but it’s too difficult task.
Recently new fingerprint algorithm named MAP4 is proposed from Prof. Reymond lab. URL is below.
MAP of MAP4 means “MinHashed atom-pair fingerprint up to a diameter of four bonds“.
The calculation process is below.
First, the circular substructures surrounding each non-hydrogen atom in the molecule at radii 1 to r are written as canonical, non-isomeric, and rooted SMILES string .
Second, the minimum topological distance TPj,k separating each atom pair(j, k) in the input molecule is calculated.
Third,all atom-pair shingles are written for each atom pair (j, k) and each value of r, placing the two SMILES strings in lexicographical order.
Fourth, the resulting set of atom-pair shingles is hashed to a set of integers using the unique mapping SHA-1, and its corresponding transposed vector is finally MinHashed to form the MAP4 vector.
It seems similar approach to ECFP but this approach uses minhashing techniques. It works very fast and this fingerprint outperform compared to their developed MHFP, ECPF and other fingerprints which are impremented RDKit.
In their article Fig2 showed performance of virtual screening against various targets and MAP4FP outperformed in many targets.
In the Fig8 shows Jaccard distance between set of molecules. MAP4 shows better performance against not only small molecule but also large molecule such as peptide, compared to other finger print such as atom pair FP, ECFP etc.
So I would like to use the FP on my notebook, so I tried to use it.
The author disclosed source code so I could get code from github.
git clone https://github.com/reymond-group/map4 cd map4
I think original repo has bug for folded fingerprint calculation so I maed PR to original repo.
And following code used modified version of original code.
I compared the FP against morganFP came from rdkit.
Today’s code was uploaded my gist and github.
MAP4Calculator provides calculate and calculate_many methods. The first one calculate MAP4FP of molecule and second one calculates MAP4FP of list of molecules.
is_folded option is false in default, so to get folded(fixed length of) finger print, user need to change the option from False to True.
The test data shown above, Moragn FP seems more sensitive to difference of compound structure. Because similarity is lower to MAP4FP. and folded and unfolded MAP4FP showed almost similar performance.
Today I showed how to calculate MAP4FP, so I would like to check the FP performance with real drug discovery project with any machine learning algorithms. :-)
I also uploaded the code to my github repo.