# Cross entropy clustering

My kids love mickey mouse. ;-)
If you have mickey mouse plot, can you cluster the data reasonably ?
R package named gmum.r provides interesting solution to do it.
The package has many features for data analysis. If reader who has interest it please use it.
I used the package to cluster mickey mouse data!

```library(gmum.r)
data("cec.mouse1.spherical")
plot(  cec.mouse1.spherical )
```

Next, compare k-means and CEC.

``` k<-kmeans(cec.mouse1.spherical,10)
plot(cec.mouse1.spherical,col=k\$cluster)
```

Oh! colorful mickey !!! Next use CEC.

```c <- CEC(k=3, x=dataset, control.nstart=10, method.type='spherical')
plot(c)
``` The result fits my feeling.

I think CEC is similar to EM based clustering.

# Draw radar chart with R

A radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. The relative position and angle of the axes is typically uninformative. (from wikipedia)

It’s useful for visualize multi parameters in drug discovery.
For example, visualize compound profile, Lipinsky rule, ADMET profile etc….
Yesterday, I found new library to make radar chart in R language named ggradar.
You know, ggradar is library based on ggplot2. It sounds nice! I used ggradar.
I wrote simple example using iris dataset.
I used dpylr for data preparation. ( dplyr is cool library !!!! )

```library(ggradar);
library(scales);
suppressMessages( library(dplyr) );
irisdata <- iris %>%
group_by( Species ) %>%
mutate_each(funs(rescale)) %>%
summarise( mean(Sepal.Length), mean(Sepal.Width), mean( Petal.Length ), mean( Petal.Width )  )
```

Then I got following image. ggradar can set many option, font color, size and line size etc.
But now legend text size can not change. The issue is submitted github.

Ggradar is early project, but useful and attractive library for me.

# Find MCS in R

Find maximum common substructure is useful for finding core scaffold.
I think that finding MCS, using commercially available tools is common (pipeline pilot ?).
I often use RDkit. ;-)
Today I found the library that search MCS in R, named fmcsR.
That’s sounds nice, because if fmcsR works fine, I’ll implement the library to Spotfire using TERR.
So, let’s try it.
Install is very easy.
Type following command.

```source("http://bioconductor.org/biocLite.R")
biocLite("fmcsR")
```

TIPS; fmcsR depend on ChemmineR.
Then write test code.

```library(fmcsR)
data("fmcstest")
test &lt;- fmcs(fmcstest, fmcstest, au=2,bu=1)
plotMCS(test, regenerateCoords=TRUE)
```

au is Upper bound for the number of atom mismatches.
bu is Upper bound for the number of bound mismatches.

Then I got following image. Works fine.

This library also compute batch search.
Example is following.

```&gt; fmcsBatch(sdf, sdf[1:30], au=0, bu=0)
starting worker pid=2002 on localhost:11906 at 22:33:40.230
Query_Size Target_Size MCS_Size Tanimoto_Coefficient Overlap_Coefficient
CMP1 33 33 33 1.0000000 1.0000000
CMP2 33 26 11 0.2291667 0.4230769
CMP3 33 26 10 0.2040816 0.3846154
CMP4 33 32 9 0.1607143 0.2812500
CMP5 33 23 14 0.3333333 0.6086957
CMP6 33 19 13 0.3333333 0.6842105
CMP7 33 21 9 0.2000000 0.4285714
CMP8 33 31 8 0.1428571 0.2580645
CMP9 33 21 9 0.2000000 0.4285714
CMP10 33 21 8 0.1739130 0.3809524
CMP11 33 36 15 0.2777778 0.4545455
CMP12 33 26 12 0.2553191 0.4615385
CMP13 33 26 11 0.2291667 0.4230769
CMP14 33 16 12 0.3243243 0.7500000
CMP15 33 34 15 0.2884615 0.4545455
CMP16 33 25 8 0.1600000 0.3200000
CMP17 33 19 8 0.1818182 0.4210526
CMP18 33 24 10 0.2127660 0.4166667
CMP19 33 25 14 0.3181818 0.5600000
CMP20 33 26 10 0.2040816 0.3846154
CMP21 33 25 15 0.3488372 0.6000000
CMP22 33 21 11 0.2558140 0.5238095
CMP23 33 26 11 0.2291667 0.4230769
CMP24 33 17 6 0.1363636 0.3529412
CMP25 33 27 9 0.1764706 0.3333333
CMP26 33 24 13 0.2954545 0.5416667
CMP27 33 26 11 0.2291667 0.4230769
CMP28 33 20 10 0.2325581 0.5000000
CMP29 33 20 8 0.1777778 0.4000000
CMP30 33 18 7 0.1590909 0.3888889
&gt;
```

This method is useful for bach search.

Unfortunately, batch rmcs search is very slow on win7 32bit environment. ;-(

# make 3d PCA plot

I often use PCA(principal component analysis) to reduce dimension.
I do PCA using Python sklearn or R language.

Basic function of R “biplot” makes 2D chart.
It’s easy way to make biplot.

Today I found cool library of R, named “pca3d”.
Install is easy! Just type following command.

```install.packages("pca3d")
```

Now make chart.
I used iris data set for test.

```> library(pca3d)
> data(iris)
> pca <- prcomp(iris[,-5], scale.=TRUE)
> pca3d(pca, group=iris[,5], biplot=TRUE)
 0.06599283 0.05354630 0.02004088
Creating new device
```

Enter pca3d command, X quarts was launched and I got 3d biplot.
The chart can move. Hmm, 3D chart is useful to check the distribution of datapoint, but… I like 2D biplot. ;-)

If reader who is interested in the library, please check following site.
https://cran.r-project.org/web/packages/pca3d/pca3d.pdf

# PKPD in R.

You know, to drug development understanding PKPD is important.
I’m not DMPK dept. but I think it’s better to know about basic PKPD theory.
There are some packages about pkpd analysis in R.
And I found cool library developed ronkeiser named “PKPDsim”.
http://ronkeizer.github.io/PKPDsim/
This library can integrate shiny, so user can calculate PKPD on the fly!
I used the library today.
I following code is almost same as document.
Following code is simulation of 1 compartment model, oral dose.
pk_1cmt_oral is defined in source code.

```library("PKPDsim")
library("ggplot2")
p <- list(CL=1, V=10, KA=0.5)
pk1 <- new_ode_model("pk_1cmt_oral")
r1 <- new_regimen( amt=100,
times=c(0,12,24,36)
)
dat <- sim_ode(ode = "pk1",
par = p,
regimen=r1
)
plt <- ggplot( dat, aes(x=t, y=y) ) + geom_line() + facet_wrap(~comp)
print(plt)
```

Now I got following image.
1 means elimination phase, 2 means absorption phase. Then run shiny, and simulate on the fly.

```sim_ode_shiny(ode='pk1', par=p, regimen=r1)
```

I think it’s good library to study or simulate PKPD.

# Array or sparse array ?

In a process of lead optimization, chemist often do SAR expansion around potent compound.
If lead compound can be break down three parts A(head), B(core), C(tail), chemist(me…) often fix one part(e.g core B) and change two parts.
After optimize A and C then, fix A, C and change B.
This approach is called array synthesis (not matrix).

It’s very simple and easy to design molecules, Free wilson approach maybe useful in this case.

But, sometime this approach meet problem that SAR is not additive. Each parts shows interaction because of many factors, steric, electrostatic, etc….
I need to think about SAR design more effectively.

DOE(design of experiment) is used drug discovery.
It’s not only used for QC, QA, but also used in process research or any other process of drug discovery.

I think DOE is useful for LO.
So, I searched DOE library in PyPi, and found pyDOE.
https://pythonhosted.org/pyDOE/
It’s cool but the library can’t make orthogonal array.
So, I moved to R.
CRAN provided DOE library called DOE.base.
I used the library.
Situation is…
Lead compound is build from 3 components.
r1, side chain.
r2, core part.
r3, side chain.
To optimize molecule, I selected 6 parts in r1, 3 parts in core and 3 parts in r2.
Simply, if I’ll try to make all combination, I need to make 6 * 3 * 3 = 54 cmpds.
But based on DOE, make orthogonal array like following code I could reduce number of compounds that I have to make.
I think combination of QSAR and DOE is good method of lead opt process.

```> library("DoE.base")
> r1 <- c("C", "CO", "CN","F","Br", "Cl")
> core <- c("phenyl", "pyridyl", "cyclohexyl")
> r2 <- c("c-pro", "c-bu", "c-pent")

> des <- oa.design( factor.names = list(r1, core, r2))
> des
A          B      C
1  Br cyclohexyl   c-bu
2  CN cyclohexyl  c-pro
3  CN    pyridyl c-pent
4  Cl cyclohexyl   c-bu
5  Br    pyridyl  c-pro
6  CO     phenyl  c-pro
7   F    pyridyl c-pent
8   F     phenyl   c-bu
9  Br     phenyl c-pent
10  C    pyridyl   c-bu
11  F cyclohexyl  c-pro
12 Cl    pyridyl  c-pro
13  C     phenyl  c-pro
14  C cyclohexyl c-pent
15 CO    pyridyl   c-bu
16 Cl     phenyl c-pent
17 CO cyclohexyl c-pent
18 CN     phenyl   c-bu
class=design, type= oa
```

How do you expansion SAR in LO stage?

# Get distance matrix via tibco spotfire.

I often calculate molecular distance matrix or similarity matrix.
Distance matrix is useful for visualise molecular similarity but some time it is bother to calculate it.
Today I wrote data function that calculate distance matrix for molecular set.
There are not native function to calculate that in Spotfire. And RCDK can not install TERR.
So, I used RinR. Using RinR, Spotfire can use local R environment function.
I set input parameter ‘smiles’ as column and output parameter ‘res’ as table.
Then register following data function to TS server.

```res <- REvaluate(
{ library( rcdk );
mols <- parse.smiles( smiles );
fps <- lapply( mols, get.fingerprint, type='extended' );
fp.sim <- fp.sim.matrix( fps, method=tanimoto'' );
dist <- 1 - fp.sim;
dist.df <- data.frame( smiles, dist );
res <- dist.df;
}, data = 'smiles'
)
```

I could get distance matrix when I ran the function.
RinR is very flexible to extend function of spotfire.