Belatedly I’m interested in hadoop.

I felt that it’s difficult for me to handle hadoop ( I’m not good at data science…. ) but somedays ago I found very attractive library named ‘Hivemall’.

Following document get from github page.

Hivemall is a scalable machine learning library that runs on Apache Hive. Hivemall is designed to be scalable to the number of training instances as well as the number of training features.

https://github.com/myui/hivemall

It’s mean that I can store data, build model, predict on hadoop. Hmm, that’s sounds nice. Let’s DIY!

My environment is Mac. So, I installed hadoop and hive by using homebrew.

It was very simple way. Type following command.

iwatobipen$ brew install hadoop
iwatobipen$ brew install hive

And after installation, I set up some files. And format file system.

And run hadoop.

iwatobipen$ hdfs namenode -format
iwatobipen$ /usr/local/Cellar/hadoop/2.7.2/sbin/start-all.sh

Works fine.

Next I installed hivemall. It was easy because just put two files into /tmp.

Files can get from https://github.com/myui/hivemall/releases.

I got following files.

hivemall-core-0.4.2-rc.2-with-dependencies.jar

define-all.hive

And if hive version is newer, it needs comment out the line of sha1 function in define-all.hive.

~part of define-all.hive~
-----------------------
-- hashing functions --
-----------------------
drop temporary function mhash;
create temporary function mhash as 'hivemall.ftvec.hashing.MurmurHash3UDF';
--following line cause error
--drop temporary function sha1;
--create temporary function sha1 as 'hivemall.ftvec.hashing.Sha1UDF';
drop temporary function array_hash_values;
create temporary function array_hash_values as 'hivemall.ftvec.hashing.ArrayHashValuesUDF';

It’s ready.

I wrote sample sql for iris.dataset classification.

Following code is almost same to github example.

test.sql

-- install hivemall ( ADD jar & SOURCE )
ADD jar /tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar;
SOURCE /tmp/define-all.hive
CREATE TABLE iris_raw(
F1 float,
F2 float,
F3 float,
F4 float,
CLASS string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/Users/iwatobipen/iris.data' INTO TABLE iris_raw;
CREATE TABLE iris_dataset
AS
SELECT
CLASS,
ARRAY( concat('1:',F1), concat('2:',F2), concat('3:', F3), concat('4:', F4) ) AS FEATURES
FROM iris_raw;
CREATE TABLE label_mapping
AS
SELECT
CLASS,
RANK -1 AS LABEL
FROM (
SELECT
distinct CLASS,
dense_rank() over (order by CLASS) AS RANK
FROM
iris_raw
) t
;
--SELECT * FROM label_mapping;
CREATE TABLE training
AS
SELECT
rowid() as rowid,
array( t1.F1, t1.F2, t1.F3, t1.F4 ) AS FEATURES,
t2.LABEL
FROM
iris_raw t1
JOIN label_mapping t2 ON ( t1.class = t2.class )
;
CREATE TABLE model
STORED AS SEQUENCEFILE
AS
SELECT train_randomforest_classifier( features, label )
FROM training;
desc model;
set hivevar:classification = true;
set hive.auto.convert.join = true;
set hive.mapjoin.optimized.hashtable = false;
CREATE TABLE predict_vm
AS
SELECT
rowid,
rf_ensemble( predicted ) as predicted
FROM(
SELECT
rowid,
tree_predict( p.model_id, p.model_type, p.pred_model, t.FEATURES, ${classification} ) AS predicted
FROM
model p
LEFT OUTER JOIN training t
) t1
GROUP BY
rowid;
SELECT t.ROWID,pv.ROWID, t.LABEL, pv.predicted
FROM training t
LEFT OUTER JOIN predict_vm pv ON (t.ROWID = pv.ROWID);

Hive is useful because, user can handle hadoop like a RDB.

Hivemall’s radomforest can handle array as features, so next step I’ll make fingerprint array and predict SAR or ADMET properties.

And prediction can do using join method. Hmm all like SQL.

Finally run the script.

iwatobipen$ hive -f query.sql > loghive.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/Cellar/hive/2.1.0/libexec/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/Cellar/hadoop/2.7.2/libexec/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/2.1.0/libexec/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
Added [/tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar] to class path
Added resources: [/tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar]
OK
Time taken: 1.349 seconds
OK
Time taken: 0.011 seconds
.............
..............
2016-09-06 22:35:46 End of local task; Time Taken: 0.79 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-09-06 22:35:49,634 Stage-3 map = 100%, reduce = 0%
Ended Job = job_local1070174860_0008
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 33486 HDFS Write: 29891 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 10.06 seconds, Fetched: 150 row(s)

Log file was following.

Of course the result showed good accuracy because I used training dataset for test.😉

Also hivemall has lots of function for machine learning.

Next, I’ll try to use the library for chemoinformatics.

odel_id string
model_type int
pred_model string
var_importance array<double>
oob_errors int
oob_tests int
1-1 1-1 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-2 1-2 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-3 1-3 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-4 1-4 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-5 1-5 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-6 1-6 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-7 1-7 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-8 1-8 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-9 1-9 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-10 1-10 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-11 1-11 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-12 1-12 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-13 1-13 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-14 1-14 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-15 1-15 0 {"label":0,"probability":0.96,"probabilities":[0.96,0.04]}
1-16 1-16 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-17 1-17 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-18 1-18 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-19 1-19 0 {"label":0,"probability":0.98,"probabilities":[0.98,0.02]}
1-20 1-20 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-21 1-21 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-22 1-22 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-23 1-23 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-24 1-24 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-25 1-25 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-26 1-26 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-27 1-27 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-28 1-28 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-29 1-29 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-30 1-30 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-31 1-31 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-32 1-32 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-33 1-33 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-34 1-34 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-35 1-35 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-36 1-36 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-37 1-37 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-38 1-38 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-39 1-39 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-40 1-40 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-41 1-41 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-42 1-42 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-43 1-43 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-44 1-44 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-45 1-45 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-46 1-46 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-47 1-47 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-48 1-48 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-49 1-49 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-50 1-50 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-51 1-51 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-52 1-52 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-53 1-53 1 {"label":1,"probability":0.9,"probabilities":[0.0,0.9,0.1]}
1-54 1-54 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-55 1-55 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-56 1-56 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-57 1-57 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-58 1-58 1 {"label":1,"probability":0.94,"probabilities":[0.0,0.94,0.06]}
1-59 1-59 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-60 1-60 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-61 1-61 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-62 1-62 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-63 1-63 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-64 1-64 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-65 1-65 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-66 1-66 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-67 1-67 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-68 1-68 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-69 1-69 1 {"label":1,"probability":0.96,"probabilities":[0.0,0.96,0.04]}
1-70 1-70 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-71 1-71 1 {"label":2,"probability":0.56,"probabilities":[0.0,0.44,0.56]}
1-72 1-72 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-73 1-73 1 {"label":1,"probability":0.76,"probabilities":[0.0,0.76,0.24]}
1-74 1-74 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-75 1-75 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-76 1-76 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-77 1-77 1 {"label":1,"probability":0.88,"probabilities":[0.0,0.88,0.12]}
1-78 1-78 1 {"label":1,"probability":0.66,"probabilities":[0.0,0.66,0.34]}
1-79 1-79 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-80 1-80 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-81 1-81 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-82 1-82 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-83 1-83 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-84 1-84 1 {"label":1,"probability":0.64,"probabilities":[0.0,0.64,0.36]}
1-85 1-85 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-86 1-86 1 {"label":1,"probability":0.96,"probabilities":[0.02,0.96,0.02]}
1-87 1-87 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-88 1-88 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-89 1-89 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-90 1-90 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-91 1-91 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-92 1-92 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-93 1-93 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-94 1-94 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-95 1-95 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-96 1-96 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-97 1-97 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-98 1-98 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-99 1-99 1 {"label":1,"probability":0.96,"probabilities":[0.0,0.96,0.04]}
1-100 1-100 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-101 1-101 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-102 1-102 2 {"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]}
1-103 1-103 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-104 1-104 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-105 1-105 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-106 1-106 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-107 1-107 2 {"label":1,"probability":0.58,"probabilities":[0.0,0.58,0.42]}
1-108 1-108 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-109 1-109 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-110 1-110 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-111 1-111 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-112 1-112 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-113 1-113 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-114 1-114 2 {"label":2,"probability":0.92,"probabilities":[0.0,0.08,0.92]}
1-115 1-115 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-116 1-116 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-117 1-117 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-118 1-118 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-119 1-119 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-120 1-120 2 {"label":1,"probability":0.52,"probabilities":[0.0,0.52,0.48]}
1-121 1-121 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-122 1-122 2 {"label":2,"probability":0.8,"probabilities":[0.0,0.2,0.8]}
1-123 1-123 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-124 1-124 2 {"label":2,"probability":0.88,"probabilities":[0.0,0.12,0.88]}
1-125 1-125 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-126 1-126 2 {"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]}
1-127 1-127 2 {"label":2,"probability":0.86,"probabilities":[0.0,0.14,0.86]}
1-128 1-128 2 {"label":2,"probability":0.88,"probabilities":[0.0,0.12,0.88]}
1-129 1-129 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-130 1-130 2 {"label":2,"probability":0.66,"probabilities":[0.0,0.34,0.66]}
1-131 1-131 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-132 1-132 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-133 1-133 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-134 1-134 2 {"label":2,"probability":0.56,"probabilities":[0.0,0.44,0.56]}
1-135 1-135 2 {"label":2,"probability":0.68,"probabilities":[0.0,0.32,0.68]}
1-136 1-136 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-137 1-137 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-138 1-138 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-139 1-139 2 {"label":2,"probability":0.84,"probabilities":[0.0,0.16,0.84]}
1-140 1-140 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-141 1-141 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-142 1-142 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-143 1-143 2 {"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]}
1-144 1-144 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-145 1-145 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-146 1-146 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-147 1-147 2 {"label":2,"probability":0.94,"probabilities":[0.0,0.06,0.94]}
1-148 1-148 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-149 1-149 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-150 1-150 2 {"label":2,"probability":0.94,"probabilities":[0.0,0.06,0.94]}