Belatedly I’m interested in hadoop.

I felt that it’s difficult for me to handle hadoop ( I’m not good at data science…. ) but somedays ago I found very attractive library named ‘Hivemall’.

Following document get from github page.

Hivemall is a scalable machine learning library that runs on Apache Hive. Hivemall is designed to be scalable to the number of training instances as well as the number of training features.

https://github.com/myui/hivemall

It’s mean that I can store data, build model, predict on hadoop. Hmm, that’s sounds nice. Let’s DIY!

My environment is Mac. So, I installed hadoop and hive by using homebrew.

It was very simple way. Type following command.

iwatobipen$ brew install hadoop iwatobipen$ brew install hive

And after installation, I set up some files. And format file system.

And run hadoop.

iwatobipen$ hdfs namenode -format iwatobipen$ /usr/local/Cellar/hadoop/2.7.2/sbin/start-all.sh

Works fine.

Next I installed hivemall. It was easy because just put two files into /tmp.

Files can get from https://github.com/myui/hivemall/releases.

I got following files.

hivemall-core-0.4.2-rc.2-with-dependencies.jar

define-all.hive

And if hive version is newer, it needs comment out the line of sha1 function in define-all.hive.

~part of define-all.hive~ ----------------------- -- hashing functions -- ----------------------- drop temporary function mhash; create temporary function mhash as 'hivemall.ftvec.hashing.MurmurHash3UDF'; --following line cause error --drop temporary function sha1; --create temporary function sha1 as 'hivemall.ftvec.hashing.Sha1UDF'; drop temporary function array_hash_values; create temporary function array_hash_values as 'hivemall.ftvec.hashing.ArrayHashValuesUDF';

It’s ready.

I wrote sample sql for iris.dataset classification.

Following code is almost same to github example.

test.sql

-- install hivemall ( ADD jar & SOURCE ) ADD jar /tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar; SOURCE /tmp/define-all.hive CREATE TABLE iris_raw( F1 float, F2 float, F3 float, F4 float, CLASS string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/Users/iwatobipen/iris.data' INTO TABLE iris_raw; CREATE TABLE iris_dataset AS SELECT CLASS, ARRAY( concat('1:',F1), concat('2:',F2), concat('3:', F3), concat('4:', F4) ) AS FEATURES FROM iris_raw; CREATE TABLE label_mapping AS SELECT CLASS, RANK -1 AS LABEL FROM ( SELECT distinct CLASS, dense_rank() over (order by CLASS) AS RANK FROM iris_raw ) t ; --SELECT * FROM label_mapping; CREATE TABLE training AS SELECT rowid() as rowid, array( t1.F1, t1.F2, t1.F3, t1.F4 ) AS FEATURES, t2.LABEL FROM iris_raw t1 JOIN label_mapping t2 ON ( t1.class = t2.class ) ; CREATE TABLE model STORED AS SEQUENCEFILE AS SELECT train_randomforest_classifier( features, label ) FROM training; desc model; set hivevar:classification = true; set hive.auto.convert.join = true; set hive.mapjoin.optimized.hashtable = false; CREATE TABLE predict_vm AS SELECT rowid, rf_ensemble( predicted ) as predicted FROM( SELECT rowid, tree_predict( p.model_id, p.model_type, p.pred_model, t.FEATURES, ${classification} ) AS predicted FROM model p LEFT OUTER JOIN training t ) t1 GROUP BY rowid; SELECT t.ROWID,pv.ROWID, t.LABEL, pv.predicted FROM training t LEFT OUTER JOIN predict_vm pv ON (t.ROWID = pv.ROWID);

Hive is useful because, user can handle hadoop like a RDB.

Hivemall’s radomforest can handle array as features, so next step I’ll make fingerprint array and predict SAR or ADMET properties.

And prediction can do using join method. Hmm all like SQL.

Finally run the script.

iwatobipen$ hive -f query.sql > loghive.txt SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/Cellar/hive/2.1.0/libexec/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/Cellar/hadoop/2.7.2/libexec/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/2.1.0/libexec/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true Added [/tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar] to class path Added resources: [/tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar] OK Time taken: 1.349 seconds OK Time taken: 0.011 seconds ............. .............. 2016-09-06 22:35:46 End of local task; Time Taken: 0.79 sec. Execution completed successfully MapredLocal task succeeded Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2016-09-06 22:35:49,634 Stage-3 map = 100%, reduce = 0% Ended Job = job_local1070174860_0008 MapReduce Jobs Launched: Stage-Stage-3: HDFS Read: 33486 HDFS Write: 29891 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 10.06 seconds, Fetched: 150 row(s)

Log file was following.

Of course the result showed good accuracy because I used training dataset for test. ;-)

Also hivemall has lots of function for machine learning.

Next, I’ll try to use the library for chemoinformatics.

odel_id string model_type int pred_model string var_importance array<double> oob_errors int oob_tests int 1-1 1-1 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-2 1-2 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-3 1-3 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-4 1-4 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-5 1-5 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-6 1-6 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-7 1-7 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-8 1-8 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-9 1-9 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-10 1-10 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-11 1-11 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-12 1-12 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-13 1-13 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-14 1-14 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-15 1-15 0 {"label":0,"probability":0.96,"probabilities":[0.96,0.04]} 1-16 1-16 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-17 1-17 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-18 1-18 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-19 1-19 0 {"label":0,"probability":0.98,"probabilities":[0.98,0.02]} 1-20 1-20 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-21 1-21 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-22 1-22 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-23 1-23 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-24 1-24 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-25 1-25 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-26 1-26 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-27 1-27 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-28 1-28 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-29 1-29 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-30 1-30 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-31 1-31 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-32 1-32 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-33 1-33 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-34 1-34 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-35 1-35 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-36 1-36 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-37 1-37 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-38 1-38 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-39 1-39 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-40 1-40 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-41 1-41 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-42 1-42 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-43 1-43 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-44 1-44 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-45 1-45 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-46 1-46 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-47 1-47 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-48 1-48 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-49 1-49 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-50 1-50 0 {"label":0,"probability":1.0,"probabilities":[1.0,0.0]} 1-51 1-51 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]} 1-52 1-52 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-53 1-53 1 {"label":1,"probability":0.9,"probabilities":[0.0,0.9,0.1]} 1-54 1-54 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-55 1-55 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]} 1-56 1-56 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-57 1-57 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]} 1-58 1-58 1 {"label":1,"probability":0.94,"probabilities":[0.0,0.94,0.06]} 1-59 1-59 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-60 1-60 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]} 1-61 1-61 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]} 1-62 1-62 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-63 1-63 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]} 1-64 1-64 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-65 1-65 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-66 1-66 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-67 1-67 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-68 1-68 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-69 1-69 1 {"label":1,"probability":0.96,"probabilities":[0.0,0.96,0.04]} 1-70 1-70 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-71 1-71 1 {"label":2,"probability":0.56,"probabilities":[0.0,0.44,0.56]} 1-72 1-72 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-73 1-73 1 {"label":1,"probability":0.76,"probabilities":[0.0,0.76,0.24]} 1-74 1-74 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-75 1-75 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-76 1-76 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-77 1-77 1 {"label":1,"probability":0.88,"probabilities":[0.0,0.88,0.12]} 1-78 1-78 1 {"label":1,"probability":0.66,"probabilities":[0.0,0.66,0.34]} 1-79 1-79 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-80 1-80 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-81 1-81 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-82 1-82 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-83 1-83 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-84 1-84 1 {"label":1,"probability":0.64,"probabilities":[0.0,0.64,0.36]} 1-85 1-85 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]} 1-86 1-86 1 {"label":1,"probability":0.96,"probabilities":[0.02,0.96,0.02]} 1-87 1-87 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-88 1-88 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-89 1-89 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-90 1-90 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-91 1-91 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-92 1-92 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-93 1-93 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-94 1-94 1 {"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]} 1-95 1-95 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-96 1-96 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-97 1-97 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-98 1-98 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-99 1-99 1 {"label":1,"probability":0.96,"probabilities":[0.0,0.96,0.04]} 1-100 1-100 1 {"label":1,"probability":1.0,"probabilities":[0.0,1.0]} 1-101 1-101 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-102 1-102 2 {"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]} 1-103 1-103 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-104 1-104 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]} 1-105 1-105 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-106 1-106 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-107 1-107 2 {"label":1,"probability":0.58,"probabilities":[0.0,0.58,0.42]} 1-108 1-108 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-109 1-109 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]} 1-110 1-110 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-111 1-111 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]} 1-112 1-112 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-113 1-113 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-114 1-114 2 {"label":2,"probability":0.92,"probabilities":[0.0,0.08,0.92]} 1-115 1-115 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]} 1-116 1-116 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-117 1-117 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-118 1-118 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-119 1-119 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-120 1-120 2 {"label":1,"probability":0.52,"probabilities":[0.0,0.52,0.48]} 1-121 1-121 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-122 1-122 2 {"label":2,"probability":0.8,"probabilities":[0.0,0.2,0.8]} 1-123 1-123 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-124 1-124 2 {"label":2,"probability":0.88,"probabilities":[0.0,0.12,0.88]} 1-125 1-125 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-126 1-126 2 {"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]} 1-127 1-127 2 {"label":2,"probability":0.86,"probabilities":[0.0,0.14,0.86]} 1-128 1-128 2 {"label":2,"probability":0.88,"probabilities":[0.0,0.12,0.88]} 1-129 1-129 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-130 1-130 2 {"label":2,"probability":0.66,"probabilities":[0.0,0.34,0.66]} 1-131 1-131 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-132 1-132 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-133 1-133 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-134 1-134 2 {"label":2,"probability":0.56,"probabilities":[0.0,0.44,0.56]} 1-135 1-135 2 {"label":2,"probability":0.68,"probabilities":[0.0,0.32,0.68]} 1-136 1-136 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-137 1-137 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-138 1-138 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-139 1-139 2 {"label":2,"probability":0.84,"probabilities":[0.0,0.16,0.84]} 1-140 1-140 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-141 1-141 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-142 1-142 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]} 1-143 1-143 2 {"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]} 1-144 1-144 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-145 1-145 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-146 1-146 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-147 1-147 2 {"label":2,"probability":0.94,"probabilities":[0.0,0.06,0.94]} 1-148 1-148 2 {"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]} 1-149 1-149 2 {"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]} 1-150 1-150 2 {"label":2,"probability":0.94,"probabilities":[0.0,0.06,0.94]}