RandomForest Classification on Hadoop.

Belatedly I’m interested in hadoop.
I felt that it’s difficult for me to handle hadoop ( I’m not good at data science…. ) but somedays ago I found very attractive library named ‘Hivemall’.
Following document get from github page.

Hivemall is a scalable machine learning library that runs on Apache Hive. Hivemall is designed to be scalable to the number of training instances as well as the number of training features.
https://github.com/myui/hivemall

It’s mean that I can store data, build model, predict on hadoop. Hmm, that’s sounds nice. Let’s DIY!
My environment is Mac. So, I installed hadoop and hive by using homebrew.
It was very simple way. Type following command.

iwatobipen$ brew install hadoop
iwatobipen$ brew install hive

And after installation, I set up some files. And format file system.
And run hadoop.

iwatobipen$ hdfs namenode -format
iwatobipen$ /usr/local/Cellar/hadoop/2.7.2/sbin/start-all.sh 

Works fine.
Next I installed hivemall. It was easy because just put two files into /tmp.
Files can get from https://github.com/myui/hivemall/releases.
I got following files.
hivemall-core-0.4.2-rc.2-with-dependencies.jar
define-all.hive

And if hive version is newer, it needs comment out the line of sha1 function in define-all.hive.

~part of define-all.hive~
-----------------------
-- hashing functions --
-----------------------

drop temporary function mhash;
create temporary function mhash as 'hivemall.ftvec.hashing.MurmurHash3UDF';

--following line cause error
--drop temporary function sha1;
--create temporary function sha1 as 'hivemall.ftvec.hashing.Sha1UDF';

drop temporary function array_hash_values;
create temporary function array_hash_values as 'hivemall.ftvec.hashing.ArrayHashValuesUDF';

It’s ready.

I wrote sample sql for iris.dataset classification.
Following code is almost same to github example.

test.sql

-- install hivemall ( ADD jar & SOURCE ) 
ADD jar /tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar;
SOURCE /tmp/define-all.hive

CREATE TABLE iris_raw(
              F1 float,
              F2 float,
              F3 float,
              F4 float,
              CLASS string
              )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/Users/iwatobipen/iris.data' INTO TABLE iris_raw;

CREATE TABLE iris_dataset
AS
SELECT
  CLASS,
  ARRAY( concat('1:',F1), concat('2:',F2), concat('3:', F3), concat('4:', F4) ) AS FEATURES
FROM iris_raw;

CREATE TABLE label_mapping
     AS
     SELECT
       CLASS,
       RANK -1 AS LABEL
     FROM (
       SELECT
         distinct CLASS,
         dense_rank() over (order by CLASS) AS RANK
     FROM
       iris_raw
     ) t
     ;

--SELECT * FROM label_mapping;

CREATE TABLE training
AS
SELECT
  rowid() as rowid,
  array( t1.F1, t1.F2, t1.F3, t1.F4 ) AS FEATURES,
  t2.LABEL
FROM
  iris_raw t1
  JOIN label_mapping t2 ON ( t1.class = t2.class )
;

CREATE TABLE model
STORED AS SEQUENCEFILE
AS
SELECT train_randomforest_classifier( features, label )
FROM training;

desc model;

set hivevar:classification = true;
set hive.auto.convert.join = true;
set hive.mapjoin.optimized.hashtable = false;

CREATE TABLE predict_vm
AS
SELECT
  rowid,
  rf_ensemble( predicted ) as predicted
FROM(
  SELECT
    rowid,
    tree_predict( p.model_id, p.model_type, p.pred_model, t.FEATURES, ${classification} ) AS predicted
  FROM
    model p
    LEFT OUTER JOIN training t
) t1
GROUP BY
  rowid;

SELECT t.ROWID,pv.ROWID, t.LABEL, pv.predicted
FROM training t
LEFT OUTER JOIN predict_vm pv ON (t.ROWID = pv.ROWID);

Hive is useful because, user can handle hadoop like a RDB.

Hivemall’s radomforest can handle array as features, so next step I’ll make fingerprint array and predict SAR or ADMET properties.
And prediction can do using join method. Hmm all like SQL.

Finally run the script.

iwatobipen$ hive -f query.sql > loghive.txt
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/Cellar/hive/2.1.0/libexec/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/Cellar/hadoop/2.7.2/libexec/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/2.1.0/libexec/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
Added [/tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar] to class path
Added resources: [/tmp/hivemall-core-0.4.2-rc.2-with-dependencies.jar]
OK
Time taken: 1.349 seconds
OK
Time taken: 0.011 seconds

.............
..............
2016-09-06 22:35:46	End of local task; Time Taken: 0.79 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-09-06 22:35:49,634 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local1070174860_0008
MapReduce Jobs Launched: 
Stage-Stage-3:  HDFS Read: 33486 HDFS Write: 29891 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 10.06 seconds, Fetched: 150 row(s)

Log file was following.
Of course the result showed good accuracy because I used training dataset for test. 😉
Also hivemall has lots of function for machine learning.
Next, I’ll try to use the library for chemoinformatics.

odel_id            	string              	                    
model_type          	int                 	                    
pred_model          	string              	                    
var_importance      	array<double>       	                    
oob_errors          	int                 	                    
oob_tests           	int                 	                    
1-1	1-1	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-2	1-2	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-3	1-3	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-4	1-4	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-5	1-5	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-6	1-6	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-7	1-7	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-8	1-8	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-9	1-9	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-10	1-10	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-11	1-11	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-12	1-12	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-13	1-13	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-14	1-14	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-15	1-15	0	{"label":0,"probability":0.96,"probabilities":[0.96,0.04]}
1-16	1-16	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-17	1-17	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-18	1-18	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-19	1-19	0	{"label":0,"probability":0.98,"probabilities":[0.98,0.02]}
1-20	1-20	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-21	1-21	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-22	1-22	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-23	1-23	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-24	1-24	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-25	1-25	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-26	1-26	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-27	1-27	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-28	1-28	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-29	1-29	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-30	1-30	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-31	1-31	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-32	1-32	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-33	1-33	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-34	1-34	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-35	1-35	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-36	1-36	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-37	1-37	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-38	1-38	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-39	1-39	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-40	1-40	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-41	1-41	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-42	1-42	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-43	1-43	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-44	1-44	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-45	1-45	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-46	1-46	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-47	1-47	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-48	1-48	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-49	1-49	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-50	1-50	0	{"label":0,"probability":1.0,"probabilities":[1.0,0.0]}
1-51	1-51	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-52	1-52	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-53	1-53	1	{"label":1,"probability":0.9,"probabilities":[0.0,0.9,0.1]}
1-54	1-54	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-55	1-55	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-56	1-56	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-57	1-57	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-58	1-58	1	{"label":1,"probability":0.94,"probabilities":[0.0,0.94,0.06]}
1-59	1-59	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-60	1-60	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-61	1-61	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-62	1-62	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-63	1-63	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-64	1-64	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-65	1-65	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-66	1-66	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-67	1-67	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-68	1-68	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-69	1-69	1	{"label":1,"probability":0.96,"probabilities":[0.0,0.96,0.04]}
1-70	1-70	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-71	1-71	1	{"label":2,"probability":0.56,"probabilities":[0.0,0.44,0.56]}
1-72	1-72	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-73	1-73	1	{"label":1,"probability":0.76,"probabilities":[0.0,0.76,0.24]}
1-74	1-74	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-75	1-75	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-76	1-76	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-77	1-77	1	{"label":1,"probability":0.88,"probabilities":[0.0,0.88,0.12]}
1-78	1-78	1	{"label":1,"probability":0.66,"probabilities":[0.0,0.66,0.34]}
1-79	1-79	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-80	1-80	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-81	1-81	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-82	1-82	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-83	1-83	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-84	1-84	1	{"label":1,"probability":0.64,"probabilities":[0.0,0.64,0.36]}
1-85	1-85	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-86	1-86	1	{"label":1,"probability":0.96,"probabilities":[0.02,0.96,0.02]}
1-87	1-87	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-88	1-88	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-89	1-89	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-90	1-90	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-91	1-91	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-92	1-92	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-93	1-93	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-94	1-94	1	{"label":1,"probability":0.98,"probabilities":[0.0,0.98,0.02]}
1-95	1-95	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-96	1-96	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-97	1-97	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-98	1-98	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-99	1-99	1	{"label":1,"probability":0.96,"probabilities":[0.0,0.96,0.04]}
1-100	1-100	1	{"label":1,"probability":1.0,"probabilities":[0.0,1.0]}
1-101	1-101	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-102	1-102	2	{"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]}
1-103	1-103	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-104	1-104	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-105	1-105	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-106	1-106	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-107	1-107	2	{"label":1,"probability":0.58,"probabilities":[0.0,0.58,0.42]}
1-108	1-108	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-109	1-109	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-110	1-110	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-111	1-111	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-112	1-112	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-113	1-113	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-114	1-114	2	{"label":2,"probability":0.92,"probabilities":[0.0,0.08,0.92]}
1-115	1-115	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-116	1-116	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-117	1-117	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-118	1-118	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-119	1-119	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-120	1-120	2	{"label":1,"probability":0.52,"probabilities":[0.0,0.52,0.48]}
1-121	1-121	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-122	1-122	2	{"label":2,"probability":0.8,"probabilities":[0.0,0.2,0.8]}
1-123	1-123	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-124	1-124	2	{"label":2,"probability":0.88,"probabilities":[0.0,0.12,0.88]}
1-125	1-125	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-126	1-126	2	{"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]}
1-127	1-127	2	{"label":2,"probability":0.86,"probabilities":[0.0,0.14,0.86]}
1-128	1-128	2	{"label":2,"probability":0.88,"probabilities":[0.0,0.12,0.88]}
1-129	1-129	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-130	1-130	2	{"label":2,"probability":0.66,"probabilities":[0.0,0.34,0.66]}
1-131	1-131	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-132	1-132	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-133	1-133	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-134	1-134	2	{"label":2,"probability":0.56,"probabilities":[0.0,0.44,0.56]}
1-135	1-135	2	{"label":2,"probability":0.68,"probabilities":[0.0,0.32,0.68]}
1-136	1-136	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-137	1-137	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-138	1-138	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-139	1-139	2	{"label":2,"probability":0.84,"probabilities":[0.0,0.16,0.84]}
1-140	1-140	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-141	1-141	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-142	1-142	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-143	1-143	2	{"label":2,"probability":0.96,"probabilities":[0.0,0.04,0.96]}
1-144	1-144	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-145	1-145	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-146	1-146	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-147	1-147	2	{"label":2,"probability":0.94,"probabilities":[0.0,0.06,0.94]}
1-148	1-148	2	{"label":2,"probability":1.0,"probabilities":[0.0,0.0,1.0]}
1-149	1-149	2	{"label":2,"probability":0.98,"probabilities":[0.0,0.02,0.98]}
1-150	1-150	2	{"label":2,"probability":0.94,"probabilities":[0.0,0.06,0.94]}

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s