2013年8月7日星期三
About mahout classification (logistic algorithm) Practice
4 月 1, 2008 has started preparing the data collection classification algorithm to learn, has now collected more than 130,000 pieces. And did practice.
prepare training data
http://dl.vmall.com/c0t2stwgs6
prepare the test sample data
http://dl.vmall.com/c0bqyvvsvz
start work:
create model
. / mahout trainlogistic-input ~ / sniffer.csv-output ~ / sniffer_model-target CABINSTATUS-categories 2-predictors SUBMITTOFLIGHT CABINVALUE ; OPENSIZEMAX BOOKSIZEMAX SUPRISEDATEVALUE-types numeric word numeric numeric word-features 1000-passes 100-rate ; 100
results are as follows:
CABINSTATUS ~ -3.903 * BOOKSIZEMAX + -21.278 * CABINVALUE = B + -525.060 * CABINVALUE = C + -142.396 * CABINVALUE = E + 510.136 * CABINVALUE = F + 1.575 * CABINVALUE = H + 0.793 * CABINVALUE = K + -3.148 * CABINVALUE = L + 0.063 * CABINVALUE = M + -22.743 * CABINVALUE = Q + -359.977 * CABINVALUE = R + 57.664 * CABINVALUE = U + -13.304 * CABINVALUE = X + 521.823 * CABINVALUE = Y + -0.124 * Intercept Term + ; 11.577 * OPENSIZEMAX + -1.658 * SUBMITTOFLIGHT + 1.889 * SUPRISEDATEVALUE = NORMAL + -4.491 * SUPRISEDATEVALUE = QINGMING + -2.710 * SUPRISEDATEVALUE = WUYIJIE
BOOKSIZEMAX -3.90338
CABINVALUE = B -21.27764
CABINVALUE = C -525.05977
CABINVALUE = E -142.39577
CABINVALUE = F 510.13636
CABINVALUE = H 1.57494
CABINVALUE = K 0.79265
CABINVALUE = L -3.14779
CABINVALUE = M 0.06252
CABINVALUE = Q -22.74316
CABINVALUE = R -359.97720
CABINVALUE = U 57.66431
CABINVALUE = X -13.30351
CABINVALUE = Y 521.82344
Intercept Term -0.12375
OPENSIZEMAX 11.57650
SUBMITTOFLIGHT -1.65824
SUPRISEDATEVALUE = NORMAL 1.88903
SUPRISEDATEVALUE = QINGMING -4.49111
SUPRISEDATEVALUE = WUYIJIE -2.71033
13/04/06 09:54:29 INFO driver.MahoutDriver: Program took 203150 ms (Minutes: 3.3858333333333333)
ps: It should be noted, csv files need to be the first line of the title of each line, the above command line csv field must be the first line with the same title, and is case sensitive.
validate the model:
. / mahout runlogistic-input ~ / sniffer_test.csv-model ~ / sniffer_model-auc-confusion
Running on hadoop, using / opt/hadoop/hadoop/hadoop-1.0.4/bin/hadoop and HADOOP_CONF_DIR =
MAHOUT-JOB: / opt/hadoop/mahout-distribution-0.7/mahout-examples-0.7-job.jar
Warning: $ HADOOP_HOME is deprecated.
AUC = 0.92 [prediction accuracy rate]
confusion: [[553.0, 46.0], [134.0, 787.0]]
entropy: [[NaN, NaN], [-37.9, -1.4]]
PS: AUC 0.92 is quite satisfactory, according to the different combinations of predictor variables, or get a different AUC. So, choose the appropriate predictor variables is still very important.
------ Solution ---------------------------------------- ----
just learning MapReduce when using mahout clustering experiments done KMEAN indeed very good use.
But there is a very big problem head, mahout library seemingly only analyze SequenceFile, I do not know how to analyze the ordinary TEXT FILE.
Later, only to give up, to write a MapReduce version KMEAN, LZ after this experience can share about it?
订阅:
博文评论 (Atom)
没有评论:
发表评论