Internship final report@Treasure Data Inc.

97
Internship final report @Treasure Data Inc. (2016 8/1-9/30) ITO Ryuichi

Transcript of Internship final report@Treasure Data Inc.

Internship final report @Treasure Data Inc. (2016 8/1-9/30)

ITO Ryuichi

Outline

• Who am I?

• What I did? • About Hivemall • Benchmark • Add several new features

Who am I?

Who am I?

• ITO Ryuichi(@amaya382)

• Graduate School of Information Science and Technology, Osaka University(’16-)

• Accelerating graph processing engine: concurrency control, hardware-aware optimization

• (a little) Natural language processing:conversation system with context consistency

❤ Scala, C#

What I did?

What I did?

• About Hivemall • Benchmark • Add several new features

About Hivemall

• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉

• Can use many features via HQL(Hive Query Language, like SQL) • Classification

• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation

• Matrix Factorisation, Factorisation Machine, etc. • Utilities

• Feature engineering, Additional array operations, etc. • etc.

About Hivemall

• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉

• Can use many features via HQL(Hive Query Language, like SQL) • Classification

• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation

• Matrix Factorisation, Factorisation Machine, etc. • Utilities

• Feature engineering, Additional array operations, etc. • etc.

Cute Logo!

About Hivemall

• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉

• Can use many features via HQL(Hive Query Language, like SQL) • Classification

• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation

• Matrix Factorisation, Factorisation Machine, etc. • Utilities

• Feature engineering, Additional array operations, etc. • etc.

Cute Logo!

About Hivemall

• A scalable machine learning library running on Apache Hive(+Spark, Pig) • Developed by @myui and others as an OSS • Joined Apache Incubator 🎉

• Can use many features via HQL(Hive Query Language, like SQL) • Classification

• Perceptron, AdaGradRDA, Soft Confidence Weighted, etc. • Recommendation

• Matrix Factorisation, Factorisation Machine, etc. • Utilities

• Feature engineering, Additional array operations, etc. • etc.

Cute Logo!

About Hivemall(cont.)

• How does Hivemall work on Hive? • Hivemall is a set of UDFs(User-Defined Functions)

• UDF: projection, one entry -> one entry • UDTF(Table-generating): some entries -> some entries • UDAF(Aggregate): all entries -> one entry

• Define features as UDFs following interfaces in Java prepared by Hive

• And by loading Hivemall jar file, enable to use extra functions in HQL

About Hivemall(cont.)

• Example: Training by logistic regression

• Only HQL, no need to be familiar with programming. (Already, HQL(Hive) is close to data!)

CREATE TABLE model AS SELECT feature, AVG(weight) AS weight FROM ( SELECT logress(features, label, ...) AS (feature, weight) FROM train_data) t GROUP BY feature

What I did?

• About Hivemall • Benchmark • Add several new features

Benchmark

• Based on bench-ml (https://github.com/szilard/benchm-ml)

• Several pre-defined test cases w/ prepared data set 1. Logistic Regression 2. Random Forest

• Several hyper parameters 3. Boosting 4. Deep Learning

• Already tested by several tools(e.g. R, Python-sklearn, Spark, etc.)NOTE: basically, using common environment, but some cases use different environments For more details, see bench-ml project

Benchmark

• Based on bench-ml (https://github.com/szilard/benchm-ml)

• Several pre-defined test cases w/ prepared data set 1. Logistic Regression 2. Random Forest

• Several hyper parameters 3. Boosting 4. Deep Learning

• Already tested by several tools(e.g. R, Python-sklearn, Spark, etc.)NOTE: basically, using common environment, but some cases use different environments For more details, see bench-ml project

TriedTried

Benchmark(cont.)

• Environment • Amazon Web Service

• EMR(Elastic MapReduce) • m3.xlarge*3 + c3.xlarge*3 • Hadoop: Amazon 2.7.2 • Tez: 0.8.4 • Hive: 2.1.0 • Hivemall: 0.4.2-RC2

• Misc. • Basically, using six parallel processing, fitting to #instances

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

(Time[sec] / AUC[%])

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

10x10x

(Time[sec] / AUC[%])

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

10x10x 12.5x12.5x

(Time[sec] / AUC[%])

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖10x10x 12.5x12.5x

(Time[sec] / AUC[%])

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

1.3x1.3x

10x10x 12.5x12.5x

(Time[sec] / AUC[%])

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

(Time[sec] / AUC[%])

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x

(Time[sec] / AUC[%])

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x

(Time[sec] / AUC[%])

High initial overhead caused by Hive

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x

±0±0

(Time[sec] / AUC[%])

High initial overhead caused by Hive

Benchmark - Logistic Regression

• Using logress() on Hivemall

• Hivemall is relatively slow and low AUC(NOTE: Hivemall’s logress uses SGD) • But can be sure its scalability

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/1-linear/6-hivemall

✖✖

4.9x4.9x1.3x1.3x

10x10x 12.5x12.5x

3.9x3.9x

±0±0

+0.4+0.4

(Time[sec] / AUC[%])

High initial overhead caused by Hive

Benchmark - Random Forest(1)• Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables

• Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall

(Time[sec] / AUC[%])

Benchmark - Random Forest(1)• Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables

• Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall

(Time[sec] / AUC[%])

Benchmark - Random Forest(1)• Using train_randomforest_classifier() on Hivemall • (1)Regulation: 500 trees, three variables

• Hivemall is almost good until 0.1M, but cannot process 1M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/2-rf/7-hivemall

(Time[sec] / AUC[%])

Amazing…

Benchmark - Random Forest(2)• Using train_randomforest_classifier() on Hivemall • (2)Regulation: 100 trees, max depth 20

• Hivemall is good until 1M, but cannot process 10M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall

(Time[sec] / AUC[%])

Benchmark - Random Forest(2)• Using train_randomforest_classifier() on Hivemall • (2)Regulation: 100 trees, max depth 20

• Hivemall is good until 1M, but cannot process 10M • Need to tune environment and parameters

Overall: https://github.com/amaya382/benchm-ml/tree/hivemall/z-other-tools/10-hivemall

(Time[sec] / AUC[%])

What I did?

• About Hivemall • Benchmark • Add several new features

What I did?

• About Hivemall • Benchmark • Add several new features

Main topic!

Add several new features

• systemtest module

• Feature binning

• Feature selection

• Some spark integrations

Add new features - systemtest

• What’s systemtest? • Testing framework for UDFs

• Also can apply other applications based on UDFs • Already tests exist, not? Why need?

• Yes, but the existing is... • Cannot run on Hive actually, only run as Java programs • Difficult to write coverall tests

• e.g. in UDAF, several work flows depending on a kind of function, data set and environment

• Difficult to use existing resources • Low extendability, etc.

Add new features - systemtest

• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);

Add new features - systemtest

• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);

omitt

ed a

lot

Add new features - systemtest

• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);

omitt

ed a

lot

Useless and long initializationUseless and long initialization

Add new features - systemtest

• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);

omitt

ed a

lot

Useless and long initializationUseless and long initialization

Useless many conversionsUseless many conversions

Add new features - systemtest

• Example: a part of an existing testfinal SignalNoiseRatioUDAF snr = new SignalNoiseRatioUDAF(); final ObjectInspector[] OIs = new ObjectInspector[] {ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableDoubleObjectInspector), ObjectInspectorFactory.getStandardListObjectInspector( PrimitiveObjectInspectorFactory.writableIntObjectInspector)}; final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator evaluator = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator) snr.getEvaluator( new SimpleGenericUDAFParameterInfo(OIs, false, false)); evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, OIs); final SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer agg = (SignalNoiseRatioUDAF.SignalNoiseRatioUDAFEvaluator.SignalNoiseRatioAggregationBuffer) evaluator.getNewAggregationBuffer(); evaluator.reset(agg); ...for (int i = 0; i < features.length; i++) { final List<IntWritable> labelList = new ArrayList<IntWritable>(); for (int label : labels[i]) { labelList.add(new IntWritable(label)); } evaluator.iterate(agg, new Object[] {WritableUtils.toWritableList(features[i]), labelList}); } final List<DoubleWritable> resultObj = (List<DoubleWritable>) evaluator.terminate(agg); ...Assert.assertArrayEquals(answer, result, 1e-5);

omitt

ed a

lot

Useless and long initializationUseless and long initialization

Useless many conversionsUseless many conversions

And not run on Hive, only logical test!!

And not run on Hive, only logical test!!

Add new features - systemtest

• Solution • New module based on JUnit, HiveRunner and td-client-java

• What it can do? • Short and unified initialization • Write and combine HQL • Run local Hive and also remote Treasure Data with the

same code • Testbed is prepared and cleaned up automatically • Easy to use external resources, e.g. TSV file • Literal definition(HQL), but test with debugger • Useful DSL

Add new features - systemtest(1)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

1. Write tests based on SystemTestRunner interface

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

Add new features - systemtest(2)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

2. Read initialization and execute via impls of SystemTestRunner

It works based on JUnit @ClassRule

Prepare database specialized for each test class

Use external resources depending on needs

Add new features - systemtest(3)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

3. Execute first test

It works based on JUnit @Rule

Run as HQL, and check return values

Rewrite DSL & HQL for each env

Add new features - systemtest(4)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

4. Reset testbeds

It works based on JUnit @Rule

Drop temporary tables

Add new features - systemtest(5,6…)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

5. Execute second test 6. Reset testbeds …repeat all tests

It works based on JUnit @Rule

Add new features - systemtest(7)

• How does it work?

SystemTestRunner

TDSystemTestRunner

Treasure Data HiveRunner

Test code

User

ImplementationInterface

SystemTestTeam

HiveSystemTestRunner

7. Finalize test

Drop temporary database and disconnect

It works based on JUnit @ClassRule

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

no o

mis

sion

!

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

no o

mis

sion

!

Common initialization with external dataCommon initialization with external data

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

no o

mis

sion

!

Common initialization with external dataCommon initialization with external data

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

no o

mis

sion

!

Common initialization with external dataCommon initialization with external data

Testbed-specific initializationTestbed-specific initialization

Add new features - systemtest

• Example: initializationprivate static SystemTestCommonInfo ci = new SystemTestCommonInfo(HogeTest.class); private static HQBase createIrisTable = HQ.uploadByResourcePathAsNewTable( "iris0", ci.initDir + "iris0.csv", new LinkedHashMap<String, String>() {{ put("a", "double"); put("b", "double"); put("c", "double"); put("d", "double"); put("c0", "int"); put("c1", "int"); put("c2", “int");}}); @ClassRulepublic static HiveSystemTestRunner hRunner = new HiveSystemTestRunner(ci) {{ initBy(createIrisTable); initBy(HQ.fromStatements("" + "CREATE TEMPORARY FUNCTION transpose_and_dot as 'hivemall.tools.matrix.TransposeAndDotUDAF';" + "CREATE TEMPORARY FUNCTION array_sum as 'hivemall.tools.array.ArraySumUDAF';" + "CREATE TEMPORARY FUNCTION array_avg as 'hivemall.tools.array.ArrayAvgGenericUDAF';" + "CREATE TEMPORARY FUNCTION chi2 as 'hivemall.ftvec.selection.ChiSquareUDF';" + "CREATE TEMPORARY FUNCTION snr as 'hivemall.ftvec.selection.SignalNoiseRatioUDAF';"));}}; @ClassRulepublic static TDSystemTestRunner tRunner = new TDSystemTestRunner(ci) {{ initBy(createIrisTable);}}; @Rulepublic SystemTestTeam team = new SystemTestTeam(hRunner);

no o

mis

sion

!

Common initialization with external dataCommon initialization with external data

Testbed-specific initializationTestbed-specific initialization

Set common runnerSet common runner

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

no o

mis

sion

!

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

no o

mis

sion

!

Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

no o

mis

sion

!

Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init

Run on HiveRunnerRun on HiveRunner

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

no o

mis

sion

!

Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init

Run on HiveRunnerRun on HiveRunner

Add new features - systemtest• Example: test cases(1)

@Testpublic void snr() throws Exception { team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)" + "SELECT snr(X, Y)" + "FROM iris"), "$ANSWER"); team.run(); } @Testpublic void chi2() throws Exception { team.add(tRunner); team.set(HQ.fromStatement("" + "WITH iris AS (" + " SELECT array(a, b, c, d) AS X, array(c0, c1, c2) AS Y FROM iris0)," + "stats AS (" + " SELECT" + " transpose_and_dot(Y, X) AS observed," + " array_sum(X) AS feature_count," + " array_avg(Y) AS class_prob" + " FROM" + " iris)," + "test AS (" + " SELECT" + " transpose_and_dot(class_prob, feature_count) AS expected" + " FROM" + " stats)" + "SELECT" + " chi2(observed, expected) AS x " + "FROM" + " test JOIN stats"), "$ANSWER"); team.run(); }

no o

mis

sion

!

Execute tests on clean testbeds using database created by initExecute tests on clean testbeds using database created by init

Run on HiveRunnerRun on HiveRunner

Run on HiveRunner and TreasureDataRun on HiveRunner and TreasureData

Add new features - systemtest

• Example: test cases(2)

@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }

Add new features - systemtest

• Example: test cases(2)

@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }

no o

mis

sion

!

Add new features - systemtest

• Example: test cases(2)

@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }

no o

mis

sion

! Test-specific initialization It also can chainTest-specific initialization It also can chain

Add new features - systemtest

• Example: test cases(2)

@Testpublic void someTest0() throws Exception { final String tableName = "color"; team.initBy(HQ.uploadByResourcePathAsNewTable( tableName, ci.initDir + "color.tsv", new LinkedHashMap<String, String>() {{ put("name", "string"); put("red", "int"); put("green", "int"); put("blue", "int");}})); team.set(HQ.fromStatement("" + "SELECT CONCAT('rgb(', red, ',', green, ',', blue, ')') FROM " + tableName + " u LEFT JOIN color c on u.favorite_color = c.name"), "rgb(255,165,0)\trgb(255,192,203)"); team.run(); } @Testpublic void someTest1() throws Exception { team.set(HQ.autoMatchingByFileName("hoge"), ci); team.run(); }

no o

mis

sion

! Test-specific initialization It also can chainTest-specific initialization It also can chain

Use HQL and answers written in external filesUse HQL and answers written in external files

Add new features - systemtest

• More details? • https://github.com/myui/hivemall/issues/323 • https://github.com/myui/hivemall/pull/336 • And systemtest/README.md

Add new features - feature binning

• What’s feature binning? • A method to divide quantitative variables

into meaningful categorical variables

Add new features - feature binning

• How does it work? • [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles)

build_bins feature_binning

Add new features - feature binning

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• Use percentile internally, make all areas uniform

Add new features - feature binning

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• What’s auto_shrink?

Add new features - feature binning

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• What’s auto_shrink?

Sometimes made void bins by small or skewed data set

!?!? ->

Add new features - feature binning

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• What’s auto_shrink?

Sometimes made void bins by small or skewed data set

!?!? ->

Add new features - feature binning

• [UDAF] build_bins(weight, num_of_bins[, auto_shrink])

• What’s auto_shrink?

Exception!Sometimes made void bins by small or skewed data set

!?!? ->

Add new features - feature binning

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

Age:17

Add new features - feature binning

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

bin 0 ->bin 1 ->bin 2 ->

Age:17

Add new features - feature binning

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

17 is between -Infinity and 18.0 …

bin 0 ->bin 1 ->bin 2 ->

Age:17

Add new features - feature binning

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

17 is between -Infinity and 18.0 …

<here!bin 0 ->bin 1 ->bin 2 ->

Age:17

Add new features - feature binning

• [UDF] feature_binning(features, quantiles_map) /(weight, quantiles) • Distribute variables into bins by its value

feature_binning

17 is between -Infinity and 18.0 …

<here!bin 0 ->bin 1 ->bin 2 ->

Age:17

Add new features - feature binning

• More details? • https://github.com/myui/hivemall/issues/319 • https://github.com/myui/hivemall/pull/322

Add new features - feature selection

• What’s feature selection? • A generic term of methods to select meaningful

features • Used to preprocessing of machine learning

• Why used? • Enhance results • Shorten learning time • Make a set of features human-understandable

Add new features - feature selection

• A kind of feature selection • Use variance • Use Chi-square value • Use SNR(Signal Noise Ratio) • mRMR(minimum Redundancy Maximum Relevance) • etc.

Add new features - feature selection

• A kind of feature selection • Use variance • Use Chi-square value • Use SNR(Signal Noise Ratio) • mRMR(minimum Redundancy Maximum Relevance) • etc.

Implemented

Implemented

Add new features - feature selection

• Feature selection using Chi-square value • To calc Chi-square value, need both observed

values and expected values(=hypothesis)

• Observed: aggregated features of each class • Expected: assuming each features and each

classes are independent, calc expected values • Calc Chi-square value • Select top-k features

Chi-square

Add new features - feature selection

• How does it work on Hivemall? • [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>

• [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>

• [UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>

Chi-square

Add new features - feature selection

• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Chi-square

Add new features - feature selection

• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Chi-square

Add new features - feature selection

• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Chi-square

Add new features - feature selection

• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Maybe you think matrix multiplication requires repetition…

Chi-square

Add new features - feature selection

• [UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>> • Utility for matrix calculation, generic UDF

YXT

Calculate incrementally!Maybe you think matrix multiplication requires repetition…

Chi-square

Add new features - feature selection

• [UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>> • Calculate Chi-square value and p-value

• Calculate p-value by above and Chi-square distribution

Chi-square

Add new features - feature selection

• [UDF] select_k_best(X::array<number>, importance_list::array<int>, k::int)::array<double> • Select top-k elements from X by importance_list • Generic UDF

NOTE: Current implementation expects all each importance_list and k are equal

k = 2

Chi-square

Add new features - feature selection

• [UDF] select_k_best(X::array<number>, importance_list::array<int>, k::int)::array<double> • Select top-k elements from X by importance_list • Generic UDF

NOTE: Current implementation expects all each importance_list and k are equal

k = 2

Chi-square

Add new features - feature selection

• Feature selection using SNR • Aggregate mean and variance of each feature

and each class • When termination, calc Signal Noise Ratio

between all combination of classes, of each feature

• Sum up Signal Noise Ratio each feature

Signal Noise Ratio

Add new features - feature selection

• How does it work on Hivemall?

• [UDAF] snr(X::array<number>, label::array<int>)::array<double>

Signal Noise Ratio

Add new features - feature selection

• [UDAF] snr(X::array<number>, label::array<int>)::array<double> • Aggregate variance by Chan’s method

• Calc Signal Noise Ratio and sum them up each features

Signal Noise Ratio

Add new features - feature selection

• More details? • https://github.com/myui/hivemall/issues/338 • https://github.com/myui/hivemall/pull/352

Add new features - spark integration

• Integrated feature selection into spark module

• Improved build flow for resolving binary incompatibility between spark-1.6 and spark-2.0

Thank you for listening!

Thank you for listening!

Any questions?