Interpretable Automated Machine Learning in MaanaTM ... · A. GraphQL and Maana Knowledge Platform...

1

Interpretable Automated Machine Learningin MaanaTM Knowledge Platform

Alexander Elkholy, Fangkai Yang, Steven GustafsonMaana, Inc.Belleue, Wa.

[email protected]

Abstract—Machine learning is becoming an essentialpart of developing solutions for many industrial applica-tions, but the lack of interpretability hinders wide indus-try adoption to rapidly build, test, deploy and validatemachine learning models, in the sense that the insight ofdeveloping machine learning solutions are not structurallyencoded, justified and transferred. In this paper we de-scribe Maana Meta-learning Service, an interpretable andinteractive automated machine learning service residingin Maana Knowledge Platform that performs machine-guided, user assisted pipeline search and hyper-parametertuning and generates structured knowledge about deci-sions for pipeline profiling and selection. The service isshipped with Maana Knowledge Platform and is validatedusing benchmark dataset. Furthermore, its capability ofderiving knowledge from pipeline search facilitates variousinference tasks and transferring to similar data scienceprojects.

I. INTRODUCTION

Machine learning is becoming an essential partof developing solutions for many industrial appli-cations. Developers of such applications need torapidly build, test, deploy and validate machinelearning models. The validation of models is a keycapability that will enable industries to more widelyadopt machine learning capabilities for businessdecision making, however, this process suffers fromthe lack of interpretability. First, in most cases,validation begins with understanding how a machinelearning model is developed - the pipeline fromsource data, through data processing and featur-ization, to model building and parameter tuning.The capability of understanding machine learningmodels also represents a key piece of domainknowledge: data scientists who understand how tomake successful domain specific machine learningpipelines will be in high demand across that domain.

Unfortunately, this level of understanding remainssomewhat of a ”dark art” in that the knowledgeand judgment used to find good domain-specificmachine learning pipelines is usually found in theheads of the data scientists. Therefore, while it ispossible to see the final machine learning pipeline,the steps the data scientist went through, and thecompromises and decisions they made, are not cap-tured. Once the model is delivered, most insightsand assumptions related to development of the so-lution are lost, making long term sustaining difficult.Second, since there is no clear way of encoding theempirical experience of the data scientist derivedfrom developing data science solutions to facilitateknowledge transfer and sharing so that they canbe applied to similar projects efficiently and sharedamong a group of data scientists in the organiza-tion, it causes repetitive work, low efficiency andinconsistency in the quality of solutions.

To facilitate rapid development and deploymentof data science solutions, automated machine learn-ing (AutoML) has gained more interest recently dueto the availability of public dataset repositories andopen source machine learning code bases. AutoMLsystems such as Auto-WEKA [1], Auto-SKLEARN[2] and TPOT[3] attempt to optimize the entire ma-chine learning pipeline, which can consist of inde-pendent steps such as featurization, which encodesdata in numeric form, or feature selection, whichattempts to pick the best subset of features to traina model from data. Given sufficient computationresources, these system can achieve good accuracyin building machine learning pipelines, but they donot provide clear explanations to justify the choiceof models that can be verified by data scientist, andconsequently the problem of interpretability remainsunsolved.

arX

iv:1

905.

0216

8v1

[cs

.LG

] 6

May

201

9

2

To address this challenge, in this paper we de-scribe the Maana Meta-learning service whichprovides interpretable automated machine learn-ing. The goal of this project is two-folded. First,we hope that the efficiency of developing datascience solutions can be improved by leveragingan automated search and profiling algorithm suchthat a baseline solution can be automatically gen-erated for the data scientists to fine-tune. Second,we hope that such automated search process istransparent to human users, and through learningprocess the service can return interpretable insightson the choice of models and hyper-parameters andencode them as knowledge. Contrasted with mostAutoML systems that provide end-to-end solutions,the Maana Meta-learning service is an interactiveassistant to data scientists that performs user-guided,machine-assisted automated machine learning. Byhaving data scientists specify a pre-determinedsearch space, and Meta-learning service then goesthrough several stages to perform model selection,pipeline profiling and hyper-parameter tuning. Dur-ing this process, it returns intermediate results anduser can inject feedback to steer the search process.Finally, it generates an optimal pipeline along withstructured knowledge encoding the decision mak-ing process, leading to an interpretable automatedmachine learning process.

Maana Meta-Learning service features two com-ponents: (1) a knowledge representation that cap-tures domain knowledge of data scientists and (2)an AutoML algorithm that generates machine learn-ing pipeline, evaluates their efficacy by samplinghyper-parameters, and encodes all the informationabout the choices made and subsequent perfor-mance / parameters into the knowledge representa-tion. The knowledge representation is defined usingGraphQL1. Developed by Facebook as an alternativeto the popular REST interface [4], GraphQL pro-vides only a single API endpoint for data access,backed by a structured, hierarchical type system.Consequently, it allows us to define a knowledgetaxonomy to capture concepts of machine learningpipelines, seamlessly populate facts to the prede-fined knowledge graph and reason with them. TheAutoML algorithm, in charge of generating andchoosing which pipelines to pursue, is based onPEORL framework [5], an integration of symbolic

1https://graphql.org/

planning [6] and hierarchical reinforcement learning[7]. Symbolic plans generated from a pre-definedsymbolic formulation of a dynamic domain is usedto guide reinforcement learning, and recently thisapproach is generalized to improve interpretabilityof deep reinforcement learning. In the setting ofAutoML, generating machine learning pipelines istreated as a symbolic planning problem on an actiondescription in action language BC[8] that containsactions such as preprocessing, featurizing, cross val-idation, training and prediction. The pipeline is sentto execution where each symbolic action by map-ping to primitive actions in a Markov Decision Pro-cess [9] (MDP) space, which are ML pipeline com-ponents instantiated with random hyper-parameters,in order to learn the quality of the actions in thepipeline. The learning process is value iterationon R-learning [10], [11], where cross-validationaccuracy of the pipeline is used as rewards. Afterthe quality of the current pipeline is measured, animproved ML pipeline is generated thereafter usingthe learned values, and the interaction with learningcontinues, until no better pipeline can be found.This step is called model profiling. After that, amore systematic parameter sweeping is performed,i.e., model searching. This allows us to describethe pipeline steps in an intuitive representation andexplore the program space more systematically andefficiently with the help of reinforcement learning.

In this paper, we demonstrate that Maana Meta-learning provides a decent baseline on a variety ofdata sets, involving both binomial and multinomialclassification tasks on various data types. Further-more, when knowledge instance is filled into thepre-defined knowledge schema, the insights derivedfrom Meta-learning process can be visualized asa knowledge graph, improving interpretability andfacilitating knowledge sharing, sustaining as well astransferring to similar tasks. We show that by usingan interactive process leveraging domain knowledgeand user feedback to populate knowledge into astructured knowledge graph, in order to addressthe interpretable automated machine learning soughtafter by industrial application of data science.

II. RELATED WORK

In contrast with optimizing the selection of modelparameters, the goal of the AutoML task is tooptimize an entire machine learning pipeline. That

3

is, starting from the raw data, it concerns itselfwith everything, including optimal selection of fea-turization, selection of algorithm, hyper-parameterselection as well as the cohesive collection of theseas an ensemble. The most recent and most relevantapproaches to the AutoML paradigm are Auto-WEKA and Auto-SKLEARN. Auto-WEKA [1] [12]call this the combined algorithm selection and hy-perparameter optimization problem or CASH. Thisapproach is formalized as a Bayesian optimiza-tion problem where it sequentially tests differentpipelines based on the performance of the last usingwhat is called a sequential model-based optimiza-tion (SMBO) formulation [13]. In combination withthe algorithms available in the WEKA library[14],they provide a complete package targeted towardnon-expert users, allowing them to build machinelearning solutions without necessarily knowing thedetails required to do so. Auto-SKLEARN [2] ap-proaches the AutoML task in much the same wayby treating it as a Bayesian optimization problem.However, they claim that by giving a ”warmstart”to the optimization procedure, the time to reachperformant pipelines is significantly reduced. Thatis, they pre-select possible good configurations tobegin the procedure. Thus their goal is to increasethe efficiency of and reduce time to build. Addi-tionally, instead of the WEKA library, they usethe scikit-learn library [15]. Recent system Key-StoneML [16] uses technique similar to databasequery optimization to optimize machine learningpipelines end-to-end, where ML operator has adeclarative logical representation. By comparison,our work has a different focus and scope. Insteadof directly outputting the best machine learningpipeline and providing a one-to-one solution, wefocus on an interactive process where data scientistsuse this service to explore their predefined searchspace and refine their decision. In this setting,Meta-learning provides “user-guided, machine as-sisted” automated search and facilitates encodingknowledge and decision making process and addressthe challenge of interpretability of data sciencesolutions. The intepretability of automated machinelearning and knowledge derived from the searchalgorithm is enabled leveraging PEORL framework[5], which is a combination of symbolic planningand reinforcement learning. Symbolic planning [6]generates possible sequences of actions to achievea goal which are pre-defined and logic-based (e.g.

only certain data types are compatible with certainfeaturizers). Combined with R-Learning [10], feed-back on actions taken is learned in order to generatenew plans.

III. PRELIMINARIES

A. GraphQL and Maana Knowledge PlatformGraphQL is a unified layer for data access and

manipulation. In a distributed system, it is locatedat the same layer like REST, SOAP, and XMLRPC,that means it is used as an abstraction layer tohide the database internals. A GraphQL schemaconsists of a hierarchical definition of types andthe operations that can be applied on times, i.e.,queries and mutations. GraphQLs type system isvery expressive and supports features like inher-itance, interfaces, lists, custom types, enumeratedtypes. By default, every type is nullable, i.e. notevery value specified in the type system or queryhas to be provided. Every GraphQL type systemmust specify a special root type called Query, whichserves as the entry point for the querys validationand execution. One example of a GraphQL schemadefinition is shown as follows. It contains two types:Person that contains fields of name (the punctual! denotes non-empty fields), age, a list of instancesof books (denoted by brackets []), and a list ofinstances friends, and a type Book with fields titleand a list of persons as authors. Furthermore, thereare 3 queries that retrieve an instance of Person byname, an instance of book by title, and a list ofbooks by applying a filter. There is also a mutationthat adds a person by providing a name.

type Person{name : String!age : Integerbooks(favorite: Boolean) : [Book]friends : [Person]

}type Book {

title : String!authors : [Person]

}type Query {

person(name : String!) : Personbook(title : String!) : Bookbooks(filter : String!) : [Book]

}type Mutation {

addPerson(name:String!) : Boolean}

Such schema provides an representational ab-straction of operations and the data they manip-ulates, and connects the front-end query/mutationcalls with the back-end implementation details.

4

Maana Knowledge Platform2 is architected basedon Graphql-based microservices where their typesystems are connected with each other to become aComputational Knowledge Graph (CKG). Differentfrom traditional semantic systems based on ontologyand description logic [17], the CKG separates theconceptual modeling of data, the content of thedata and the operations on the data. This separa-tion enables a fluidity of modeling, allowing datafrom any source and in any format to be seam-lessly integrated, modeled, searched, analyzed, op-erationalized and re-purposed. Each resulting modelis a unique combination of three key componentssubject-matter expertise, relevant data from silos,and the right algorithm all of which are instrumen-tal in optimizing operations and decision flows. Fur-thermore, the CKG is also dynamic, which meansthat it can represent conceptual and computationalmodels. In addition, it can be used to perform com-plex transformations and calculations at interactivespeeds, making it a game-changing technology foragile development of AI-driven knowledge applica-tions.

B. PEORL Framework

PEORL [5] is a framework that integrates sym-bolic planning with reinforcement learning [7]. Us-ing a symbolic formulation to capture high-leveldomain dynamics and planning with it, a symbolicplan is used to guide reinforcement learning toexplore the domain, instead of performing randomtrial-and-error. Due to the fact that domain knowl-edge significantly reduces the search space, thisapproach accelerate learning and also improves therobustness and adaptability of symbolic plans forsequential decision making. One instantiation ofsuch framework in [5] uses action language BC toformulate dynamic domain through a set of causallaws, i.e., preconditions and effects of actions andstatic relationships between properties (fluents) of astate. In particular, PEORL requires that causal lawsformulating cumulative effect (plan quality) definedon a sequence of actions. For an action a executedat state s, such causal laws has the form

a causes quality = C + Z if

s, ρ(s, a) = Z, quality = C.

2https://www.Maana.io/knowledge-platform/

where ρ is a value that will be further updated byreinforcement learning. Reinforcement learning isachieved by R-Learning [10], [11], i.e., performingvalue iteration

Rt+1(st, at)αt←− rt − ρt(st) + max

aRt(st+1, a),

ρt+1(st)βt←− rt + max

aRt(st+1, a)−max

aRt(st, a)

(1)to approximate policy that achieves maximal longterm average reward using R and gain reward usingρ.

At any time t, given an action description in BCand an initial state I and goal state G, PEORL usesan answer set solver such as CLINGO to generatea plan Πt, i.e., a sequence of actions that transitsstate from I to G. After that, the action is sentto execution one by one, value iteration (1) isperformed. After that, ρ values for all state s inplan Π are summed up to obtain the quality of theplan,

quality(Πt) =∑〈s,a,s′〉

ρ(s)

and ρ(s, a) for all ρ values for all transition 〈s, a, s′〉are used to update the facts in action descriptions.Plan Πt+1 is generated that not only satisfies thegoal condition G, but also has a plan quality greaterthan the quality(Πt). This process terminates whenplan cannot be further improved.

Meta-learning concerns on generating machinelearning pipeline with proper hyper-parameter tomeet an objective, such as accuracy. This problemcan be formulated as an interplay between gener-ating reasonable machine learning pipeline, viewedas a symbolic plan generated from a domain formu-lation for commonsense knowledge of data science,and evaluating machine learning pipeline, viewedas execution of actions and receiving rewards fromthe environment, derived from the objective. Thisapproach allows to use interpretable, explicitly rep-resented expert knowledge to delineate search spaceto look for proper pipeline along with their hyper-parameters, and also allows user to change theirspecification for the search space in run time, lead-ing to to an more interpretable and transparentmeta-learning. The details of the algorithm will bedescribed in Section IV-B.

5

IV. METHODOLOGY

A. Knowledge Schema for Data ScienceWe first show the knowledge schema defined

to capture concepts and relationships in a datascience solution. First we define a machine learninginterface

interface MachineLearningModel {id: IDalgorithm: MachineLearningAlgorithmfeatures: [Feature]preprocessor: Preprocessorsaved: Booleanaccuracy: FloattimeToLearnInSeconds: Floatlabels: [Label]

}

where MachineLearningAlgorithm,Feature and Preprocessor are enumerationtype, such as

enum MachineLearningAlgorithm {random_forest_classifierlinear_scv_classifiergaussian_nb_classifiermultinomial_nb_classifierlogistic_classifiersgd_classifiergradient_boosting_classifier

}

The interface is implemented by every classifierthat is defined as a type that includes an ID and allvalues for hyper-parameters, for instance

type LogisticClassifierimplements MachineLearningModel{norm: Stringtolerance: FloatC: Floatbalance: Booleansolver: StringmaxIterations: Int

}

In order to train a classifier, the training inputconsists of a training data specified by an URL orpath. Field inputs consists of user-defined preferencefor applying featurizers to a column:

input FieldInput {name: String!type: FeatureType!featurizerName: [FeaturizerAlgorithm]

}

The user can also specify minimal accuracy re-quired for search to stop based on a selection criteria(accuracy, F1, precision, recall), candidate models

and candidate preprocessors. The full definition oftraining input type is

input TrainingInput {modelId: IDminimumAccuracy: FloattargetName: String!dataInput: TrainingDataInput!fields: [FieldInput]folds: IntselectionCriteria: MetriccandidateModels: [MachineLearningAlgorithm]candidatePreprocessors: [PreprocessorAlgorithm]modelProfilingEpisode: IntmodelSearchEpisode: Int

}}

The mutation that triggers Meta-learning acceptsa training input and outputs an instance of Machine-LearningModel.

trainClassifier(input: TrainingInput!):MachineLearningModel

After a machine learning model is trained, it canbe used to classify new input data in JSON format:

classifyInstances(modelID:ID!,data:JSON!)

which will return instances of the type

[Label]

.The schema defined as above can be viewed as

taxonomies, or, structure of the data, that will beused to store the results generated during automatedmachine learning algorithm (Section IV-B). Encod-ing the knowledge this way enables the pipelinesearch process to be better understood, insightsabout deriving the optimal pipeline to be encodedand inference tasks on Meta-learning to be per-formed.

B. Automated Machine Learning

Automated machine learning algorithm is basedon PEORL framework, where pipeline generationis viewed as a symbolic planning problem, andpipeline evaluation as reinforcement learning prob-lem that the actions are performed on manipulatingdata and rewards are derived from cross valida-tion scores. This approach significantly accelerateslearning by incorporating domain knowledge toguide exploration, enables online injection of userfeedback that changes pipeline search in a flexibleand easy way, and interpretable learning in the sense

6

that the learning is performed only on reasonablepipelines generated from a pre-defined symbolicknowledge.

1) Representing Domain Knowledge: We use ac-tion language BC to represent dynamic domain ofML operations and translate the causal laws intocorresponding answer set program (ASP). We firstintroduce three types of objects:• Preprocessors, including

– matrix decompositions:(truncatedSVD, pca,kernelPCA,fastICA),

– kernel approximation:(rbfsampler, Nystroem),

– feature selection:(selectkbest,selectpercentile),

– scaling:(minmaxscaler, robustscaler,absscaler), and

– no preprocessing (noop),• Featurizers:

including two standard featurizers for textclassification, i.e., CountVectorizor andTfidfVectorizer.

• Classifiers:including logistic regression, Gaussian naiveBayes, linear SVM, random forest, multinomialnaive Bayes and stochastic gradient descent.

We treat each operation in the pipeline as an ac-tion, with describing their causal laws accordingly.First of all, it includes facts about compatibility withsparse vectors, such as

acceptsparse(random_forest_classifier)

facts about operators and data type, such as

compatible(integer,std_scaler)...

and actions such as data import, train and predict.The actions related to configuration of machinelearning pipelines are described as follows:• Select featurizers for each column. The fol-

lowing rules describes the effects of initializefeaturizer for each column, and after each col-umn has one featurizer selected, the cumulativequality increments.{initfeaturizer(F,Field,Y,k)}:-

feature(F),has_attr(Field,Y,k),compatible(Type,F),has_type(Field,Type),

datatype(Y,train),not

modeltrained(Y,k),not

featurizerinitialized(F,Field,Y,k).

featurizerinitialized(F,Field,Y,k):-initfeaturizer(F,Field,Y,k-1),feature(F).

cost(Q+R,k) :-featurizationcompleted(k),ro(R,k),cost(Q,k-1).

cost(Q+10,k):-featurizationcompleted(k),#count{R:ro(R,k)}=0, cost(Q,k-1).

featurizationcompleted(k):-#count{Field:featurizerinitialized(_,Field,_,k)}=X1,#count{

Field:has_field(_,Field)}=X2,X1=X2.

• Select preprocessor:{initpreprocessor(P,Y,k)}:-

featurizationcompleted(k),preprocessor(P),notmodeltrained(Y,k),

datatype(Y,train),featurizationcompleted(k).

preprocessorinitialized(P,Y,k) :-initpreprocessor(P,Y,k-1),featurizationcompleted(k).

cost(Q+R,k) :-initpreprocessor(P,Y,k-1),ro(P,R,k-1),cost(Q,k-1).

cost(Q+10,k) :- initpreprocessor(P,Y,k-1),#count{R:ro(P,R,k-1)}=0,cost(Q,k-1).

• Crossvalidate. If we have token and label fromthe data, we can cross validate the pipelineby choosing featurizers, preprocessor and aclassifier, and the effect is the model beingvalidated. If one of preprocessor and classifierdoes not accept sparse vector, it needs to betransformed into dense vector.sparse :- has_type(X,text),

#count{T:has_type(X1,T)}=1.{crossvalidate(C,P,dense,T,k)} :-

classifier(C),preprocessorinitialized(P,Y,k),has_attr(T,Y,k),notsparse,

has_targetfield(data,T).{crossvalidate(C,P,sparse,T,k)} :-

classifier(C),

7

has_attr(T,Y,k),preprocessorinitialized(P,Y,k),acceptsparse(P),acceptsparse(C),sparse.

modelvalidated(C,P,S,T,k) :-datatype(Y,train),crossvalidate(C,P,S,T,k-1).

modelvalidated(T,k) :-datatype(Y,train),crossvalidate(C,P,S,T,k-1).

cost(Q+R,k) :-ro(P,C,R,k-1),cost(Q,k-1),crossvalidate(C,P,S,T,k-1).

cost(Q+10,k) :-crossvalidate(C,P,S,T,k-1),#count{R:ro(P,C,R,k-1)}=0,cost(Q,k-1).

Besides causal laws described above, all fluentsare declared inertial, and concurrent execution of ac-tions are prohibited except for initfeaturizer.

Furthermore, to facilitate fast profiling, weuse empirical assignment of featurizers to datatypes, unless they are overriden by the user.This capability is enabled leveraging the non-monotonic reasoning capability of answer set pro-gramming. The default application of featurizersare: for categorical data type, apply one_hot,for flat data type, apply std_scaler, for integertype, apply min_max_scaler, for text, applyhashing_vectorizer (bag of words). One ex-ample of these rules is

:- initfeaturizer(F,Field,Y,k),has_type(Field,categorical),F!=one_hot#count{F1:use_featurizer(Field,F1)}=0.

2) Generation of Pipeline: The generation ofpipeline is treated as a symbolic planning problemgiven the above formulation of the dynamic domain.The initial condition comes from two sources: (1)data schema and (2) configuration of pipeline searchspace. The data schema consists of the column nameand its data type extracted from a data source (e.g.,a CSV file in which each column has been pre-labelled with their data type and one column isdesignated as classification target). For instance, thefollowing ASP file is generated for Adult3 dataset,where field_salary is the classification target:

datatype(adult_data,train).datatype(adult_test,test).has_field(data,field_age).has_type(field_age,integer)...

3https://archive.ics.uci.edu/ml/datasets/Adult

has_targetfield(adult_data, field_salary).

The configuration of pipeline search space comesfrom user specification on candidate featurizers,candidate preprocessors ad candidate models. Thisinformation is translated into ASP file as well. Forexample, the following file configure the pipelinesearch to be performed amoung the listed classifier,preprocessors and featurizers:

classifier(linear_svc_classifier)...feature(one_hot)...preprocessor(pca)...

Furthermore, user can override default appli-cation choices of featurizers by specifying theirown preference. For instance, user wants to userobustscaler for column Age, then the follow-ing fact is appended to the ASP file:

use_featurizer(robust_scaler,field_age).

The planning goal to train a classifier is definedto be

:- not modeltrained(adult_data,k), query(k).:- query(k),cost(Q,k), Q<=1.

Alternatively, given a test file, the goal can alsobe classifying the test data, with the first constraintreplaced by

:- not has_attr(adult_test, field_salary,k).

Given the initial condition and a goal, plan canbe generated by translating the action descriptionabove to ASP and run answer set solver CLINGO:

1:import_train(adult_data)2:initfeaturizer(one_hot,field_sex,

adult_data)initfeaturizer(one_hot,field_race,adult_data)

initfeaturizer(one_hot,field_education,adult_data)

initfeaturizer(one_hot,field_workclass,adult_data)

initfeaturizer(robust_scaler,field_age,adult_data)

3:initpreprocessor(Nystroem,adult_data)4:crossvalidate(gradient_boosting_classifier,

nystroem,dense,field_salary)5:train(gradient_boosting_classifier,

nystroem,dense,field_salary)

The above output is a plan that achieves the goalfrom the initial state. The machine learning pipelineis encoded into operations step by step, includinginitializing the featurizer in step 2, initializing thepreprocessor in step 3, and picking up a classifierto perform cross validation in step 4. In practice,

8

the plan request is sent from a front-end UI ora GraphQL query, where there are other hyper-parameters that need to be specified, which will bedescribed in Section V.

Currently we only allow one featurizer to beapplied to a column. In the future, we will allowuser to specify multiple featurizers to be applied to acolumn, to enable more flexible feature specificationand increase the expressivity of search space ofpipelines.

3) Pipeline Learning: The evaluation of pipelineis performed using the algorithm based on PEORL,shown in Algorithm 1. The algorithm accepts a setof candidate classifiers, a set of candidate preproces-sors and user-specified application of featurizers tocolumns and a training dataset. The input parameterto the data also consists of model profiling episodesand cross validation folds. First of all,the initial stateof planning problem (line 2) and a pipeline is gener-ated (line 5). The action in the pipeline is executedthrough calling a library of methods that wrapsSCIKIT-LEARN libraries, and for featurization andpreprocessing, give reward -1 to promote shorterpipeline (line 11) and perform value iteration of R-learning (line 12). For cross-validation action, pick-ing up a random parameter for preprocessor, featur-izers and classifiers (line 16) to assemble a pipeline(line 17) and perform cross validation (line 18). Itinvolves converting the feature matrix to dense ifspecified. After that, deriving reward proportionalto the average cross validation score (line 19) andupdate R and ρ values using R-Learning. After thewhole pipeline is profiled, evaluating the pipelinequality (line 27), update planning goal (line 28) andwrite back ρ values into symbolic formulation togenerate a new pipeline. When the pipeline cannotbe further improved, return the optimal pipeline(line 6). In our application pipeline output takes theform of a serial pickle object.

4) Interactive Process of Meta-Learning: Meta-Learning is by its nature a long running job, espe-cially with larger datasets. The PEORL algorithmapplied to machine learning by itself does not scalewell. Because of this we have incorporated two fea-tures with the intention of improving performance.First, before running the PEORL algorithm, we havea broad search phase were we test each classifieronce. We then restrict the algorithm to only thebest performing classifier and continue the searchprocess.

Algorithm 1 Pipeline Profiling

Require: candidate classifiers C, candidate preprocessors Pand user-specified featurizer-column application F,planning goal,G = (:- not modeltrained(ıdata), ∅), domainrepresentation D, profiling episodes ıprofilingepisodesand cross-validation folds v.

1: P0 ⇐ ∅, Π⇐ ∅2: generate initial state I from C, P, F.3: while True do4: Πo ⇐ Π5: solve planning problem

Π⇐ CLINGO.ısolve(I,G,D ∪ Pt)6: if Π = Πo then7: return Πo

8: end if9: for action 〈s, a, s′〉 ∈ Π do

10: if a ∈ {ıinitfeaturizer(F,Col, Y ),ıinitpreprocessor(P,Col, Y )} whereF ∈ F,P ∈ P, ıCol is a column name and Y is thetraining dataset name then

11: ıreward⇐ −112: update R(a, s) and ρat (s) for action a13: end if14: if a ∈ {ıcrossvalidate(C,P,D,Col, Y )} where

C ∈ C, ıCol is a column name and Y is thetraining dataset name then

15: for i < ıprofilingepisode do16: instantiate C,P, F by random sampling their

hyper-parameters.17: assemble pipeline using C,P, F .18: perform v-fold cross validation using

C,F, P .19: obtain ıreward ∝ ıcvscore.20: update R(a, s) and ρat (s) for action a21: i← i+ 122: end for23: else24: execute a25: end if26: end for27: quality(Π)←

∑〈s,a,s′〉∈Π ρ

a(s)28: update planning goal

G⇐ (A, quality > quality(Π)).29: update facts Pt ⇐ {ρ(a) = z : ρat (s) = z}.30: end while

9

1) Model Selection (Phase 1). For C = Cfor each classifier C, call Algorithm 1 usingthe chosen feature set, and P = {ınoop},recording the performance. Select classifierC0 based on the predefined selection criteria(accuracy, F1, precision, recall).

2) Pipeline Learning (Phase 2). Call Algo-rithm 1 with C = {C0}, P and F being userspecified featurizers and preprocessors, andgenerate optimal pipeline Π1

3) Parameter Sweeping. (Phase 3). Performgrid search or random search for hyper pa-rameter Π1 and return the final pipeline Π2.

We also allow the user to gradually refine theirpreference during the process. When the user seesthe results, they may inject their feedback by over-riding any preset configurations above, at any time,and this information will be picked up by Meta-Learning search algorithm, and change its behaviortowards user’s feedback in the next episode ofplanning and learning. The user can remove oradd possible algorithms to test, cancel the currentpipeline, stop a phase with the current best classifier,or stop the entire process and use the best classifierfound.

V. SYSTEM IMPLEMENTATION

In Maana Knowledge Platform, CSV files canbe uploaded and each column becomes a field andtheir types are automatically identified. The usercan trigger Meta-learning service by submitting aquery through GraphQL endpoint. The GraphQLinput is used to generate part of the initial state forplanning, and the Meta-learning service is triggeredfor pipeline search. Throughout the pipeline searchprocess, the results are constantly written to theMaana Knowledge platform according to the knowl-edge schema. The service is implemented in Pythonwith Graphene library to enable GraphQL serverand endpoints. It is deployed using a Docker imagealong with other components of Maana Knowledgeplatform.

Additionally, another feature we use to improveperformance is the parallelization of building mod-els. Because the model profiling and model searchepisodes can be done in parallel, we use asyn-chronous approach, where multiple workers arelaunched and each perform their own parametersampling and cross validation on the dataset, and

the result is returned to the dispatcher to performvalue iteration.

A. Example: Classify Spam EmailWe show how the Meta-learning service perform

on an example dataset obtained from UCI machinelearning data repository: Spam message detaction4.The dataset contains 4601 data entries, 57 float andinteger features to detect if a message is a spam ornot.

After Meta-learning service is running, the usercan load a CSV file into a workspace into Maanaproject. After that, the user launch the servicethrough the GraphQL endpoint, where the user spec-ifies feature fields and their related types, candidateclassifiers (logistic regression, random forest, linearSVC, SGD classifier) and candidate preprocessors(noop, random trees embedding, truncated SVD,PCA, Nystroem, kernel PCA). It performs 10 foldscross validation, 10 episodes of model profiling and20 episodes of model search.

After the service is launched, it goes to the firstphase, model selection. In this phase, it will notapply any preprocessors and only apply defaultfeaturizers to the column, and pick up 10 sets ofrandom hyper parameters for each model and cal-culate the average cross validation accuracy. By theend of model selection (Phase 1), it stores the metricinformation into the knowledge graph, and is visu-alized in the first column of Fig. 1b. It shows thatthe most accurate model, based on cross validationresults, is logistic regression. At this point, the useris notified that logistic regression is selected, basedon predefined selection criteria. In Pipeline Learning(Phase 2), Meta-learning will try to find the bestcombination of preprocessors and featurizers usingthe selected classifier, following Algorithm 1. Sincethe user does not override any default selectionof featurizers, one_hot_encoder is applied tocategorical fields, and min_max_scaler is ap-plied to integer fields. During this process, ASP-based planner generates pipelines using the selectedclassifier, and reinforcement learner evaluates thegenerated pipeline on the data, using reward derivedfrom the cross validation accuracy. The pipeline isgradually improved till the point that it does notchange. By the end of Phase 2, the performance ofselecting different preprocessor with the classifier

4https://archive.ics.uci.edu/ml/datasets/Spambase

10

(a) The Meta-learning Service Architecture. (b) Results of the three stages.

(c) The results in the Knowledge Graph.

Fig. 1: Meta-learning Service Overview

Dataset Featurizer Preprocessor Classifier CV accuracyReuters 50/50 hashing vectorizer none linear SVC 0.848

IMDB hashing vectorizer None SGD classifier 0.879Adult one hot, min max scaling random trees embedding logistic regression 0.8523

Spam detection min max scaling none logistic regression 0.927Parkinsons detection std scaler none random forest 0.887

Abalone std scaler Nystroem random forest 0.552Car one hot Nystroem gradient boosting 0.938

TABLE I: Baseline Pipelines Learned on Datasets

is output in Fig. 1b. From the result, it showsnot performing any preprocessor has the best per-formance used with logistic regression. Combinedwith the default featurizer, a pipeline of usingmin max scaler for integer fields, one hot encoderfor categorical field, random tree embedding forpreprocessor and a logistic regression is learned.

Finally, during parameter sweeping (Phase 3) hyper-parameters are swept, leading to the final resultsin the third column. All of the intermediate searchresults are stored in the knowledge graph, shownas a snapshot in Fig. 1c. The upper part of thescreen shots shows the knowledge schema organizedas knowledge graph, and on clicking each of the

11

schema node, data instance is shown in the lowerpart of the workspace.

During this process, based on the pre-definedcandidate models, preprocessors and the profilingepisodes, the system has evaluated 4 pipelines (eachparameterized with 10 sets of hyper parameters) inmodel Selection process. During pipeline learningphase, the 8 pipelines based on selected classifier(logistic regression) and candidate preprocessors arefurther evaluated (with 10 hyper-parameter tested ina single learning episode) until the optimal pipelineconverges. This process does not provide system-atic pipeline optimization and search. Instead, itleverages the decision space pre-defined by the datascientist and perform quick profiling and provideevidence for the data scientist to further refine theirdecisions.

B. Evaluation on DatasetsWe evaluate meta-learning services for classifica-

tion tasks using default setting for featurizers. Ourdata set are obtained includes:• Reuters 50/50 dataset5 contains of 2,500 texts

(50 per author) for author identification.• IDMB movie review dataset6 contains 25,000

movie reviews obtained from IMDB. The clas-sification task is to predict a movie review ispositive or negative.

• Adult dataset7 contains 48842 instances. Eachinstance has 14 fields, including age (integer),working class (categorical), education (categor-ical), capital gain (float), etc that consitute thefeature space to predict one of the two classs:salary > 50k or <= 50k.

• Spam email detection8 contains 4601 data en-tries, 57 float and integer features to detect ifa message is a spam or not.

• Parkinson’s detection dataset 9 contains 197instances, with 22 float features to detect if aperson has Parkinson’s disease or not based onvocal characteristics.

• Abalone dataset10 contains 4177 instances, 8float and integer attributes to detect the sex ofabalone.

5https://archive.ics.uci.edu/ml/datasets/Reuter 50 506http://ai.stanford.edu/ amaas/data/sentiment/7https://archive.ics.uci.edu/ml/datasets/Adult8https://archive.ics.uci.edu/ml/datasets/Spambase9https://archive.ics.uci.edu/ml/datasets/Parkinsons10https://archive.ics.uci.edu/ml/datasets/Abalone

• Car evaluation dataset11 contains 1728 in-stances, 6 categorical data fields to make clas-sification of purchasing decisions.

Detailed results are shown in Table I. The re-sult shows that the Meta-learning service generatescompetitive baseline result for the data scientist tofurther work on.

VI. CONCLUSION AND FUTURE WORK

Meta-learning service provides a novel frame-work for machine learning pipeline search that istransparent, interpretable and interactive. It serves asa profiling tool for data scientist to use: by incor-porating human knowledge, meta learning serviceperforms efficient pipeline generation and profilingin the search space delineated by the data scientists,allowing feedback to be injected in the middle toalter search space and providing useful feedbackfor the data scientist to understand the best machinelearning pipeline for the dataset of interest. Whilethe Meta-learning service will by no means replacedata scientist to finish data science project automat-ically, it can save a large amount of time for manualsearch and tuning. Currently it is deployed in MaanaKnowledge Platform to facilitate data scientists tobuild machine learning solutions faster, with betterinsight and facilitate knowledge management andsharing across different projects.

This framework leaves several paths for improve-ment. Up until now we have only applied featur-ization to data based on what type it is. However,it is possible to perform some level of automatedfeature extraction and clean up of the data. It shouldbe possible to use the knowledge gleaned fromthe meta-attributes to guide the algorithm as well.More advanced parameter optimization can also beapplied.

REFERENCES

[1] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown,“Auto-WEKA: Combined selection and hyperparameter opti-mization of classification algorithms,” in Proc. of KDD-2013,2013, pp. 847–855.

[2] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg,M. Blum, and F. Hutter, “Efficient and robust automated ma-chine learning,” in Advances in Neural Information ProcessingSystems, 2015, pp. 2962–2970.

11https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

12

[3] R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore,“Evaluation of a tree-based pipeline optimization tool forautomating data science,” in Proceedings of the Genetic andEvolutionary Computation Conference 2016. ACM, 2016, pp.485–492.

[4] R. T. Fielding and R. N. Taylor, Architectural styles and thedesign of network-based software architectures. University ofCalifornia, Irvine Doctoral dissertation, 2000, vol. 7.

[5] F. Yang, D. Lyu, B. Liu, and S. Gustafson, “Peorl: Integratingsymbolic planning and hierarchical reinforcement learning forrobust decision-making,” in International Joint Conference ofArtificial Intelligence (IJCAI), 2018.

[6] A. Cimatti, M. Pistore, and P. Traverso, “Automated planning,”in Handbook of Knowledge Representation, F. van Harmelen,V. Lifschitz, and B. Porter, Eds. Elsevier, 2008.

[7] R. Sutton and A. G. Barto, Reinforcement Learning: An Intro-duction. MIT Press, 1998.

[8] J. Lee, V. Lifschitz, and F. Yang, “Action Language BC:A Preliminary Report,” in International Joint Conference onArtificial Intelligence (IJCAI), 2013.

[9] M. L. Puterman, Markov Decision Processes. New York, USA:Wiley Interscience, 1994.

[10] A. Schwartz, “A reinforcement learning method for maximizingundiscounted rewards,” in Proc. 10th International Conf. onMachine Learning. Morgan Kaufmann, San Francisco, CA,1993.

[11] S. Mahadevan, “Average reward reinforcement learning: Foun-

dations, algorithms, and empirical results,” Machine Learning,vol. 22, pp. 159–195, 1996.

[12] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-Brown, “Auto-weka 2.0: Automatic model selection and hy-perparameter optimization in weka,” The Journal of MachineLearning Research, vol. 18, no. 1, pp. 826–830, 2017.

[13] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Sequentialmodel-based optimization for general algorithm configuration,”in International Conference on Learning and Intelligent Opti-mization. Springer, 2011, pp. 507–523.

[14] F. Eibe, M. Hall, and I. Witten, “The weka workbench. onlineappendix for” data mining: Practical machine learning tools andtechniques,” Morgan Kaufmann, 2016.

[15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Ma-chine learning in Python,” Journal of Machine Learning Re-search, vol. 12, pp. 2825–2830, 2011.

[16] E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, andB. Recht, “Keystoneml: Optimizing pipelines for large-scaleadvanced analytics,” in Data Engineering (ICDE), 2017 IEEE33rd International Conference on. IEEE, 2017, pp. 535–546.

[17] F. Baader, D. Calvanese, D. McGuinness, P. Patel-Schneider,and D. Nardi, The description logic handbook: Theory, imple-mentation and applications. Cambridge university press, 2003.

Interpretable Automated Machine Learning in MaanaTM ... · A. GraphQL and Maana Knowledge Platform...

Documents

Transcript of Interpretable Automated Machine Learning in MaanaTM ... · A. GraphQL and Maana Knowledge Platform...