An Integrated Development Environment for Faster Feature ... · as Google’s core search engine,...

An Integrated Development Environment for Faster FeatureEngineering

Michael R. Anderson Michael Cafarella Yixing Jiang Guan Wang Bochun ZhangUniversity of MichiganAnn Arbor, MI 48109

{mrander, michjc, ethanjyx, wangguan, bochun}@umich.edu

ABSTRACTThe application of machine learning to large datasets hasbecome a core component of many important and excitingsoftware systems being built today. The extreme value inthese trained systems is tempered, however, by the difficultyof constructing them. As shown by the experience of Google,Netflix, IBM, and many others, a critical problem in buildingtrained systems is that of feature engineering. High-qualitymachine learning features are crucial for the system’s per-formance but are difficult and time-consuming for engineersto develop. Data-centric developer tools that improve theproductivity of feature engineers will thus likely have a largeimpact on an important area of work.

We have built a demonstration integrated development en-vironment for feature engineers. It accelerates one particularstep in the feature engineering development cycle: evaluatingthe effectiveness of novel feature code. In particular, it usesan index and runtime execution planner to process raw dataobjects (e.g., Web pages) in order of descending likelihoodthat the data object will be relevant to the user’s featurecode. This demonstration IDE allows the user to write ar-bitrary feature code, evaluate its impact on learner quality,and observe exactly how much faster our technique performscompared to a baseline system.

1. INTRODUCTIONMany of today’s most compelling software systems, such

as Google’s core search engine, Netflix’s recommendationsystem, and IBM’s Watson question answering system, aretrained systems that use machine learning techniques to ex-tract value from very large datasets. They are quite difficultto construct, requiring many years of work, even for the mostsophisticated technical organizations. One reason for thisdifficulty appears to be feature engineering, which we havediscussed previously [1].

A feature is a set of values distilled from raw data objectsand used to train a supervised machine learning system. Forexample, consider a Web search engine that uses a trained

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-mission prior to any use beyond those covered by the license. Contactcopyright holder by emailing [email protected]. Articles from this volumewere invited to present their results at the 40th International Conference onVery Large Data Bases, September 1st - 5th 2014, Hangzhou, China.Proceedings of the VLDB Endowment, Vol. 7, No. 13Copyright 2014 VLDB Endowment 2150-8097/14/08.

regressor to estimate the relevance of a given Web page toa user’s query; suitable features might include whether theuser’s query appears in italicized type, whether the user’squery is in the document’s title, and so on. Human-writtenfeatures are generated by applying user-defined feature codeto a raw data object. Good feature code is not only correctfrom a software engineering point of view, but it also producesa useful feature for the machine learning task at hand.

Features have two important characteristics. First, goodfeatures are crucial for the overall performance of trainedsystems [1]. Public accounts from the engineering teams ofsuccessful trained systems indicate that their success comesfrom slow development of many good features, rather thanstatistical breakthroughs [5, 7].

Second, features are difficult for engineers to write. Be-cause the raw input data is large and diverse (as in a Webcrawl), the “specification” of the feature is always unclear. Itis often difficult for the engineer to predict whether a particu-lar feature will in fact improve the learned artifact’s accuracy;the programmer may struggle to implement a feature func-tion that will ultimately serve no useful purpose. As a result,feature development involves many small code-and-evaluateiterations by the engineer.

Unfortunately, each such iteration can be time-consuming.Every evaluation entails applying the novel feature functionto a massive input dataset, using the resulting feature datato train a machine learning artifact, and then testing the ar-tifact’s accuracy. Our demonstration integrated developmentenvironment (IDE) aims to speed up this code-and-evaluateloop. Its intended impact on engineer productivity is de-signed to be akin to that of giving the engineer a fastercompiler: she will not necessarily write better code, but sheshould be able to evaluate the effectiveness of her code morequickly. Our system should thus improve the productivity offeature engineers, thereby enabling trained systems that areeither more accurate or less expensive to construct.

Modern Feature Engineering — Today there is no ex-plicit support for feature engineering in most software stacks.When developing novel feature code, a feature engineer usesthe following evaluation sequence:

1. The engineer uses a bulk data processing system, suchas MapReduce [4], to apply the feature functions to alarge raw dataset. This job is very time-consuming dueto the large input, but it produces the desired set offeature vectors.

2. The feature vectors are used to train a machine learningartifact, generally using a standard software toolkit,

1657

such as Weka [6]. The precise runtime cost of this stepis dependent on the chosen machine learning technique.

3. The resulting artifact—say, a text classifier that deter-mines whether a news story is related to politics, sports,or entertainment—is evaluated for accuracy using aholdout set of labeled data. If the artifact’s accuracyis not sufficient, the engineer modifies her feature codeand returns to the first step. If the artifact’s accuracyis acceptable, the engineer’s job is complete.

The runtime costs of machine learning, reflected in thesecond step here, are well-known and are an active area ofresearch. We focus instead on the first step: the time spentin bulk data processing. Current systems simply process theentire raw input dataset in an arbitrary order, and alwaysprocess the data in full. If instead the bulk data processorchose to process the most promising raw inputs first, theengineer could terminate the processing task early and stillbe able to evaluate the current iteration of the feature code.

Technical Challenge — The technical core to our work isa bulk data processing system that performs input selectionoptimization. Instead of blindly processing every raw input,our processor chooses inputs that will best exploit the user’sfeature code. For example, if the user’s feature code ismost useful when applied to news-oriented Web pages, theprocessor should locate news pages for processing first. Incontrast, if the user’s feature code is designed to predicte-commerce prices, the processor should instead focus onpages from Amazon.com. Because the feature code changeswith every invocation of the system, the “usefulness” of aninput can change with each iteration; preprocessing the inputdata to determine input utility is impossible, so the systemmust locate and choose useful inputs on the fly.

This problem is somewhat similar to that of active learning,in which a learning system is given a budget of superviseddata labels [8]. The system must choose which training fea-ture vectors to label that will maximize overall improvementto the artifact’s accuracy. Unlike active learning, however,we wish in our setting to avoid the time costs associated withobtaining the feature vectors themselves. This differencemakes most active learning techniques irrelevant.

Our technical solution combines a runtime planner withdomain-independent indexing and is called Zombie. Itstechnical details are described in a separate paper currentlyunder preparation. This paper describes the working IDEthat is built on top of the Zombie technique.

Demonstration — Our IDE allows the user to enter arbi-trary feature function code and then apply that function toa large input set of Web pages. The user can process thefeatures using either a standard Baseline data processingsystem or using our Zombie system. All tasks run on aconfigurable number of machines on Amazon’s EC2 service.

In the rest of this paper we will provide an overview ofour system’s approach to evaluating feature code and itsarchitecture (Section 2). We also describe its user interface(Section 3) and details of our live demonstration (Section 4).

2. SYSTEM FRAMEWORKWe now examine our feature development system’s inputs

and outputs in more detail.

Figure 1: Snapshots over time of Zombie’s perfor-mance compared with Baseline.

2.1 Inputs and OutputsIn addition to novel feature function code, an engineer who

uses the feature IDE must supply a handful of parametersto configure the overall learning task:

• A large raw dataset, such as a Web crawl. The IDE willobtain feature vectors by applying the user’s functionsto each element in this dataset.

• A method for providing a supervised label for the gen-erated feature vectors. This could be a database ofhuman-provided labels or an algorithmic “distant su-pervision” technique [9].

• A machine learning training procedure that consumesthe labeled feature vectors and produces a learnedartifact, such as a naıve Bayes classifier. This procedureis likely drawn from a machine learning library.

• A method for scoring the learned artifact’s quality.This could simply measure the accuracy of the learnedartifact over holdout set of labeled training vectors.

The engineer’s feature code will change repeatedly. Incontrast, these four parameters will remain static for thelife of an engineering task. They define the environmentin which the user’s feature code must be successful. Forour demonstration, we will preconfigure these values to de-scribe a four-way classification task that consumes a dumpof Wikipedia, uses synthetic labels derived from semantictags on those pages, and employs multiclass naıve Bayesclassification. While this is a simple task designed to allowdemo users to quickly complete a full evaluation sequence, ittakes full advantage of the behavior of the Zombie system.

When the user has completed a draft of the feature codeand is ready to begin Step 1 of the above evaluation sequence,she clicks “Go” on the IDE’s main interface. The featuredevelopment system then applies the feature function codeto each element in the raw input set. Since the raw inputset can be extremely large, this function application step isoften very time-consuming. A standard feature developmentsystem might use MapReduce; our IDE uses the Zombiesystem for choosing which raw inputs to process first.

Runtime Output — One way the engineer can save time isto terminate function application early and simply train theartifact using whatever feature vectors were in fact generated.Doing so with the Baseline system will likely cause manyhigh-value vectors to be dropped from the output. Zombie,however, locates and selects the most useful items early in

1658

the function application process, so that early terminationwill lose comparatively fewer high-value vectors. For a givenamount of processing time, our Zombie-powered IDE canproduce a set of training vectors that is more useful thanthat produced by MapReduce’s blind scans of the input set.Of course, there is no secret raw data available to our oursystem: if it is run to completion, Baseline and Zombiewill produce identical training sets.

Figure 1 shows graphs for a single run of the IDE on asingle set of user-written feature code. The x-axis is timespent in feature function execution, and the y-axis shows thenumber of high-value raw data items processed so far. Inthis dataset 0.5% of the items are high-value, but a real-lifedataset may likely have even fewer: consider a task requiringweb pages of computer science faculty. Given a crawl ofroughly 5 billion pages and an estimated 50,000 computerscience faculty in the US, the high-value percentage falls to0.001%. Even limited to academic domains, the high-valuepercentage would be far less than 1%.

Describing raw inputs as “high-value” or “low-value” isa helpful but unnecessary simplification. Our system alsoworks with data items that have fine-grained differences inutility. Further, it works with pages whose utility dependson whether similar items have been seen previously. Forexample, a web page from a faculty member of a yet-unseeninstitution may have more utility than one from an institutionthat has already provided several pages.

The tan upper line of Figure 1 describes the raw data itemsprocessed by Zombie, while the lower blue line describes theraw data items processed by Baseline. As the IDE runs, itcontinually updates the graph with new results. The engi-neer can use this information to decide when the generatedtraining vector set is “good enough” and then terminate thefunction application. Clearly, the Zombie line indicates it isable to find many more high-quality raw inputs early on; auser who terminates after, say, 25 seconds of execution wouldfind that Zombie yields 3-4x as many high-value inputs asthe traditional Baseline system. After roughly 75 more sec-onds of execution, both mechanisms have found an identicaltraining set. Of course, in non-demonstration settings theseruntimes would be tens of minutes or even hours.

2.2 Our ApproachThe core novel functionality of our IDE is supplied by the

Zombie execution system. It must decide which raw datainputs will yield the biggest payoffs to the learned artifact,without actually spending the computation time necessaryto convert the raw inputs to feature vectors. In some waysthis task resembles that of a multi-armed bandit problem [3],in which an agent must simultaneously gather informationabout the payouts of the bandit arms while exploiting thebest of those arms to maximize overall utility. There aregood optimal policies for agents faced with a multi-armedbandit problem, which we use for Zombie.

A standard bandit has two main components: a set ofbandit arms and a reward function that determines the utilitygiven by a “pull” of one of those arms. We obtain the set ofarms by performing an initial clustering of the raw inputs,which is independent of any feature code. The Zombieplanner can decide to “pull” by choosing a random elementfrom a cluster and processing it using the programmer’sfeature functions. The pull’s reward is given by some measureof the resulting feature vector’s impact on the quality of

User Client

Web Application ServerLocal machineAmazon EC2

Master Node

ZombieWorker

ZombieWorker

ZombieWorker

BaselineWorker

BaselineWorker

BaselineWorker

ZombieIndex

Raw Corpus

Figure 2: The IDE’s basic runtime architecture.

the learned artifact. We use the upper confidence boundalgorithm UCB1 [2] to repeatedly determine which arm (thatis, cluster of inputs) to examine next.

2.3 System ArchitectureFigure 2 shows the IDE’s basic architecture. A web front-

end server presents an interface to the user and spawns anyneeded jobs on machines at Amazon’s EC2 service.

When the engineer is writing a new feature function, allcommunication is between the web application and the en-gineer’s browser. When the developer decides to evaluatethe feature code, then the IDE starts to use the AmazonEC2 service. It launches a series of Worker machines, whichapply the user’s feature code to the raw data inputs, therebyobtaining usable feature vectors. The Baseline Workerssimply scan through the entire raw input data. The ZombieWorkers attempt to choose inputs intelligently through use ofthe feature-independent index described above. Each Workerruns on a different region of the input set, and reports resultsback to a Master Node every few seconds. The Master Nodecollates responses from the Workers and presents ongoingstatistics to the user’s browser. The Baseline Workers arepresent for comparative illustration only and in real-worldoperation would be omitted, with their computing resourcesinstead used as Zombie Workers.

Although the feature engineer ultimately aims to build ahigh-quality learned artifact, in this demonstration systemwe want to illustrate how effectively Zombie chooses usefulinputs. Thus, as described below in Section 3, the IDEemphasizes runtime execution statistics like those in Figure 1.

3. USER INTERFACEThe Zombie IDE is a web application with two primary

pages. The first page, shown in Figure 3, is a window wherethe feature engineer can write feature code and configuredetails of the learning task. A realistic user of our systemwould spend all of her programming time using this interface.For our demonstration system, the user can either write novelfeature code or choose from several pre-written functions.We plan to add more features to the IDE’s editor in thefuture, as described in Anderson, et al. [1].

The second page of the interface is shown in Figure 4. Withthis runtime control panel, the engineer starts the featurefunction application and evaluation process. The lower thirdof the page is for job control: the user chooses how manyworker machines to use and when to start and stop.

The upper two-thirds of the page is devoted to a live displayof Zombie’s ability to find high-value inputs. This display

1659

Figure 3: The IDE’s code editor. For demonstrationpurposes, users can either write code from scratchor choose from among several pre-written functions.

is a live version of the basic performance curves shown inFigure 1 and is updated every few seconds. The user canwatch this display and decide when to terminate processingearly. Our model of the feature engineer suggests she willrepeatedly modify feature code using the editor (Figure 3)and then evaluate the code by switching to the runtimecontrol panel (Figure 4).

4. DEMONSTRATION DETAILSOur demonstration system allows conference attendees to

test Zombie on a real text classification task. Users can eitherchoose from several pre-written feature functions or can writetheir own arbitrary feature code. To allow users to observemeaningful differences between Baseline and Zombie injust a short demo-style interaction, we will configure thesystem to process a text corpus that is unrealistically smallbut functions well for demonstration purposes.

If attendees choose to write their own feature code, wewill suggest the following script to illustrate how a featureengineer can be more productive when using our IDE:

1. The feature engineer is interested in building a classifierfor news pages on the Web, wanting to mark each pageas politics, sports, entertainment, or other. The rawdataset is a dump of Wikipedia pages. The engineerwants to iteratively develop a set of features, but ofcourse, each change to the feature code can require atime-consuming evaluation process. For this task, theother category is much more common than the others;Zombie should prioritize the non-other raw inputs.

2. The feature engineer writes two feature functions. Fea-ture function A is fast and simply tests for the pres-ence of a handful of politics-relevant words: president,politician, etc. Feature function B is time-consuming,entailing a natural language parse of the input page.

3. The feature engineer applies function A, followed byfunction B. For each function, the IDE runs Zombie inparallel with Baseline. Zombie should find relevantfeature vectors faster than Baseline, despite a modestamount of additional system overhead. The conferenceattendee will see that Zombie offers a very large ad-vantage when evaluating the time-consuming functionB, but a lesser one when evaluating A. The attendeecan further modify the feature code to see its impacton Zombie’s performance delta over Baseline.

Figure 4: The IDE’s runtime control panel.

Finally, we will log all feature functions written by confer-ence attendees. These functions will inform our own researchand suggest interesting test scenarios for future users.

5. CONCLUSIONOur demonstration feature engineering IDE showcases the

Zombie system. By choosing raw data inputs intelligently,rather than the flat scan that is standard today, it can sub-stantially reduce the amount of time that feature engineersidle away unproductively. We believe our demonstration IDEis an important first step toward a data-centric developmentenvironment for feature engineers.

6. ACKNOWLEDGMENTSThis project is supported by National Science Foundation

grants IGERT-0903629, IIS-1054913, and IIS-1064606, aswell as by gifts from Yahoo! and Google.

7. REFERENCES[1] M. Anderson, D. Antenucci, V. Bittorf, M. Burgess,

M. Cafarella, A. Kumar, F. Niu, Y. Park, C. Re, andC. Zhang. Brainwash: A data system for feature engineering.In CIDR, 2013.

[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysisof the multiarmed bandit problem. Machine Learning,47(2-3):235–256, 2002.

[3] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochasticand nonstochastic multi-armed bandit problems. MachineLearning, 5(1):1–122, 2012.

[4] J. Dean and S. Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In OSDI, 2004.

[5] D. Ferrucci et al. Building Watson: An overview of theDeepQA project. AI Magazine, 31(3):59–79, 2010.

[6] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,and I. H. Witten. The WEKA data mining software: Anupdate. SIGKDD Explorations Newsletter, 11(1):10–18, 2009.

[7] S. Levy. How Google’s Algorithm Rules the Web. Wired,February 2010.

[8] B. Settles. Active learning literature survey. ComputerSciences Technical Report 1648, University ofWisconsin–Madison, 2009.

[9] C. Zhang, F. Niu, C. Re, and J. W. Shavlik. Big data versusthe crowd: Looking for relationships in all the right places. InACL, 2012.

1660

An Integrated Development Environment for Faster Feature ... · as Google’s core search engine,...

Documents

Transcript of An Integrated Development Environment for Faster Feature ... · as Google’s core search engine,...