Intel Labs Graph Analytics Operation
Machine Learning may nourish the soul… ... but Data Preparation
will consume it.
Source: Wikipedia (Hell)
Source: Wikipedia (Banquet)
Machine Learning on Large Datasets
3
Data Quality and Feature Engineering
New Data
Feature Data
Training Set
Validation Set
Build Model
Validate
Value
Input Data
Supervised Learning
Supervised and
Unsupervised Learning
• Figure out what’s there • Extract a bunch of features • Figure out what’s needed • Finalize and feed
Test Set
Extract Transform Load
Argghh!
Problems with Processing Large Datasets
Not turn-key
Are data scientists really expected to know…
how to set up Hadoop from scratch?
java, pig, Hadoop APIs?
how to extend with UDFs?
how to extract, analyze and visualize output beyond Hadoop?
“After hours of debugging our Hadoop setup, I was ecstatic to run a
Hadoop command without a java stack trace.”
- Zach
Not agile
Traditional Environment < 1 sec Simple Established methods Fast Iteration
Distributed Environment > 30 sec Several steps and changes Not clear Slow or linear
Command response Dependency inclusion Validation Development Cycle
Problems with Processing Large Datasets
Apache Pig
• A dataflow processing system for MapReduce
• A high-level scripting language -- Pig Latin
Why Pig for ETL?
• Easy to get up & running
• Easy to program – simple declarative scripting
language , built-in dataflow primitives
• Nested data model support
• First class extensibility – custom filters,
transforms, input/output formats, etc.
• Automatic dataflow optimization – Pig/MR runtime:
~0.97x for 0.12
• As configurable as MR
The story gets even better
• Elephant Bird – good support for different formats,
codecs, etc.
• DataFu – Pig UDFs for data mining & statistics
• PiggyBank – collection of additional UDFs
So, we’re done, right?
No. Many open challenges, including complex models.
Property Graph Data Models
Source: Tinkerpop (Property Graph)
Graph Applications
Mining
• Neural Networks
• Deep Learning (RBM)
• Belief Propagation
• Label Propagation/ARW
• Collaborative Filtering
(ALS, SGD, SVD)
• Topic Modeling (LDA)
• K-Means
Machine Learning
Traversal (Search)
• PageRank
• Random Walk with
Restart
• Connected Components
• Triangle Counting
• K-Truss
• Centrality Measures
• Network Diameter
• Degree Distribution
• Depth-First Search
• Breadth-First Search
Graphical Machine Learning
Intel
Graph Builder on
Graph
Query
Processing &
Storage
Input Data Construct Graph Build Model Serve Model Insight & Prediction
HDFS
DB
Web Docs
• Need fully-integrated solutions that are easy to program
• Scale like Hadoop; speed and accuracy of in-memory graph analytics and mining
• Enables applications in broadband services, network security, retail, life sciences,
financial markets, etc.
Graph Processing: Technology Challenges
Intel Labs continues to work on the gaps.
Performance – Has skyrocketed with in-memory and asynchronous
graph engines and scalable graph query architectures
Integration – Multiple frameworks are difficult to synchronize,
coordinate, and manage
Data Models – Most large-scale work still on homogeneous graphs but
property graphs and meta-path concepts are more widely discussed
Algorithms – A wide range of toolkits with graph mining and graphical
machine learning algorithms, with more sophistication and scaled
versions arriving “every day”
Data Visualization – No great packages to visualize relationships du
jour and interactive big data sampling and projection too crude & slow
Programming – Challenging programming models in languages not
popular with data scientists, IT developers, and other end-users
Traction
Not so much
Data Preparation – Takes way too long, is way too manual, and is
fraught with error Progress!
Nothing specific for graph ETL. What’s needed:
• support for well-known input-output graph formats
• graph specific filters & transforms
• STORE functions for graph stores
Pig ETL for Graphs?
Original Vision
Graph Builder 2 Alpha • Construction of heterogeneous information networks with Pig
• Better “progressive refinement” during acquisition, cleaning, and integration
• Incremental graph construction
• Interfacing for popular graph databases (Titan, RDF output, etc.)
Ted Kushal
Mohit
Danny
Ivy
Frank
Nezih
friends
friends
friends
brothers
friends
friends
friends
friends
Food
Cart
likes
likes likes
Social Graph
Bicycles
likes likes
likes
Ratings Graph
uses
* Inspired by, “Titan: Rise of Big Graph Data,” by M. Rodriguez and M. Broecheler
Ted may like
bicycle-powered food cart
Product Graph
RDBMS
HDFS
NFS
HBase
Titan
Hadoop
HDFS
Giraph Zo
oK
ee
pe
r
Raw Data
Pig Graph
Builder
Graph ETL
Mahout
Graph
Analytics ML
Real-time Graph Queries
Blueprints Re
xste
r
Gremlin
Feature
Store
Model
Store
Example Stack Architecture
67033:-20071306431384422339653 http://www.kog.com http://www.dlstainedglass.com 2 91658:-20071306431384422339653 http://www.kog.com http://www.haegerstainedglass.com 2 941:-19442631361384422339653 http://www.ks-p.jp http://www.drag-race.nuhuh.bee.pl 1 44116:-18273037921384422339653 http://www.kune.fr http://www.chezfanny.fr 3 36891:-18273037921384422339653 http://www.kune.fr http://www.wp-jobboard.kune.fr 3 79906:-17817899301384422339654 http://www.kwc.edu http://www.umsl-sports.com 1 2238:-17817799001384422339654 http://www.kwc.org http://www.onlamp.com 1 68133:-17817799001384422339654 http://www.kwc.org http://www.tjhsst.edu 1 30677:-17817799001384422339654 http://www.kwc.org http://www.floydlandis.com 1 81185:-17817799001384422339654 http://www.kwc.org http://www.you-are-here.com 1 47527:-17817799001384422339654 http://www.kwc.org http://www.phonak-cycling.ch 1 63112:-17817799001384422339654 http://www.kwc.org http://www.link.brightcove.com 1 74837:-17817799001384422339654 http://www.kwc.org http://www.trustbut.blogspot.com 6 53668:-17817799001384422339654 http://www.kwc.org http://www.icanhascheezburger.com 4 97945:-17817799001384422339654 http://www.kwc.org http://www.mythbustersfanclub.com 12 93849:-17709983361384422339654 http://www.kwmd.us http://www.sierraclub.typepad.com 1 51421:-17700453681384422339654 http://www.kwne.jp http://www.ppvj.co.jp 1 13022:-17651665521384422339654 http://www.kwu.edu http://www.rollinghillszoo.com 2 16530:-17113867601384422339654 http://www.kyou.nu http://www.fan.unfading-scar.net 2
14199:-16755866041384422339654 http://www.kzy.com http://www.wbbm780.com 1 95253:-16755866041384422339654 http://www.kzy.com http://www.brewview.com 1 25828:-14077538951384422339655 http://www.lee.org http://www.kaiju.com 1 88133:-14077538951384422339655 http://www.lee.org http://www.sfgov.org 2 94243:-14077538951384422339655 http://www.lee.org http://www.liftport.com 1 56826:-14077538951384422339655 http://www.lee.org http://www.nishioka.com 1 88574:-14077538951384422339655 http://www.lee.org http://www.Smartflix.com 1 81966:-14077538951384422339655 http://www.lee.org http://www.smartflix.com 145 83164:-14077538951384422339655 http://www.lee.org http://www.torrentspy.com 1 99087:-14077538951384422339655 http://www.lee.org http://www.SerpentMother.com 1 39124:-14077538951384422339655 http://www.lee.org http://www.serpentmother.com 3 95995:-14077538951384422339655 http://www.lee.org http://www.toolbar.google.com 2
Extract Transform Load
Parse HTML, look for links and words
Graph Builder to Titan
Archive Records
{ "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "links": ["http://www.1stvwparts.com/shopping_cart.php", "http://www.partsfirm.com", ...], "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright“, “html”, “php”] }
http://www.1stvwparts.com/default.php?cPath=159 74.86.123.84 20091120145711 text/html 28628HTTP/1.1 200 OK <table border="0" width="100%" cellspacing="0" cellpadding="0"> <tr> <td width="100%" class="infoBoxHeading_search">Quick Find</td> </tr></table><table border="0" width="100%" cellspacing="0" cellpadding="0" class="infoBox_search"> <tr> <td><table border="0" width="100%" cellspacing="0" cellpadding="3“ . . .
PageRank and Latent Dirichlet Allocation
Graph ETL Example
row src dst #links
Development Flow
(or, what actually happened)
Extract with python
Develop transforms
Test on a couple files
Fix bugs
Run python in Jython (fail miserably)
Spend too much time enabling
Write UDF in Java
Find limitations
Develop custom load UDF instead
…
Development Pains with Pig As-Is
Data Process Flow
Load with Pig
Turn into edge list (Pig, UDF)
Store to HDFS (Pig)
Load into Titan (GraphBuilder)
Run ML algorithms (Giraph)
Model queries (Gremlin)
All of this before any Machine Learning!
Custom UDFs add a lot of complexity, time and effort.
If you don’t have this…. You’re stuck with this…
Out-of-the-Box Tools
package org.apache.pig.builtin;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;
public class TOKENIZE extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
public DataBag exec(Tuple input) throws IOException {
try {
DataBag output = mBagFactory.newDefaultBag();
Object o = input.get(0);
if (!(o instanceof String)) {
throw new IOException("Expected input to be chararray, but got " + o.getClass().getName());
}
StringTokenizer tok = new StringTokenizer((String)o, " \",()*", false);
while (tok.hasMoreTokens()) output.add(mTupleFactory.newTuple(tok.nextToken()));
return output;
} catch (ExecException ee) {
// error handling goes here
}
}
public Schema outputSchema(Schema input) {
try {
Schema.FieldSchema tokenFs = new Schema.FieldSchema("token", DataType.CHARARRAY);
Schema tupleSchema = new Schema(tokenFs);
Schema.FieldSchema tupleFs;
tupleFs = new Schema.FieldSchema("tuple_of_tokens", tupleSchema, DataType.TUPLE);
Schema bagSchema = new Schema(tupleFs);
bagSchema.setTwoLevelAccessRequired(true);
Schema.FieldSchema bagFs = new Schema.FieldSchema( "bag_of_tokenTuples",bagSchema, DataType.BAG);
return new Schema(bagFs);
} catch (Exception e) {
return null;
}
}
}
X = FOREACH A GENERATE
TOKENIZE(f1);
(More of these please)
Breadth of Knowledge
Load Raw Data
Extract Links
Filter Bad Data
Group Like Links Together
Store - HBase
Store into Titan (Graph Builder)
Pig Java MapReduce
Even if you have ninja skills, you’ll still need to deal with weirdness.
Random Record
{ "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] }
Uselessly common words
Random Record
{ "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] }
Common connector words can be trimmed
…with a bunch more ETL.
Words mangled together?
Random Record
{ "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] }
Is there an edge case that’s causing this?
Were these actually visible?
Random Record
{ "url": "http://www.1stvwparts.com/default.php?cPath=159", "timestamp": "20091120145711", "words": ["covermodel", "golf", "rabbit", "just", "text", "hig", "platestainless", "find", "copyright", "gti", "cache", "returnsprivacy", "partsfirm", "rear", "scratch", "passat", "gmt", "plate", "handle", "scuff", "was", "hatch", "wagonnew", "got", "nov", "password", "advanced", "and", "aluminum", "controlbackup", "server", "specials", "quick", "x", "largo", "www", "splash", "items", "edition", "beetle", "home", "even", "cargo", "what", "mats", "for", "sportwagen", "please", "anniversary", "content", "lights", "fog", "new", "email", "monster", "ipod", "beetlepassatrabbitroutantiguantouareg", "quite", "cart", "acoustic", "apache", "november", "includes", "by", "care", "search", "ok", "of", "openssl", "shipping", "covereuropean", "s", "products", "stainless", "wheels", "protector", "com", "vinyl", "duty", "pre", "encoding", "cc", "sweet", "noticeconditions", "clear", "unix", "electronic", "post", "cabrio", "powered", "chrome", "transfer", "fri", "accent", "mkv", "cpath", "jetta", "maintenance", "type", "store", "more", "function", "shopping", "gear", "hub", "tiguan", "park", "pragma", "brushed", "must", "steel", "distance", "account", "look", "default", "bbs", "cap", "us", "guards", "reviews", "routan", "fun", "chunked", "my", "usecontact", "control", "they", "bumper", "warning", "brake", "eurovan", "exterior", "close", "rig", "dent", "check", "registerforgot", "information", "revalidate", "sensors", "connection", "member", "html", "touareg", "online", "interior", "european", "wheel", "http", "though", "models", "bestsellers", "expires", "categories", "kit", "catalog", "date", "php", "e", "center", "light", "thu", "no", "cover", "frontpage", "eos", "gorilla", "contact", "the", "selectcabriocceoseurovangligolfgtijettajetta", "left"] }
“html” was found in every record, something seems wrong.
raw_data = LOAD '/zach/common-crawl/1285409360731_9.arc.gz' USING ArcLoader()
AS (header:chararray, html:chararray);
edge_list = FOREACH raw_data GENERATE ExtractLinks(*);
edge_list_filtered = FILTER edge_list BY FilterAny(*);
src_based = FOREACH edge_list_filtered GENERATE NormalizeURL(*, 0);
src_based_cleaned = FILTER src_based BY FilterMalformedURL(*, 1);
dest_based = FOREACH src_based_cleaned GENERATE NormalizeURL(*, 1);
dest_based_self_loops_removed = FILTER dest_based BY FilterLoop(*);
final = FILTER dest_based_self_loops_removed
BY NOT (src_domain MATCHES '.*mailto.*' OR dest_domain MATCHES '.*mailto.*');
grouped = GROUP final BY (src_domain,dest_domain) PARALLEL 64;
with_link_count = FOREACH grouped GENERATE group.src_domain,
group.dest_domain,
COUNT(final) AS num_links:long;
with_hbase_keys = FOREACH with_link_count GENERATE RowKeyAssignerUDF(*);
final_graph = FOREACH with_hbase_keys GENERATE FLATTEN($0)
AS (key:chararray, src_domain:chararray, dest_domain:chararray, num_links:long);
STORE_GRAPH(final_graph, 'hbase://pagerank_edge_list', 'Titan');
Load raw data
Extract links
Filter & Normalize
Generate Link
Counts
Assign HBase Keys
Store into Titan
Demo.
Open Problems with Pig ETL
(for Data Science)
Complex JSON/XML processing is painful { "Top-Level-Field": "top_level", "Inner-Json": [{ "Name": "inner-name", "Value": 10 }]}
Interactive Mode
Built in Functions and Operators UDFs
MR Jobs
Open source packages
Embedded Mode (Java, Python, etc.)
Batch Mode
STORE Functions
LOAD Functions Pig Scripting Interface Parser
Planner
Data Type Support
Backend & Execution Engines
User Interface
1
json_data = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); unnested = FOREACH json_data GENERATE $0#'Top-Level-Field' AS (top_level_field_value: chararray), FLATTEN($0#'Inner-Json') AS (inner_json: map[]);
unnested = FOREACH unnested GENERATE top_level_field_value, FLATTEN(inner_json#'Name') AS (inner_name: chararray), FLATTEN(inner_json#'Value') AS (inner_value:long);
Better high-level language integration Native-like experience with non-JVM languages (Python, R, etc.) REST interface can be improved (HCATALOG-182)
Interactive Mode
Built in Functions and Operators UDFs
MR Jobs
Open source packages
Embedded Mode (Java, Python, etc.)
Batch Mode
STORE Functions
LOAD Functions Pig Scripting Interface Parser
Planner
Data Type Support
Backend & Execution Engines
User Interface
2
Better data exploration & error reporting Faster iterative processing (Spark, YARN) Better SAMPLE (WIP: PIG-1713) SUMMARY for descriptive statistics More descriptive error messages
Interactive Mode
Built in Functions and Operators UDFs
MR Jobs
Open source packages
Embedded Mode (Java, Python, etc.)
Batch Mode
STORE Functions
LOAD Functions Pig Scripting Interface Parser
Planner
Data Type Support
Backend & Execution Engines
3
Better control with HBaseStorage
Inefficient for bulk loading
Better HBase filter support
Batching support
Fetch multiple versions
Interactive Mode
Built in Functions and Operators UDFs
MR Jobs
Open source packages
Embedded Mode (Java, Python, etc.)
Batch Mode
STORE Functions
LOAD Functions Pig Scripting Interface Parser
Planner
Data Type Support
Backend & Execution Engines
4
Questions?
• Graph Builder 2 Alpha Dec’13
• Apache 2.0 OS code available at: www.01.org/graphbuilder/
Legal Notices
• INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.
• Intel may make changes to specifications and product descriptions at any time, without notice.
• All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
• Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
• Code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user
• Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
• Intel, Intel Inside, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
• *Other names and brands may be claimed as the property of others.
• Copyright © 2013 Intel Corporation.
Abstract Intel is working hard to build datacenter software from the silicon up that provides for a wide range of advanced analytics on Apache Hadoop. The Graph Analytics Operation within Intel Labs is helping to transform Hadoop into a full-blown “knowledge discovery platform” that can deftly process a wide range of data models, from simple tables to multi-property graphs, using sophisticated machine learning algorithms and data mining techniques. But, the analysis cannot start until features are engineered, a task that takes a lot of time and effort today. In this talk, I will describe some of the Hadoop-based tools we are developing to make it easier for data scientists to deal with data quality issues and construct features for scalable machine learning, including graph-based approaches
Top Related