Hw09 Enabling Ad Hoc Analytics At Web Scale

17
rod smith ([email protected] ) © 2006 IBM Corporation Enabling ad-hoc Analytic Apps with Hadoop Enabling ad-hoc Analytic Apps with Hadoop Text Friday, October 2, 2009

Transcript of Hw09 Enabling Ad Hoc Analytics At Web Scale

Page 1: Hw09   Enabling Ad Hoc Analytics At Web Scale

rod smith ([email protected])

© 2006 IBM Corporation

Enabling ad-hoc

Analytic Apps

with Hadoop

Enabling ad-hoc

Analytic Apps

with Hadoop

Text

Friday, October 2, 2009

Page 2: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Emerging Technology - What do we work on?

Making Hadoop accessible to

business professionals

Friday, October 2, 2009

Page 3: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

New Intelligence - Big Data

Nearly 15 petabytes of data are created every day — eight times more than the information in all the libraries in the U.S,

Volume of data in enterprises is doubling approximately every 3 years (Forrester Research)

• Includes structured and unstructured data, excludes rich media

Costs to find, collect & analyze data is decreasing significantly as web innovation proceeds

Content is untapped value for business insights & intelligence

Friday, October 2, 2009

Page 4: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

ExploreExplore

Extract

GatherGather

Internet Evolution: A web of data

sources, services for exploring &

manipulating data, and ways that users can connect them together (Tom Coates/Yahoo™ )

Enterprises recognizing potential of

leveraging the broader web for

business intelligence coverage - as

well as for internal data

Next wave of content-centric webApps

emerging

• Long(er) running data collection & analytic applications

Friday, October 2, 2009

Page 5: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

Internet Evolution: A web of data

sources, services for exploring &

manipulating data, and ways that users can connect them together (Tom Coates/Yahoo™ )

Enterprises recognizing potential of

leveraging the broader web for

business intelligence coverage - as

well as for internal data

Next wave of content-centric webApps

emerging

• Long(er) running data collection & analytic applications

Friday, October 2, 2009

Page 6: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

Hear business users asking for

the ability to directly manipulate,

analyze & remix massive data

sources & services

• LOB “… Google wetted my appetite...I want more customizable analytics with me in the drivers seat…”

Leveraging easy-to-use, rich data

manipulation metaphors like

spreadsheets, etc..

Rich visualizations to quickly

identify insights

Friday, October 2, 2009

Page 7: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

Hear business users asking for

the ability to directly manipulate,

analyze & remix massive data

sources & services

• LOB “… Google wetted my appetite...I want more customizable analytics with me in the drivers seat…”

Leveraging easy-to-use, rich data

manipulation metaphors like

spreadsheets, etc..

Rich visualizations to quickly

identify insights

Rich Spectrum

DIY AnalyticApplications

Emerging

Friday, October 2, 2009

Page 8: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

BBC Digital

Democracy ProjectAchieving Increased

Government Transparency

Web Content To Gather:• UK Parliament Web Site

• Timeframe: 10 + years

Business Questions• Name names: Who is doing what, who

isn!t doing what

• Overlay voting record with demographic & voting records over time

• Buzz - what are people talking about?

• Visualize content relationships

Knowledge of Interest: • Members of Parliament (MPs)

• Bills, Debates, Voting Districts

Let!s Talk Customer Scenarios - BBC

Friday, October 2, 2009

Page 9: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Let!s Talk Customers Scenarios - Thomson Reuters

Web Content To Gather: • ~118 3rd Party Finanical News Services and

Blogs, including: BBC, CNN ,Yahoo News, Financial Times, NY Times, The Big Picture, Fox News, PR Newswire, Market Watch, World Press, Forbes, Google News, Wall Street , Journal, MSNBC, The Sun, ZDNet,

Business Questions• NewsBuzz: What are the headlines? What

are not the headlines but still infocus?

• OpinionMonitor: Who is saying what? What are the debate topics?

• NewsTimeline: Chronology (pulse) of headline news?

• TopicCloud: Tag based topic metrix

• IssueAnalytics: Link backs to semantically related news

Knowledge of Interest:• People, places, events

Enrich Trader!s Desktop Enhancement

Timely aggregation & analytics of content originating from public internet sites

Scenario• Gather unstructured data from anywhere between 200 to

2000 data sources - every 15 minutes

• Perform preprocessing (search, transform, index) over each source

• Publish harvested content for distributed content services and downstream Mashups

Friday, October 2, 2009

Page 10: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

IBM Emerging Technology Project: M2

What is it?An insight engine for enabling ad-hoc business insights for business users - at web scale

How does it work?Discovery Process1. point M2 to data sources of interests

• unstructured web data, feeds, XML, etc..

2. transform data into a form that can be analyzed• Unstructured data becomes semi-structured data

• Example: name: Rod Smith, employer: IBM, state: GA

• Apply analytics - enriching the data

3. “what if tooling” - browser-based visual front end - spreadsheet metaphor to create worksheets for exploring/visualizing the data

What!s different?• Unlocking insights embedded in unstructured data

• Analyzing data previously unavailable to analyze

Friday, October 2, 2009

Page 11: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

M2 -> Demo

Web Content To Gather: • Gathered 1.4m patent docs from USPTO

• 1991-2007 case records from Court of

Appeals United States Federal Circuit

(CAFC)

Business Questions• How much is a target company worth?

• What are the high-value areas of their

portfolio?

• Explored cited patent topics, litigated

patents

Knowledge of Interest: • Patents ranked by citation – e.g how often

was a patent referenced determines value

• Corporate genealogies IP ownership roll-up

• Augment analysis with items affecting IP

value, inventor affiliation, citation rank by

time

Project:Improve IP Portfolio Analysis for Mergers & Acquisitions

“...please collect all US Patent filings… then let’s do…”

Friday, October 2, 2009

Page 12: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

What!s Under the Covers: Hadoop

Emergence of map/reduce programming

model for a new class of webApp

Hadoop: provides a framework for large

scale parallel processing map/reduce

apps (Apache projects lead by Yahoo)

• Offers simplicity of “programming” - Looks like a simple single threaded app model for developers

• Handles big data - scalable storage across machine clusters (think read-only file system)

• Deployment: no application knowledge of runtime or OS or cloud necessary

• Today - setting up, coding Hadoop jobs in Java, etc. is the domain of skilled Java engineers

Friday, October 2, 2009

Page 13: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Expanding upon the Hadoop stack

• Visual tooling builds extensively on Pig

M2 Architecture Characteristics:

• Extensible via UDFs

• REST API for customer choice of analytic service/engine

• REST APl for choice of visualization packages

• Export content as feeds, XML, etc..

• ...more to come

IBM Emerging Technology Project: M2 Architectural Components

Friday, October 2, 2009

Page 14: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Conclusions

In God we trust

Friday, October 2, 2009

Page 15: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Conclusions

…all others bring data

Friday, October 2, 2009

Page 16: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Conclusions

Enterprises quickly evolving their thinking

from a Database strategy to a Data Strategy

encompassing unstructured & structured

content

Repeatable business patterns in broad range

of industries emerging

Hadoop has potential to be the platform for

broad range of solutions from web-based

analytics -> business event processing ->

collaboration

Friday, October 2, 2009

Page 17: Hw09   Enabling Ad Hoc Analytics At Web Scale

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Almost The End

Selecting customer proof of concept projects

!"#$%"&!'!()*('+,*,-

www-01.ibm.com/software/ebusiness/jstart/about.html

INTERESTED?

Friday, October 2, 2009