Big Data Sourcebook Your Guide to the Data Revolution Free eBook

60

description

Big data guide

Transcript of Big Data Sourcebook Your Guide to the Data Revolution Free eBook

Page 1: Big Data Sourcebook Your Guide to the Data Revolution Free eBook
Page 2: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

Learn more at www.splicemachine.com

Tired of the Big Data hype?

Get Real with

SQL on HadoopReal-Time Real Scale

Real Apps Real SQL

Splice Machine is the real-time, SQL-on-Hadoop database.

For companies contemplating a costly scale up of a traditional RDBMS, struggling to extract value out of their data inside of Hadoop, or looking to build new data-driven applications, the power of Big Data can feel just out of reach.

Splice Machine powers real-time queries and real-time updates on both operational and analytical workloads, delivering real answers and real results to companies looking to harness their Big Data streams.

Page 3: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

introduction

2 The Big PictureJoyce Wells

industry updates

4 The Battle Over Persistence and the Race for Access Hill

John O’Brien

10 The Age of Big Data Spells the End of Enterprise IT Silos

Alex Gorbachev

16 Big Data Poses Legal Issues and RisksAlon Israely

20 Unlocking the Potential of Big Data in a Data Warehouse Environment

W. H. Inmon

26 Cloud Technologies Are Maturing to Address Emerging Challenges

and OpportunitiesChandramouli Venkatesan

30 Data Quality and MDM Programs Must Evolve to Meet Complex New Challenges

Elliot King

34 In Today’s BI and Advanced Analytics World, There Is Something for Everyone

Joe McKendrick

40 Social Media Analytic Tools and Platforms Offer Promise

Peter J. Auditore

46 Big Data Is Transforming the Practice of Data Integration

Stephen Swoyer

CONTENTSBIG DATA SOURCEBOOKDECEMBER 2013

Michael Corey, Chief Executive Officer, Ntirety

Bill Miller, Vice President and General Manager, BMC Software

Mike Ruane, President/CEO, Revelation Software

Robin Schumacher, Vice President of Product Management, DataStax

Susie Siegesmund, Vice President and General Manager, U2 Brand, Rocket Software

BIG DATA SOURCEBOOK is published annually by Information Today, Inc.,

143 Old Marlton Pike, Medford, NJ 08055

POSTMASTER

Send all address changes to:

Big Data Sourcebook, 143 Old Marlton Pike, Medford, NJ 08055

Copyright 2013, Information Today, Inc. All rights reserved.

PRINTED IN THE UNITED STATES OF AMERICA

The Big Data Sourcebook is a resource for IT managers and professionals providing information on the enterprise and technology issues surrounding the ‘big data’ phenomenon and the need to better manage and extract value from large quantities of structured, unstructured and semi-structured data. The Big Data Sourcebook provides in-depth articles on the expanding range of NewSQL, NoSQL, Hadoop, and private/public/hybrid cloud technologies, as well as new capabilities for traditional data management systems. Articles cover business- and technology-related topics, including business intelligence and advanced analytics, data security and governance, data integration, data quality and master data management, social media analytics, and data warehousing.

No part of this magazine may be reproduced and by any means—print, electronic or any other—without written permission of the publisher.

COPYRIGHT INFORMATION

Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by Information Today, Inc., provided that the base fee of US $2.00 per page is paid directly to Copyright Clearance Center (CCC), 222 Rosewood Drive, Danvers, MA 01923, phone 978-750-8400, fax 978-750-4744, USA. For those organizations that have been grated a photocopy license by CCC, a separate system of payment has been arranged. Photocopies for academic use: Persons desiring to make academic course packs with articles from this journal should contact the Copyright Clearance Center to request authorization through CCC’s Academic Permissions Service (APS), subject to the conditions thereof. Same CCC address as above. Be sure to reference APS.

Creation of derivative works, such as informative abstracts, unless agreed to in writing by the copyright owner, is forbidden.

Acceptance of advertisement does not imply an endorsement by Big Data Sourcebook. Big Data Sourcebook disclaims responsibility for the statements, either of fact or opinion, advanced by the contributors and/or authors.

© 2013 Information Today, Inc.

PUBLISHED BY Unisphere Media—a Division of Information Today, Inc.EDITORIAL & SALES OFFICE 630 Central Avenue, Murray Hill, New Providence, NJ 07974CORPORATE HEADQUARTERS 143 Old Marlton Pike, Medford, NJ 08055Thomas Hogan Jr., Group Publisher 808-795-3701; thoganjr@infotoday

Joyce Wells, Managing Editor 908-795-3704; [email protected]

Joseph McKendrick, Contributing Editor; [email protected]

Sheryl Markovits, Editorial and Project Management Assistant (908) 795-3705; [email protected]

Celeste Peterson-Sloss, Deborah Poulson, Alison A. Trotta, Editorial Services

Denise M. Erickson, Senior Graphic Designer

Jackie Crawford, Ad Trafficking Coordinator

Alexis Sopko, Advertising Coordinator 908-795-3703; [email protected]

Sheila Willison, Marketing Manager, Events and Circulation 859-278-2223; [email protected]

DawnEl Harris, Director of Web Events; [email protected]

ADVERTISING Stephen Faig, Business Development Manager,908-795-3702; [email protected]

Thomas H. Hogan, President and CEO

Roger R. Bilboul, Chairman of the Board

John C. Yersak, Vice President and CAO

Richard T. Kaser, Vice President, Content

Thomas Hogan Jr., Vice President, Marketing and Business Development

M. Heide Dengler, Vice President, Graphics and Production

Bill Spence, Vice President, Information Technology

INFORMATION TODAY, INC. EXECUTIVE MANAGEMENT

DATABASE TRENDS AND APPLICATIONS EDITORIAL ADVISORY BOARD

From the publishers of

Page 4: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

2 BIG DATA SOURCEBOOK 2013

It has been well-documented that social media, web, transactional, as well as machine-generated and tradi-tional relational data, are being collected within organiza-tions at an accelerated pace. Today, according to common industry estimates, 80% of enterprise data is unstructured —or schema-less.

The reality of what is taking place in IT organizations today is more than hype. According to an SAP-spon-sored survey of 304 data managers and professionals, conducted earlier this year by Unisphere Research, a division of Information Today, Inc., between one-third and one-half of respondents have high levels of volume, variety, velocity, and value in that data—the well-known four characteristics that define big data. The “2013 Big Data Opportunities Survey” found that two-fifths of respondents have data stores reaching into the hundreds of terabytes and greater. Eleven percent of respondents said the total data they manage ranges from 500TBs to 1PB, 8% had between 1PB and 10PBs, and 9% had more than 10PBs.

In addition, data stores are growing rapidly. Accord-ing to another study produced by Unisphere Research and sponsored by Oracle, almost nine-tenths of the 322 respondents say they are experiencing year-over-year growth in their data assets. Respondents to the survey were data managers and professionals who are members of the Independent Oracle Users Group (IOUG). For many, this growth is in double-digit ranges. Forty-one percent report significant growth levels, defined as exceeding 25% a year. Seventeen percent report that the rate of growth has been more than 50% (“Achieving Enterprise Data Performance: 2013 IOUG Database Growth Survey”).

Big data offers enormous potential to organizations and represents a major transformation of information technology. Beyond the obvious need to effectively store and protect this data, IT organizations are increasingly

seeking to integrate their disparate forms of data and to also perform analytics in order to uncover information that will result in their organization’s competitive advan-tage. What makes big data valuable is the ability to deliver insights to decision makers that can propel organizations forward and grow revenue.

As might be expected, the largest organizations in the SAP-Unisphere study—those with 1,000 employees and up—are engaged in big data initiatives, but many smaller firms are also pursuing big data projects as well. More than a third of the smallest companies or agencies in the survey, 37%, say they are involved in big data efforts, along with 43% of organizations with employees in the hundreds. According to the study, three-fourths of respondents have users at their organizations that are pushing for access to more data to do their jobs.

Products to address the big data challenge are coming to the rescue. The expanding range of NewSQL, NoSQL, Hadoop, and private/public/hybrid cloud technologies, as well as newer capabilities for traditional data management systems, present extraordinary advantages in effectively dealing with the data deluge. But which approaches are best for each individual organization? Which approaches will have staying power and which will fall by the wayside? As Radiant Advisors’ John O’Brien rightly points out in his overview of the state of big data, we are in the infancy of a new era—and moving into a new era has never been easy.

To help advance the discussion, in this issue, DBTA has assembled a cadre of expert authors who each drill down on a single key area of the big data picture. In addition, leading vendors showcase their products and unique approaches on how to achieve value and mitigate risk in big data projects. Together, these articles and sponsored content provide rich insight on the current trends and opportunities—as well as pitfalls to avoid—when address-ing the emerging big data challenge. ■

The Big Picture

By Joyce Wells

DBTA’s Big Data Sourcebook is a guide to the enter-prise and technology issues IT professionals are being asked to cope with as business or organizational leadership increasingly defines strategies that leverage the ‘big data’ phenomenon.

Page 5: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 3

sponsored content

With big data, the world has gotten far

more complex for IT managers and those

in charge of keeping a business moving

forward. So how do you simplify your

architecture and operations while raising

the value of the innovative tools you’ve

crafted to meet your business goals? With the

emergence of simple key/value type data—

such as MongoDB, Cassandra, social media

databases, and Hadoop—data connectivity is

evolving to meet requirements for speed and

consistency.

AN EXAMPLEEvery year, NASA and the National

Science Foundation host a contest across

the scientific communities, the results

often resonating in both the academic

and business worlds. The latest challenge:

How can organizations pull together all the

right data from a variety of sources before

performing analysis, drawing conclusions

and making decisions? Sounds like big

data, right?

Consider the problem of determining

if life ever existed on Mars. A huge variety

of data collected by the Mars rover is

fed into clusters of databases around the

world. It then gets transmitted as a whole

to a variety of data sets and Hadoop

clusters. What do we do with it? How does

the scientific community organize itself

to deal with this influx?

There are similar examples in every

industry, all leading to key integration

challenges: How do we make dissimilar data

sets uniformly accessible? And how do we

extract the most relevant information in a

fast, scalable and consistent way?

The problems of data access and relevancy

are complicated by three additional data

processing realities:

1. Big data is driven by economics. When

the cost of keeping information is less than the

cost of throwing it away, more data survives.

2. Applications are driven by data. Big

data applications drive data analysis. That’s

what they’re for. And they all have the same

marching orders: Get the right data to the

right people at the right time.

3. Dark data happens. Because nothing

is thrown away, some data may linger for

years without being valued or used. This

“dark data” might not be relevant for one

analysis, but could be critical for another.

In theory and in future practice, nothing

is “irrelevant.”

THE BIG DATA MARKETAccording to a recent Progress DataDirect

survey, most respondents use Hadoop file

systems or plan to use them within two years.

Respondents also included Microsoft HD

Insights, Cloudera, Oracle BDA and Amazon

EMR in the list of technology they plan to

use in the next two years. This indicates the

growing market awareness that it is now

economically feasible to store and process

many large data sets, and analyze them in

their entirety.

The survey also asked respondents to

rank leading new data storage technologies.

MongoDB and Cassandra have both gained

a large foothold. Progress DataDirect will

soon be supporting them.

TECHNOLOGY ADDRESSES THE NEED

Market growth and maturation has led

to new approaches for storage and analysis

of both structured and multi-structured

data. Recent breakthroughs include:

• Integration of external and social data

with corporate data for a more complete

perspective.

• Adoption of exploratory analytic

approaches to identify new patterns

in data.

• Predictive analytics coming on strong as

a fundamental component of business

intelligence (BI) strategies.

• Increased adoption of in-memory

databases for rapid data ingestion.

• Real-time analysis of data prior to storage

within the data warehouses and Hadoop

clusters.

• A requirement for interactive, native,

SQL-based analysis of data in Hadoop

and HBase.

As the cost of keeping collected

data plummets, new data sources are

proliferating. To address the growing need,

organizations must be able to connect a

variety of BI applications to a variety of data

sources, all with different APIs and designs

—without forcing developers to learn new

APIs or to constantly re-code applications.

The connection has to be fast, consistent,

scalable and efficient. And most importantly,

it should provide real-time data access for

smarter operations and decision making.

SQL connectivity, the central value of our

Progress DataDirect solutions, is the answer.

It delivers a high-performance, scalable and

consistent way to access new data sources

both on-premise and in the cloud. With SQL,

we treat every data source as a relational

database—a fundamentally more efficient

and simplified way of processing data.

PROGRESS DATADIRECT www.datadirect.com

Does Big Data = Big Business Value?

Page 6: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

4 BIG DATA SOURCEBOOK 2013

By John O’Brien

The Battle Over Persistence and the Race for Access Hill

The State of Big Data in 2013

Page 7: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 5

ind

ustr

y u

pd

ate

s

Shifting gears into a new era has never

been easy during the transition. Only in hind-

sight do we clearly see what was right in front

of our faces—and probably the whole time.

This is one nugget of wisdom I have been

sharing with audiences through keynotes

at data warehousing (DW), big data confer-

ences, and major company onsite briefings.

Having been part of the data management

and business intelligence (BI) industry for

25 years, I have witnessed emerging technol-

ogies, business management paradigms, and

Moore’s Law reshape our industry time and

time over.

Big data and business analytics have all the

promise to usher in the information age, but

we are still in the infancy of our next era—

and frankly that’s what makes it so exciting!

In 2013, the marketplace for big data,

BI, NoSQL, and cloud computing has seen

emerging vendors, adapting incumbents, and

maturing technologies as each compete for

market position. Some of these battles are

being resolved in 2013, while others will be

resolved in later years—or potentially not at

all. Either way, understanding the challenges

on the landscape will assist with technology

decision making, strategies, and architecture

road maps today and when planning for

years ahead.

Two of the more dominant shifts occur-

ring around us this year can be called the

Battle Over Persistence and the Race for

Access Hill.

The Battle Over PersistenceThe Battle Over Persistence didn’t just start

5 years ago with the emergence of big data or

the Apache Foundation Hadoop; it’s been an

ongoing battle for decades in the structured

data world. As the pendulum swings broadly

between centralized data and distributed

disparate data, the Battle Over Persistence is

somewhat of a “holy war” between the data

consistency inherently derived from a singu-

lar data store versus the performance derived

from data stores optimized for specific work-

loads. The consistency camp argues that with

enough resources, the single data store can

overcome performance challenges, while the

performance camp argues that they can man-

age the complexity of mixed heterogeneous

data stores to ensure consistency.

Decades ago, multidimensional databases,

or MOLAP cubes, were optimized to persist

and work with data in a way different than

row-based relational databases management

systems (RDBMs) did. It wasn’t just about rep-

resenting data in star schemas derived from a

dimensional modeling paradigm—both of

which are very powerful—but about how that

data should be persisted when you knew how

users would access and interact with it. OLAP

cubes represent the first highly interactive user

experience: the ability to swiftly “slice and

dice” through summarized dimensional data

—a behavior that could not be delivered by

relational databases, given the price-perfor-

mance of computing resources at the time.

Persisting data in two different data stores

for different purposes has been a part of BI

architecture for decades already, and today’s

debates challenge the core notion of trans-

actional systems and analytical system work-

loads: They could be run from the same data

store in the near future.

Data Is DataThe NoSQL family of data stores was born

out of the business demands to capitalize on

the “orders of magnitude” of data volume and

complexity inherent to instrumented data

acquisition—first from the internet website

and search engines tracking your every click,

to the mobile revolution tracking your every

post. What’s different about NoSQL and

Hadoop is the paradigm on which it’s built:

“Data is data.”

Technically speaking, data is free, but what

does cost money and contributes to return

on investment calculations are costs to store

and access data: infrastructure. So, develop-

ing a software solution that leverages the low-

est cost infrastructure, operating costs, and

footprint was required to tackle the order of

magnitude that big data represented—i.e., the

lowest capital cost of servers, the lowest data

center costs from supplying power and cool-

ing, and the highest density of servers to fit

the most in a smaller space. With the “data is

data” mantra, we don’t require understanding

of how the data needs to be structured before-

hand, and we accept that the applications cre-

ating the data may be continuously changing

structure or introducing new data elements.

Fortunately, at the heart of data abstraction

and flexibility is the key-value pair of data,

and this simple elemental data unit enables

the highest scalability.

A Modern Data Platform Has EmergedThe Battle Over Persistence principle

argues that there are multiple databases (or

data technologies), each with its own clear

strengths, and most-suited for different kinds

of data and different kinds of workloads

with data. For now, the pendulum has swung

back into the distributed and federated data

architecture. We can embrace flexibility and

overall manageability of big data platforms,

The Battle Over Persistence is somewhat of a ‘holy war’ between the data consistency inherently derived from a singu lar data store and the performance derived from data stores optimized for specific work loads.

The State of Big Data in 2013

Page 8: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

6 BIG DATA SOURCEBOOK 2013

such as Hadoop and MongoDB. Entity-

relationship modeled data in enterprise data

warehouses and master data management

fuse consistent and standard context into

schemas and support temporal aspects of ref-

erence data with rich attribution to fuel ana-

lytics. Even analytic optimized databases—

such as columnar, MPP (massive parallel

processing), appliances, and even multi-

dimensional databases—can be combined

with in-memory databases, cloud computing,

and high-performance networks. Separately,

highly specialized NoSQL or analytic data-

bases—such as graph databases, documents,

or text-based analytic engines—have their

place and can be executed natively in these

specialized databases.

Companies and vendors are beginning to

accept that there needs to be multiple data-

base technologies interwoven together to

deliver the much needed Modern Data Plat-

form (MDP), but keep in mind that the pen-

dulum will continue to swing—it may be 5

or 10 years from now, but some things about

technology that we know hold true. Comput-

ing price-performance will continue as it has

with Moore’s Law, so we can converge higher

numbers of CPU cores in parallel with lower

cost, more abundant memory with faster solid

state storage, and higher capacity mechanical

disk drives. Tack on the rate of technology

innovation and maturity that is driving big

data today, and we could see the capabilities

of Hadoop derivatives, MongoDB, or some

emerging data technologies eclipse highly

specialized and optimized data technologies

being deployed today to meet demands.

There are great debates about the disparate

databases ecosystems versus the all-in-one

Hadoop—it’s simply a matter of timing and

vision versus the reality of today’s demand-

ing, data-centric environments.

The Race for Access HillWhen you accept the premise of a feder-

ated data architecture based primarily on

workloads rather than logical data subjects,

the next question that arises is, “How do I find

anything and where do I start?” The ability to

manage the semantic context of all data, its

usage for monitoring and compliance, or to

provide users with a single or simple point of

access is the Race for Access Hill.

When you think about “the internet,” you

realize that it’s used as a singular noun, simi-

lar to how “Google” has become a verb mean-

ing to search through the millions of servers

that comprise the internet. Therefore, if the

Modern Data Platform represents all the dis-

parate data stores and information assets of

the enterprise in a singular noun form, we

need a point of access and navigation. Other-

wise, the MDP is simply a bunch of databases.

One major concept at stake for modern

data architects in the Race for Access Hill is

how to centralize semantic context for con-

sistency, collaboration, and navigation. Pre-

viously in the organized world of data sche-

mas, there were many database vendors and

technologies that made data access heteroge-

neous, but it was still unified SQL data access

under a single paradigm. Federated data

architectures were predominantly still SQL

schema in nature and easier to unify. Today’s

key-value stores, such as Hadoop, have the

ability to separate the context of data or its

schema from the data itself, which has great

discovery-oriented benefits for late-binding

the schema with the data, rather than analyz-

ing and designing a schema prior to loading

data in as a traditional RDBMS.

Centralizing context can be done in a

Hadoop cluster’s HCatalog or Hive compo-

nents for semantic integration with other

SQL-oriented databases for federation,

hence joining the SQL world where possible.

(Reminds me of my favorite recent Twitter

quote, “Who knew the future of NoSQL was

SQL?”) Data virtualization (DV) can serve as

a single access point for the broad, SQL-based

consumer community, therefore becoming

the “glue” of the Modern Data Platform that

unifies persistence across many data store

workloads. The later addition of HCatalog

and Hive to Hadoop also has this capability,

but only for the data that can fit this para-

digm; MapReduce functionality was designed

to enable any analytic capability through a

programming model. Other NoSQL data

stores, such as graph databases, don’t inher-

ently “speak SQL,” so in order to be compre-

hensive, an access layer (or point) needs to be

service-oriented as well. Consumers will need

a simple navigation map that allows them to

access and consume information from data

services, as well as virtual data tables. The

long-term strategy will lean further toward a

service-orientation more and more over time;

however, virtualized data will still be needed

for information access situations.

Competing for the HillThe resolution for this portion of the

Race for Access Hill will be gradual within

the coming years; as the need arises, a tech-

nology and strategy are already in place for

companies to adopt. However, this is not the

case with the “hill” portion of the “the race”:

Vendors are racing to position their products

to be that single point of access (the hill) with

compelling arguments and case studies to

support them. Aside from the SQL/services

centralization of semantic context, the next

question becomes, “Where should this access

point live within the architecture?”

There are four different locations or lay-

ers where centralized access and context

could be effectively managed—a continuum

between two points with the data at one end

and the consumer or user at the other, if you will. Along this continuum are several points where you could introduce centralized access and information context. Starting from the data end, you could make the single point of access within a database—this database could have connections to other data stores and virtualization as the representation for the users. Next could be to centralize the access and information context above the database

One major concept at stake in the Race for Access Hill is how to centralize semantic context.

The State of Big Data in 2013

Page 9: Big Data Sourcebook Your Guide to the Data Revolution Free eBook
Page 10: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

8 BIG DATA SOURCEBOOK 2013

layer but between the BI app and consumer

layers with a data virtualization technology.

Third could be to move further along the

path toward the user into the BI application

layer, where BI tools have the ability to cre-

ate meta catalogs and data objects in a man-

aged order for reporting, dashboards, and

other consumers. Finally, some argue that the

user—or desktop—application is the place

where users can freely navigate and work

with data within the context they need locally

and with a much more agile fashion.

Not All Data Is Created EqualDespite database, data virtualization, and

BI tool vendors racing to be the single point

of access for all the data assets in the mod-

ern data platform for their own gains, there

isn’t one answer for where singular access and

context should live because it’s not necessarily

an architectural question but perhaps a more

philosophical one—a classic “it depends.”

With so many options available from the ven-

dors today, understanding how to blend and

inherit context under which circumstances or

workload is key.

First, understand which data needs to be

governed vigorously—not all data is created

equal. When the semantic context of data

needs to be governed absolutely, moving the

context closer to the data itself ensures that

access will be inherited context every time. For

relational databases, this is the physical tables,

columns, and data types that define entities

and attribution within a schema of the data.

For Hadoop, instead, this would be the defini-

tion of the table and columns, with the Hive

or HCatalog abstraction layer bound to the

data within the Hadoop Distributed File Sys-

tem (HDFS). Therefore, a data virtualization

tool or BI server could integrate multiple data

stores’ schemas as a single virtual access point.

Counter to this approach is certain data that

does not have a set definition yet (discovery),

or when local interpretation is more valuable

than enterprise consistency—here it makes

more sense for the context to be managed by

users or business analysts in a self-service or

collaborative nature. The semantic life cycle

of data can be thought of as discovery, veri-

fication, governance, and, finally, adoption by

different users in different ways.

As for the “it depends” comment regard-

ing different analytic workloads, let’s examine

another new hot topic of 2013: Analytic Dis-

covery, or specifically, the analytic discovery

process. Analytic databases have been posi-

tioned as higher-performing and analytic-

optimized database between the vast amounts

of big data in Hadoop and the enterprise ref-

erence data, such as data warehouses and

master data management hubs. The analytic

database is highly optimized for performing

dataset operations and statistics by combin-

ing the ease of use from SQL and the perfor-

mance of MPP database technology, colum-

nar data storage, or in-memory processing.

Discovery is a highly iterative mental pro-

cess—somewhat trial and error and verifica-

tion. Analytic databases may not be as flexible

or scalable as Hadoop, but they are faster out

of the box. So, when an analytic database is

used for a discovery workload, some degree of

semantics and remote database connections

should live within them. Whether the analytic

sandbox is for discovery or is for running pro-

duction analytics accumulating more analytic

jobs over time is still unknown.

What’s AheadIn 2013, two major shifts in the data land-

scape occurred. The acceptance of leveraging

the strengths of various database technologies

in an optimized Modern Data Platform has

more or less been resolved, but the recogni-

tion of a single point of access and context is

next. Likewise, the race for access will con-

tinue well into 2014—and while one solution

may win out over the others with enough

push and marketing from vendors, the overall

debate will continue for years, with blended

approaches being the reality at companies.

And, get ready: The next wave in data is

now emerging, once again pushing beyond

web and mobile data. The Internet of Things

(IoT)—or, Machine-to-Machine (M2M) data

—comes from a ratio of thousands of devices

per person that creates, shares, and performs

analytics, and, in some cases, every second.

Whether it’s every device in your home, car,

office, or everywhere in between that has a

plug or battery generating and sharing data

in a cloud somewhere—or it’s the 10,000

data points being generated every second by

each airline jet engine on the flight I’m on

right now—there will be new forms of value

created by business intelligence, energy effi-

ciency intelligence, operational intelligence,

and many other forms and families of artifi-

cial intelligence. ■

John O’Brien is principal and CEO of Radiant Advisors. With more than 25 years of experience delivering value through data warehousing and business intelligence programs, O’Brien’s unique perspective comes from the combination of his roles as a practitioner, consultant, and vendor CTO in the BI industry. As a globally recognized business intelligence thought leader, O’Brien has been publishing arti-cles and presenting at conferences in North America and Europe for the past 10 years. His knowledge in designing, building, and grow-ing enterprise BI systems and teams brings real-world insights to each role and phase within a BI program. Today, through Radiant Advisors, O’Brien provides research, stra-tegic advisory services, and mentoring that guide companies in meeting the demands of next-generation information management, architecture, and emerging technologies.

With so many options available, understanding how to blend and inherit context under which circumstances or workload is key.

The State of Big Data in 2013

Page 11: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

WHAT HAS YOUR BIG DATA DONE

FOR YOU LATELY?

TransLattice helps solve the world’s Big Data problems.

Bridge your federated systems with effortless visibility and data control to get real benefit from your data.

www.TransLattice.com

Page 12: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

10 BIG DATA SOURCEBOOK 2013

Data management has been a hot topic

in the last years, topping even cloud comput-

ing. Here is a look at some of the trends and

how they are going to impact data manage-

ment professionals.

The Rise of ‘Datafication’Today, businesses are ending up with more

and more critical dependency on their data

infrastructure. Before widespread electrifica-

tion was implemented, most businesses were

able to operate well without electricity but in

a matter of a couple of decades, dependency

on electricity became so strong and so broad

that almost no business could continue to

operate without electricity. Similarly, “data-

fication” is what’s happening right now. If

underlying database systems are not available,

manufacturing floors cannot operate, stock

exchanges cannot trade, retail stores cannot

sell, banks cannot serve customers, mobile

phone users cannot place calls, stadiums

cannot host sports games, gyms cannot ver-

ify their subscribers’ identity. The list keeps

growing as more and more companies rely on

data to run their core business.

Consolidation and Private Database CloudsDatabase consolidation has been lagging

behind application server consolidation. The

latter has long moved to virtual platforms

while the database posed unique challenges

with host-based virtualization. However, with

server virtualization improvements and data-

base software innovations such as Oracle’s

Multitenant, database consolidation moved

to the next level and most recently reemerged

as database as a service with SLA manage-

ment, resource accounting and chargeback,

self-service capabilities, and elastic capacity.

Commodity Hardware and SoftwareHardware performance has been rising

consistently for decades with Moore’s Law,

high-speed networking, solid-state storage,

and the abundance of memory. On the other

hand, the cost of hardware has been consis-

tently decreasing to the point where we now

call it a commodity resource. Public cloud

infrastructure as a service (IaaS) has dropped

the last barriers of adoption.

On the software side, open source phenom-

ena resulted in the availability of free or inex-

pensive database software that, combined with

access to affordable hardware, allows pract-

cally any company to build its own data man-

agement systems—no barriers for datafication.

The Age of Big Data Spells the End of Enterprise IT Silos

By Alex Gorbachev

The State of Big Data Management

Page 13: Big Data Sourcebook Your Guide to the Data Revolution Free eBook
Page 14: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

12 BIG DATA SOURCEBOOK 2013

The Future of Database OutsourcingDatafication, consolidation, virtualization,

Moore’s Law, engineered systems, cloud com-

puting, big data, and software innovations

will all result in more eggs (business appli-

cations) ending up in one basket (a single

data management system). Consequently, the

impact of an incident on such a system is sig-

nificantly higher, affecting larger numbers of

more critical business applications and func-

tions—for example, a major U.S. retailer that

has $1 billion of annual revenue dependent

on a single engineered system or another sin-

gle engineered system handling 2% of Japan’s

retail transactions.

Operating such critical data systems

becomes much more skills-intensive rather

than labor-intensive, and, as companies fol-

low the trend of moving from a zillion low

importance systems to just a few highly crit-

ical systems, outsourcing vendors will have

to adapt. The modern database outsourcing

industry is broken because it’s designed to

source an army of cheap but mediocre work-

ers. The future of database outsourcing is with

the vendors focused on enabling their clients

to build an A-team to manage the critical data

systems of today and tomorrow.

Breaking Enterprise IT SilosThe age of big data spells the end of

enterprise IT silos. Big data projects are very

difficult to tackle by orchestrating a num-

ber of very specialized teams such as storage

administrators, system engineers, network

specialists, DBAs, application developers,

data analysts, etc.

It’s difficult to specialize due to the quickly

changing scope of roles as well as rapid evo-

lution of the software. Getting things done in

a siloed environment takes a very long time—

this is misaligned with the need to be more

agile and adaptable to changing requirements

and timelines. A single, well-jelled big data

team is able to get work done quickly and in a

more optimal way—big data systems are basi-

cally new commercial supercomputers in the

age of datafication and—just like with tradi-

tional supercomputers—they require a team

of professionals responsible for the manage-

ment of the complete system end-to-end.

Pre-integrated solutions and engineered

systems also break enterprise IT silos by

forcing companies to build a cross-skilled

single team responsible for that whole engi-

neered system.

The Future for Hadoop and NoSQLWhether Hadoop is the best big data plat-

form from a technology perspective or not,

it has such a broad (and growing) adoption

in the industry nowadays that there is little

chance for it to be displaced by any other

technology stack.

While, traditionally, core Hadoop has

been thought of as a combination of HDFS

and MapReduce, today, both HDFS and

MapReduce are really optional. For example,

the MapR Hadoop distribution uses MapR-FS,

and Amazon EMR uses S3. The same applies

to MapReduce—Cloudera Impala has its own

parallel execution engine, Apache Spark is a

new low-latency parallel execution frame-

work, and many more are becoming popular.

Even Apache Hive and Apache Pig are mov-

ing from pure MapReduce to Apache Tez, yet

another big data real-time distributed execu-

tion framework.

Hadoop is here to stay and that means the

Hadoop ecosystem at large. It will evolve and

add new capabilities at a blazing-fast pace. Some

will die out and others move into mainstream.

“Core Hadoop” as we know it will change.

There are many commercial off-the-shelf

(COTS) applications available that use rela-

tional databases as a data platform—CRM,

ERP, ecommerce, health records manage-

ment, and more. Deploying COTS applica-

tions on one of the supported relational data-

base platforms is a relatively straightforward

task, and application vendors have a proven

track of deployments with clearly defined

guidelines. It can be argued that the major-

ity of relational database deployments today

host a third-party application rather than an

in-house developed application.

Big data projects, on the other hand, are

pretty much 100% custom-developed solu-

tions and nonrepeatable easily at another company. As Hadoop has become the stan-dard platform of the big data industry, expect a slew of COTS applications to deploy on top of Hadoop platforms just as they are deployed on top of relational databases such as Oracle and SQL Server.

It’s difficult to specialize due to the quickly changing scope of roles as well as rapid evolution of the software. Getting things done in a siloed environment takes a very long time—this is misaligned with the need to be more agile and adaptable to changing requirements and timelines.

The State of Big Data Management

Page 15: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 13

ind

ustr

y u

pd

ate

s

For example, all retail players have to solve

the challenges of providing a seamless expe-

rience to the clients across both physical and

online channels. All city governments have

the same needs for traffic planning and real-

time control to minimize traffic jams and at

the same time to minimize the cost of opera-

tions and ownership. Companies will be able

to buy a COTS application and deploy it on

their own Hadoop infrastructure no matter

what Hadoop distribution it is.

It is, however, quite possible that the new

big data COTS applications will be domi-

nated by software as a service (SaaS) offerings

or completely integrated solution appliances

(as an evolution of engineered systems) and

that means a completely different repeatable

deployment model for big data.

Unlike Hadoop, however, the world of

NoSQL is still represented by a huge variety of

incompatible platforms and it’s not obvious

who will dominate the market. Each of the

NoSQL technologies has a certain specializa-

tion and no one size fits all—unlike relational

databases.

Relational Databases Are Not Going Anywhere

While there is much speculation about

how modern data processing technologies

are displacing proven relational databases, the

reality is that most companies will be better

served with relational technologies for most

of their needs.

As the saying goes, if all you have is a

hammer, everything looks like a nail. When

database professionals drink enough of the

big data Kool-Aid, many of their challenges

look like big data problems. In reality, though,

most of their problems are self-inflicted. A

bad data model is not a big data problem.

Using 7-year-old hardware is not a big data

problem. Lack of data purging policy is not

a big data problem. Misconfigured databases,

operating systems, and storage arrays are not

big data problems.

There is one good rule of thumb to assess

whether you have a big data problem or not—

if you are not using new data sources, you likely

don’t have a big data problem. If you are con-

suming new information from the new data

sources, you might have a big data problem.

What’s AheadThere are a few areas in which we can cer-

tainly expect to have many innovations over

the next few years.

Real-time analytics on massive data vol-

umes has more and more demand. While

there are many in-memory database technol-

ogies including many proprietary solutions,

I believe the future is with the Hadoop eco-

system and open standards. However, pro-

prietary solutions such as SAP HANA or just

announced Oracle In-Memory Database are

very credible alternatives.

Graph databases will see significant

uptake. There are several graph databases

and libraries available, but they all have

unique weaknesses when it comes to scalabil-

ity, availability, in-memory requirements,

data size, modification consistency and plain

stability. As we have more and more data

generated that is based on dynamic relations

between entities, graph theory becomes a

very convenient way to model data. Thus,

the graph databases space is bound to evolve

at a fast pace.

Continuously increasing security demands

is a general trend in many industries although

most of the modern data processing technol-

ogies have weak security capabilities out of

the box. This is where established relational

databases with very strong security models

and capabilities to integrate easily with central

security controls have a strong edge. While it’s

possible to deploy a Hadoop-based solution

with encryption of data in transit and at rest,

strong authentication, granular access con-

trols, and access audit, it takes significantly

more effort than deploying mature database

technologies. It’s especially difficult to sat-

isfy strict security standards compliance with

newer technologies, as there are no widely

accepted and/or certified secure deployment

blueprints.

The future of the database profes-sional—One of the challenges that is holding

companies from adopting new data processing

technologies is the lack of skilled people to

implement and maintain that new technology.

Those of us with a strong background in tra-

ditional database technologies are already in

high demand and are even in higher demand

when it comes to the bleeding-edge, not-yet-

proven databases. If you want to be ahead of

the industry, look for opportunities to invest in

learning one of the new database technologies

and do not be afraid that it might be one of

those technologies that becomes nonexistent

in a couple of years. What you learn will take

you to the next level in your professional career

and make it much easier to adapt to the quickly

changing database landscape. ■

Alex Gorbachev, chief technology officer at Pyth-ian, has architected and designed numerous suc-cessful database solutions to address challenging business requirements. He is a respected figure in the database world and a sought-after leader and speaker at con-ferences. Gorbachev is an Oracle ACE direc-tor, a Cloudera Champion of Big Data, and a member of OakTable Network. He serves as director of communities for the Independent Oracle User Group (IOUG). In recognition of his industry leadership, business achieve-ments, and community contributions, he received the 2013 Forty Under 40 award from the Ottawa Business Journal and the Ottawa Chamber of Commerce.

Unlike Hadoop, the world of NoSQL is still represented by a huge variety of platforms and each of the NoSQL technologies has a certain specialization.

The State of Big Data Management

Page 16: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

14 BIG DATA SOURCEBOOK 2013

sponsored content

Big Data is being talked about

everywhere … in IT and business conferences,

venture capital, legal, medical and government

summits, blogs and tweets … even Fox News!

The prevailing mindset is that if you don’t

have a Big Data project, you’re going to be left

behind. In turn, CIOs are feeling pressured

to do something—anything—about Big

Data. So while they are putting up Hadoop

clusters and crunching some data, it seems

that the really big (data) question all of them

should be asking is where is the value going

to come from?, what are the “real” use cases?,

and finally how can they prevent this from

becoming yet another money pit, or “elephant

trap,” of technologies and consultants?

TRAP 1—NOT FOCUSING ON VALUEMuch of the talk about Big Data is

focused on data … not the value in it.

Perhaps we should start with value—identify

those business entities and processes where

having infinitely more information could

directly influence revenue, profitability or

customer satisfaction. Take for example

the customer as an entity. If we had perfect

knowledge of current and potential

customers, past transactions and future

intentions, demographics and preferences

—how would we take advantage of that to

drive loyalty and increase share of wallet

and margins? Or to focus on a process such

as delivering healthcare services—how

would Big Data impact clinical quality, cost

and reduce relapse rates? Enumerating the

possible impact of Big Data on real business

goals (or social goals for non-profits) should

be the first step of your Big Data strategy,

followed by prioritizing them which would

involve weeding out the whimsical and

instead focus on the practical.

TRAP 2—SEEKING DATA PERFECTIONWith value in mind, you must be willing

to experiment with many different types of

Big Data (structured to highly unstructured)

and sources—machine and sensor data

(weather sensors, machine logs, web

click streams, RFID), user-generated data

(social media, customer feedback), Open

Government and public data (financial data,

court records, yellow pages), corporate data

(transactions, financials) and many more.

In many cases the “broader view” might yield

more value than the “deep and narrow” view.

And this allows companies to experiment

with data that may be less than perfect

quality but more than “fit for purpose.”

While quality, trustworthiness, performance

and security are valid concerns, over-zealously

filtering out new sources of data using old

standards will fail to achieve the full value of

Big Data. Also data integration technologies

and approaches are themselves siloed with

different technology stacks for analytics

(ETL/DW), for business process (BPM,

ESB), content and collaboration (ECM,

Search, Portals). Companies need to think

more broadly about data acquisition and

integration capabilities if they want to acquire,

normalize, and integrate multi-structured

data from internal and external sources and

turn the collective intelligence into relevant

and timely information through a unified/

common/semantic layer of data.

TRAP 3—COST, TIME AND RIGIDITYWhile all the data in the world—and

its potential value—can excite companies,

it would not be economically attractive

except to the largest organizations if Big

Data integration and analytics were done

using traditional high-cost approaches

such as ETL, data warehouses, and high-

performance database appliances. From the

start, Big Data projects should be designed

with low cost, speed and flexibility as the

core objectives of the project. Big Data is still

nascent, meaning both business needs and

data realms are likely to evolve faster than

previous generations of analytics, requiring

tremendous flexibility. Traditional analytics

relied heavily on replicated data, but Big Data

is too large for replication-based strategies

and must be leveraged in place or in flight

where possible. This also applies in the output

Elephant Traps … How to Avoid Them With Data Virtualization

Page 17: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 15

sponsored content

Big Data in the Enterprise

Agile BI & Analytics

Users

HadoopCLOUD STORAGE

Data.gov

Big Data in the Web / Cloud

Web Streams

Big Data in the Enterprise

STORAGE

Web Streams

Relational / Parallel / Columnar

MDX

Address Customer Name

Seattle

Minnesota

New Jersey

New Jersey

Minnesota

Seattle

Seattle

Minnesota

New Jersey

New Jersey

IBM

Chevron Corporation

JPMorgan

Chevron Corporation

IBM

IBM

JPMorgan

Chevron Corporation

JPMorgan

Chevron Corporation

Query Widget Chart Widget Map Widget

Enterprise Apps

DataLog Files

Unstructured Content

WWWWWW

Enterprise & Cloud Apps

Denodo Platform

Unified Data Access

Universal DataPublishing

Unified Data Layer

Connect -> Combine -> Publish

direction where Big Data results must be easy

to reuse across unanticipated new projects in

the future.

AVOIDING THE TRAPSTo prevent Big Data projects from

becoming yet another money pit and suffer

from the same rigidity of data warehouses,

there are four areas in particular to

consider: data access, data storage, data

processing, and data services. The middle

two areas (storage and processing) have

received the most attention as open source

and distributed storage and processing

technologies like Hadoop have raised hopes

that big value can be squeezed out of Big

Data using small budgets. But what about

data access and data services?

Companies should be able to harness

Big Data from disparate realms cost

effectively, conform multi-structured data,

minimize replication, and provide real-time

integration. The Big Data and analytic result

sets may need to be abstracted and delivered

as reusable data services in order to allow

different interaction models such as discover,

search, browse, and query. These practices

ensure a Big Data solution that is not only

cost-effective, but also one that is flexible for

being leveraged across the enterprise.

DATA VIRTUALIZATIONSeveral technologies and approaches

serve the Big Data needs of which two

categories are particularly important.

The first has received a lot of attention

and involves distributed computing across

standard hardware clusters or cloud

resources, using open source technologies.

Technologies that fall in this category and

have all received a lot of attention include

Hadoop, Amazon S3, Google Big Query,

etc. The other is data virtualization, which

has been less talked about until now, but

is particularly important to address the

challenges of Big Data mentioned above:

Data virtualization accelerates time to value in Big Data projects: Because data

virtualization is not physical, it can rapidly

expose internal and external data assets

and allow business users and application

developers to explore and combine

information into prototype solutions that can

demonstrate value and validate projects faster.

Best of breed data virtualization solutions provide better and more efficient connectivity: Best of breed data virtualization

solutions connect diverse data realms and

sources ranging from legacy to relational to

multi-dimensional to hierarchical to semantic

to Big Data/NoSQL to semi-structured web

all the way to fully unstructured content and

indexes. These diverse sources are exposed

as normalized views so they can be easily

combined into semantic business entities and

associated across entities as linked data.

Virtualized data inherently provides lower costs and more flexibility: The output of

data virtualization are data services which

hides the complexity of underlying data

and exposes business data entities through a

variety of interfaces including RESTful linked

data services, SOA web services, data widgets,

or SQL views to applications and end users.

This makes Big Data reusable, discoverable,

searchable, browsable and queryable using a

variety of visualization and reporting tools,

and makes the data easily leveraged in real-

time operational applications as well.

CONCLUSIONCIOs and Chief Data Officers alike would

do well to keep the dangers of elephant

traps in mind before they find themselves

ensnared. The truth is that every Big Data

project needs a balance between the Big Data

technologies for storage and processing on

the one hand and data virtualization for data

access and data services delivery on the other.

Page 18: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

16 BIG DATA SOURCEBOOK 2013

The use of “big data” by organizations

today raises some important legal and regu-

latory concerns. The use of big data systems

and cloud-based systems is expanding faster

than the rules or legal infrastructure to man-

age it. Risk management implications are

becoming more critical to business strategy.

Businesses must get ahead of the practice to

protect themselves and their data.

Before a discussion of those legal and risk

issues, it’s important that we speak the same

language as the terms “big data” and “cloud”

are overused and mean many different things.

For our purposes here, big data is the con-

tinuously growing collection of datasets that

derive from different sources, under individ-

ualized conditions and which form an overall

set of information to be analyzed and mined

in a manner when traditional database tech-

nologies and methods are not sufficient. Big

data analysis requires powerful computing

systems that sift through massive amounts of

information with large numbers of variables

to produce results and reporting that can be

used to determine trends and discover pat-

terns to ultimately make smarter and more

accurate (business) decisions.

Big data analysis is used to spot everything

from business or operational trends to QA

issues, new products, new diseases, new ways

of socializing, etc. Cloud technologies are

required to help manage big data analysis.

Big data leverages cloud technologies such

as utility computing and distributed storage

—that is, massive parallel software that runs

to crunch, correlate, and present data in new

ways. Cloud infrastructure is highly scalable

and allows for an on-demand and usage-

based economic model that translates to

low-cost yet powerful IT resources, with a low

capital expense and low maintenance costs.

Cloud infrastructure becomes even more

important as the creation and use of the data

continues to grow. Every day, Google pro-

cesses more than 24,000TB of data, and a

few of the largest banks processes more than

75TB of internal corporate data daily across

the globe. Those massive sets of data form

the basis for big data analysis. And as big

data becomes more widely used and those

datasets continue to grow, so do the legal and

risk issues.

Legal and risk management implications

are typically sidelined in the quest for big data

mining and analysis because the organization

is typically focused, first and foremost, on try-

ing to use the data effectively and efficiently

for its own internal business purposes, let

alone giving attention to ensuring that any

legal and risk management implications are

also covered. The potential value of the results

of using big data analysis to increase income

(or lower expenses) for the company tends

to drown out the calls for risk oversight. Big

data can be a Siren, whose beautiful call lures

unsuspecting sailors to a rocky destruction.

Big Data Poses Legal Issues and Risks

By Alon Israely

The State of Data Security and Governance

Page 19: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 17

ind

ustr

y u

pd

ate

s

Big data can be a Siren, whose beautiful call lures unsuspecting sailors to a rocky destruction. Understanding the legal and regulatory consequences will help keep your company safe from those dangerous rocks.

Understanding the legal and regulatory con-

sequences will help keep your company safe

from those dangerous rocks.

Developing Protection StrategiesIn order to protect the organization from

legal risks when using big data, businesses must

assess issues and develop protection strategies.

The main areas typically discussed related to

legal risks and big data are in the realm of con-

sumer privacy; but, the legal compliance, such

as legal discovery and preservation obligations,

are also critical to address. Records informa-

tion management, information governance,

legal, and IT/IS professionals must know how

to identify, gather, and manage big datasets in

a defensible manner when that data and asso-

ciated systems are implicated in legal matters

such as lawsuits, regulatory investigations, and

commercial arbitrations. Organizations must

understand the risks, obligations, and stan-

dards associated with storing and managing

big data for legal purposes. As with all technol-

ogy decisions, there should be a cost/benefit

analysis completed to quantify all risks, includ-

ing soft risks such as the risk to reputation of

data breaches or the misuse of data.

Big data can be a sensitive topic when law-

suits or regulators come knocking—especially

if the potential legal risks have not thoroughly

been considered by companies early on as

they put in place big data systems and then

rely upon its associated analysis. Thus it’s

important to bring in the lawyers together

with the technologists early, though this is

not always easy to do. Big data from a legal

perspective includes consumer privacy and

international data transfer (cross-border)

issues, but more risky is the potential expo-

sure of using that data in the normal course

and maintaining the underlying raw data and

analyses (e.g., trending reports). For example,

one question raised is about those parts of an

organization’s big data that may be protected

by a legal privilege.

Some examples of big data usage in the

market that carry critical legal implications

and ramifications and which have their own

tough questions include:

• Determining customer trends to identify

new products and markets

• Finding combinations of proteins and

other biological components to identify

and cure diseases

• Using social-networking data (e.g.,

Twitter) to predict financial market

movements

• Consumer level support for finding better

deals, products, or info (e.g., Amazon just-

like-this, or LinkedIn people-you-may-

know functions)

• Using satellite and other geo-related

imagery and data to determine

movement of goods across shipping

lanes and to spot trends in global

manufacturing/distribution

• Corporate reputation management

by following social media and other

internet-based mentions, and comparing

those with internal customer trend data

• Use by government and others to

determine voting possibilities and

accuracy for demographic-related issues

The Legal RisksWith respect to the legal risks involved,

what’s good for the goose is good for the

gander. That is, it’s important to remem-

ber that use of big data by a company may

open the door for discovery by opposing lit-

igants, government regulators, and other legal

adversaries.

Technical limitations of identifying,

storing, searching, and producing raw data

underlying big data analysis may not guard

against discovery, and being forced to pro-

duce raw data underlying the big data

analysis used by the organization to make

important (possibly, trade secret classified)

decisions can be potentially dangerous for

a company—especially as that data may

end up in the hands of competitors. Thus,

an organization should perform a legal/risk

evaluation before any analysis using big data

is formulated, used, or published.

A major risk faced by organizations uti-

lizing big data analysis is a legal request by

opposing parties and regulators (e.g., for dis-

covery or legal investigation purposes) for

big datasets or its underlying raw data. It can

be very difficult to maintain a limited scope

related only to the legal issues at hand. This

means the organization can end up turning

over far more data than is either necessary

or appropriate due to technical limitations

for segmenting or identifying the relevant

data subsets. Challenges associated with such

issues are still new and thus there are no

known industry best practices, and no legal

authority yet exists. Though this is not good

news for organizations currently using big

data analysis that may be also implicated in

lawsuits or other legal matters, there are ways

to mitigate exposure and protect the orga-nization as best possible, even now as this is still very much an unknown territory, from a legal compliance perspective.

The State of Data Security and Governance

Page 20: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

18 BIG DATA SOURCEBOOK 2013

Information security risks are also import-

ant factors to consider within the larger legal

and risk context. If they are not mitigated

early on, they alone can lead to opening the

door for broader discovery related to big

datasets and systems. Information security in

a broad sense can include:

• Data Integrity and Privacy

• Encryption

• Access Control

• Chain-of-Custody

• Relevant Laws/Regulations

• Corporate Policies

Specific examples of situations where infor-

mation security policies should be monitored

include:

• Vendor Agreements

• Data Ownership & Custody

Requirements

• International Regulations

• Confidentiality Terms

• Data Retention/Archiving

• Geographical Issues

Entering into contracts with third-party

big data-related providers is an area that war-

rants special attention and where legal or risk

problems may arise. Strict controls related to

third-parties are important. More and more

big data systems and technologies are sup-

plied by third parties, so the organization

must have certain restrictions and protections

in place to ensure side-door and backdoor

discovery doesn’t occur.

When dealing with third-party control,

avoiding common pitfalls leads to better data

risk and cost control. Common problems that

arise include:

• Inadvertent data spoliation, which

can include stripping metadata and

truncating communication threads

• Custody and control of the data,

including access rights and issues with

data removal

• Problems with relevant policies/

procedures, which can include a lack

of planning and a lack of enforcement

of rules

• International rules and regulations,

including cross-border issues

Big data sources are no different than tra-

ditional data sources in that big data sources

and the use of big data should be protected

like any other critical corporate document,

dataset, or record.

Mitigating RiskTo best mitigate risk from both internal

and third-party users, certain procedures

related to data access and handling should be

implemented via IT control:

• Auditing and validation of logins and access

• Logging of actions

• Monitoring

• Chain-of-custody

Executive oversight, however, is also an

extremely important method of managing data

risk. Organizational commitment to appro-

priate control procedures evidenced through

executive support is a key factor to creating,

deploying, and maintaining a successful infor-

mation risk management program. Employees

who are able to see the value of the procedures

through the actions and attitudes of those in

management more appreciate the importance

of those procedures themselves.

All in all, a practical, holistic approach is

best for risk mitigation. Here are some tips for

managing legal information/data risk:

• Use a team approach: Include

representatives from legal, IT, risk, and

executives to cover all bases.

• Use written SOPs and protocols:

Standard ways of operating/responding/

process management and following

written protocols are key to consistency.

Consistency helps defend the process in

legal proceedings if needed.

• Leverage native functionality when

responding to legal requests: Reporting

that is sufficient for the business should

be appropriate for the courts. Also be sure

to establish a strong separation of the

presentation layer from the underlying

data for implicated system identification

purposes.

Multi-departmental involvement is also

very important to creating and maintaining

a successful risk mitigation environment and

plan. It is easy to lose track of weak spots in

data handling when only one group is try-

ing to guess the activities of all the others in

an organization. Executives, IT, legal, and

risk all have experiences to share that could

implicate weakness in the systems. Review by

a team helps cover all the bases.

Implementation across departments also

reinforces the importance to the organiza-

tion of the risk procedures. Organizations

that create risk programs but choose not to

implement them, or that implement them

inconsistently, face their own challenges when

dealing with the courts in enforcing data and

document requests, even those requests with

a broad scope.

What’s AheadThis is a new field for legal professionals

and the courts. Big data is here to stay and

will become increasingly ubiquitous and a

necessary part of running an efficient and

successful business. Because of that, those sys-

tems and data (including derived analysis and

underlying raw information) will be impli-

cated in legal matters and will thus be subject

to legal rules of preservation, discovery, and

evidence. Those types of legal requirements

are typically burdensome and expensive

when processes are not in place and people

are not trained. Relevant big data systems and

applications are not designed for the type of

operations required by legal rules of preser-

vation and discovery—requirements related

to maintaining evidentiary integrity, chain-

of-custody, data origination, use, metadata

information, and historical access control.

This new technical domain will quickly

become critical to the legal fact-finding pro-

cess. Thus, organizations must begin to think

about how the data is used and maintained

during the normal course of business and

how that may affect their legal obligations if

big data or related systems are implicated—

which may likely be the case with every legal

situation an organization may face. ■

Alon Israely, Esq, CISSP, is a co-founder of Business Intelligence Associates. As a licensed attorney and IT professional, together with the distinction of the CISSP credential, he brings a unique perspective to articles and lec-tures, which has made him one of the most sought-after speakers and contributors in his field. Israely has worked with corporations and law firms to address management, identifica-tion, gathering, and handling of data involved in e-discovery for more than a decade.

The State of Data Security and Governance

Page 21: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 19

sponsored content

OVERVIEWThe LexisNexis® Global Content

Systems Group provides content to a wide

array of market facing delivery systems,

including Lexis® for Microsoft® Office, and

LEXIS.COM®. These services deliver access

to content to more than a million end users.

The LexisNexis content collection consists

of more than 2.3 billion documents of

various size, and is more than 20 terabytes

of data. New documents are added to the

collection every day.

The raw text documents are prospectively

enhanced by recognizing and resolving

embedded citations, performing multiple

topical classifications, recognizing entities,

and creation of statistical summaries and

other data mining activities.

The older documents in the collection

require periodic retrospective processing to

apply new or modified topical classification

rules, and to account for changes on the basis

of the other data enhancements. Without

the periodic retrospective processing, the

collection of documents would become

increasingly inconsistent. The inconsistent

application of the above enhancements

materially reduces the effectiveness of the

data enhancements.

THE CHALLENGEThe LexisNexis Content management

system had evolved over a 40-year period

into a complex heterogeneous distributed

environment of proprietary and commodity

servers. The systems acting as repository

nodes were separated from the systems

that performed the data enhancements.

The separation of the repository nodes

from the processing systems required that

copies of the documents be transmitted

from the repository systems to the data

enhancement system, and then transmitted

back to the repository after the enhancement

process completed. The transmission of the

documents created additional processing

latencies, and the elapsed time to perform

a retrospective topical classification or

indexing became several months.

The delay to apply a new classification

to the collection retrospectively created a

situation where older documents might not

be found by a researcher via the topical index

when the index topic was new or recently

modified. The lack of certainty about the

coverage of the indexing required the

researcher to conduct additional searches,

especially when the classification covered

a new or emerging topic.

THE SOLUTIONLexisNexis Global Content Systems Group

consolidated the content management and

document enhancement and mining systems

onto HPCC Systems® to solve multiple data

challenges, including content enrichment

since data enrichment must be applied across

all the content simultaneously to provide a

superior search result.

HPCC Systems from LexisNexis is an

open-source, enterprise-ready solution

designed to help detect patterns and hidden

relationships in Big Data across disparate data

sets. Proven for more than 10 years, HPCC

Systems helped LexisNexis Risk Solutions

scale to a $1.4 billion information company

now managing several petabytes of data on

a daily basis from 10,000 different sources.

HPCC Systems is proven in entity

recognition/resolution, clustering and content

analytics. The massively parallel nature of the

HPCC platform provides both the processing

and storage resources required to fulfill the

dual missions of content storage and content

enrichment.

HPCC Systems was easily integrated with

the existing Content Management workflow

engine to provide document level locking and

other editorial constraints.

The migration of the content repository

and data enhancement processing to the

HPCC platform involved creating several

HPCC “worker” clusters of varying sizes

to perform data enrichments and a single

HPCC Data Management cluster to house

the content. This configuration provides

the ability to send document workloads of

varying sizes to appropriately sized worker

clusters while reserving a substantially sized

Data Management cluster for content storage

and update promotions. Interactive access is

also provided to support search and browse

operations.

THE RESULTSThe new system achieves the goal

of having a tightly integrated content

management and enrichment system that

takes full advantage of HPCC Systems

supercomputing capabilities for both

computation and high speed data access.

The elapsed time to perform an

enrichment pass of the entire data collection

dropped from six to eight weeks to less

than a day. This change is so significant that

LexisNexis has already increased the degree

of enrichment into other capabilities that

were previously out of reach.

ABOUT HPCC SYSTEMS HPCC Systems was built for small

development teams and offers a single

architecture and one programming

language for efficient data processing

of large or complex queries. Customers,

such as financial institutions, insurance

companies, law enforcement agencies,

federal government and other enterprise

organizations, leverage the HPCC Systems

technology through LexisNexis products and

services. For more information, visit

www.hpccsystems.com.

LEXISNEXIS www.hpccsystems.com

With HPCC Systems®, LexisNexis® Data Enrichment is Achieved in Less Than One Day

LexisNexis and the Knowledge Burst Logo are registered trademarks of Reed Elsevier Properties Inc., used under license. HPCC Systems is a registered trademark of LexisNexis Risk Data Management Inc. Copyright © 2012 LexisNexis. All rights reserved.

Page 22: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

20 BIG DATA SOURCEBOOK 2013

In the beginning, the “data warehouse”

was a concept that was not accepted by the

database fraternity. From that humble begin-

ning, the data warehouse has become conven-

tional wisdom and is a standard part of the

infrastructure in most organizations. Data

warehouse has become the foundation of

corporate data. When an organization wants

to look at data from a corporate perspective,

not an application perspective, the data ware-

house is the tool of choice.

Data Warehousing and Business Intelligence

A data warehouse is the enabling founda-

tion of business intelligence. Data warehouse

and business intelligence are linked as closely

as fish and water.

The spending on data warehousing and

business intelligence has long ago passed that

of spending on transaction-based operational

systems. Once, operational systems domi-

nated the budget of IT. Now, data warehous-

ing and business intelligence dominate.

Through the years, data warehouses have

grown in size and sophistication. Once, data

warehouse capacity was measured in giga-

bytes. Today, many data warehouses are mea-

sured in terabytes. Once, single processors

were sufficient to manage data warehouses.

Today, parallel processors are the norm.

Today, also, most corporations understand

the strategic significance of a data warehouse.

Most corporations appreciate that being able

to look at data uniformly across the corpora-

tion is an essential aspect of doing business.

But in many ways, the data warehouse

is like a river. It is constantly moving, never

standing still. The architecture of data ware-

houses has evolved with time. First, there was

just the warehouse. Then, there was the cor-

porate information factory (CIF). Then, there

was DW 2.0. Now there is big data.

Enter Big DataContinuing the architectural evolution is the

newest technology—big data. Big data technol-

ogy arrived on the scene as an answer to the need

to service very large amounts of data. There are

several definitions of big data. The definition

discussed here is the one typically discussed in

Silicon Valley. Big data technology:

• Is capable of handling lots and lots

of data

Unlocking the Potential of Big Data in a Data Warehouse Environment

By W. H. Inmon

The State of Data Warehousing

Page 23: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 21

ind

ustr

y u

pd

ate

s

• Is capable of operating on inexpensive

storage

Big data:

• Is managed by the “Roman census”

method

• Resides in an unstructured format

Organizations are finding that big data

extends their capabilities beyond the scope

of their current horizon. With big data tech-

nology, organizations can search and ana-

lyze data well beyond what would have ever

fit in their current environment. Big data

extends well beyond anything that would

ever fit in the standard DBMS environment.

As such, big data technology extends the

reach of data warehousing as well.

Some Fundamental ChallengesBut with big data there come some fun-

damental challenges. The biggest challenge is

that big data is not able to be analyzed using

standard analytical software. Standard analyt-

ical software makes the assumption that data

is organized into standard fields, columns,

rows, keys, indexes, etc. This classical DBMS

structuring of data provides context to the

data. And analytical software greatly depends

on this form of context. Stated differently, if

standard analytical software does not have the

context of data that it assumes is there, then

the analytical software simply does not work.

Therefore, without context, unstructured

data cannot be analyzed by standard analyt-

ical software. If big data is to fulfill its destiny,

there must be a means by which to analyze big

data once the data is captured.

Determining ContextThere have been several earlier attempts

to analyze unstructured data. Each of the

attempts has its own major weakness. The

previous attempts to analyze unstructured

data include:

1. NLP—natural language processing. NLP is intuitive. But the flaw with NLP is

that NLP assumes context can be determined

from the examination of text. The problem

with this assumption is that most context is

nonverbal and never finds its way into any

form of text.

2. Data scientists. The problem with

throwing a data scientist at the problem of

needing to analyze unstructured data is that

the world only has a finite supply of those

scientists. Even if the universities of the world

started to turn out droves of data scientist, the

demand for data scientists everywhere there is

big data would far outstrip the supply.

3. MapReduce. The leading technology of

big data—Hadoop—has technology called

MapReduce. In MapReduce, you can create and

manage unstructured data to the nth degree.

But the problem with MapReduce is that it

requires very technical coding in order to be

implemented. In many ways MapReduce is like

coding in Assembler. Thousands and thousands

of lines of custom code are required. Further-

more, as business functionality changes, those

thousands of lines of code need to be main-

tained. And no organization likes to be stuck

with ongoing maintenance of thousands of lines

of detailed, technical custom code.

4. MapReduce on steroids. Organizations

have recognized that creating thousands of

lines of custom code is no real solution.

Instead, technology has been developed that

accomplishes the same thing as MapReduce

except that the code is written much more

efficiently. But even here there are some

basic problems. The MapReduce on steroids

approach is still written for the technician, not

the business person. And the raw data found

in big data is essentially missing context.

5. Search engines. Search engines have

been around for a long time. Search engines

have the capability of operating on unstruc-

tured data as well as structured data. The only

problem is that search engines still need for

data to have context in order for a search to

produce sophisticated results. While search

engines can produce some limited results

while operating on unstructured data, sophis-

ticated queries are out of the reach of search

engines. The missing ingredient that search

engines need is the context of data which is

not present in unstructured data.

So the data warehouse has arrived at the

point where it is possible to include big data

in the realm of data warehousing. But in order

to include big data, it is necessary to overcome

a very basic problem—the data found in big

data is void of context, and without context,

it is very difficult to do meaningful analysis

on the data.

While it is possible that data warehousing

will be extended to include big data, unless

the basic problem of achieving or creating

context in an unstructured environment is

solved, there will always be a gap between big

data and the potential value of big data.

Deriving context then is the forthcoming

major issue of data warehouse and big data for

the future. Without being able to derive context

There is new technology called ‘textual disambiguation’ which allows raw unstructured text to have its context specifically determined. In addition, textual disambiguation allows the output of its processing to be placed in a standard database format so that classical analytical tools can be used.

The State of Data Warehousing

Page 24: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

22 BIG DATA SOURCEBOOK 2013

for unstructured data, there are limited uses for

big data. So exactly how can context of text be

derived, especially when context of text cannot

be derived from the text itself?

Deriving ContextIn fact, there are two ways to derive con-

text for unstructured data. Those ways are

“general context” and “specific context.” Gen-

eral context can be derived by merely declar-

ing a document to be of a particular variety. A

document may be about fishing. A document

may be about legislation. A document may

be about healthcare, and so forth. Once the

general context of the document is declared,

then the interpretation of text can be made in

accordance with the general category.

As a simple example, suppose there were

in the raw text this sentence: “President Ford

drove a Ford.” If the general context were

about motor cars, then Ford would be inter-

preted to be an automobile. If the general

context were about the history of presidents

of the U.S., then Ford would be interpreted to

be a reference to a former president.

Textual DisambiguationThe other type of context is specific con-

text. Specific context can be derived in many

different ways. Specific context can be derived

by the structure of a word, the text sur-

rounding a word, the placement of words in

proximity to each other, and so forth. There

is new technology called “textual disambig-

uation” which allows raw unstructured text

to have its context specifically determined.

In addition, textual disambiguation allows

the output of its processing to be placed in

a standard database format so that classical

analytical tools can be used.

At the end of textual disambiguation,

analytical processing can be done on the

raw unstructured text that has now been

disambiguated.

The Value of Determining ContextThe determination of the context of unstruc-

tured data opens the door to many types of

processing that previously were impossible. For

example, corporations can now:

Read, understand, and analyze their cor-porate contracts en masse. Prior to textual

disambiguation, it was not possible to look at

contracts and other documents collectively.

Analyze medical records. For all the work

done in the creation of EMRs (electronic

medical records), there is still much narrative

in a medical record. The ability to understand

narrative and restructure that narrative into a

form and format that can be analyzed auto-

matically is a powerful improvement over the

techniques used today.

Analyze emails. Today after an email is read,

it is placed on a huge trash heap and is never

seen again. There is, however, much valuable

information in most corporations’ emails. By

using textual disambiguation, the organization

can start to determine what important infor-

mation is passing through their hands.

Analyze and capture call center data. Today, most corporations look at and analyze

only a sampling of their call center conversa-

tions. With big data and textual disambigua-

tion, now corporations can capture and ana-

lyze all of their call center conversations.

Analyze warranty claims data. While a

warranty claim is certainly important to the

customer who has made the claim, warranty

analysis is equally important to the manufac-

turer to understand what manufacturing pro-

cesses need to be improved. By being able to

automatically capture and analyze warranty

data and to put the results in a database, the

manufacturer can benefit mightily.

And the list goes on and on. This short

list is merely the tip of the tip of the iceberg

when it comes to the advantages of being

able to capture and analyze unstructured

data. Note that with standard structured

processing, none of these opportunities have

come to fruition.

Some Architectural ConsiderationsOne of the architectural considerations

of managing big data through textual dis-

ambiguation technology is that raw data on

a big data platform cannot be analyzed in a

sophisticated manner. In order to set the stage

for sophisticated analysis, the designer must

take the unstructured text from big data,

pass the text through textual disambiguation,

then return the text back to big data. How-

ever, when the raw text passes through textual

disambiguation, it is transformed into disam-

biguated text. In other words, when the raw

text passes through textual disambiguation, it

passes back into big data, where the context of

the raw text has been determined.

Once the context of the unstructured text

has been determined, it can then be used for

sophisticated analytical processing.

What’s AheadThe argument can be made that the pro-

cess of disambiguating the raw text then

rewriting it to big data in a disambiguated

state increases the amount of data in the

environment. Such an observation is abso-

lutely true. However, given that big data is

cheap and that the big data infrastructure is

designed to handle large volumes of data, it

should be of little concern that there is some

degree of duplication of data after raw text

passes through the disambiguation process.

Only after big data has been disambiguated is

the big data store fit to be called a data ware-

house. However, once the big data is disam-

biguated, it makes a really valuable and really

innovative addition to the analytical, data

warehouse environment.

Big data has much potential. But unlock-

ing that potential is going to be a real chal-

lenge. Textual disambiguation promises to be

as profound as data warehousing once was.

Textual disambiguation is still in its infancy,

but then again, everything was once in its

infancy. However, the early seeds sewn in tex-

tual disambiguation are bearing some most

interesting fruit. ■

W. H. Inmon—the “father of data warehouse”—has written 52 books published in nine languages. Inmon speaks at conferences reg- ularly. His latest adventure is the building of Textual ETL—textual disambiguation—technology that reads raw text and allows raw text to be analyzed. Textual disambiguation is used to create business value from big data. Inmon was named by ComputerWorld as one of the 10 most influential people in the history of the computer profession, and lives in Castle Rock, Colo.

The State of Data Warehousing

Page 25: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 23

sponsored content

The adage “Every company is a data

company” is more true today than ever.

The problem is most companies don’t

realize how much valuable data they’re

actually sitting on, nor how to access

and use this untapped data. Companies

must exploit whatever data enters their

enterprise in every format and from every

source to gain a comprehensive view of

their business.

Most IT professionals focus all

their resources on figuring out how to

effectively access structured data sources.

Projects associated with data warehousing

and business intelligence get all the

attention. And in some cases they yield

valuable insights into the business. But

the fact is that structured data sources

are just the tip of the iceberg inside most

companies. There is so much intelligence

that goes unseen and unanalyzed simply

because they don’t know how to get at it.

For that reason, forward-looking CIOs

and IT organizations have begun exploring

new strategies for tapping into other

non-traditional sources of information

to get a more complete picture of their

business. These strategies attempt to gather

and analyze highly unstructured data like

websites, tweets and blogs to discover trends

that might impact the business.

While this is a step in the right direction,

it misses the bigger picture of the Big Data

landscape. The “blind spots” in these data

strategies are both the unstructured and

semi-structured data that is contained in

content like reports, EDI streams, machine

data, PDF files, print spools, ticker feeds,

message buses, and many other sources.

UNDERSTANDING THE CONTENT BLIND SPOT

A growing number of IT organizations

now see value in information contained

within these content blind spots. The key

reason: It enhances their business leaders’

ability to make smarter decisions because

much of this data provides a link to past

decisions.

Companies also realize that these non-

traditional data sources are growing at an

exponential rate. They have become the

language of business for industries like

healthcare, financial services and retail. So

where do you find these untapped sources

of information? Easy; they’re everywhere.

As companies have rolled out ERP, CRM

and other enterprise systems (including

enterprise content management tools), they

have also created thousands of standard

reports. Companies are also stockpiling

volumes of commerce data with EDI

exchanges. Excel spreadsheets are ubiquitous

as well. And as PDF files of invoices and

bills-of-lading are exchanged, vital data is

being saved. All these sources possess semi-

structured data that can reveal valuable

business insight.

But how do you get to these sources,

and what do you do with them?

OPTIMIZING INFORMATION THROUGH VISUAL DATA DISCOVERY

Next-generation analytics enable

businesses to analyze any data variety,

regardless of structure, at real-time velocity

for fast decision making in a visual data

discovery environment. These analytic tools

link diverse data types with traditional

decision-making tools like spreadsheets and

business intelligence (BI) systems to offer

a richer decision making capability than

previously possible.

By tapping into semi-structured and

unstructured content from varied sources

throughout an organization, next-gen

analytics solutions are able to map these

sources to models so that they can be

combined, restructured and analyzed.

While it sounds simple, the technology

actually requires significant intelligence

regarding the structural components of the

content types to be ingested and the ability

to break these down into “atomic level” items

that can be combined and mapped together

in different ways.

For organizations to fully exploit the

power of their information, they have to

uncover the content blind spots in their

enterprise that hold so much underutilized

value. Leveraging structured, unstructured

and semi-structured content in a visual

discovery environment can deliver enormous

improvements in decision making and

operational effectiveness.

DATAWATCH www.datawatch.com

Filling the Content Blind Spot

Page 26: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

Join DBTA via Facebook, Twitter, Google+, and LinkedIn to connect with industry peers, receive the latest-breaking news, gain insights, get conference discounts,

download white papers, hear about webinars, and much more.

GROWyour connections

Page 27: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 25

sponsored content

Businesses all over the world are

beginning to realize the promise of Big

Data. After all, being able to extract data

from various sources across the enterprise,

including operational data, customer

data, and machine/sensor data, and then

transform it all into key business insights can

provide significant competitive advantage. In

fact, having up-to-date, accurate information

for analytics can make the difference

between success and failure for companies.

However, it’s not easy. A recent study by

Wikibon noted that returns thus far on Big

Data investments are only 50 cents to the

dollar. A number of challenges stand in front

of maximizing return. The data transfer

bottleneck is but one real and pervasive issue

that’s causing many headaches in IT today.

REASONS FOR THE BIG DATA TRANSFER BOTTLENECK

Outdated technology. Moving data is

hard. Moving Big Data is harder. When

companies rely on heritage platforms

engineered to support structured data

exclusively, such as ETL, they quickly find

out that the technology simply cannot scale

to handle the volume, velocity or variety of

data, and therefore, cannot meet the real-

time information needs of the business.

Lagging system performance. Even if

source and target systems are in the same

physical location, data latency can still be

a problem. Data often resides in systems

that are used daily for operational and

transaction processing. Using complex

queries to extract data and launching

bulk data loads mean extra work for CPU

and disk resources, resulting in delayed

processing for all users.

Complex setup and implementation. Sometimes companies manage to deliver

data using complex, proprietary scripts and

programs that take months of IT time and

effort to develop and implement. With SLAs

to meet and business opportunities at risk of

being lost, most companies simply don’t have

the luxury of wading through this difficult

and time-consuming process.

Delays caused by writing data to disk. When information is extracted from systems,

it is often sent to a staging area and then

relayed to the target to be loaded. Storage to

disk causes delays as data is written and then

read in preparation for loading.

Proliferation of sources and targets. With data that can reside in a variety of

transactional databases such as Oracle, SQL

Server, IBM Mainframe, and with newer

data warehouse targets such as Vertica,

Pivotal, Teradata UDA and Microsoft PDW

on the rise, setup time can increase and

performance can be lost using solutions that

are not optimized to each platform.

Limited Internet bandwidth. If source

and target systems are in different physical

locations, or if the target is in the cloud,

insufficient Internet bandwidth can be a

major cause of data replication lag. Most

networks are configured to handle general

operations but are not built for massive data

migrations.

HARSH REALITYWhen timely information isn’t available,

key decisions need to be deferred. This

can lead to lost revenues, decreased

competitiveness, or lower levels of customer

satisfaction. Additionally, the reliability of

decisions made without real-time data may

also be called into question.

THE ANSWERThere is a solution to overcoming this

challenge. Attunity beats the Big Data

bottleneck by providing high-performance

data replication and loading for the broadest

range of databases and data warehouses

in the industry. Its easy, Click-2-Replicate

design and unique TurboStream DX data

transfer and CDC technologies give it the

power to stand up to the largest bottlenecks

and win. Partner with Attunity. You too can

beat the data transfer bottleneck!

Learn more! Download this eBook by data management expert, David Loshin: Big Data Analytics Strategies —Beating the Data Transfer Bottleneck for Competitive Gain http://bit.ly/ATTUeBook

ATTUNITY For more information, visit www.Attunity.com or call (800) 288-8648 (toll free) +1 (781) 730-4070.

Overcoming the Big Data Transfer Bottleneck

Page 28: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

26 BIG DATA SOURCEBOOK 2013

Cloud technologies and frameworkshave matured in recent years and enter-

prises are starting to realize the benefits of

cloud adoption—including savings in infra-

structure costs, and a pay-as-you-go service

model similar to Amazon Web Services. Here

is a look at the cloud market and its conver-

gence with the big data market, including key

technologies and services, challenges, and

opportunities.

Evolution of the ‘Cloud’ AdoptionThe technology, platform, and services

that were available in the early 1990s were

similar to the “cloud” adoption of the last

decade. We had distributed systems with Sun

RISC-based server workstations, IBM main-

frames, millions of Intel-based Windows

desktops, Oracle Database Servers (includ-

ing Grid Computing–10g), and J2EE N-tier

architecture. There were application service

providers (ASPs), managed service providers

(MSPs), and internet service providers (ISPs)

offering services similar to cloud offerings

today. What has changed?

There were significant events that trig-

gered the emergence of cloud offerings and

their adoption. The first one was the Amazon

Web Services (S3, EC2, RDS, SQS) scale out

and the development of IaaS (infrastructure

as a service) once Amazon was able to real-

ize the benefits of its offering for its own

internal use. The second major event was

the search engine, advertising platform, and

Google Big Table (memcache) and realiza-

tion that millions of nodes with commodity

hardware (cheap) can be leveraged to harness

MapReduce and other frameworks to distrib-

ute the search query and provide results at a

millisecond response time (unheard of with

even mainframes).

In the middle of the 2000-decade, the tra-

ditional telecom and mobile phone service

providers saw that that they needed to move

to scale-out platforms (the cloud) to manage

their mobile customer base, which grew from

a few million to a billion (factor of 1000).

The mobile data grew from a few terabytes

to petabytes and they needed newer scale-out

platforms and wanted on-premise as well as

hybrid cloud deployments.

The creators of Hadoop ran TeraSort

benchmarks with large clusters of nodes in

order to determine the benefits of MapReduce

frameworks. It resulted in the emergence of

the Hadoop Cluster Distribution; NoSQL data

stores such as columnar, document, and graph

databases; and massive parallel performance

(MPP) analytical databases. An ecosystem of

vendors emerged to reap the benefits of the

scale-out cloud infrastructure, MapReduce

frameworks, and Hadoop and NoSQL data

stores. The applications included data migra-

tion, predictive analytics, fraud detection, and

data aggregation from multiple data sources.

The new paradigm shift addressed the

key issue of scale as well as the handling of

unstructured data that was lacking in tradi-

tional relational databases. The paradigm

Cloud Technologies Are Maturing to Address Emerging Challenges and Opportunities

By Chandramouli Venkatesan

The State of Cloud Technologies

Page 29: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 27

ind

ustr

y u

pd

ate

s

shift occurred as a result of the availability of

commodity hardware and a framework to run

massive parallel data processing across clusters

of nodes including distributed file system, high

performance analytical databases, and NoSQL

data stores for handling unstructured data.

Hybrid CloudFor enterprises that are adopting the hybrid

(public/private/community) cloud pay-as-you-

go model for IaaS, PaaS, and SaaS cloud

deployments, the key drivers are cost, flexibil-

ity, and speed (time to set up hardware, soft-

ware, and services). The primary use cases for

the new hybrid model include the ability to do

data migration, fraud detection, and the ability

to manage unstructured data in real time.

But the move to hybrid cloud deploy-

ment comes with new challenges and risks.

The biggest challenge for cloud deployments

today is in the area of data security and iden-

tity. There are several cloud providers who

offer IaaS, PaaS, SaaS, network as a service,

and “everything as a service” and probably

offer good firewalls to protect data within

the boundaries of their data center. The chal-

lenges include data at rest, data in flight used

in mobile devices accessing the cloud pro-

vider, and data derived from multiple cloud

providers and provision of a single-view to

the mobile customer.

BYODThe ubiquitous mobile computing is driv-

ing the new cloud adoption model faster than

anticipated and a key driver is BYOD (bring

your own device). The traditional IT shop

had control of its assets whether on-premise

or on cloud. However, the demands of BYOD

and the myriad mobile devices, applications,

and mobile stores have resulted in the IT

organization losing control of users’ identity,

as one can have more than one profile. The

use of the biometric information such as fin-

gerprint and eye scans is still in its infancy

for the mobile users. There are some efforts

in standardization in cloud identity manage-

ment such as OpenID Connect, OAuth, and

SIEM, but the adoption is slow, and it will

take time to work seamlessly across many

cloud providers.

Trust the ‘Cloud’ ProvidersThe key security issue for cloud and mobil-

ity deployment is establishment of “trust” and

“trust boundaries.” There are several players

in the cloud and mobile deployments offer-

ing different services, and they need to work

seamlessly end-to-end. The “trust” worthiness

is enabled by the ability to automatically sign-

off or hand-off to another cloud/mobile ser-

vice provider in the “trust boundary” and still

maintain the data integrity at each hand-off.

The automatic sign-off would need to verify

the validity of the cloud provider, protect

the identity of the users, as well as guarantee

the nontampering of content. The interme-

diate trust verification providers would also

be a cloud provider similar to verification of

ecommerce internet sites. The trust verifica-

tion provider must support the SLAs for secu-

rity, identity, and trust between mobile and

cloud service provider. The key requirement

is to ensure the integrity and trust between

mobile and cloud providers, inter-cloud, and

intra-cloud providers.

The mobile end user will have a trust

boundary with mobile/telecom service pro-

vider (cellular) or managed service provider

(Wi-Fi). The trust will be recorded, and some

portion of identity will be passed on to one or

more cloud providers offering different ser-

vices. Each trust boundary will have a nego-

tiation between mobile and cloud providers

or between cloud providers to establish the

identity, security, and integrity of data as well

as the mobile user.

The future of cloud is in the convergence

of simple standards for security, identity, and

trust, and it involves all participants in the

cloud: mobile device vendors; service provid-

ers; cloud IaaS, PaaS, and SaaS vendors; and

the network. The pay-as-you-go model would

have a price tag factored in for a minimum

SLA level in terms of guarantee and addi-

tional pricing based on additional levels of

security, including security locks at the CPU

boundary.

Cloud/Big Data FrameworksIn addition to cloud security, identity ver-

ification, and trust regarding data integrity,

the technology of cloud/big data frameworks

will have rapid changes and adoption in the

next few years. One such adoption is the

standardization of a query language for the

NoSQL data stores similar to SQL for rela-

tional database management systems. The

query language will result in query nodes that

accept incoming queries and in turn result

in distributed queries across the cluster of

nodes, handling all issues dealing with data,

including security, speed, and reliability of the

transaction.

The price per TB (not GB) of flash and random access memory will drive the future adoption of cloud/big data predictive analytics and learning models. This is key in generating value in different verticals such as healthcare, education, energy, and finance. The ability

The future of cloud deployments will involve rapid adoption of new technology frameworks beyond Hadoop, open standards in the area of cloud security, identity, trust, as well as a universal and simple query language for aggregating data from legacy and emerging data stores.

The State of Cloud Technologies

Page 30: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

28 BIG DATA SOURCEBOOK 2013

to keep big data in an in-memory cache in a

smaller footprint including mobile devices will

result in improvements in gathering, collecting,

and storing data from trillions of mobile devices

and perform data, predictive, behavioral, and

visual analytics at near-real time (microseconds).

The key cloud adoption driver today is the num-

ber of cores per computing nodes. The future

of cloud adoption will involve a large memory

cache in addition to many cores per computing

node (commodity hardware).

The technology of Hadoop frameworks

has evolved since 2004, and it includes the

MapReduce framework, the Hadoop Distrib-

uted File System, and additional technologies.

There is a need for a “beyond Hadoop” frame-

work, and the future of Hadoop will be built

in to the platform (iOS, Linux, Windows,

etc.) similar to a task scheduler in a platform

OS. The new frameworks “beyond Hadoop”

will need to provide distributed query search

engines out of the box, the ability to easily

manage custom queries, and the ability to

provide a mechanism to have an audit trail

of data transformations end-to-end across

several mobile and cloud providers. The audit

trail or probe will be similar to a ping or trace

route command, and it should be available to

ensure the integrity of data for end-to-end

deployment.

Emerging StandardsThere are several emerging standards

for cloud deployments, primarily to address

identity, security, and software-defined

networking (SDN). IaaS, PaaS, and SaaS

cloud deployments have matured, and there

are several players that coexist in the cloud

ecosystem today. The standards such as

OpenID, Open Connect, OAuth, and Open

Data Center Alliance have several cloud pro-

viders and enterprises signing up every day,

but the adoption will take a few more years

to evolve and mature. Open standards are the

key to the future adoption of cloud and the

seamless flow of secure data among differ-

ent cloud providers. This offers a paradigm

similar to a free market economy, which is a

goal, but in reality, the goal to be strived for

by future cloud players is about 60% open

standards and 40% proprietary frameworks

in order to promote competition and an even

playing field. Customers will demand faster

adoption in open standards for cloud deploy-

ments, and the keys to adoption are speed,

flexibility, cost, and focus on solving their

problems efficiently. The current approach of

enterprises spending time and money in the

evaluation, selection, and use of cloud pro-

viders will pave the way for pay-as-go-you

go cloud providers on demand for blended

services. There will be blend of services lever-

aging mobile and cloud deployment, such as

single sign-on, presales, actual customer sale,

post-sales, recommendation systems, etc. The

cloud adoption of IaaS, PaaS, and SaaS will

give way to business models similar to prepay,

post-pay debit/credit cards for products and

services with “cloud” ready offerings.

The cloud/big data deployments will

see the emergence of multiple data centers

managed by multiple cloud providers, and

the cloud will have to support distributed

query-based search, with results that can be

provided to the mobile user at near real time.

This would require open standards to allow

seamless data exchange between multiple

data centers, maintaining the SLA levels for

performance, scalability, security, and iden-

tity. It is a clear challenge and opportunity for

the future of cloud, but it is likely that new

mobile apps will drive the need for coopera-

tion between cloud providers or result in con-

solidation of several players into a few mobile

and cloud providers.

Billing Systems for the CloudFuture cloud deployments will require

both mobile and cloud provider payment

processing to keep pace with other aspects of

the cloud deployment model, such as secu-

rity, scalability, cost-savings, and reliability.

We would require, at a minimum, a billing

provider to provide platform billing and rec-

onciliation of payments between cloud and

mobile service providers. The challenge of the

future for cloud-based billing providers is the

payment processing for a blend of services.

For example, payment of different rates for

providers in the cloud, such as device, mobile,

cloud infrastructure/platform provider, stor-

age, network, and payment service providers.

The break-even and moderate margin for a

pay-as-you go model in the cloud will be 40%

cost and 60% revenue; the cost reduction over

time would be as a result of consolidation

from both the mobile and cloud service pro-

vider offering integrated services. The pay-as-

you-go business model, with SLA guarantees,

will be appealing for mom-and-pop stores

that want to adopt cloud services, coexist, and

compete with big retail stores, and will ulti-

mately result in better service and lower cost

for the consumer.

The future of cloud deployments will

involve rapid adoption of new technology

frameworks beyond Hadoop, open standards

in the area of cloud security, identity, and

trust, as well as a universal and simple query

language for aggregating data from legacy and

emerging data stores. Future cloud adoption

will involve trillions of mobile devices, ubiq-

uitous computing, zettabytes of data, and

improved SLAs between cloud providers, as

well as larger, cheaper memory cache and

multiple cores per computing node. ■

Chandramouli Venkatesan (Mouli) has more than 20 years of experience in the telecom industry, including technical leadership roles at Fujitsu Networks and Cisco Systems, and as a big data integration architect in the finan-cial and healthcare industries. Venkatesan’s company MEICS, Inc. (www.meics.org) pro-vides the analytics and learning platform for cloud deployments. Venkatesan evangelizes emerging technologies and platforms and innovation in cloud, big data, mobility, and content delivery networks.

Open standards are the key to the future adoption of cloud and the seamless flow of secure data among different cloud providers.

The State of Cloud Technologies

Page 31: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 29

sponsored content

Many organizations recognize

the value of generating insights from

their rapidly increasing web, social,

mobile and machine-generated data.

However, traditional batch analysis is

not fast enough.

If data is even one day old, the

insights may already be obsolete.

Companies need to analyze data in

near-real-time, often in seconds.

Additionally, no matter the timeliness

of these insights, they are worthless

without action. The faster a company

acts, the more likely there is a return on

the insight such as increased customer

conversion, loyalty, satisfaction or

lower inventory, manufacturing, or

distribution costs.

Companies seek technology solutions

that allow them to become real-time,

data-driven businesses, but have been

challenged by existing solutions.

LEGACY DATABASE OPTIONSTraditional RDBMSs (Relational

Database Management Systems) such

as Oracle or IBM DB2 can support

real-time updates, but require expensive

specialized hardware to “scale up” to

support terabytes to petabytes of data.

At millions of dollars per installation,

this becomes cost-prohibitive —quickly.

Traditional open source databases such

as MySQL and PostgresSQL are unable

to scale beyond a few terabytes without

manual sharding. However, manual sharding

requires a partial rewrite of every application

and becomes a maintenance nightmare to

periodically rebalance shards.

New Big Data technologies such as

Hadoop and HBase are cost-effective

platforms that are proven to scale from

terabytes to petabytes, but they provide

little or no SQL support. This lack of SQL

support is a major barrier to Hadoop

adoption and is also a major shortcoming

of NoSQL solutions, because of the massive

retraining required. Companies adopting

these technologies cannot leverage existing

investments in SQL-trained people, or SQL

Business Intelligence (BI) tools.

SPLICE MACHINE: THE REAL-TIME SQL-ON-HADOOP DATABASE

Splice Machine brings the best of these

worlds together. It is a standard SQL

database supporting real-time updates and

transactions implemented on the scalable,

Hadoop distributed computing platform.

Designed to meet the needs of real-time,

data-driven businesses, Splice Machine is

the only transactional SQL-on-Hadoop

database. Like Oracle and MySQL, it is a

general-purpose database that can handle

operational (OLTP) or analytical (OLAP)

workloads, but can also scale out cost-

effectively on inexpensive commodity

servers.

Splice Machine marries two proven

technology stacks: Apache Derby, a Java-

based, full-featured ANSI SQL database, and

HBase/Hadoop, the leading platforms for

distributed computing.

SPLICE MACHINE ENABLES YOU TO GET REAL WITH BIG DATA

As the only transactional SQL-on-

Hadoop database, Splice Machine presents

unlimited possibilities to application

developers and database architects. Best of

all, it eliminates the compromises that have

been part of any Big Data database platform

selection to date.

Splice Machine is uniquely qualified to

power applications that can harness real-

time data to create more valuable insights

and drive better, more timely actions. This

enables companies that use Splice Machine

to become real-time, data-driven businesses

that can leapfrog their competition and get

real results from Big Data.

SPLICE MACHINE [email protected] www.splicemachine.com

Get Real With Big DataWITH SPLICE MACHINE,

COMPANIES CAN:Unlock the Value of Hadoop. Splice

Machine provides a standard ANSI SQL

engine, so any SQL-trained analyst or SQL-

based application can unlock the value of

the data in a current Hadoop deployment,

across most major distributions.

Combine NoSQL and SQL. Splice

Machine enables application developers

to enjoy the best of both SQL and NoSQL,

bringing NoSQL scalability with SQL

language support.

Avoid Expensive “Big Iron.” Splice

Machine frees companies with specialized

server hardware from the spiraling costs of

scaling up the handle over a few terabytes.

Scale Beyond MySQL. Splice Machine

can help those companies scale beyond

a few terabytes with the proven auto-

sharding capability of HBase.

“Future-proof ” New Apps. Splice

Machine provides a “future-proof”

database platform that can scale cost-

effectively from gigabytes to petabytes

for new applications.

Page 32: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

30 BIG DATA SOURCEBOOK 2013

Data quality has been one of the central

issues in information management since the

beginning—not the beginning of modern

computing and the development of the cor-

porate information infrastructure but since

the beginning of modern economics and

probably before that. Data quality is what

audits are all about.

Nonetheless, the issues surrounding data

quality took on added importance with the

data explosion sparked by the large-scale inte-

gration of computing into every aspect of

business activity. The need for high-quality

data was captured in the punch-card days of

the computer revolution with the epigram gar-

bage in, garbage out. If the data isn’t good, the

outcome of the business process that uses that

data isn’t good either.

Data growth has always been robust, and

the rate keeps accelerating with every new gen-

eration of computing technology. Mainframe

computers generated and stored huge amounts

of information, but then came minicomputers

and then personal computers. At that point,

everybody in a corporation and many people

at home were generating valuable data that was

used in many different ways. Relational data-

bases became the repositories of information

across the enterprises, from financial data to

product development efforts, from manufac-

turing to logistics to customer relationships to

marketing. Unfortunately, given the organiza-

tional structure of most companies, frequently

data was captured in divisional silos and could

not be shared among different departments—

finance and sales, for example, or manufac-

turing and logistics. Since data was captured

in different ways by different organizational

units, integrating the data to provide a holistic

picture of business activities was very difficult.

The explosion in the amount of structured

data generated by a corporation sparked two

key developments. First, it cast a sharp spotlight

on data quality. The equation was pretty simple.

Bad data led to bad business outcomes. Second,

efforts were put in place to develop master data

management programs so data generated by

different parts of an organization could be coor-

dinated and integrated, at least to some degree.

Challenges to Data Quality and MDMEfforts in both data quality and master data

management have only been partially success-

ful. Not only is data quality difficult to achieve,

it is a difficult problem even to approach. In

addition, the scope of the problem keeps

broadening. Master data management pres-

ents many of the same challenges that data

quality itself presents. Moreover, the complex-

ity of implementing master data management

solutions has restricted them to relatively large

companies. At the bottom line, both data qual-

ity program and master data management

solutions are tricky to successfully implement,

in part because, to a large degree, the impact

of poor quality and disjointed data is hidden

from sight. Too often, data quality seems to be

nobody’s specific responsibility.

Despite the difficulties in gathering corpo-

rate resources to address these issues, during

the past decade, the high cost of poor quality

and poorly integrated data has become clearer,

and a better understanding of what defines

data quality, as well as a general methodology

for implementing data quality programs, has

emerged. The establishment of the general

foundation for data quality and master data

management programs is significant, particu-

larly because the corporate information envi-

ronment is undergoing a tremendous upheaval,

generating turbulence as vigorous as that cre-

ated by mainframe and personal computers.

The spread of the internet and mobile

devices such as smartphones and tablets is not

Data Quality and MDM Programs Must Evolve to Meet Complex New Challenges

By Elliot King

The State of Data Quality and Master Data Management

Page 33: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 31

ind

ustr

y u

pd

ate

sDuring the past decade, the high cost of poor quality and poorly integrated data has become clearer, and a better understanding of what defines data quality, as well as a general methodology for implementing data quality programs, has emerged.

only generating more data than ever before,

many kinds of data—much of it largely

unstructured or semistructured—have

become very important. The use of RFID and

other kinds of sensor data has led to a data

tsunami of epic proportions. Cloud com-

puting has created an imperative for compa-

nies to integrated data from many different

sources both inside and outside the corpo-

ration. And compliance with regulations in a

wide range of industries means that data has

to be held for longer periods of time and must

be correct. In short, the basics for data quality

and master data management are in place but

the basics are not nearly sufficient.

The Current SituationIn 2002, the Data Warehousing Institute

estimated that poor data quality cost Amer-

ican businesses about $600 million a year.

Through the years, that figure has been the

number most commonly bandied about

as the price tag for bad data. Of course, the

accuracy of such an eye-popping number

covering the entire scope of American indus-

try is hard to assess.

However, a more recent study of busi-

nesses in the U.K. presented an even starker

picture. It found that as much as 16% of

many companies’ budgets is squandered

because of poor data quality. Departments

such as sales, operations, and finance waste

on average 15% of their budgets, according

to the study. That figure climbs to 18% for

IT. And the number is even higher for cus-

tomer-facing activities such as customer loy-

alty programs. In all, 90% of the companies

surveyed opined that they felt their activities

were hindered by poor data.

When specific functional areas are assessed,

the substantial cost that poor data quality

extracts can become pretty clear. For exam-

ple, contact information was one of the first

targets for data quality programs. Obviously,

inaccurate, incomplete, and duplicated address

information hurts the results of direct market-

ing campaigns. In one particularly egregious

example, a major pharmaceutical company

once reported that 25% of the glossy brochures

it mailed were returned. Not only are potential

sales missed, current customers can be alien-

ated. Marketing material that arrives in error

somewhere represents sheer costs.

Marketing is only one area in which the

impact of poor information is visible. One

European bank found that 100% of customer

complaints had their roots in poor or outright

incorrect information. Moreover, this study

showed, customers who register complaints

are much more likely to shop for alternative

suppliers than those who don’t. The difference

in the churn between customers who complain

and whose complaints are rooted in poor data

quality and those who don’t is a direct cost of

poor data quality.

And the list goes on. Poor data quality in

manufacturing slows time to market, leads

to inventory management problems, and can

result in product defects. Bad logistics data can

have a material impact on both the front end

and back end of the manufacturing process.

The Benefits of Improving Data QualityOn the other side of the equation, improv-

ing data quality can lead to huge benefits.

One company reported that improving the

quality of data available to its call center per-

sonnel resulted in nearly $1 million in sav-

ings. Another realized $150,000 in billing

efficiencies by improving its customer contact

information.

As the cost/benefit equation of data quality

has become more apparent, the need to define

data quality has become more pressing. In

addition to the core characteristics of accuracy

and timeliness, the most concise expression of

the attributes of high-quality data is consistency,

completeness, and compactness. Consistency

means that each “fact” is represented in the

same way across the information ecosystem.

For example, a date is represented by two digits

for the month, two for the day, and four for the

year and is represented in that order across the

informational ecosystem in a company. More-

over, the “facts” represented must be logical. An

“order due” date, for example, cannot be earlier

than an “order placed” date.

Maintaining consistency is more difficult

than it may appear at first. Companies capture

data in a multitude of ways. In many cases,

customers are entering data via web forms

,and both the accuracy and the consistency

of the data can be an issue. Moreover, data is

often imported from third-party source sys-

tems, which may use alternative formats to

represent “facts.” Indeed, even separate oper-

ational units within a single enterprise may

represent data differently.

Maintaining Data ConsistencyMaster data management is one approach

companies have used to maintain data consis-

tency. MDM technology consolidates, cleanses,

and augments corporate data, synchronizing

data among all applications, business pro-

cesses, and analytical tools. Master data man-

agement tools provide the central repository

for cross-referenced data in the organization,

building a single view of organizational data.

The second element of data quality is com-

pleteness. Different stakeholders in an organi-

zation need different information. For example,

the academic records department in a univer-sity may be most interested in a student grade point average, the courses in which the student is enrolled, and the student’s progress toward graduation. The dean of students wants to know if the student is living on campus, the extra-curricular activities in which the student participates, and any disciplinary problems the student has had. The bursar’s office wants to know the scholarships the student has received

The State of Data Quality and Master Data Management

Page 34: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

32 BIG DATA SOURCEBOOK 2013

and the student’s payment history. A good data

system will not only capture all that informa-

tion but also ensure that none of the key ele-

ments are missing.

The last element of good quality data is con-

ciseness. Information is flowing into organiza-

tions through several different avenues. Inevita-

bly, records will be duplicated and information

comingled, and nobody likes to receive three

copies of the same piece of direct mail.

Because companies currently operate within

such a dynamic information environment, no

matter how diligent enterprises are, their sys-

tems will contain faulty, incorrect, duplicate,

and incomplete information. Indeed, if compa-

nies do nothing at all, the quality of their data

will degrade. Time decay is an ongoing, con-

sistent cause of data errors. People move. They

get married and change their names. They get

divorced and change their names again. Corpo-

rate records have no way to keep up.

But time is only one of the root causes for

bad data. Corporate change also poses a prob-

lem. As companies grow, they add new applica-

tions and systems, making other applications

and systems obsolete. In addition, an enter-

prise may merge with or purchase another

organization whose data is in completely dif-

ferent formats. Finally, companies are increas-

ingly incorporating data from outside sources.

If not managed correctly, each of these events

can introduce large-scale problems with cor-

porate data.

The third root cause of data quality prob-

lems is that old standby—human error.

People already generate a lot of data and are

generating even more as social media content

and unstructured data become more signifi-

cant. Sadly, people make mistakes. People are

inconsistent. People omit things. People enter

data multiple times. Inaccuracies, omissions,

inconsistencies, and redundancies are hall-

marks of poor data quality.

Given that data deterioration is an ongoing

facet of enterprise information, for a data qual-

ity program to work, it must be ongoing and

iterative. Modern data quality programs rest

on a handful of key activities—data profiling

and assessment, data improvement, data inte-

gration, and data augmentation.

In theory, data improvement programs are

not complicated. The first step is to character-

ize or profile the data at hand and measure how

closely it conforms to what is expected. The

next step is to fix the mistakes. The third step

is to eliminate duplicated and redundant data.

Finally, data quality improvement programs

should address holes in the enterprise infor-

mation environment by augmenting existing

data with data from appropriate sources. Fre-

quently, data improvement programs do not

address enterprise data in its entirety but focus

on high-value, high-impact information used

in what can be considered mission-critical

business processes.

The Big Data ChallengeTo date, most data quality programs have

been focused on structured data. But, iron-

ically, while the tools, processes, and organi-

zational structures needed to implement an

effective data quality program have developed,

the emergence of big data has the potential to

completely rewrite the rules of the game.

Though the term “big data” is still debated,

it represents something qualitatively new.

Big data does not just mean the explosion of

transactional data driven by the widespread

use of sensors and other data-generating

devices. It also refers to the desire and ability

to extract analytic value from new data types

such as video and audio. And it refers to the

trend toward capturing huge amounts of data

produced by the internet, mobile devices, and

social media.

The availability of more data, new types of

data, and data from a wider array of sources

has had a major impact on data analysis and

business intelligence. In the past, people would

identify a problem they wanted to solve and

then gather and analyze the data needed to

solve that problem. With big data, that work

flow is reversed. Companies are realizing that

they have access to huge amounts of new

data—tweets, for example—and are working

to determine how to extract value from that

data, reversing the usual process.

Data quality programs will have to evolve

to meet these new challenges. Perhaps the first

step will be methods for developing appropri-

ate metadata. In general, big data is complex,

messy, and can come from a variety of dif-

ferent sources, so good metadata is essential.

Data classification, efficient data integration,

and the establishment of standards and data

governance will also be critical elements of

data quality programs that encompass big data

elements.

Ensuring data quality has been a serious

challenge in many organizations. Frequently,

data quality problems are masked. Business

processes seem to be working well enough, and

it is hard to determine beforehand what the

return on investment in a data quality program

would be. In addition, in many organizations,

nobody seems to “own” responsibility for the

overall quality of corporate data. People are

responsible or are sensitive to their own slice

of the data pie but are not concerned with the

overall pie itself.

What’s AheadIt should not be a surprise that in a recent

survey of data quality professionals, two-

thirds of the respondents felt the data quality

programs in their organizations were only

“OK”—that is, some goals were met or poor.

On the brighter side, however, 70% indicated

that the company’s management felt data

and information were important corporate

assets and recognized the value of improving

its quality. On balance, however, data quality

must be improved. In another survey, 61% of

IT and business professionals said they lacked

confidence in their company data.

During the next several years, data qual-

ity professionals will face a series of complex

challenges. Perhaps the most immediate is to

be able to view data quality issues within their

organizations holistically. Data generated by

one division—marketing, let’s say—may be

consumed by another—manufacturing, per-

haps. Data quality professionals need to be

able to respond to the needs of both.

Secondly, data quality professionals must

develop tools, processes, and procedures to

manage big data. Since a lot of big data is also

real-time data, data quality must become a

real-time process integrated into the enter-prise information ecosystem. And finally, and perhaps most importantly, data quality pro-fessionals will have to set priorities. Nobody can do everything at once. ■

Elliot King has reported on IT for 30 years. He is the chair of the Department of Communication at Loyola University Maryland,where he is a founder of an M.A. program in Emerging Media. He has written six books and hundreds of articles about new technologies. Follow him on Twitter @joyofjournalism. He blogs at emergingmedia360.org.

The State of Data Quality and Master Data Management

Page 35: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 33

sponsored content

Data is growing exponentially, experiencing unprecedented growth at

phenomenal speeds. According to IDC

(International Data Corporation), by

2015, nearly 3 billion people will be online,

generating nearly 8 zettabytes of data.

Analyzing large data sets and leveraging

new data-driven strategies will be essential

for establishing competitive differentiation

in the foreseeable future.

Big Data represents a fundamental shift

in the way companies conduct business

and interact with customers. Deriving value

from data sets requires that companies

across all industries be aggressive about

data collection, integration, cleansing

and analysis.

DATA SOURCES— BEYOND THE TRADITIONAL

Enterprises understand the intrinsic

value in mining and analyzing traditional

data sources such as demographics,

consumer transactions, behavior models,

industry trends, and competitor information.

However, the age of Big Data and advanced

technologies necessitate the analysis of new

data universes, such as social media and

mobile technologies.

Social media is one of the major elements

driving the overall Big Data phenomenon.

Twitter streams, Facebook posts and

blogging forums flood organizations

with massive amounts of data. Successful

Big Data strategies include the adoption

of technologies to pull relevant social

media into a single stream and integrate

the information into the core functions

of the enterprise. Automated processes,

matching technology and filters extract

content and consumer sentiment. When

social stream data is cleansed and integrated

into a database, enterprises gain invaluable

information on customer insights,

competitive intelligence, product feedback,

and market trends.

Mobile technology is also contributing

to the data influx as mobile devices become

more powerful, networks run faster and

apps more numerous. According to a report

by Cisco, global traffic on data networks

grew by 70% in 2012. The traffic on mobile

data networks in 2012—885 petabytes or

885 quadrillion bytes—was nearly 12 times

greater than total Internet traffic around the

world in 2000. As consumer behavior shifts

to new digital technologies, enterprises

are in a prime position to take advantage

of opportunities such as location-based

marketing.

GPS technologies are much more precise,

allowing marketers to deliver targeted

real-time messaging based on a consumer’s

location. Geofencing, a technology gaining

popularity among industries such as retail,

establishes a virtual perimeter around a

real-world site. For example, geofences

may be set up around a storefront. When

a customer carrying a smart device enters

the area, the device emits geodata, allowing

companies to send locally-targeted content

and promotions. According to research by

Placecast, a company specializing in location-

based services, one of every two consumers

visits a location after receiving an alert.

MANAGING BIG DATAWhen properly managed, Big Data brings

big opportunities. Solid data management

processes and well-designed procedures for

data stewardship are crucial investments

for Big Data projects to be successful.

Structured and unstructured data must be

properly formatted, integrated and cleansed

to fully extract actionable and agile business

intelligence.

As the speed of business continues

to accelerate, data is generated instantly.

Traditional data quality batch processing is

no longer enough to fully sustain effective

operational decision-making. Integrating,

cleansing and analyzing data in real-time

allows a company to engage in opportunities

instantly. For example, using real-time data

processing, a company can personalize a

customer’s on-line website visit, enhancing

the overall customer experience. Monitoring

of transactions in real-time also has

important benefits for security. Security

threats can instantly be identified, such

as fraudulent activity or individuals on a

security watch list. The applications are

numerous. Corporations able to react to

information the fastest will have the greatest

competitive advantage.

Big Data initiatives require planning

and dedication to be successful. According

to Gartner Predicts 2012 research, more

than 85% of Fortune 500 organizations

will be unable to effectively exploit Big

Data by 2015. Companies who successfully

incorporate Big Data projects into the overall

business strategy will gain significant returns,

including better customer relationships,

improved operational efficiency, identification

of marketing opportunities, security risk

mitigation, and more.

DATAMENTORS provides award-winning data quality and database marketing solutions. Offered as either a customer-premise installation or ASP delivered solution, DataMentors leverages proprietary data discovery, analysis, campaign management, data mining and modeling practices to identify proactive, knowledge-driven decisions. Learn more at www.DataMentors.com, including how to obtain a complimentary customer database quality analysis.

Big Data ... Big Deal?

Page 36: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

34 BIG DATA SOURCEBOOK 2013

There is something for everyone within

today’s generation of business intelligence and

advanced analytics solutions. Built on open,

flexible frameworks and designed for users

who expect and need information at internet

speeds, BI and analytics are undergoing its

first revolutionary transformation since com-

puters became mainstream business tools.

Not only are the tools evolving, end users

are evolving as well. People are demanding

more of their analytics solutions, but analyt-

ics are also changing the way people across

enterprises, from end-users to infrastructure

specialists to top-level executives, work and

run their businesses.

All About ChoiceFor today’s data infrastructure managers

charged with capturing, cleansing, processing,

and storing data, the new BI/analytics world

is all about choice—and lots of it. An array of

technologies and solutions is now surging into

the marketplace that offers smarter ways to

capture, manage, and store big data of all types

and volumes.

A company doesn’t need to be an enter-

prise on the scale of a Google or eBay, turn-

ing huge datasets into real-time insights on

millions of customers. Organizations of all

sizes are now getting into the game. In fact,

more than two-fifths of 304 data managers

surveyed from all types and sizes of busi-

nesses report they have formal “big data” ini-

tiatives in progress, with the goals of deliv-

ering predictive analytics, customer analysis,

and growing new business revenue streams

(“2013 Big Data Opportunities Survey,”

sponsored by SAP and conducted by Uni-

sphere Research, a division of Information

Today, Inc., May 2013).

There are a variety of data infrastructure

tools and platforms that are paving the way to

big data analysis:

Open Source/NoSQL/NewSQL Databases: Alternative forms of databases are filling the

need to manage and store unstructured data.

These new databases often hail from the open

source space, meaning that they are immedi-

ately available to administrators and develop-

ers for little or no charge. NewSQL databases

tend to be cloud-based systems. NoSQL (“Not

only” SQL)-based databases are designed

to store unstructured or nonrelational data.

There are four categories of NoSQL databases:

key-value stores (for the storage of schema-less

data); column family databases (storing data

within columns); graph databases (employing

structures with nodes, edges, and properties to

represent and store data); and document data-

bases (for the simple storage and retrieval of

document aggregates).

Hadoop/MapReduce Open Source Eco-sphere: Apache Hadoop, an open source

framework, is designed for processing and

managing big data stores of unstructured data,

such as log files. Hadoop is a parallel-pro-

cessing framework, linked to the MapReduce

analytics engine, that captures and packages

both unstructured and structured data into

digestible files that can be accessed by other

enterprise applications.

A survey of 298 data managers affiliated

with the Independent Oracle Users Group

(IOUG) has found that Hadoop adoption is

likely to triple during the coming years. At

the time of the survey, 13% of respondents

had deployed or were in the process of imple-

menting or piloting Hadoop, with an addi-

tional 22% considering adoption of the open

source framework at some point in the future

(“Big Data, Big Challenges, Big Opportunities:

In Today’s BI and Advanced Analytics World, There Is Something for Everyone

By Joe McKendrick

The State of Business Intelligence and Advanced Analytics

Page 37: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

WHITE PAPER: Turbocharge Analytics with Data Virtualization

DOWNLOAD NOW: http://tinyurl.com/turbochargeanalytics

Traditional data integration approaches slow analytics adoption and constrain theability to achieve these objectives. This white paper outlines the analytics pipeline, identifies how Big Data and Cloud Computing present new barriers to agility, anddescribes how data virtualization successfully addresses these challenges.

A customer use-case is also included, illustrating how forward-thinking companiesare taking advantage of these modern data integration techniques to turbochargetheir analytics.

Page 38: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

36 BIG DATA SOURCEBOOK 2013

[T]he new BI/analytics world is all about diving deep into datasets and being able to engage in ‘storytelling’ as a way to connect data to the business.

2012 IOUG Big Data Strategies Survey,” spon-

sored by Oracle and conducted by Unisphere

Research, September 2012).

Relational Database Management Sys-tems: RDBMSs, on the market for close to 3

decades, structure data into tables that can

be cross-indexed within applications and are

increasingly being tweaked for the data surge

ahead. The IOUG survey finds nine out of

10 enterprises intend to continue using rela-

tional databases for the foreseeable future,

and it is likely that many organizations will

have hybrid environments with both SQL and

NoSQL running side by side.

Cloud: Cloud-based BI solutions offer

functionality on demand, along with more

rapid deployment, low upfront cost, and scal-

ability. Many database vendors now support

data management and storage capabilities via

a cloud or software as a service environment.

In addition, other vendors are also optimiz-

ing their data products to be able to leverage

cloud resources—either as the foundation

of private clouds, or running in on-premises

server environments that also access applica-

tion programming interfaces (APIs) or web

services for additional functions.

In another survey of 262 data managers,

37% say their organizations are either run-

ning private clouds—defined as on-demand

shared services provided to internal depart-

ments or lines of business within enter-

prises—at full or limited scale, or are in pilot

stages (“Enterprise Cloudscapes—Deeper

and More Strategic: 2012–13 IOUG Cloud

Computing Survey,” sponsored by Oracle and

conducted by Unisphere Research, February

2013). This is up from 29% in 2010, the first

year this survey was conducted. In addition,

adoption of public clouds—defined as on-

demand services provided by public cloud

providers—is on the upswing. Twenty-six

percent of respondents say they now use pub-

lic cloud services either in full or limited ways,

or within pilot projects. This is up by 86%

from the first survey in this series, conducted

in 2010, when 14% reported adoption.

In addition, 50% of private cloud users

report they run database as a service, up from

35% 2 years ago. Among public cloud users,

37% run database as a service, up from 12%

2 years ago.

Data Virtualization: Just as IT assets are

now offered through service layers via soft-

ware as a service or platform as a service,

information can be available through a “data

as a service” approach. In tandem with the

rise of private cloud and server virtualization

within enterprises, there has been a similar

movement to data virtualization, or data-

base as a service. By decoupling the database

layer from hardware and applications, users

are able to access disparate data sources from

anywhere across the enterprise, regardless of

location or underlying platform.

In-Memory Technologies: Many ven-

dors are adding in-memory capabilities to

offerings in which data and processing are

moved into a machine’s random access mem-

ory. In-memory eliminates what is probably

the slowest part of data processing—pulling

data off disks. In an environment with large

datasets—scaling into the hundreds of tera-

bytes—this will multiply into a bottleneck for

rapid analysis, limiting the amount of data

that can be analyzed at one time. Some esti-

mate that the capacity of such systems can

already go as high as those of large, disk-based

databases—all that data stored in a RAID

array could potentially be moved right into

machine memory.

A recent survey of 323 data managers

demonstrates that in-memory technology is

poised for rapid growth. While in-memory is

seen within many organizations, it is mainly

focused on specific sites or pilot projects at

this time. A handful of respondents to the

survey, 5%, report the technology is currently

in “widespread” use across their enterprises,

while another 8% say it is in limited use across

more than three departments within their

organizations. Close to one-third, 31%, report

that they are either piloting or considering this

technology (“Accelerating Enterprise Insights:

2013 IOUG In-Memory Strategies Survey,”

sponsored by SAP and conducted by Uni-

sphere Research, January 2013).

Technologies to Connect Data to BusinessFor quants, data analysts, data scientists,

and business users, the new BI/analytics

world is all about diving deep into datasets

and being able to engage in “storytelling” as a

way to connect data to the business.

There is a perception that developing

and supporting “data scientist”-type skill sets

require specially trained statisticians and

mathematicians supported by sophisticated

algorithms. However, with the help of tools

and platforms now widely available in today’s

market, members of existing data departments

can also be brought up-to-speed and made

capable of delivering insightful data analysis.

Open Source: The revolutionary frame-

work that broke open the big data analysis

scene is Hadoop and MapReduce. One of the

most potent tools in the quants’ toolboxes is

R, the open source, object-oriented analytics

language. R is rapidly deployable, tends to

be well-suited for building analytics against

large and highly diverse datasets, and has been

embedded in many applications. There are a

number of solutions that build upon R and

make the language easy to work with to visu-

ally manipulate data for the more effective

delivery of business insights.

Predictive Analytics: Predictive analytics technology is a key mission awaiting quants, data analysts, and data scientists. The tech-nology is available; all it takes is a little imag-ination. For example, during the presidential election in the fall of 2012, Nate Silver of The New York Times put predictive analytics on the map with his almost dead-on prediction of the winning candidate. The same principles can

The State of Business Intelligence and Advanced Analytics

Page 39: Big Data Sourcebook Your Guide to the Data Revolution Free eBook
Page 40: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

38 BIG DATA SOURCEBOOK 2013

be applied for more routine business prob-

lems, which potentially can uncover unfore-

seen outcomes. For example, one bank found

that its most profitable customers were not

high-wealth individuals, but rather those who

were not meeting minimums and overdrafting

accounts and thus anteing up fees. In another

case, an airline found that passengers specify-

ing vegetarian preferences in their on-board

meals were less likely to miss flights. Or even

counterintuitive findings—such as the dating

site that found people rated the most attractive

received less attention than “average”-looking

members. (Suitors felt they faced more compe-

tition with more attractive members.)

Programming Tools: A range of script-

ing and open source languages—including

Python, Ruby, and Perl—also include exten-

sions for parallel programming and machine

learning.

Opening Up Analytics to the BusinessFor business users, the new BI/analytics

world is all about analytics for all. There

has been a growing movement to open up

analytics across the organization—pushing

these capabilities down to all levels of deci-

sion makers, including frontline customer

service representatives, production per-

sonnel, and information workers. A recent

survey of 250 data managers finds that in

most companies, fewer than one out of 10

employees have access to BI and analytic

systems (“Opening Up Business Intelligence

to the Enterprise: 2012 Survey On Self-Ser-

vice BI and Analytics,” sponsored by Tab-

leau Software and published by Unisphere

Research, October 2012).

Now, a new generation of front-end tools

is making this possible:

Visualization: Visual analytics is the new

frontier for end-user data access. Data visualiza-

tion tools provide highly graphic, yet relatively

simple, interfaces that help end users dig deep

into queries. This represents a departure from

the ubiquitous spreadsheet—rows of num-

bers—as well as static dashboards or PDF-based

reports with their immovable variables.

Self-Service: There is a growing trend among

enterprises to enable end users to build or design

their own interfaces and queries. Self-service

may take the form of enterprise “mashups,” in

which end users build their own front ends that

are combined with one or more data sources,

or through highly configurable portals. Accord-

ing to the 2012 Tableau-Unisphere self-service

BI and analytics study, self-service BI is now

offered to some extent in half of the organiza-

tions surveyed.

Pervasive BI: Pervasive BI and analyt-

ics are increasingly being embedded within

applications or devices, in which the end user

is oblivious to the software and data feeds

running in the background.

Cloud: Many users are looking to the

cloud to support BI data and tools in a more

cost-effective way than on-premises desk-

top tools. Third-party cloud providers have

almost unlimited capacity and can support

and provide big data analytics in a way that

is prohibitive for most organizations. Cloud

opens up business intelligence and analytics

to more users—nonanalysts—within orga-

nizations. With the drive to make BI more

ubiquitous, the cloud will only accelerate this

move toward simplified access.

Mobile: Mobile technology, which is only

just starting to seep into the BI and analytics

realm, promises to be a source of disruption.

The availability of analytics on an easy-to-use

mobile app, for example, will bring analytics

to decision makers almost instantaneously.

With many employees now bringing their

own devices to work, analytics may be read-

ily used by users that previously did not have

access to those capabilities.

The Opportunity to Compete on AnalyticsFor top-level executives, the new BI/

analytics presents opportunities to compete

on analytics. The ability to employ analytics

means understanding customers and markets

better, as well as spotting trends as they are

starting to happen, or before they happen.

As found in the Unisphere Research sur-

vey on big data opportunities, most execu-

tives instinctively understand the advantages

big data can bring to their operations, espe-

cially with predictive analytics and customer

analytics. A majority of the respondents with

such efforts under way, 59%, seek to improve

existing business processes, while another

41% are concerned with the need to create

new business processes/models.

BI and advanced analytics not only pro-

vide snapshots of aspects of the business such

as sales or customer churn, but also makes it

possible to apply key performance indicators

against data to develop a picture of a busi-

ness’s overall performance.

What’s AheadTo compete in today’s hyper-competitive

global marketplace, businesses need to under-

stand what’s around the corner. Predictive

analytics technology enables this to happen,

and the new generation of tools incorporates

such predictive capabilities.

The ability to automate low-level deci-

sions is freeing up organizations to apply

their mind power against tougher, more stra-

tegic decisions. These days, analytical appli-

cations are being embedded into processes

and applied against business rules engines to

enable applications and machines to handle

the more routine, day-to-day decisions that

come up—rerouting deliveries, extending

up-sell offers to customers, or canceling or

revising a purchase order.

Many organizations beginning their jour-

ney into the new BI and analytics space are

starting to discover all the possibilities it offers.

But, in an era in which data is now scaling into

the petabyte range, BI and analytics are more

than technologies. It’s a disruptive force. And,

with disruption comes new opportunities for

growth. Companies interested in capitaliz-

ing on the big data revolution need to move

forward with BI and analytics as a strategic

and tactical part of their business road map.

The benefits are profound—including vastly

accelerated business decisions and lower IT

costs. This will open new and often surprising

avenues to value. ■

Joe McKendrick is an author and independent researcher covering inno-vation, information tech-nology trends, and markets. Much of his research work is in conjunction with Uni-sphere Research, a division of Information Today, Inc. (ITI), for user groups including SHARE, the Oracle Applications Users Group, the Independent Oracle Users Group, and the International DB2 Users Group. He is also a regular contributor to Database Trends and Applications, published by ITI.

The State of Business Intelligence and Advanced Analytics

Page 41: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 39

sponsored content

Big data continues to be a mystery to many companies. Industry research

validates our experience that there are five

major stages that companies go through

when working with big data. We call these

the 5 Es—Evading, Envisioning, Evaluating,

Executing, and Expanding—of the big data

journey. Today, approximately 40% of

companies are still in the Evading stage,

waiting to get the clarity, means and purpose

for tackling big data.

To provide some clarity on the subject,

here we present five essential technological

means needed for this inevitable journey. If

your purpose is to find measurable returns

from big data, any one of these will be

sufficient to begin tasting the value. When

blended together, these means will provide

an irresistible recipe for big data success.

1) ENABLE VISUAL SELF EXPLORATION OF DATA

Human beings are visual creatures.

Big data analytics is all about “seeing”

relationships, anomalies and outliers present

in large quantities of data. Techniques for

advanced ways to graph, map and visualize

data, therefore, are a core requirement.

Secondly, visualizations need to be

intuitive and easy to work with. Business

users need the control to define what data

will be visualized and iterate through ideas

to determine the best visual representation.

They need the flexibility to share their

output through web browsers, mobile apps,

email, and other presentation modes.

Finally, the tools used need to be highly

responsive to a user’s needs. Effective

analysis can only happen when users move

uninterrupted at the speed-of-thought with

every exploration.

2) DEMOCRATIZE ADVANCED ANALYTICS

Big data has no voice without analytics.

Often the reason to work with large

quantities of low-level data is to apply

sophisticated analytic models, which can

tease out valuable insights not readily

apparent in aggregated information.

In business, analytical

modeling is the job of trained

data scientists who use a variety

of tools for developing these

models. Frontline business

users do not have such skill, but

everyday decisions they make can

be vastly improved based on such

big data insights. Challenges arise

in this transfer of knowledge,

since most tools don’t typically

talk to one another.

Organizations can enable

data scientists and trained analysts to

easily transfer business insights to frontline

workers by adopting tools that can expose

the widest support for advanced analytics

and predictive techniques, either natively or

through open integration with other tools.

3) COMBINE DATA FROM MULTIPLE SOURCES

Organizations never keep all data in one

place. Even with big data storage like Hadoop,

businesses will be hard pressed to unify all data

under one roof, owing to the ever-proliferating

systems. To date, IT has solved this problem

by transforming and moving data between

sources, before analysis is conducted. In today’s

age, exponentially larger datasets make data

movement virtually impossible, especially

when organizations want to be more nimble

but keep costs in check.

New technologies allow business users to

blend data from multiple sources, in-place,

and without involving IT. IT can take this a

step further by providing a scalable analytic

architecture masking the data complexity while

providing common business terminology.

Such architecture will easily facilitate

analyses that span customer information,

sales transactions, cost data, service history,

marketing promotions and more.

4) GIVE STRUCTURE TO ACTIONABLE UNSTRUCTURED DATA

Unstructured data accounts for 80% of all

data in a business. It typically comprises of

text-heavy formats like internal documents,

service records, web logs, emails, etc.

First, unstructured data has to be

structured to enable any analysis. While

trained analysts can do this interactively

at small scale, larger scale and general

access would demand an offline process.

Second, analysis of unstructured data will

often be useful only in conjunction with

other structured enterprise data. Third, the

insights from such analyses can be quite

amorphous. Unless businesses can take

concrete action based on the insights from

a certain unstructured source, its ROI will

be hard to justify.

5) SETUP CONNECTIVITY TO REAL-TIME DATA

Not all big data use cases lend themselves

to real-time analysis. But some do. When

decisions need to be taken in real-time (or

near real-time), this capability becomes a

key success factor. Analytic solutions for

financial trading, customer service, logistics

planning, etc. can all be beneficiaries of tying

live actual data to historical information or

forecasted outcomes.

In the end, big data analytics initiatives

are very much like traditional business

intelligence initiatives. These five technological

needs demand a significantly greater emphasis

for your big data journey. Will you stop

evading it now?

MICROSTRATEGY To learn how MicroStrategy can help craft solutions for your big data analytics needs, visit microstrategy.com/bigdatabook.

Five Key Pieces in the Big Data Analytics Puzzle

Page 42: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

40 BIG DATA SOURCEBOOK 2013

Social media networks are creating large

datasets that are now enabling companies and

organizations to gain competitive advantage

and improve performance by understand-

ing customer needs and brand experience

in nearly real time. These datasets provide

important insights into real-time customer

behavior, brand reputation, and the over-

all customer experience. Intelligent or “data

analysis”-driven organizations are now mon-

itoring, and some are collecting, this data

from “propriety social media networks,” such

as Salesforce Chatter and Microsoft Yammer

and “open social media networks” such as

LinkedIn, Twitter, Facebook, and others.

The majority of organizations today are

not harvesting and staging data from these

networks but are leveraging a new breed of

social media listening tools and social analyt-

ics platforms. Many are tapping their public

relations agencies to execute this new business

process. Smarter data-driven organizations

are extrapolating social media datasets and

performing predictive analytics in real time

and in-house.

There are, however, significant regula-

tory issues associated with harvesting, stag-

ing, and hosting social media data. These

regulatory issues apply to nearly all data

types in regulated industries such as health-

care and financial services in particular.

The SEC and FINRA with Sarbanes-Oxley

require different types of electronic com-

munications to be organized, indexed in

a taxonomy schema, and then be archived

and easily discoverable over defined time

periods. Data protection, security, gover-

nance, and compliance have entered an

entirely new frontier with introduction and

management of social data.

This article provides a broad overview of

the current state of analytical tools and plat-

forms that enable accelerated and real-time

decision making in organizations based on

customers. Social media is driving organi-

zational demand for insights on “customer

everything” in addition to BI and analytics

tools. Providing “enterprise BI” that includes social analytics will be a significant challenge to many enterprises in the near future. This is one of the primary reasons for the success of the new wave of innovative and easy-to-use BI and social media analytical tools within the last several years.

Social Media Analytic Tools and Platforms Offer Promise

By Peter J. Auditore

The State of Social Media

Page 43: Big Data Sourcebook Your Guide to the Data Revolution Free eBook
Page 44: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

42 BIG DATA SOURCEBOOK 2013

The majority of organizations today are not harvesting and staging data from social media networks but are leveraging a new breed of social media listening tools and social analytics platforms.

Analytic Tools OverviewIn the beginning, there was SPSS and

SAS Institute, the first analytical and statis-

tical platforms to be computerized and go

mainstream in the early 1980s. There is no

way in my view you can talk about anything

analytical without mentioning them. When

I was a young marine scientist, these were

the first DOS-based analytical tools we used

to do basic statistical analysis in addition to

rudimentary predictive analytics employed to

forecast fisheries populations.

During the last 40 years, these platforms

evolved to include a host of new capabilities

and functionality and are now considered

business intelligence tools. For the last 20

years, the majority of business intelligence

tools accessed structured datasets in vari-

ous databases, however now that nearly 80%

of enterprise data is unstructured, many of

the BI platforms incorporate sophisticated

enterprise search capabilities that rely on

metadata, inferences, and connections to

multiple data sources. The vast majority

of social media data is unstructured, as we

know, and this presents significant chal-

lenges to many organizations in its overall

management: collection, staging, archiving,

analysis, governance, and security.

Many organizations today are leveraging

their legacy business intelligence tools and

platforms to perform analysis on social media

datasets, in addition to the use of sophisti-

cated tagging and automated taxonomy tools

that make search (finding the right contents

and/or objects) easier. The most basic and

easy analytical tool used by nearly everyone

is a simple alert, which combs/crawls the web

for topics related to your alert criteria.

Modern capabilities of business intelli-gence tools and platforms:

• Enterprise Search—structured and

unstructured data

• Ad Hoc Query Analysis and Reporting

• OLAP, ROLAP, MOLAP

• Data Mining

• Predictive and Advanced Analytics

• In-Database Analytics

• In-Memory Analytics

• Performance Management Dashboards

• Advanced Visualization, Modeling,

Simulation, and Scenario Planning

• Cloud and Mobile BI

Cloud-Based and Mobile BI and the New Innovative Business Intelligence Tools

Within the last several years, a new class

of BI tools has emerged including some open

source and cloud-based platforms/tools, some

of which are specialized for specific vertical

market segments or business processes. They

are easy-to-use, highly collaborative via work

flow, and some include standard and custom

reporting in addition to including some rudi-

mentary ETL tools. Mobile BI is one of the

fastest growing areas; however, many legacy

vendors have been slow to develop applica-

tions for BYOD, especially tablets.

These new products have innovative

semantic layers and new ways of visualizing

data, both structured and unstructured. In

some cases, these new tools tout the fact that

they can work with any database and don’t

require the building of a data warehouse or

data mart but provide access to any data any-

where. Innovative visualization dashboard

platforms and implementations have been

very attractive to business managers and have

found their way into many organizations, in

some cases, without the knowledge of the IT

department.

In-MemoryIn-memory database technology, the next

major innovation in the world of business

intelligence and social media analytics,

is the game changer that will provide the

unfair advantage that leads to the compet-

itive advantage every CEO wants today.

In-memory technologies and built-in ana-

lytics are beginning to play major roles in

social analytics. The inherent business value

of in-memory technology revolves around

the ability to make real-time decisions based

on accurate information about seminal busi-

ness processes such as social media.

The ability to know and understand the

customer experience is paramount in the new

millennium as organizations strive to improve

customer service, keep customers loyal, and

gain greater insights into customer purchasing

patterns. This has become even more import-

ant as a result of social media and social media

networks that are now the new “word-of-

mouth platforms.” In-memory promises to

provide real-time data not only from transac-

tional systems but also to allow organizations to harvest and manage unstructured data from the social media sphere.

Predictive Analytics and Graph DatabasesGraph databases are sometimes faster

than SQL and greatly enhance and extend the capabilities of predictive analytic by incorporating multiple data points and interconnections across multiple sources in real time. Predictive analytics and graph

The State of Social Media

Page 45: Big Data Sourcebook Your Guide to the Data Revolution Free eBook
Page 46: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

44 BIG DATA SOURCEBOOK 2013

databases are a perfect fit for the social media

landscape where various data points are

interconnected.

Social media analytic tools enable busi-ness and organizations to enhance:

• Brand and sentiment analysis

• Identification and ranking of key

influencers

• Campaign tracking and measurement

• Product launches

• Product innovation through

crowdsourcing

• Digital channel influence

• Purchase intent analysis

• Customer care

• Risk management

• Competitive intelligence

• Partner monitoring

• Category analysis

The Social Media Listening CentersMany organizations are just starting to

use social data, few are at the forefront, and

most are using off-the-shelf vendor products

to create social media listening/monitoring

centers. These platforms operate in real time

and visually display sentiment and brand

analysis for products and services. The major-

ity of organizations today are at this stage of

social analytics, and again, few appear to be

collecting, staging, and archiving data for

further analysis and predictive analytics.

Monitoring and performing predictive

analytics on social media datasets are the most

obvious and common uses of analytic solu-

tions today. Many solutions use natural lan-

guage processing in the indexing and staging of

social media data. Predictive analytics enable

a wide array of business functions including

marketing, sales, product development, com-

petitive intelligence, customer service, and

human resources to identify common and

unusual patterns and opportunities in the

unstructured world of social media data.

Social Media Analytical ToolsSocial media analytical tools identify and

analyze text strings that contain targeted

search terms, which are then loaded into

databases or data staging platforms such as

Hadoop. This can enable database queries, for

example, by data, region, keyword, sentiment.

This can then enable insights and analysis into

customer attitudes toward brand, product,

services, employees, and partners. The major-

ity of products work at multiple levels and drill

down into conversations with results depicted

in customizable charts and dashboards.

Often analytic results are provided in

customizable charts and dashboards that are

easy to visualize and interpret and can be

shared on enterprise collaborative platforms

for decision makers. Some social media ana-

lytic platforms integrate easily with existing

analytic platforms and business processes to

help you act on social media insights, which

can lead to improved customer satisfaction,

enhanced brand reputation, and can even

enable your organization to anticipate new

opportunities or resolve problems.

On the bleeding edge of social media

analytics is a new wave of tools and highly

integrated platforms that have emerged to

provide not only social media listening tools

but also enable organizations to understand

content preferences (or content intelligence)

by affinity groups and brands they are follow-

ing or trending. Some of the innovators tak-

ing social media data to a new level include

Attensity, InfiniGraph, Brandwatch, Bamboo,

Kapow, Crimson Hexagon, Sysomos, Simply

Measured, NetBase, and Gnip.

Current Use of Social Media BI ToolsIn 2012, the SHARE users group and

Guide SHARE Europe conducted a Social

Media and Business Intelligence Survey, pro-

duced by Unisphere Research, a division of

Information Today, Inc., and sponsored by

IBM and Marist College. The survey, which

examined the current state of social media

data monitoring and collection and use of

business intelligence tools in more than 500

organizations, found that IBM, SAS, Oracle,

and SAP were the entrenched BI platform

market leaders. The majority of the sample

base indicated that they were not using third-

party BI tools for social media analytics.

What’s AheadThe 2012 social media and BI survey data

still provide a relevant picture of the state of

social media analytics. A majority of organi-

zations will leverage legacy business intelli-

gence vendors with familiar semantic layers

to perform rudimentary social media data

analysis. The big issue is that line-of-busi-

ness managers will not wait for nonagile IT

departments to collect, harvest, stage/build,

and perform analytics on new social media

data marts or data warehouses.

New bleeding-edge social media analyt-

ical platforms are addressing the needs of

line-of-business professionals in real time.

They are also leveraging the economics of

utility computing and the cloud to bring cost-

effective analytical platforms to nearly all orga-

nizations. These highly integrated platforms

include simple social media listening tools,

along with embedded analytics and predictive

analytics that incorporate content and some-

times advertising abilities to meet the needs of

modern digital marketers. There are also other

new vendors that specialize in collecting and

delivering raw social media for those organi-

zations which are building their own in-house

social media analytics platforms.

Traditionally, marketing has always had

four P’s. Today, marketing has five P’s: prod-

uct, place, position, price, and people—

because in this millennium, the social media

network is the new platform for “word-of-

mouth marketing.” ■

Peter J. Auditore is currently the principal researcher at Asterias Research, a bou-tique consultancy focused on information manage- ment, traditional and social analytics, and big data ([email protected]). Auditore was a member of SAP’s Global Communications team for 7 years and most recently head of the SAP Business Influ-encer Group. He is a veteran of four technol-ogy startups: Zona Research (co-founder); Hummingbird (VP, marketing, Americas); Survey.com (president); and Exigen Group (VP, corporate communications).

Predictive analytics and graph databases are a perfect fit for the social media landscape where various data points are interconnected.

The State of Social Media

Page 47: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

DBTA.COM 45

sponsored content

Big data analytic opportunities are

abundant, with business value the driver.

According to the Professors Andrew McAfee

and Erik Brynjolfsson of MIT:

“Companies that inject big data and

analytics into their operations show

productivity rates and profitability

that are 5% to 6% higher than those

of their peers.”

DATA IS THE LIFEBLOOD OF ANALYTICS

Enterprises, flooded with a deluge of

data about their customers, prospects,

business processes, suppliers, partners and

competitors, understand data’s critical role

as the lifeblood of analytics.

THE ANALYTIC DATA CHALLENGE However, integrating data consumes the

better half of any analytic project as variety

and volume complexity constrain progress.

• Diverse data types—In the past, most

analytic data was tabular, typically

relational. That changed with the rise of

web services and other non-relational

and big data sources. Analysts must now

work with multiple data types, including

tabular, XML, key-value pairs and semi-

structured log data.

• Multiple interfaces and protocols—

Accessing data is now more complicated.

Before, analysts used ODBC to access a

database or a spreadsheet. Now, analysts

must access data through a variety of

protocols, including web services via

SOAP or REST, Hadoop data through

Hive, and other types of NOSQL data

via proprietary APIs.

• Larger data sets—Data sets are

significantly larger. Analysts can no

longer assemble all data in one place,

especially if that place is their desktop.

Analysts must be able to work with data

where it is, intelligently sub-setting it

and combining it with relevant data

from other high volume sources.

• Iterative analytic methods—Exploration

and experimentation defines the analytic

process. Finding, accessing and pulling

together data is difficult alone, with

continuous updating and reassembling

of data sets also a must have.

CONSOLIDATING EVERYTHING, SLOW AND COSTLY

Providing analytics with the data

required has always been difficult, with

data integration long considered the biggest

bottleneck in any analytics or BI project.

No longer is consolidating all analytics

data into a data warehouse the answer.

When you need to integrate data from

new sources to perform a wider, more

far-reaching analysis, does it make sense

to create yet another silo that physically

consolidates other data silos?

Or is it better to federate these silos

using data virtualization?

DATA VIRTUALIZATION TO THE RESCUE

Cisco’s Data Virtualization Suite addresses

your difficult analytic data challenges.

• Rapid Data Gathering Accelerates Analytics Impact—Cisco’s nimble data

discovery and access tools makes it faster

and easier to gather together the data sets

each new analytic project requires.

• Data Discovery Addresses Data Proliferation—Data discovery automates

entity and relationship identification;

accelerating data modeling so your

analysts can better understand and

leverage your distributed data assets.

• Query Optimization for Timely Business Insight—Optimization

algorithms and techniques deliver the

timely information your analytics require.

• Data Federation Provides the Complete Picture—Virtual data integration in

memory provides the complete picture

without the cost and overhead of

physical data consolidation.

• Data Abstraction Simplifies Complex Data —Data abstraction transforms

data from native structures to common

semantics your analysts understand.

• Analytic Sandbox and Data Hub Options Provide Deployment Flexibility—Data

virtualization supports your diverse

analytic requirements from ad hoc

analyses via sandboxes to recurring

analyses via data hubs.

• Data Governance Maximizes Control—

Built-in governance ensures data security,

data quality and 7x24 operations to

balance business agility with needed

controls.

• Layered Data Architecture Enables Rapid Change—Loose coupling and

rapid development tools provide the

agility required to keep pace with your

ever-changing analytic needs.

CONCLUSION The business value of analytics has never

been greater. But data volumes and variety

impact the velocity of analytic success.

Data virtualization helps overcome data

challenges to fulfill critical analytic data needs

significantly faster with far fewer resources

than other data integration techniques.

• Empower your people with instant access to

all the data they want, the way they want it

• Respond faster to your changing analytics

and business intelligence needs

• Reduce complexity and save money

Better analysis equals business advantage.

So take advantage of data virtualization.

LEARN MORE To learn more about Cisco’s data virtualization offerings for big data analytics, visit www.compositesw.com

Data Virtualization Brings Velocity and Value to Big Data Analytics

Page 48: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

46 BIG DATA SOURCEBOOK 2013

Big data is transforming both the

scope and the practice of data integration.

After all, the tools and methods of classic data

integration evolved over time to address the

requirements of the data warehouse and its

orbiting constellation of business intelligence

tools. In a sense, then, the single biggest change

wrought by big data is a conceptual one: Big

data has displaced the warehouse from its posi-

tion as the focal point for data integration.

The warehouse remains a critical system

and will continue to service a critical constit-

uency of users; for this reason, data integra-

tion in the context of data warehousing and

BI will continue to be important. Neverthe-

less, we now conceive of the warehouse as

just one system among many systems, as one

provider in a universe of providers. In this

respect, the impact of big data isn’t unlike that

of the Copernican Revolution: The universe,

after Copernicus, looked a lot bigger. The

same can be said about data integration after

big data: The size and scope of its projects—

to say nothing of the problems or challenges

it’s tasked with addressing—look a lot bigger.

This isn’t so much a function of the “big-

ness” of big data—of its celebrated volumes,

varieties, or velocities—as of the new use cases,

scenarios, projects, or possibilities that stem

from our ability to collect, process, and—most

important—to imaginatively conceive of “big”

data management. To say that big data is the

sum of its volume, variety, and velocity is a lot

like saying that nuclear power is simply and

irreducibly a function of fission, decay, and

fusion. It’s to ignore the societal and economic

factors that—for good or ill—ultimately deter-

mine how big data gets used. In other words,

if we want to understand how big data has

changed data integration, we need to consider

the ways in which we’re using—or in which we

want to use—big data.

Big Data Integration in Practice In this respect, no application—no use

case—is more challenging than that of

advanced analytics. This is an umbrella term

for a class of analytics that involves statistical

analysis, machine learning, and the use of new

techniques such as numerical linear algebra.

From a data integration perspective, what’s

most challenging about advanced analytics is

that it involves the combination of data from

an array of multistructured sources. “Multi-

structured” is a category that includes struc-

tured hierarchical databases (such as IMS

or ADABAS on the mainframe or—a recent

innovation—HBase on Hadoop); semistruc-

tured sources, such as graph and network data-

bases, along with human-readable sources,

including JSON, XML, and txt documents);

and a host of so-called “unstructured” file

types—documents, emails, audio and video

recordings, etc. (The term “unstructured” is

misleading: Syntax is structure; semantics is

structure. Understood in this context, most

so-called unstructured artifacts—emails,

tweets, PDF files, even audio and video files—

have structure. Much of the work of the next

decade will focus on automating the profiling,

preparation, analysis, and—yes—integration

of unstructured artifacts.)

If all of this multistructured information is

to be analyzed, it needs to be prepared; how-

ever, the tools or techniques required to prepare

multistructured data for analysis far outstrip

the capabilities of the handiest tools (e.g., ETL)

in the data integration toolset. For one thing,

multistructured information can’t efficiently

or, more to the point, cost-effectively, be loaded

into a data warehouse or OLTP database. The warehouse, for example, is a schema-manda-tory platform; it needs to store and manage information in terms of “facts” or “dimen-sions.” It is most comfortable speaking SQL, and to the extent that information from nonrelational sources (such as hiearchical

Big Data Is Transforming the Practice of Data Integration

By Stephen Swoyer

The State of Data Integration

Page 49: Big Data Sourcebook Your Guide to the Data Revolution Free eBook
Page 50: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

48 BIG DATA SOURCEBOOK 2013

The impact of big data isn’t unlike that of the Copernican Revolution: The universe, after Copernicus, looked a lot bigger. The same can be said about data integration after big data: The size and scope of its projects—to say nothing of the problems or challenges it’s tasked with addressing—look a lot bigger.

databases, sensor events, or machine logs) can

be transformed into tabular format, they can

be expressed in SQL and ingested by the data

warehouse. But what about information from

all multistructured sources?

Enter the category of the “NoSQL” data

store, which includes a raft of open source soft-

ware (OSS) projects, such as the Apache Cassan-

dra distributed database, MongoDB, CouchDB,

and—last but not least—the Hadoop stack.

Increasingly, Hadoop and its Hadoop Distrib-

uted File System (HDFS) are being touted as an

all-purpose “landing zone” or staging area for

multistructured information.

ETL Processing and HadoopHadoop is a schema-optional platform; it

can function as a virtual warehouse—i.e., as

a general-purpose storage area—for informa-

tion of any kind. In this respect, Hadoop can

be used to land, to stage, to prepare, and—

in many cases—to permanently store data.

This approach makes sense because Hadoop

comes with its own baked-in data processing

engine: MapReduce.

For this reason, many data integra-

tion vendors now market ETL products

for Hadoop. Some use MapReduce itself to

perform ETL operations; others substitute

their own, ETL-optimized libraries for the

MapReduce engine. Traditionally, program-

ming for MapReduce is a nontrivial task:

MapReduce jobs can be coded in Java, Pig

Latin (the high-level language used by Pig, a

platform designed to abstract the complex-

ity of the MapReduce engine), Perl, Python,

and (using open source libraries) C, C++,

Ruby, and other languages. Moreover, using

MapReduce as an ETL technology also pre-

supposes a detailed knowledge of data man-

agement structures and concepts. For this rea-

son, ETL tools that support Hadoop usually

generate MapReduce jobs in the form of Java

code, which can be fed into Hadoop. In this

scheme, users design Hadoop MapReduce

jobs just like they’d design other ETL jobs or

workflows—in a GUI-based design studio.

The benefits of doing ETL processing in

Hadoop are manifold: For starters, Hadoop

is a massively parallel processing (MPP) envi-

ronment. An ETL workload scheduled as a

MapReduce job can be efficiently distribut-

ed—i.e., parallelized—across a Hadoop

cluster. This makes MapReduce ideal for

crunching massive datasets, and, while the

sizes of the datasets used in decision support

workloads aren’t all that big, those used in

advanced analytic workloads are. From a data

integration perspective, they’re also consid-

erably more complicated, inasmuch as they

involve a mix of analytic methods and tradi-

tional data preparation techniques.

Let’s consider the steps involved in an

“analysis” of several hundred terabytes of

image or audio files sitting in HDFS. Before

this data can be analyzed, it must be pro-

filed; this means using MapReduce (or cus-

tom-coded analytic libraries) to run a series of

statistical and numerical analyses, the results

of which will contain information about the

working dataset. From there, a series of tra-

ditional ETL operations—performed via

MapReduce—can be used to prepare the data

for additional analysis.

There’s still another benefit to doing ETL

processing in Hadoop: The information is

already there. It has an adequate—though

by no means spectacular—data management

toolset. For example, Hive, an interpreter

that compiles its own language (HiveQL)

into Hadoop MapReduce jobs, exposes a

SQL-like query facility; HBase is a hierarchi-

cal data store for Hadoop that supports high

user concurrency levels as well as basic insert

and update operations. Finally, HCatalog is a

primitive metadata catalog for Hadoop.

Data Integration Use CasesRight now, most data integration use cases

involve getting information out of Hadoop.

This is chiefly because Hadoop’s data manage-

ment feature set is primitive compared to those

of more established platforms. Hadoop, for

example, isn’t ACID-compliant. In the advanced

analytic example cited above, a SQL platform—

not Hadoop—would be the most likely des-

tination for the resultant dataset. Almost all

database vendors and a growing number of

analytic applications boast connectivity of some

kind into Hadoop. Others promote the use of

Hadoop as a kind of queryable archive. This

use case could involve using Hadoop to per-

sist historical data—e.g., “cold” or infrequently

accessed data that (by virtue of its sheer volume)

could impact the performance or cost of a data

warehouse. Still another emerging scenario

involves using Hadoop as a repository in which

to persist the raw data that feeds a data ware-

house. In traditional data integration, this data

is often staged in a middle tier, which can con-

sist of an ETL repository or an operational data

store (ODS). On a per-gigabyte or per-terabyte

basis, both the ETL and ODS stores are more

expensive than Hadoop. In this scheme, some

or all of this data could be shifted into Hadoop,

where it could be used to (inexpensively) aug-

ment analytic discovery (which prefers denor-

malized or raw data) or to assist with data ware-

house maintenance—e.g., in case dimensions

are added or have to be rekeyed.

Still another use case involves offloading workloads from Hadoop to SQL analytic platforms. Some of these platforms are able to execute analytic algorithms inside their database engines. Some SQL DBMS ven-dors claim that an advanced analysis will

The State of Data Integration

Page 51: Big Data Sourcebook Your Guide to the Data Revolution Free eBook
Page 52: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

ind

ustr

y u

pd

ate

s

50 BIG DATA SOURCEBOOK 2013

run faster on their own MPP platforms than

on Hadoop using MapReduce. They note that

MapReduce is a brute-force data processing

tool, and while it’s ideal for certain kinds of

workloads, it’s far from ideal as a general-pur-

pose compute engine. This is why so much

Hadoop development work has focused on

YARN—Yet Another Resource Negotiator—

which will permit Hadoop to schedule,

execute, and manage non-MapReduce jobs.

The benefits of doing so are manifold, espe-

cially from a data integration perspective.

First, even though some ETL tools run in

Hadoop and replace MapReduce with their

own engines, Hadoop itself provides no

native facility to schedule or manage non-

MapReduce jobs. (Hadoop’s existing Job-

Tracker and TaskTracker paradigm is tightly

coupled to the MapReduce compute engine.)

Second, YARN should permit users to run

optimized analytic libraries—much like the

SQL analytic database vendors do—in the

Hadoop environment. This promises to be

faster and more efficient than the status quo,

which involves coding analytic workloads as

MapReduce jobs. Third, YARN could help

stem the flow of analytic workloads out of

Hadoop and encourage analytic workloads to

be shifted from the SQL world into Hadoop.

Even though it might be faster to run an ana-

lytic workload in an MPP database platform,

it probably isn’t cheaper—relative, that is, to

running the same workload in Hadoop.

Alternatives to HadoopBut while big data is often discussed

through the prism of Hadoop, owing to the

popularity and prominence of that platform,

alternatives abound. Among NoSQL plat-

forms, for example, there’s Apache Cassan-

dra, which is able to host and run Hadoop

MapReduce workloads, and which—unlike

Hadoop—is fault-tolerant. There’s also Span-

ner, Google’s successor to BigTable. Google

runs its F1 DBMS—a SQL- and ACID-com-

pliant database platform—on top of Spanner,

which has already garnered the sobriquet

“NewSQL.” (And F1, unlike Hadoop, can be

used as a streaming database. Here and else-

where, Hadoop’s file-based architecture is a

significant constraint.) Remember, a primary

contributor to Hadoop’s success is its cost—

as an MPP storage and compute platform,

Hadoop is significantly less expensive than

existing alternatives. But Hadoop by itself isn’t

ACID-compliant and doesn’t expose a native

SQL interface. To the extent that technologies

such as F1 address existing data management

requirements, enable scalable parallel work-

load processing, and expose more intuitive

programming interfaces, they could comprise

compelling alternatives to Hadoop.

What’s AheadBig data, along with the related technol-

ogies such as Hadoop and other NoSQL

platforms, is just one of several destabilizing

forces on the IT horizon, however. Other

technologies are changing the practice of data

integration—such as the shift to the cloud

and the emergence of data virtualization.

Cloud will change how we consume and

interact with—and, for that matter, what we

expect of—applications and services. From

a data integration perspective, cloud, like

big data, entails its own set of technologi-

cal, methodological, and conceptual chal-

lenges. Traditional data integration evolved

in a client-server context; it emphasizes

direct connectivity between resources—e.g.,

a requesting client and a providing server.

The conceptual model for cloud, on the other

hand, is that of representational state transfer,

or REST. In place of client-server’s empha-

sis on direct, stateful connectivity between

resources, REST emphasizes abstract, stateless

connectivity. It prescribes the use of new and

nontraditional APIs or interfaces. Traditional

data integration makes use of tools such as

ODBC, JDBC, or SQL to query for and return

a subset of source data. REST components, on

the other hand, structure and transfer infor-

mation in the form of files—e.g., HTML,

XML, or JSON documents—that are repre-

sentations of a subset of source data. For this

reason, data integration in the context of the

cloud entails new constraints, makes use of

new tools, and will require the development

of new practices and techniques.

That said, it doesn’t mean throwing out

existing best practices: If you want to run

sales analytics on data in your Salesforce.

com cloud, you’ve either got to load it into an

existing, on-premises repository or—alterna-

tively—expose it to a cloud analytics provider.

In the former case, you’re going to have to

extract your data from Salesforce, prepare it,

and load it into the analytic repository of your

choice, much as you would do with data from

any other source. The shift to the cloud isn’t

going to mean the complete abandonment of

on-premises systems. Both will coexist.

Data Virtualization, or DV, is another

technology that should be of interest to data

integration practitioners. DV could play a role

in knitting together the fabric of the post-big

data, post-cloud application-scape. Tradition-

ally, data integration was practiced under fairly

controlled conditions: Most systems (or most

consumables, in the case of flat files or files

uploaded via FTP) were internal to an organi-

zation, i.e., accessible via a local area network.

In the context of both big data and the cloud,

data integration is a far-flung practice. Data

virtualization technology gives data architects

a means to abstract resources, regardless of

architecture, connectivity, or physical location.

Conceptually, DV is REST-esque in that

it exposes canonical representations (i.e.,

so-called business views) of source data. In

most cases, in fact, a DV business view is a

representation of subsets of data stored in

multiple distributed systems. DV can pro-

vide a virtual abstraction layer that unifies

resources strewn across—and outside of—

the information enterprise, from traditional

data warehouse systems to Hadoop and other

NoSQL platforms to the cloud. DV platforms

are polyglot: They speak SQL, ODBC, JDBC,

and other data access languages, along with

procedural languages such as Java and (of

course) REST APIs.

Moreover, DV’s prime directive is to move

as little data as possible. As data volumes scale

into the petabyte range, data architects must

be alert to the practical physics of data move-

ment. It’s difficult if not impossible to move

even a subset of a multi-petabyte repository

in a timely or cost-effective manner. ■

Stephen Swoyer is a tech-nology writer with more than 15 years of experience. His writing has focused on business intelligence, data warehousing, and ana-lytics for almost a decade. He’s particularly intrigued by the thorny people and process problems most BI and DW vendors almost never want to acknowledge, let alone talk about. You can contact him at [email protected].

The State of Data Integration

Page 53: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

industry directory

DBTA.COM 51

Appfluent transforms the economics of Big Data and Hadoop.

Appfluent provides IT organizations with unprecedented visibility into

usage and performance of data warehouse and business intelligence

systems. IT decision makers can view exactly which data is being

used or not used, determine how business intelligence systems are

performing and identify causes of database performance issues.

With Appfluent, enterprises can address exploding data growth with

confidence, proactively manage performance of BI and data warehouse

systems, and realize the tremendous economies of Hadoop.

Learn more at www.appfluent.com.

APPFLUENT TECHNOLOGY, INC.

6001 Montrose Road, Suite 1000

Rockville, MD 20852

301-770-2888

[email protected]

www.appfluent.com

Attunity is a leading provider of data integration software solutions

that make Big Data available where and when needed across

heterogeneous enterprise platforms and the cloud. Attunity solutions

accelerate mission-critical initiatives including BI/Big Data Analytics,

Disaster Recovery, Content Distribution and more. Solutions include

data replication, change data capture (CDC), data connectivity,

enterprise file replication (EFR), managed-file-transfer (MFT), and cloud

data delivery. For 20 years, Attunity has supplied innovative software

solutions to thousands of enterprise-class customers worldwide to

enable real-time access and availability of any data, anytime, anywhere

across the maze of systems making up today’s IT environment.

Learn more at www.attunity.com.

ATTUNITY

www.attunity.com

SEE OUR AD ON

PAGE 49

CodeFutures is the provider of dbShards, the Big Data platform that

makes your database scalable and reliable. dbShards is not a database

—instead dbShards works with proven DBMS engines you know

and trust. dbShards gives your application transparent access to

one or more DBMS engines, providing the Big Data scalability,

High-Availability, and Disaster Recovery you need for demanding

“always-on” operation. You can even use dbShards to seamlessly

migrate your database from one environment to another—between

regions, cloud vendors and your own data center.

For more information, go to www.dbshards.com.

CODEFUTURES CORPORATION

11001 West 120th Avenue, Suite 400

Broomfield, CO 80021

(303) 625-4084

[email protected]

www.dbshards.com

codeFuturesComposite Software, now part of Cisco, is the data virtualization

market leader. Hundreds of organizations use the Composite Data

Virtualization Platform’s streamlined approach to data integration

to gain more insight from their data, respond faster to ever changing

analytics and BI needs, and save 50–75% over data replication and

consolidation.

Cisco Systems, Inc. completed the acquisition of Composite Software, Inc.

on July 29, 2013.

COMPOSITE SOFTWARE

Please call us: (650) 227-8200

Follow us on Twitter: http://twitter.com/compositesw

www.compositesw.com

SEE OUR AD ON

PAGE 35

Page 54: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

industry directory

52 BIG DATA SOURCEBOOK 2013

Datawatch is the leading provider of visual data discovery solutions that allow organizations to optimize the use of any information, whether it is structured, unstructured, or semi-structured data locked in content like static reports, PDF files, and EDI streams—in real-time sources like CEP engines, tick or other real-time sources like machine data. Through an unmatched visual data discovery environment and the industry’s leading information optimization software, Datawatch allows you to utilize ALL data to deliver a complete picture of your business from every aspect and then manage, secure, and deliver that information to transform business processes, increase visibility to critical Big Data sources, and improve business intelligence applications offering broader analytical capabilities.

Datawatch provides the solution to Get the Whole Story!

DATAWATCH CORPORATION 271 Mill Road, Quorum Office Park Chelmsford, MA 01824 978-441-2200 [email protected]

www.datawatch.com

SEE OUR AD ON

PAGE 11

DataMentors provides award-winning data quality and database marketing solutions. Offered as either a customer-premise installation

or ASP delivered solution, DataMentors leverages proprietary data discovery, analysis, campaign management, data mining and modeling practices to identify proactive, knowledge-driven decisions.

DataFuse, DataMentors’ data quality and integration solution, is

consistently recognized by industry-leading analysts for its extreme

flexibility and ease of householding. DataMentors’ marketing database

solution, PinPoint, quickly and accurately analyzes, segments, and

profiles customers’ preferences and behaviors. DataMentors also

offers social media marketing, drive time analysis, email marketing,

data enhancements and behavior models to further enrich the customer

experience across all channels.

DATAMENTORS

2319-104 Oak Myrtle Lane

Wesley Chapel, FL 33544

Phone: 813-960-7800

Email: [email protected]

www.DataMentors.com

SEE OUR AD ON

PAGE 47

Delphix delivers agility to enterprise application projects, addressing

the largest source of inefficiency and inflexibility in the datacenter—

provisioning, managing, and refreshing databases for business-critical

applications. With Delphix in place, QA engineers spend more time

testing and less time waiting for new data, increasing utilization of

expensive test infrastructure. Analysts and managers make better

decisions with fresh data in data marts and warehouses. Leading global

organizations use Delphix to dramatically reduce the time, cost, and risk

of application rollouts, accelerating packaged and custom applications

projects and reporting.

DELPHIX

275 Middlefield Road

Menlo Park, CA 94025

[email protected]

www.delphix.com

Denodo is the leader in data virtualization. Denodo enables hybrid data

storage for big data warehouse and analytics—providing unmatched

performance, unified virtual access to the broadest range of enterprise,

big data, cloud and unstructured sources, and agile data services

provisioning—which has allowed reference customers in every major

industry to minimize the cost and pitfalls of big data technology

and accelerate its adoption and value by making it transparent to

business users. Denodo is also used for cloud integration, single-view

applications, and RESTful linked data services. Founded in 1999,

Denodo is privately held.

DENODO TECHNOLOGIES

[email protected]

www.denodo.com

Page 55: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

industry directory

DBTA.COM 53

DBMoto® is the preferred solution for heterogeneous Data Replication and Change Data Capture requirements in an enterprise environment. Whether replicating data to a lower TCO database, synchronizing data among disparate operational systems, creating a new columnar or high-speed analytic database or data mart, or building a business intelligence application, DBMoto is the solution of choice for fast, trouble-free, easy-to-maintain Data Replication and Change Data Capture projects. DBMoto is mature and approved by enterprises ranging from midsized to Fortune 1000 worldwide. HiT Software®, Inc., a BackOffice Associates® LLC Company, is based in San Jose, CA.

For more information see www.info.hitsw.com/DBTA-bds2013/

HIT SOFTWARE, INC., A BACKOFFICE ASSOCIATES LLC COMPANY Contact: Giacomo Lorenzin 408-345-4001 [email protected]

www.hitsw.com

SEE OUR AD ON

PAGE 37

Nearly 80% of all existing data is generally only available in unstructured form and does not contain additional, descriptive metadata. This content, therefore, cannot be machine-processed automatically with conventional IT. It demands human interaction for interpretation, which is impossible to achieve when faced with the sheer volume of information. Based on the highly scalable Information Access System, Empolis offers methods for analyzing unstructured content perfectly suitable for a wide range of applications. For instance, Empolis technology is able to semantically annotate and process an entire day of traffic on Twitter in less than 20 minutes, or the German version of Wikipedia in three minutes. In addition to statistical algorithms, this also covers massive parallel processing utilizing linguistic methods for information extraction. These, in turn, form the basis for our Smart Information Management solutions, which transform unstructured content into structured information that can be automatically processed with the help of content analysis.

EMPOLIS INFORMATION MANAGEMENT GMBH Europaallee 10 | 67657 Kaiserslautern | Germany Phone +49 631 68037-0 | Fax +49 631 68037-77 [email protected]

www.empolis.com

Kapow Software, a Kofax company, harnesses the power of legacy data

and big data, making it actionable and accessible across organizations.

Hundreds of large global enterprises including Audi, Intel, Fiserv,

Deutsche Telekom, and more than a dozen federal agencies rely on its

agile big data integration platform to make smarter decisions, automate

processes, and drive better outcomes faster. They leverage the platform

to give business consumers a flexible 360-degree view of information

across any internal and external source, providing organizations with a

data-driven advantage.

For more information, please visit: www.kapowsoftware.com.

KAPOW SOFTWARE

260 Sheridan Avenue, Suite 420

Palo Alto, CA 94306

Phone: +1 800 805 0828

Fax: +1 650 330 1062

Email: [email protected]

www.kapowsoftware.com

HPCC Systems® from LexisNexis® is an open-source, enterprise-ready

solution designed to help detect patterns and hidden relationships in

Big Data across disparate data sets. Proven for more than 10 years,

HPCC Systems helped LexisNexis Risk Solutions scale to a $1.4 billion

information company now managing several petabytes of data on a

daily basis from 10,000 different sources.

HPCC Systems was built for small development teams and offers a single

architecture and one programming language for efficient data processing

of large or complex queries. Customers, such as financial institutions,

insurance companies, law enforcement agencies, federal government and

other enterprise organizations, leverage the HPCC Systems technology

through LexisNexis products and services. HPCC Systems is available

in an Enterprise and Community version under the Apache license.

LEXISNEXIS

Phone: 877.316.9669

www.hpccsystems.com

www.lexisnexis.com/risk

SEE OUR AD ON

PAGE 43

Page 56: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

industry directory

54 BIG DATA SOURCEBOOK 2013

Founded in 1989, MicroStrategy (Nasdaq: MSTR) is a leading

worldwide provider of enterprise software platforms. Millions of users

use the MicroStrategy Analytics Platform™ to analyze vast amounts

of data and distribute actionable business insight throughout the

enterprise. Our analytics platform delivers interactive dashboards and

reports that users can access and share via web browsers, information-

rich mobile apps, and inside Microsoft® Office applications. Big data

analytics delivered with MicroStrategy will enable businesses to

analyze big data visually without writing code and apply advanced

analytics to obtain deep insights from all of their data.

To learn more and try MicroStrategy free, visit

microstrategy.com/bigdatabook.

MICROSTRATEGY

1850 Towers Crescent Plaza

Tysons Corner, VA 22182 USA

Phone: 888.537.8135

Email: [email protected]

www.microstrategy.com/bigdatabook

SEE OUR AD ON

COVER 4

Since 1988, Objectivity, Inc. has been the Enterprise NoSQL leader,

helping customers harness the power of Big Data. Our leading edge

technologies: InfiniteGraph, The Distributed Graph Database™,

and Objectivity/DB, a distributed and scalable object management

database, enable organizations to discover hidden relationships for

improved Big Data analytics and develop applications with significant

time-to-market advantages and technical cost savings, achieving

greater return on data related investments. Objectivity, Inc. is

committed to our customers’ success, with representatives worldwide.

Our clients include: AWD Financial, CUNA Mutual, Draeger Medical,

Ericsson, McKesson, IPL, Siemens and the US Department of Defense.

OBJECTIVITY, INC.

3099 North First Street, Suite 200

San Jose, CA 95134 USA

408-992-7100

[email protected]

www.objectivity.com

SEE OUR AD ON

PAGE 7

Progress DataDirect provides high-performance, real-time connectivity

to applications and data deployed anywhere. From SaaS applications

like Salesforce to Big Data sources such as Hadoop, DataDirect

makes these sources appear just like a regular relational database.

Whether you are connecting your own application, or your favorite BI

and reporting tools, DataDirect makes it easy to access your critical

business information.

More than 300 leading independent software vendors embed Progress

Software’s DataDirect components in over 400 commercial products.

Further, 96 of the Fortune 100 turn to Progress Software’s DataDirect

to simplify and streamline data connectivity.

PROGRESS DATADIRECT

www.datadirect.com

SEE OUR AD ON

PAGE 41

Percona has made MySQL and integrated MySQL/big data solutions

faster and more reliable for over 2,000 customers worldwide. Our

experts help companies integrate MySQL with big data solutions

including Hadoop, Hbase, Hive, MongoDB, Vertica, and Redis.

Percona provides enterprise-grade Support, Consulting, Training,

Remote DBA, and Server Development services for MySQL or

integrated MySQL/big data deployments. Our founders authored the

book High Performance MySQL and the MySQL Performance Blog.

We provide open source software including Percona Server, Percona

XtraDB Cluster, Percona Toolkit, and Percona XtraBackup. We also

host Percona Live conferences for MySQL users worldwide.

For more information, visit www.percona.com.

PERCONA

www.percona.com

Page 57: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

industry directory

DBTA.COM 55

TransLattice provides its customers corporate-wide visibility,

dramatically improved system availability, simple scalability and

significantly reduced deployment complexity, all while enabling data

location compliance. Computing resources are tightly integrated to

enable enterprise databases to be spread across an organization as

needed, whether on-premise or in the cloud, providing data where and

when it is needed. Nodes work seamlessly together and if a portion

of the system goes down, the rest of the system is not affected. Data

location is policy-driven, enabling proactive compliance with regulatory

requirements. This simplified approach is fundamentally more reliable,

more scalable, and more cost-effective than traditional approaches.

TRANSLATTICE

+1 408 749-8478

[email protected]

www.TransLattice.com

SEE OUR AD ON

PAGE 9

Splice Machine is the only transactional SQL-on-Hadoop database

for real-time Big Data applications. Splice Machine provides all the

benefits of NoSQL databases, such as auto-sharding, scalability, fault

tolerance and high availability, while retaining SQL—the industry

standard. It optimizes complex queries to power real-time OLTP and

OLAP apps at scale without rewriting existing SQL-based apps and

BI tool integrations. Splice Machine provides fully ACID transactions

and uses Multiple Version Concurrency Control (MVCC) with lockless

snapshot isolation to enable real-time database updates with very

high throughput.

SPLICE MACHINE

[email protected]

www.splicemachine.com

SEE OUR AD ON

COVER 2

From the publishers of

Page 58: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

Each issue of DBTA features original and valuable content—providing you with clarity, perspective, and objectivity in a complex and exciting world where data assets hold the key to organizational competitiveness.

Don’t miss an issue! Subscribe FREE* today!

*Print edition free to qualified U.S. subscribers.

Need Help Unlocking the Full Value of Your Information?

DBTA magazine is here to help.

S C

A N

Page 59: Big Data Sourcebook Your Guide to the Data Revolution Free eBook

Big Data technologies,

including Hadoop, NoSQL, and in-memory

databases

Increasing efficiency

through cloud technologies and services

Solving complex data

and application integration challenges

Tools and techniques

reshaping the world of business

intelligence

Key strategies for increasing

database performance

and availability

New approaches

for agile data warehousing

Get the inside scoop on the hottest topics in data management and analysis:

Best Practices and Thought Leadership Reports

For information on upcoming reports: http://iti.bz/dbta-editorial-calendar

To review past reports: http://iti.bz/dbta-whitepapers

Page 60: Big Data Sourcebook Your Guide to the Data Revolution Free eBook