Big Data Sourcebook Your Guide to the Data Revolution Free eBook
-
Upload
krishna-dash -
Category
Documents
-
view
8 -
download
3
description
Transcript of Big Data Sourcebook Your Guide to the Data Revolution Free eBook
![Page 1: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/1.jpg)
![Page 2: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/2.jpg)
Learn more at www.splicemachine.com
Tired of the Big Data hype?
Get Real with
SQL on HadoopReal-Time Real Scale
Real Apps Real SQL
Splice Machine is the real-time, SQL-on-Hadoop database.
For companies contemplating a costly scale up of a traditional RDBMS, struggling to extract value out of their data inside of Hadoop, or looking to build new data-driven applications, the power of Big Data can feel just out of reach.
Splice Machine powers real-time queries and real-time updates on both operational and analytical workloads, delivering real answers and real results to companies looking to harness their Big Data streams.
![Page 3: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/3.jpg)
introduction
2 The Big PictureJoyce Wells
industry updates
4 The Battle Over Persistence and the Race for Access Hill
John O’Brien
10 The Age of Big Data Spells the End of Enterprise IT Silos
Alex Gorbachev
16 Big Data Poses Legal Issues and RisksAlon Israely
20 Unlocking the Potential of Big Data in a Data Warehouse Environment
W. H. Inmon
26 Cloud Technologies Are Maturing to Address Emerging Challenges
and OpportunitiesChandramouli Venkatesan
30 Data Quality and MDM Programs Must Evolve to Meet Complex New Challenges
Elliot King
34 In Today’s BI and Advanced Analytics World, There Is Something for Everyone
Joe McKendrick
40 Social Media Analytic Tools and Platforms Offer Promise
Peter J. Auditore
46 Big Data Is Transforming the Practice of Data Integration
Stephen Swoyer
CONTENTSBIG DATA SOURCEBOOKDECEMBER 2013
Michael Corey, Chief Executive Officer, Ntirety
Bill Miller, Vice President and General Manager, BMC Software
Mike Ruane, President/CEO, Revelation Software
Robin Schumacher, Vice President of Product Management, DataStax
Susie Siegesmund, Vice President and General Manager, U2 Brand, Rocket Software
BIG DATA SOURCEBOOK is published annually by Information Today, Inc.,
143 Old Marlton Pike, Medford, NJ 08055
POSTMASTER
Send all address changes to:
Big Data Sourcebook, 143 Old Marlton Pike, Medford, NJ 08055
Copyright 2013, Information Today, Inc. All rights reserved.
PRINTED IN THE UNITED STATES OF AMERICA
The Big Data Sourcebook is a resource for IT managers and professionals providing information on the enterprise and technology issues surrounding the ‘big data’ phenomenon and the need to better manage and extract value from large quantities of structured, unstructured and semi-structured data. The Big Data Sourcebook provides in-depth articles on the expanding range of NewSQL, NoSQL, Hadoop, and private/public/hybrid cloud technologies, as well as new capabilities for traditional data management systems. Articles cover business- and technology-related topics, including business intelligence and advanced analytics, data security and governance, data integration, data quality and master data management, social media analytics, and data warehousing.
No part of this magazine may be reproduced and by any means—print, electronic or any other—without written permission of the publisher.
COPYRIGHT INFORMATION
Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by Information Today, Inc., provided that the base fee of US $2.00 per page is paid directly to Copyright Clearance Center (CCC), 222 Rosewood Drive, Danvers, MA 01923, phone 978-750-8400, fax 978-750-4744, USA. For those organizations that have been grated a photocopy license by CCC, a separate system of payment has been arranged. Photocopies for academic use: Persons desiring to make academic course packs with articles from this journal should contact the Copyright Clearance Center to request authorization through CCC’s Academic Permissions Service (APS), subject to the conditions thereof. Same CCC address as above. Be sure to reference APS.
Creation of derivative works, such as informative abstracts, unless agreed to in writing by the copyright owner, is forbidden.
Acceptance of advertisement does not imply an endorsement by Big Data Sourcebook. Big Data Sourcebook disclaims responsibility for the statements, either of fact or opinion, advanced by the contributors and/or authors.
© 2013 Information Today, Inc.
PUBLISHED BY Unisphere Media—a Division of Information Today, Inc.EDITORIAL & SALES OFFICE 630 Central Avenue, Murray Hill, New Providence, NJ 07974CORPORATE HEADQUARTERS 143 Old Marlton Pike, Medford, NJ 08055Thomas Hogan Jr., Group Publisher 808-795-3701; thoganjr@infotoday
Joyce Wells, Managing Editor 908-795-3704; [email protected]
Joseph McKendrick, Contributing Editor; [email protected]
Sheryl Markovits, Editorial and Project Management Assistant (908) 795-3705; [email protected]
Celeste Peterson-Sloss, Deborah Poulson, Alison A. Trotta, Editorial Services
Denise M. Erickson, Senior Graphic Designer
Jackie Crawford, Ad Trafficking Coordinator
Alexis Sopko, Advertising Coordinator 908-795-3703; [email protected]
Sheila Willison, Marketing Manager, Events and Circulation 859-278-2223; [email protected]
DawnEl Harris, Director of Web Events; [email protected]
ADVERTISING Stephen Faig, Business Development Manager,908-795-3702; [email protected]
Thomas H. Hogan, President and CEO
Roger R. Bilboul, Chairman of the Board
John C. Yersak, Vice President and CAO
Richard T. Kaser, Vice President, Content
Thomas Hogan Jr., Vice President, Marketing and Business Development
M. Heide Dengler, Vice President, Graphics and Production
Bill Spence, Vice President, Information Technology
INFORMATION TODAY, INC. EXECUTIVE MANAGEMENT
DATABASE TRENDS AND APPLICATIONS EDITORIAL ADVISORY BOARD
From the publishers of
![Page 4: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/4.jpg)
2 BIG DATA SOURCEBOOK 2013
It has been well-documented that social media, web, transactional, as well as machine-generated and tradi-tional relational data, are being collected within organiza-tions at an accelerated pace. Today, according to common industry estimates, 80% of enterprise data is unstructured —or schema-less.
The reality of what is taking place in IT organizations today is more than hype. According to an SAP-spon-sored survey of 304 data managers and professionals, conducted earlier this year by Unisphere Research, a division of Information Today, Inc., between one-third and one-half of respondents have high levels of volume, variety, velocity, and value in that data—the well-known four characteristics that define big data. The “2013 Big Data Opportunities Survey” found that two-fifths of respondents have data stores reaching into the hundreds of terabytes and greater. Eleven percent of respondents said the total data they manage ranges from 500TBs to 1PB, 8% had between 1PB and 10PBs, and 9% had more than 10PBs.
In addition, data stores are growing rapidly. Accord-ing to another study produced by Unisphere Research and sponsored by Oracle, almost nine-tenths of the 322 respondents say they are experiencing year-over-year growth in their data assets. Respondents to the survey were data managers and professionals who are members of the Independent Oracle Users Group (IOUG). For many, this growth is in double-digit ranges. Forty-one percent report significant growth levels, defined as exceeding 25% a year. Seventeen percent report that the rate of growth has been more than 50% (“Achieving Enterprise Data Performance: 2013 IOUG Database Growth Survey”).
Big data offers enormous potential to organizations and represents a major transformation of information technology. Beyond the obvious need to effectively store and protect this data, IT organizations are increasingly
seeking to integrate their disparate forms of data and to also perform analytics in order to uncover information that will result in their organization’s competitive advan-tage. What makes big data valuable is the ability to deliver insights to decision makers that can propel organizations forward and grow revenue.
As might be expected, the largest organizations in the SAP-Unisphere study—those with 1,000 employees and up—are engaged in big data initiatives, but many smaller firms are also pursuing big data projects as well. More than a third of the smallest companies or agencies in the survey, 37%, say they are involved in big data efforts, along with 43% of organizations with employees in the hundreds. According to the study, three-fourths of respondents have users at their organizations that are pushing for access to more data to do their jobs.
Products to address the big data challenge are coming to the rescue. The expanding range of NewSQL, NoSQL, Hadoop, and private/public/hybrid cloud technologies, as well as newer capabilities for traditional data management systems, present extraordinary advantages in effectively dealing with the data deluge. But which approaches are best for each individual organization? Which approaches will have staying power and which will fall by the wayside? As Radiant Advisors’ John O’Brien rightly points out in his overview of the state of big data, we are in the infancy of a new era—and moving into a new era has never been easy.
To help advance the discussion, in this issue, DBTA has assembled a cadre of expert authors who each drill down on a single key area of the big data picture. In addition, leading vendors showcase their products and unique approaches on how to achieve value and mitigate risk in big data projects. Together, these articles and sponsored content provide rich insight on the current trends and opportunities—as well as pitfalls to avoid—when address-ing the emerging big data challenge. ■
The Big Picture
By Joyce Wells
DBTA’s Big Data Sourcebook is a guide to the enter-prise and technology issues IT professionals are being asked to cope with as business or organizational leadership increasingly defines strategies that leverage the ‘big data’ phenomenon.
![Page 5: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/5.jpg)
DBTA.COM 3
sponsored content
With big data, the world has gotten far
more complex for IT managers and those
in charge of keeping a business moving
forward. So how do you simplify your
architecture and operations while raising
the value of the innovative tools you’ve
crafted to meet your business goals? With the
emergence of simple key/value type data—
such as MongoDB, Cassandra, social media
databases, and Hadoop—data connectivity is
evolving to meet requirements for speed and
consistency.
AN EXAMPLEEvery year, NASA and the National
Science Foundation host a contest across
the scientific communities, the results
often resonating in both the academic
and business worlds. The latest challenge:
How can organizations pull together all the
right data from a variety of sources before
performing analysis, drawing conclusions
and making decisions? Sounds like big
data, right?
Consider the problem of determining
if life ever existed on Mars. A huge variety
of data collected by the Mars rover is
fed into clusters of databases around the
world. It then gets transmitted as a whole
to a variety of data sets and Hadoop
clusters. What do we do with it? How does
the scientific community organize itself
to deal with this influx?
There are similar examples in every
industry, all leading to key integration
challenges: How do we make dissimilar data
sets uniformly accessible? And how do we
extract the most relevant information in a
fast, scalable and consistent way?
The problems of data access and relevancy
are complicated by three additional data
processing realities:
1. Big data is driven by economics. When
the cost of keeping information is less than the
cost of throwing it away, more data survives.
2. Applications are driven by data. Big
data applications drive data analysis. That’s
what they’re for. And they all have the same
marching orders: Get the right data to the
right people at the right time.
3. Dark data happens. Because nothing
is thrown away, some data may linger for
years without being valued or used. This
“dark data” might not be relevant for one
analysis, but could be critical for another.
In theory and in future practice, nothing
is “irrelevant.”
THE BIG DATA MARKETAccording to a recent Progress DataDirect
survey, most respondents use Hadoop file
systems or plan to use them within two years.
Respondents also included Microsoft HD
Insights, Cloudera, Oracle BDA and Amazon
EMR in the list of technology they plan to
use in the next two years. This indicates the
growing market awareness that it is now
economically feasible to store and process
many large data sets, and analyze them in
their entirety.
The survey also asked respondents to
rank leading new data storage technologies.
MongoDB and Cassandra have both gained
a large foothold. Progress DataDirect will
soon be supporting them.
TECHNOLOGY ADDRESSES THE NEED
Market growth and maturation has led
to new approaches for storage and analysis
of both structured and multi-structured
data. Recent breakthroughs include:
• Integration of external and social data
with corporate data for a more complete
perspective.
• Adoption of exploratory analytic
approaches to identify new patterns
in data.
• Predictive analytics coming on strong as
a fundamental component of business
intelligence (BI) strategies.
• Increased adoption of in-memory
databases for rapid data ingestion.
• Real-time analysis of data prior to storage
within the data warehouses and Hadoop
clusters.
• A requirement for interactive, native,
SQL-based analysis of data in Hadoop
and HBase.
As the cost of keeping collected
data plummets, new data sources are
proliferating. To address the growing need,
organizations must be able to connect a
variety of BI applications to a variety of data
sources, all with different APIs and designs
—without forcing developers to learn new
APIs or to constantly re-code applications.
The connection has to be fast, consistent,
scalable and efficient. And most importantly,
it should provide real-time data access for
smarter operations and decision making.
SQL connectivity, the central value of our
Progress DataDirect solutions, is the answer.
It delivers a high-performance, scalable and
consistent way to access new data sources
both on-premise and in the cloud. With SQL,
we treat every data source as a relational
database—a fundamentally more efficient
and simplified way of processing data.
PROGRESS DATADIRECT www.datadirect.com
Does Big Data = Big Business Value?
![Page 6: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/6.jpg)
ind
ustr
y u
pd
ate
s
4 BIG DATA SOURCEBOOK 2013
By John O’Brien
The Battle Over Persistence and the Race for Access Hill
The State of Big Data in 2013
![Page 7: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/7.jpg)
DBTA.COM 5
ind
ustr
y u
pd
ate
s
Shifting gears into a new era has never
been easy during the transition. Only in hind-
sight do we clearly see what was right in front
of our faces—and probably the whole time.
This is one nugget of wisdom I have been
sharing with audiences through keynotes
at data warehousing (DW), big data confer-
ences, and major company onsite briefings.
Having been part of the data management
and business intelligence (BI) industry for
25 years, I have witnessed emerging technol-
ogies, business management paradigms, and
Moore’s Law reshape our industry time and
time over.
Big data and business analytics have all the
promise to usher in the information age, but
we are still in the infancy of our next era—
and frankly that’s what makes it so exciting!
In 2013, the marketplace for big data,
BI, NoSQL, and cloud computing has seen
emerging vendors, adapting incumbents, and
maturing technologies as each compete for
market position. Some of these battles are
being resolved in 2013, while others will be
resolved in later years—or potentially not at
all. Either way, understanding the challenges
on the landscape will assist with technology
decision making, strategies, and architecture
road maps today and when planning for
years ahead.
Two of the more dominant shifts occur-
ring around us this year can be called the
Battle Over Persistence and the Race for
Access Hill.
The Battle Over PersistenceThe Battle Over Persistence didn’t just start
5 years ago with the emergence of big data or
the Apache Foundation Hadoop; it’s been an
ongoing battle for decades in the structured
data world. As the pendulum swings broadly
between centralized data and distributed
disparate data, the Battle Over Persistence is
somewhat of a “holy war” between the data
consistency inherently derived from a singu-
lar data store versus the performance derived
from data stores optimized for specific work-
loads. The consistency camp argues that with
enough resources, the single data store can
overcome performance challenges, while the
performance camp argues that they can man-
age the complexity of mixed heterogeneous
data stores to ensure consistency.
Decades ago, multidimensional databases,
or MOLAP cubes, were optimized to persist
and work with data in a way different than
row-based relational databases management
systems (RDBMs) did. It wasn’t just about rep-
resenting data in star schemas derived from a
dimensional modeling paradigm—both of
which are very powerful—but about how that
data should be persisted when you knew how
users would access and interact with it. OLAP
cubes represent the first highly interactive user
experience: the ability to swiftly “slice and
dice” through summarized dimensional data
—a behavior that could not be delivered by
relational databases, given the price-perfor-
mance of computing resources at the time.
Persisting data in two different data stores
for different purposes has been a part of BI
architecture for decades already, and today’s
debates challenge the core notion of trans-
actional systems and analytical system work-
loads: They could be run from the same data
store in the near future.
Data Is DataThe NoSQL family of data stores was born
out of the business demands to capitalize on
the “orders of magnitude” of data volume and
complexity inherent to instrumented data
acquisition—first from the internet website
and search engines tracking your every click,
to the mobile revolution tracking your every
post. What’s different about NoSQL and
Hadoop is the paradigm on which it’s built:
“Data is data.”
Technically speaking, data is free, but what
does cost money and contributes to return
on investment calculations are costs to store
and access data: infrastructure. So, develop-
ing a software solution that leverages the low-
est cost infrastructure, operating costs, and
footprint was required to tackle the order of
magnitude that big data represented—i.e., the
lowest capital cost of servers, the lowest data
center costs from supplying power and cool-
ing, and the highest density of servers to fit
the most in a smaller space. With the “data is
data” mantra, we don’t require understanding
of how the data needs to be structured before-
hand, and we accept that the applications cre-
ating the data may be continuously changing
structure or introducing new data elements.
Fortunately, at the heart of data abstraction
and flexibility is the key-value pair of data,
and this simple elemental data unit enables
the highest scalability.
A Modern Data Platform Has EmergedThe Battle Over Persistence principle
argues that there are multiple databases (or
data technologies), each with its own clear
strengths, and most-suited for different kinds
of data and different kinds of workloads
with data. For now, the pendulum has swung
back into the distributed and federated data
architecture. We can embrace flexibility and
overall manageability of big data platforms,
The Battle Over Persistence is somewhat of a ‘holy war’ between the data consistency inherently derived from a singu lar data store and the performance derived from data stores optimized for specific work loads.
The State of Big Data in 2013
![Page 8: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/8.jpg)
ind
ustr
y u
pd
ate
s
6 BIG DATA SOURCEBOOK 2013
such as Hadoop and MongoDB. Entity-
relationship modeled data in enterprise data
warehouses and master data management
fuse consistent and standard context into
schemas and support temporal aspects of ref-
erence data with rich attribution to fuel ana-
lytics. Even analytic optimized databases—
such as columnar, MPP (massive parallel
processing), appliances, and even multi-
dimensional databases—can be combined
with in-memory databases, cloud computing,
and high-performance networks. Separately,
highly specialized NoSQL or analytic data-
bases—such as graph databases, documents,
or text-based analytic engines—have their
place and can be executed natively in these
specialized databases.
Companies and vendors are beginning to
accept that there needs to be multiple data-
base technologies interwoven together to
deliver the much needed Modern Data Plat-
form (MDP), but keep in mind that the pen-
dulum will continue to swing—it may be 5
or 10 years from now, but some things about
technology that we know hold true. Comput-
ing price-performance will continue as it has
with Moore’s Law, so we can converge higher
numbers of CPU cores in parallel with lower
cost, more abundant memory with faster solid
state storage, and higher capacity mechanical
disk drives. Tack on the rate of technology
innovation and maturity that is driving big
data today, and we could see the capabilities
of Hadoop derivatives, MongoDB, or some
emerging data technologies eclipse highly
specialized and optimized data technologies
being deployed today to meet demands.
There are great debates about the disparate
databases ecosystems versus the all-in-one
Hadoop—it’s simply a matter of timing and
vision versus the reality of today’s demand-
ing, data-centric environments.
The Race for Access HillWhen you accept the premise of a feder-
ated data architecture based primarily on
workloads rather than logical data subjects,
the next question that arises is, “How do I find
anything and where do I start?” The ability to
manage the semantic context of all data, its
usage for monitoring and compliance, or to
provide users with a single or simple point of
access is the Race for Access Hill.
When you think about “the internet,” you
realize that it’s used as a singular noun, simi-
lar to how “Google” has become a verb mean-
ing to search through the millions of servers
that comprise the internet. Therefore, if the
Modern Data Platform represents all the dis-
parate data stores and information assets of
the enterprise in a singular noun form, we
need a point of access and navigation. Other-
wise, the MDP is simply a bunch of databases.
One major concept at stake for modern
data architects in the Race for Access Hill is
how to centralize semantic context for con-
sistency, collaboration, and navigation. Pre-
viously in the organized world of data sche-
mas, there were many database vendors and
technologies that made data access heteroge-
neous, but it was still unified SQL data access
under a single paradigm. Federated data
architectures were predominantly still SQL
schema in nature and easier to unify. Today’s
key-value stores, such as Hadoop, have the
ability to separate the context of data or its
schema from the data itself, which has great
discovery-oriented benefits for late-binding
the schema with the data, rather than analyz-
ing and designing a schema prior to loading
data in as a traditional RDBMS.
Centralizing context can be done in a
Hadoop cluster’s HCatalog or Hive compo-
nents for semantic integration with other
SQL-oriented databases for federation,
hence joining the SQL world where possible.
(Reminds me of my favorite recent Twitter
quote, “Who knew the future of NoSQL was
SQL?”) Data virtualization (DV) can serve as
a single access point for the broad, SQL-based
consumer community, therefore becoming
the “glue” of the Modern Data Platform that
unifies persistence across many data store
workloads. The later addition of HCatalog
and Hive to Hadoop also has this capability,
but only for the data that can fit this para-
digm; MapReduce functionality was designed
to enable any analytic capability through a
programming model. Other NoSQL data
stores, such as graph databases, don’t inher-
ently “speak SQL,” so in order to be compre-
hensive, an access layer (or point) needs to be
service-oriented as well. Consumers will need
a simple navigation map that allows them to
access and consume information from data
services, as well as virtual data tables. The
long-term strategy will lean further toward a
service-orientation more and more over time;
however, virtualized data will still be needed
for information access situations.
Competing for the HillThe resolution for this portion of the
Race for Access Hill will be gradual within
the coming years; as the need arises, a tech-
nology and strategy are already in place for
companies to adopt. However, this is not the
case with the “hill” portion of the “the race”:
Vendors are racing to position their products
to be that single point of access (the hill) with
compelling arguments and case studies to
support them. Aside from the SQL/services
centralization of semantic context, the next
question becomes, “Where should this access
point live within the architecture?”
There are four different locations or lay-
ers where centralized access and context
could be effectively managed—a continuum
between two points with the data at one end
and the consumer or user at the other, if you will. Along this continuum are several points where you could introduce centralized access and information context. Starting from the data end, you could make the single point of access within a database—this database could have connections to other data stores and virtualization as the representation for the users. Next could be to centralize the access and information context above the database
One major concept at stake in the Race for Access Hill is how to centralize semantic context.
The State of Big Data in 2013
![Page 9: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/9.jpg)
![Page 10: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/10.jpg)
ind
ustr
y u
pd
ate
s
8 BIG DATA SOURCEBOOK 2013
layer but between the BI app and consumer
layers with a data virtualization technology.
Third could be to move further along the
path toward the user into the BI application
layer, where BI tools have the ability to cre-
ate meta catalogs and data objects in a man-
aged order for reporting, dashboards, and
other consumers. Finally, some argue that the
user—or desktop—application is the place
where users can freely navigate and work
with data within the context they need locally
and with a much more agile fashion.
Not All Data Is Created EqualDespite database, data virtualization, and
BI tool vendors racing to be the single point
of access for all the data assets in the mod-
ern data platform for their own gains, there
isn’t one answer for where singular access and
context should live because it’s not necessarily
an architectural question but perhaps a more
philosophical one—a classic “it depends.”
With so many options available from the ven-
dors today, understanding how to blend and
inherit context under which circumstances or
workload is key.
First, understand which data needs to be
governed vigorously—not all data is created
equal. When the semantic context of data
needs to be governed absolutely, moving the
context closer to the data itself ensures that
access will be inherited context every time. For
relational databases, this is the physical tables,
columns, and data types that define entities
and attribution within a schema of the data.
For Hadoop, instead, this would be the defini-
tion of the table and columns, with the Hive
or HCatalog abstraction layer bound to the
data within the Hadoop Distributed File Sys-
tem (HDFS). Therefore, a data virtualization
tool or BI server could integrate multiple data
stores’ schemas as a single virtual access point.
Counter to this approach is certain data that
does not have a set definition yet (discovery),
or when local interpretation is more valuable
than enterprise consistency—here it makes
more sense for the context to be managed by
users or business analysts in a self-service or
collaborative nature. The semantic life cycle
of data can be thought of as discovery, veri-
fication, governance, and, finally, adoption by
different users in different ways.
As for the “it depends” comment regard-
ing different analytic workloads, let’s examine
another new hot topic of 2013: Analytic Dis-
covery, or specifically, the analytic discovery
process. Analytic databases have been posi-
tioned as higher-performing and analytic-
optimized database between the vast amounts
of big data in Hadoop and the enterprise ref-
erence data, such as data warehouses and
master data management hubs. The analytic
database is highly optimized for performing
dataset operations and statistics by combin-
ing the ease of use from SQL and the perfor-
mance of MPP database technology, colum-
nar data storage, or in-memory processing.
Discovery is a highly iterative mental pro-
cess—somewhat trial and error and verifica-
tion. Analytic databases may not be as flexible
or scalable as Hadoop, but they are faster out
of the box. So, when an analytic database is
used for a discovery workload, some degree of
semantics and remote database connections
should live within them. Whether the analytic
sandbox is for discovery or is for running pro-
duction analytics accumulating more analytic
jobs over time is still unknown.
What’s AheadIn 2013, two major shifts in the data land-
scape occurred. The acceptance of leveraging
the strengths of various database technologies
in an optimized Modern Data Platform has
more or less been resolved, but the recogni-
tion of a single point of access and context is
next. Likewise, the race for access will con-
tinue well into 2014—and while one solution
may win out over the others with enough
push and marketing from vendors, the overall
debate will continue for years, with blended
approaches being the reality at companies.
And, get ready: The next wave in data is
now emerging, once again pushing beyond
web and mobile data. The Internet of Things
(IoT)—or, Machine-to-Machine (M2M) data
—comes from a ratio of thousands of devices
per person that creates, shares, and performs
analytics, and, in some cases, every second.
Whether it’s every device in your home, car,
office, or everywhere in between that has a
plug or battery generating and sharing data
in a cloud somewhere—or it’s the 10,000
data points being generated every second by
each airline jet engine on the flight I’m on
right now—there will be new forms of value
created by business intelligence, energy effi-
ciency intelligence, operational intelligence,
and many other forms and families of artifi-
cial intelligence. ■
John O’Brien is principal and CEO of Radiant Advisors. With more than 25 years of experience delivering value through data warehousing and business intelligence programs, O’Brien’s unique perspective comes from the combination of his roles as a practitioner, consultant, and vendor CTO in the BI industry. As a globally recognized business intelligence thought leader, O’Brien has been publishing arti-cles and presenting at conferences in North America and Europe for the past 10 years. His knowledge in designing, building, and grow-ing enterprise BI systems and teams brings real-world insights to each role and phase within a BI program. Today, through Radiant Advisors, O’Brien provides research, stra-tegic advisory services, and mentoring that guide companies in meeting the demands of next-generation information management, architecture, and emerging technologies.
With so many options available, understanding how to blend and inherit context under which circumstances or workload is key.
The State of Big Data in 2013
![Page 11: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/11.jpg)
WHAT HAS YOUR BIG DATA DONE
FOR YOU LATELY?
TransLattice helps solve the world’s Big Data problems.
Bridge your federated systems with effortless visibility and data control to get real benefit from your data.
www.TransLattice.com
![Page 12: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/12.jpg)
ind
ustr
y u
pd
ate
s
10 BIG DATA SOURCEBOOK 2013
Data management has been a hot topic
in the last years, topping even cloud comput-
ing. Here is a look at some of the trends and
how they are going to impact data manage-
ment professionals.
The Rise of ‘Datafication’Today, businesses are ending up with more
and more critical dependency on their data
infrastructure. Before widespread electrifica-
tion was implemented, most businesses were
able to operate well without electricity but in
a matter of a couple of decades, dependency
on electricity became so strong and so broad
that almost no business could continue to
operate without electricity. Similarly, “data-
fication” is what’s happening right now. If
underlying database systems are not available,
manufacturing floors cannot operate, stock
exchanges cannot trade, retail stores cannot
sell, banks cannot serve customers, mobile
phone users cannot place calls, stadiums
cannot host sports games, gyms cannot ver-
ify their subscribers’ identity. The list keeps
growing as more and more companies rely on
data to run their core business.
Consolidation and Private Database CloudsDatabase consolidation has been lagging
behind application server consolidation. The
latter has long moved to virtual platforms
while the database posed unique challenges
with host-based virtualization. However, with
server virtualization improvements and data-
base software innovations such as Oracle’s
Multitenant, database consolidation moved
to the next level and most recently reemerged
as database as a service with SLA manage-
ment, resource accounting and chargeback,
self-service capabilities, and elastic capacity.
Commodity Hardware and SoftwareHardware performance has been rising
consistently for decades with Moore’s Law,
high-speed networking, solid-state storage,
and the abundance of memory. On the other
hand, the cost of hardware has been consis-
tently decreasing to the point where we now
call it a commodity resource. Public cloud
infrastructure as a service (IaaS) has dropped
the last barriers of adoption.
On the software side, open source phenom-
ena resulted in the availability of free or inex-
pensive database software that, combined with
access to affordable hardware, allows pract-
cally any company to build its own data man-
agement systems—no barriers for datafication.
The Age of Big Data Spells the End of Enterprise IT Silos
By Alex Gorbachev
The State of Big Data Management
![Page 13: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/13.jpg)
![Page 14: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/14.jpg)
ind
ustr
y u
pd
ate
s
12 BIG DATA SOURCEBOOK 2013
The Future of Database OutsourcingDatafication, consolidation, virtualization,
Moore’s Law, engineered systems, cloud com-
puting, big data, and software innovations
will all result in more eggs (business appli-
cations) ending up in one basket (a single
data management system). Consequently, the
impact of an incident on such a system is sig-
nificantly higher, affecting larger numbers of
more critical business applications and func-
tions—for example, a major U.S. retailer that
has $1 billion of annual revenue dependent
on a single engineered system or another sin-
gle engineered system handling 2% of Japan’s
retail transactions.
Operating such critical data systems
becomes much more skills-intensive rather
than labor-intensive, and, as companies fol-
low the trend of moving from a zillion low
importance systems to just a few highly crit-
ical systems, outsourcing vendors will have
to adapt. The modern database outsourcing
industry is broken because it’s designed to
source an army of cheap but mediocre work-
ers. The future of database outsourcing is with
the vendors focused on enabling their clients
to build an A-team to manage the critical data
systems of today and tomorrow.
Breaking Enterprise IT SilosThe age of big data spells the end of
enterprise IT silos. Big data projects are very
difficult to tackle by orchestrating a num-
ber of very specialized teams such as storage
administrators, system engineers, network
specialists, DBAs, application developers,
data analysts, etc.
It’s difficult to specialize due to the quickly
changing scope of roles as well as rapid evo-
lution of the software. Getting things done in
a siloed environment takes a very long time—
this is misaligned with the need to be more
agile and adaptable to changing requirements
and timelines. A single, well-jelled big data
team is able to get work done quickly and in a
more optimal way—big data systems are basi-
cally new commercial supercomputers in the
age of datafication and—just like with tradi-
tional supercomputers—they require a team
of professionals responsible for the manage-
ment of the complete system end-to-end.
Pre-integrated solutions and engineered
systems also break enterprise IT silos by
forcing companies to build a cross-skilled
single team responsible for that whole engi-
neered system.
The Future for Hadoop and NoSQLWhether Hadoop is the best big data plat-
form from a technology perspective or not,
it has such a broad (and growing) adoption
in the industry nowadays that there is little
chance for it to be displaced by any other
technology stack.
While, traditionally, core Hadoop has
been thought of as a combination of HDFS
and MapReduce, today, both HDFS and
MapReduce are really optional. For example,
the MapR Hadoop distribution uses MapR-FS,
and Amazon EMR uses S3. The same applies
to MapReduce—Cloudera Impala has its own
parallel execution engine, Apache Spark is a
new low-latency parallel execution frame-
work, and many more are becoming popular.
Even Apache Hive and Apache Pig are mov-
ing from pure MapReduce to Apache Tez, yet
another big data real-time distributed execu-
tion framework.
Hadoop is here to stay and that means the
Hadoop ecosystem at large. It will evolve and
add new capabilities at a blazing-fast pace. Some
will die out and others move into mainstream.
“Core Hadoop” as we know it will change.
There are many commercial off-the-shelf
(COTS) applications available that use rela-
tional databases as a data platform—CRM,
ERP, ecommerce, health records manage-
ment, and more. Deploying COTS applica-
tions on one of the supported relational data-
base platforms is a relatively straightforward
task, and application vendors have a proven
track of deployments with clearly defined
guidelines. It can be argued that the major-
ity of relational database deployments today
host a third-party application rather than an
in-house developed application.
Big data projects, on the other hand, are
pretty much 100% custom-developed solu-
tions and nonrepeatable easily at another company. As Hadoop has become the stan-dard platform of the big data industry, expect a slew of COTS applications to deploy on top of Hadoop platforms just as they are deployed on top of relational databases such as Oracle and SQL Server.
It’s difficult to specialize due to the quickly changing scope of roles as well as rapid evolution of the software. Getting things done in a siloed environment takes a very long time—this is misaligned with the need to be more agile and adaptable to changing requirements and timelines.
The State of Big Data Management
![Page 15: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/15.jpg)
DBTA.COM 13
ind
ustr
y u
pd
ate
s
For example, all retail players have to solve
the challenges of providing a seamless expe-
rience to the clients across both physical and
online channels. All city governments have
the same needs for traffic planning and real-
time control to minimize traffic jams and at
the same time to minimize the cost of opera-
tions and ownership. Companies will be able
to buy a COTS application and deploy it on
their own Hadoop infrastructure no matter
what Hadoop distribution it is.
It is, however, quite possible that the new
big data COTS applications will be domi-
nated by software as a service (SaaS) offerings
or completely integrated solution appliances
(as an evolution of engineered systems) and
that means a completely different repeatable
deployment model for big data.
Unlike Hadoop, however, the world of
NoSQL is still represented by a huge variety of
incompatible platforms and it’s not obvious
who will dominate the market. Each of the
NoSQL technologies has a certain specializa-
tion and no one size fits all—unlike relational
databases.
Relational Databases Are Not Going Anywhere
While there is much speculation about
how modern data processing technologies
are displacing proven relational databases, the
reality is that most companies will be better
served with relational technologies for most
of their needs.
As the saying goes, if all you have is a
hammer, everything looks like a nail. When
database professionals drink enough of the
big data Kool-Aid, many of their challenges
look like big data problems. In reality, though,
most of their problems are self-inflicted. A
bad data model is not a big data problem.
Using 7-year-old hardware is not a big data
problem. Lack of data purging policy is not
a big data problem. Misconfigured databases,
operating systems, and storage arrays are not
big data problems.
There is one good rule of thumb to assess
whether you have a big data problem or not—
if you are not using new data sources, you likely
don’t have a big data problem. If you are con-
suming new information from the new data
sources, you might have a big data problem.
What’s AheadThere are a few areas in which we can cer-
tainly expect to have many innovations over
the next few years.
Real-time analytics on massive data vol-
umes has more and more demand. While
there are many in-memory database technol-
ogies including many proprietary solutions,
I believe the future is with the Hadoop eco-
system and open standards. However, pro-
prietary solutions such as SAP HANA or just
announced Oracle In-Memory Database are
very credible alternatives.
Graph databases will see significant
uptake. There are several graph databases
and libraries available, but they all have
unique weaknesses when it comes to scalabil-
ity, availability, in-memory requirements,
data size, modification consistency and plain
stability. As we have more and more data
generated that is based on dynamic relations
between entities, graph theory becomes a
very convenient way to model data. Thus,
the graph databases space is bound to evolve
at a fast pace.
Continuously increasing security demands
is a general trend in many industries although
most of the modern data processing technol-
ogies have weak security capabilities out of
the box. This is where established relational
databases with very strong security models
and capabilities to integrate easily with central
security controls have a strong edge. While it’s
possible to deploy a Hadoop-based solution
with encryption of data in transit and at rest,
strong authentication, granular access con-
trols, and access audit, it takes significantly
more effort than deploying mature database
technologies. It’s especially difficult to sat-
isfy strict security standards compliance with
newer technologies, as there are no widely
accepted and/or certified secure deployment
blueprints.
The future of the database profes-sional—One of the challenges that is holding
companies from adopting new data processing
technologies is the lack of skilled people to
implement and maintain that new technology.
Those of us with a strong background in tra-
ditional database technologies are already in
high demand and are even in higher demand
when it comes to the bleeding-edge, not-yet-
proven databases. If you want to be ahead of
the industry, look for opportunities to invest in
learning one of the new database technologies
and do not be afraid that it might be one of
those technologies that becomes nonexistent
in a couple of years. What you learn will take
you to the next level in your professional career
and make it much easier to adapt to the quickly
changing database landscape. ■
Alex Gorbachev, chief technology officer at Pyth-ian, has architected and designed numerous suc-cessful database solutions to address challenging business requirements. He is a respected figure in the database world and a sought-after leader and speaker at con-ferences. Gorbachev is an Oracle ACE direc-tor, a Cloudera Champion of Big Data, and a member of OakTable Network. He serves as director of communities for the Independent Oracle User Group (IOUG). In recognition of his industry leadership, business achieve-ments, and community contributions, he received the 2013 Forty Under 40 award from the Ottawa Business Journal and the Ottawa Chamber of Commerce.
Unlike Hadoop, the world of NoSQL is still represented by a huge variety of platforms and each of the NoSQL technologies has a certain specialization.
The State of Big Data Management
![Page 16: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/16.jpg)
14 BIG DATA SOURCEBOOK 2013
sponsored content
Big Data is being talked about
everywhere … in IT and business conferences,
venture capital, legal, medical and government
summits, blogs and tweets … even Fox News!
The prevailing mindset is that if you don’t
have a Big Data project, you’re going to be left
behind. In turn, CIOs are feeling pressured
to do something—anything—about Big
Data. So while they are putting up Hadoop
clusters and crunching some data, it seems
that the really big (data) question all of them
should be asking is where is the value going
to come from?, what are the “real” use cases?,
and finally how can they prevent this from
becoming yet another money pit, or “elephant
trap,” of technologies and consultants?
TRAP 1—NOT FOCUSING ON VALUEMuch of the talk about Big Data is
focused on data … not the value in it.
Perhaps we should start with value—identify
those business entities and processes where
having infinitely more information could
directly influence revenue, profitability or
customer satisfaction. Take for example
the customer as an entity. If we had perfect
knowledge of current and potential
customers, past transactions and future
intentions, demographics and preferences
—how would we take advantage of that to
drive loyalty and increase share of wallet
and margins? Or to focus on a process such
as delivering healthcare services—how
would Big Data impact clinical quality, cost
and reduce relapse rates? Enumerating the
possible impact of Big Data on real business
goals (or social goals for non-profits) should
be the first step of your Big Data strategy,
followed by prioritizing them which would
involve weeding out the whimsical and
instead focus on the practical.
TRAP 2—SEEKING DATA PERFECTIONWith value in mind, you must be willing
to experiment with many different types of
Big Data (structured to highly unstructured)
and sources—machine and sensor data
(weather sensors, machine logs, web
click streams, RFID), user-generated data
(social media, customer feedback), Open
Government and public data (financial data,
court records, yellow pages), corporate data
(transactions, financials) and many more.
In many cases the “broader view” might yield
more value than the “deep and narrow” view.
And this allows companies to experiment
with data that may be less than perfect
quality but more than “fit for purpose.”
While quality, trustworthiness, performance
and security are valid concerns, over-zealously
filtering out new sources of data using old
standards will fail to achieve the full value of
Big Data. Also data integration technologies
and approaches are themselves siloed with
different technology stacks for analytics
(ETL/DW), for business process (BPM,
ESB), content and collaboration (ECM,
Search, Portals). Companies need to think
more broadly about data acquisition and
integration capabilities if they want to acquire,
normalize, and integrate multi-structured
data from internal and external sources and
turn the collective intelligence into relevant
and timely information through a unified/
common/semantic layer of data.
TRAP 3—COST, TIME AND RIGIDITYWhile all the data in the world—and
its potential value—can excite companies,
it would not be economically attractive
except to the largest organizations if Big
Data integration and analytics were done
using traditional high-cost approaches
such as ETL, data warehouses, and high-
performance database appliances. From the
start, Big Data projects should be designed
with low cost, speed and flexibility as the
core objectives of the project. Big Data is still
nascent, meaning both business needs and
data realms are likely to evolve faster than
previous generations of analytics, requiring
tremendous flexibility. Traditional analytics
relied heavily on replicated data, but Big Data
is too large for replication-based strategies
and must be leveraged in place or in flight
where possible. This also applies in the output
Elephant Traps … How to Avoid Them With Data Virtualization
![Page 17: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/17.jpg)
DBTA.COM 15
sponsored content
Big Data in the Enterprise
Agile BI & Analytics
Users
HadoopCLOUD STORAGE
Data.gov
Big Data in the Web / Cloud
Web Streams
Big Data in the Enterprise
STORAGE
Web Streams
Relational / Parallel / Columnar
MDX
Address Customer Name
Seattle
Minnesota
New Jersey
New Jersey
Minnesota
Seattle
Seattle
Minnesota
New Jersey
New Jersey
IBM
Chevron Corporation
JPMorgan
Chevron Corporation
IBM
IBM
JPMorgan
Chevron Corporation
JPMorgan
Chevron Corporation
Query Widget Chart Widget Map Widget
Enterprise Apps
DataLog Files
Unstructured Content
WWWWWW
Enterprise & Cloud Apps
Denodo Platform
Unified Data Access
Universal DataPublishing
Unified Data Layer
Connect -> Combine -> Publish
direction where Big Data results must be easy
to reuse across unanticipated new projects in
the future.
AVOIDING THE TRAPSTo prevent Big Data projects from
becoming yet another money pit and suffer
from the same rigidity of data warehouses,
there are four areas in particular to
consider: data access, data storage, data
processing, and data services. The middle
two areas (storage and processing) have
received the most attention as open source
and distributed storage and processing
technologies like Hadoop have raised hopes
that big value can be squeezed out of Big
Data using small budgets. But what about
data access and data services?
Companies should be able to harness
Big Data from disparate realms cost
effectively, conform multi-structured data,
minimize replication, and provide real-time
integration. The Big Data and analytic result
sets may need to be abstracted and delivered
as reusable data services in order to allow
different interaction models such as discover,
search, browse, and query. These practices
ensure a Big Data solution that is not only
cost-effective, but also one that is flexible for
being leveraged across the enterprise.
DATA VIRTUALIZATIONSeveral technologies and approaches
serve the Big Data needs of which two
categories are particularly important.
The first has received a lot of attention
and involves distributed computing across
standard hardware clusters or cloud
resources, using open source technologies.
Technologies that fall in this category and
have all received a lot of attention include
Hadoop, Amazon S3, Google Big Query,
etc. The other is data virtualization, which
has been less talked about until now, but
is particularly important to address the
challenges of Big Data mentioned above:
Data virtualization accelerates time to value in Big Data projects: Because data
virtualization is not physical, it can rapidly
expose internal and external data assets
and allow business users and application
developers to explore and combine
information into prototype solutions that can
demonstrate value and validate projects faster.
Best of breed data virtualization solutions provide better and more efficient connectivity: Best of breed data virtualization
solutions connect diverse data realms and
sources ranging from legacy to relational to
multi-dimensional to hierarchical to semantic
to Big Data/NoSQL to semi-structured web
all the way to fully unstructured content and
indexes. These diverse sources are exposed
as normalized views so they can be easily
combined into semantic business entities and
associated across entities as linked data.
Virtualized data inherently provides lower costs and more flexibility: The output of
data virtualization are data services which
hides the complexity of underlying data
and exposes business data entities through a
variety of interfaces including RESTful linked
data services, SOA web services, data widgets,
or SQL views to applications and end users.
This makes Big Data reusable, discoverable,
searchable, browsable and queryable using a
variety of visualization and reporting tools,
and makes the data easily leveraged in real-
time operational applications as well.
CONCLUSIONCIOs and Chief Data Officers alike would
do well to keep the dangers of elephant
traps in mind before they find themselves
ensnared. The truth is that every Big Data
project needs a balance between the Big Data
technologies for storage and processing on
the one hand and data virtualization for data
access and data services delivery on the other.
![Page 18: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/18.jpg)
ind
ustr
y u
pd
ate
s
16 BIG DATA SOURCEBOOK 2013
The use of “big data” by organizations
today raises some important legal and regu-
latory concerns. The use of big data systems
and cloud-based systems is expanding faster
than the rules or legal infrastructure to man-
age it. Risk management implications are
becoming more critical to business strategy.
Businesses must get ahead of the practice to
protect themselves and their data.
Before a discussion of those legal and risk
issues, it’s important that we speak the same
language as the terms “big data” and “cloud”
are overused and mean many different things.
For our purposes here, big data is the con-
tinuously growing collection of datasets that
derive from different sources, under individ-
ualized conditions and which form an overall
set of information to be analyzed and mined
in a manner when traditional database tech-
nologies and methods are not sufficient. Big
data analysis requires powerful computing
systems that sift through massive amounts of
information with large numbers of variables
to produce results and reporting that can be
used to determine trends and discover pat-
terns to ultimately make smarter and more
accurate (business) decisions.
Big data analysis is used to spot everything
from business or operational trends to QA
issues, new products, new diseases, new ways
of socializing, etc. Cloud technologies are
required to help manage big data analysis.
Big data leverages cloud technologies such
as utility computing and distributed storage
—that is, massive parallel software that runs
to crunch, correlate, and present data in new
ways. Cloud infrastructure is highly scalable
and allows for an on-demand and usage-
based economic model that translates to
low-cost yet powerful IT resources, with a low
capital expense and low maintenance costs.
Cloud infrastructure becomes even more
important as the creation and use of the data
continues to grow. Every day, Google pro-
cesses more than 24,000TB of data, and a
few of the largest banks processes more than
75TB of internal corporate data daily across
the globe. Those massive sets of data form
the basis for big data analysis. And as big
data becomes more widely used and those
datasets continue to grow, so do the legal and
risk issues.
Legal and risk management implications
are typically sidelined in the quest for big data
mining and analysis because the organization
is typically focused, first and foremost, on try-
ing to use the data effectively and efficiently
for its own internal business purposes, let
alone giving attention to ensuring that any
legal and risk management implications are
also covered. The potential value of the results
of using big data analysis to increase income
(or lower expenses) for the company tends
to drown out the calls for risk oversight. Big
data can be a Siren, whose beautiful call lures
unsuspecting sailors to a rocky destruction.
Big Data Poses Legal Issues and Risks
By Alon Israely
The State of Data Security and Governance
![Page 19: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/19.jpg)
DBTA.COM 17
ind
ustr
y u
pd
ate
s
Big data can be a Siren, whose beautiful call lures unsuspecting sailors to a rocky destruction. Understanding the legal and regulatory consequences will help keep your company safe from those dangerous rocks.
Understanding the legal and regulatory con-
sequences will help keep your company safe
from those dangerous rocks.
Developing Protection StrategiesIn order to protect the organization from
legal risks when using big data, businesses must
assess issues and develop protection strategies.
The main areas typically discussed related to
legal risks and big data are in the realm of con-
sumer privacy; but, the legal compliance, such
as legal discovery and preservation obligations,
are also critical to address. Records informa-
tion management, information governance,
legal, and IT/IS professionals must know how
to identify, gather, and manage big datasets in
a defensible manner when that data and asso-
ciated systems are implicated in legal matters
such as lawsuits, regulatory investigations, and
commercial arbitrations. Organizations must
understand the risks, obligations, and stan-
dards associated with storing and managing
big data for legal purposes. As with all technol-
ogy decisions, there should be a cost/benefit
analysis completed to quantify all risks, includ-
ing soft risks such as the risk to reputation of
data breaches or the misuse of data.
Big data can be a sensitive topic when law-
suits or regulators come knocking—especially
if the potential legal risks have not thoroughly
been considered by companies early on as
they put in place big data systems and then
rely upon its associated analysis. Thus it’s
important to bring in the lawyers together
with the technologists early, though this is
not always easy to do. Big data from a legal
perspective includes consumer privacy and
international data transfer (cross-border)
issues, but more risky is the potential expo-
sure of using that data in the normal course
and maintaining the underlying raw data and
analyses (e.g., trending reports). For example,
one question raised is about those parts of an
organization’s big data that may be protected
by a legal privilege.
Some examples of big data usage in the
market that carry critical legal implications
and ramifications and which have their own
tough questions include:
• Determining customer trends to identify
new products and markets
• Finding combinations of proteins and
other biological components to identify
and cure diseases
• Using social-networking data (e.g.,
Twitter) to predict financial market
movements
• Consumer level support for finding better
deals, products, or info (e.g., Amazon just-
like-this, or LinkedIn people-you-may-
know functions)
• Using satellite and other geo-related
imagery and data to determine
movement of goods across shipping
lanes and to spot trends in global
manufacturing/distribution
• Corporate reputation management
by following social media and other
internet-based mentions, and comparing
those with internal customer trend data
• Use by government and others to
determine voting possibilities and
accuracy for demographic-related issues
The Legal RisksWith respect to the legal risks involved,
what’s good for the goose is good for the
gander. That is, it’s important to remem-
ber that use of big data by a company may
open the door for discovery by opposing lit-
igants, government regulators, and other legal
adversaries.
Technical limitations of identifying,
storing, searching, and producing raw data
underlying big data analysis may not guard
against discovery, and being forced to pro-
duce raw data underlying the big data
analysis used by the organization to make
important (possibly, trade secret classified)
decisions can be potentially dangerous for
a company—especially as that data may
end up in the hands of competitors. Thus,
an organization should perform a legal/risk
evaluation before any analysis using big data
is formulated, used, or published.
A major risk faced by organizations uti-
lizing big data analysis is a legal request by
opposing parties and regulators (e.g., for dis-
covery or legal investigation purposes) for
big datasets or its underlying raw data. It can
be very difficult to maintain a limited scope
related only to the legal issues at hand. This
means the organization can end up turning
over far more data than is either necessary
or appropriate due to technical limitations
for segmenting or identifying the relevant
data subsets. Challenges associated with such
issues are still new and thus there are no
known industry best practices, and no legal
authority yet exists. Though this is not good
news for organizations currently using big
data analysis that may be also implicated in
lawsuits or other legal matters, there are ways
to mitigate exposure and protect the orga-nization as best possible, even now as this is still very much an unknown territory, from a legal compliance perspective.
The State of Data Security and Governance
![Page 20: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/20.jpg)
ind
ustr
y u
pd
ate
s
18 BIG DATA SOURCEBOOK 2013
Information security risks are also import-
ant factors to consider within the larger legal
and risk context. If they are not mitigated
early on, they alone can lead to opening the
door for broader discovery related to big
datasets and systems. Information security in
a broad sense can include:
• Data Integrity and Privacy
• Encryption
• Access Control
• Chain-of-Custody
• Relevant Laws/Regulations
• Corporate Policies
Specific examples of situations where infor-
mation security policies should be monitored
include:
• Vendor Agreements
• Data Ownership & Custody
Requirements
• International Regulations
• Confidentiality Terms
• Data Retention/Archiving
• Geographical Issues
Entering into contracts with third-party
big data-related providers is an area that war-
rants special attention and where legal or risk
problems may arise. Strict controls related to
third-parties are important. More and more
big data systems and technologies are sup-
plied by third parties, so the organization
must have certain restrictions and protections
in place to ensure side-door and backdoor
discovery doesn’t occur.
When dealing with third-party control,
avoiding common pitfalls leads to better data
risk and cost control. Common problems that
arise include:
• Inadvertent data spoliation, which
can include stripping metadata and
truncating communication threads
• Custody and control of the data,
including access rights and issues with
data removal
• Problems with relevant policies/
procedures, which can include a lack
of planning and a lack of enforcement
of rules
• International rules and regulations,
including cross-border issues
Big data sources are no different than tra-
ditional data sources in that big data sources
and the use of big data should be protected
like any other critical corporate document,
dataset, or record.
Mitigating RiskTo best mitigate risk from both internal
and third-party users, certain procedures
related to data access and handling should be
implemented via IT control:
• Auditing and validation of logins and access
• Logging of actions
• Monitoring
• Chain-of-custody
Executive oversight, however, is also an
extremely important method of managing data
risk. Organizational commitment to appro-
priate control procedures evidenced through
executive support is a key factor to creating,
deploying, and maintaining a successful infor-
mation risk management program. Employees
who are able to see the value of the procedures
through the actions and attitudes of those in
management more appreciate the importance
of those procedures themselves.
All in all, a practical, holistic approach is
best for risk mitigation. Here are some tips for
managing legal information/data risk:
• Use a team approach: Include
representatives from legal, IT, risk, and
executives to cover all bases.
• Use written SOPs and protocols:
Standard ways of operating/responding/
process management and following
written protocols are key to consistency.
Consistency helps defend the process in
legal proceedings if needed.
• Leverage native functionality when
responding to legal requests: Reporting
that is sufficient for the business should
be appropriate for the courts. Also be sure
to establish a strong separation of the
presentation layer from the underlying
data for implicated system identification
purposes.
Multi-departmental involvement is also
very important to creating and maintaining
a successful risk mitigation environment and
plan. It is easy to lose track of weak spots in
data handling when only one group is try-
ing to guess the activities of all the others in
an organization. Executives, IT, legal, and
risk all have experiences to share that could
implicate weakness in the systems. Review by
a team helps cover all the bases.
Implementation across departments also
reinforces the importance to the organiza-
tion of the risk procedures. Organizations
that create risk programs but choose not to
implement them, or that implement them
inconsistently, face their own challenges when
dealing with the courts in enforcing data and
document requests, even those requests with
a broad scope.
What’s AheadThis is a new field for legal professionals
and the courts. Big data is here to stay and
will become increasingly ubiquitous and a
necessary part of running an efficient and
successful business. Because of that, those sys-
tems and data (including derived analysis and
underlying raw information) will be impli-
cated in legal matters and will thus be subject
to legal rules of preservation, discovery, and
evidence. Those types of legal requirements
are typically burdensome and expensive
when processes are not in place and people
are not trained. Relevant big data systems and
applications are not designed for the type of
operations required by legal rules of preser-
vation and discovery—requirements related
to maintaining evidentiary integrity, chain-
of-custody, data origination, use, metadata
information, and historical access control.
This new technical domain will quickly
become critical to the legal fact-finding pro-
cess. Thus, organizations must begin to think
about how the data is used and maintained
during the normal course of business and
how that may affect their legal obligations if
big data or related systems are implicated—
which may likely be the case with every legal
situation an organization may face. ■
Alon Israely, Esq, CISSP, is a co-founder of Business Intelligence Associates. As a licensed attorney and IT professional, together with the distinction of the CISSP credential, he brings a unique perspective to articles and lec-tures, which has made him one of the most sought-after speakers and contributors in his field. Israely has worked with corporations and law firms to address management, identifica-tion, gathering, and handling of data involved in e-discovery for more than a decade.
The State of Data Security and Governance
![Page 21: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/21.jpg)
DBTA.COM 19
sponsored content
OVERVIEWThe LexisNexis® Global Content
Systems Group provides content to a wide
array of market facing delivery systems,
including Lexis® for Microsoft® Office, and
LEXIS.COM®. These services deliver access
to content to more than a million end users.
The LexisNexis content collection consists
of more than 2.3 billion documents of
various size, and is more than 20 terabytes
of data. New documents are added to the
collection every day.
The raw text documents are prospectively
enhanced by recognizing and resolving
embedded citations, performing multiple
topical classifications, recognizing entities,
and creation of statistical summaries and
other data mining activities.
The older documents in the collection
require periodic retrospective processing to
apply new or modified topical classification
rules, and to account for changes on the basis
of the other data enhancements. Without
the periodic retrospective processing, the
collection of documents would become
increasingly inconsistent. The inconsistent
application of the above enhancements
materially reduces the effectiveness of the
data enhancements.
THE CHALLENGEThe LexisNexis Content management
system had evolved over a 40-year period
into a complex heterogeneous distributed
environment of proprietary and commodity
servers. The systems acting as repository
nodes were separated from the systems
that performed the data enhancements.
The separation of the repository nodes
from the processing systems required that
copies of the documents be transmitted
from the repository systems to the data
enhancement system, and then transmitted
back to the repository after the enhancement
process completed. The transmission of the
documents created additional processing
latencies, and the elapsed time to perform
a retrospective topical classification or
indexing became several months.
The delay to apply a new classification
to the collection retrospectively created a
situation where older documents might not
be found by a researcher via the topical index
when the index topic was new or recently
modified. The lack of certainty about the
coverage of the indexing required the
researcher to conduct additional searches,
especially when the classification covered
a new or emerging topic.
THE SOLUTIONLexisNexis Global Content Systems Group
consolidated the content management and
document enhancement and mining systems
onto HPCC Systems® to solve multiple data
challenges, including content enrichment
since data enrichment must be applied across
all the content simultaneously to provide a
superior search result.
HPCC Systems from LexisNexis is an
open-source, enterprise-ready solution
designed to help detect patterns and hidden
relationships in Big Data across disparate data
sets. Proven for more than 10 years, HPCC
Systems helped LexisNexis Risk Solutions
scale to a $1.4 billion information company
now managing several petabytes of data on
a daily basis from 10,000 different sources.
HPCC Systems is proven in entity
recognition/resolution, clustering and content
analytics. The massively parallel nature of the
HPCC platform provides both the processing
and storage resources required to fulfill the
dual missions of content storage and content
enrichment.
HPCC Systems was easily integrated with
the existing Content Management workflow
engine to provide document level locking and
other editorial constraints.
The migration of the content repository
and data enhancement processing to the
HPCC platform involved creating several
HPCC “worker” clusters of varying sizes
to perform data enrichments and a single
HPCC Data Management cluster to house
the content. This configuration provides
the ability to send document workloads of
varying sizes to appropriately sized worker
clusters while reserving a substantially sized
Data Management cluster for content storage
and update promotions. Interactive access is
also provided to support search and browse
operations.
THE RESULTSThe new system achieves the goal
of having a tightly integrated content
management and enrichment system that
takes full advantage of HPCC Systems
supercomputing capabilities for both
computation and high speed data access.
The elapsed time to perform an
enrichment pass of the entire data collection
dropped from six to eight weeks to less
than a day. This change is so significant that
LexisNexis has already increased the degree
of enrichment into other capabilities that
were previously out of reach.
ABOUT HPCC SYSTEMS HPCC Systems was built for small
development teams and offers a single
architecture and one programming
language for efficient data processing
of large or complex queries. Customers,
such as financial institutions, insurance
companies, law enforcement agencies,
federal government and other enterprise
organizations, leverage the HPCC Systems
technology through LexisNexis products and
services. For more information, visit
www.hpccsystems.com.
LEXISNEXIS www.hpccsystems.com
With HPCC Systems®, LexisNexis® Data Enrichment is Achieved in Less Than One Day
LexisNexis and the Knowledge Burst Logo are registered trademarks of Reed Elsevier Properties Inc., used under license. HPCC Systems is a registered trademark of LexisNexis Risk Data Management Inc. Copyright © 2012 LexisNexis. All rights reserved.
![Page 22: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/22.jpg)
ind
ustr
y u
pd
ate
s
20 BIG DATA SOURCEBOOK 2013
In the beginning, the “data warehouse”
was a concept that was not accepted by the
database fraternity. From that humble begin-
ning, the data warehouse has become conven-
tional wisdom and is a standard part of the
infrastructure in most organizations. Data
warehouse has become the foundation of
corporate data. When an organization wants
to look at data from a corporate perspective,
not an application perspective, the data ware-
house is the tool of choice.
Data Warehousing and Business Intelligence
A data warehouse is the enabling founda-
tion of business intelligence. Data warehouse
and business intelligence are linked as closely
as fish and water.
The spending on data warehousing and
business intelligence has long ago passed that
of spending on transaction-based operational
systems. Once, operational systems domi-
nated the budget of IT. Now, data warehous-
ing and business intelligence dominate.
Through the years, data warehouses have
grown in size and sophistication. Once, data
warehouse capacity was measured in giga-
bytes. Today, many data warehouses are mea-
sured in terabytes. Once, single processors
were sufficient to manage data warehouses.
Today, parallel processors are the norm.
Today, also, most corporations understand
the strategic significance of a data warehouse.
Most corporations appreciate that being able
to look at data uniformly across the corpora-
tion is an essential aspect of doing business.
But in many ways, the data warehouse
is like a river. It is constantly moving, never
standing still. The architecture of data ware-
houses has evolved with time. First, there was
just the warehouse. Then, there was the cor-
porate information factory (CIF). Then, there
was DW 2.0. Now there is big data.
Enter Big DataContinuing the architectural evolution is the
newest technology—big data. Big data technol-
ogy arrived on the scene as an answer to the need
to service very large amounts of data. There are
several definitions of big data. The definition
discussed here is the one typically discussed in
Silicon Valley. Big data technology:
• Is capable of handling lots and lots
of data
Unlocking the Potential of Big Data in a Data Warehouse Environment
By W. H. Inmon
The State of Data Warehousing
![Page 23: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/23.jpg)
DBTA.COM 21
ind
ustr
y u
pd
ate
s
• Is capable of operating on inexpensive
storage
Big data:
• Is managed by the “Roman census”
method
• Resides in an unstructured format
Organizations are finding that big data
extends their capabilities beyond the scope
of their current horizon. With big data tech-
nology, organizations can search and ana-
lyze data well beyond what would have ever
fit in their current environment. Big data
extends well beyond anything that would
ever fit in the standard DBMS environment.
As such, big data technology extends the
reach of data warehousing as well.
Some Fundamental ChallengesBut with big data there come some fun-
damental challenges. The biggest challenge is
that big data is not able to be analyzed using
standard analytical software. Standard analyt-
ical software makes the assumption that data
is organized into standard fields, columns,
rows, keys, indexes, etc. This classical DBMS
structuring of data provides context to the
data. And analytical software greatly depends
on this form of context. Stated differently, if
standard analytical software does not have the
context of data that it assumes is there, then
the analytical software simply does not work.
Therefore, without context, unstructured
data cannot be analyzed by standard analyt-
ical software. If big data is to fulfill its destiny,
there must be a means by which to analyze big
data once the data is captured.
Determining ContextThere have been several earlier attempts
to analyze unstructured data. Each of the
attempts has its own major weakness. The
previous attempts to analyze unstructured
data include:
1. NLP—natural language processing. NLP is intuitive. But the flaw with NLP is
that NLP assumes context can be determined
from the examination of text. The problem
with this assumption is that most context is
nonverbal and never finds its way into any
form of text.
2. Data scientists. The problem with
throwing a data scientist at the problem of
needing to analyze unstructured data is that
the world only has a finite supply of those
scientists. Even if the universities of the world
started to turn out droves of data scientist, the
demand for data scientists everywhere there is
big data would far outstrip the supply.
3. MapReduce. The leading technology of
big data—Hadoop—has technology called
MapReduce. In MapReduce, you can create and
manage unstructured data to the nth degree.
But the problem with MapReduce is that it
requires very technical coding in order to be
implemented. In many ways MapReduce is like
coding in Assembler. Thousands and thousands
of lines of custom code are required. Further-
more, as business functionality changes, those
thousands of lines of code need to be main-
tained. And no organization likes to be stuck
with ongoing maintenance of thousands of lines
of detailed, technical custom code.
4. MapReduce on steroids. Organizations
have recognized that creating thousands of
lines of custom code is no real solution.
Instead, technology has been developed that
accomplishes the same thing as MapReduce
except that the code is written much more
efficiently. But even here there are some
basic problems. The MapReduce on steroids
approach is still written for the technician, not
the business person. And the raw data found
in big data is essentially missing context.
5. Search engines. Search engines have
been around for a long time. Search engines
have the capability of operating on unstruc-
tured data as well as structured data. The only
problem is that search engines still need for
data to have context in order for a search to
produce sophisticated results. While search
engines can produce some limited results
while operating on unstructured data, sophis-
ticated queries are out of the reach of search
engines. The missing ingredient that search
engines need is the context of data which is
not present in unstructured data.
So the data warehouse has arrived at the
point where it is possible to include big data
in the realm of data warehousing. But in order
to include big data, it is necessary to overcome
a very basic problem—the data found in big
data is void of context, and without context,
it is very difficult to do meaningful analysis
on the data.
While it is possible that data warehousing
will be extended to include big data, unless
the basic problem of achieving or creating
context in an unstructured environment is
solved, there will always be a gap between big
data and the potential value of big data.
Deriving context then is the forthcoming
major issue of data warehouse and big data for
the future. Without being able to derive context
There is new technology called ‘textual disambiguation’ which allows raw unstructured text to have its context specifically determined. In addition, textual disambiguation allows the output of its processing to be placed in a standard database format so that classical analytical tools can be used.
The State of Data Warehousing
![Page 24: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/24.jpg)
ind
ustr
y u
pd
ate
s
22 BIG DATA SOURCEBOOK 2013
for unstructured data, there are limited uses for
big data. So exactly how can context of text be
derived, especially when context of text cannot
be derived from the text itself?
Deriving ContextIn fact, there are two ways to derive con-
text for unstructured data. Those ways are
“general context” and “specific context.” Gen-
eral context can be derived by merely declar-
ing a document to be of a particular variety. A
document may be about fishing. A document
may be about legislation. A document may
be about healthcare, and so forth. Once the
general context of the document is declared,
then the interpretation of text can be made in
accordance with the general category.
As a simple example, suppose there were
in the raw text this sentence: “President Ford
drove a Ford.” If the general context were
about motor cars, then Ford would be inter-
preted to be an automobile. If the general
context were about the history of presidents
of the U.S., then Ford would be interpreted to
be a reference to a former president.
Textual DisambiguationThe other type of context is specific con-
text. Specific context can be derived in many
different ways. Specific context can be derived
by the structure of a word, the text sur-
rounding a word, the placement of words in
proximity to each other, and so forth. There
is new technology called “textual disambig-
uation” which allows raw unstructured text
to have its context specifically determined.
In addition, textual disambiguation allows
the output of its processing to be placed in
a standard database format so that classical
analytical tools can be used.
At the end of textual disambiguation,
analytical processing can be done on the
raw unstructured text that has now been
disambiguated.
The Value of Determining ContextThe determination of the context of unstruc-
tured data opens the door to many types of
processing that previously were impossible. For
example, corporations can now:
Read, understand, and analyze their cor-porate contracts en masse. Prior to textual
disambiguation, it was not possible to look at
contracts and other documents collectively.
Analyze medical records. For all the work
done in the creation of EMRs (electronic
medical records), there is still much narrative
in a medical record. The ability to understand
narrative and restructure that narrative into a
form and format that can be analyzed auto-
matically is a powerful improvement over the
techniques used today.
Analyze emails. Today after an email is read,
it is placed on a huge trash heap and is never
seen again. There is, however, much valuable
information in most corporations’ emails. By
using textual disambiguation, the organization
can start to determine what important infor-
mation is passing through their hands.
Analyze and capture call center data. Today, most corporations look at and analyze
only a sampling of their call center conversa-
tions. With big data and textual disambigua-
tion, now corporations can capture and ana-
lyze all of their call center conversations.
Analyze warranty claims data. While a
warranty claim is certainly important to the
customer who has made the claim, warranty
analysis is equally important to the manufac-
turer to understand what manufacturing pro-
cesses need to be improved. By being able to
automatically capture and analyze warranty
data and to put the results in a database, the
manufacturer can benefit mightily.
And the list goes on and on. This short
list is merely the tip of the tip of the iceberg
when it comes to the advantages of being
able to capture and analyze unstructured
data. Note that with standard structured
processing, none of these opportunities have
come to fruition.
Some Architectural ConsiderationsOne of the architectural considerations
of managing big data through textual dis-
ambiguation technology is that raw data on
a big data platform cannot be analyzed in a
sophisticated manner. In order to set the stage
for sophisticated analysis, the designer must
take the unstructured text from big data,
pass the text through textual disambiguation,
then return the text back to big data. How-
ever, when the raw text passes through textual
disambiguation, it is transformed into disam-
biguated text. In other words, when the raw
text passes through textual disambiguation, it
passes back into big data, where the context of
the raw text has been determined.
Once the context of the unstructured text
has been determined, it can then be used for
sophisticated analytical processing.
What’s AheadThe argument can be made that the pro-
cess of disambiguating the raw text then
rewriting it to big data in a disambiguated
state increases the amount of data in the
environment. Such an observation is abso-
lutely true. However, given that big data is
cheap and that the big data infrastructure is
designed to handle large volumes of data, it
should be of little concern that there is some
degree of duplication of data after raw text
passes through the disambiguation process.
Only after big data has been disambiguated is
the big data store fit to be called a data ware-
house. However, once the big data is disam-
biguated, it makes a really valuable and really
innovative addition to the analytical, data
warehouse environment.
Big data has much potential. But unlock-
ing that potential is going to be a real chal-
lenge. Textual disambiguation promises to be
as profound as data warehousing once was.
Textual disambiguation is still in its infancy,
but then again, everything was once in its
infancy. However, the early seeds sewn in tex-
tual disambiguation are bearing some most
interesting fruit. ■
W. H. Inmon—the “father of data warehouse”—has written 52 books published in nine languages. Inmon speaks at conferences reg- ularly. His latest adventure is the building of Textual ETL—textual disambiguation—technology that reads raw text and allows raw text to be analyzed. Textual disambiguation is used to create business value from big data. Inmon was named by ComputerWorld as one of the 10 most influential people in the history of the computer profession, and lives in Castle Rock, Colo.
The State of Data Warehousing
![Page 25: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/25.jpg)
DBTA.COM 23
sponsored content
The adage “Every company is a data
company” is more true today than ever.
The problem is most companies don’t
realize how much valuable data they’re
actually sitting on, nor how to access
and use this untapped data. Companies
must exploit whatever data enters their
enterprise in every format and from every
source to gain a comprehensive view of
their business.
Most IT professionals focus all
their resources on figuring out how to
effectively access structured data sources.
Projects associated with data warehousing
and business intelligence get all the
attention. And in some cases they yield
valuable insights into the business. But
the fact is that structured data sources
are just the tip of the iceberg inside most
companies. There is so much intelligence
that goes unseen and unanalyzed simply
because they don’t know how to get at it.
For that reason, forward-looking CIOs
and IT organizations have begun exploring
new strategies for tapping into other
non-traditional sources of information
to get a more complete picture of their
business. These strategies attempt to gather
and analyze highly unstructured data like
websites, tweets and blogs to discover trends
that might impact the business.
While this is a step in the right direction,
it misses the bigger picture of the Big Data
landscape. The “blind spots” in these data
strategies are both the unstructured and
semi-structured data that is contained in
content like reports, EDI streams, machine
data, PDF files, print spools, ticker feeds,
message buses, and many other sources.
UNDERSTANDING THE CONTENT BLIND SPOT
A growing number of IT organizations
now see value in information contained
within these content blind spots. The key
reason: It enhances their business leaders’
ability to make smarter decisions because
much of this data provides a link to past
decisions.
Companies also realize that these non-
traditional data sources are growing at an
exponential rate. They have become the
language of business for industries like
healthcare, financial services and retail. So
where do you find these untapped sources
of information? Easy; they’re everywhere.
As companies have rolled out ERP, CRM
and other enterprise systems (including
enterprise content management tools), they
have also created thousands of standard
reports. Companies are also stockpiling
volumes of commerce data with EDI
exchanges. Excel spreadsheets are ubiquitous
as well. And as PDF files of invoices and
bills-of-lading are exchanged, vital data is
being saved. All these sources possess semi-
structured data that can reveal valuable
business insight.
But how do you get to these sources,
and what do you do with them?
OPTIMIZING INFORMATION THROUGH VISUAL DATA DISCOVERY
Next-generation analytics enable
businesses to analyze any data variety,
regardless of structure, at real-time velocity
for fast decision making in a visual data
discovery environment. These analytic tools
link diverse data types with traditional
decision-making tools like spreadsheets and
business intelligence (BI) systems to offer
a richer decision making capability than
previously possible.
By tapping into semi-structured and
unstructured content from varied sources
throughout an organization, next-gen
analytics solutions are able to map these
sources to models so that they can be
combined, restructured and analyzed.
While it sounds simple, the technology
actually requires significant intelligence
regarding the structural components of the
content types to be ingested and the ability
to break these down into “atomic level” items
that can be combined and mapped together
in different ways.
For organizations to fully exploit the
power of their information, they have to
uncover the content blind spots in their
enterprise that hold so much underutilized
value. Leveraging structured, unstructured
and semi-structured content in a visual
discovery environment can deliver enormous
improvements in decision making and
operational effectiveness.
DATAWATCH www.datawatch.com
Filling the Content Blind Spot
![Page 26: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/26.jpg)
Join DBTA via Facebook, Twitter, Google+, and LinkedIn to connect with industry peers, receive the latest-breaking news, gain insights, get conference discounts,
download white papers, hear about webinars, and much more.
GROWyour connections
![Page 27: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/27.jpg)
DBTA.COM 25
sponsored content
Businesses all over the world are
beginning to realize the promise of Big
Data. After all, being able to extract data
from various sources across the enterprise,
including operational data, customer
data, and machine/sensor data, and then
transform it all into key business insights can
provide significant competitive advantage. In
fact, having up-to-date, accurate information
for analytics can make the difference
between success and failure for companies.
However, it’s not easy. A recent study by
Wikibon noted that returns thus far on Big
Data investments are only 50 cents to the
dollar. A number of challenges stand in front
of maximizing return. The data transfer
bottleneck is but one real and pervasive issue
that’s causing many headaches in IT today.
REASONS FOR THE BIG DATA TRANSFER BOTTLENECK
Outdated technology. Moving data is
hard. Moving Big Data is harder. When
companies rely on heritage platforms
engineered to support structured data
exclusively, such as ETL, they quickly find
out that the technology simply cannot scale
to handle the volume, velocity or variety of
data, and therefore, cannot meet the real-
time information needs of the business.
Lagging system performance. Even if
source and target systems are in the same
physical location, data latency can still be
a problem. Data often resides in systems
that are used daily for operational and
transaction processing. Using complex
queries to extract data and launching
bulk data loads mean extra work for CPU
and disk resources, resulting in delayed
processing for all users.
Complex setup and implementation. Sometimes companies manage to deliver
data using complex, proprietary scripts and
programs that take months of IT time and
effort to develop and implement. With SLAs
to meet and business opportunities at risk of
being lost, most companies simply don’t have
the luxury of wading through this difficult
and time-consuming process.
Delays caused by writing data to disk. When information is extracted from systems,
it is often sent to a staging area and then
relayed to the target to be loaded. Storage to
disk causes delays as data is written and then
read in preparation for loading.
Proliferation of sources and targets. With data that can reside in a variety of
transactional databases such as Oracle, SQL
Server, IBM Mainframe, and with newer
data warehouse targets such as Vertica,
Pivotal, Teradata UDA and Microsoft PDW
on the rise, setup time can increase and
performance can be lost using solutions that
are not optimized to each platform.
Limited Internet bandwidth. If source
and target systems are in different physical
locations, or if the target is in the cloud,
insufficient Internet bandwidth can be a
major cause of data replication lag. Most
networks are configured to handle general
operations but are not built for massive data
migrations.
HARSH REALITYWhen timely information isn’t available,
key decisions need to be deferred. This
can lead to lost revenues, decreased
competitiveness, or lower levels of customer
satisfaction. Additionally, the reliability of
decisions made without real-time data may
also be called into question.
THE ANSWERThere is a solution to overcoming this
challenge. Attunity beats the Big Data
bottleneck by providing high-performance
data replication and loading for the broadest
range of databases and data warehouses
in the industry. Its easy, Click-2-Replicate
design and unique TurboStream DX data
transfer and CDC technologies give it the
power to stand up to the largest bottlenecks
and win. Partner with Attunity. You too can
beat the data transfer bottleneck!
Learn more! Download this eBook by data management expert, David Loshin: Big Data Analytics Strategies —Beating the Data Transfer Bottleneck for Competitive Gain http://bit.ly/ATTUeBook
ATTUNITY For more information, visit www.Attunity.com or call (800) 288-8648 (toll free) +1 (781) 730-4070.
Overcoming the Big Data Transfer Bottleneck
![Page 28: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/28.jpg)
ind
ustr
y u
pd
ate
s
26 BIG DATA SOURCEBOOK 2013
Cloud technologies and frameworkshave matured in recent years and enter-
prises are starting to realize the benefits of
cloud adoption—including savings in infra-
structure costs, and a pay-as-you-go service
model similar to Amazon Web Services. Here
is a look at the cloud market and its conver-
gence with the big data market, including key
technologies and services, challenges, and
opportunities.
Evolution of the ‘Cloud’ AdoptionThe technology, platform, and services
that were available in the early 1990s were
similar to the “cloud” adoption of the last
decade. We had distributed systems with Sun
RISC-based server workstations, IBM main-
frames, millions of Intel-based Windows
desktops, Oracle Database Servers (includ-
ing Grid Computing–10g), and J2EE N-tier
architecture. There were application service
providers (ASPs), managed service providers
(MSPs), and internet service providers (ISPs)
offering services similar to cloud offerings
today. What has changed?
There were significant events that trig-
gered the emergence of cloud offerings and
their adoption. The first one was the Amazon
Web Services (S3, EC2, RDS, SQS) scale out
and the development of IaaS (infrastructure
as a service) once Amazon was able to real-
ize the benefits of its offering for its own
internal use. The second major event was
the search engine, advertising platform, and
Google Big Table (memcache) and realiza-
tion that millions of nodes with commodity
hardware (cheap) can be leveraged to harness
MapReduce and other frameworks to distrib-
ute the search query and provide results at a
millisecond response time (unheard of with
even mainframes).
In the middle of the 2000-decade, the tra-
ditional telecom and mobile phone service
providers saw that that they needed to move
to scale-out platforms (the cloud) to manage
their mobile customer base, which grew from
a few million to a billion (factor of 1000).
The mobile data grew from a few terabytes
to petabytes and they needed newer scale-out
platforms and wanted on-premise as well as
hybrid cloud deployments.
The creators of Hadoop ran TeraSort
benchmarks with large clusters of nodes in
order to determine the benefits of MapReduce
frameworks. It resulted in the emergence of
the Hadoop Cluster Distribution; NoSQL data
stores such as columnar, document, and graph
databases; and massive parallel performance
(MPP) analytical databases. An ecosystem of
vendors emerged to reap the benefits of the
scale-out cloud infrastructure, MapReduce
frameworks, and Hadoop and NoSQL data
stores. The applications included data migra-
tion, predictive analytics, fraud detection, and
data aggregation from multiple data sources.
The new paradigm shift addressed the
key issue of scale as well as the handling of
unstructured data that was lacking in tradi-
tional relational databases. The paradigm
Cloud Technologies Are Maturing to Address Emerging Challenges and Opportunities
By Chandramouli Venkatesan
The State of Cloud Technologies
![Page 29: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/29.jpg)
DBTA.COM 27
ind
ustr
y u
pd
ate
s
shift occurred as a result of the availability of
commodity hardware and a framework to run
massive parallel data processing across clusters
of nodes including distributed file system, high
performance analytical databases, and NoSQL
data stores for handling unstructured data.
Hybrid CloudFor enterprises that are adopting the hybrid
(public/private/community) cloud pay-as-you-
go model for IaaS, PaaS, and SaaS cloud
deployments, the key drivers are cost, flexibil-
ity, and speed (time to set up hardware, soft-
ware, and services). The primary use cases for
the new hybrid model include the ability to do
data migration, fraud detection, and the ability
to manage unstructured data in real time.
But the move to hybrid cloud deploy-
ment comes with new challenges and risks.
The biggest challenge for cloud deployments
today is in the area of data security and iden-
tity. There are several cloud providers who
offer IaaS, PaaS, SaaS, network as a service,
and “everything as a service” and probably
offer good firewalls to protect data within
the boundaries of their data center. The chal-
lenges include data at rest, data in flight used
in mobile devices accessing the cloud pro-
vider, and data derived from multiple cloud
providers and provision of a single-view to
the mobile customer.
BYODThe ubiquitous mobile computing is driv-
ing the new cloud adoption model faster than
anticipated and a key driver is BYOD (bring
your own device). The traditional IT shop
had control of its assets whether on-premise
or on cloud. However, the demands of BYOD
and the myriad mobile devices, applications,
and mobile stores have resulted in the IT
organization losing control of users’ identity,
as one can have more than one profile. The
use of the biometric information such as fin-
gerprint and eye scans is still in its infancy
for the mobile users. There are some efforts
in standardization in cloud identity manage-
ment such as OpenID Connect, OAuth, and
SIEM, but the adoption is slow, and it will
take time to work seamlessly across many
cloud providers.
Trust the ‘Cloud’ ProvidersThe key security issue for cloud and mobil-
ity deployment is establishment of “trust” and
“trust boundaries.” There are several players
in the cloud and mobile deployments offer-
ing different services, and they need to work
seamlessly end-to-end. The “trust” worthiness
is enabled by the ability to automatically sign-
off or hand-off to another cloud/mobile ser-
vice provider in the “trust boundary” and still
maintain the data integrity at each hand-off.
The automatic sign-off would need to verify
the validity of the cloud provider, protect
the identity of the users, as well as guarantee
the nontampering of content. The interme-
diate trust verification providers would also
be a cloud provider similar to verification of
ecommerce internet sites. The trust verifica-
tion provider must support the SLAs for secu-
rity, identity, and trust between mobile and
cloud service provider. The key requirement
is to ensure the integrity and trust between
mobile and cloud providers, inter-cloud, and
intra-cloud providers.
The mobile end user will have a trust
boundary with mobile/telecom service pro-
vider (cellular) or managed service provider
(Wi-Fi). The trust will be recorded, and some
portion of identity will be passed on to one or
more cloud providers offering different ser-
vices. Each trust boundary will have a nego-
tiation between mobile and cloud providers
or between cloud providers to establish the
identity, security, and integrity of data as well
as the mobile user.
The future of cloud is in the convergence
of simple standards for security, identity, and
trust, and it involves all participants in the
cloud: mobile device vendors; service provid-
ers; cloud IaaS, PaaS, and SaaS vendors; and
the network. The pay-as-you-go model would
have a price tag factored in for a minimum
SLA level in terms of guarantee and addi-
tional pricing based on additional levels of
security, including security locks at the CPU
boundary.
Cloud/Big Data FrameworksIn addition to cloud security, identity ver-
ification, and trust regarding data integrity,
the technology of cloud/big data frameworks
will have rapid changes and adoption in the
next few years. One such adoption is the
standardization of a query language for the
NoSQL data stores similar to SQL for rela-
tional database management systems. The
query language will result in query nodes that
accept incoming queries and in turn result
in distributed queries across the cluster of
nodes, handling all issues dealing with data,
including security, speed, and reliability of the
transaction.
The price per TB (not GB) of flash and random access memory will drive the future adoption of cloud/big data predictive analytics and learning models. This is key in generating value in different verticals such as healthcare, education, energy, and finance. The ability
The future of cloud deployments will involve rapid adoption of new technology frameworks beyond Hadoop, open standards in the area of cloud security, identity, trust, as well as a universal and simple query language for aggregating data from legacy and emerging data stores.
The State of Cloud Technologies
![Page 30: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/30.jpg)
ind
ustr
y u
pd
ate
s
28 BIG DATA SOURCEBOOK 2013
to keep big data in an in-memory cache in a
smaller footprint including mobile devices will
result in improvements in gathering, collecting,
and storing data from trillions of mobile devices
and perform data, predictive, behavioral, and
visual analytics at near-real time (microseconds).
The key cloud adoption driver today is the num-
ber of cores per computing nodes. The future
of cloud adoption will involve a large memory
cache in addition to many cores per computing
node (commodity hardware).
The technology of Hadoop frameworks
has evolved since 2004, and it includes the
MapReduce framework, the Hadoop Distrib-
uted File System, and additional technologies.
There is a need for a “beyond Hadoop” frame-
work, and the future of Hadoop will be built
in to the platform (iOS, Linux, Windows,
etc.) similar to a task scheduler in a platform
OS. The new frameworks “beyond Hadoop”
will need to provide distributed query search
engines out of the box, the ability to easily
manage custom queries, and the ability to
provide a mechanism to have an audit trail
of data transformations end-to-end across
several mobile and cloud providers. The audit
trail or probe will be similar to a ping or trace
route command, and it should be available to
ensure the integrity of data for end-to-end
deployment.
Emerging StandardsThere are several emerging standards
for cloud deployments, primarily to address
identity, security, and software-defined
networking (SDN). IaaS, PaaS, and SaaS
cloud deployments have matured, and there
are several players that coexist in the cloud
ecosystem today. The standards such as
OpenID, Open Connect, OAuth, and Open
Data Center Alliance have several cloud pro-
viders and enterprises signing up every day,
but the adoption will take a few more years
to evolve and mature. Open standards are the
key to the future adoption of cloud and the
seamless flow of secure data among differ-
ent cloud providers. This offers a paradigm
similar to a free market economy, which is a
goal, but in reality, the goal to be strived for
by future cloud players is about 60% open
standards and 40% proprietary frameworks
in order to promote competition and an even
playing field. Customers will demand faster
adoption in open standards for cloud deploy-
ments, and the keys to adoption are speed,
flexibility, cost, and focus on solving their
problems efficiently. The current approach of
enterprises spending time and money in the
evaluation, selection, and use of cloud pro-
viders will pave the way for pay-as-go-you
go cloud providers on demand for blended
services. There will be blend of services lever-
aging mobile and cloud deployment, such as
single sign-on, presales, actual customer sale,
post-sales, recommendation systems, etc. The
cloud adoption of IaaS, PaaS, and SaaS will
give way to business models similar to prepay,
post-pay debit/credit cards for products and
services with “cloud” ready offerings.
The cloud/big data deployments will
see the emergence of multiple data centers
managed by multiple cloud providers, and
the cloud will have to support distributed
query-based search, with results that can be
provided to the mobile user at near real time.
This would require open standards to allow
seamless data exchange between multiple
data centers, maintaining the SLA levels for
performance, scalability, security, and iden-
tity. It is a clear challenge and opportunity for
the future of cloud, but it is likely that new
mobile apps will drive the need for coopera-
tion between cloud providers or result in con-
solidation of several players into a few mobile
and cloud providers.
Billing Systems for the CloudFuture cloud deployments will require
both mobile and cloud provider payment
processing to keep pace with other aspects of
the cloud deployment model, such as secu-
rity, scalability, cost-savings, and reliability.
We would require, at a minimum, a billing
provider to provide platform billing and rec-
onciliation of payments between cloud and
mobile service providers. The challenge of the
future for cloud-based billing providers is the
payment processing for a blend of services.
For example, payment of different rates for
providers in the cloud, such as device, mobile,
cloud infrastructure/platform provider, stor-
age, network, and payment service providers.
The break-even and moderate margin for a
pay-as-you go model in the cloud will be 40%
cost and 60% revenue; the cost reduction over
time would be as a result of consolidation
from both the mobile and cloud service pro-
vider offering integrated services. The pay-as-
you-go business model, with SLA guarantees,
will be appealing for mom-and-pop stores
that want to adopt cloud services, coexist, and
compete with big retail stores, and will ulti-
mately result in better service and lower cost
for the consumer.
The future of cloud deployments will
involve rapid adoption of new technology
frameworks beyond Hadoop, open standards
in the area of cloud security, identity, and
trust, as well as a universal and simple query
language for aggregating data from legacy and
emerging data stores. Future cloud adoption
will involve trillions of mobile devices, ubiq-
uitous computing, zettabytes of data, and
improved SLAs between cloud providers, as
well as larger, cheaper memory cache and
multiple cores per computing node. ■
Chandramouli Venkatesan (Mouli) has more than 20 years of experience in the telecom industry, including technical leadership roles at Fujitsu Networks and Cisco Systems, and as a big data integration architect in the finan-cial and healthcare industries. Venkatesan’s company MEICS, Inc. (www.meics.org) pro-vides the analytics and learning platform for cloud deployments. Venkatesan evangelizes emerging technologies and platforms and innovation in cloud, big data, mobility, and content delivery networks.
Open standards are the key to the future adoption of cloud and the seamless flow of secure data among different cloud providers.
The State of Cloud Technologies
![Page 31: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/31.jpg)
DBTA.COM 29
sponsored content
Many organizations recognize
the value of generating insights from
their rapidly increasing web, social,
mobile and machine-generated data.
However, traditional batch analysis is
not fast enough.
If data is even one day old, the
insights may already be obsolete.
Companies need to analyze data in
near-real-time, often in seconds.
Additionally, no matter the timeliness
of these insights, they are worthless
without action. The faster a company
acts, the more likely there is a return on
the insight such as increased customer
conversion, loyalty, satisfaction or
lower inventory, manufacturing, or
distribution costs.
Companies seek technology solutions
that allow them to become real-time,
data-driven businesses, but have been
challenged by existing solutions.
LEGACY DATABASE OPTIONSTraditional RDBMSs (Relational
Database Management Systems) such
as Oracle or IBM DB2 can support
real-time updates, but require expensive
specialized hardware to “scale up” to
support terabytes to petabytes of data.
At millions of dollars per installation,
this becomes cost-prohibitive —quickly.
Traditional open source databases such
as MySQL and PostgresSQL are unable
to scale beyond a few terabytes without
manual sharding. However, manual sharding
requires a partial rewrite of every application
and becomes a maintenance nightmare to
periodically rebalance shards.
New Big Data technologies such as
Hadoop and HBase are cost-effective
platforms that are proven to scale from
terabytes to petabytes, but they provide
little or no SQL support. This lack of SQL
support is a major barrier to Hadoop
adoption and is also a major shortcoming
of NoSQL solutions, because of the massive
retraining required. Companies adopting
these technologies cannot leverage existing
investments in SQL-trained people, or SQL
Business Intelligence (BI) tools.
SPLICE MACHINE: THE REAL-TIME SQL-ON-HADOOP DATABASE
Splice Machine brings the best of these
worlds together. It is a standard SQL
database supporting real-time updates and
transactions implemented on the scalable,
Hadoop distributed computing platform.
Designed to meet the needs of real-time,
data-driven businesses, Splice Machine is
the only transactional SQL-on-Hadoop
database. Like Oracle and MySQL, it is a
general-purpose database that can handle
operational (OLTP) or analytical (OLAP)
workloads, but can also scale out cost-
effectively on inexpensive commodity
servers.
Splice Machine marries two proven
technology stacks: Apache Derby, a Java-
based, full-featured ANSI SQL database, and
HBase/Hadoop, the leading platforms for
distributed computing.
SPLICE MACHINE ENABLES YOU TO GET REAL WITH BIG DATA
As the only transactional SQL-on-
Hadoop database, Splice Machine presents
unlimited possibilities to application
developers and database architects. Best of
all, it eliminates the compromises that have
been part of any Big Data database platform
selection to date.
Splice Machine is uniquely qualified to
power applications that can harness real-
time data to create more valuable insights
and drive better, more timely actions. This
enables companies that use Splice Machine
to become real-time, data-driven businesses
that can leapfrog their competition and get
real results from Big Data.
SPLICE MACHINE [email protected] www.splicemachine.com
Get Real With Big DataWITH SPLICE MACHINE,
COMPANIES CAN:Unlock the Value of Hadoop. Splice
Machine provides a standard ANSI SQL
engine, so any SQL-trained analyst or SQL-
based application can unlock the value of
the data in a current Hadoop deployment,
across most major distributions.
Combine NoSQL and SQL. Splice
Machine enables application developers
to enjoy the best of both SQL and NoSQL,
bringing NoSQL scalability with SQL
language support.
Avoid Expensive “Big Iron.” Splice
Machine frees companies with specialized
server hardware from the spiraling costs of
scaling up the handle over a few terabytes.
Scale Beyond MySQL. Splice Machine
can help those companies scale beyond
a few terabytes with the proven auto-
sharding capability of HBase.
“Future-proof ” New Apps. Splice
Machine provides a “future-proof”
database platform that can scale cost-
effectively from gigabytes to petabytes
for new applications.
![Page 32: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/32.jpg)
ind
ustr
y u
pd
ate
s
30 BIG DATA SOURCEBOOK 2013
Data quality has been one of the central
issues in information management since the
beginning—not the beginning of modern
computing and the development of the cor-
porate information infrastructure but since
the beginning of modern economics and
probably before that. Data quality is what
audits are all about.
Nonetheless, the issues surrounding data
quality took on added importance with the
data explosion sparked by the large-scale inte-
gration of computing into every aspect of
business activity. The need for high-quality
data was captured in the punch-card days of
the computer revolution with the epigram gar-
bage in, garbage out. If the data isn’t good, the
outcome of the business process that uses that
data isn’t good either.
Data growth has always been robust, and
the rate keeps accelerating with every new gen-
eration of computing technology. Mainframe
computers generated and stored huge amounts
of information, but then came minicomputers
and then personal computers. At that point,
everybody in a corporation and many people
at home were generating valuable data that was
used in many different ways. Relational data-
bases became the repositories of information
across the enterprises, from financial data to
product development efforts, from manufac-
turing to logistics to customer relationships to
marketing. Unfortunately, given the organiza-
tional structure of most companies, frequently
data was captured in divisional silos and could
not be shared among different departments—
finance and sales, for example, or manufac-
turing and logistics. Since data was captured
in different ways by different organizational
units, integrating the data to provide a holistic
picture of business activities was very difficult.
The explosion in the amount of structured
data generated by a corporation sparked two
key developments. First, it cast a sharp spotlight
on data quality. The equation was pretty simple.
Bad data led to bad business outcomes. Second,
efforts were put in place to develop master data
management programs so data generated by
different parts of an organization could be coor-
dinated and integrated, at least to some degree.
Challenges to Data Quality and MDMEfforts in both data quality and master data
management have only been partially success-
ful. Not only is data quality difficult to achieve,
it is a difficult problem even to approach. In
addition, the scope of the problem keeps
broadening. Master data management pres-
ents many of the same challenges that data
quality itself presents. Moreover, the complex-
ity of implementing master data management
solutions has restricted them to relatively large
companies. At the bottom line, both data qual-
ity program and master data management
solutions are tricky to successfully implement,
in part because, to a large degree, the impact
of poor quality and disjointed data is hidden
from sight. Too often, data quality seems to be
nobody’s specific responsibility.
Despite the difficulties in gathering corpo-
rate resources to address these issues, during
the past decade, the high cost of poor quality
and poorly integrated data has become clearer,
and a better understanding of what defines
data quality, as well as a general methodology
for implementing data quality programs, has
emerged. The establishment of the general
foundation for data quality and master data
management programs is significant, particu-
larly because the corporate information envi-
ronment is undergoing a tremendous upheaval,
generating turbulence as vigorous as that cre-
ated by mainframe and personal computers.
The spread of the internet and mobile
devices such as smartphones and tablets is not
Data Quality and MDM Programs Must Evolve to Meet Complex New Challenges
By Elliot King
The State of Data Quality and Master Data Management
![Page 33: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/33.jpg)
DBTA.COM 31
ind
ustr
y u
pd
ate
sDuring the past decade, the high cost of poor quality and poorly integrated data has become clearer, and a better understanding of what defines data quality, as well as a general methodology for implementing data quality programs, has emerged.
only generating more data than ever before,
many kinds of data—much of it largely
unstructured or semistructured—have
become very important. The use of RFID and
other kinds of sensor data has led to a data
tsunami of epic proportions. Cloud com-
puting has created an imperative for compa-
nies to integrated data from many different
sources both inside and outside the corpo-
ration. And compliance with regulations in a
wide range of industries means that data has
to be held for longer periods of time and must
be correct. In short, the basics for data quality
and master data management are in place but
the basics are not nearly sufficient.
The Current SituationIn 2002, the Data Warehousing Institute
estimated that poor data quality cost Amer-
ican businesses about $600 million a year.
Through the years, that figure has been the
number most commonly bandied about
as the price tag for bad data. Of course, the
accuracy of such an eye-popping number
covering the entire scope of American indus-
try is hard to assess.
However, a more recent study of busi-
nesses in the U.K. presented an even starker
picture. It found that as much as 16% of
many companies’ budgets is squandered
because of poor data quality. Departments
such as sales, operations, and finance waste
on average 15% of their budgets, according
to the study. That figure climbs to 18% for
IT. And the number is even higher for cus-
tomer-facing activities such as customer loy-
alty programs. In all, 90% of the companies
surveyed opined that they felt their activities
were hindered by poor data.
When specific functional areas are assessed,
the substantial cost that poor data quality
extracts can become pretty clear. For exam-
ple, contact information was one of the first
targets for data quality programs. Obviously,
inaccurate, incomplete, and duplicated address
information hurts the results of direct market-
ing campaigns. In one particularly egregious
example, a major pharmaceutical company
once reported that 25% of the glossy brochures
it mailed were returned. Not only are potential
sales missed, current customers can be alien-
ated. Marketing material that arrives in error
somewhere represents sheer costs.
Marketing is only one area in which the
impact of poor information is visible. One
European bank found that 100% of customer
complaints had their roots in poor or outright
incorrect information. Moreover, this study
showed, customers who register complaints
are much more likely to shop for alternative
suppliers than those who don’t. The difference
in the churn between customers who complain
and whose complaints are rooted in poor data
quality and those who don’t is a direct cost of
poor data quality.
And the list goes on. Poor data quality in
manufacturing slows time to market, leads
to inventory management problems, and can
result in product defects. Bad logistics data can
have a material impact on both the front end
and back end of the manufacturing process.
The Benefits of Improving Data QualityOn the other side of the equation, improv-
ing data quality can lead to huge benefits.
One company reported that improving the
quality of data available to its call center per-
sonnel resulted in nearly $1 million in sav-
ings. Another realized $150,000 in billing
efficiencies by improving its customer contact
information.
As the cost/benefit equation of data quality
has become more apparent, the need to define
data quality has become more pressing. In
addition to the core characteristics of accuracy
and timeliness, the most concise expression of
the attributes of high-quality data is consistency,
completeness, and compactness. Consistency
means that each “fact” is represented in the
same way across the information ecosystem.
For example, a date is represented by two digits
for the month, two for the day, and four for the
year and is represented in that order across the
informational ecosystem in a company. More-
over, the “facts” represented must be logical. An
“order due” date, for example, cannot be earlier
than an “order placed” date.
Maintaining consistency is more difficult
than it may appear at first. Companies capture
data in a multitude of ways. In many cases,
customers are entering data via web forms
,and both the accuracy and the consistency
of the data can be an issue. Moreover, data is
often imported from third-party source sys-
tems, which may use alternative formats to
represent “facts.” Indeed, even separate oper-
ational units within a single enterprise may
represent data differently.
Maintaining Data ConsistencyMaster data management is one approach
companies have used to maintain data consis-
tency. MDM technology consolidates, cleanses,
and augments corporate data, synchronizing
data among all applications, business pro-
cesses, and analytical tools. Master data man-
agement tools provide the central repository
for cross-referenced data in the organization,
building a single view of organizational data.
The second element of data quality is com-
pleteness. Different stakeholders in an organi-
zation need different information. For example,
the academic records department in a univer-sity may be most interested in a student grade point average, the courses in which the student is enrolled, and the student’s progress toward graduation. The dean of students wants to know if the student is living on campus, the extra-curricular activities in which the student participates, and any disciplinary problems the student has had. The bursar’s office wants to know the scholarships the student has received
The State of Data Quality and Master Data Management
![Page 34: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/34.jpg)
ind
ustr
y u
pd
ate
s
32 BIG DATA SOURCEBOOK 2013
and the student’s payment history. A good data
system will not only capture all that informa-
tion but also ensure that none of the key ele-
ments are missing.
The last element of good quality data is con-
ciseness. Information is flowing into organiza-
tions through several different avenues. Inevita-
bly, records will be duplicated and information
comingled, and nobody likes to receive three
copies of the same piece of direct mail.
Because companies currently operate within
such a dynamic information environment, no
matter how diligent enterprises are, their sys-
tems will contain faulty, incorrect, duplicate,
and incomplete information. Indeed, if compa-
nies do nothing at all, the quality of their data
will degrade. Time decay is an ongoing, con-
sistent cause of data errors. People move. They
get married and change their names. They get
divorced and change their names again. Corpo-
rate records have no way to keep up.
But time is only one of the root causes for
bad data. Corporate change also poses a prob-
lem. As companies grow, they add new applica-
tions and systems, making other applications
and systems obsolete. In addition, an enter-
prise may merge with or purchase another
organization whose data is in completely dif-
ferent formats. Finally, companies are increas-
ingly incorporating data from outside sources.
If not managed correctly, each of these events
can introduce large-scale problems with cor-
porate data.
The third root cause of data quality prob-
lems is that old standby—human error.
People already generate a lot of data and are
generating even more as social media content
and unstructured data become more signifi-
cant. Sadly, people make mistakes. People are
inconsistent. People omit things. People enter
data multiple times. Inaccuracies, omissions,
inconsistencies, and redundancies are hall-
marks of poor data quality.
Given that data deterioration is an ongoing
facet of enterprise information, for a data qual-
ity program to work, it must be ongoing and
iterative. Modern data quality programs rest
on a handful of key activities—data profiling
and assessment, data improvement, data inte-
gration, and data augmentation.
In theory, data improvement programs are
not complicated. The first step is to character-
ize or profile the data at hand and measure how
closely it conforms to what is expected. The
next step is to fix the mistakes. The third step
is to eliminate duplicated and redundant data.
Finally, data quality improvement programs
should address holes in the enterprise infor-
mation environment by augmenting existing
data with data from appropriate sources. Fre-
quently, data improvement programs do not
address enterprise data in its entirety but focus
on high-value, high-impact information used
in what can be considered mission-critical
business processes.
The Big Data ChallengeTo date, most data quality programs have
been focused on structured data. But, iron-
ically, while the tools, processes, and organi-
zational structures needed to implement an
effective data quality program have developed,
the emergence of big data has the potential to
completely rewrite the rules of the game.
Though the term “big data” is still debated,
it represents something qualitatively new.
Big data does not just mean the explosion of
transactional data driven by the widespread
use of sensors and other data-generating
devices. It also refers to the desire and ability
to extract analytic value from new data types
such as video and audio. And it refers to the
trend toward capturing huge amounts of data
produced by the internet, mobile devices, and
social media.
The availability of more data, new types of
data, and data from a wider array of sources
has had a major impact on data analysis and
business intelligence. In the past, people would
identify a problem they wanted to solve and
then gather and analyze the data needed to
solve that problem. With big data, that work
flow is reversed. Companies are realizing that
they have access to huge amounts of new
data—tweets, for example—and are working
to determine how to extract value from that
data, reversing the usual process.
Data quality programs will have to evolve
to meet these new challenges. Perhaps the first
step will be methods for developing appropri-
ate metadata. In general, big data is complex,
messy, and can come from a variety of dif-
ferent sources, so good metadata is essential.
Data classification, efficient data integration,
and the establishment of standards and data
governance will also be critical elements of
data quality programs that encompass big data
elements.
Ensuring data quality has been a serious
challenge in many organizations. Frequently,
data quality problems are masked. Business
processes seem to be working well enough, and
it is hard to determine beforehand what the
return on investment in a data quality program
would be. In addition, in many organizations,
nobody seems to “own” responsibility for the
overall quality of corporate data. People are
responsible or are sensitive to their own slice
of the data pie but are not concerned with the
overall pie itself.
What’s AheadIt should not be a surprise that in a recent
survey of data quality professionals, two-
thirds of the respondents felt the data quality
programs in their organizations were only
“OK”—that is, some goals were met or poor.
On the brighter side, however, 70% indicated
that the company’s management felt data
and information were important corporate
assets and recognized the value of improving
its quality. On balance, however, data quality
must be improved. In another survey, 61% of
IT and business professionals said they lacked
confidence in their company data.
During the next several years, data qual-
ity professionals will face a series of complex
challenges. Perhaps the most immediate is to
be able to view data quality issues within their
organizations holistically. Data generated by
one division—marketing, let’s say—may be
consumed by another—manufacturing, per-
haps. Data quality professionals need to be
able to respond to the needs of both.
Secondly, data quality professionals must
develop tools, processes, and procedures to
manage big data. Since a lot of big data is also
real-time data, data quality must become a
real-time process integrated into the enter-prise information ecosystem. And finally, and perhaps most importantly, data quality pro-fessionals will have to set priorities. Nobody can do everything at once. ■
Elliot King has reported on IT for 30 years. He is the chair of the Department of Communication at Loyola University Maryland,where he is a founder of an M.A. program in Emerging Media. He has written six books and hundreds of articles about new technologies. Follow him on Twitter @joyofjournalism. He blogs at emergingmedia360.org.
The State of Data Quality and Master Data Management
![Page 35: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/35.jpg)
DBTA.COM 33
sponsored content
Data is growing exponentially, experiencing unprecedented growth at
phenomenal speeds. According to IDC
(International Data Corporation), by
2015, nearly 3 billion people will be online,
generating nearly 8 zettabytes of data.
Analyzing large data sets and leveraging
new data-driven strategies will be essential
for establishing competitive differentiation
in the foreseeable future.
Big Data represents a fundamental shift
in the way companies conduct business
and interact with customers. Deriving value
from data sets requires that companies
across all industries be aggressive about
data collection, integration, cleansing
and analysis.
DATA SOURCES— BEYOND THE TRADITIONAL
Enterprises understand the intrinsic
value in mining and analyzing traditional
data sources such as demographics,
consumer transactions, behavior models,
industry trends, and competitor information.
However, the age of Big Data and advanced
technologies necessitate the analysis of new
data universes, such as social media and
mobile technologies.
Social media is one of the major elements
driving the overall Big Data phenomenon.
Twitter streams, Facebook posts and
blogging forums flood organizations
with massive amounts of data. Successful
Big Data strategies include the adoption
of technologies to pull relevant social
media into a single stream and integrate
the information into the core functions
of the enterprise. Automated processes,
matching technology and filters extract
content and consumer sentiment. When
social stream data is cleansed and integrated
into a database, enterprises gain invaluable
information on customer insights,
competitive intelligence, product feedback,
and market trends.
Mobile technology is also contributing
to the data influx as mobile devices become
more powerful, networks run faster and
apps more numerous. According to a report
by Cisco, global traffic on data networks
grew by 70% in 2012. The traffic on mobile
data networks in 2012—885 petabytes or
885 quadrillion bytes—was nearly 12 times
greater than total Internet traffic around the
world in 2000. As consumer behavior shifts
to new digital technologies, enterprises
are in a prime position to take advantage
of opportunities such as location-based
marketing.
GPS technologies are much more precise,
allowing marketers to deliver targeted
real-time messaging based on a consumer’s
location. Geofencing, a technology gaining
popularity among industries such as retail,
establishes a virtual perimeter around a
real-world site. For example, geofences
may be set up around a storefront. When
a customer carrying a smart device enters
the area, the device emits geodata, allowing
companies to send locally-targeted content
and promotions. According to research by
Placecast, a company specializing in location-
based services, one of every two consumers
visits a location after receiving an alert.
MANAGING BIG DATAWhen properly managed, Big Data brings
big opportunities. Solid data management
processes and well-designed procedures for
data stewardship are crucial investments
for Big Data projects to be successful.
Structured and unstructured data must be
properly formatted, integrated and cleansed
to fully extract actionable and agile business
intelligence.
As the speed of business continues
to accelerate, data is generated instantly.
Traditional data quality batch processing is
no longer enough to fully sustain effective
operational decision-making. Integrating,
cleansing and analyzing data in real-time
allows a company to engage in opportunities
instantly. For example, using real-time data
processing, a company can personalize a
customer’s on-line website visit, enhancing
the overall customer experience. Monitoring
of transactions in real-time also has
important benefits for security. Security
threats can instantly be identified, such
as fraudulent activity or individuals on a
security watch list. The applications are
numerous. Corporations able to react to
information the fastest will have the greatest
competitive advantage.
Big Data initiatives require planning
and dedication to be successful. According
to Gartner Predicts 2012 research, more
than 85% of Fortune 500 organizations
will be unable to effectively exploit Big
Data by 2015. Companies who successfully
incorporate Big Data projects into the overall
business strategy will gain significant returns,
including better customer relationships,
improved operational efficiency, identification
of marketing opportunities, security risk
mitigation, and more.
DATAMENTORS provides award-winning data quality and database marketing solutions. Offered as either a customer-premise installation or ASP delivered solution, DataMentors leverages proprietary data discovery, analysis, campaign management, data mining and modeling practices to identify proactive, knowledge-driven decisions. Learn more at www.DataMentors.com, including how to obtain a complimentary customer database quality analysis.
Big Data ... Big Deal?
![Page 36: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/36.jpg)
ind
ustr
y u
pd
ate
s
34 BIG DATA SOURCEBOOK 2013
There is something for everyone within
today’s generation of business intelligence and
advanced analytics solutions. Built on open,
flexible frameworks and designed for users
who expect and need information at internet
speeds, BI and analytics are undergoing its
first revolutionary transformation since com-
puters became mainstream business tools.
Not only are the tools evolving, end users
are evolving as well. People are demanding
more of their analytics solutions, but analyt-
ics are also changing the way people across
enterprises, from end-users to infrastructure
specialists to top-level executives, work and
run their businesses.
All About ChoiceFor today’s data infrastructure managers
charged with capturing, cleansing, processing,
and storing data, the new BI/analytics world
is all about choice—and lots of it. An array of
technologies and solutions is now surging into
the marketplace that offers smarter ways to
capture, manage, and store big data of all types
and volumes.
A company doesn’t need to be an enter-
prise on the scale of a Google or eBay, turn-
ing huge datasets into real-time insights on
millions of customers. Organizations of all
sizes are now getting into the game. In fact,
more than two-fifths of 304 data managers
surveyed from all types and sizes of busi-
nesses report they have formal “big data” ini-
tiatives in progress, with the goals of deliv-
ering predictive analytics, customer analysis,
and growing new business revenue streams
(“2013 Big Data Opportunities Survey,”
sponsored by SAP and conducted by Uni-
sphere Research, a division of Information
Today, Inc., May 2013).
There are a variety of data infrastructure
tools and platforms that are paving the way to
big data analysis:
Open Source/NoSQL/NewSQL Databases: Alternative forms of databases are filling the
need to manage and store unstructured data.
These new databases often hail from the open
source space, meaning that they are immedi-
ately available to administrators and develop-
ers for little or no charge. NewSQL databases
tend to be cloud-based systems. NoSQL (“Not
only” SQL)-based databases are designed
to store unstructured or nonrelational data.
There are four categories of NoSQL databases:
key-value stores (for the storage of schema-less
data); column family databases (storing data
within columns); graph databases (employing
structures with nodes, edges, and properties to
represent and store data); and document data-
bases (for the simple storage and retrieval of
document aggregates).
Hadoop/MapReduce Open Source Eco-sphere: Apache Hadoop, an open source
framework, is designed for processing and
managing big data stores of unstructured data,
such as log files. Hadoop is a parallel-pro-
cessing framework, linked to the MapReduce
analytics engine, that captures and packages
both unstructured and structured data into
digestible files that can be accessed by other
enterprise applications.
A survey of 298 data managers affiliated
with the Independent Oracle Users Group
(IOUG) has found that Hadoop adoption is
likely to triple during the coming years. At
the time of the survey, 13% of respondents
had deployed or were in the process of imple-
menting or piloting Hadoop, with an addi-
tional 22% considering adoption of the open
source framework at some point in the future
(“Big Data, Big Challenges, Big Opportunities:
In Today’s BI and Advanced Analytics World, There Is Something for Everyone
By Joe McKendrick
The State of Business Intelligence and Advanced Analytics
![Page 37: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/37.jpg)
WHITE PAPER: Turbocharge Analytics with Data Virtualization
DOWNLOAD NOW: http://tinyurl.com/turbochargeanalytics
Traditional data integration approaches slow analytics adoption and constrain theability to achieve these objectives. This white paper outlines the analytics pipeline, identifies how Big Data and Cloud Computing present new barriers to agility, anddescribes how data virtualization successfully addresses these challenges.
A customer use-case is also included, illustrating how forward-thinking companiesare taking advantage of these modern data integration techniques to turbochargetheir analytics.
![Page 38: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/38.jpg)
ind
ustr
y u
pd
ate
s
36 BIG DATA SOURCEBOOK 2013
[T]he new BI/analytics world is all about diving deep into datasets and being able to engage in ‘storytelling’ as a way to connect data to the business.
2012 IOUG Big Data Strategies Survey,” spon-
sored by Oracle and conducted by Unisphere
Research, September 2012).
Relational Database Management Sys-tems: RDBMSs, on the market for close to 3
decades, structure data into tables that can
be cross-indexed within applications and are
increasingly being tweaked for the data surge
ahead. The IOUG survey finds nine out of
10 enterprises intend to continue using rela-
tional databases for the foreseeable future,
and it is likely that many organizations will
have hybrid environments with both SQL and
NoSQL running side by side.
Cloud: Cloud-based BI solutions offer
functionality on demand, along with more
rapid deployment, low upfront cost, and scal-
ability. Many database vendors now support
data management and storage capabilities via
a cloud or software as a service environment.
In addition, other vendors are also optimiz-
ing their data products to be able to leverage
cloud resources—either as the foundation
of private clouds, or running in on-premises
server environments that also access applica-
tion programming interfaces (APIs) or web
services for additional functions.
In another survey of 262 data managers,
37% say their organizations are either run-
ning private clouds—defined as on-demand
shared services provided to internal depart-
ments or lines of business within enter-
prises—at full or limited scale, or are in pilot
stages (“Enterprise Cloudscapes—Deeper
and More Strategic: 2012–13 IOUG Cloud
Computing Survey,” sponsored by Oracle and
conducted by Unisphere Research, February
2013). This is up from 29% in 2010, the first
year this survey was conducted. In addition,
adoption of public clouds—defined as on-
demand services provided by public cloud
providers—is on the upswing. Twenty-six
percent of respondents say they now use pub-
lic cloud services either in full or limited ways,
or within pilot projects. This is up by 86%
from the first survey in this series, conducted
in 2010, when 14% reported adoption.
In addition, 50% of private cloud users
report they run database as a service, up from
35% 2 years ago. Among public cloud users,
37% run database as a service, up from 12%
2 years ago.
Data Virtualization: Just as IT assets are
now offered through service layers via soft-
ware as a service or platform as a service,
information can be available through a “data
as a service” approach. In tandem with the
rise of private cloud and server virtualization
within enterprises, there has been a similar
movement to data virtualization, or data-
base as a service. By decoupling the database
layer from hardware and applications, users
are able to access disparate data sources from
anywhere across the enterprise, regardless of
location or underlying platform.
In-Memory Technologies: Many ven-
dors are adding in-memory capabilities to
offerings in which data and processing are
moved into a machine’s random access mem-
ory. In-memory eliminates what is probably
the slowest part of data processing—pulling
data off disks. In an environment with large
datasets—scaling into the hundreds of tera-
bytes—this will multiply into a bottleneck for
rapid analysis, limiting the amount of data
that can be analyzed at one time. Some esti-
mate that the capacity of such systems can
already go as high as those of large, disk-based
databases—all that data stored in a RAID
array could potentially be moved right into
machine memory.
A recent survey of 323 data managers
demonstrates that in-memory technology is
poised for rapid growth. While in-memory is
seen within many organizations, it is mainly
focused on specific sites or pilot projects at
this time. A handful of respondents to the
survey, 5%, report the technology is currently
in “widespread” use across their enterprises,
while another 8% say it is in limited use across
more than three departments within their
organizations. Close to one-third, 31%, report
that they are either piloting or considering this
technology (“Accelerating Enterprise Insights:
2013 IOUG In-Memory Strategies Survey,”
sponsored by SAP and conducted by Uni-
sphere Research, January 2013).
Technologies to Connect Data to BusinessFor quants, data analysts, data scientists,
and business users, the new BI/analytics
world is all about diving deep into datasets
and being able to engage in “storytelling” as a
way to connect data to the business.
There is a perception that developing
and supporting “data scientist”-type skill sets
require specially trained statisticians and
mathematicians supported by sophisticated
algorithms. However, with the help of tools
and platforms now widely available in today’s
market, members of existing data departments
can also be brought up-to-speed and made
capable of delivering insightful data analysis.
Open Source: The revolutionary frame-
work that broke open the big data analysis
scene is Hadoop and MapReduce. One of the
most potent tools in the quants’ toolboxes is
R, the open source, object-oriented analytics
language. R is rapidly deployable, tends to
be well-suited for building analytics against
large and highly diverse datasets, and has been
embedded in many applications. There are a
number of solutions that build upon R and
make the language easy to work with to visu-
ally manipulate data for the more effective
delivery of business insights.
Predictive Analytics: Predictive analytics technology is a key mission awaiting quants, data analysts, and data scientists. The tech-nology is available; all it takes is a little imag-ination. For example, during the presidential election in the fall of 2012, Nate Silver of The New York Times put predictive analytics on the map with his almost dead-on prediction of the winning candidate. The same principles can
The State of Business Intelligence and Advanced Analytics
![Page 39: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/39.jpg)
![Page 40: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/40.jpg)
ind
ustr
y u
pd
ate
s
38 BIG DATA SOURCEBOOK 2013
be applied for more routine business prob-
lems, which potentially can uncover unfore-
seen outcomes. For example, one bank found
that its most profitable customers were not
high-wealth individuals, but rather those who
were not meeting minimums and overdrafting
accounts and thus anteing up fees. In another
case, an airline found that passengers specify-
ing vegetarian preferences in their on-board
meals were less likely to miss flights. Or even
counterintuitive findings—such as the dating
site that found people rated the most attractive
received less attention than “average”-looking
members. (Suitors felt they faced more compe-
tition with more attractive members.)
Programming Tools: A range of script-
ing and open source languages—including
Python, Ruby, and Perl—also include exten-
sions for parallel programming and machine
learning.
Opening Up Analytics to the BusinessFor business users, the new BI/analytics
world is all about analytics for all. There
has been a growing movement to open up
analytics across the organization—pushing
these capabilities down to all levels of deci-
sion makers, including frontline customer
service representatives, production per-
sonnel, and information workers. A recent
survey of 250 data managers finds that in
most companies, fewer than one out of 10
employees have access to BI and analytic
systems (“Opening Up Business Intelligence
to the Enterprise: 2012 Survey On Self-Ser-
vice BI and Analytics,” sponsored by Tab-
leau Software and published by Unisphere
Research, October 2012).
Now, a new generation of front-end tools
is making this possible:
Visualization: Visual analytics is the new
frontier for end-user data access. Data visualiza-
tion tools provide highly graphic, yet relatively
simple, interfaces that help end users dig deep
into queries. This represents a departure from
the ubiquitous spreadsheet—rows of num-
bers—as well as static dashboards or PDF-based
reports with their immovable variables.
Self-Service: There is a growing trend among
enterprises to enable end users to build or design
their own interfaces and queries. Self-service
may take the form of enterprise “mashups,” in
which end users build their own front ends that
are combined with one or more data sources,
or through highly configurable portals. Accord-
ing to the 2012 Tableau-Unisphere self-service
BI and analytics study, self-service BI is now
offered to some extent in half of the organiza-
tions surveyed.
Pervasive BI: Pervasive BI and analyt-
ics are increasingly being embedded within
applications or devices, in which the end user
is oblivious to the software and data feeds
running in the background.
Cloud: Many users are looking to the
cloud to support BI data and tools in a more
cost-effective way than on-premises desk-
top tools. Third-party cloud providers have
almost unlimited capacity and can support
and provide big data analytics in a way that
is prohibitive for most organizations. Cloud
opens up business intelligence and analytics
to more users—nonanalysts—within orga-
nizations. With the drive to make BI more
ubiquitous, the cloud will only accelerate this
move toward simplified access.
Mobile: Mobile technology, which is only
just starting to seep into the BI and analytics
realm, promises to be a source of disruption.
The availability of analytics on an easy-to-use
mobile app, for example, will bring analytics
to decision makers almost instantaneously.
With many employees now bringing their
own devices to work, analytics may be read-
ily used by users that previously did not have
access to those capabilities.
The Opportunity to Compete on AnalyticsFor top-level executives, the new BI/
analytics presents opportunities to compete
on analytics. The ability to employ analytics
means understanding customers and markets
better, as well as spotting trends as they are
starting to happen, or before they happen.
As found in the Unisphere Research sur-
vey on big data opportunities, most execu-
tives instinctively understand the advantages
big data can bring to their operations, espe-
cially with predictive analytics and customer
analytics. A majority of the respondents with
such efforts under way, 59%, seek to improve
existing business processes, while another
41% are concerned with the need to create
new business processes/models.
BI and advanced analytics not only pro-
vide snapshots of aspects of the business such
as sales or customer churn, but also makes it
possible to apply key performance indicators
against data to develop a picture of a busi-
ness’s overall performance.
What’s AheadTo compete in today’s hyper-competitive
global marketplace, businesses need to under-
stand what’s around the corner. Predictive
analytics technology enables this to happen,
and the new generation of tools incorporates
such predictive capabilities.
The ability to automate low-level deci-
sions is freeing up organizations to apply
their mind power against tougher, more stra-
tegic decisions. These days, analytical appli-
cations are being embedded into processes
and applied against business rules engines to
enable applications and machines to handle
the more routine, day-to-day decisions that
come up—rerouting deliveries, extending
up-sell offers to customers, or canceling or
revising a purchase order.
Many organizations beginning their jour-
ney into the new BI and analytics space are
starting to discover all the possibilities it offers.
But, in an era in which data is now scaling into
the petabyte range, BI and analytics are more
than technologies. It’s a disruptive force. And,
with disruption comes new opportunities for
growth. Companies interested in capitaliz-
ing on the big data revolution need to move
forward with BI and analytics as a strategic
and tactical part of their business road map.
The benefits are profound—including vastly
accelerated business decisions and lower IT
costs. This will open new and often surprising
avenues to value. ■
Joe McKendrick is an author and independent researcher covering inno-vation, information tech-nology trends, and markets. Much of his research work is in conjunction with Uni-sphere Research, a division of Information Today, Inc. (ITI), for user groups including SHARE, the Oracle Applications Users Group, the Independent Oracle Users Group, and the International DB2 Users Group. He is also a regular contributor to Database Trends and Applications, published by ITI.
The State of Business Intelligence and Advanced Analytics
![Page 41: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/41.jpg)
DBTA.COM 39
sponsored content
Big data continues to be a mystery to many companies. Industry research
validates our experience that there are five
major stages that companies go through
when working with big data. We call these
the 5 Es—Evading, Envisioning, Evaluating,
Executing, and Expanding—of the big data
journey. Today, approximately 40% of
companies are still in the Evading stage,
waiting to get the clarity, means and purpose
for tackling big data.
To provide some clarity on the subject,
here we present five essential technological
means needed for this inevitable journey. If
your purpose is to find measurable returns
from big data, any one of these will be
sufficient to begin tasting the value. When
blended together, these means will provide
an irresistible recipe for big data success.
1) ENABLE VISUAL SELF EXPLORATION OF DATA
Human beings are visual creatures.
Big data analytics is all about “seeing”
relationships, anomalies and outliers present
in large quantities of data. Techniques for
advanced ways to graph, map and visualize
data, therefore, are a core requirement.
Secondly, visualizations need to be
intuitive and easy to work with. Business
users need the control to define what data
will be visualized and iterate through ideas
to determine the best visual representation.
They need the flexibility to share their
output through web browsers, mobile apps,
email, and other presentation modes.
Finally, the tools used need to be highly
responsive to a user’s needs. Effective
analysis can only happen when users move
uninterrupted at the speed-of-thought with
every exploration.
2) DEMOCRATIZE ADVANCED ANALYTICS
Big data has no voice without analytics.
Often the reason to work with large
quantities of low-level data is to apply
sophisticated analytic models, which can
tease out valuable insights not readily
apparent in aggregated information.
In business, analytical
modeling is the job of trained
data scientists who use a variety
of tools for developing these
models. Frontline business
users do not have such skill, but
everyday decisions they make can
be vastly improved based on such
big data insights. Challenges arise
in this transfer of knowledge,
since most tools don’t typically
talk to one another.
Organizations can enable
data scientists and trained analysts to
easily transfer business insights to frontline
workers by adopting tools that can expose
the widest support for advanced analytics
and predictive techniques, either natively or
through open integration with other tools.
3) COMBINE DATA FROM MULTIPLE SOURCES
Organizations never keep all data in one
place. Even with big data storage like Hadoop,
businesses will be hard pressed to unify all data
under one roof, owing to the ever-proliferating
systems. To date, IT has solved this problem
by transforming and moving data between
sources, before analysis is conducted. In today’s
age, exponentially larger datasets make data
movement virtually impossible, especially
when organizations want to be more nimble
but keep costs in check.
New technologies allow business users to
blend data from multiple sources, in-place,
and without involving IT. IT can take this a
step further by providing a scalable analytic
architecture masking the data complexity while
providing common business terminology.
Such architecture will easily facilitate
analyses that span customer information,
sales transactions, cost data, service history,
marketing promotions and more.
4) GIVE STRUCTURE TO ACTIONABLE UNSTRUCTURED DATA
Unstructured data accounts for 80% of all
data in a business. It typically comprises of
text-heavy formats like internal documents,
service records, web logs, emails, etc.
First, unstructured data has to be
structured to enable any analysis. While
trained analysts can do this interactively
at small scale, larger scale and general
access would demand an offline process.
Second, analysis of unstructured data will
often be useful only in conjunction with
other structured enterprise data. Third, the
insights from such analyses can be quite
amorphous. Unless businesses can take
concrete action based on the insights from
a certain unstructured source, its ROI will
be hard to justify.
5) SETUP CONNECTIVITY TO REAL-TIME DATA
Not all big data use cases lend themselves
to real-time analysis. But some do. When
decisions need to be taken in real-time (or
near real-time), this capability becomes a
key success factor. Analytic solutions for
financial trading, customer service, logistics
planning, etc. can all be beneficiaries of tying
live actual data to historical information or
forecasted outcomes.
In the end, big data analytics initiatives
are very much like traditional business
intelligence initiatives. These five technological
needs demand a significantly greater emphasis
for your big data journey. Will you stop
evading it now?
MICROSTRATEGY To learn how MicroStrategy can help craft solutions for your big data analytics needs, visit microstrategy.com/bigdatabook.
Five Key Pieces in the Big Data Analytics Puzzle
![Page 42: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/42.jpg)
ind
ustr
y u
pd
ate
s
40 BIG DATA SOURCEBOOK 2013
Social media networks are creating large
datasets that are now enabling companies and
organizations to gain competitive advantage
and improve performance by understand-
ing customer needs and brand experience
in nearly real time. These datasets provide
important insights into real-time customer
behavior, brand reputation, and the over-
all customer experience. Intelligent or “data
analysis”-driven organizations are now mon-
itoring, and some are collecting, this data
from “propriety social media networks,” such
as Salesforce Chatter and Microsoft Yammer
and “open social media networks” such as
LinkedIn, Twitter, Facebook, and others.
The majority of organizations today are
not harvesting and staging data from these
networks but are leveraging a new breed of
social media listening tools and social analyt-
ics platforms. Many are tapping their public
relations agencies to execute this new business
process. Smarter data-driven organizations
are extrapolating social media datasets and
performing predictive analytics in real time
and in-house.
There are, however, significant regula-
tory issues associated with harvesting, stag-
ing, and hosting social media data. These
regulatory issues apply to nearly all data
types in regulated industries such as health-
care and financial services in particular.
The SEC and FINRA with Sarbanes-Oxley
require different types of electronic com-
munications to be organized, indexed in
a taxonomy schema, and then be archived
and easily discoverable over defined time
periods. Data protection, security, gover-
nance, and compliance have entered an
entirely new frontier with introduction and
management of social data.
This article provides a broad overview of
the current state of analytical tools and plat-
forms that enable accelerated and real-time
decision making in organizations based on
customers. Social media is driving organi-
zational demand for insights on “customer
everything” in addition to BI and analytics
tools. Providing “enterprise BI” that includes social analytics will be a significant challenge to many enterprises in the near future. This is one of the primary reasons for the success of the new wave of innovative and easy-to-use BI and social media analytical tools within the last several years.
Social Media Analytic Tools and Platforms Offer Promise
By Peter J. Auditore
The State of Social Media
![Page 43: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/43.jpg)
![Page 44: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/44.jpg)
ind
ustr
y u
pd
ate
s
42 BIG DATA SOURCEBOOK 2013
The majority of organizations today are not harvesting and staging data from social media networks but are leveraging a new breed of social media listening tools and social analytics platforms.
Analytic Tools OverviewIn the beginning, there was SPSS and
SAS Institute, the first analytical and statis-
tical platforms to be computerized and go
mainstream in the early 1980s. There is no
way in my view you can talk about anything
analytical without mentioning them. When
I was a young marine scientist, these were
the first DOS-based analytical tools we used
to do basic statistical analysis in addition to
rudimentary predictive analytics employed to
forecast fisheries populations.
During the last 40 years, these platforms
evolved to include a host of new capabilities
and functionality and are now considered
business intelligence tools. For the last 20
years, the majority of business intelligence
tools accessed structured datasets in vari-
ous databases, however now that nearly 80%
of enterprise data is unstructured, many of
the BI platforms incorporate sophisticated
enterprise search capabilities that rely on
metadata, inferences, and connections to
multiple data sources. The vast majority
of social media data is unstructured, as we
know, and this presents significant chal-
lenges to many organizations in its overall
management: collection, staging, archiving,
analysis, governance, and security.
Many organizations today are leveraging
their legacy business intelligence tools and
platforms to perform analysis on social media
datasets, in addition to the use of sophisti-
cated tagging and automated taxonomy tools
that make search (finding the right contents
and/or objects) easier. The most basic and
easy analytical tool used by nearly everyone
is a simple alert, which combs/crawls the web
for topics related to your alert criteria.
Modern capabilities of business intelli-gence tools and platforms:
• Enterprise Search—structured and
unstructured data
• Ad Hoc Query Analysis and Reporting
• OLAP, ROLAP, MOLAP
• Data Mining
• Predictive and Advanced Analytics
• In-Database Analytics
• In-Memory Analytics
• Performance Management Dashboards
• Advanced Visualization, Modeling,
Simulation, and Scenario Planning
• Cloud and Mobile BI
Cloud-Based and Mobile BI and the New Innovative Business Intelligence Tools
Within the last several years, a new class
of BI tools has emerged including some open
source and cloud-based platforms/tools, some
of which are specialized for specific vertical
market segments or business processes. They
are easy-to-use, highly collaborative via work
flow, and some include standard and custom
reporting in addition to including some rudi-
mentary ETL tools. Mobile BI is one of the
fastest growing areas; however, many legacy
vendors have been slow to develop applica-
tions for BYOD, especially tablets.
These new products have innovative
semantic layers and new ways of visualizing
data, both structured and unstructured. In
some cases, these new tools tout the fact that
they can work with any database and don’t
require the building of a data warehouse or
data mart but provide access to any data any-
where. Innovative visualization dashboard
platforms and implementations have been
very attractive to business managers and have
found their way into many organizations, in
some cases, without the knowledge of the IT
department.
In-MemoryIn-memory database technology, the next
major innovation in the world of business
intelligence and social media analytics,
is the game changer that will provide the
unfair advantage that leads to the compet-
itive advantage every CEO wants today.
In-memory technologies and built-in ana-
lytics are beginning to play major roles in
social analytics. The inherent business value
of in-memory technology revolves around
the ability to make real-time decisions based
on accurate information about seminal busi-
ness processes such as social media.
The ability to know and understand the
customer experience is paramount in the new
millennium as organizations strive to improve
customer service, keep customers loyal, and
gain greater insights into customer purchasing
patterns. This has become even more import-
ant as a result of social media and social media
networks that are now the new “word-of-
mouth platforms.” In-memory promises to
provide real-time data not only from transac-
tional systems but also to allow organizations to harvest and manage unstructured data from the social media sphere.
Predictive Analytics and Graph DatabasesGraph databases are sometimes faster
than SQL and greatly enhance and extend the capabilities of predictive analytic by incorporating multiple data points and interconnections across multiple sources in real time. Predictive analytics and graph
The State of Social Media
![Page 45: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/45.jpg)
![Page 46: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/46.jpg)
ind
ustr
y u
pd
ate
s
44 BIG DATA SOURCEBOOK 2013
databases are a perfect fit for the social media
landscape where various data points are
interconnected.
Social media analytic tools enable busi-ness and organizations to enhance:
• Brand and sentiment analysis
• Identification and ranking of key
influencers
• Campaign tracking and measurement
• Product launches
• Product innovation through
crowdsourcing
• Digital channel influence
• Purchase intent analysis
• Customer care
• Risk management
• Competitive intelligence
• Partner monitoring
• Category analysis
The Social Media Listening CentersMany organizations are just starting to
use social data, few are at the forefront, and
most are using off-the-shelf vendor products
to create social media listening/monitoring
centers. These platforms operate in real time
and visually display sentiment and brand
analysis for products and services. The major-
ity of organizations today are at this stage of
social analytics, and again, few appear to be
collecting, staging, and archiving data for
further analysis and predictive analytics.
Monitoring and performing predictive
analytics on social media datasets are the most
obvious and common uses of analytic solu-
tions today. Many solutions use natural lan-
guage processing in the indexing and staging of
social media data. Predictive analytics enable
a wide array of business functions including
marketing, sales, product development, com-
petitive intelligence, customer service, and
human resources to identify common and
unusual patterns and opportunities in the
unstructured world of social media data.
Social Media Analytical ToolsSocial media analytical tools identify and
analyze text strings that contain targeted
search terms, which are then loaded into
databases or data staging platforms such as
Hadoop. This can enable database queries, for
example, by data, region, keyword, sentiment.
This can then enable insights and analysis into
customer attitudes toward brand, product,
services, employees, and partners. The major-
ity of products work at multiple levels and drill
down into conversations with results depicted
in customizable charts and dashboards.
Often analytic results are provided in
customizable charts and dashboards that are
easy to visualize and interpret and can be
shared on enterprise collaborative platforms
for decision makers. Some social media ana-
lytic platforms integrate easily with existing
analytic platforms and business processes to
help you act on social media insights, which
can lead to improved customer satisfaction,
enhanced brand reputation, and can even
enable your organization to anticipate new
opportunities or resolve problems.
On the bleeding edge of social media
analytics is a new wave of tools and highly
integrated platforms that have emerged to
provide not only social media listening tools
but also enable organizations to understand
content preferences (or content intelligence)
by affinity groups and brands they are follow-
ing or trending. Some of the innovators tak-
ing social media data to a new level include
Attensity, InfiniGraph, Brandwatch, Bamboo,
Kapow, Crimson Hexagon, Sysomos, Simply
Measured, NetBase, and Gnip.
Current Use of Social Media BI ToolsIn 2012, the SHARE users group and
Guide SHARE Europe conducted a Social
Media and Business Intelligence Survey, pro-
duced by Unisphere Research, a division of
Information Today, Inc., and sponsored by
IBM and Marist College. The survey, which
examined the current state of social media
data monitoring and collection and use of
business intelligence tools in more than 500
organizations, found that IBM, SAS, Oracle,
and SAP were the entrenched BI platform
market leaders. The majority of the sample
base indicated that they were not using third-
party BI tools for social media analytics.
What’s AheadThe 2012 social media and BI survey data
still provide a relevant picture of the state of
social media analytics. A majority of organi-
zations will leverage legacy business intelli-
gence vendors with familiar semantic layers
to perform rudimentary social media data
analysis. The big issue is that line-of-busi-
ness managers will not wait for nonagile IT
departments to collect, harvest, stage/build,
and perform analytics on new social media
data marts or data warehouses.
New bleeding-edge social media analyt-
ical platforms are addressing the needs of
line-of-business professionals in real time.
They are also leveraging the economics of
utility computing and the cloud to bring cost-
effective analytical platforms to nearly all orga-
nizations. These highly integrated platforms
include simple social media listening tools,
along with embedded analytics and predictive
analytics that incorporate content and some-
times advertising abilities to meet the needs of
modern digital marketers. There are also other
new vendors that specialize in collecting and
delivering raw social media for those organi-
zations which are building their own in-house
social media analytics platforms.
Traditionally, marketing has always had
four P’s. Today, marketing has five P’s: prod-
uct, place, position, price, and people—
because in this millennium, the social media
network is the new platform for “word-of-
mouth marketing.” ■
Peter J. Auditore is currently the principal researcher at Asterias Research, a bou-tique consultancy focused on information manage- ment, traditional and social analytics, and big data ([email protected]). Auditore was a member of SAP’s Global Communications team for 7 years and most recently head of the SAP Business Influ-encer Group. He is a veteran of four technol-ogy startups: Zona Research (co-founder); Hummingbird (VP, marketing, Americas); Survey.com (president); and Exigen Group (VP, corporate communications).
Predictive analytics and graph databases are a perfect fit for the social media landscape where various data points are interconnected.
The State of Social Media
![Page 47: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/47.jpg)
DBTA.COM 45
sponsored content
Big data analytic opportunities are
abundant, with business value the driver.
According to the Professors Andrew McAfee
and Erik Brynjolfsson of MIT:
“Companies that inject big data and
analytics into their operations show
productivity rates and profitability
that are 5% to 6% higher than those
of their peers.”
DATA IS THE LIFEBLOOD OF ANALYTICS
Enterprises, flooded with a deluge of
data about their customers, prospects,
business processes, suppliers, partners and
competitors, understand data’s critical role
as the lifeblood of analytics.
THE ANALYTIC DATA CHALLENGE However, integrating data consumes the
better half of any analytic project as variety
and volume complexity constrain progress.
• Diverse data types—In the past, most
analytic data was tabular, typically
relational. That changed with the rise of
web services and other non-relational
and big data sources. Analysts must now
work with multiple data types, including
tabular, XML, key-value pairs and semi-
structured log data.
• Multiple interfaces and protocols—
Accessing data is now more complicated.
Before, analysts used ODBC to access a
database or a spreadsheet. Now, analysts
must access data through a variety of
protocols, including web services via
SOAP or REST, Hadoop data through
Hive, and other types of NOSQL data
via proprietary APIs.
• Larger data sets—Data sets are
significantly larger. Analysts can no
longer assemble all data in one place,
especially if that place is their desktop.
Analysts must be able to work with data
where it is, intelligently sub-setting it
and combining it with relevant data
from other high volume sources.
• Iterative analytic methods—Exploration
and experimentation defines the analytic
process. Finding, accessing and pulling
together data is difficult alone, with
continuous updating and reassembling
of data sets also a must have.
CONSOLIDATING EVERYTHING, SLOW AND COSTLY
Providing analytics with the data
required has always been difficult, with
data integration long considered the biggest
bottleneck in any analytics or BI project.
No longer is consolidating all analytics
data into a data warehouse the answer.
When you need to integrate data from
new sources to perform a wider, more
far-reaching analysis, does it make sense
to create yet another silo that physically
consolidates other data silos?
Or is it better to federate these silos
using data virtualization?
DATA VIRTUALIZATION TO THE RESCUE
Cisco’s Data Virtualization Suite addresses
your difficult analytic data challenges.
• Rapid Data Gathering Accelerates Analytics Impact—Cisco’s nimble data
discovery and access tools makes it faster
and easier to gather together the data sets
each new analytic project requires.
• Data Discovery Addresses Data Proliferation—Data discovery automates
entity and relationship identification;
accelerating data modeling so your
analysts can better understand and
leverage your distributed data assets.
• Query Optimization for Timely Business Insight—Optimization
algorithms and techniques deliver the
timely information your analytics require.
• Data Federation Provides the Complete Picture—Virtual data integration in
memory provides the complete picture
without the cost and overhead of
physical data consolidation.
• Data Abstraction Simplifies Complex Data —Data abstraction transforms
data from native structures to common
semantics your analysts understand.
• Analytic Sandbox and Data Hub Options Provide Deployment Flexibility—Data
virtualization supports your diverse
analytic requirements from ad hoc
analyses via sandboxes to recurring
analyses via data hubs.
• Data Governance Maximizes Control—
Built-in governance ensures data security,
data quality and 7x24 operations to
balance business agility with needed
controls.
• Layered Data Architecture Enables Rapid Change—Loose coupling and
rapid development tools provide the
agility required to keep pace with your
ever-changing analytic needs.
CONCLUSION The business value of analytics has never
been greater. But data volumes and variety
impact the velocity of analytic success.
Data virtualization helps overcome data
challenges to fulfill critical analytic data needs
significantly faster with far fewer resources
than other data integration techniques.
• Empower your people with instant access to
all the data they want, the way they want it
• Respond faster to your changing analytics
and business intelligence needs
• Reduce complexity and save money
Better analysis equals business advantage.
So take advantage of data virtualization.
LEARN MORE To learn more about Cisco’s data virtualization offerings for big data analytics, visit www.compositesw.com
Data Virtualization Brings Velocity and Value to Big Data Analytics
![Page 48: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/48.jpg)
ind
ustr
y u
pd
ate
s
46 BIG DATA SOURCEBOOK 2013
Big data is transforming both the
scope and the practice of data integration.
After all, the tools and methods of classic data
integration evolved over time to address the
requirements of the data warehouse and its
orbiting constellation of business intelligence
tools. In a sense, then, the single biggest change
wrought by big data is a conceptual one: Big
data has displaced the warehouse from its posi-
tion as the focal point for data integration.
The warehouse remains a critical system
and will continue to service a critical constit-
uency of users; for this reason, data integra-
tion in the context of data warehousing and
BI will continue to be important. Neverthe-
less, we now conceive of the warehouse as
just one system among many systems, as one
provider in a universe of providers. In this
respect, the impact of big data isn’t unlike that
of the Copernican Revolution: The universe,
after Copernicus, looked a lot bigger. The
same can be said about data integration after
big data: The size and scope of its projects—
to say nothing of the problems or challenges
it’s tasked with addressing—look a lot bigger.
This isn’t so much a function of the “big-
ness” of big data—of its celebrated volumes,
varieties, or velocities—as of the new use cases,
scenarios, projects, or possibilities that stem
from our ability to collect, process, and—most
important—to imaginatively conceive of “big”
data management. To say that big data is the
sum of its volume, variety, and velocity is a lot
like saying that nuclear power is simply and
irreducibly a function of fission, decay, and
fusion. It’s to ignore the societal and economic
factors that—for good or ill—ultimately deter-
mine how big data gets used. In other words,
if we want to understand how big data has
changed data integration, we need to consider
the ways in which we’re using—or in which we
want to use—big data.
Big Data Integration in Practice In this respect, no application—no use
case—is more challenging than that of
advanced analytics. This is an umbrella term
for a class of analytics that involves statistical
analysis, machine learning, and the use of new
techniques such as numerical linear algebra.
From a data integration perspective, what’s
most challenging about advanced analytics is
that it involves the combination of data from
an array of multistructured sources. “Multi-
structured” is a category that includes struc-
tured hierarchical databases (such as IMS
or ADABAS on the mainframe or—a recent
innovation—HBase on Hadoop); semistruc-
tured sources, such as graph and network data-
bases, along with human-readable sources,
including JSON, XML, and txt documents);
and a host of so-called “unstructured” file
types—documents, emails, audio and video
recordings, etc. (The term “unstructured” is
misleading: Syntax is structure; semantics is
structure. Understood in this context, most
so-called unstructured artifacts—emails,
tweets, PDF files, even audio and video files—
have structure. Much of the work of the next
decade will focus on automating the profiling,
preparation, analysis, and—yes—integration
of unstructured artifacts.)
If all of this multistructured information is
to be analyzed, it needs to be prepared; how-
ever, the tools or techniques required to prepare
multistructured data for analysis far outstrip
the capabilities of the handiest tools (e.g., ETL)
in the data integration toolset. For one thing,
multistructured information can’t efficiently
or, more to the point, cost-effectively, be loaded
into a data warehouse or OLTP database. The warehouse, for example, is a schema-manda-tory platform; it needs to store and manage information in terms of “facts” or “dimen-sions.” It is most comfortable speaking SQL, and to the extent that information from nonrelational sources (such as hiearchical
Big Data Is Transforming the Practice of Data Integration
By Stephen Swoyer
The State of Data Integration
![Page 49: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/49.jpg)
![Page 50: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/50.jpg)
ind
ustr
y u
pd
ate
s
48 BIG DATA SOURCEBOOK 2013
The impact of big data isn’t unlike that of the Copernican Revolution: The universe, after Copernicus, looked a lot bigger. The same can be said about data integration after big data: The size and scope of its projects—to say nothing of the problems or challenges it’s tasked with addressing—look a lot bigger.
databases, sensor events, or machine logs) can
be transformed into tabular format, they can
be expressed in SQL and ingested by the data
warehouse. But what about information from
all multistructured sources?
Enter the category of the “NoSQL” data
store, which includes a raft of open source soft-
ware (OSS) projects, such as the Apache Cassan-
dra distributed database, MongoDB, CouchDB,
and—last but not least—the Hadoop stack.
Increasingly, Hadoop and its Hadoop Distrib-
uted File System (HDFS) are being touted as an
all-purpose “landing zone” or staging area for
multistructured information.
ETL Processing and HadoopHadoop is a schema-optional platform; it
can function as a virtual warehouse—i.e., as
a general-purpose storage area—for informa-
tion of any kind. In this respect, Hadoop can
be used to land, to stage, to prepare, and—
in many cases—to permanently store data.
This approach makes sense because Hadoop
comes with its own baked-in data processing
engine: MapReduce.
For this reason, many data integra-
tion vendors now market ETL products
for Hadoop. Some use MapReduce itself to
perform ETL operations; others substitute
their own, ETL-optimized libraries for the
MapReduce engine. Traditionally, program-
ming for MapReduce is a nontrivial task:
MapReduce jobs can be coded in Java, Pig
Latin (the high-level language used by Pig, a
platform designed to abstract the complex-
ity of the MapReduce engine), Perl, Python,
and (using open source libraries) C, C++,
Ruby, and other languages. Moreover, using
MapReduce as an ETL technology also pre-
supposes a detailed knowledge of data man-
agement structures and concepts. For this rea-
son, ETL tools that support Hadoop usually
generate MapReduce jobs in the form of Java
code, which can be fed into Hadoop. In this
scheme, users design Hadoop MapReduce
jobs just like they’d design other ETL jobs or
workflows—in a GUI-based design studio.
The benefits of doing ETL processing in
Hadoop are manifold: For starters, Hadoop
is a massively parallel processing (MPP) envi-
ronment. An ETL workload scheduled as a
MapReduce job can be efficiently distribut-
ed—i.e., parallelized—across a Hadoop
cluster. This makes MapReduce ideal for
crunching massive datasets, and, while the
sizes of the datasets used in decision support
workloads aren’t all that big, those used in
advanced analytic workloads are. From a data
integration perspective, they’re also consid-
erably more complicated, inasmuch as they
involve a mix of analytic methods and tradi-
tional data preparation techniques.
Let’s consider the steps involved in an
“analysis” of several hundred terabytes of
image or audio files sitting in HDFS. Before
this data can be analyzed, it must be pro-
filed; this means using MapReduce (or cus-
tom-coded analytic libraries) to run a series of
statistical and numerical analyses, the results
of which will contain information about the
working dataset. From there, a series of tra-
ditional ETL operations—performed via
MapReduce—can be used to prepare the data
for additional analysis.
There’s still another benefit to doing ETL
processing in Hadoop: The information is
already there. It has an adequate—though
by no means spectacular—data management
toolset. For example, Hive, an interpreter
that compiles its own language (HiveQL)
into Hadoop MapReduce jobs, exposes a
SQL-like query facility; HBase is a hierarchi-
cal data store for Hadoop that supports high
user concurrency levels as well as basic insert
and update operations. Finally, HCatalog is a
primitive metadata catalog for Hadoop.
Data Integration Use CasesRight now, most data integration use cases
involve getting information out of Hadoop.
This is chiefly because Hadoop’s data manage-
ment feature set is primitive compared to those
of more established platforms. Hadoop, for
example, isn’t ACID-compliant. In the advanced
analytic example cited above, a SQL platform—
not Hadoop—would be the most likely des-
tination for the resultant dataset. Almost all
database vendors and a growing number of
analytic applications boast connectivity of some
kind into Hadoop. Others promote the use of
Hadoop as a kind of queryable archive. This
use case could involve using Hadoop to per-
sist historical data—e.g., “cold” or infrequently
accessed data that (by virtue of its sheer volume)
could impact the performance or cost of a data
warehouse. Still another emerging scenario
involves using Hadoop as a repository in which
to persist the raw data that feeds a data ware-
house. In traditional data integration, this data
is often staged in a middle tier, which can con-
sist of an ETL repository or an operational data
store (ODS). On a per-gigabyte or per-terabyte
basis, both the ETL and ODS stores are more
expensive than Hadoop. In this scheme, some
or all of this data could be shifted into Hadoop,
where it could be used to (inexpensively) aug-
ment analytic discovery (which prefers denor-
malized or raw data) or to assist with data ware-
house maintenance—e.g., in case dimensions
are added or have to be rekeyed.
Still another use case involves offloading workloads from Hadoop to SQL analytic platforms. Some of these platforms are able to execute analytic algorithms inside their database engines. Some SQL DBMS ven-dors claim that an advanced analysis will
The State of Data Integration
![Page 51: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/51.jpg)
![Page 52: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/52.jpg)
ind
ustr
y u
pd
ate
s
50 BIG DATA SOURCEBOOK 2013
run faster on their own MPP platforms than
on Hadoop using MapReduce. They note that
MapReduce is a brute-force data processing
tool, and while it’s ideal for certain kinds of
workloads, it’s far from ideal as a general-pur-
pose compute engine. This is why so much
Hadoop development work has focused on
YARN—Yet Another Resource Negotiator—
which will permit Hadoop to schedule,
execute, and manage non-MapReduce jobs.
The benefits of doing so are manifold, espe-
cially from a data integration perspective.
First, even though some ETL tools run in
Hadoop and replace MapReduce with their
own engines, Hadoop itself provides no
native facility to schedule or manage non-
MapReduce jobs. (Hadoop’s existing Job-
Tracker and TaskTracker paradigm is tightly
coupled to the MapReduce compute engine.)
Second, YARN should permit users to run
optimized analytic libraries—much like the
SQL analytic database vendors do—in the
Hadoop environment. This promises to be
faster and more efficient than the status quo,
which involves coding analytic workloads as
MapReduce jobs. Third, YARN could help
stem the flow of analytic workloads out of
Hadoop and encourage analytic workloads to
be shifted from the SQL world into Hadoop.
Even though it might be faster to run an ana-
lytic workload in an MPP database platform,
it probably isn’t cheaper—relative, that is, to
running the same workload in Hadoop.
Alternatives to HadoopBut while big data is often discussed
through the prism of Hadoop, owing to the
popularity and prominence of that platform,
alternatives abound. Among NoSQL plat-
forms, for example, there’s Apache Cassan-
dra, which is able to host and run Hadoop
MapReduce workloads, and which—unlike
Hadoop—is fault-tolerant. There’s also Span-
ner, Google’s successor to BigTable. Google
runs its F1 DBMS—a SQL- and ACID-com-
pliant database platform—on top of Spanner,
which has already garnered the sobriquet
“NewSQL.” (And F1, unlike Hadoop, can be
used as a streaming database. Here and else-
where, Hadoop’s file-based architecture is a
significant constraint.) Remember, a primary
contributor to Hadoop’s success is its cost—
as an MPP storage and compute platform,
Hadoop is significantly less expensive than
existing alternatives. But Hadoop by itself isn’t
ACID-compliant and doesn’t expose a native
SQL interface. To the extent that technologies
such as F1 address existing data management
requirements, enable scalable parallel work-
load processing, and expose more intuitive
programming interfaces, they could comprise
compelling alternatives to Hadoop.
What’s AheadBig data, along with the related technol-
ogies such as Hadoop and other NoSQL
platforms, is just one of several destabilizing
forces on the IT horizon, however. Other
technologies are changing the practice of data
integration—such as the shift to the cloud
and the emergence of data virtualization.
Cloud will change how we consume and
interact with—and, for that matter, what we
expect of—applications and services. From
a data integration perspective, cloud, like
big data, entails its own set of technologi-
cal, methodological, and conceptual chal-
lenges. Traditional data integration evolved
in a client-server context; it emphasizes
direct connectivity between resources—e.g.,
a requesting client and a providing server.
The conceptual model for cloud, on the other
hand, is that of representational state transfer,
or REST. In place of client-server’s empha-
sis on direct, stateful connectivity between
resources, REST emphasizes abstract, stateless
connectivity. It prescribes the use of new and
nontraditional APIs or interfaces. Traditional
data integration makes use of tools such as
ODBC, JDBC, or SQL to query for and return
a subset of source data. REST components, on
the other hand, structure and transfer infor-
mation in the form of files—e.g., HTML,
XML, or JSON documents—that are repre-
sentations of a subset of source data. For this
reason, data integration in the context of the
cloud entails new constraints, makes use of
new tools, and will require the development
of new practices and techniques.
That said, it doesn’t mean throwing out
existing best practices: If you want to run
sales analytics on data in your Salesforce.
com cloud, you’ve either got to load it into an
existing, on-premises repository or—alterna-
tively—expose it to a cloud analytics provider.
In the former case, you’re going to have to
extract your data from Salesforce, prepare it,
and load it into the analytic repository of your
choice, much as you would do with data from
any other source. The shift to the cloud isn’t
going to mean the complete abandonment of
on-premises systems. Both will coexist.
Data Virtualization, or DV, is another
technology that should be of interest to data
integration practitioners. DV could play a role
in knitting together the fabric of the post-big
data, post-cloud application-scape. Tradition-
ally, data integration was practiced under fairly
controlled conditions: Most systems (or most
consumables, in the case of flat files or files
uploaded via FTP) were internal to an organi-
zation, i.e., accessible via a local area network.
In the context of both big data and the cloud,
data integration is a far-flung practice. Data
virtualization technology gives data architects
a means to abstract resources, regardless of
architecture, connectivity, or physical location.
Conceptually, DV is REST-esque in that
it exposes canonical representations (i.e.,
so-called business views) of source data. In
most cases, in fact, a DV business view is a
representation of subsets of data stored in
multiple distributed systems. DV can pro-
vide a virtual abstraction layer that unifies
resources strewn across—and outside of—
the information enterprise, from traditional
data warehouse systems to Hadoop and other
NoSQL platforms to the cloud. DV platforms
are polyglot: They speak SQL, ODBC, JDBC,
and other data access languages, along with
procedural languages such as Java and (of
course) REST APIs.
Moreover, DV’s prime directive is to move
as little data as possible. As data volumes scale
into the petabyte range, data architects must
be alert to the practical physics of data move-
ment. It’s difficult if not impossible to move
even a subset of a multi-petabyte repository
in a timely or cost-effective manner. ■
Stephen Swoyer is a tech-nology writer with more than 15 years of experience. His writing has focused on business intelligence, data warehousing, and ana-lytics for almost a decade. He’s particularly intrigued by the thorny people and process problems most BI and DW vendors almost never want to acknowledge, let alone talk about. You can contact him at [email protected].
The State of Data Integration
![Page 53: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/53.jpg)
industry directory
DBTA.COM 51
Appfluent transforms the economics of Big Data and Hadoop.
Appfluent provides IT organizations with unprecedented visibility into
usage and performance of data warehouse and business intelligence
systems. IT decision makers can view exactly which data is being
used or not used, determine how business intelligence systems are
performing and identify causes of database performance issues.
With Appfluent, enterprises can address exploding data growth with
confidence, proactively manage performance of BI and data warehouse
systems, and realize the tremendous economies of Hadoop.
Learn more at www.appfluent.com.
APPFLUENT TECHNOLOGY, INC.
6001 Montrose Road, Suite 1000
Rockville, MD 20852
301-770-2888
www.appfluent.com
Attunity is a leading provider of data integration software solutions
that make Big Data available where and when needed across
heterogeneous enterprise platforms and the cloud. Attunity solutions
accelerate mission-critical initiatives including BI/Big Data Analytics,
Disaster Recovery, Content Distribution and more. Solutions include
data replication, change data capture (CDC), data connectivity,
enterprise file replication (EFR), managed-file-transfer (MFT), and cloud
data delivery. For 20 years, Attunity has supplied innovative software
solutions to thousands of enterprise-class customers worldwide to
enable real-time access and availability of any data, anytime, anywhere
across the maze of systems making up today’s IT environment.
Learn more at www.attunity.com.
ATTUNITY
www.attunity.com
SEE OUR AD ON
PAGE 49
CodeFutures is the provider of dbShards, the Big Data platform that
makes your database scalable and reliable. dbShards is not a database
—instead dbShards works with proven DBMS engines you know
and trust. dbShards gives your application transparent access to
one or more DBMS engines, providing the Big Data scalability,
High-Availability, and Disaster Recovery you need for demanding
“always-on” operation. You can even use dbShards to seamlessly
migrate your database from one environment to another—between
regions, cloud vendors and your own data center.
For more information, go to www.dbshards.com.
CODEFUTURES CORPORATION
11001 West 120th Avenue, Suite 400
Broomfield, CO 80021
(303) 625-4084
www.dbshards.com
codeFuturesComposite Software, now part of Cisco, is the data virtualization
market leader. Hundreds of organizations use the Composite Data
Virtualization Platform’s streamlined approach to data integration
to gain more insight from their data, respond faster to ever changing
analytics and BI needs, and save 50–75% over data replication and
consolidation.
Cisco Systems, Inc. completed the acquisition of Composite Software, Inc.
on July 29, 2013.
COMPOSITE SOFTWARE
Please call us: (650) 227-8200
Follow us on Twitter: http://twitter.com/compositesw
www.compositesw.com
SEE OUR AD ON
PAGE 35
![Page 54: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/54.jpg)
industry directory
52 BIG DATA SOURCEBOOK 2013
Datawatch is the leading provider of visual data discovery solutions that allow organizations to optimize the use of any information, whether it is structured, unstructured, or semi-structured data locked in content like static reports, PDF files, and EDI streams—in real-time sources like CEP engines, tick or other real-time sources like machine data. Through an unmatched visual data discovery environment and the industry’s leading information optimization software, Datawatch allows you to utilize ALL data to deliver a complete picture of your business from every aspect and then manage, secure, and deliver that information to transform business processes, increase visibility to critical Big Data sources, and improve business intelligence applications offering broader analytical capabilities.
Datawatch provides the solution to Get the Whole Story!
DATAWATCH CORPORATION 271 Mill Road, Quorum Office Park Chelmsford, MA 01824 978-441-2200 [email protected]
www.datawatch.com
SEE OUR AD ON
PAGE 11
DataMentors provides award-winning data quality and database marketing solutions. Offered as either a customer-premise installation
or ASP delivered solution, DataMentors leverages proprietary data discovery, analysis, campaign management, data mining and modeling practices to identify proactive, knowledge-driven decisions.
DataFuse, DataMentors’ data quality and integration solution, is
consistently recognized by industry-leading analysts for its extreme
flexibility and ease of householding. DataMentors’ marketing database
solution, PinPoint, quickly and accurately analyzes, segments, and
profiles customers’ preferences and behaviors. DataMentors also
offers social media marketing, drive time analysis, email marketing,
data enhancements and behavior models to further enrich the customer
experience across all channels.
DATAMENTORS
2319-104 Oak Myrtle Lane
Wesley Chapel, FL 33544
Phone: 813-960-7800
Email: [email protected]
www.DataMentors.com
SEE OUR AD ON
PAGE 47
Delphix delivers agility to enterprise application projects, addressing
the largest source of inefficiency and inflexibility in the datacenter—
provisioning, managing, and refreshing databases for business-critical
applications. With Delphix in place, QA engineers spend more time
testing and less time waiting for new data, increasing utilization of
expensive test infrastructure. Analysts and managers make better
decisions with fresh data in data marts and warehouses. Leading global
organizations use Delphix to dramatically reduce the time, cost, and risk
of application rollouts, accelerating packaged and custom applications
projects and reporting.
DELPHIX
275 Middlefield Road
Menlo Park, CA 94025
www.delphix.com
Denodo is the leader in data virtualization. Denodo enables hybrid data
storage for big data warehouse and analytics—providing unmatched
performance, unified virtual access to the broadest range of enterprise,
big data, cloud and unstructured sources, and agile data services
provisioning—which has allowed reference customers in every major
industry to minimize the cost and pitfalls of big data technology
and accelerate its adoption and value by making it transparent to
business users. Denodo is also used for cloud integration, single-view
applications, and RESTful linked data services. Founded in 1999,
Denodo is privately held.
DENODO TECHNOLOGIES
www.denodo.com
![Page 55: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/55.jpg)
industry directory
DBTA.COM 53
DBMoto® is the preferred solution for heterogeneous Data Replication and Change Data Capture requirements in an enterprise environment. Whether replicating data to a lower TCO database, synchronizing data among disparate operational systems, creating a new columnar or high-speed analytic database or data mart, or building a business intelligence application, DBMoto is the solution of choice for fast, trouble-free, easy-to-maintain Data Replication and Change Data Capture projects. DBMoto is mature and approved by enterprises ranging from midsized to Fortune 1000 worldwide. HiT Software®, Inc., a BackOffice Associates® LLC Company, is based in San Jose, CA.
For more information see www.info.hitsw.com/DBTA-bds2013/
HIT SOFTWARE, INC., A BACKOFFICE ASSOCIATES LLC COMPANY Contact: Giacomo Lorenzin 408-345-4001 [email protected]
www.hitsw.com
SEE OUR AD ON
PAGE 37
Nearly 80% of all existing data is generally only available in unstructured form and does not contain additional, descriptive metadata. This content, therefore, cannot be machine-processed automatically with conventional IT. It demands human interaction for interpretation, which is impossible to achieve when faced with the sheer volume of information. Based on the highly scalable Information Access System, Empolis offers methods for analyzing unstructured content perfectly suitable for a wide range of applications. For instance, Empolis technology is able to semantically annotate and process an entire day of traffic on Twitter in less than 20 minutes, or the German version of Wikipedia in three minutes. In addition to statistical algorithms, this also covers massive parallel processing utilizing linguistic methods for information extraction. These, in turn, form the basis for our Smart Information Management solutions, which transform unstructured content into structured information that can be automatically processed with the help of content analysis.
EMPOLIS INFORMATION MANAGEMENT GMBH Europaallee 10 | 67657 Kaiserslautern | Germany Phone +49 631 68037-0 | Fax +49 631 68037-77 [email protected]
www.empolis.com
Kapow Software, a Kofax company, harnesses the power of legacy data
and big data, making it actionable and accessible across organizations.
Hundreds of large global enterprises including Audi, Intel, Fiserv,
Deutsche Telekom, and more than a dozen federal agencies rely on its
agile big data integration platform to make smarter decisions, automate
processes, and drive better outcomes faster. They leverage the platform
to give business consumers a flexible 360-degree view of information
across any internal and external source, providing organizations with a
data-driven advantage.
For more information, please visit: www.kapowsoftware.com.
KAPOW SOFTWARE
260 Sheridan Avenue, Suite 420
Palo Alto, CA 94306
Phone: +1 800 805 0828
Fax: +1 650 330 1062
Email: [email protected]
www.kapowsoftware.com
HPCC Systems® from LexisNexis® is an open-source, enterprise-ready
solution designed to help detect patterns and hidden relationships in
Big Data across disparate data sets. Proven for more than 10 years,
HPCC Systems helped LexisNexis Risk Solutions scale to a $1.4 billion
information company now managing several petabytes of data on a
daily basis from 10,000 different sources.
HPCC Systems was built for small development teams and offers a single
architecture and one programming language for efficient data processing
of large or complex queries. Customers, such as financial institutions,
insurance companies, law enforcement agencies, federal government and
other enterprise organizations, leverage the HPCC Systems technology
through LexisNexis products and services. HPCC Systems is available
in an Enterprise and Community version under the Apache license.
LEXISNEXIS
Phone: 877.316.9669
www.hpccsystems.com
www.lexisnexis.com/risk
SEE OUR AD ON
PAGE 43
![Page 56: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/56.jpg)
industry directory
54 BIG DATA SOURCEBOOK 2013
Founded in 1989, MicroStrategy (Nasdaq: MSTR) is a leading
worldwide provider of enterprise software platforms. Millions of users
use the MicroStrategy Analytics Platform™ to analyze vast amounts
of data and distribute actionable business insight throughout the
enterprise. Our analytics platform delivers interactive dashboards and
reports that users can access and share via web browsers, information-
rich mobile apps, and inside Microsoft® Office applications. Big data
analytics delivered with MicroStrategy will enable businesses to
analyze big data visually without writing code and apply advanced
analytics to obtain deep insights from all of their data.
To learn more and try MicroStrategy free, visit
microstrategy.com/bigdatabook.
MICROSTRATEGY
1850 Towers Crescent Plaza
Tysons Corner, VA 22182 USA
Phone: 888.537.8135
Email: [email protected]
www.microstrategy.com/bigdatabook
SEE OUR AD ON
COVER 4
Since 1988, Objectivity, Inc. has been the Enterprise NoSQL leader,
helping customers harness the power of Big Data. Our leading edge
technologies: InfiniteGraph, The Distributed Graph Database™,
and Objectivity/DB, a distributed and scalable object management
database, enable organizations to discover hidden relationships for
improved Big Data analytics and develop applications with significant
time-to-market advantages and technical cost savings, achieving
greater return on data related investments. Objectivity, Inc. is
committed to our customers’ success, with representatives worldwide.
Our clients include: AWD Financial, CUNA Mutual, Draeger Medical,
Ericsson, McKesson, IPL, Siemens and the US Department of Defense.
OBJECTIVITY, INC.
3099 North First Street, Suite 200
San Jose, CA 95134 USA
408-992-7100
www.objectivity.com
SEE OUR AD ON
PAGE 7
Progress DataDirect provides high-performance, real-time connectivity
to applications and data deployed anywhere. From SaaS applications
like Salesforce to Big Data sources such as Hadoop, DataDirect
makes these sources appear just like a regular relational database.
Whether you are connecting your own application, or your favorite BI
and reporting tools, DataDirect makes it easy to access your critical
business information.
More than 300 leading independent software vendors embed Progress
Software’s DataDirect components in over 400 commercial products.
Further, 96 of the Fortune 100 turn to Progress Software’s DataDirect
to simplify and streamline data connectivity.
PROGRESS DATADIRECT
www.datadirect.com
SEE OUR AD ON
PAGE 41
Percona has made MySQL and integrated MySQL/big data solutions
faster and more reliable for over 2,000 customers worldwide. Our
experts help companies integrate MySQL with big data solutions
including Hadoop, Hbase, Hive, MongoDB, Vertica, and Redis.
Percona provides enterprise-grade Support, Consulting, Training,
Remote DBA, and Server Development services for MySQL or
integrated MySQL/big data deployments. Our founders authored the
book High Performance MySQL and the MySQL Performance Blog.
We provide open source software including Percona Server, Percona
XtraDB Cluster, Percona Toolkit, and Percona XtraBackup. We also
host Percona Live conferences for MySQL users worldwide.
For more information, visit www.percona.com.
PERCONA
www.percona.com
![Page 57: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/57.jpg)
industry directory
DBTA.COM 55
TransLattice provides its customers corporate-wide visibility,
dramatically improved system availability, simple scalability and
significantly reduced deployment complexity, all while enabling data
location compliance. Computing resources are tightly integrated to
enable enterprise databases to be spread across an organization as
needed, whether on-premise or in the cloud, providing data where and
when it is needed. Nodes work seamlessly together and if a portion
of the system goes down, the rest of the system is not affected. Data
location is policy-driven, enabling proactive compliance with regulatory
requirements. This simplified approach is fundamentally more reliable,
more scalable, and more cost-effective than traditional approaches.
TRANSLATTICE
+1 408 749-8478
www.TransLattice.com
SEE OUR AD ON
PAGE 9
Splice Machine is the only transactional SQL-on-Hadoop database
for real-time Big Data applications. Splice Machine provides all the
benefits of NoSQL databases, such as auto-sharding, scalability, fault
tolerance and high availability, while retaining SQL—the industry
standard. It optimizes complex queries to power real-time OLTP and
OLAP apps at scale without rewriting existing SQL-based apps and
BI tool integrations. Splice Machine provides fully ACID transactions
and uses Multiple Version Concurrency Control (MVCC) with lockless
snapshot isolation to enable real-time database updates with very
high throughput.
SPLICE MACHINE
www.splicemachine.com
SEE OUR AD ON
COVER 2
From the publishers of
![Page 58: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/58.jpg)
Each issue of DBTA features original and valuable content—providing you with clarity, perspective, and objectivity in a complex and exciting world where data assets hold the key to organizational competitiveness.
Don’t miss an issue! Subscribe FREE* today!
*Print edition free to qualified U.S. subscribers.
Need Help Unlocking the Full Value of Your Information?
DBTA magazine is here to help.
S C
A N
![Page 59: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/59.jpg)
Big Data technologies,
including Hadoop, NoSQL, and in-memory
databases
Increasing efficiency
through cloud technologies and services
Solving complex data
and application integration challenges
Tools and techniques
reshaping the world of business
intelligence
Key strategies for increasing
database performance
and availability
New approaches
for agile data warehousing
Get the inside scoop on the hottest topics in data management and analysis:
Best Practices and Thought Leadership Reports
For information on upcoming reports: http://iti.bz/dbta-editorial-calendar
To review past reports: http://iti.bz/dbta-whitepapers
![Page 60: Big Data Sourcebook Your Guide to the Data Revolution Free eBook](https://reader034.fdocuments.in/reader034/viewer/2022051519/577cb51d1a28aba7118cf59c/html5/thumbnails/60.jpg)