Extreme Data: Challenges and Opportunities for Large … mex38l_c3.pdf · Extreme Data: Challenges...

24
Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and Analytics Analytics Gartner The Future of IT Conference October 4-6, 2011 Centro Banamex Mexico City, Mexico Donald Feinberg Notes accompany this presentation. Please select Notes Page view. These materials can be reproduced only with written approval from Gartner. Such approvals must be requested via email: [email protected]. Gartner is a registered trademark of Gartner Inc or its affiliates This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates. © 2011 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. or its affiliates.

Transcript of Extreme Data: Challenges and Opportunities for Large … mex38l_c3.pdf · Extreme Data: Challenges...

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsAnalytics

Gartner The Future of IT Conference

October 4-6, 2011 Centro BanamexMexico City, Mexico

Donald Feinberg

Notes accompany this presentation. Please select Notes Page view. These materials can be reproduced only with written approval from Gartner. Such approvals must be requested via email: [email protected]. Gartner is a registered trademark of Gartner Inc or its affiliates

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Gartner is a registered trademark of Gartner, Inc. or its affiliates.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsTactical Guideline: The growth and diversity of available information from many sources will create new opportunities for applications that yield major business value through content analytics, predictive analytics and real-time analysis using sophisticated data services.

Information challenges are hampering organizations' ability to run grow and transform the business The volume of available information alone isInformation challenges are hampering organizations ability to run, grow and transform the business. The volume of available information alone is overwhelming. The problem of determining which information is useful and how long it is useful increases the stress on current systems. Many end users say they do not understand the potential value of information until it is used. But that is no different than someone boasting about an expensive painting the person has never tried to sell. Nothing is worth anything until it is used for a benefit somewhere. Keeping everything in a giant information landfill is a "hoarder" principle, and hoarding is born of fear. In this case, fear comes from a deep awareness that when business strategies and tactics change, so does the value of every piece of information. In other words, the valuation of information assets changes over time, just like any other commodity or asset. But it has no value until it is used. But just how big is this problem?

• According to an OAUG survey, 33% of respondents indicate data growth contributes to performance issues at least "most of the time," and 54% more indicate growth is an issue "some of the time." (OAUG ResearchLine, 2010. Information, Unplugged: By Joseph McKendrick, Analyst, Produced by Unisphere Research, a division of Information Today. December 2009. Sponsored by 2009 OAUG ResearchLine Survey onEnterprise Application Information Lifecycle Management)

• "… managing storage effectively may cost 3-10 times what it cost to procure." Also, "The average company keeps 20-40 duplicates of its data.'" Nov. 2009; Data Growth Challenges Demand Proactive Data Management — Adrian, IBM. ftp://public.dhe.ibm.com/common/ssi/sa/wh/n/iml14218usen/IML14218USEN.PDF

• Gartner estimates that 85% of data is unstructured. Growth of file content outside traditional databases is typically predicted to track database growth closely. In a recent Enterprise Strategy Group study, 36% of surveyed organizations predicted over 41% annual growth in e-mail archives alone."

Page 1

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

• EMC has a data growth flash ticker indicating how fast the data volume in the world is growing based on their data about storage volumes in use around the world at http://www.emc.com/leadership/digital-universe/expanding-digital-universe.htm.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsStrategic Planning Assumption: Through 2015, organizations integrating high-value, diverse, new information types and sources into a coherent information management infrastructure will outperform their industry peers financially by more than 20%.

By 2015 organizations using traditional parse annotate and store approaches to link unstructured information assets to structuredBy 2015, organizations using traditional parse, annotate and store approaches to link unstructured information assets to structured assets will witness a sixfold increase in asset volume and costs to process and manage it. While the ratio of structured to unstructured data is not known in the IT market today, it is generally assumed unstructured data volumes hold more than a 20% advantage (60/40); most industry professionals place it much higher (75/25). Without statistical data on adoption rate, this is a difficult position. However, based on market growth for text analytics, text mining and the acquisition rate of data management vendors acquiringunstructured analytics tools, it is a solid conjecture point that a minimal amount of e-mail, document or CRM annotation analysis will occur in more than half of BI analytics by 2015. By 2015, the market adoption of unstructured information assets in performance management and business intelligence analytics will create demand for a $1 billion software market for real-time unstructured asset analysis software The reasoning is that converting and tagging will give way to leaving to analyze on demand (leveraginganalysis software. The reasoning is that converting and tagging will give way to leaving to analyze on demand (leveraging processors, networks and fast storage). The total worldwide volume of data is growing at 59% per year, with the number of files growing at 88% per year. Even the familiar vocabulary used for data quantities has expanded from kilobytes to megabytes and thengigabytes, and soon will include terabytes. Nearly every gadget today is generating data — cell phones, personal digital assistants (PDAs), cars, and even toys. IT personnel has become so inundated with data growth statistics that it hasn't digested the implications of these facts past the point of increasing storage capacity. Planning for and providing more storage capacity is only one of many tasks needed for managing this morass of data. Today's data management challenges pale in comparison to those of tomorrow. Moore's Law supports the idea that computing power and storage capacities have been increasing exponentially, doubling every 18 to 24 th Th t' i f b t 1 000 ti 20 D it thi i d t i i t f t t th

Page 2

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

24 months. That's an increase of about 1,000 times every 20 years. Despite this increase, data is now growing at a faster rate than storage capacity and computing power. Some projections expect that, by 2012, only 51% of the data produced will be efficiently collated and stored; by 2017, that figure drops to only 28%.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and Analytics

Big data has such a vast size that it exceeds the capacity of traditional data management technologies; itBig data has such a vast size that it exceeds the capacity of traditional data management technologies; it requires the use of new or exotic technologies simply to manage the volume alone. But processing matters, too. A complex statistical model can make a 300GB database seem bigger than a 110TB database, even if both are running on multicore, distributed parallel processing platforms. Big data has quickly emerged as a significant challenge for IT leaders. The term only became popular in 2009. By February 2011, a Google search on "big data" yielded 2.9 million hits, and vendors now advertise their products as solutions to the big data challenge. Inquiries about big data from Gartner clients have risen sharply as well.

In this presentation, we examine the definitions and issues involved with managing extreme information. We will first define what it is and explain why big data is only a small part of the overall concept of extreme information, as well as what new opportunities it will open for the enterprise. Then we will look at the issues around managing this information with today's systems. Finally, we will look at the new technologies and methods for managing extreme information.

Page 3

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and Analytics

Big data has such a vast size that it exceeds the capacity of traditional data management technologies; itBig data has such a vast size that it exceeds the capacity of traditional data management technologies; it requires the use of new or exotic technologies simply to manage the volume alone. But processing matters, too. A complex statistical model can make a 300GB database seem bigger than a 110TB database, even if both are running on multicore, distributed parallel processing platforms. Big data has quickly emerged as a significant challenge for IT leaders. The term only became popular in 2009. By February 2011, a Google search on "big data" yielded 2.9 million hits, and vendors now advertise their products as solutions to the big data challenge. Inquiries about big data from Gartner clients have risen sharply as well.

In this presentation, we examine the definitions and issues involved with managing extreme information. We will first define what it is and explain why big data is only a small part of the overall concept of extreme information, as well as what new opportunities it will open for the enterprise. Then we will look at the issues around managing this information with today's systems. Finally, we will look at the new technologies and methods for managing extreme information.

Page 4

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsKey Issue: What is extreme information, how is it used and what opportunities does it present for businesses?Strategic Imperative: IT leaders must understand all 12 dimensions that emphasize the extreme information management environment. Quantification is only one aspect.g y p

Access Enablement and Control: These dimensions relate to making sure people and machines can find and use information whenAccess Enablement and Control: These dimensions relate to making sure people and machines can find and use information when needed, but that unauthorized use does not occur. Access enablement and control set the stage for who can see data, how fast it should be provided, the different delivery mechanisms and much more. As a result, they provide the most significant opportunity to plan for context-aware computing, data center modernization, and the convergence of OT and IT. Classification includes sensitive/nonsensitive and private/public classifications and an understanding their implications. Authorization and security fit here, as does public or private information. Contracts involve agreements on who will share information and how — inside and outside the enterprise — usually represented by metadata. This dimension includes the terms of sharing, how the records will be exposed, theintended use, how long you can use the information and so on. Pervasiveness refers to information and data that becomes "hot," and is in great demand across the organization How long does data remain active? What do you do with orphaned data that has outlivedis in great demand across the organization. How long does data remain active? What do you do with orphaned data that has outlived its value, but for some reason keeps hanging around? Technology-enablement involves specifications derived from the other 11 dimensions that guide the design and integration infrastructure of systems such as data integration tools, data quality tools, master data management and application middleware. Qualification and Assurance: These dimensions relate to the levels of trust users can place in data. Addressing qualification and assurance can help with Pattern-Based Strategy, social networking and analysis. Fidelity means the ability or inability to confidently adapt an asset for wider use. Linking involves data in combination and the uses related to this context. Validation ensures that the information was created in accordance with complete understanding of the use cases and includes all the other aspects of data

Page 5

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

quality. Unknown future use cases make validation a constant challenge. Perishability refers to the confidence that the data remains valid and reaches all use cases while it remains so. What is the shelf life of the information? How long does it remain useful? How long should it be kept? What are the aging aspects of information?

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsStrategic Imperative: IT leaders must think about big data and all the dimensions implied in quantification in order to take advantage of the economics of data, and to manage extreme information.

The most difficult information management issues emerge from the simultaneous and persistent interaction ofThe most difficult information management issues emerge from the simultaneous and persistent interaction of extreme volume, variety of data formats, velocity of record creation and variable latencies, and the complexity of individual data types within formats. Big data is a popular proxy for this concept, but encourages an inappropriate focus on volume. The weight of data processing requirements is constantly increasing and overbears the existing technology. This creates a perpetual pendulum swing between the capability of systems and the demand on them to process information.When business leaders or data management professionals talk about big data, they often emphasize volume, W g p g , y p ,with minimal consideration for velocity, variety and complexity — the other aspects of quantification:• Velocity involves streams of data, structured record creation and availability for access and delivery. Velocity

means both how fast data is being produced, and how fast the data must be processed to meet demand. • Variety includes tabular data (databases), hierarchical data, documents, e-mail, metering data, video, image,

audio, stock ticker data, financial transactions and more. • Complexity means different standards domain rules and storage formats can exist with each asset type An

Page 6

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

• Complexity means different standards, domain rules and storage formats can exist with each asset type. An information management system for media cannot have only one video solution.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsDefinition: Velocity involves streams of data, structured record creation and availability for access and delivery. Velocity means how fast data is being produced and how fast the data must be processed to meet demand.

Velocity can be thought of in terms of streams of data structured record creation and availability for access and deliveryVelocity can be thought of in terms of streams of data, structured record creation, and availability for access and delivery.It is important to remember that velocity includes the concept of a throttle on how fast data is being produced, and a demand concept regarding how fast the data must be processed to be effective.

Suppose a large country builds a smart electrical grid with billions of meters to detect power usage. The meters would send huge amounts of data to those monitoring the health of the grid, but most of the readings would amount to noise because nearly all would report normal operation. A technology like MapReduce can perform a statistical analysis to gather similar readings together — on the source device. An analyst can then specify that the system should filter out

di i hi l d di l l b l di b i h R d i h dreadings within normal parameters and display only abnormal readings by moving the Reduction process to the data on the metering devices. This cuts down on the data volume being transmitted, and addresses the fidelity of data as it relates to the use case contract under a time constraint that matches the perishable nature of data (if you don't get a right answer fast enough, it might be too late). Further, it does all this without determining to transfer large volumes which would otherwise be "noise." Other people can set different parameters for the meter data to perform other kinds of analysis in additional MapReduce functions — but the resulting polling of the operational technology could overwhelm the primary job of the smart meters. Other examples include combining massive environmental data sets with each other (ocean temperature or salinity data combined with weather data) and if one data set has "expired" under the perishable

Page 7

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

temperature or salinity data combined with weather data), and if one data set has "expired" under the perishable dimension, the entire analysis is invalidated.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsDefinition: Variety includes tabular data (databases), hierarchical data, documents, e-mail, metering data, video, image, audio, stock ticker data, financial transactions and more.

In considering variety of information assets there is tabular data (databases) hierarchical dataIn considering variety of information assets, there is tabular data (databases), hierarchical data, content/documents, e-mail, metering data, video, image, audio, stock ticker data, financial transactions and more. For example, an increase in the variety of new data types in the data environment (such as video) may mean dealing with technologies that have no standard formats and thus involve more-complex data types. Thus, IT leaders cannot focus only on data volume or any one dimension. They must be aware of all of the dimensions that are moving toward extremes in their environment, then learn how these dimensions interact. Data management leaders must make daily decisions that balance the long-term needs of the organization against the immediate pain points. In social media, new opportunities to discover patterns, detect early warning signs and distill other actionable information from social media data hold great promise, but tools and techniques are still used only sparsely. Moreover, much of the increased diversity and volume of information is unstructured and is a result of the social media and network phenomenon. Traditional BI techniques and skills are necessary to analyze social networks and glean business insights, but they are insufficient. Social network analysis examines relationships, information flows and influence to understand who is participating, who has influence and what is being discussed on the different types of social networks Further analysis of

Page 8

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

influence and what is being discussed on the different types of social networks. Further analysis of unstructured social data requires content analytics, taxonomy and ontology. Nontraditional BI tools, such as network maps and related visualization tools, aid the mining process by exposing the interworking and structure of people in groups.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsDefinition: Complexity means that different standards, domain rules and even storage formats can exist with each asset type. An information management system for media cannot have only one video solution.

In considering the complexity of individual data types it's important to note that within each asset type is theIn considering the complexity of individual data types, it s important to note that within each asset type is the potential for different standards, domain rules and even storage formats. Thus, a "media" information management solution cannot have only one video solution.XML standards are intended to streamline data exchange; however, as with all standards, modification is prevalent, so semantic interpreters are required. XML is among the better-defined information exchange standards, but still has additional complexity. For example, the Digital Imaging and Communications in Medicine (DICOM) standard (a set of protocols, identifiers and representations used by virtually the entireMedicine (DICOM) standard (a set of protocols, identifiers and representations used by virtually the entire diagnostic medical systems industry) is gaining wider acceptance in virtually every medical profession. DICOM specifies how medical images are transferred — protocols — and identifiers that can be combined into representations. The standard has been in development since the 1990s and is the subject of constant revision and updating. A widely used but dynamic standard has to address changes in the standards over time, creating complexity.

Page 9

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsStrategic Imperative: Adopt an information management architecture incorporating extreme information management to enable new applications with high business value.

Moore's law supports the idea that computing power and storage capacities have been increasing exponentiallyMoore s law supports the idea that computing power and storage capacities have been increasing exponentially, doubling every 18 to 24 months. That's an increase of about 1,000 times every 20 years. Despite this increase, data is growing at a faster rate than storage capacity and computing power. Some projections expect that, by 2012, only 51% of the data produced will be efficiently collated and stored; by 2017, that figure drops to only 28%. By 2015, organizations using traditional parse, annotate and store approaches to link unstructured information assets to structured assets will witness a sixfold increase in asset volume and costs to process and manage it. While the ratio of structured to unstructured data is not known in the IT market today, it is generally assumed that unstructured data volumes hold more than a 20% advantage (60/40), and most industry professionals place it much higher (75/25). The new technologies, both hardware and software, that will allow an in-memory column store DBMS used for both OLTP and DW will allow new applications, requiring higher performance and lower latency access to information. Predictive analytics, Pattern-Based Strategies, real-time forecasting and real-time pricing are just a few of the applications that will return high value to the business.Action Item: Look for new applications with high business value to offset the higher costs and risks associated

ith th i f t t d d t bl th li ti

Page 10

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

with the infrastructure needed to enable these applications.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and Analytics

Big data has such a vast size that it exceeds the capacity of traditional data management technologies; itBig data has such a vast size that it exceeds the capacity of traditional data management technologies; it requires the use of new or exotic technologies simply to manage the volume alone. But processing matters, too. A complex statistical model can make a 300GB database seem bigger than a 110TB database, even if both are running on multicore, distributed parallel processing platforms. Big data has quickly emerged as a significant challenge for IT leaders. The term only became popular in 2009. By February 2011, a Google search on "big data" yielded 2.9 million hits, and vendors now advertise their products as solutions to the big data challenge. Inquiries about big data from Gartner clients have risen sharply as well.

In this presentation, we examine the definitions and issues involved with managing extreme information. We will first define what it is and explain why big data is only a small part of the overall concept of extreme information, as well as what new opportunities it will open for the enterprise. Then we will look at the issues around managing this information with today's systems. Finally, we will look at the new technologies and methods for managing extreme information.

Page 11

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsKey Issue: How does extreme information challenge today's architectures and approaches for data management?

Information managers enterprise information management leads information architects data architectsInformation managers, enterprise information management leads, information architects, data architects, system architects and other IT leaders must recognize when they face extreme information management challenges, plan how to address them (including deferral if needed), determine if more Web-enabled architecture approaches are needed, and acquire new technologies and practices to manage them when the business needs demand it.

In this hyperconnected age, where users are empowered, IT leaders may never regain the control they once had They must change from being the sole owner of the technology stack to being a business partner andhad. They must change from being the sole owner of the technology stack to being a business partner and technology steward. Thus, they must help their business counterparts understand how to use new technologies and new data sources, while educating them on the risks. In the hyperconnected age, everyone with an Internet browser can become an information specialist. More data and more technology introduce more risk (and, potentially, more costs). These challenges reach their peak in the widening array of data sources available to the business — that is, in the big-data phenomenon.

Page 12

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsTactical Guideline: The ability to manage extreme data will be a core competency of enterprises that are increasingly using new forms of information (text, social, context) to look for patterns that support business decisions (Pattern-Based Strategy).

Extreme management issues like big data will have a significant impact on the reuse analysis and sharing of informationExtreme management issues like big data will have a significant impact on the reuse, analysis and sharing of information assets. This will impact efforts to manage Pattern-Based Strategy, data center modernization, OT/IT convergence and context-aware computing. Figure 3 provides contextual awareness of how these issues drive these 21st-century information management and utilization initiatives, and highlights the fact that ignoring qualification and access/control leaves a gaping hole in an information management strategy. Importantly, while quantification is depicted (with big data appropriately along only one axis), all 12 dimensions remain important. Pattern-Based Strategy, as an engine of change, utilizes all the dimensions (not just quantification) in its pattern-seeking process. It then provides the basis of the modeling for new business solutions, which allows the business to adapt. The seek, model and adapt cycle can then bemodeling for new business solutions, which allows the business to adapt. The seek, model and adapt cycle can then be completed in various mediums, such as social computing analysis or context-aware computing engines. Finally, newly modified application engines create more data, and the cycle repeats itself.

The dimensions within each category interact to complicate the challenge of managing information and data. Urgent demand for a particular set of data across the organization and higher expectations for data validation may work against each other. Making data available to more people and technologies exacerbates the challenge of preventing unauthorized access. The dimensions interact among categories as well. For example, linking external data to internal sources increases the challenge of maintaining a consistent ontology Adding metatags to illuminate the context of data can vastly increase

Page 13

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

the challenge of maintaining a consistent ontology. Adding metatags to illuminate the context of data can vastly increase the size of the data and further complicate technology issues. Application engines create more data, and the cycle repeats itself.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsStrategic Imperative: Between the growth of data and applications, coupled with users' desire for new, faster applications with added business value, today's DBMS is straining. Organizations are looking for ways to enhance performance and fix many issues caused by these changes.

For both OLTP and data warehouses the scalability and performance of the DBMS is being pushed to theFor both OLTP and data warehouses, the scalability and performance of the DBMS is being pushed to the maximum. The issues are normally not about functionality. Extreme information is adding a dimension to the already stressed data warehouse environment.

Another issue lies with the latency necessary to move data into a data warehouse to achieve the rapid or real-time analysis necessary for many new applications in areas such as predictive analytics and pattern-based strategies. This is leading to the creation of new in-memory DBMS and caching products to increase the speed in an attempt to reduce this latencyin an attempt to reduce this latency.

Finally, another change is the increased use of complex mixed data types (e.g., XML, biometrics and spatial) requiring new file systems and DBMSs specializing in these. In many cases, it is far more realistic and easier to manage this data in separate special-purpose databases (e.g., HDFS and Hbase), using data services to bring the data together with structured data for analysis.

As more-mature organizations embrace the issue of extreme information, we believe new tools will emerge not l t th d t b t l t t it ith t d t i d i ti th d t bi

Page 14

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

only to manage the data, but also to support it with metadata engines and new, innovative methods to combine information from multiple sources (i.e., federation).

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsStrategic Imperative: Best practices of the past will not work with modern technologies, work loads and new applications. To be successful in the future, organizations must be willing to experiment and change these outdated practices.

What are best practices that at some point in time can and do over time become just practices? OftenWhat are best practices that, at some point in time, can and do, over time, become just practices? Often, organizations can't remember why something was a best practice. Tape backup is a good example. Today, with the large size of databases, tape is not even a reasonable choice; normally, after a couple years, the tape is not readable. Also, since database structures change over time, the tapes are not restorable. Yet, organizations are reluctant to substitute disk for tape. To make the leap to replace disk with SSD and flash will take much more time. Making major shifts in technology will require organizations to think differently about many of these practices. BI and data warehousing without optimization structures, including cubes, is a very different approach. Even though tests show better performance without them while using in-memory DBMS technology, organizations will be reluctant to change. High availability (HA) without disks is a very new and different concept that is uncomfortable for most organizations.

This major technology change from traditional DBMS structures on disk to in-memory and row store to column store, while creating new opportunities, will come with many prejudices against the technology. It will take years for these perceptions to changes.

Page 15

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Action Item: Do not always fall back on old practices as best practices without understanding the benefits.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and Analytics

Big data has such a vast size that it exceeds the capacity of traditional data management technologies; itBig data has such a vast size that it exceeds the capacity of traditional data management technologies; it requires the use of new or exotic technologies simply to manage the volume alone. But processing matters, too. A complex statistical model can make a 300GB database seem bigger than a 110TB database, even if both are running on multicore, distributed parallel processing platforms. Big data has quickly emerged as a significant challenge for IT leaders. The term only became popular in 2009. By February 2011, a Google search on "big data" yielded 2.9 million hits, and vendors now advertise their products as solutions to the big data challenge. Inquiries about big data from Gartner clients have risen sharply as well.

In this presentation, we examine the definitions and issues involved with managing extreme information. We will first define what it is and explain why big data is only a small part of the overall concept of extreme information, as well as what new opportunities it will open for the enterprise. Then we will look at the issues around managing this information with today's systems. Finally, we will look at the new technologies and methods for managing extreme information.

Page 16

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsKey Issue: What are the emerging techniques for harnessing extreme Information and making it available and usable by business analytics?Technology: New DBMS technologies and delivery methods are emerging to handle the explosion of information, complexity of systems and simultaneous growth in users. Some p , p y y gwill be absorbed into the relational DBMS model; others will coexist with the relational model.

Many new models are emerging as new products (open source and commercial) for the DBMS (see "HypeMany new models are emerging as new products (open source and commercial) for the DBMS (see Hype Cycle for Data Management, 2010" [G00200878]). Some of these new technologies have the potential to be disruptive (e.g., in-memory DBMS), while others are hyped in the market and, although interesting, add no real value to the market (e.g., noSQL). Many new vendors are also emerging with this technology (see "Cool Vendors in Data Management and Integration, 2010" [G00174919]).

In addition to these, other new forces are coming into play. Analytics has become a major driving application for DW with in-DBMS analytics (as delivered by Teradata and SAS as well as others) use of MapReducefor DW, with in DBMS analytics (as delivered by Teradata and SAS, as well as others), use of MapReduce outside and inside the DBMS, and the use of self-service data marts, implemented in EMC/Greenplum and Teradata as a private cloud for internal implementation.

Another driver in the market is extreme information. Many organizations are beginning to realize they must begin to use this data for decisions, new analytic applications and pattern-based strategies (see "'Big Data' Is Only the Beginning of Extreme Information Management" [G00211490]).

A ti It U ti h l ti d d t h l i th t d t b h t li d i

Page 17

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Action Item: Use caution when evaluating new vendors and technologies, as they tend to be short lived in many cases.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsTactical Guideline: The RDBMS model, along with SQL, will not disappear and will be the predominate model for structured data. However, other DBMS and file store models will be used for management of unstructured data.

The name noSQL DBMS is given to a new group of DBMSs (an open source movement beginning in 2009)The name noSQL DBMS is given to a new group of DBMSs (an open-source movement beginning in 2009) that are nonrelational DBMSs. However, noSQL is a misnomer, as the use of SQL has nothing to do with DBMS; in fact, many noSQL DBMSs allow the use of SQL. For this reason, the noSQL community has been trying to use the acronym to mean "not only SQL." There are now many of these available from the open-source community (for example, CouchDB, Dynomite, HampsterDB, Terrastore and Voldemort) and from commercial vendors (for example Azure Table Storage, BerkeleyDB, GenieDB, Membase, MongoDB and SimpleDB). Some, such as BerkeleyDB, have been around for longer than the noSQL movement, and some have been developed with special purposes within, for example, Amazon's Dynomo Key Value store (internal to Amazon and eventually moved to open source as Voldemort). Many are open-source projects that will simply disappear. A few, such as Azure Data Services, BerkeleyDB, Casandra, MongoDB and SimpleDB, will be used more widely during the next two to five years. Additionally, with the increasing availability of in-memory DBMSs, with ACID properties, the speed of the noSQL DBMS will not be a distinguishing characteristic. Some, such as the graph DBMS, will be incorporated into the current DBMS model (as Aster is doing today)

Page 18

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

doing today).

Action Item: Watch these new technologies for opportunities to gain performance and support the dimensions of extreme information (especially variety and complexity).

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsStrategic Planning Assumption: By 2016, in-memory column store DBMSs using storage-class memory will replace 25% of traditional DW and OLTP systems.

Since the inception of the data warehouse (DW) circa 1989 organizations have wanted to combine the DWSince the inception of the data warehouse (DW) circa 1989, organizations have wanted to combine the DW and OLTP databases into one. This has not been possible, as the schema and models for OLTP are far different from those used for the DW. In addition, the DW has required many optimization structures in the DW to support the applications used in BI, analytics and other applications. These structures, such as summary tables, aggregates, star schema, indexes and multidimensional structures (cubes), have prevented the use of the DW for OLTP purposes. In recent years, the column store DBMS model has matured for use as an analytic data mart and, more recently, for the DW. The primary issue with the column store is the overhead required to write data to it when it resides on disk devices. One benefit of the column store is the degree of data compression that can be attained with it. When the column store is moved as a true in-memory DBMS, the write latency disappears, and the compression allows for the storage of large amounts of data in the database. Research has shown that indexes actually slow down retrieval of data from an in-memory database. Finally, due to the speed of in-memory access, bypassing any need for I/O allows the removal of all the optimization structures, yielding a simple third normal form database continuing only the data necessary for the DW. This database can also be used for OLTP finally giving us the ability to have only one database for OLTP and OLAP applications with

Page 19

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

used for OLTP, finally giving us the ability to have only one database for OLTP and OLAP applications, with little or no maintenance and overhead involved to support the database.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsTechnology: Many new DBMS technologies are emerging, and organizations with an awareness of these technologies, architectures and strategies will be able to leverage investments to optimize the value from extreme information.

Organizations recognize that how they manage data is critical to their ability to improve agility andOrganizations recognize that how they manage data is critical to their ability to improve agility and productivity, reduce costs and support compliance. These major business pressures are driving a greater focus on optimizing data management technologies and approaches, as well as investigating emerging and transformational technologies. Technologies such as appliances, OSS (especially Linux) and data integration suites should be watched closely for value to the IT organization. As many of these technologies mature, it will become easier to demonstrate the business value to IT and, therefore, use them to gain a competitive position.

Many new technologies have been added to the Hype Cycle such as noSQL in-memory DBMS andMany new technologies have been added to the Hype Cycle, such as noSQL, in memory DBMS and MapReduce. Some of these have a real potential to change the DBMS of the future. Use of cloud DBMSs also has potential to offload some of the infrastructure for applications, especially those that operate in the cloud.

Action Item: Use the Gartner "Hype Cycle for Data Management" to understand the level of maturity of key data management concepts and technologies, and to judge the vision of existing and potential vendors in your environment.

Page 20

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and AnalyticsStrategic Imperatives: The concept of one enterprise data warehouse containing all information needed for decisions is dead. Multiple systems, including content management, data warehouses, data marts and specialized file systems tied together with data services and metadata, will become the "logical" enterprise data warehouse.

The EDW is evolving from a pursuit of a single centrally managed repository sometimes with multipleThe EDW is evolving from a pursuit of a single, centrally managed repository, sometimes with multiple dependent data marts (hub and spoke) toward a logical information asset processing engine. This evolution demands that data that is best served by inclusion in a centralized repository be stored in that repository; but also, those information assets requiring logical reduction and/or complex file/format types should be accessed by efficient processes which dynamically generate evaluated records with consistent, predictable results for use in analyses. This style of EDW allows massive-volume source data sets to persist in their native formats, but standardizes the results of accessing them on demand. The logical consistency of this warehouse is moved to the data processing algorithms instead of being persisted in physical data stores after processing (although the results should be stored for later temporal comparisons). But it is imperative to recognize that the extended processing must be able to faithfully replicate the processing as designed every time it executes. The information store becomes a federated store, but the processing designs must be managed as if they are definitive data integration (DI) services with specified output — the only variable being more and more data. With the use of data services and metadata, the multiple sources of data can be brought together for analysis and/or operated on by DI to load into permanent or temporary data marts for further analysis This would

Page 21

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

and/or operated on by DI to load into permanent or temporary data marts for further analysis. This would include data from content management systems, large data repositories, such as HDFS for MapReduce.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and Analytics

Page 22

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.

Extreme Data: Challenges and Opportunities for Large-Scale Data Warehousing, BI and Analytics

Page 23

Donald Feinberg

MEX38L_115, 10/11

This presentation, including any supporting materials, is owned by Gartner, Inc. and/or its affiliates and is for the sole use of the intended Gartner audience or other authorized recipients. This presentation may contain information that is confidential, proprietary or otherwise legally protected, and it may not be further copied, distributed or publicly displayed without the express written permission of Gartner, Inc. or its affiliates.© 2011 Gartner, Inc. and/or its affiliates. All rights reserved.