MapR Data Hub White Paper V2 2014

The Hadoop Data Refinery and Enterprise Data Hub

Prepared for:

By Mike Ferguson Intelligent Business Strategies

May 2014

WH

ITE

PA

PE

R INTELLIGENT

BUSINESS STRATEGIES


Copyright © Intelligent Business Strategies Limited, 2014, All Rights Reserved 2

Table of Contents

Management Summary.................................................................................................... 3 Introduction -‐ Data Warehousing and the Origins of ETL Processing............................... 5

Scaling Up Data Integration – The Shift from ETL to ELT ..................................... 5 The Emergence of Big Data and Multiple Analytical Workloads ...................................... 6

Characteristics of Multi-‐structured and Structured Big Data............................... 6 Big Data Analytical Workloads ............................................................................. 7 Hadoop – A Key Platform for Big Data Analytics.................................................. 7

Building An Enterprise Data Hub Using MapR.................................................................. 8

What Is An Enterprise Data Hub? ........................................................................ 8 The MapR Hadoop Distribution as an Enterprise Data Hub Platform.................. 9

MapR Disaster Recovery and Data Protection ......................................... 9 Hadoop Workloads and MapR Extensions ............................................. 10

The Data Refinery -‐ Accelerating ETL Processing at Low Cost............................ 10 The Data Refinery -‐ Exploratory Analysis ........................................................... 12

Accelerating Big Data Consumption and Filtering Using Automated Analytics During in-‐Hadoop ELT Processing ........................................... 12

Key MapR Features That Meet Enterprise Data Hub Requirements.................. 13 Data Hub ELT Processing With MapR Hadoop Distributions.............................. 14 Hadoop as a Data Hub for All Analytical Platforms............................................ 15 Feeding Data Warehouses from a Hadoop Data Hub to Produce New Insight from Enriched Data ............................................................................................ 16 Archiving Data Warehouse Data into Hadoop ................................................... 17

Conclusion ...................................................................................................................... 18



MANAGEMENT SUMMARY Over recent years many companies have seen huge growth in data volumes. This has come from both existing structured and new semi-structured and unstructured data sources. The relentless rise in online shoppers, as well as the convenience of mobile devices, is contributing significantly to accelerating transaction volumes. Therefore transaction data is rapidly on the increase and click stream data from online browsing is reaching unprecedented volumes. It also harbours deep insight into online customer behaviour.

For most companies, the way they analyse sales and other transaction activity is by extracting this data from e-commerce systems, cleaning, transforming and integrating it with customer, product and financial data from other core transaction processing systems, loading it into a data warehouse and then analysing subsets of it in data marts using business intelligence tools. Over the years as transaction data has grown and other data sources have become available for analysis, the challenge of data extraction, transformation and loading (ETL) has become increasingly difficult to scale. ETL tools have switched to ELT to extract and load data into data warehouse staging tables first before using the power of parallel SQL processing in a massively parallel database to deal with scalability. However today, the momentum behind the use of online channels as the preferred way of transacting business and interacting with companies has become so great that data volumes are increasing at rates we have not seen before. Click stream, inbound emails interactions, social media, interactions, and sensor data are taking data volumes to new heights. The result is that staging areas in data warehouses set aside for ETL processing are becoming so large that ETL processing on its own is driving expensive upgrades to data warehouses to handle the workload. In addition, analysis of complex data is also happening on Hadoop to analyse new data types such as text, JSON, clickstream, images and video.

This paper looks at an alternative solution—the creation of an Enterprise Data Hub with data landing zone, data refinery on a much lower cost Hadoop platform that can scale to manage increasing data volumes as well as integrate structured master and transaction data with more complex high value data like clickstream, and multi-structured interaction data. In addition we look at how Hadoop can be used as an analytical platform to support exploratory analysis of raw data within a data refinery in the Enterprise Data Hub to produce new insights that can be published and offered up to business analysts for use in further analyses. From here, business analysts throughout the enterprise can subscribe to receive new insights into traditional data warehouses and data marts so as to enrich what companies already know with the intent to deliver competitive advantage in existing and new markets. We will also look at how Hadoop can act as a long-term data store for big data as well as an on-line archive for data warehouse data that is no longer analysed on a frequent basis.

MapR is a Hadoop vendor that has enhanced its MapR M5 Edition and MapR M7 Edition to support high availability features such as JobTracker HA™ and No NameNode HA™, MapR Direct Access NFS™, snapshots for online point-in-time data recovery, automatic data compression, remote mirroring, disaster recovery, and data protection. Its disaster recovery and data protection features make M5 and M7 capable of becoming long-term low cost data store

The switch to online channels is driving unprecedented volumes of transaction data and ciickstream data

This is driving up the cost of data warehousing as staging areas holding data for ETL processing grow rapidly Companies are looking to lower the cost of data warehousing by archiving data and offloading processing

Hadoop offers a complementary low cost alternative that supports big data analytics and the ability to offload ETL processing

MapR has created an enterprise-grade Hadoop platform that supports long term data storage, data warehouse archive, offloading of ETL processing and big data analytics



where new big data sources can be analysed and archived data from traditional data warehouses can be stored and selectively reprocessed. In addition, M5 and M7 offer workload management support allowing a Hadoop cluster to be logically divided to support different use cases, job types, user groups, and administrators. Also jobs can be isolated. All this helps support multiple workloads and allows usage to be managed and tracked. These capabilities make MapR an enterprise-grade Hadoop platform capable of supporting an enterprise data hub encompassing a data landing zone and data refinery where data can be cleaned, integrated and analysed by data scientists to produce new insights for competitive advantage. These new insights can then be supplied to data warehouses, data marts and other analytic platforms forming the data foundation of a multi-platform analytical ecosystem.

It can also act as an enterprise data hub to supply data to a multi-platform analytical ecosystem



INTRODUCTION - DATA WAREHOUSING AND THE ORIGINS OF ETL PROCESSING

For many years, companies have been building data warehouses to analyse business activity and produce insights for decision makers to act on to improve business performance. These traditional analytical systems are often based on a classic pattern where data from multiple transaction processing systems is captured, cleaned, transformed and integrated before loading it into a data warehouse.

Initially, the challenge of capturing, cleaning and integrating data was the role of IT programmers who wrote hand-crafted code to extract, transform and load (ETL) data from multiple sources into newly designed data warehouse databases for subsequent analysis and reporting. Soon however, new software ETL tools emerged to take on this task and improve productivity. Some of these tools generated 3GL and 4GL code to do the work while others interpreted graphically defined rules at run time. ETL execution involved extracting data from multiple operational systems, moving the data to the ETL server, and transforming and integrating it on the server before loading it into a target data warehouse. In the early years as customer demand grew, vendors added support for more and more structured data sources including popular packaged transaction processing applications, new file formats and popular external data providers.

However, more data sources led to larger data volumes causing many customers to start hitting performance limitations especially when data was being totally refreshed. ETL tool vendors responded by adding support for change data capture, but even so, the problem of ETL performance emerged again as business demand for data increased.

SCALING UP DATA INTEGRATION – THE SHIFT FROM ETL TO ELT To counter this problem, many ETL vendors began to look at new ways of achieving scalability. One of the most popular ways adopted to do this was to exploit parallel query processing in massively parallel (MPP) relational DBMSs. Rather than just loading transformed data from an ETL server into target MPP RDBMSs, several ETL vendors realised that they could boost performance by capturing data from multiple data sources, loading it into staging tables on a target MPP RDBMS and then generating SQL to transform the data using massively parallel query processing in the DBMS. The result was significant performance improvement that also made it possible for transformed, integrated data to then be moved from staging tables into production as a “within the box” process on the same RDBMS platform. This approach gave rise to the term Extract, Load, Transform (ELT) whereby MPP RDBMSs took data integration scalability to a new level.

Extract, Transform and Load (ETL) tools emerged in the early years of data warehousing to extract, clean, transform and integrate data from multiple transaction processing systems into data warehouses

ETL tools while successful, experienced performance problems as the demand for data grew

ETL tools vendors switched to loading data into MPP RDBMS staging tables first and then used SQL to transform it in parallel. This became known as ELT processing



THE EMERGENCE OF BIG DATA AND MULTIPLE ANALYTICAL WORKLOADS

Although this traditional environment is now mature, many new more complex types of data have now emerged that businesses want to analyse to enrich what they already know. In addition, the rate at which much of this new data is being created and/or generated is far beyond what we have ever seen before.

Customers and prospects are creating huge amounts of new data on social networks and review web sites. In addition, online news items, weather data, competitor web site content, and even data marketplaces are now available as candidate data sources for business consumption.

Within the enterprise, web logs are growing at staggering rates as customers switch to online channels as their preferred way to transact business and interact with companies. Also, increasing amounts of sensor networks and machines are being deployed to instrument and optimise business operations. The result is an abundance of new “big data” sources, rapidly increasing data volumes and a flurry of new data streams that all need to be analysed.

CHARACTERISTICS OF MULTI-STRUCTURED AND STRUCTURED BIG DATA The characteristics of these new data sources are different from the structured data that has been analysed in data warehouses for the last twenty years. For example, the variety of data types being captured now includes:

• Structured data • Semi-structured data, e.g. XML, HTML • Unstructured data, e.g. text, audio, video • Machine-generated data, e.g. sensor data

Semi-structured data such as XML allows navigation of XML paths to occur to go deeper into the content to derive business value. Unstructured text requires text mining to parse the data to derive structured data from unstructured while also building full text indexing. Deriving insight from unstructured sound and video data is more challenging but even here, demand is growing especially from government agencies and law enforcement.

In addition to data variety, the volumes of data are also increasing Unstructured and machine-generated data in particular, can be very large in volume. However, volumes of structured transaction data are also increasing rapidly mainly because of the growth in the use of online channels from desktop computers and mobile devices. One side effect of much larger transaction volumes is that staging tables on data warehouse that hold data awaiting ELT processing are increasing rapidly, which in turn is forcing companies to upgrade data warehouse platforms, often at considerable cost, to hold more data.

Finally, the rate (velocity) at which data is being generated is also increasing. Clickstream data, sensor data and financial markets data are good examples of this and are sometimes referred to as data streams.

Businesses now want to analyse new more complex types of data to add new insights to what they already know

Social network data, web logs, archived data warehouse data and sensor data are all new data sources of attracting analytical attention

The variety of data is more complex than traditional data warehousing with multi-structured data now in demand

Big data can be much larger in volume

Machine-generated data is being created at very high rates



BIG DATA ANALYTICAL WORKLOADS The arrival of big data and big data analytics has taken us beyond the traditional analytical workloads seen in data warehouses. Examples of new analytical workloads include:

• Analysis of data in motion • Complex analysis of structured data • Exploratory analysis of un-modeled multi-structured data • Graph analysis e.g. social networks • Accelerating ETL processing of structured and multi-structured data to

enrich data in a data warehouse or analytical appliance • The long term storage and reprocessing of archived data warehouse

data for rapid selective retrieval These new analytical workloads are more likely to be processed outside of traditional data warehouses and data marts on platforms more suited to these kinds of workloads.

HADOOP – A KEY PLATFORM FOR BIG DATA ANALYTICS One key platform that has emerged to support big data analytical workloads is Apache Hadoop. The Hadoop software “stack” has a number of components including:

Component Description Hadoop HDFS

A distributed file system that partitions large files across multiple machines for high-throughput access to data

Hadoop YARN

A framework for job scheduling and cluster resource management

Hadoop MapReduce

A programming framework for distributed batch processing of large data sets distributed across multiple servers

Hive A data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto this data and query it using a SQL-like language called HiveQL. HiveQL programs are converted into MapReduce programs

HBase An open-source, distributed, versioned, column-oriented store modeled after Google's BigTable

Pig A high-level data-flow language for expressing MapReduce programs for analyzing large HDFS distributed data sets

Mahout A scalable machine learning and data mining library Oozie A workflow/coordination system to manage Hadoop jobs Spark A general purpose engine for large scale data processing in-

memory. It supports analytical applications that wish to make use of stream processing, SQL access to columnar data and analytics on distributed in-memory data

Zookeeper A coordination service for distributed applications

Big data has created new analytical workloads beyond those typical of traditional data warehouses and data marts

Hadoop has emerged as a platform very much at the centre of big data analytics

Hive is a data warehouse system for Hadoop that provides a mechanism to project structure on Hadoop data

Hive provides an interface whereby SQL can be converted into MapReduce programs

Mahout offers a whole library of analytics that can exploit the full power of a Hadoop cluster



BUILDING AN ENTERPRISE DATA HUB USING MAPR

WHAT IS AN ENTERPRISE DATA HUB? Having discussed the characteristics of new sources of data, new analytical workloads that now need to be supported and Hadoop as a key platform for analytics, a key question is “How does Hadoop fit into an existing analytical environment?” A key emerging role for Hadoop is that of an Enterprise Data Hub as shown in Figure 1.

Figure 1

An enterprise data hub is a managed and governed Hadoop environment in which to land raw data, refine it and publish new insights that can be delivered to authorised users throughout the enterprise, either on-demand or on a subscription basis. These users may want to add the new insights to existing data warehouses and data marts to enrich what they already know and/or conduct further analyses for competitive advantage.

The Enterprise Data Hub consists of:

• A managed data reservoir

• A governed data refinery

• Published, protected and secure high value insights

• Long-term storage of archived data from data warehouses

The$Managed$Hadoop$Enterprise$Data$Hub$Includes$A$Data$Reservoir$and$A$Data$Refinery$And$A$Zone$For$New$Insights$

Parse$&$Prepare$Data$in$Hadoop$(MapReduce)$

Transform$&$Cleanse$Data$in$Hadoop$(MapReduce)$

$Discover$data$in$Hadoop$

ELT$work$Jflow$

sandbox$

other$data$

sandbox$ sandbox$

Data$Reservoir$

Load$data$into$Hadoop$

Data$Refinery$

New$high$value$Insights$(pub/sub)$

EDW$Graph$DBMS$$

DW$appliance$



All of this is made available in a secure, well-governed environment. Within the enterprise data hub, a data reservoir is where raw data is landed, collected and organised before it enters into the data refinery where data and data relationships are discovered, data is parsed, profiled, cleansed, transformation and integrated. It is then made available to data scientists who may combine it with other trusted data such as master data or historical data from a data warehouse before conducting exploratory analyses in a sandbox environment to identify and produce new business insights. These insights are the output of the data refining process. They are made available to other authorised users in the enterprise by first describing them using common vocabulary data definitions and then publishing them into a new insights zone where they become available for distribution to other platforms and analytical projects. In addition, cold data not being used

THE MAPR DISTRIBUTION FOR HADOOP AS AN ENTERPRISE DATA HUB PLATFORM

MapR is a vendor that provides a Hadoop platform upon which to build a managed Enterprise Data Hub. MapR was founded in 2009, is based in San Jose, California, and offers three editions of its Hadoop Distribution. These are:

MapR Distribution for

Apache Hadoop

Description

MapR M3 Standard Edition

A free community edition that includes HBase™, Pig, Hive, Mahout, Cascading, Sqoop, Flume etc. It includes POSIX-compliant NFS file system access.

MapR M5 Enterprise Edition

This enterprise edition includes HBase™, Pig, Hive, Mahout, Cascading, Sqoop, Flume, Impala, Spark etc. M5 edition is a no single point of failure edition with high availability and data protection features such as JobTracker HA, no NameNode HA, Snapshots and Mirroring to synchronise data across clusters.

MapR M7 Enteprise Database Edition

The MapR M7 Database Edition includes all the capabilities of M5 plus enterprise-grade modifications to HBase to make it more dependable and faster.

MapR Disaster Recovery and Data Protection MapR has also strengthened Hadoop by adding support for disaster recovery and data protection to their M5 and M7 Hadoop distribution offerings. In the area of disaster recovery, MapR provides remote mirroring to keep a synchronized copy of data at a remote site, so that processing can continue uninterrupted in the case of a disaster. Management of multiple on-site or geographically dispersed clusters is available with the MapR Control System. With respect to data protection, MapR has no single points of failure, with no NameNode HA and distributed cluster metadata. MapR Snapshot provides point-in-time recovery while MapR Mirroring offers business continuity. Everything in the MapR distribution is logged, and able to restart with the intent

MapR has three editions of its distribution for Hadoop MapR provides a distribution for Hadoop that includes over 20 Apache Hadoop projects



that the entire cluster is self-healing and self-tuning. The JobTracker and NameNode have been re-engineered to be distributed and replicated. Direct Access NFS HA means that clients do not idle waiting for unavailable servers and rolling upgrades make sure that the cluster is always available. In addition, workload management is also supported including job isolation, job placement control, logical volumes, SLA enforcement and enterprise access control to isolate and secure data access.

Hadoop Workloads and MapR Extensions Specific examples of where Hadoop is particularly well suited include:

• Offloading and accelerating data warehouse ELT processing at low cost • Exploratory analysis of un-modeled multi-structured data • Extreme analytics – for example having to run millions of scoring

models concurrently on millions of accounts to detect “cramming” fraud on credit cards. This is an activity whereby fraudsters attempt to steal small amounts of money from large numbers of credit card accounts by associating false charges with vague financial services and hoping consumers just don’t notice. Running millions of analytical models concurrently on data is typically not a workload you would see running in a data warehouse.

• The long term storage of data and reprocessing of archived data warehouse data for rapid selective retrieval

These are all workloads that you would expect to find in an Enterprise Data Hub. The MapR enhancements to the underlying data platform which powers their Hadoop distribution, provide capabilities needed to support these including continuous data capture, offloading of ELT processing, exploratory analytics, long term storage of archived warehouse data, and selective retrieval of it for analytical processing. Let’s look at these in more detail with particular focus on the data refining process and offloading ELT processing and some analytical workloads from data warehouses.

THE DATA REFINERY - ACCELERATING ETL PROCESSING AT LOW COST The evolution of ETL on big data platforms like Hadoop has mirrored that on traditional data warehouses. First of all, hand-crafted ETL programs were written to provision data into Hadoop, transform and integrate it for exploratory analysis. The problem with this approach is that even if these programs exploit the multi-processor, multi-server Hadoop platform, development is slow and expensive requiring scarce MapReduce programming skills.

ETL tool vendors responded by announcing support for Hadoop as both a target to provision data for exploratory analysis and a source to move derived insights from Hadoop into data warehouses. However, while this approach works, ETL processing occurs outside the Hadoop environment and so is unable to exploit the scalability of the Hadoop platform to deal with the characteristics of big data.

In order to get scalability, ETL vendors have evolved their products to exploit Hadoop by implementing ELT processing much like it did on data warehouse systems. The difference now, however, is that all the data is being loaded into a Hadoop cluster for ELT processing via generated 3GL, Hive or Pig ELT jobs

Hand-crafted ETL programs were initially created in a Hadoop environment

ETL servers then emerged to handle big data integration to load into Hadoop



running natively on a low cost Hadoop cluster. This is shown in Figures 2 and 3. It is this capability that is so attractive to many companies who are looking for a way to offload ELT processing from data warehouses and create an enterprise data hub. Offloading ELT processing to Hadoop frees up considerable capacity on data warehouse platforms thereby potentially avoiding expensive data warehouse upgrades. This is especially significant as transaction data volumes continue to grow and new big data sources become available for analysis.

Figure 2

Several ETL tool vendors have now re-written transforms to run on Hadoop. Also, several have added new tools and transformations to handle large volumes of multi-structured data. Therefore what we are seeing in Hadoop environments is a full repetition of what happened with ETL tools on MPP RDBMSs. This time however, the attraction is that ELT processing can potentially be done at a much lower cost given that a Hadoop cluster is a much cheaper platform to store any kind of data. It also opens up the way for ELT processing to be offloaded from data warehouses and to exploit the full power of a Hadoop cluster to get the scalability needed to improve performance in a big data environment.

Figure 3

Data$Cleansing$and$Integra/on$Tool$$

Extract Parse Clean Transform Analyse Load Insights$

Op/on$1$ETL$tool$generates$HQL$or$convert$generated$SQL$to$HQL$

Op/on$2$ETL$tool$generates$Pig$La/n$(compiler$converts$every$transform$to$a$map$reduce$job)$

Op/on$3$ETL$tool$generates$3GL$MapReduce$code$

Scaling$ETL$Transforma/ons$By$Genera/ng$Pig,$Hive$or$3GL$MapReduce$Code$for$InMHadoop$ELT$Processing$

Provisioning Data Into Hadoop for Exploratory Analysis of Multi-Structured Data Using In-Hadoop ELT processing

Web logs

Generated MapReduce

ELT jobs

business insight

sandbox sandbox Un-modelled multi-structured data

Structured data

Filtered sensor data

ELT

Pro

cess

ing

ETL servers that handle big data cleansing and integration outside of Hadoop are unlikely to scale well – they need to exploit Hadoop

ETL processing in a big data environment needs to exploit Hadoop to get scalability at low cost

Several ETL vendors have ported their software to Hadoop to run ELT map-reduce processing on multi-structured big data



The aforementioned attractions of the Figure 3 pattern are leading many companies to consider placing a significant slice of their structured and multi-structured data into Hadoop Enterprise Data Hub for ELT processing before making subsets of it available exploratory analysis on Hadoop itself or to other data warehouse and NoSQL platforms. The only challenge with this is the migration of existing ELT jobs running on existing data warehouses which is helped by Hive being able to convert ELT-generated SQL (used to transform data) into map reduce or potentially even in-memory Spark programs.

THE DATA REFINERY - EXPLORATORY ANALYSIS In addition to ETL processing, another analytical workload very much part of the data refining process in a Hadoop Enterprise Data Hub is the exploratory analysis of complex data types. This is where data scientists in the enterprise data hub often use freeform exploratory tools like search and/or develop and run batch MapReduce or Spark analytic applications (written in languages like Java, Python, Scala and R) to conduct exploratory analyses on un-modelled data stored in the Hadoop system. The purpose of this analysis in the data refining process, is to derive structured insight from unstructured data that may then be stored in HBase, Hive or moved into a data warehouse for further analysis. With Hadoop MapReduce, these analytical programs are copied to thousands of compute nodes in a Hadoop cluster where the data is located in order to run the batch analysis in parallel. In addition, in-Hadoop analytics in the Mahout library can run in parallel close to the data to exploit the full power of a Hadoop cluster. The addition of Spark means that MapR can improve performance of exploratory analytical applications by exploiting in-memory processing. Data access can also be simplified by using SQL via Shark on Spark instead of lower level HDFS APIs. Insight derived from this exploratory analysis can then be published and moved into data warehouses to enrich value or to other analytical data stores for further analysis.

Accelerating Big Data Consumption and Filtering Using Automated Analytics During in-Hadoop ELT Processing One of the challenges with big data is dealing with the data deluge. Companies have to step up to the challenge that data is arriving faster than they can consume it. They therefore have to find a way to automate the consumption and refining of popular data to deal with the data deluge and bring data into the enterprise in a timely way for the business to analyse and act on.

That means not only doing ETL processing on Hadoop, but also being able to analyse data during ELT processing as part of the data refining process. An example might be to score Twitter sentiment during ELT processing of Twitter data on Hadoop so that negative sentiment can be identified quickly and attributed to customers, products, brand or business functions like customer service.

Figure 4 takes in-Hadoop ELT processing further than in Figure 1 by doing in-line automated analytics on Hadoop during ELT processing. In this way, popular structured and multi-structured data sources may be consumed and refined in a more automated way, thereby expediting time to value. In addition, if data scientists have built custom map reduce or Spark based analytics on this kind of data, then it is potentially possible to exploit these analytics during ELT processing. This means that once data scientists have built analytics that analyse data, they can be used and re-used in analyses to produce insight

Embedding in-Hadoop analytics in Hadoop-based ELT processing allows data to be consumedmore rapidly and in a more automated way



from new big data sources. Note how in Figure 4 automated analysis allows the ELT workflow to span entire data refining process.

Figure 4

In this way it becomes possible to build an “Enterprise Data Filter” that can speed consumption of new and existing data sources to expedite the production of new high value business insights.

KEY MAPR FEATURES THAT MEET ENTERPRISE DATA HUB REQUIREMENTS Given what is potentially possible, the next question is ‘How has MapR enhanced its Hadoop distributions to support the operation of an Enterprise Data Hub?” Since its inception, MapR has sought to enhance a critical part of the open source Apache Hadoop stack to improve availability, open up access and improve overall performance and usability.

MapR has strengthened Apache Hadoop considerably to improve its resilience, improve performance and make it easier to manage. For example, they have removed multiple single points of failure in Apache Hadoop and introduced data mirroring across clusters, using asynchronous replication, to support failover and disaster recovery. In addition, they have added data snapshots, a heat map management console and have improved performance through data compression and by rewriting the intermediate shuffle phase that occurs after Map and before Reduce. HBase has also been strengthened (MapR M7 Edition) to remove compactions and Spark has been added to facilitate high performance in-memory analytics. All of this makes the MapR distributions much more enterprise-grade.

Key features of MapR M5 and M7 that benefit ETL processing include:

The$Managed$Hadoop$Enterprise$Data$Hub$Includes$Automated$Analyses$to$Refine$Data$Much$More$Rapidly$

Parse$&$Prepare$Data$in$Hadoop$(MapReduce)$

Transform$&$Cleanse$Data$in$Hadoop$(MapReduce)$

$Discover$data$in$Hadoop$

ELT$work$Gflow$

other$data$

Data$Reservoir$

Load$data$into$Hadoop$

Data$Refinery$

New$high$value$Insights$(pub/sub)$

EDW$Graph$DBMS$$

Automated$InvocaOon$of$Custom$Built$&$PreGbuilt$AnalyOcs$on$Hadoop$

DW$appliance$

MapR has strengthened Apache Hadoop to improve resilience and performance

MapR features that benefit ETL processing include high availability and Direct Access NFS



• No single point of failure edition with high availability features such as JobTracker High Availability and no-NameNode High Availability. In a global business where ETL processing may need to happen several times within the day, high availability is very important

• MapR Direct Access™ NFS which enables real-time read/write data

flows via the industry-standard Network File System (NFS) protocol. With MapR Direct Access NFS, any remote client can simply mount the cluster. This means that application servers can write their log files and other data directly into the cluster, rather than writing it first to direct- or network-attached storage. This reduces the need for log collection tools that may require agents on every application server. Application servers can either write data directly into the cluster or use standard tools like Rsync to synchronize data between local disks and the cluster. Either way, it means that ELT processing on log data needed for clickstream analytics could potentially avoid the extract and load into MapR M5/M7, thus speeding up the process of making this data available for analysis. It also reduces data latency which can be important in many applications.

• MapR Snapshots allow for online point-in-time data recovery without

replication of data. A volume snapshot is consistent (atomic), does not copy data, and does not impact performance. Snapshots can potentially help ETL processing in the event of a failure where an ETL job may need to be restarted from the point of failure or at least from an intermediate snapshot taken at specific points in the ETL processing.

• The re-writing of the intermediate shuffle phase that occurs after Map

and before Reduce can really help improve ETL performance for ETL tools generating map-reduce ETL jobs via Hive, Pig or natively with a 3GL language such as Java.

• Automatic data compression can also help improve performance and so

speed up data refinery processes.

• The MapR data protection and disaster recovery capabilities make the MapR Distributions for Hadoop suitable for long-term storage of big data and data warehouse archived data which can then be selectively reprocessed in specific analyses even though it is offloaded from the data warehouse.

• The MapR remote mirroring capability also allows ELT and analytical

workloads in a data refinery to be spread across clusters in order to get more work done.

DATA HUB ELT PROCESSING WITH MAPR HADOOP DISTRIBUTIONS With respect to ELT processing on Hadoop, MapR partners with a number of ETL tool vendors and ETL accelerator vendors that can run ELT jobs on MapR M5 and M7 Hadoop clusters. These include:

• Informatica • Pentaho • Talend

Direct Access NFS speeds up the ability to capture data and support change data capture

MapR snapshots help to support ‘point-in-time’ restart of big data ETL processing in the event of a failure without going back to the beginning

Rewrite of shuffle and data compression helps improve ETL processing performance

MapR has several ETL partners that run on Hadoop to accelerate ETL processing

MapR together with its ETL partners can support most ETL patterns



• Syncsort Using these partner technologies in combination with the MapR M5 and M7 editions, the following patterns are supported:

• Accelerating big data consumption and filtering by using in-Hadoop analytics during ELT processing

• In-Hadoop ELT processing via MapReduce-based transformations • Provisioning data into Hadoop sandboxes for exploratory analysis as

part of a data refining process • Feeding data warehouses from Hadoop to accelerate multi-platform

analytics In terms of scalability, adding more Hadoop nodes to the cluster allows you to process this data at speed due to greater I/O parallelism and compute power. For very large amounts of data, a MapR Hadoop cluster can be spun up in a cloud environment like the Google Compute Engine to undertake this work at a much lower cost than trying to configure a cluster of similar size in-house.

With respect to restart of ELT processing, MapR snapshots taken at specific points in ELT processing makes it possible to restart any data refinery ELT process quickly.

MapR also provides random read/write access in its Hadoop distribution. In Apache Hadoop, HDFS is normally append-only, but one of the key features of the MapR Distribution is Direct Access NFS. In the context of ELT, Direct Access NFS allows faster and more convenient loading of data into the Hadoop cluster, thereby reducing data latency. ETL tools that support Change Data Capture can write changes straight into the MapR Hadoop cluster. An example of a MapR partner ETL tool vendor that can do this is Talend. Change data capture is very important to ETL performance, especially on large volumes of data.

Finally in terms of ELT performance, the re-writing of the intermediate shuffle phase that occurs after Map and before Reduce will benefit sorting, aggregation, hashing and pattern matching transformations all of which are mainstream transformation functionality needed in most ELT jobs. This functionality, together with data compression, will boost performance.

These features allow MapR to support the following key ETL patterns.

HADOOP AS A DATA HUB FOR ALL ANALYTICAL PLATFORMS Given these enhancements, the MapR Distribution could potentially be used not only to offload processing from data warehouses but also to create a low cost data hub (see Figure 5). An Enterprise Data Hub is the foundation pattern in data and new insight provisioning in a multi-platform analytical environment. It would be possible to use MapR M5 or M7 as an Enterprise Data Hub that cleans, transforms and integrates data from multiple structured and multi-structured sources and provisions trusted data into any analytical platform in a big data analytical ccosystem for subsequent analysis. This includes:

• MapR M5/M7 Hadoop distribution itself, where sandboxes are created for data scientists to conduct exploratory analysis as part of a data refining process

ETL processing can take place on premises or in the cloud

ETL jobs can be designed with restart in mind using MapR snapshots ETL jobs can handle change data capture using MapR Direct Access NFS ETL performance is accelerated using MapR shuffle processing and data compression

Archiving data from data warehouses into Hadoop is also needed



• Enterprise data warehouses • Data marts • Analytical appliances • Other NoSQL databases e.g. graph databases

Figure 5

FEEDING DATA WAREHOUSES FROM A HADOOP DATA HUB TO PRODUCE NEW INSIGHT FROM ENRICHED DATA

Having transformed data on Hadoop and produced insights from it, there is a need to add any new insights produced to existing environments to add to what is already known. This means being able to also have ETL tools extract derived insights produced on Hadoop from that platform and integrate them with other structured data going into a data warehouse (see Figure 6). This may happen on Hadoop itself (i.e. push data into a data warehouse) or outside of Hadoop to pull the data into a data warehouse. In this way we can facilitate multi-platform analytics that may start by analysing data on Hadoop and end up offering new insights to self-service BI users accessing a data warehouse.

By embedding analytics in Hadoop ELT processing, it is also potentially possible to turn ELT workflows into multi-platform analytical workflows.

Data Hub - Consume, Clean, Integrate, Analyse And Provision Data From Hadoop To Any Analytical Platform

Generated MapReduce

ELT jobs

business insight

sandbox

ELT Processing

feeds sensors

!"#$%&'()%

RDBMS Files office docs social Cloud *+,*-./0123%

Web logs web services

NoSQL DB e.g. graph DB EDW

DW & marts

mart

DW Appliance

Advanced Analytics (structured data)

Exploratory analysis

ETL tools also need to extract data from Hadoop and provide it to data warehouses and other NoSQL data stores Providing new insights from Hadoop into data warehouses is a very common requirement



Figure 6

ARCHIVING DATA WAREHOUSE DATA INTO HADOOP In order to maximise the value from ETL processing in a big data environment, it must be possible to move data from Hadoop into other NoSQL and relational analytical platforms and vice-versa. This includes orchestrating multi-platform analytical ETL workflows to solve complex analytical problems.

Figure 7

Figure 7 shows this capability. With two-way data movement it becomes possible to take dimension data into Hadoop and archive data from data warehouses into Hadoop. It also becomes possible to manage data across all data stores and analytical platforms in Hadoop. What this also shows is that data management software has to scale much more than before, not just to handle big data volumes, but also to handle data movement across platforms during analytical processing. It will also be used for data archiving across platforms. Therefore ETL scalability on a robust, highly available self-healing Hadoop platform like MapR is even more important going forward.

Leveraging Hadoop for Data Integration on Massive Volumes of Data to Bring Additional Insights Into a DW

Cloud Data

HDFS

Extract

DW ET L Map/ Reduce data

transformation and analytics applications

Transform

e.g. PIG, JAQL

Cloud Data e.g. Deriving insight from huge volumes of social web content on sites like Twitter, Facebook. Digg, Myspace, TripAdvisor, LinkedIn….for sentiment analytics

Hundreds of terabytes up to petabytes

relevant insight

Operational systems

EDW

MDM System DW & marts

NoSQL DB e.g. graph DB

Advanced Analytic (multi-structured data)

mart DW

Appliance

Advanced Analytics (structured data)

Need to Manage the Supply of Consistent Data And Archive Data Across The Entire Analytical Ecosystem

Enterprise Information Management Tool Suite Stream

processing

C

R

U

D

Prod

Asset

Cust

actions

feeds sensors

!"#$%&'()%

RDBMS Files office docs social Cloud *+,*-./0123%

Web logs web services

New

New

New

New

New New New New New New

New

New

C

R

U

D

Prod

Asset

Cust

Archiving data from data warehouses into Hadoop is also needed

There is a need for two-way movement of data between Hadoop and other data stores and to manage data across all analytical platforms in a big data ecosystem



CONCLUSION The emergence of new data sources and the need to analyse everything from unstructured data to live event streams has led many organisations to realise that the spectrum of analytical workloads is now so broad that they cannot all be dealt with in a single enterprise data warehouse. Companies now need multiple analytical platforms in addition to traditional data warehouses and data marts to manage big data workloads. New big data platforms like Hadoop, stream processing engines, and NoSQL graph DBMSs are all emerging as platforms optimised for specific analytical workloads that need to be added to the enterprise analytical setup.

This has resulted in a more complex analytical environment that has put much more emphasis on data management to keep data consistent across big data workload-optimised analytical platforms and traditional data warehouses.

ETL software now has to deal with multiple data types, very large data volumes and high velocity event streams as well as handling traditional ETL processing into data warehouses. In addition, this software must now deal with the need to rapidly move data between big data and traditional data warehousing platforms during the execution of analytical workloads. All this is needed while continuing to deliver value for money and without causing dramatic increases in cost. Data refinery processes have to be fast, efficient, simple to use and cost-effective.

Several data management vendors now support Hadoop as both a source and a target in their ETL tools. They are also generate HiveQL, Pig or Java to create map reduce ELT processing jobs that fully exploit massively parallel Hadoop clusters. To support faster filtering and consumption of data, we are also seeing ETL tools starting to support the embedding of analytics into ETL workflows so that fully automated analytical workflows can be built to speed up the rate at which organisations can consume, analyse and act on data.

It is this combination of Hadoop with data management software and in-Hadoop analytics that opens up the attractive proposition of creating a low cost Enterprise Data Hub (as shown in Figure 4) that manages and accelerates the data refinery process in an end-to-end big data analytical ecosystem. The MapR Distribution for Hadoop is well suited to this role and can also support the offloading of a subset of analytical processing from data warehouses. The Enterprise Data Hub is not just for data warehousing however. Its job is to become the foundation for cleansing, transforming and integrating structured and multi-structured data from multiple sources before provisioning filtered data and new insights to any platform in the entire big data analytical ecosystem for subsequent analysis.

MapR, with its enterprise-grade Hadoop distribution together with its partners, look ready for the challenge.

Business is now demanding more analytical power to analyse new sources of structured and multi-structured data

ETL tools have to scale to support more data, more complex transformations and faster data loading

Support for Hadoop and rapid data movement between Hadoop and data warehouses is needed

Hadoop in combination with ETL processing offers an attractive low cost way to implement a data management hub for the entire big data analytical ecosystem

MapR can help customers take ETL processing to the next level



About Intelligent Business Strategies

Intelligent Business Strategies is a research and consulting company whose goal is to help companies understand and exploit new developments in business intelligence, analytical processing, data management and enterprise business integration. Together, these technologies help an organisation become an intelligent business.

Author Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specialises in business intelligence and enterprise business integration. With over 31 years of IT experience, Mike has consulted for dozens of companies on business intelligence strategy, big data, data governance, master data management, enterprise architecture, and SOA. He has spoken at events all over the world and written numerous articles. He has written many articles, and blogs providing insights on the industry. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of Database Associates, an independent analyst organisation. He teaches popular master classes in Big Data Analytics, New Technologies for Business Intelligence and Data Warehousing, Enterprise Data Governance, Master Data Management, and Enterprise Business Integration.

INTELLIGENT BUSINESS STRATEGIES

Water Lane, Wilmslow Cheshire, SK9 5BG

England Telephone: (+44)1625 520700

Internet URL: www.intelligentbusiness.biz E-mail: [email protected]

The Hadoop Data Refinery and Enterprise Data Hub Copyright © 2014 by Intelligent Business Strategies

All rights reserved

MapR Data Hub White Paper V2 2014

Documents

Transcript of MapR Data Hub White Paper V2 2014