7 Steps to Faster Analytics Processing

c1 TDWI RESE ARCH tdwi.org

TDWI CHECKLIST REPORT: SEVEN STEPS TO FASTER ANALY TICS PROCESSING WITH OPEN SOURCE

TDWI CHECKLIST REPORT

TDWI RESEARCH

tdwi.org

Seven Steps to Faster Analytics Processing with Open Source Realizing Business Value from

the Hadoop Ecosystem

By David Stodder

Sponsored by:

www.cloudera.com

www.datatorrent.com

www.talend.com

www.platfora.com

www.tdwi.org

1 TDWI RESE ARCH tdwi.org

2 FOREWORD

2 NUMBER ONE Take advantage of open source innovations to push analytics to the next level

3 NUMBER TWO Overcome limitations of slow, hard-to-develop batch processing

3 NUMBER THREE Gain business value from open source stream processing and stream analytics

4 NUMBER FOUR Choose the right strategy for supporting interactive access to Hadoop data

4 NUMBER FIVE Evaluate technologies for improving data integration and preparation for big data analytics

5 NUMBER SIX Accelerate the business impact of processing and analytics with open standard technologies

5 NUMBER SEVEN Balance the value of unified architecture with the benefits of best-of-breed innovation

6 ABOUT OUR SPONSORS

8 ABOUT THE AUTHOR

8 ABOUT TDWI RESEARCH

8 ABOUT TDWI CHECKLIST REPORTS

© 2015 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail requests or feedback to [email protected]. Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

DECEMBER 2015

SEVEN STEPS TO FASTER ANALYTICS PROCESSING WITH OPEN SOURCEREALIZING BUSINESS VALUE FROM THE HADOOP ECOSYSTEM

By David Stodder

TABLE OF CONTENTS

555 S Renton Village Place, Ste. 700 Renton, WA 98057-3295

T 425.277.9126 F 425.687.2842 E [email protected]

tdwi.org

TDWI CHECKLIST REPORT



Excellence in analytics is a competitive advantage in nearly all industries. Technology trends are moving in a positive direction for organizations seeking to expand the business impact of analytics. New technologies can help organizations democratize the analytics experience so that more managers, operational employees, and other users can engage in faster data-driven decision making.

On the front end, visual analytics and data discovery tools are enabling users to move beyond limited and static data views typical of traditional business intelligence (BI) reporting and spreadsheet applications. Nontechnical users and developers, data engineers, and data scientists working with advanced tools and techniques are pushing to get past traditional BI and data warehousing barriers to access and to interact with a broader range of data, including real-time data streams. They have little time to wait for long extract, transform, and load (ETL) or other preparation processes to complete before they can touch the data. Their urgency is driving innovation toward faster, easier, and more flexible data integration, preparation, and processing.

Open source projects are spawning many key innovations. The Hadoop ecosystem, which developed out of a series of ongoing Apache Software Foundation projects, is maturing and expanding. Hadoop ecosystem technologies could supplement or supplant many traditional BI and data warehousing technologies and practices that may have worked for classic BI querying and reporting but struggle in the brave new world of highly iterative analytics, where questions often lead to follow-up questions as part of data discovery. Old ways also struggle when business-critical analytics processes require ready access to big data and need better performance and flexibility to support the variety of analytic models.

Organizations should evaluate Hadoop ecosystem technologies for how they can contribute to giving users easier, more interactive, and more integrated experiences with data. They should examine how open source technologies and frameworks can reduce delays in preparing and processing data for users, developers, and data scientists who are seeking to employ advanced analytics. This Checklist will discuss seven key considerations to help organizations focus their evaluation and develop a strategy for gaining value from open source technologies to support faster, more powerful, and more flexible analytics.

FOREWORD

Core Apache Hadoop, consisting of the Hadoop Distributed File System (HDFS) and MapReduce, were developed over a decade ago to enable analytics that was not possible—or at best, was cumbersome and expensive—with traditional relational data warehouses. Their developers needed software and data processing capabilities that would allow them to engage in deeper and more computationally intensive analytics with huge data sets drawn from primarily multi-structured information on the Web and user search behavior. Hadoop has since been used by many companies to analyze data generated by customer behavior, interpret sensor and machine data, and other use cases that could not easily be addressed by using traditional systems.

Hadoop was designed to take advantage of affordable, commodity servers that could be clustered to support distributed, massively parallel processing of large data sets. MapReduce, while not easy for developers to use, enabled organizations to process data inside the clusters themselves rather than dealing with the bottleneck of “moving data to the code” (i.e., across networks to application servers or other platforms for processing). HDFS could facilitate rapid data transfer among the nodes in a cluster, ensure resiliency if a node failed, and with MapReduce distribute the work across the nodes and assemble (i.e., “reduce”) the results. A key benefit for analytics has been the ability to run processes directly on integrated volumes of raw, highly varied data, not just on a set of disparate samples or aggregates.

Starting with the initial related open source Apache projects, such as Hive, HBase, and Zookeeper, a thriving Hadoop ecosystem has grown up with tools and frameworks for different types of storage, processing, data integration, resource management, security, analytics, search, and data discovery. The ecosystem continues to expand with the introduction of both new and revised technologies. Hadoop 2.0’s YARN (Yet Another Resource Negotiator), for example, provides a new layer between HDFS and MapReduce for better cluster resource management and a flexible execution model for programming frameworks other than MapReduce.

Organizations should ensure that their developers have current knowledge about new and evolving open source concepts, frameworks, and technologies that could contribute to objectives with analytics, BI, and data discovery.

TAKE ADVANTAGE OF OPEN SOURCE INNOVATIONS TO PUSH ANALYTICS TO THE NEXT LEVEL

NUMBER ONE



Many firms want to tap real-time data streams generated by such sources as e-commerce, social media, mobile devices, and Internet of Things (IoT) sources such as sensors and machine data. In many cases, the requirement is to stream data as it is generated straight into Hadoop files. However, other firms want to apply real-time “stream” analytics to “data in motion” before—or instead of—landing it in Hadoop or other system files so they can analyze and act on it automatically in real time.

Stream analytics is about applying statistical models, algorithms, or other analytics practices to data that arrives continuously, often as an unbounded sequence of instances. By running predictive models on these flows, organizations can monitor events and processes and become more situationally aware. Organizations can be proactive in spotting trends, patterns, correlations, and anomalous events and respond as they are happening. They can filter data for relevance and enrich the quality of data flowing in real time with information from their other sources.

The Hadoop ecosystem has been generating open source projects relevant to stream processing and stream analytics. The Apache Spark Streaming module, Apache Storm, and Apache Apex are aimed at processing streams of real-time data for analytics. Apache Spark is a general-purpose data processing engine with an extensible application programming interface for different workloads, including Spark Streaming for stream processing. Storm, though more mature, may eventually give way to Heron, which Storm’s original developer, Twitter, is implementing as a replacement. Apache Apex, developed by DataTorrent and accepted in September 2015 as a project by the Apache Incubator, unifies stream and batch processing. These technologies can be integrated with Apache Kafka, a publish-and-subscribe messaging system that can serve data to streaming systems. Kafka is growing in popularity for meeting high throughput, horizontal scalability requirements.

Organizations should evaluate open source technologies that support stream processing and analytics. Organizations engaged in tracking and monitoring activities—or seeking to perform analytics on “fast” data to react in real time—should make establishing a technology strategy for stream processing and stream analytics a priority.

Many organizations today are focused on driving out latency—that is, slowness in gaining insight from analytics due to processing delays that has a detrimental impact on business performance. The ultimate zero latency is real-time processing; at the other end of the spectrum is batch processing. Traditionally, IT will schedule BI updates, analytic queries, and extract, transform, and load (ETL) jobs to run without manual intervention in off hours so that data- and compute-intensive processes are not competing with online applications. Batch is still often the best way of performing bulk updates, scans, ETL, and other activities against large data volumes where the aim is repeatable results drawn from unchanging data. However, batch windows are getting tighter.

Once batch processes are underway, developers and administrators have to wait until they finish before they can give users access to the resulting snapshot of historical data. No one can interact with the data through iterative queries in a continuous fashion, which is necessary for many kinds of analytics. Everyone has to wait for the results of batch processes, and if they have further queries, wait again for subsequent batch processes to finish.

HDFS and MapReduce provided new means of storing and processing data but did not offer an alternative to batch processing. Likewise, the Hive SQL interface to Hadoop also works in batch (and until recently, only on MapReduce) to perform reads and writes of historical, disk-based data. Major areas of focus for the next generation of the Hadoop ecosystem have been to enable organizations to perform faster batch processing, use in-memory computing to allow more continuous data interaction despite batch cycles, and support multi-step processing and multi-pass computations and algorithms that do more work faster. Apache Spark, Apache Apex, and other technologies support these innovations and are gaining strength as flexible alternatives to MapReduce that deliver more efficient batch processes.

Organizations should improve batch processing to increase end-user data access and satisfy analytic workloads. Developers should be encouraged to use newer techniques such as pipeline processing, which can speed data engineering because jobs can run in parallel and not wait for completion of entire processes before work on single processes can begin.

OVERCOME LIMITATIONS OF SLOW, HARD-TO-DEVELOP BATCH PROCESSING

GAIN BUSINESS VALUE FROM OPEN SOURCE STREAM PROCESSING AND STREAM ANALYTICS

NUMBER TWO NUMBER THREE



Armed with self-service BI and visual analytics tools, nontechnical users are ready to join data scientists in reaching beyond data warehousing systems to interact with data stored in Hadoop files. Technology trends have begun to make this easier by offering a variety of means to interact with Hadoop data, including directly rather than by first loading the data into a relational database.

Interactivity typically includes the ability to send ad hoc SQL queries to the data via front-end tools, either by using execution engines and intermediate databases or directly to Hadoop files. Response time for interactive queries should be significantly faster than with batch processes, keeping in mind that interactive querying of Hadoop data is primarily for supporting discovery. Users will most likely not be interacting with true real-time data; thus, it is important for users to know the freshness (and quality) of the data. A key issue to also examine is the degree to which the technology solution supports standard ANSI SQL and how extensions or customized functions are supported.

For developers, Hive was the first generation, offering functionality that converted SQL into MapReduce jobs that would run in batch. The Apache Tez project, built on YARN, is aimed at offering a better execution engine for Hive (or Pig) than MapReduce offered. However, an alternative trend is to integrate Hive (or other frameworks such as Pig and Crunch) with Apache Spark. Existing Hive jobs can then run on Spark, which handles the processing of data; developers are able to choose their preferred execution engine. Another alternative is the Spark SQL module, which exposes Spark-only data sets via JDBC to BI and visual analytics tools.

SQL-on-Hadoop is a different option for faster, near-real-time interactivity and for situations that demand high concurrency. This approach allows developers—or users through front-end tools—to query Hadoop data directly through the SQL-on-Hadoop software’s own parallel query execution engine. Cloudera Impala, Apache Drill, Presto, and other technologies are implementing this approach.

Open source technologies are maturing to enable interactive querying and iterative data exploration from BI and visual analytics tools. Organizations should evaluate options given their current and anticipated user demand.

Data integration, ETL, and other data preparation steps can be time-consuming parts of BI and analytics projects. Often because of poor data quality, business analysts and data scientists have to spend the majority of their time manually preparing data. Nontechnical users of visual analytics and data discovery tools are also frustrated because they do not have the technology or knowledge to complete data preparation steps on their own. The problem is growing acute in the age of big data as organizations amass volumes of multi-structured data, often in Hadoop files, and want to blend it with traditional data to gain comprehensive views.

Open source technologies are generating new ways of addressing data integration, ETL, and data preparation. With so much data being stored and processed in Hadoop files—now joined by data sets processed by Spark, Apex, and other technologies—it follows that many organizations would seek to move tasks close to where the data is stored and processed. Being able to run processing and analytics on the same data in the same platform can eliminate data movement, cut down metadata confusion, and help lower overall costs. Talend, for example, can run natively in Hadoop and has introduced a data integration platform built on Spark. Technologies such as Platfora are able to use Spark to drive and consolidate data preparation and transformation in support of a range of analytics.

As organizations bring in big data from heterogeneous and sometimes unstable edge sites, they must ensure a continuous data flow. The incubating Apache project NiFi, originally developed by the National Security Administration, is focused on automating the flow of data between external devices and the data center. NiFi can enable firms to track pieces of data from entry until landing in the data center. This can be important for data governance and for understanding the data lineage behind analytics.

Newer technologies are beginning to embed advanced analytics for faster integration, transformation, and enrichment data from Hadoop files as well as from traditional sources. Organizations should evaluate how these technologies can help them reduce errors and waste in data integration, transformation, and preparation processes on big data.

CHOOSE THE RIGHT STRATEGY FOR SUPPORTING INTERACTIVE ACCESS TO HADOOP DATA

EVALUATE TECHNOLOGIES FOR IMPROVING DATA INTEGRATION AND PREPARATION FOR BIG DATA ANALYTICS

NUMBER FOUR NUMBER FIVE



Having an integrated and well-governed data architecture and set of technologies is important. The architecture must have a wide scope to include both on-premises and cloud systems, and it must knit together traditional BI and data warehouses with the Hadoop ecosystem. A unified architecture must also be integrated with business processes so that analytics can directly improve processes. However, as important as integration and unification are, these efforts must be balanced with the need to let developers, data scientists, and business users experiment with innovative technologies and choose the right tool.

Industry analysts describe unified environments as “logical” or “hybrid” architectures. The notion is to have a cohesive platform that enables multiple frameworks and workloads for users with different skills as well as the range of potential use cases. Of course, few if any organizations have reached such a level of unity and have still been able to balance that unification with flexibility. More typically, IT management establishes a centralized architecture and then has to grapple with “shadow” systems run by business units that threaten the unity. Some organizations shut out open source technologies, particularly if used by shadow systems, out of fear that they will exacerbate disunity.

Historically, the Hadoop ecosystem has indeed been a collection of disparate tools and technologies. Hadoop 2.0 technologies are beginning to steer the ecosystem toward greater integration. YARN, for example, supports multiple types of processing from multiple sources and can therefore enable best-of-breed technologies to work within a unified framework. Organizations can choose different analytics, data integration and preparation, data processing, and storage technologies but still plug them into the ecosystem thanks to YARN. However, most organizations still need to work with vendors’ platforms to gain fuller integration and management of open-source-based technologies.

Integration and unification are vital, in particular to support users of self-service BI and visual analytics tools who are not seasoned developers and do not have the knowledge, time, or interest in getting their hands dirty with integrating technologies. Organizations should evaluate vendors’ multifunction platforms and systems to ensure they not only match requirements but that they also can be updated as new technologies emerge.

Open source technologies that follow open, accepted standards are helping spread innovation in analytics. Development efforts can be repeatable, reusable, and optimized with less custom work. Standards are also important in easing integration with applications and systems developed by IT with commercial technologies. As advanced analytics spreads, it will be important to enable developers and data scientists to work with flexible tools and frameworks as well as those that follow open standards.

Advances in open source for stream analytics were discussed previously. Another type of analytics spreading fast due to standards is machine learning—a term coined in 1959 by Arthur Samuel, a pioneer in artificial intelligence, to mean “a field of study that gives computers the ability to learn without being explicitly programmed.” Provided with data in either a supervised (by a human) or unsupervised way, a machine can learn from examples. Machine learning algorithms are a growing presence for discovering patterns, predictive analytics, and more. Machines can use collaborative filtering of customers’ past behavior, for example, to create actionable predictive insights for automating the order in which recommended products are presented.

There are several open source projects underway focused on machine learning, some of which have been in existence for some time, such as Shogun, which was created in 1999. Apache Mahout, first released in 2008, is aimed at giving developers the ability to create scalable machine learning applications primarily for implementation on top of Hadoop and MapReduce but more recently on Spark, which has a module called MLlib. Each has strengths as well as gaps in the types of use cases they can support and how they work with R, Python, and other programming languages for analytics.

Online analytical processing (OLAP) could also benefit from open source. Though not yet mature enough to be considered standards, prominent projects include Druid, which can ingest data and events in real time into a data store for exploration, time-series analysis, and other OLAP processes. Apache incubator project Kylin, originally developed at eBay, is aimed at providing a distributed analytics engine that supports SQL and multidimensional OLAP on Hadoop. Organizations should evaluate emerging advanced analytics and OLAP technologies for their potential to give business functions a competitive edge.

BALANCE THE VALUE OF UNIFIED ARCHITECTURE WITH THE BENEFITS OF BEST-OF-BREED INNOVATION

ACCELERATE THE BUSINESS IMPACT OF PROCESSING AND ANALYTICS WITH OPEN STANDARD TECHNOLOGIES

NUMBER SEVEN NUMBER SIX



cloudera.com

Cloudera is revolutionizing enterprise data management by offering the first unified Platform for Big Data, an enterprise data hub built on Apache Hadoop. Cloudera offers enterprises one place to store, process, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data.

Only Cloudera offers everything needed on a journey to an enterprise data hub, including software for business-critical data challenges such as storage, access, management, analysis, security, and search. As the leading educator of Hadoop professionals, Cloudera has trained over 22,000 individuals worldwide. Over 1,000 partners and a seasoned professional services team help deliver greater time to value.

Finally, only Cloudera provides proactive and predictive support to run an enterprise data hub with confidence. Leading organizations in every industry plus top public sector organizations globally run Cloudera in production.

www.datatorrent.com

DataTorrent is the leader in real-time big data analytics and the creator and sponsor of Apache Apex, an open source, enterprise-grade native YARN big data platform that unifies stream processing and batch processing. Apex processes big data in motion in a highly scalable, highly performant, fault-tolerant, stateful, secure, distributed, and easily operable way. Apex provides a simple API that enables users to write or reuse generic Java code, thereby lowering the expertise needed to write big data applications. Apache Apex is currently in incubation status.

DataTorrent RTS, built on the foundation of Apache Apex, is the industry’s only enterprise-grade open source, unified stream and batch processing platform.

DataTorrent dtIngest simplifies the collection, aggregation, and movement of large amounts of data to and from Hadoop for a more efficient data processing pipeline and is available to organizations for unlimited use at no cost.

DataTorrent is proven in production environments to reduce time to market, development costs, and operational expenditures for Fortune 100 and leading Internet companies. Based in Santa Clara, California, DataTorrent is backed by leading investors including August Capital, GE Ventures, Singtel Innov8, Morado Ventures, and Yahoo co-founder Jerry Yang. For more information, visit our website (www.datatorrent.com) or follow us on Twitter (www.twitter.com/datatorrent).

ABOUT OUR SPONSORS

www.cloudera.com

www.cloudera.com

www.datatorrent.com

www.datatorrent.com

www.datatorrent.com

www.datatorrent.com

www.twitter.com/datatorrent



www.platfora.com

Platfora is the fastest, most powerful, flexible, and complete Big Data Discovery platform built natively on Apache Hadoop and Spark. Platfora enables business users and data scientists alike to visually interact with petabyte-scale data in seconds, allowing them to work with even the rawest forms of data. The latest update to the platform provides expanded support for SQL, Excel, and Apache Spark, creating a more open workflow that lets users seamlessly connect to the most popular data tools. Platfora’s next-generation data prep provides instant statistics and sample data to better guide users toward smart, customized data-driven decisions and facilitates more intelligent, iterative investigations. With Platfora, data insights can be shared across the organization without silos, driving collaboration on even the most complex data queries.

Platfora is transforming the way businesses unlock insights from big data to achieve more meaningful outcomes through the use of its industry-defining Customer Analytics, Security Analytics, and Internet of Things solutions. Data-driven organizations use Platfora Big Data Discovery to tightly integrate analytics workflows with the most in-demand features, including advanced visualizations, self-service data preparation, UI and data transforms, drag-and-drop data sets, and machine learning.

Talend.com

At Talend, it’s our mission to connect the data-driven enterprise, so our customers can operate in real time with new insight about their customers, markets, and business.

Founded in 2006, our global team of integration experts builds on open source innovation to create enterprise-ready solutions that help unlock business value more quickly. By design, Talend integration software simplifies the development process, reduces the learning curve, and decreases total cost of ownership with a unified, open, and predictable platform. Through native support of modern big data platforms, Talend takes the complexity out of integration efforts.

More than 1,700 enterprise customers worldwide rely on Talend’s solutions and services. Privately held and headquartered in Redwood City, California, the company has offices in North America, Europe, and Asia, along with a global network of technical and services partners. For more information, please visit www.talend.com and follow us on Twitter: @Talend.

ABOUT OUR SPONSORS

www.platfora.com

www.platfora.com

www.talend.com

www.talend.com

www.twitter.com/talend



TDWI Research provides research and advice for data professionals worldwide. TDWI Research focuses exclusively on business intelligence, data warehousing, and analytics issues and teams up with industry thought leaders and practitioners to deliver both broad and deep understanding of the business and technical challenges surrounding the deployment and use of business intelligence, data warehousing, and analytics solutions. TDWI Research offers in-depth research reports, commentary, inquiry services, and topical conferences as well as strategic planning services to user and vendor organizations.

ABOUT TDWI RESEARCHABOUT THE AUTHOR

David Stodder is director of TDWI Research for business intelligence. He focuses on providing research-based insight and best practices for organizations implementing business intelligence (BI), analytics, performance management, data discovery, data visualization, and related technologies and methods. He is the author of TDWI Best Practices Reports on mobile BI and customer analytics in the age of social media, as well as TDWI Checklist Reports on data discovery and information management. He has chaired TDWI conferences on BI agility and big data analytics. Stodder has provided thought leadership on BI, information management, and IT management for over two decades. He has served as vice president and research director with Ventana Research, and he was the founding chief editor of Intelligent Enterprise, where he served as editorial director for nine years. You can reach him by e-mail ([email protected]), on Twitter (www.twitter.com/ dbstodder), and on LinkedIn (www.linkedin.com/in/davidstodder).

TDWI Checklist Reports provide an overview of success factors for a specific project in business intelligence, data warehousing, or a related data management discipline. Companies may use this overview to get organized before beginning a project or to identify goals and areas of improvement for current projects.

ABOUT TDWI CHECKLIST REPORTS

mailto:[email protected]

www.twitter.com/dbstodder

www.linkedin.com/in/davidstodder

7 Steps to Faster Analytics Processing

Documents

Transcript of 7 Steps to Faster Analytics Processing