COLL Report Typesafe Apache Spark

24
 AP ACHE SPARK PREPARING FOR THE NEXT WAVE OF REACTIVE BIG DATA 

description

apache spark use cases report

Transcript of COLL Report Typesafe Apache Spark

  • APACHE SPARKPREPARING FOR THE NEXT WAVE OF REACTIVE BIG DATA

  • 2Foreword..........................................................................................................................................................3

    Apache Spark Survey 2015 - Quick Snapshot .................................................................................................4

    INTRODUCTION: Is Apache Spark the Future in Reactive Big Data? ................................. 5

    CHAPTER 2: The People and Organizations Interested in Apache Spark ........................ 7

    CHAPTER 3: What Goals Do Organizations Hope to Achieve with Apache Spark? ....... 10

    CHAPTER 4: How Organizations Use Spark Today ........................................................... 15

    CHAPTER 5: Barriers, Concerns and Support Desires Expressed by Respondents ...... 19

    Final Thoughts ...................................................................................................................... 22

    CONTENTS

  • 3FOREWORD BY MATEI ZAHARIA, CREATOR OF APACHE SPARK

    Im very excited to see this survey, built with Typesafe, that represents the largest poll of Spark developers yet. Apache Spark has rapidly been gaining traction over the past few years, and Im thrilled to see the wide variety of use cases and environments where it is being deployed. This survey of over 2100 developers alone highlights that over 500 enterprises using or planning to use Spark in production in 2015, in environments ranging from Hadoop clusters to public and private clouds, with data sources including key-value stores, databases, stream-ing data and file systems. Their use cases range from batch workloads to SQL queries, stream processing and machine learning, highlighting Sparks unique capability as a simple, unified platform for data processing.

    At Databricks and within the Spark community, this type of feedback is critical in helping us continue to enhance Spark for many more use cases and make Big Data simpler for enterprises of all sizes.

    Matei Zaharia CTO at Databricks and Vice President, Apache Spark @matei_zaharia

  • 74% Developers8% Data Scientists7% C-level execs

    TOP 3 LANGUAGES USED WITH SPARK

    88% Scala 44% Java22% Python

    31% are evaluating Spark now

    are running Spark in production

    13%

    82% of users chose Spark to replace MapReduce

    78% of users need faster processing of larger data sets

    62% of users load data into Spark with Hadoop DFS

    54% of users run Spark standalone

    67% of users need Spark for event stream processing

    20% are planning to use Spark in 2015

    TOP 3 INDUSTRIES

    RESPONDENTS

    Telecoms, Banks, Retail

    APACHE SPARK SURVEY 2015 - QUICK SNAPSHOT

  • CHAPTER 1: INTRODUCTION

    Is Apache Spark the Future in Reactive Big Data?

  • 6INTRODUCTIONBack in summer of 2014, we launched the results of a survey on Java 8, which provided us a lot of information we were looking for but also contained a small, golden nugget of data that we didnt expect: that out of more than 3000 developers surveyed, a shocking 17% of them reported using Apache Spark in production. Whoa.

    Apache Spark is a fast and general engine for large-scale data processing built using Scala and Akka, two technologies among many that we at Typesafe recommend for building Reactive systems. Notice that fast is emphasized in the Spark description? As weve learned, its actually not the size, but rather the speed or velocity of the data that is the challenge. So why Scala and Akka, you ask? You can refer to this posting by Matei for his full answer.

    With this foundation in mind, it made a lot of sense to learn more. So we asked a total of 2136 respondents about Spark awareness and adoption, the most-demanded features/modules, and how organizations use Spark in production today. We partnered with Databricks (also founded by Matei) in order to bring full lifecycle support for Apache Spark to Typesafe customers.

    We think of this next phase of technology as Reactive Big Data. But whatever you call it, its already here.

    When we started Spark, we had two goalswe wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsofts DryadLINQ (the first language-integrated Big Data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scalas static typing also made it much easier to control per-formance compared to, say, Jython or Groovy.

    Matei Zaharia CTO at Databricks and Vice President, Apache Spark @matei_zaharia

  • CHAPTER 2: WHO IS GETTING FIRED UP OVER SPARK?

    The People and Organizations Interested in Apache Spark

  • 8WHAT BEST DESCRIBES YOUR ROLE?The respondents who joined our survey generally adhere to the common technology industry demographics: a vast majority of software developers (74%) along with a smattering of other professionals. However, rather than having a more sizeable segment of Architects (3.5%), we can see higher representation of Data Scientists (7.5%), C-level Executives (6.5%), clearly speaking to the ripple effect that Big Data has across an organization.

    The industry verticals in which respondents place themselves are fairly varied. The largest consumersTelcos (16%), Banks (12%), Retailers (11%), Software/Tech (10%) and Advertising (9%)are all huge consumers of complex data sets, plus their business models often depend on crunching real-time data for reactive decision making at times of peak traffic/usage.

    JOB TYPE/ROLE INDUSTRY FOCUS

    7.5% Data Scientist 6.5% C-Level Executive 3.5% Soware Architect 3.5% Dev Ops 1% Business Analyst

    74% Developer

    6.5% Other

    33% Other

    5% Consulting

    4% Healthcare / Insurance

    9% Advertising

    10% Soware / Technology

    11% Retail

    12% Banking / Finance

    16% Telecommunications / Networks

    Including Biotechnology/Chemistry, Machinery, Education, Government and Utilities and other sectors

  • 9WHICH OF THE FOLLOWING TECHNOLOGIES DO YOU USE FOR YOUR PRODUCTION INFRASTRUCTURE?

    We see quite a lot of complementary technologies in this breakdown of production infrastructure toolsfrom IaaS/PaaS to frameworks and containers. The market has settled on Amazon EC2 (53%), with Docker (34%) and Cloudera CDH (22%) also retaining good market shares. From relative obscurity just 2 years ago, its interesting to see multi-functional Ansible (16%) appear in the mix. Mesos (14%) and OpenStack (13%) havent always been so close in market share, so its curious to see where things will head in 2015-16.

    In the end, we are receiving self-reported statistics from a sample population that includes mainly developers, so its not always clear if this question was interpreted as have you ever seen this technology appear in your organization in any form? as opposed to confirmed instances of enterprise-wide production usage.

    INFRASTRUCTURE TECHNOLOGIES IN USE

    53% Amazon EC2

    34% Docker

    22% Cloudera CDH

    16% Ansible

    14% Mesos

    13% OpenStack

    12% Apache.org Builds of Hadoop

    10% HortonWorks HDP

    10% Heroku

    8% Google Compute Engine

    7% Core OS

    7% MapR Hadoop Distribution

    6% Microso Azure

    5% Marathon

    4% Kubernetes

    2% Aurora

    11% Other XaaS

  • CHAPTER 3: A NEW HOPE

    What Goals Do Organizations Hope to Achieve with Apache Spark?

  • 11

    WHICH BEST DESCRIBES YOUR COMPANYS INTEREST (OR AWARENESS) WITH SPARK?

    A solid majority representing 72% of respondents have at least some experience with Apache Spark, and a total of 35% are currently using or planning to use it this year (or next). Notably, the largest single segment (31%) is currently evaluating Spark, but since 28% had never heard of Spark at the time of this survey (funnily, this group is now 0%!), there is still a ways to go. But trends can be discernedboth in buzz and adoptionfrom sources as varied as this survey as well as Google Trends:

    That said, a similar linear trend exists for searches like Hadoop and Big Data, so while Spark might defeat Hadoop in the processing power and event streaming areas, it is also designed to cooperate very well with Hadoopboth are Apache Foundation projects, after all. This is no secret; the creators of Spark, who later founded Databricks, speak directly to the complementa-ry relationship between Hadoop and Spark in a January 2014 blog post.

    Evaluating Spark now

    Currently usingin production

    Evaluated,not planning to use

    Evaluated, will use in 2016 or later

    Um, whats Spark?

    Planning touse in 2015

    31%

    28%

    20%

    13%

    6% 2%

    CURRENT RELATIONSHIP WITH SPARK

    2011 2013

    GOOGLE TRENDS - APACHE SPARK INTEREST OVER TIME

  • 12

    Fast Batch Processing of

    Large Data Sets

    78%Support for

    Event Stream Processing

    60%Fast Data Queries in Real Time

    56%Improved

    Programmer Productivity

    55%

    WHAT PROBLEMS ARE YOU TRYING TO SOLVE WITH SPARK THAT OTHER TOOLS DONT SOLVE?

    The most prevalent goals to achieve by respondents focus on the gains in processing speed, which are indeed one of the most exciting benchmarks: recent Spark in-memory performance tests showed it could process data at up to 100x the speed of Hadoop. However, users are also excited to implement event stream processing, which was an impossibility using previous technologies. As Typesafe CTO Jonas Bonr explains in his 2015 tech trends article in Wired.com, its the velocity of data that concerns most organizations, not the size.

    Jonas Bonr CTO, Typesafe @jboner

    Most so-called Big Data problems today are actually better described in the context of velocity instead of size. You want Fast Data. Speed is the problem to solve, not size.

    BUSINESS GOALS IN MIND

  • 13

    WHICH OF THE FOLLOWING SPARK FEATURES OR MODULES ARE MOST LIKELY TO SOLVE YOUR BIG DATA CHALLENGES?

    As you can predict, Spark Core API replacement (82%) and to a lesser extent Spark Streaming (65%) are seen as the biggest benefits of adoption, highlighting the shortcomings of MapReduce in terms of API friendliness, sheer performance and event streaming. Sparks MLlib (59%) and SparkSQL (51%) modules are smaller priorities and GraphX (25%) seems like a distant goal for most.

    SPARK FEATURES/MODULES IN DEMAND

    25%

    59%65%82%

    51%

    Core API as a Replacement for

    MapReduceStreaming Library(Spark Streaming)

    Machine Learning Library

    (MLlib) Integrated SQL (SparkSQL)

    Graph Algorithms Library

    (GraphX)

    Spark uses sophisticated caching of intermediate data in memory between processing steps, considerably improv-ing the performance of applications compared to comparable MapReduce implementations. Compared to the MapReduce API, the Spark API is amazingly intuitive, providing concise, expressive operations that are often needed for analytics. So, in addition to addressing a wider class of problems, Spark is improving the productivity of developers who use it.

    Dean Wampler Author & Big Data Expert, Typesafe @deanwampler

  • 14

    HOW WILL YOU USE SPARK TO PROCESS YOUR DATA?

    When it comes to data sources used by Spark, there is a reasonable amount of variance. Event stream processing (67%), clearly a priority, remains a focus for over two-thirds of respondentsa further breakdown of this aspect is presented on this page. The rest of these priorities are speaking to current legacy systems; developers will use Spark as a replacement for MapReduce in traditional batch mode applications, including ETL (61%) jobs for moving, cleaning, and re-format-ting data sets, and this will affect the rest of data processing methods as well.

    Many respondents feel that event stream processing will be a key killer feature of Spark, and see it helping their entire data pipeline (71%) as a whole, which points to the idea of extracting data sooner rather than later (65%); seems to encourage the evolution towards Reactive systems with Big Data at the heart of it all. Decision making automation at runtime (which sounds a bit to us like continuous deployment) is also something that about 40% of respondents consider as data velocity increases.

    DATA PROCESSING WITH SPARK

    39%

    41%

    46%

    46%

    59%

    61%

    Read or Write Data to One or More Databases

    Static Reports

    SQL Queries and Business Intelligence

    Write Data to Hadoop Distributed File System (HDFS)

    Ad-hoc Queries and Reporting

    ETL Data from External Sources

    67% Event Stream Processing

    71%

    65%

    40%

    Use Spark as Part of a Larger Data Pipeline

    Extract Information from Data Sooner Rather than Later

    Automate Decision Making at Runtime

  • CHAPTER 4: APACHE SPARK IN USE

    How Organizations Use Spark Today

  • 16

    2ndJava 44%

    1stScala 88%

    3rdPython 22%

    WHICH PROGRAMMING LANGUAGES ARE IMPORTANT TO YOUR SPARK INSTALLATION?

    Considering that Apache Spark was designed with Scala and Akka, its not surprising that the earliest users of this technol-ogy would be focused on Scala (88%). That said, as Spark adoption goes more mainstream on the JVM, we expect Java (44%) to increase in priority over time. Python (22%) is represented by about one-quarter of users, and is the 3rd language after Scala and Java that Spark documentation has prioritized. Other languages that users would like to see supported include R, loved by data scientists and statisti-cians, plus Clojure, Groovy, Ruby and Go.

    WHICH LANGUAGES ARE IMPORTANT TO YOUR SPARK INSTALLATION?

    Honorable mentions: R, Clojure, Groovy, Ruby & Go

  • 17

    WHERE ARE YOU RUNNING SPARK CURRENTLY?

    Standalone (54%) and Local mode (29%) installations of Spark seem logical for early users with different testing purposes, and one can always add to a cluster later. Otherwise, YARN (42%), aka MapReduce 2, and Mesos (26%) are the general go-to choices for integrating and running Spark with current systems. Cassandra (20%) is another Apache project that not only integrates well with Sparks event streaming power, but shares a similar vision of supporting highly responsive, resilient, elastic systems. Also mentioned by about 3% of respondents is Amazon Elastic MapReduce.

    WHERE DO YOU RUN SPARK?

    20%29%

    42%54%

    26%

    Standalone

    YARN

    Local ModeMesos

    Cassandra

  • 18

    HOW DO YOU LOAD YOUR DATA INTO SPARK?

    HOW DO YOU LOAD DATA INTO SPARK?

    62% Hadoop Distributed File System (HDFS)

    18% Other Services(e.g. over socket connection)

    41% Apache Kafka

    46% Databases

    29% Amazon S3

    12% Other*

    When it comes to data loading, respondents take from a wide spectrum of technologiesfrom DBs to messaging and file systems to plain socket connections, almost anything goes. The winner here is HDFS (62%) which makes perfect sensethe things users cannot get done with Hadoop are designed to be ported over to Spark to finish the job, again emphasizing the complementary nature of these two technologies. Unspecific Databases (46%) are in use by almost half of respondents, and Apache Kafka (41%) is a hot messaging broker built by LinkedIn using Scala in 2011 that now leverages Sparks event streaming capabilities. Amazon S3 comes in at 29%, little surprise considering Amazons infrastructure dominance with EC2 and their fairly comprehensive stack portfolio. *Including:

    Apache Cassandra, Amazon Kinesis and Apache HBase

  • CHAPTER 5: SO WHATS THE DELAY IN ADOPTION?

    Barriers, Concerns and Support Desires Expressed by Respondents

  • 20

    WHAT IS YOUR BIGGEST BARRIER TO USING SPARK EFFECTIVELY?

    Here we get to analyze hundreds of write-in answers by hand...fun! We found the write-in answers to be generally legible and only occasionally off-topic mumbo jumbo (i.e. some-thing about tabs vs. spaces). We asked about barriers to using Spark effectively at this time, then manually clustered them into sentiment categories, if you will.

    Low awareness / experience makes sense, since Spark adoption is still growinga year from now, we pre-dict that awareness of Spark will be considerably higher and no longer considered a barrier to adoption or use.

    Current requirements dont fit reflect a lack of urgency among the majority of enterprises; however,

    since the data shows that most early adopters use Spark to replace MapReduce, this group will likely re-evaluate

    their requirements as the need for data velocity increases.

    Too immature regarding integrations with middleware, platforms, tooling and programming

    languages. As adoption increases, you should check the Spark pages regularly for updates on feature

    and API maturity.

    LARGEST BARRIERS TO USING SPARK EFFECTIVELY

    Low Awareness / Experience

    1st1st

    Current Requirements Dont Fit

    2nd2nd

    TooImmature

    3rd3rd

  • 21

    HOW CAN SUPPORT BE IMPROVED?

    In line with the previous question, we also had a large collection of suggestions for improving support. Generally, these mirror the issues perceived as barriers to using Spark effectively in the previous question, but with some slight differences in semantics. Here are the top 3 sentiment categories that we hope can serve as useful feedback for future Spark development.

    Integration integration integration! comes in loudly as a definite requirement for many users, some of

    which may not be aware of currently supported technologies, since they specifically mentioned Scala,

    Java and Hadoop, which are first-class citizens for Spark.

    Deeper examples, docs & tutorials are important for making the case for Spark. We see documentation,

    more real-life case studies and tutorial options (like these) from vendors as answering these needs.

    Maturity through features is the final area where respondents see a lot of room to improve. Specifically

    mentioned are immaturity in the Spark feature set related to the client and streaming functionality, issues

    related to clustering and the overall stability of Spark in production.

    HOW CAN SUPPORT BE IMPROVED?

    1st1stIntegration

    Integration Integration!Integration

    Integration Integration!

    2nd2ndDeeper Examples, Docs & Tutorials

    Deeper Examples, Docs & Tutorials

    3rd3rdMaturity

    Through FeaturesMaturity

    Through Features

  • Final Thoughts Spark has become the Big Data tool of choice for a future of Reactive Systems, fueled by organizations in need of faster data and event steaming features.

  • 23

    FINAL THOUGHTS

    By this point, were sure you now understand that Spark awareness and adoption are experiencing remarkable growth. Developers have a pent-up need to eliminate issues with MapReduce, such as a difficult API, poor performance, and restriction to batch jobs only.

    You should consider Spark as the tool that meets these needs, providing excellent performance at scale, a concise and intuitive API, and support for event stream processing and iterative algorithms.

    Spark is less mature than older technologies, like MapReduce, so developers also need good documentation, example applications, and guidance on runtime performance tuning, management and monitoring. Spark is also driving interest in Scala, the language in which Spark is written, but developers and data scientists can also use Java, Python, and soon, R.

    Its all very good, more or less. So if you, like our sensible PR team, were looking for the Top 3 Takeways From This Survey, here they are in more shareable form:

    Spark awareness and adoption are seeing exponential growth.

    Google Trends confirms this and the survey shows that 72% of respondents have at least evaluation or research experience with Spark35% are using it or have decided to implement it.

    Faster data processing and event streaming are the focus for enterprises.

    By far the most desirable features are Sparks vastly improved processing power over MapReduce (over 78% mention this) and the ability to process event streams (over 66% mention this), a limitation of current technologies.

    Perceived barriers to adoption are not major blockers.

    When asked, respondents mentioned lack of in-house experience and perceived immaturity of some Spark components and integrations with other middleware and management tools. Also cited are needs for better commercial support options and for more comprehensive documentation and advanced examples.

  • 24

    DONT WORRY...WE HAVE MORE FOR YOU HERE

    Typesafe (Twitter: @Typesafe) is dedicated to helping developers build Reactive applications on the JVM. Backed by Greylock Partners, Shasta Ventures, Bain Capital Ventures and Juniper Networks, Typesafe is headquartered in San Francisco with offices in Switzerland and Sweden. To start building Reactive applications today, download Typesafe Activator.

    2015 Typesafe

    Introducing the Typesafe Reactive Platform

    DOWNLOAD

    Hands-on Spark Workshop with Typesafe Activator

    DOWNLOAD

    Getting Started with Spark

    DOWNLOAD

    Foreword by Matei Zaharia, creator of Apache SparkApache Spark Survey 2015 - Quick Snapshot

    Is Apache Spark the Future in Reactive Big Data?The People and Organizations Interested in Apache SparkWhat Goals Do Organizations Hope to Achieve with Apache Spark? How Organizations Use Spark Today Barriers, Concerns and Support Desires Expressed by Respondents Final Thoughts