DataOps with Project Amaterasu

DataOps with Project AmaterasuYaniv Rodenski Karel Alfonso

What Data Pipelines are Made Off• Big Data applications:

• Ingestion

• Storage

• Processing

• Serving

• Workflows

• Machine learning

• Data Sources and Destinations

• Tests?

• Schemas??

Archetypes of Data Pipelines Builders

• Exploratory workloads

• Data centric

• Simple Deployment

Data People (Data Scientist/Analysts/BI Devs) Software Developers

• Code centric

• Heavy on methodologies

• Heavy tooling

• Very complex deployment

Making Big Data Teams Scale• Scaling teams is hard

• Scaling Big Data teams is harder

• Different mentality between data professionals/engineers

• Mixture of technologies

• Data as integration point

• Often schema-less

• Lack of tools

Continuous Delivery

• Keep software in a production ready state

• Test all the changes: unit, integration

• Exercise deployments

• Faster feedback cycle

No silos Autonomous teams

Feedback Automation

Build quality in

Shared responsibility

DevOps & Collaboration

The case for CI/CD/DevOps in Big Data Projects• Coordination: data engineers, analysts, business, ops

• Integrate and test critical jobs

• Complex infrastructure: multiple distributed systems

• Need to decouple cluster operation via APIs/DSLs

• DevOps team to manage cluster operations: scaling, monitoring, deployment.

• Include CI/CD practices are part of the delivery process.

How are these techniques applicable to

Big Data applications?

What Do We Need for Deploying our apps?• Source control system: Git, Hg, etc

• CI process to run tests and package app

• A repository to store packaged app

• A repository to store configuration

• An API/DSL to deploy to the cluster

• Mechanism to monitor the behaviour and performance of the app

Who are we? Software developers withyears of Big Data experience

What do we want? Simple and robust way todeploy Big Data applications

How will we get it? Write thousands of linesof code on top of Mesos

Amaterasu - Simple Continually Deployed Data Apps

• Amaterasu is the Shinto goddess of sun

• In the Japanese manga series Naruto Amaterasu is a super-natural power in the shape of a black flame that can only be taken out by its Sender

• Started as a framework to reliably execute Spark driver programs

Amaterasu - Simple Continually Deployed Data Apps

• Big Data apps in Multiple Frameworks (Currently Only Spark is Supported)

• Multiple Languages (soon)

• Workflow as YAML

• Simple to Write, easy to deploy

• Reliable execution (via Mesos)

• Multiple Environments

Big Data Pipeline Ops Requirements

• Support managing multiple distributed technologies: Apache Spark, HDFS, Kafka, Cassandra, etc.

• Treat data center as the OS while providing resource isolation, scalability and fault tolerance.

• Ability to run multiple tasks per machine to maximize utilization

Why Mesos?• General purpose, battle tested cluster resource scheduler.

• Can run major modern Big Data systems: Hadoop, Spark, Kafka, Cassandra

• Can deploys spark as part of the execution

• Supports scheduled and long running apps.

• Improves resource management and efficiency

• Great APIs

• DC/OS provides an even reacher environment

Amaterasu Repositories• Jobs are defined in repositories

• Current implementation - git repositories

• Local directories support is planned for future release

• Repos structure

• maki.yml - The workflow definition

• src - a folder containing the actions (spark scripts, etc.) to be executed

• env - a folder containing configuration per environment

• Benefits of using git:

• Branching

• Tooling

Workflow DSL - maki.yml---job-name:amaterasu-testflow:-name:starttype:spark-scalafile:file.scala-name:step2type:spark-scalafile:file2.scalaerror:file2.scalaname:handle-errortype:spark-scalafile:cleanup.scala...

Actions

Error handling actions

Amaterasu is not a workflow engine, it’s a deployment tool that understands that Big

Data applications are rarely deployed independently of other Big Data applications

Actions DSL• Your Scala/Future languages Spark code

• Few changes:

• Don’t create a new sc/sqlContext, use the one in scope or access via AmaContext.sc and AmaContext.sqlContext

• AmaContext.getDataFrame and AmaContext.getRDD are used to access data from previously executed actions

importio.shinto.amaterasu.runtime._

valoddRdd=AmaContext.getRDD[Int]("start","rdd").filter(x=>x%2==0)

oddRdd.take(5).foreach(println)

valhighNoDf=AmaContext.getDataFrame("start",“odd").where("_1>3")

highNoDf.write.json("file:///tmp/test1")

Actions DSL (in action)

valdata=Array(1,2,3,4,5)valx=data.tail

valrdd=AmaContext.sc.parallelize(data)valodd=rdd.filter(n=>n%2!=0)

Action 1 (“start”) Action 2

Environments• Configuration is stored per environment

• Stored as JSON

• Contains:

• Spark master URI

• Input/output path

• Work dir

• User defined key-values

production.json{"name":"production","sparkMasterUrl":"mesos://server1:5050","inputPath":"hdfs://hdfsprd:9000/user/amaterasu/input","outputPath":"hdfs://hdfsprd:9000/user/amaterasu/output","workingDir":"alluxio://server3:19998/","configuration":{"spark.cassandra.connection.host":"cassie-prod","sourceTable":"documents"}}

dev.json{"name":"test","sparkMasterUrl":"local[*]","inputRootPath":"file:///tmp/input","outputRootPath":"file:///tmp/output","workingDir":"file:///tmp/work","configuration":{"spark.cassandra.connection.host":"127.0.0.1","sourceTable":"documents"}}

valoddRdd=AmaContext.getRDD[Int]("start","rdd").filter(x=>x/2==0)

oddRdd.take(5).foreach(println)

valhighNoDf=AmaContext.getDataFrame("start",“x").where("_1>3")

highNoDf.write.json(Env.outputPath)

Environments in the Actions DSL

Future Development• Continuous integration and test automation

• R, shell and Python support (R is already in progress)

• Extend environments to support:

• Full spark configuration (spark-defaults.conf, etc.)

• Extendable configuration model

• Better tooling

• DC/OS universe package

• Other frameworks: Flink, vowpal wabbit

• YARN?

Amaterasu + demos https://github.com/shintoio/

Slack http://shintoio.slack.com

Getting started

Thank you!

DataOps with Project Amaterasu

Technology

Transcript of DataOps with Project Amaterasu

DataOps: The Collaborative Framework for Enterprise Data ... · DataOps: The Collaborative Framework for Enterprise Data-Flow Orchestration Published: January 2017 Report Number:

DATA WAREHOUSING & BUSINESS INTELLIGENCE SUMMIT 2020€¦ · SUMMIT 2020 Big Data, Analytics & Data Science, DataOps, Cloud Datawarehousing, Data Governance, DataVault Concrete approach

The Treasures of Amaterasu

MapR 6.0 Powers DataOps

Ivan Groenewold - DataOps Barcelona | Databases · The first filter applied is host-pattern Use --limit to further refine ... Idempotent Examples: ansible example -m ping | SUCCESS

Introduction to the IBM DataOps methodology and practice ...... · regulatory requirements that may affect the customer’s business and any actions the customermay need to take to

The Rise of the DataOps - Dataiku - J On the Beach 2016

Best Practices: Implementing DataOps with a Data Science Platform

DataOps: Let's learn how to work with physical objects in IRIS

Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with dataops

Amaterasu – Sun Goddess Goddess. Japan ’ sGeography Japan ’ s Geography Japan is a series of islands — the group consists of over 3000 islands of which.

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data Pipelines

HOW CAN I BACKUP MY POSTGRESQL DATABASES?€¦ · 2ndquadrant.com @asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BarmanInAction WHAT TO EXPECT FROM THE PLAYGROUND • Basic

DataOps Toolchain for Continuous Control Monitoring · #PIWorld ©2018 OSIsoft, LLC DataOps Toolchain for Continuous Control Monitoring Mahyar SEPEHR 1

The Transformatvi e P otential of DataOps for Analytci s · EXECUTIVE INSIGHTS The Transformative Potential of DataOps for Analytics 2 RANDY BEAN, “Time to Value: The Currency of

DataOps: The Authoritative First Edition · The Authoritative First Edition of DataOps. The third technology breakthrough has not yet arrived, but elements are starting to emerge.

DataOps - NIST

DataOps and The Future of Data Management - MIT Technology ... › wp-content › uploads › 2020 … · Gartner defines DataOps similarly as “a collaborative data management practice

Getting DataOps Right

Accelerate your Data Science and DataOps projects with IBM ...