DATA INGESTION MADE EASY · Zwoox is a data ingestion tool, or a Change Data Capture (CDC) tool...

DATA INGESTION MADE EASY

@ Xpand IT 2019 www.xpand-it.com

Zwoox is a data ingestion tool, or a Change Data Capture (CDC) tool that facilitates companies importing and extracting structured data into a Hadoop cluster.

Zwoox is fully integrated with Cloudera Enterprise Data Hub and is a highly scalable tool based on proven Hadoop technologies (Spark, Hbase, Kafka) that also allows you to change and model the data as you ingest it.

By using this framework, you can eliminate the need to manually code individual data pipelines for every data source.

Zwoox allows you to accelerate data processing by giving you several options on how to import your data, even allowing you to replicate any RDBMS DML in near real time into target Hadoop structures.

There are several tools out there that can drop data into Hadoop (HDFS or Kafka), the hard part is how to make this data easily accessible in a scalable and efficient way.

All these tools almost always need extra work to consolidate and normalize the data (CDC for each source) or arrange data (formatting and partitioning).

Zwoox can take care of all these needs by efficiently replicating data in HDFS, with Hive or Kudu tables created to access the final data.

What’s under the hood?

Kafka for CDC ingestion

By using Kafka for DML event storage Zwoox can access data change events in a structured and ordered fashion, guaranteeing fault tolerance, accurate event replication.

Sqoop for direct Database access

Uses Sqoop for a table's initial load before activating CDC stream, or to do a full or incremental import of a table in case the CDC functionality isn't available.

Spark for all processing

Zwoox fully leverages Spark for its high scalability, resilience and fast data processing. Combined with YARN, you can configure the amount of resources to use in your cluster for Zwoox, making all processes highly scalable and manageable.

HBase for configuration repository

Zwoox uses Hbase as its internal database since it allows very quick access to all relevant metadata even when dealing with thousands of concurrent ingestions.

sales@xpand-it.com

powered by Cloudera Enterprise

DATA INGESTION WITH ZWOOX

HIVE/METASTORE

@ Xpand IT 2019 www.xpand-it.com

Derive new columns from pre-set functions or "pluggable" code

Data availability through atomic bulk changes

Zwoox automatically converts data types between your chosen source database and your chosen target destination. If you want to change the types however, it is also possible to specify which datatype you want a certain column to be converted to.

Besides near real time ingestion using CDC and Kafka, it is also possible to ingest a table using full or incremental imports with Sqoop. This is the simplest way to have a table being ingested, needing nothing but the source database and Sqoop to work.

You can specify a certain time based column in a table (timestamp/date/long) and a cutoff date, and Zwoox will handle the cleansing for you by erasing all information before the cut-off date.

Zwoox uses a rolling view system to provide continuous access to a target tables; at the end of each batch the view is updated to reflect data changes atomically. By using this system, any user can access a table and get a consistent view of it even as ingestion processes occur.

You can - for chosen tables - instruct Zwoox to keep all changed data (deleted or updated rows) in a separate table. This keeps current table access speedy and when you need you can easily access a view with the full audit history, that includes operation timestamp allowing you to know what the table holded anywhere in time.

Cloudera Manager is Cloudera's administration tool for Hadoop clusters. Zwoox is fully integrated with the Cloudera Manager, with installation being possible with a few easy steps. By checking the manager, you can also easily see all Zwoox logs, service statuses, stop/restart services, check on important metrics and change configurations.

Zwoox offers the possibility to add metacolumns to a target destination table, either by using built-in functions or custom 3rd party code. By using this functionality, it is possible for you to: derive new partitioning columns from source columns; do data treatment on incoming data; anonymize relevant data; and do any other thing you can think of.

Partitioning is a very relevant factor when you're using HDFS based tables, for ease of access and speed. You can specify any combination of columns you want to be part of the partition key for any target table.

Data type translation

What does ZwooX offer?

Partitioning automation

Full audit history without performance impacts

Operational Integration withCloudera Manager

Incremental Sqoop based imports Structured historical data cleansing

ABOUT XPAND IT

Everything we do is to inspire others to achieve outstanding results. We do that by investing in IT mastery to generate market value. As a global company, specialised in Big Data, Business Intelligence & Analytics, Middleware, Enterprise Mobility, Collaboration & Development tools, we have top products and services used by Fortune 500 companies.

United Kingdom USAPortugal Sweden

DATA INGESTION MADE EASY · Zwoox is a data ingestion tool, or a Change Data Capture (CDC) tool...

Documents

Transcript of DATA INGESTION MADE EASY · Zwoox is a data ingestion tool, or a Change Data Capture (CDC) tool...

Designing a Real Time Data Ingestion Pipeline

Efﬁcient Data Ingestion and Query Processing for LSM ...asterix.ics.uci.edu/pub/luo-lsm-storage.pdfEfﬁcient Data Ingestion and Query Processing for LSM-Based Storage Systems Chen

Data Ingestion in AsterixDB

Tuning Live Data Map Performance - kb.informatica.com Library/1/0853... · Tuning Live Data Map performance involves tuning parameters for metadata ingestion, ingestion database,

Big Data Ingestion @ Flipkart Data Platform

Real-time Data Ingestion with CDC - Continuous Real-Time ... · Real-time Data Ingestion with CDC The Striim platform ingests real-time, streaming data from HPE NonStop servers and

Reliable and Scalable Data Ingestion at Airbnb

White Paper - Software...Big Data Design Principles This is the Hadoop architecture for storing and querying data within HDFS. Data Ingestion & Storage: Data ingestion (unstructured

The Process of Data Ingestion in ÆKOS

Data ingestion and distribution with apache NiFi

Automating Data Synchronization, Checking, Ingestion and ......2016/07/12 · Automating Data Synchronization, Checking, Ingestion and Publication for CMIP6 Ag Stephens & Alan Iwi

The Automation of the Data Lake Ingestion Process from ...

Real Time Data Ingestion for Your Analytics Ecosystem

Oracle Database 12c - Built for Data Warehousing...5 | ORACLE DATABASE 12C – BUILT FOR DATA WAREHOUSING Data Ingestion Data Ingestion is responsible for moving, cleaning and transforming

Self-Service Data Ingestion Using NiFi, StreamSets & Kafka

PAWN: A Novel Ingestion Workflow Technology for Scientific Data

Gobblin: Unifying Data Ingestion for Hadoop

Pentaho and Big Data Ingestion Patterns · This document covers some best practices on collecting Pentaho Data Integration (PDI) and Hadoop usage patterns for data ingestion and processing.

Data Ingestion at Scale - MICDE...Data Ingestion at Scale Jeffrey Sica ARC-TS @jeefy. Overview What is Data Ingestion? Concepts Use Cases GPS collection with mobile devices Collecting

An IDEA: An Ingestion Framework for Data Enrichment in AsterixDBasterix.ics.uci.edu/pub/vldb-ingestion.pdf · 2019. 9. 27. · An IDEA: An Ingestion Framework for Data Enrichment