Using Apache Spark as ETL engine. Pros and Cons
-
Upload
provectus -
Category
Engineering
-
view
183 -
download
5
Transcript of Using Apache Spark as ETL engine. Pros and Cons
Using Apache Spark as ETL engine Pros and cons
Maksym Doroshenko Big Data Software Engineer
LeadGenius, Provectus
Agenda1. What is Spark
2. Spark components
3. Spark pillars
4. What is ETL pipeline
5. Using Spark SQL for ETL
6. Customer use case
7. Demo
Spark, who are you?
I am is a fast and general engine
for large-scale data processing.
Prove it
Hadoop MR Spark Spark
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 6592 6080
# Reducers 10,000 29,000 250,000
Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
Environment dedicated data center
EC2 (i2.8xlarge) EC2 (i2.8xlarge)
Apache Spark has an advanced DAG execution enginethat supports acyclic data flow and in-memory computing.
Spark use cases- Simplify the challenging and compute-intensive task of
processing high volumes of data
- Real time data processing
- Seamlessly integrating complex capabilities such as machine learning and graph algorithms
- Spark brings Big Data processing to the masses
Survey: Why companies use Spark ?91% use Apache Spark because of its performance gains
77% use Apache Spark as it is easy to use
71% use Apache Spark due to the ease of deployment
64% use Apache Spark to leverage advanced analytics
52% use Apache Spark for real-time streaming.
Spark components
What is RDDResilient Distributed Dataset - a big collection of data with following properties:
- Immutable - Distributed - Lazily evaluated - Fault tolerante
Operations
Narrow transformations
- map - flatMap - filter - etc.
Wide transformations
- reduceByKey - groupByKey - sortByKey - etc.
Spark application tree
Spark DataframesDataframes is distributed collection of data grouped into named columns (RDD with schema) with more efficient storage options, advanced optimizer, and direct operations on serialized data. These components are super important for getting the best of Spark performance
What is ETL?1. Sequence of transformation on data
2. Source data is typically semi-structured/unstructured (Text, JSON, CSV etc.) and structured (JDBC, Parquet, ORC, AVRO, etc.)
3. Output data is clean, structured, integrated and ready for further data processing, analysis and reporting.
ETL query in Spark
Why is ETL Hard?1. Various sources/formats
2. Schema mismatch
3. Different representation
4. Corrupted files and data
5. Scalability
6. Schema evolution
This is why ETL is important
Consumers of this data do not want to deal with this messiness and complexity
Spark SQL's flexible APIs,
support for a wide variety of
datasources,
build-in support for structured
streaming,
state of art catalyst optimizer
and tungsten execution engine
make it a great framework for
building end-to-end ETL
pipelines.
Spark SQL
Data sources
https://spark-packages.org/
Schema inference: semi structured data
Schema inference: semi structured data
User specified schema
Faster No scan to infer schema
More flexible Easily handle schema evolving
More robust Handle type errors ASAP
Deal with bad datajava.io.IOException: org.apache.hadoop.io.compress.DecompressorStream.decompress java.io.EOFException: Unexpected end of input stream java.lang.RuntimeException: file:/temp/path/c000.json is not a Parquet file (too small)
[SPARK-17850] If true, the Spark jobs will continue to run even when it encounters corrupt files. The contents that have been read will still be returned.
spark.sql.files.ignoreCorruptFiles = true
Deal with bad data[SPARK-12833] [SPARK-13764] TextFile formats (JSON and CSV) supports 3 Parse modes while reading data:
PERMISSIVE DROPMALFORMED FAILFAST
Better JSON and CSV support[SPARK-18352] [SPARK-19610] Multiline JSON and CSV support
Spark SQL reads JSON/CSV one line at time Before Spark 2.2 it requires custom ETL
Transformations: Higher order functions in SQL
Transformations on complex objects like arrays, maps and structures inside of columns.
1. Check for element existence SELECT EXIST(values, e->e>30) AS v FROM tbl_nested;
2. Transform an array SELECT TRANSFORM(values, e->e*e) AS v FROM tbl_nested;
Transformations: Higher order functions in SQL
3. Filter an array SELECT FILTER(values, e->e>30) AS v FROM tbl_nested;
4. Aggregate an array SELECT REDUCE(values, 0, (value, acc)->acc+value) AS v FROM tbl_nested;
Load
Different modes:
Error Append Overwrite Ignore
Wide functionality:
df.write .partitionBy(“favorite_color”) .bucketBy(42, “name") .sortBy(“age”) .saveAsTable("people_partitioned_bucketed”))
Customer Use Case- Data sources in different formats
- Mapping data to golden customer schema
- Normalize all data (e.g. email, phone)
- Link and merge same entities
Spark ETL Pros & Cons- Pros:
Open source
Great community
Easy to scale
Strong transformation engine
Support different languages
Unified API for different components
- Cons:
No File management system
Resource consuming
Manual configuration tuning
No ETL UI