Data Transformation made easy
Building a successful agile data transformation stack
Martin Magdinier
March 2014
Building an Agile Data Transformation StackMartin Magdinier
Agile Data Transformation Stack is Agile Data Transformation Stack is the Key for Successthe Key for Success
Building an Agile Data Transformation StackMartin Magdinier
If Data is the new oilIf Data is the new oilWhere are the gas station !?!Where are the gas station !?!
● Data is not (yet?) a standardized good:
- Environment with evolving technology and formats● - Unique need:
● Industry, ● Department, ● Business case
Building an Agile Data Transformation StackMartin Magdinier
The Data Transformation ProcessThe Data Transformation Process
Your data transformation stack should help you to:
– Explore and search new data
– Identify and Extract relevant data
– Refine/Turn data into usable information
– Store & distribute to business users
Building an Agile Data Transformation StackMartin Magdinier
The Agile Data Transformation Stack
● Is a combination of complementary tools, technology and processes,
● Supporting rapid iteration of ideas, processes and products
● Focused on value creation for the customer (internal or external)
Building an Agile Data Transformation StackMartin Magdinier
The Data Transformation Stack
......
PlatformData Processing
SolutionsStorage
FreeOpen Source
Suit your needs All Software are cross platform
Building an Agile Data Transformation StackMartin Magdinier
Data Discovery & ProfilingMine existing dataAdd new data Data Transformation
Process & CodePrototype (MVP)Semi automatedAutomation
Track / MeasureCollect feedbackLearn from your experience
Progress in small
incremental steps
Data ConsumptionCreate valueGenerate new need
Agile Data Transformation Iteration
Building an Agile Data Transformation StackMartin Magdinier
Data Discovery & Profiling
Data Discovery & ProfilingMine existing dataAdd new data Data Transformation
Process & CodePrototype (MVP)Semi automatedAutomation
Track / MeasureCollect feedbackLearn from your experience
Progress in small
incremental steps
Data ConsumptionCreate valueGenerate new need
Building an Agile Data Transformation StackMartin Magdinier
Data Discovery
● Seek:– New data sources
– New usage for existing data
● Validate– Does the data match my quality criteria?
– Can I create value out of it?
Building an Agile Data Transformation StackMartin Magdinier
Data Profiling
● Understand your data and make sense of it – Mine
– Explore
– Interact
– Transform
● Combine with visualization and reporting tool● Iterate and explore various vantage points
Building an Agile Data Transformation StackMartin Magdinier
Data Discovery & ProfilingMine existing dataAdd new dataRefine requirements
Data TransformationProcess & CodePrototype (MVP)Semi automatedAutomation
Track / MeasureCollect feedbackLearn from your experience
Progress in small
incremental steps
Data ConsumptionCreate valueGenerate new need
Data Transformation
Building an Agile Data Transformation StackMartin Magdinier
Role of a Working Prototype
● Minimize project cost and development time● Focus on core functions of the transformation
process (packaging will come later)● Define your transformation strategy in a
sandbox mode– Validate your assumption
– Identify road block on the path to automation
Building an Agile Data Transformation StackMartin Magdinier
Iterate - Iterate - Iterate
● Improve and grow by incremental steps● Start feeding your business with data
– Validate if there is value in this data
– Collect feedback from the users
● Iterate as much as necessary
Building an Agile Data Transformation StackMartin Magdinier
Discovery, Profiling & Prototyping
● Designed for technical and business users● Support a variety of input format● Allow easy and safe interaction with the data:
– Somewhere between Excel ● Point and click user friendly interface● Changes Preview ● Undo / Redo functions
– and SQL● Query oriented language● Handling large amount of data
Building an Agile Data Transformation StackMartin Magdinier
OpenRefine Interface
Facet for fastfiltering
Expression builder
Instant preview of the transformation
Building an Agile Data Transformation StackMartin Magdinier
Prototyping & Automation
● Extract – Transform – Load solution● Process focus with
– Drag and drop component graphical interface
– Java based
● Compile your job to run it on your server– Java (Talend Open Studio)
– Map reduce (Talend for Big Data)
● Connect to anything● Open Source: Ease of addition / customizing
your own components / library
Building an Agile Data Transformation StackMartin Magdinier
Talend Open Studio Interface
Drag, drop, connect and configure components
Process oriented interface
List of components available
Building an Agile Data Transformation StackMartin Magdinier
Semi Automated Cleaning
● Intelligent Meta Crowd-sourcing Platform
● Build your job for data:
– clean up
– analysis
– categorization
– collection ...
● Ensure quality output– Check consistency of
results
– Select best worker
● Web Interface to – Build Prototype
– Test job
● API for automation– OpenRefine extension
– Talend Internet component
Building an Agile Data Transformation StackMartin Magdinier
Lesson Learned
Data Discovery & ProfilingMine existing dataAdd new dataRefine requirements
Data TransformationProcess & CodePrototype (MVP)Semi automatedAutomation
Track / MeasureCollect feedbackLearn from your experience
Progress in small
incremental steps
Data ConsumptionCreate valueGenerate new need
Building an Agile Data Transformation StackMartin Magdinier
Don't repeat yourself
● 1 process = 1 independent component / job● Reuse your existing components● Maintain your code in one place● Add few new items at each iteration
Building an Agile Data Transformation StackMartin Magdinier
Name Splitting
3. Move the talend component to a routine
● Split FullName into FirstName and LastName– John Doe / John Van de Doe / John Della Doe
1. Define Logic and exception list in OpenRefine 2. Translate the logic into a talend component (tJavaRow)
Building an Agile Data Transformation StackMartin Magdinier
Garbage in - Garbage out
● Catch errors early– The sooner, the easier
– Do not build the next step on erroneous data
● Independent process – Make it easier to track and debug.
– When the bug is fixed, every process / job benefit from it
Building an Agile Data Transformation StackMartin Magdinier
Know where the value is
● Poorly planned data cleaning process is a never ending job (and a depressing experience)
● Prototyping helps to – Anticipate how dirty the data is
● Plan appropriate strategy● Discard the source early on if too dirty
– Set quality level of acceptance ● Level of granularity● Data format● ...
Building an Agile Data Transformation StackMartin Magdinier
Example: Address parsingExample:
91 King Street East
305 – 1055, 20 TH ST SW
● Option A:– Address Line 1
– Address Line 2
● Option B:– Street Number
– Street Name
– Unit / PO Box
– Unit / PO Box Number
Building an Agile Data Transformation StackMartin Magdinier
Know when to stop
● Plan your process keeping in mind the effort to – Build
– Operate
– Maintain
● Balance fully automated vs semi-automated process
– Manual Cleaning - Crowdflower API
– OpenRefine Redo / Apply function
– Talend job
Building an Agile Data Transformation StackMartin Magdinier
Undo / redo in OpenRefine
History to undo previous steps
Extract and re apply transformation steps on a different project
JSON code to copy / paste in a different project
Building an Agile Data Transformation StackMartin Magdinier
Know when to stopBuild your job in Crowdflower
Building an Agile Data Transformation StackMartin Magdinier
Cleaning Typo
● How do you spell: – Mississagua
– mississauga
– Mississauga
– Mississuaga
– Misssisauga
● Algorithms– Levenshtein
– Fingerprint
– n-gram
– Metaphone
– PPM
● Process followed– Test and explore various algorithms in OpenRefine
– Automate in Talend with tFuzzyMatch
– Add human validation over a certain threshold
Building an Agile Data Transformation StackMartin Magdinier
Cleaning Typo1. OpenRefine cluster interface to test different algorithms
2. tFuzzyMatch in talend to automate transformation
Building an Agile Data Transformation StackMartin Magdinier
Conclusion
● Think Agile!● Iterate as often as you can
– Start small and build on it
– Confirm your assumption
– Focus on value creation
● Build a data friendly environment – Chose your tools carefully
– Leave room for learning and growing
Building an Agile Data Transformation StackMartin Magdinier
Contact
Ask me questions!
Martin Magdinier
● Linkedin: www.linkedin.com/in/magdinier/en● Twitter: @magdmartin● Email
Top Related