Finding and Using Big Data in your business
-
Upload
simon-elliston-ball -
Category
Technology
-
view
279 -
download
4
description
Transcript of Finding and Using Big Data in your business
Simon Elliston Ball Head of Big Data !
@sireb !
!
!
!
Finding (and using) Big Data in your Business
#findBigData
http://bit.ly/findBigData
Now THAT's Big Data• A modern Ford kicks out 25GB per car, in a day.
• Ad networks: over a billion event logs per day.
• PayPal: 3 billion transactions a year
• Climate Corporation: soil type record for every square meter in the USA
• Facebook: 10PB a day
So you're probably not Facebook• Big Data takes many forms
• Velocity
• Variety
• Volume
• Value
• Veracity
Feature usage at Red Gate• We are obsessed with UX
• Knowing what our users do helps us make their life better
• Error reporting
• Feature usage reporting
• Conversations, survey, sales everything goes into making products better.
The default: SQL Server
The problem: SQL Server
"I used to use FUR all the time! I can't use it anymore, it's too slow." - Michelle, Product Manager
"I'm running a query right now... It started yesterday :(" - Ben, Product Manager
"Hey, this database is taking up a few TBs, can we just delete it?" - Simon, DBA
DELETE IT!?!?!?
• Thinning out old data
• Archiving to cheaper storage (even tape)
• Turning down collection
Big Data to the rescue
• Cheap storage in Hadoop
• Scale out, not scale up
• Distributed computing required for speed
• Occasional bursty workloads
• Semi-structured
Hadoop• Created by Doug Cutting as a backend for a search engine and
crawler (Nutch) in 2005.
• Developed further at Yahoo
• Based on Google's papers on Google Filesystem, and MapReduce
• Since grown into an ecosystem of tools
• Now version 2.0
Hadoop
All grown up
Really complex• Lots of moving parts
• Integrating into your network can be complex
• Getting all the tools to play nice
• Self build
• Fixing up from a good starting point
• Use a distro
Sandboxes
• Quick Start
• Great to learn
The menagerie
What we did
• Test cloud
• Virtualization is not Hadoop's friend.
• Performance is not good
• “Can we have 2TB on the SAN for /tmp?” Ur. No.
• "Borrowed" some old hardware, and got a small cluster running.
Putting data in
• Sqoop
• Cleaning
• ORC
How to not kill SQL server
• To a DBA Sqoop is a DDOS attack
• Limit the number of mappers Sqoop uses
• Import from a replica, or backup
Immediate value
• The data was a lot smaller
• Cheaper to store
• Column formats
• Compression: use lzo, bzip costs too much, and gzip is bad for Hadoop.
Give it back! Queries and ETL
• Hive. Reuse your SQL
• Pig. New, but worth learning
• MapReduce? (Optional. Warning: may contain java. Or snakes)
Give it back to the business
• Summary report in Excel
• Batch jobs
• Pump back into SQL for slicing and dicing
• Give us MORE!
Give it back! The platform
• To the cloud!
• Reuse all our existing queries and workflow
• On demand compute
• Takes time to lift the initial data set into cloud storage, but incremental updates are fast
Demo HDInsight
Thinking like a data scientist
• Plan your experiments
• Precision is subjective.
• Show the error bars
• Use whatever tool works
• Embrace uncertainty
Know your business
Think strategically
• Business buy-in
• Show quick wins
• What is your analysis for?
• What will it deliver to the business?
Break down the requirements
• Prioritize
• Go for the top value pieces
• Perfect fit for Agile methodologies
Communication• Talk to everyone you can
• Before
• After
• During
• Organizational knowledge
• Keep a log
Communication
• Conversations
• Coffee machine
• Formal talks
So what's next?
• Denormalize
• Democratize
• Machine learning for alerts
• Marketing
• Sales
And of course new tools
• We want to talk to you...