Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow...
Transcript of Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow...
![Page 1: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/1.jpg)
Big Data Pragmaticalities Experiences from Time Series Remote Sensing
MARINE & ATMOSPHERIC RESEARCH
Edward King Remote Sensing & Software Team Leader
3 September 2013
![Page 2: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/2.jpg)
Overview
• Remote sensing (RS) and RS time series (type of processing & scale)
• Opportunities for parallelism
• Compute versus Data
• Scientific programming versus software engineering
• Some handy techniques
• Where next
Big Data Pragmaticalities 2 |
![Page 3: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/3.jpg)
Automated data collection….
Big Data Pragmaticalities 3 |
![Page 4: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/4.jpg)
Presto! Big Data(sets).
Big Data Pragmaticalities 4 |
![Page 5: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/5.jpg)
More Detail…
L0 (raw sensor)
L1B (calibrated)
L2 (derived
quantity)
Examples 1km imagery 3000 scenes/year x 500MB/scene x 10 years = 15TB 500m imagery x 4 = 60TB
Remapped
Composites
![Page 6: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/6.jpg)
Recap - Big Picture View
• These archives are large
• They are often only stored in raw format
• We usually need to do some significant amount of processing to extract the geophysical variable(s) of interest
• We often need to process the whole archive to achieve consistency in the data
• As scientists, unless you have a background in high performance computing and data intensive science, this is a daunting prospect.
• There are things that can make it easier…
Big Data Pragmaticalities 6 |
![Page 7: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/7.jpg)
+ + = “best pixels” User
Output types…
Big Data Pragmaticalities 7 |
Scenes: … User
Composites:
+ + = …etc
![Page 8: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/8.jpg)
Things to notice
• Some operations are done over and over again to data from different times. • For example: processing Monday data and Tuesday data are independent • This is an opportunity to do things in parallel (ie all at the same time)
• Operations on one place in the data are completely independent to
operations in other places. • For example: Processing data from WA doesn’t depend on data from Tas. • This is another opportunity to do things in parallel (ie all at the same
time)
Big Data Pragmaticalities 8 |
![Page 9: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/9.jpg)
12th ARSPC - Fremantle
Note: This general pattern is often referred to as a “HADOOP” or “MAP-REDUCE” system, and there are software frameworks that formalise it – eg it lies behind Google search indexing. (Disclaimer: I’ve never used one)
![Page 10: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/10.jpg)
So what?
• Our previous example – 10yrs x 3000 scenes/yr @ 10mins/scene = 5000hrs = 30weeks
– Give me 200 CPUs = 25hours
• But what about the data flux? • 15TB/30 weeks = 3 GB/hour
• 15TB/25 hours = 600 GB/hour
• Problem is transformed from compute bound to I/O bound
Big Data Pragmaticalities 10 |
~0.5GB
![Page 11: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/11.jpg)
Key tradeoff #1:
• Can you supply data fast enough to make the most of your computing?
• How much effort you put into this depends on • How big is your data set
• How much computing you have available
• How many times you have to do it
• How soon you need your result
• Figuring out how to balance data organisation and supply against time spent computing is key to getting the best results.
• Unless you have an extraordinarily computationally intensive algorithm, you’re (usually) better off focussing on steps to speed up data.
Big Data Pragmaticalities 11 |
![Page 12: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/12.jpg)
Computing Clusters
Big Data Pragmaticalities 12 |
Workstation 2 CPUs (15 weeks)
My first (& last) cluster (2002) 20 CPUs (1.5 weeks)
NCI (now obsolete) 20000 CPUs (20 mins)
![Page 13: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/13.jpg)
Plumbing & Software
• Somehow we have to connect data to operations: • Operations = atmosphere correction | remap | calibrate | myCleverAlgorithm
– Might be pre-existing packages
– Your own special code (Fortran, C, Python,…. Matlab, IDL)
• Connect = provide the right data to the right operation and collect the results
– Usually you will use a scripting language since you need:
– To work with the operating system
– Run programs
– Analyse file names
– Maybe read log files to see if something went wrong
• Software for us is like glassware in a chem lab: a specialised setup for our experiments; you can get components off the shelf, but only you know how you want to connect them together.
• Bottom line – you’re going to be doing some programming of some sort.
Big Data Pragmaticalities 13 |
![Page 14: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/14.jpg)
Scientific Programming versus Software Engineering (Key Tradeoff #2)
• Do you want to do this processing only once, or many times? • Which parts of your workflow are repeated, which are one-off?
• Eg base processing many times, followed by one-off analysis experiments
• How does the cost of your time spent programming compare with the availability of computing and time spent running your workflow? • Why spend a week making something twice as fast if it already runs in two days?
(maybe because you need to do it many times?)
• Will you need to understand it later?
Big Data Pragmaticalities 14 |
![Page 15: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/15.jpg)
Proprietary fly in the ointment (#1) • If you use licenced software (IDL, Matlab
etc.) you need licences for each CPU you want to run on.
• This may mean you can’t use anything like as much computing as you otherwise could.
• These languages are good for prototyping and testing
• But, to really make the most of modern computing, you need to escape the licencing encumbrance = migrate to free software.
• PS: Windows is licenced software
Big Data Pragmaticalities 15 |
• Example: we have complex IDL code that we run on a big data set at the NCI. We have only 4 licences. It runs in a week (6 days). If we had 50 licences -> 12hours. We can live with that since there would be weeks and weeks of coding and testing to port to Python.
![Page 16: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/16.jpg)
How to do it…
![Page 17: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/17.jpg)
Maximise performance by 1. Minimise the amount of programming you do
• Exploit existing tools (eg std. processing packages, operating system cmds) • Write things you can re-use (data access, logging tools) • Choose file names that make it easy to figure out what to do • Use the file-system as your database.
2. Maximise your ability to use multiple CPUs • Eliminate unnecessary differences (eg data formats, standards) • Look for opportunities to parallelise • Avoid licencing (eg proprietary data formats, libraries, languages)
3. Seek data movement efficiency everywhere • Data layout • Compression • RAM disks
4. Minimise the number of times you have to run your workflow • Log everything (so there is no uncertainty about whether you did what you think you did)
Big Data Pragmaticalities 17 |
![Page 18: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/18.jpg)
RAM disks
• Tapes are slow • Disks are less slow • Memory is even less slow • Cache is fast – but small • Most modern systems have multiple GB of RAM for each CPU,
which you can assign to working memory and as virtual disk.
• If you have multiple processing steps, which need intermediate file storage – use a RAM disk. Can get a factor of 10 improvement.
Big Data Pragmaticalities 18 |
DISK TAPE CPU
RAM
Cache
![Page 19: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/19.jpg)
Compression • Data that is half the size takes half as long to move (but then you have to
uncompress it – but CPUs are faster than disks) • Zip, gzip will usually get you a factor of 2-4 compression • Bzip2 is often 10-15% better
• BUT – it is much slower (factor of 5).
• Don’t store random precision (3.14 compresses more than 3.1415926) • Avoid recompressing (treat compressed archive as read-only, ie copy-
uncompress-use-delete, DO NOT move-uncompress-use-recompress-moveBack)
Big Data Pragmaticalities 19 |
Remote Disk
File.gz
CPU (decompression) RAM File
![Page 20: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/20.jpg)
Data Layout
• Look at your data access patterns and organise your code/data to match
• Eg 1. if your analysis uses multiple files repeatedly, reorganise the data so you reduce the number of open & close operations
• Eg 2. Big files tend to end up as contiguous blocks on a disk, so try and localise access to data, not jumping around which will entail waiting for the disk.
Big Data Pragmaticalities 20 |
Access by row
Access by column
![Page 21: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/21.jpg)
Data Formats (and metadata)
• This is still a religious subject, factors to consider: • Avoid proprietary (may need licences or libraries for undocumented
formats) – versus open formats that are publicly documented • Self-contained (keep header (metadata) and data together) • Self-documenting formats have structure that can be decoded using only
information already in the file • Architectural independence – will work on different computers • Storage efficiency – binary versus ascii • Access efficiency and flexibility – support for different layouts • Interoperability – openness and standard conformance = reuse
• Need some conventions around metadata for consistency
• Automated metadata harvest (for indexing/cataloguing)
• Longevity (& migration) • Answer: use netCDF or HDF (or maybe FITS in astronomy)
Big Data Pragmaticalities 21 |
![Page 22: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/22.jpg)
The file-system is my database
• Often in your multi-step processing of 1000s of files you will want to use a database to keep track of things – DON’T!
• Every time you do something, you have to update the DB • It doesn’t usually take long before inconsistencies arise (eg someone deletes a file
by hand).
• Databases are a pain to work with by hand (SQL syntax, forgettable rules)
• Use the file-system (folders, filenames) to keep track. Egs: • once file.nc has been processed, rename it to file.nc.done and just have your
processing look for files *.nc. (rename it back to file.nc to run it again, use ls or dir to see where things are up to, and rm to get rid of things that didn’t work).
• Create zero size files as breadcrumbs
– touch file.nc.FAIL.STEP2
– ls *.FAIL.* to see how many failures there were and at what step
• Use directories to group data that need to be grouped – for example all files for a particular composite.
Big Data Pragmaticalities 22 |
![Page 23: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/23.jpg)
Filenames are really important
• Filenames are a good place to store metadata relevant to the processing workflow: • They’re easy to access without opening the file
• You can use file system tools to select data
• Use YYYYMMDD (or YYYYddd) for dates in filenames – then they will automatically sort into time order (cf DDMMYY, DDmonYYYY)
• Make it easy to get metadata out of file names: • Fixed width numerical fields (F1A.dat, F10B.dat, F100C.dat is harder to
interpret by program than F001A.dat, F010B.dat, F100C.dat)
• Structured names – but don’t go overboard! – D-20130812.G-1455.P-aqua.C-20130812172816.T-d000000n274862.S-n.pds
– Eg. ls *.G-1[234]* to choose files at a particular time of day
Big Data Pragmaticalities 23 |
![Page 24: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/24.jpg)
Logging and Provenance
• Every time you do something (move data, feed it to a program, put it somewhere): write a time-stamped message to a log file. • Write a function that automatically prepends a timestamp to a piece of text
you give to it.
• Time-stamps are really useful for profiling – identifying where the bottlenecks are, or figuring out if something has gone wrong.
• Huge log files are a tiny marginal overhead • Make them easy to read by program (eg grep)
• Make your processing code report a version (number, or description), and its inputs, to the log file. Write the log file into the output data file as a final step. • This lets you understand what you did months later (so you don’t do it again)
• Keeps the relevant log file with the data (so you don’t lose it, or mix it up)
Big Data Pragmaticalities 24 |
![Page 25: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/25.jpg)
Final Thoughts
• Most of this is applicable to other data intensive parallel processing tasks • Eg. spatio-temporal model output grids • Advantages may vary depending on file size
• Data organisation has many subtleties – a little work in understanding can offer great returns in performance
• Keep an eye on file format capabilities • More CPUs is a double edged sword • Data efficiency will only become more important • Haven’t really touched on spatial metadata (v. important for ease of
end-use/analysis – but tedious (=automatable)) • Get your data into a self-documenting machine-readable open file
format – and you’ll never have to reformat by hand again.
• These are things we now do out of habit because they work for us • Perhaps they’ll work for you?
Big Data Pragmaticalities 25 |
![Page 26: Big Data Pragmaticalities - University of Tasmania · •Tapes are slow •Disks are less slow •Memory is even less slow •Cache is fast – but small •Most modern systems have](https://reader030.fdocuments.in/reader030/viewer/2022040409/5ec98fb3f931947a177dd239/html5/thumbnails/26.jpg)
Marine & Atmospheric Research Edward King Team Leader: Remote Sensing & Software
t +61 3 6232 5334 e [email protected] w www.csiro.au/cmar
MARINE & ATMOSPHERIC RESEARCH
Thank you