2^48 - keine Angst vor großen Datensätzen in MATLAB · 3 Challenges with Large Data Sets “Out...
Transcript of 2^48 - keine Angst vor großen Datensätzen in MATLAB · 3 Challenges with Large Data Sets “Out...
1 © 2014 The MathWorks, Inc.
2^48 - keine Angst vor großen Datensätzen in
MATLAB
9. July 2014
Rainer Mümmler
Application Engineering Group
3
Challenges with Large Data Sets
“Out of memory”
– Running out of address space
Slow processing
– Data too large to be efficiently
managed between RAM and
virtual memory
– Lots of data to process
Gaining insight
– Large data visualization
– Modeling with no equation and lots of predictors
4
Available system memory
Memory usage in MATLAB
Techniques for processing large data sets
Agenda
5
System Memory
System Memory = RAM + Swap/Page on Disk
Virtual Memory
– Process sees contiguous block of memory
– Memory actually divided between RAM and disk
(swap/page file)
– OS maps virtual address to physical address
General guidelines:
– Add RAM, possibly swap space
– If thrashing, consider alternative approaches
Virtual Memory
(per process) Disk
RAM
6
Memory and Your Operating System
32-bit operating systems
– 4GB of addressable memory per process
– Part of it is reserved by the OS,
leaving the application < 4GB
64-bit operating systems
– In theory, can address 18 Exabytes of memory
– Determined by OS and processor
– Essentially limited by the amount of RAM and
disk available on the computer
Use 64-bit OS, if possible
Memory per Process
Available for Process
Reserved by
Operating System
7
Available system memory
Memory usage in MATLAB
Techniques for processing large data sets
Agenda
8
Memory Management in MATLAB
Preallocate arrays
– Large matrices first
Clear variables when no longer needed
Check memory available (Windows only)
>> memory
Control contiguous memory
with startup switch (Windows only)
C:\matlab –shield medium
Allocated
x2 = zeros(50,1)
x1 = zeros(25,1)
Allocated
x3 = zeros(25,1)
x4 = zeros(100,1)
MATLAB Process
Address Space
Allocated
x2 = zeros(50,1)
x1 = zeros(25,1)
Allocated
x3 = zeros(25,1)
x4 = zeros(100,1)
9
Data Copies Function calls
Data is “copy-on-write” (lazy-copy)
Passed by reference into the function
function y = foo(x,a,b) y = a * x + b; end
function y = foo(x,a,b) a(1) = a(1) + 12; y = a * x + b; end
a not copied a is copied
If not modified, no copy is made If modified, a temporary copy is made
10
Data Copies In-Place Optimizations
MATLAB performs calculations “in-place” when:
– Output variable name is the same as input variable name
– Performing element-wise computation
not in-place
y = 2*x + 3;
x = 2*x + 3;
in-place
11
Techniques for Minimizing Data Copies
In-place operations, if possible
Nested functions
– Share the workspace of all outer functions
– Avoids making temporary copies
of input arguments
For objects, consider handle classes
– Copy of a handle object refers to the
same object as the original handle
12
Using Appropriate Data Storage
Numerical data types
– Floating point for math (e.g. linear algebra)
– Integers where appropriate (e.g. images)
Cells and structures
Sparse arrays
Categorical arrays
13
How does MATLAB store data? Container overhead*
d Header (112)
Data
d = [1 2] dcell ={[1 2]}
dcell Header (112)
Data
Cell Header (112)
dstruct.d = [1 2]
dstruct Header (112)
Data
Element Header (112)
Fieldname (64)
* Using values for 64-bit MATLAB
14
Sparse Matrices
Require less memory and are faster
When to use sparse?
– < 1/2 dense on 64-bit (double precision)
– < 2/3 dense on 32-bit (double precision)
Functions that support sparse matrices
>> help sparfun
Blog Post: Creating Sparse Finite Element Matrices http://blogs.mathworks.com/loren/2007/03/01/creating-sparse-finite-element-matrices-in-matlab/
17
Available system memory
Memory usage in MATLAB
Techniques for processing large data sets
Agenda
18
Processing Large Data Sets
Break your large data into separate pieces
and process independently
– Partial reading and writing of files
– Built-in functionality for block-processing
– System Objects for stream processing (signals, videos)
Use the whole dataset at once
– Single array across memory of multiple machines
19
Reading in Part of a Dataset from Files
ASCII file
– Import Tool, textscan
MAT file
– Load and save part of a variable using the matfile
Binary file
– Read and write directly to/from file using memmapfile
– Maps address space to file
Databases (with Database Toolbox)
– ODBC and JDBC-compliant (e.g. Oracle, MySQL, Microsoft, SQL Server)
– Database Explorer App
20
Summary Examples: Reading in Part of a Dataset from Files
ASCII file
– Import Tool, textscan
MAT file
– Load and save part of a variable using the matfile
Binary file
– Read and write directly to/from file using memmapfile
– Maps address space to file
Only read/write parts of datasets, and not the whole file
21
Block Processing Images
blockproc automatically divides an
image into blocks for processing
Reduces memory usage
– Read and write block directly from image file
Processes arbitrarily large images
Available from Image Processing Toolbox
22
Batch processing…
Load the entire file and process it all at once
Stream processing
Load a frame and process it before moving on to the next frame
Source
Batch
Processing
Algorithm
Memory
MATLAB Memory
Stream
Source
Stream
Processing
23
System Objects
A class of MATLAB objects that support streaming workflows
Simplifies data access for streaming applications
– Manages flow of data from files or network
– Handles data indexing and buffering
Contain algorithms to work with streaming data
– Manages algorithm state
– Available for Signal Processing, Communications, Video Processing,
and Phased Array Applications
Available from DSP System Toolbox
Communications System Toolbox
Computer Vision System Toolbox
Phased Array System Toolbox
24
Processing Large Data Sets
Break your large data into separate pieces
and process independently
– Partial reading and writing of files
– Built-in functionality for block-processing
– System Objects for stream processing
Use the whole dataset at once
– Single array across memory of multiple machines
25
Distributed Array
Lives on the Workers
Remotely Manipulate Array
from Client
11 26 41
12 27 42
13 28 43
14 29 44
15 30 45
16 31 46
17 32 47
17 33 48
19 34 49
20 35 50
21 36 51
22 37 52
Distributing Large Data
Worker
Worker
Worker
Worker
MATLAB
Desktop (Client)
Available from Parallel Computing Toolbox
MATLAB Distributed Computing Server
26
Using Distributed Arrays Regular MATLAB code
27
Investigation: Distributed Calculations
Effect of number of computers on execution time
Processor: Intel Xeon E5-2670
16 cores, 60 GB RAM per compute node
10 Gigabit Ethernet
N
Time (s)
1 node,
multi-
threaded
Distributed
2 nodes,
32W
4 nodes,
64W
4000 2 3 3
8000 16 14 12
16000 126 102 67
20000 244 187 118
32000 - 664 394
40000 - - 710
30
Sample of Other Technical Resources
MATLAB documentation User’s Guide
– Programming Fundamentals Software Development Memory Usage
The Art of MATLAB, Loren Shure’s blog
– blogs.mathworks.com/loren/
Memory Management Guides
– www.mathworks.com/support/tech-notes/1100/1106.html
– www.mathworks.com/support/tech-notes/1100/1107.html
MATLAB Answers
– http://www.mathworks.com/matlabcentral/answers/
31