ExtSort
-
Upload
nurgazy-nazhimidinov -
Category
Documents
-
view
220 -
download
0
Transcript of ExtSort
-
8/13/2019 ExtSort
1/30
-
8/13/2019 ExtSort
2/30
CENG 351 Data Management and File Structures 2
External Sorting
Problem: Sort 1Gb of data with 1Mb of
RAM.
When a file doesnt fit in memory, there
are two stages in sorting:
1. File is divided into several segments, each of
which sorted separately
2. Sorted segments are merged(Each stage involves reading and writing the file
at least once)
-
8/13/2019 ExtSort
3/30
CENG 351 Data Management and File Structures 3
Sorting Segments
Two possibilities depending on the number of disks:
1. Heapsort:
optimal routine if only one disk drive is available.
It can be executed by overlapping the input/outputwith processing
Each sorted segment will be the size of the availablememory.
2. Replacement selection:
optimal for two or more disk drives.
Sorted segments are twice the size of memory.
Reading in and writing out can be overlapped
-
8/13/2019 ExtSort
4/30
CENG 351 Data Management and File Structures 4
Heap: ReviewWhat is a heap?
A heap is a binary tree with the following
properties:
1. Each node has a single key and that key is greater than
or equal to the key at its parent node.2. It is a complete binary tree. i.e. All leaves are on at
most 2 levels, leaves on the lowest level are at the
leftmost position.
3. Can be stored in an array; the root is at index 1, thechildren of node i are at indexes 2*i, and 2*i+1.
Conversely, the parent of node j is stored at index
j/2(very compact: no need to store pointers)
-
8/13/2019 ExtSort
5/30
CENG 351 Data Management and File Structures 5
10
35 20
45 40
60 50 55
25 30
Heap as abinary tree:
Height = log n
Example
Heap as an array:
10 35 20 45 40 25 30 60 50 55
-
8/13/2019 ExtSort
6/30
CENG 351 Data Management and File Structures 6
Heapsort Algorithm for Small Files First Stage: Building the heap while reading the
file :
While there is available space
Get the next record from current input buffer
Put the new record at the end of the heap
Reestablish the heap by exchanging the new node with its
parent, if it is smaller than the parent: otherwise leave it, whereit should be. Repeat this step as long as heap property isviolated.
Second stage: Sorting while writing the heap outto the file :
While there are records in heap Put the root record in the current output buffer.
Replace the root by the last record in the heap.
Restore the heap again, which has the complexity of O(log n)
-
8/13/2019 ExtSort
7/30
CENG 351 Data Management and File Structures 7
Example
Trace the algorithm with:
48 70 30 19 50 45 100 15
-
8/13/2019 ExtSort
8/30
CENG 351 Data Management and File Structures 8
Heapsort for Large Files
Basic idea
1. Forming runs (i.e. sorted subfiles):
bring as many records as possible to main
memory, sort them using heapsort, save it intoa small file.
Repeat this until we have read all records
from the original file.2. Do a multiway merge of the sorted
subfiles.
-
8/13/2019 ExtSort
9/30
CENG 351 Data Management and File Structures 9
Heapsort for Large Files (cont.)
How big is a heap? As big as the available memory.
What is the time it takes to create the sortedsegments?
Ignoring the seek time and assuming bblocks in the file,where heap processing overlaps (approximately) withI/O.
The time for creating the initial sorted segments is2b*btt (read in the segment and write out the runs)
Note that the entire file has not been sorted yet. Theseare just sorted segments, and the size of each segment islimited to the size of the available memory used for this
purpose.
-
8/13/2019 ExtSort
10/30
CENG 351 Data Management and File Structures 10
Multiway Merging
K-way merge: we want to merge K input lists tocreate a single sequentially ordered output list. (Kis the order of a K-way merge)
We will adapt the 2-way merge algorithm:
Instead of two lists, keep an array of lists: list[0], list[1], list[k-1]
Keep an array of the items that are being used fromeach list: item[0], item[1], item[k-1]
The merge processing requires a call to a function (sayMinIndex) to find the index of the item with theminimum value.
-
8/13/2019 ExtSort
11/30
CENG 351 Data Management and File Structures 11
Finding the minimum item
When the number of lists is small (K8)sequential search among items works nicely.
(O(K))
When the number of lists is large, we could placethe items in a priority queue (an array heap).
The min value will be at the root (1stposition in
array)
Replace the root with the next value from the
associated list. This insert operation is O(log K)
-
8/13/2019 ExtSort
12/30
CENG 351 Data Management and File Structures 12
Merging as a way of Sorting Large Files
Let us consider the following example: File to be sorted:
8,000,000 records
R = 100 bytes
Size of the key = 10 bytes Memory available as a work area: 10MB (not
counting memory used to hold program, O.S., I/Obuffers etc.)
Total file size = 800MB
-
8/13/2019 ExtSort
13/30
CENG 351 Data Management and File Structures 13
Cost of Heapsort
I/O operations are performed in the following times:
1. Reading each record into main memory for
sorting and forming the runs.
2. Writing sorted runs to disk.These two steps are done as follows:
Read a chunk of 10MB, write a chunk of 10MB
(repeat this 80 times)
In terms of basic disk operations, we spend:
For reading: 80 seeks + transfer time for 800 MB
Same for writing
-
8/13/2019 ExtSort
14/30
CENG 351 Data Management and File Structures 14
3. Reading runs into memory for merging. Read
one chunk of each run, so 80 chunks. Since
available memory is 10MB each chunk can have
(10,000,000/80)bytes = 125,000 bytes = 1250
records.
How many chunks to be read for each run?Size of run/size of chunk = 10,000,000/125,000= 80
Total number of basic seeks = Total number of
chunks (counting all runs) is 80 runs * 80 chunks/run
= 802
chunks = 6400 seeks. Reading each chunk involves average seeking.
-
8/13/2019 ExtSort
15/30
CENG 351 Data Management and File Structures 15
4. Writing sorted file to disk: after the first
pass, the number of separate writes closely
approximate reads. We estimate two seeks
- one for reading and one for writing- for
each piece: 80* 80 pieces therefore
6400 seeks
-
8/13/2019 ExtSort
16/30
CENG 351 Data Management and File Structures 16
Sorting a File that is 10 times larger
How is the time for merge phase affected ifthe file is 80 million records?
More runs: 800 runs
800-way merge in 10MB memory
i.e. divide the memory into 800 buffers.
Each buffer holds 1/800thof a run
So, 800 runs * 800 seeks/run = 640,000 seeks
-
8/13/2019 ExtSort
17/30
CENG 351 Data Management and File Structures 17
The cost of increasing the file size
In general, for a K-way merge of K runs, thebuffer size for each run is
(1/K) * size of memory space = (1/K) * size of each run
So K seeks are required to read all of the recordsin each run.
Since there are K runs, merge requires K2seeks.
Because K is directly proportional to N it also
follows that the total cost is an O(N2) operation.
-
8/13/2019 ExtSort
18/30
CENG 351 Data Management and File Structures 18
Improvements
There are several ways to reduce the time:
1. Allocate more hardware (e.g. Disk drives,
memory)
2. Perform merge in more than one step.
3. Algorithmically increase the lengths of the
initial sorted runs
4. Find ways to overlap I/O operations.
-
8/13/2019 ExtSort
19/30
CENG 351 Data Management and File Structures 19
Multiple-step merges
Instead of merging all runs at once, webreak the original set of runs into small
groups and merge the runs in these groups
separately.more buffer space is available for each run;
hence fewer seeks are required per run.
When all of the smaller merges arecompleted, a second pass merges the new
set of merged runs.
-
8/13/2019 ExtSort
20/30
CENG 351 Data Management and File Structures 20
25 sets of 32 runs each (25 32-way merges)
Two-step merge of 800 runs
Each run is320M
32-way
merge32-way
merge
32-way
merge32-way
merge
25-way
merge
Each run is
10M
-
8/13/2019 ExtSort
21/30
CENG 351 Data Management and File Structures 21
Cost of multi-step merge
25 sets of 32 runs, followed by 25-way merge: Disadvantage: we read every record twice.
Advantage: we can use larger buffers and avoid a large
number of disk seeks.
Calculations:
First Merge Step:
Buffer size = 1/32 run => 32*32 = 1024 seeks
For 25 32-way merges=> 25 * 1024 = 25,600 seeks
-
8/13/2019 ExtSort
22/30
CENG 351 Data Management and File Structures 22
Second Merge Step:
For each 25 final runs, 1/25 buffer space is allocated.
So each input buffer is 0.4M and it can hold 4000
records (or 1/800 run) Hence, 800 seeks per run, so we end up making 25 *
800 = 20,000 seeks.
Total number of seeks for reading in two steps:
25600 + 20000 = 45,600
What about the total time for merge?
We now have to transmit all of the records 4 timesinstead of two.
We also write the records twice, requiring an extra45,600 seeks.
Still the trade is profitable.
-
8/13/2019 ExtSort
23/30
Overall Formula
Sort a file withNblocks (pages) Available memoryBblocks (buffer pages):
Pass 0: useBbuffer pages. Produce sorted runs
ofBpages each. Pass 2, , etc.: mergeB-1 runs.
N B/
B Main memory buffers
INPUT 1
INPUT B-1
OUTPUT
DiskDisk
INPUT 2
. . . . . .. . .
-
8/13/2019 ExtSort
24/30
Cost of External Merge Sort
Number of passes:
Cost = 2N * (# of passes)
E.g., with 5 buffer pages, to sort 108 page
file:
Pass 0: = 22 sorted runs of 5 pages
each (last run is only 3 pages)
Pass 1: = 6 sorted runs of 20 pageseach (last run is only 8 pages)
Pass 2: 2 sorted runs, 80 pages and 28 pages
Pass 3: Sorted file of 108 pages
1 1 log /B N B
108 5/
22 4/
-
8/13/2019 ExtSort
25/30
Number of Passes of External Sort
N B=3 B=5 B=9 B=17 B=129 B=257
100 7 4 3 2 1 1
1,000 10 5 4 3 2 210,000 13 7 5 4 2 2
100,000 17 9 6 5 3 3
1,000,000 20 10 7 5 3 3
10,000,000 23 12 8 6 4 3
100,000,000 26 14 9 7 4 4
1,000,000,000 30 15 10 8 5 4
-
8/13/2019 ExtSort
26/30
CENG 351 Data Management and File Structures 26
Increasing Run Lengths
A longer initial run means
fewer total runs,
a lower-order merge,
bigger buffers,
fewer seeks.
How can we create initial runs that are twice aslarge as the number of records that we can hold inmemory?
=> Replacement selection
-
8/13/2019 ExtSort
27/30
CENG 351 Data Management and File Structures 27
Replacement Selection
Idea: Modify heap sort as follows:
always select the key from memory that has the
lowest value
output the keyreplacing it with a new key from the input list
-
8/13/2019 ExtSort
28/30
CENG 351 Data Management and File Structures 28
Input:
16, 47, 5, 12, 67, 21
Remaining input Memory (P=3) Output run
12,67,21 5 47 16 _
67, 21 12 47 16 5
21 67 47 16 5,12
_ 67 47 21 5,12,16
_ 67 47 _ 5,12,16,21
_ 67 _ _ 5,12,16,21 47 __ _ _ 5,12,16,21 47,67
What about a key arriving in memory too late to beoutput into its proper position? => use of second heap
-
8/13/2019 ExtSort
29/30
CENG 351 Data Management and File Structures 29
Trace of replacement selection
Input: ( P = 3)
16,47,5,12,67,21,7,17,14,58,24,18,33
-
8/13/2019 ExtSort
30/30
CENG 351 Data Management and File Structures 30
Replacement Selection with two disks
Algorithm:
1. Construct a heap (primary heap) in the memory, whilereading records block by block from the first disk drive,
2. As we move records from the heap to output buffer, we
replace those records with records from the input buffer.
If some new records have keys smaller than those alreadywritten out, asecondary heapis created for them.
The other new records are inserted to the primary heap.
3. Repeat step 2 as long as there are records left in the
primary heap and there are records to be read.
4. When the primary heap is empty make the secondary
heap into primary heap and repeat steps 1-3.