Post on 26-Dec-2015
File Processing - Cosequential Processing MVNC 1
Cosequential Processing
Chapter 8
File Processing - Cosequential Processing MVNC 2
Cosequential Processing
Coordinated processing of two or more sequential lists
Goals» To merge lists into a single sorted list (union)
– Make a single sorted list from many
» To match records with the same keys (intersection)– Apply transactions to a master file– Find entries which exist in multiple lists
File Processing - Cosequential Processing MVNC 3
Cosequential Processing
Keys» Matching/merging may be by a single key or
several.» Number of keys only affects compare operator, not
sort strategy
File Processing - Cosequential Processing MVNC 4
Master Transaction File Processing
Common processing strategy on sequential files.
Common since historically sequential processing was the rule (tapes, cards)
Companies stored data in sequential files Lists of “transactions” posted against these
record periodically.
File Processing - Cosequential Processing MVNC 5
Master Transaction File Processing
Consider a grocery store» Record of inventory for each type of item stored in
a large sequential file (master file)» As items sold, a the item number and quantity sold
posted (written) as records to a transaction file» As trucks deliver new items, item numbers and
quantities are entered into the transaction file.» As new types of items are added to inventory, or
old items are discontinued, entries about this are placed in the transaction file.
File Processing - Cosequential Processing MVNC 6
Master Transaction File Processing
grocery store example:
Item # Item Name Type Quan20231 Shoe Shine (br) 6 420231 Shoe Shine (bl) 6 120177 Cottage Cheese 5 39220179 Chicken Soup 6 3220231 T-bone 2 43....
Item # Trans Quan Item Name20231 U -220231 U 5020379 U -520443 U -420445 A 40 Corn Chips20532 A 300 Butter20534 D20558 U 200 ....
Master File Transaction File
U - UpdateA - AddD - Delete
File Processing - Cosequential Processing MVNC 7
Master Transaction File Processing
Periodically update master from transaction
TransactionFile
Old MasterFile
UpdateOperation
New MasterFile
UpdateMessages
File Processing - Cosequential Processing MVNC 8
Master Transaction File Processing
Transactions are applied against master. New master is created Invalid Transactions result in Message Important changes in Messages - audit trail Transaction and master must be in sorted
order.
File Processing - Cosequential Processing MVNC 9
Master Transaction File Processing
Processing SchemeRead record Mast from old Master and Trans from Transaction
While more records in both files
if Add and Trans.ID < Mast.ID, write Mast to new master
else If Trans.ID = Mast.ID then
If UPDATE then update record and write to new master
If Delete then continue (no write)
else trasaction error
else write Mast to new master
Read next from transaction, next from old master
If more records in old master, write to new master
If more records in transaction, give errors
File Processing - Cosequential Processing MVNC 10
Merging
Merge two (or more) sorted lists into a single sorted list
May remove duplicates (union) or keep
BillCathyFranGrayHilleryJennyKennyLindaMaryPeteRandySallyZeke
BillGrayHilleryJennyLindaMaryRandy
CathyFranKennyPeteSallyZeke
mergemerge
File Processing - Cosequential Processing MVNC 11
Merging
Merge(List1,Max1,List2, Max2,Result) int next1 := 0; next2 := 0; out = 0; while Max1 >= next1 and Max2 >= next2 if (List1[next1] > List2[next2]) Result[out++] := List2[next2++]; else Result[out++] := List1[next1++]; if (List1 < Max1)
for (; next1 <= Max1 ; Result[out++] := List1[next1++]); if (List2 < Max2)
for (; next1 <= Max2 ; Result[out++] := List2[next1++]);
File Processing - Cosequential Processing MVNC 12
Sorting
Small files» sort completely in memory» Called internal sorting.
File Processing - Cosequential Processing MVNC 13
Sorting
Larger files » may be too large to fit in memory simultaneously» require "external sorting"» Sorting using secondary devices
File Processing - Cosequential Processing MVNC 14
External Sorting Criteria for evaluating external sorting algorithms
» Different from internal sorts Internal sort comparison criteria
» Number of comparisons required» Number of swaps made» Memory needs
External sort comparison criteria» Dominated by I/O time» Minimize transfers between secondary storage and main
memory
File Processing - Cosequential Processing MVNC 15
External Sorting
Two major external sorting methods» in situ - sort the file in place» use additional storage space
File Processing - Cosequential Processing MVNC 16
External Sorting
Characteristics of in situ sorting» uses less file space, thus larger files may be
sorted.» if crash occurs during sort, file may be left in
corrupt state» in site sorts may be done on direct-access files
using standard internal type sorts.» direct-access required (may not be available)» performance of such algorithm's tends to be data
sensitive
File Processing - Cosequential Processing MVNC 17
External Sorting
Consider a file with 1000 records, 120 bytes each
We have 25,000 bytes available for a buffer. Solution?
» read in 200 records at a time, sort internally» This results in 5 sorted files» merge the resulting sorted files into 1sorted file
File Processing - Cosequential Processing MVNC 18
Sort/Merge
A common non-in situ method is an algorithm called "sort-merge"
"safe" sorting technique performance is guaranteed requires only serial file access
File Processing - Cosequential Processing MVNC 19
Sort/Merge
Partition
Sort
Sort
Sort
Sort
Merge
File Processing - Cosequential Processing MVNC 20
Sort/Merge
Sort/Merge techniques have two stages:» sort stage - sorted partitions are generated
– Size depends on available memory
» merge stage - sorted partitions are merged (repetitively if necessary)
– Why might more then one merge phase be needed?
File Processing - Cosequential Processing MVNC 21
Basic Sort/Merge
initial partition size is 1» Merge begins immediately (no sort)» Smallest main memory use » requires only 2 buffers in memory.
File starts with N "sorted" files of size 1 Similar to internal merge/sort
File Processing - Cosequential Processing MVNC 22
Improving Sort/Merge
Increase buffer size» Partitions sorted (in memory) with little I/O» Larger partitions mean fewer (I/O intensive)
merges needed» Take advantage of already sorted runs of data» Consider the "unsortedness" of the data
File Processing - Cosequential Processing MVNC 23
Sort/Merge
Producing sorted partitions» internal sorting» natural selection - (use already sorted runs)» replacement selection
File Processing - Cosequential Processing MVNC 24
Internal sorting
read M records (M determined by available memory)
sort them using internal sorting techniques write back out, creating a partition of size M
File Processing - Cosequential Processing MVNC 25
Sort/Merge
Replacement selection (snowshovel)» files usually not totally out of order» take advantage of partial ordering in file» partition size varies with already existing ordering
File Processing - Cosequential Processing MVNC 26
Replacement selection (snowshovel)
Start with primary buffer of size N (snowshovel)1. Read in N records into buffer
2. Output record with smallest key
3. Replace with next record in file
4. if this new record is smaller then the last record written, "freeze" (must wait for next partition)
5. if unfrozen records remain, go to 2
6. If all records frozen, unfreeze them all, start new partition, go to 2
File Processing - Cosequential Processing MVNC 27
Replacement selection (snowshovel)
if file is sorted or almost sorted, one pass may suffice for complete sort!
average partition length is 2N Consider file with, N = 4:
» 29 42 3 7 9 101 99 87 89 100 16 8 12 2 15 [EOF]
File Processing - Cosequential Processing MVNC 28
Natural Selection
Frozen records in the replacement scheme take up space and search time.
Natural, rather than freezing, writes these unused records to a fixed length secondary file (called reservoir)
partition creation terminates when reservoir full. Next, buffer is refilled first with records from buffer,
than records from file (if more needed) expected partition length is 2.718N if reservoir and
buffer same size - (about 30)
File Processing - Cosequential Processing MVNC 29
Natural Selection
Redo example with reservoir size 4» 29 42 3 7 9 101 99 87 89 100 16 8 12 2 15 [EOF]
File Processing - Cosequential Processing MVNC 30
Distribution and Merging
Merging» required to bring the sorted partitions together into
a sorted whole» may require a series of merge “phases”, where
shorter partitions are merged into larger partitions – More then one partitions per file– Not all partitions can be openned at once
File Processing - Cosequential Processing MVNC 31
MergingSingle phase
File Processing - Cosequential Processing MVNC 32
MergingMultiple phase
File Processing - Cosequential Processing MVNC 33
MergingMultiple Partitions / File
P1 P3 P5 P7 P9 P11 P2 P4 P6 P8 P10 P12
P1-2 P5-6 P9-10 P3-4 P7-8 P11-12
P1-4 P9-12 P5-8
P1-12
P1-8 P9-12
File Processing - Cosequential Processing MVNC 34
Merging
Major issues - minimizing overall I/O» Different length partitions
– Spend time simply reading and writing from one file
» Left over partitions– Spend time simply copying partitions
File Processing - Cosequential Processing MVNC 35
Distribution and Merging
Distribution» In order to merge, partitions must be “distributed”
to files in a manner facilitaing the merge process.» If 1 partition per file, distribution is trivial» If >1 partition per file, distribution should minimize
I/O» Several partitions may be placed in each file
File Processing - Cosequential Processing MVNC 36
Balanced N-way merge
use as many files (or tapes) as the system can open at once
Distribute the partitions evenly amoung F/2 files
repetitively merge back and forth between one set of F/2 files and the other
Distribute the generated partitions evenly amoung the F/2 output files
File Processing - Cosequential Processing MVNC 37
Balanced 2-way merge
P1 P3 P5 P7 P9 P11 P2 P4 P6 P8 P10 P12
P1-2 P5-6 P9-10 P3-4 P7-8 P11-12
P1-4 P9-12 P5-8
P1-12
P1-8 P9-12
File 1 File 2
File 3 File 4
File 1 File 2
File 3 File 4
File 1
File Processing - Cosequential Processing MVNC 38
Balanced 2-way merge
Example: 4 files, 700 records, 100 primary records can be sorted in memory
1-700
1-100201-300401-500601-700
101-200301-400501-600
1-200
401-600
201-400
601-700
1-700
1-400
401-700
File Processing - Cosequential Processing MVNC 39
Balanced N-way merge
advantage» simple
disadvantage » wastes time if partition size different» spend time reading and write records without
actually merging
File Processing - Cosequential Processing MVNC 40
Polyphase merging
Strategically distribute the partitions onto F files based on the Fibonacci Sequence
Algorithm» During each phase merge the F smallest files until
the end of one file is reached.
After each phase at least one partition will now be empty - this file becomes available new place to merge into
Continue to merge until only one file exists
File Processing - Cosequential Processing MVNC 41
Polyphase merging
Consider: Initially generate three files:» 24 partitions, 20 partitions , and 13 partitions
File Processing - Cosequential Processing MVNC 42
Polyphase merging
advantages» No overhead from merging partitions of different
sizes
disadvantages» complex management of files» must know partition sizes» still not completely optional - partition sizes not
always maximal.