File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

42
e Processing - Cosequential Processing MVNC 1 Cosequential Processing Chapter 8

Transcript of File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

Page 1: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 1

Cosequential Processing

Chapter 8

Page 2: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 2

Cosequential Processing

Coordinated processing of two or more sequential lists

Goals» To merge lists into a single sorted list (union)

– Make a single sorted list from many

» To match records with the same keys (intersection)– Apply transactions to a master file– Find entries which exist in multiple lists

Page 3: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 3

Cosequential Processing

Keys» Matching/merging may be by a single key or

several.» Number of keys only affects compare operator, not

sort strategy

Page 4: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 4

Master Transaction File Processing

Common processing strategy on sequential files.

Common since historically sequential processing was the rule (tapes, cards)

Companies stored data in sequential files Lists of “transactions” posted against these

record periodically.

Page 5: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 5

Master Transaction File Processing

Consider a grocery store» Record of inventory for each type of item stored in

a large sequential file (master file)» As items sold, a the item number and quantity sold

posted (written) as records to a transaction file» As trucks deliver new items, item numbers and

quantities are entered into the transaction file.» As new types of items are added to inventory, or

old items are discontinued, entries about this are placed in the transaction file.

Page 6: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 6

Master Transaction File Processing

grocery store example:

Item # Item Name Type Quan20231 Shoe Shine (br) 6 420231 Shoe Shine (bl) 6 120177 Cottage Cheese 5 39220179 Chicken Soup 6 3220231 T-bone 2 43....

Item # Trans Quan Item Name20231 U -220231 U 5020379 U -520443 U -420445 A 40 Corn Chips20532 A 300 Butter20534 D20558 U 200 ....

Master File Transaction File

U - UpdateA - AddD - Delete

Page 7: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 7

Master Transaction File Processing

Periodically update master from transaction

TransactionFile

Old MasterFile

UpdateOperation

New MasterFile

UpdateMessages

Page 8: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 8

Master Transaction File Processing

Transactions are applied against master. New master is created Invalid Transactions result in Message Important changes in Messages - audit trail Transaction and master must be in sorted

order.

Page 9: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 9

Master Transaction File Processing

Processing SchemeRead record Mast from old Master and Trans from Transaction

While more records in both files

if Add and Trans.ID < Mast.ID, write Mast to new master

else If Trans.ID = Mast.ID then

If UPDATE then update record and write to new master

If Delete then continue (no write)

else trasaction error

else write Mast to new master

Read next from transaction, next from old master

If more records in old master, write to new master

If more records in transaction, give errors

Page 10: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 10

Merging

Merge two (or more) sorted lists into a single sorted list

May remove duplicates (union) or keep

BillCathyFranGrayHilleryJennyKennyLindaMaryPeteRandySallyZeke

BillGrayHilleryJennyLindaMaryRandy

CathyFranKennyPeteSallyZeke

mergemerge

Page 11: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 11

Merging

Merge(List1,Max1,List2, Max2,Result) int next1 := 0; next2 := 0; out = 0; while Max1 >= next1 and Max2 >= next2 if (List1[next1] > List2[next2]) Result[out++] := List2[next2++]; else Result[out++] := List1[next1++]; if (List1 < Max1)

for (; next1 <= Max1 ; Result[out++] := List1[next1++]); if (List2 < Max2)

for (; next1 <= Max2 ; Result[out++] := List2[next1++]);

Page 12: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 12

Sorting

Small files» sort completely in memory» Called internal sorting.

Page 13: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 13

Sorting

Larger files » may be too large to fit in memory simultaneously» require "external sorting"» Sorting using secondary devices

Page 14: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 14

External Sorting Criteria for evaluating external sorting algorithms

» Different from internal sorts Internal sort comparison criteria

» Number of comparisons required» Number of swaps made» Memory needs

External sort comparison criteria» Dominated by I/O time» Minimize transfers between secondary storage and main

memory

Page 15: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 15

External Sorting

Two major external sorting methods» in situ - sort the file in place» use additional storage space

Page 16: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 16

External Sorting

Characteristics of in situ sorting» uses less file space, thus larger files may be

sorted.» if crash occurs during sort, file may be left in

corrupt state» in site sorts may be done on direct-access files

using standard internal type sorts.» direct-access required (may not be available)» performance of such algorithm's tends to be data

sensitive

Page 17: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 17

External Sorting

Consider a file with 1000 records, 120 bytes each

We have 25,000 bytes available for a buffer. Solution?

» read in 200 records at a time, sort internally» This results in 5 sorted files» merge the resulting sorted files into 1sorted file

Page 18: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 18

Sort/Merge

A common non-in situ method is an algorithm called "sort-merge"

"safe" sorting technique performance is guaranteed requires only serial file access

Page 19: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 19

Sort/Merge

Partition

Sort

Sort

Sort

Sort

Merge

Page 20: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 20

Sort/Merge

Sort/Merge techniques have two stages:» sort stage - sorted partitions are generated

– Size depends on available memory

» merge stage - sorted partitions are merged (repetitively if necessary)

– Why might more then one merge phase be needed?

Page 21: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 21

Basic Sort/Merge

initial partition size is 1» Merge begins immediately (no sort)» Smallest main memory use » requires only 2 buffers in memory.

File starts with N "sorted" files of size 1 Similar to internal merge/sort

Page 22: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 22

Improving Sort/Merge

Increase buffer size» Partitions sorted (in memory) with little I/O» Larger partitions mean fewer (I/O intensive)

merges needed» Take advantage of already sorted runs of data» Consider the "unsortedness" of the data

Page 23: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 23

Sort/Merge

Producing sorted partitions» internal sorting» natural selection - (use already sorted runs)» replacement selection

Page 24: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 24

Internal sorting

read M records (M determined by available memory)

sort them using internal sorting techniques write back out, creating a partition of size M

Page 25: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 25

Sort/Merge

Replacement selection (snowshovel)» files usually not totally out of order» take advantage of partial ordering in file» partition size varies with already existing ordering

Page 26: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 26

Replacement selection (snowshovel)

Start with primary buffer of size N (snowshovel)1. Read in N records into buffer

2. Output record with smallest key

3. Replace with next record in file

4. if this new record is smaller then the last record written, "freeze" (must wait for next partition)

5. if unfrozen records remain, go to 2

6. If all records frozen, unfreeze them all, start new partition, go to 2

Page 27: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 27

Replacement selection (snowshovel)

if file is sorted or almost sorted, one pass may suffice for complete sort!

average partition length is 2N Consider file with, N = 4:

» 29 42 3 7 9 101 99 87 89 100 16 8 12 2 15 [EOF]

Page 28: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 28

Natural Selection

Frozen records in the replacement scheme take up space and search time.

Natural, rather than freezing, writes these unused records to a fixed length secondary file (called reservoir)

partition creation terminates when reservoir full. Next, buffer is refilled first with records from buffer,

than records from file (if more needed) expected partition length is 2.718N if reservoir and

buffer same size - (about 30)

Page 29: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 29

Natural Selection

Redo example with reservoir size 4» 29 42 3 7 9 101 99 87 89 100 16 8 12 2 15 [EOF]

Page 30: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 30

Distribution and Merging

Merging» required to bring the sorted partitions together into

a sorted whole» may require a series of merge “phases”, where

shorter partitions are merged into larger partitions – More then one partitions per file– Not all partitions can be openned at once

Page 31: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 31

MergingSingle phase

Page 32: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 32

MergingMultiple phase

Page 33: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 33

MergingMultiple Partitions / File

P1 P3 P5 P7 P9 P11 P2 P4 P6 P8 P10 P12

P1-2 P5-6 P9-10 P3-4 P7-8 P11-12

P1-4 P9-12 P5-8

P1-12

P1-8 P9-12

Page 34: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 34

Merging

Major issues - minimizing overall I/O» Different length partitions

– Spend time simply reading and writing from one file

» Left over partitions– Spend time simply copying partitions

Page 35: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 35

Distribution and Merging

Distribution» In order to merge, partitions must be “distributed”

to files in a manner facilitaing the merge process.» If 1 partition per file, distribution is trivial» If >1 partition per file, distribution should minimize

I/O» Several partitions may be placed in each file

Page 36: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 36

Balanced N-way merge

use as many files (or tapes) as the system can open at once

Distribute the partitions evenly amoung F/2 files

repetitively merge back and forth between one set of F/2 files and the other

Distribute the generated partitions evenly amoung the F/2 output files

Page 37: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 37

Balanced 2-way merge

P1 P3 P5 P7 P9 P11 P2 P4 P6 P8 P10 P12

P1-2 P5-6 P9-10 P3-4 P7-8 P11-12

P1-4 P9-12 P5-8

P1-12

P1-8 P9-12

File 1 File 2

File 3 File 4

File 1 File 2

File 3 File 4

File 1

Page 38: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 38

Balanced 2-way merge

Example: 4 files, 700 records, 100 primary records can be sorted in memory

1-700

1-100201-300401-500601-700

101-200301-400501-600

1-200

401-600

201-400

601-700

1-700

1-400

401-700

Page 39: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 39

Balanced N-way merge

advantage» simple

disadvantage » wastes time if partition size different» spend time reading and write records without

actually merging

Page 40: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 40

Polyphase merging

Strategically distribute the partitions onto F files based on the Fibonacci Sequence

Algorithm» During each phase merge the F smallest files until

the end of one file is reached.

After each phase at least one partition will now be empty - this file becomes available new place to merge into

Continue to merge until only one file exists

Page 41: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 41

Polyphase merging

Consider: Initially generate three files:» 24 partitions, 20 partitions , and 13 partitions

Page 42: File Processing - Cosequential Processing MVNC1 Cosequential Processing Chapter 8.

File Processing - Cosequential Processing MVNC 42

Polyphase merging

advantages» No overhead from merging partitions of different

sizes

disadvantages» complex management of files» must know partition sizes» still not completely optional - partition sizes not

always maximal.