Chapter 14: Searching and Sortingclasses.engr.oregonstate.edu/.../cs261/Textbook/Chapter14.pdfa data...

Chapter 14: Searching and Sorting 1

Chapter 14: Searching and Sorting In the process of examining data structure implementation techniques we have in the previous thirteen chapters discovered many algorithms that can be used to speed the process of searching a collection of values. This began in Chapter 2, where we asked the rhetorical question “why do phone books maintain their values in sorted order based on name?” This led to a discussion of binary search versus linear search, and the huge improvement in speed that is possible when searching an ordered collection. Binary search is an example of a general problem-solving heuristic known as divide and conquer. The fundamental goal of divide and conquer is to chose an initial step that divides a problem roughly in half, and therefore allows us in one step to eliminate one or the other half. In binary search this comes from comparing against the middle element. For example, suppose you are asking whether the name “Amanda” occurs in the following list. In the first step you compare against the middle element, which is “Alex”. In one comparison you know that “Amanda”, if it occurs, cannot be in the first half of the collection.

The true power of the heuristic occurs when you apply the technique repeatedly. After eliminating one half of the possibilities, you again ask about the value in the middle of the remaining range, and eliminate one half of the remaining values, and so on. This invariably leads to fast O(log n) behavior. Root Finding The basic insight behind divide and conquer is applicable to a wide variety of situations, not just searching an ordered array. For example, suppose you are trying to discover the root of the function x3 – 14x2 + 59x – 65. A root is a point where the function has value zero. If you evaluate the function at 0, and again at 8, you can readily discover that at one of the points the value of the function is positive, and at the other it is negative. Since the function is continuous, (that is, it has no “jumps”), you know there must be a root somewhere in the middle. We can once again use a binary search technique. Simply test the value in the middle range. Depending upon whether it is positive or negative, you move either the top or the bottom value.


This process would first examine the value of the function at 4. Since it is positive, you eliminate the top half. Next the value at 2 is tested, then 1. Since the value at 1 is negative, the lower bound is changed. The process continues in this fashion, examining 1.5, 1.75, 1.625, 1.71875, and so on. Within ten iterations the root can be determined to within 0.01. Partition and Fast Median In Chapter 4 you developed the quick sort algorithm. At the heart of this algorithm was a process termed forming a partition. The partition algorithm is most commonly associated with quick sort, however it is a useful operation in its own right. To partition is to divide a data set into two groups, so that all the elements in the first group are smaller than or equal to a key, and all the elements in the second are larger than or equal to a key. It is easy to imagine situations in which you would like to partition a data set. You have employee records and you want to find those with less than ten years employment and those with more than ten years. Or you have a list of students and you are creating the honor roll of students who have a grade point average greater than 3.5.

Another common use for the partition algorithm is to find the median of a data collection. The median is the element in the middle, the value with the property that half the remaining elements are larger, and half are smaller. A median is often better than an average as an indication of the range of data values, since it is not easily skewed by unusual values. Imagine, for example, that the vast majority of houses along a lake cost around $200,000, but there is one house that cost $10 million. The average cost would be several million, while the median would be closer to the $200,000 figure.


One way to find a median is to sort the data set, then select the value with index n/2. This can be done in O(n log n) steps. But sorting orders all the values, which is much more work than is necessary. The partition algorithm works in O(n) time. Using partition we can find the one element in question much more quickly. We can generalize the median finding problem to the task of finding the ith element, where i is given as a parameter. Finding the median is the same as finding the element n/2, where n is the number of values in the data set. The idea is to partition the data, which returns the index of the pivot value. Either this is or this is not the value we seek. If it is, we are finished. If not, then it might at first appear that little information has been gained. But we have, in fact, divided the vector into two portions. As with binary search, we can in one step immediately rule out one of these sections. By comparing the index position for the pivot to the position we are seeking we can recursively search either the lower or the upper portion of the transformed vector. This yields the following algorithm: double findIth (double data [ ], int i, int low, int high) { int pivot = partition(data, low, high, i); if (pivot == i) return data[i]; if (pivot < i) return findIth(data, i, low, pivot); else return findIth(data, i, pivot+1, high); } How fast is this algorithm? As with quick sort, the answer depends upon how lucky one is in selecting a pivot that divides the input roughly in half. If this can be done, then if we recall that partition is an O(n) algorithm, we can visualize this as follows:


Execution time in this case is proportional to the sum n + n/2 + n/4 + … + 1. If we factor out the common n term, the remaining sum can be expressed as 1 + ½ + ¼ + … This series is bounded by the constant 2. This shows that in the best case the algorithm will run in O(n) steps. What if we consistently make a bad choice for the pivot? As with quick sort, the worst case occurs when a partition divides the input into two groups, one of which is empty, and one of which has n-1 elements. This is pictured as follows:

What is the big-oh execution time in this case? So the findIth algorithm has a terrific best case performance, and a worst case that is slower than the naïve approach of sorting the input then selecting the ith element. To find out how it performs on average you can experimentally try finding the median of a set of random values. If you try this you will discover that the algorithm works very well most of the time. Review of Sorting Algorithms In earlier chapters we have seen a number of different sorting algorithms. It is useful to summarize these, both to remind the reader of the wide variety of algorithms that have been presented, and to compare and contrast some of the features of each. Bubble Sort Chapter 2. Simple to state, easy to analyze, but poor O(n2)

performance on all inputs. Should never be used in practice. Selection Sort Chapter 2. Avoids swapping until position is known, and hence

performs only O(n) swaps, although still requires O(n2) comparisons. Only slightly better than bubble sort.

Gnome Sort Worksheet 6. An amusing variation on insertion sort, using only a single loop. Easy to prove correct. But still requires O(n2) work in the worst case.

Insertion Sort Chapter 4. Sort by inserting new values into an already sorted array. Practical for small arrays. Is O(n) in best case, but still O(n2) in worst case.

Shell Sort An improvement over insertion sort obtained by moving elements long


distances in a single comparison. Much faster than insertion sort, although the mathematical analysis is complex.

Merge Sort Chapter 4. First algorithm we looked at with guaranteed fast O(n log n) performance, however uses O(n) extra storage.

Quick Sort Chapter 4. One of the fastest sorting algorithms on random data, uses no extra storage, but degenerates to O(n2) in worst case.

Tree Sort Chapter 10. Simple idea – copy elements into a binary search tree, then copy them back out. Easy to write, guaranteed O(n log n) performance, but uses extra storage.

Heap Sort Chapter 12. Simple idea – form a heap, then repeatedly pull smallest element out and rebuild heap. Since heap is getting smaller, can store already sorted elements in bottom of array, and thereby not use any extra memory. Guaranteed O(n log n) performance. In practice not as fast as some of the others.

Radix Sort Worksheet 39. Sort on digit positions using hash tables. Historical interest for use in punch cards. Can be very fast in the right situation, but not generalizable.

It would be foolish to think that even after looking at all these algorithms we have exhausted all there is to say about sorting algorithms. In the following sections we will describe just a few of the surprisingly large number of variations on the simple idea of sorting. Counting Sort Suppose you need to sort an array of 1000 integers, but you know that they are all between the values of 0 and 19. The key idea of a counting sort is that you don’t need to remember the values themselves in this case, just their frequency. This sorting algorithm works in two steps. In step one you compute the frequency, keeping the information in an array. This array of frequencies is sometimes termed a histogram. The histogram can be constructed very quickly, in O(n) time where n is the number of elements. For example, in the end you might come up with a table such as the table shown.

The histogram tells you that in the sorted array there will be 47 zeros, followed by 92 ones, then 12 twos, and so on. So in the second step of the algorithm you loop through the frequency table and copy values back into the array.

The first loop is O(n), where n is the number of elements, and the second loop is O(m), where m is the maximum value in the frequency table. So the resulting execution time is the maximum of these two values. This is normally written as O(n + m). This is extremely fast in comparison to other sorting algorithms we have seen, but is applicable in only a very limited set of circumstances. Bucket Sorting

0:47 5:114 10:36 15:93 1:92 6:16 11:92 16:12 2:12 7:37 12:43 17:15 3:14 8:41 13:17 18:63 4:32 9:3 14:132 19:89


In Chapter 12 you were introduced to the idea of hashing, and more specifically hash tables using buckets. Recall that the basic idea was to very quickly determine an index value for each element. This index is then used to place the value into one of a small number of groups, termed buckets.

If it is possible to find a hashing function with a particular characteristic, then this can be the source of a very fast sorting algorithm. In particular, what we need is for all the elements in the ith bucket to be less than or equal to all the elements in the (i+1)st bucket. If we can find such a hash function, then the process of sorting the entire list reduces to that of sorting each bucket, then placing the elements of each bucket end to end to generate the final sorted list. Since the number of elements in each bucket is a small fraction of the number of original elements, this is very fast. For example, suppose the input consists of values between zero and 10,000, and you select as your hash function “division by 100”. The first bucket will contain only elements between zero and 100, the second only values between 100 and 200, and so on. Instructors frequently use bucket sort in sorting exam papers. In the first pass all exams are divided into piles based on the students name, a pile of names beginning with “a”, a pile of names beginning with “b”, and so on. The second step is then to sort the individual piles. The result will be that all exams are completely sorted. The post office uses a somewhat similar argument, sorting first on state (a 50-bucket hash table), then on city, and finally on route. (Expressed in this fashion the algorithm starts to resemble radix sort, which was discussed in Chapter 12. In fact, bucket sort is sometimes described as a “top down radix sort”.)


As with all hashing algorithms, Bucket sort can be very fast, but only if an appropriate hash function can be identified. Library Sorting Library sort, sometimes termed gapped-insertion sort, is an interesting combination of bucket sort, insertion sort and counting sort. Like bucket sort, it requires a hash function that will divide the input so that all values in the first bucket are smaller than all values in the second bucket, and so on. In the first step, each of the n original values is examined, and this hash value computed. However, rather than actually placing the elements into buckets, only a count of the number of items in each bucket is maintained. This is similar to counting sort. These counts can be determined in O(n) time. Imagine, for example, that you discover that there are 10 items in the first bucket, 14 in the second, 22 in the third, and so on. Once the counts have been determined, a second pass can be made through the items, copying the values into a new array. Rather than sorting the values, you simply place them “close” to where they are to go. For example, if the very first item in the list goes into the third bucket, you place it at location 25. Location 25 will leave space for the ten items you know will go into the first bucket, and the 14 items you know will go into the second. The next item that goes into bucket 3 will go into location 26, regardless whether it is larger or smaller than the item you placed in location 25. Again this copy process can be performed in O(n) time. The name Library Sort comes from the analogy of placing books on to a bookshelf:

Suppose a librarian were to store his books alphabetically on a long shelf, starting with the As at the left end, and continuing to the right along the shelf with no spaces between the books until the end of the Zs. If the librarian acquired a new book that belongs to the B section, once he finds the correct space in the B section, he will have to move every book over, from the middle of the Bs all the way down to the Zs in order to make room for the new book. This is an insertion sort. However, if he were to leave a space after every letter, as long as there was still space after B, he would only have to move a few books to make room for the new one. This is the basic principle of the Library Sort. (From the Wikipedia article on Library sort).

After the second step the new array will not be sorted, but it will be much more organized than it was at first. In particular, it will be “close” to being sorted, as values will be more or less in their correct location. But if you remember the analysis performed in Worksheet 7, this is the ideal situation for insertion sort. A final insertion sort of the new array will place all the values in their correct location, and will be much faster than the O(n2) worst case that an insertion sort on the original array might have encountered.


Like merge sort, library sort has the disadvantage of requiring an extra array, but in the right situation it can be very fast. External Sorting Sometimes it is necessary to sort collections that are too large to fit into memory. An example would be sorting a large file of text. Because the data is never all in memory at one time, this is termed external sorting. The key insight in external sorting is the same idea that motivated the merge sort algorithm you examined in Chapter 4 and worksheet 12. Namely, two ordered collections can be merged very easily to form a new ordered collection.The difference here is that the input will be read from a pair of files, rather than from an array as was the case with the merge sort algorithm. The two sorted files will be combined, resulting in a third file containing their merged contents. During the merge process, one record (for example, one line of text) is read from each of the two input files. One of the two records, the smallest one, is then written to the output file. This continues until all records have been read from each of the files. To build on this idea, we must first create a number of small sorted files. This is the first step in the external sort algorithm. The input stream, which is assumed to be too large to fit into memory, is divided into small pieces, each piece small enough to be sorted using whatever internal sorting algorithm is desired. The sorted portion of the input is then written to a temporary file. When this first phase is finished, there will be some number of temporary files, each holding a sorted portion of the input.

The next phase is the merge step. It is based on the observation we made previously, namely that is is relatively easy to merge two sorted file streams together to form a new sorted file. To do this, the names of the sorted files are placed into a queue. At each step, two file names are removed from the queue. A new temporary file is created, containing the merged results of the two input files. The new temporary is then added to the queue, and the first two files can be closed and discarded. This process continues until there is


only one file in the queue. Since at each step there are two entries removed from the queue and only one new entry inserted, the process must eventually halt.

When the merge phase is finished there is only one file remaining, which is the final sorted result containing all the records from the original file, now is sorted order. Parallel Sorting In all the algorithms we have described in this text we have implicitly assumed we were running on a single processor machine. That is, there is one CPU, only one instruction is executed at any point, and one point of “control”, which is handed from function to function. For example, when one function calls another, it passes control to the new function, and waits in a state of suspended animation until the second function completes, and passes “control” back to the caller. It is becoming increasingly common for these assumptions to be challenged, if not outright violated. One of the most interesting of new areas is the rise of multi-core architectures, which move parallel programming from the arena of a few specialized and very expensive computers to a point where they may even become the accepted (and therefore, expected) norm. There are a wide variety of different parallel programming models, each with their own peculiar and different programming techniques. Therefore in this section we will only describe parallel sorting algorithms in a very informal fashion, to give you a hint as to how the advent of parallelism will radically change the programming landscape.


Imagine you are sorting a group of people, say into increasing order based on their height. If you are in a classroom, you might want to try this. The following algorithm is sometimes known as the polite sorting algorithm.

The Polite Sorting Algorithm. (Also sometimes known as “King of the Hill”). All members of the group stand in a line, then turn 90 degrees to their right, so that each member (except the first) is facing the back of the next. Each person in the group (except for the front person) then compares their height to the person in front of them. If they are larger, they politely tap the next person on the shoulder, and say “can I please exchange places with you?” After an exchange, or after the person in front of them has exchanged, they then continue comparing to the new person in front of them, until everybody is in order.

Notice that all comparisons are occurring at the same time. If we start to analyze the time this algorithm will take, we can ask, for example, how long it will take for the tallest person to reach the end of the line. We note that a tall person will never move backwards, and so the worst case would occur if they started out clear at the bottom. But if there were originally N people in the line, this tallest person will move one location at each step, and so will reach their proper location in O(N) exchanges. This is much faster than any of the O(N log N) sorting algorithms we have seen to up to now. But notice this is only possible because all N individuals were potentially moving at the same time; that is, there was no single focus of “control”. This is what we mean when we say there is parallel execution. If you tried performing this algorithm as a group exercise, you will have undoubtedly noticed a possible period of confusion that arises when an individual has both tapped the next person on the shoulder and themselves been tapped at the same time. The person in the middle is then faced with a dilemma. They need to exchange with the person in front of them, but have also been asked to exchange with the person in back of them? Which action should they do first? This is what is technically known as a race condition, and can result in much confusion. For example, the person in the middle can just “freeze up”, continually thinking in their mind about the two possible actions – “which do I do, exchange forward, or exchange backward?”, and never actually doing anything. You have probably experienced your computer getting into a similar sort of quandary. To eliminate the possibility of a race condition there is a slightly more complicated parallel sorting algorithm that can be employed. This is known as an even-odd transposition sort or, when performed in a group setting, a dos-a-dos sort. It requires slightly more coordination in order to ensure that actions happen as a sequence of “steps”. In a group setting this coordination can be provided by the remaining members of the class singing or clapping.


Even-Odd Parallel Transposition Sort (Dos-a-dos sort). As with the polite sorting algorithm, all members of the sorting group initially stand in a line, only this time facing forward. Then every other member turns around, so that they face backward. So, with the exception of the endpoints, each forward facing person has two backward facing people beside them, and vice versa. (In the computer the effect of facing forward and backward would be achieved by numbering each element, and using the even or odd property; hence the name). At each step of the algorithm a forward facing person (and only the forward facing people) compares his or her height to the backward facing person on their left. If they are out of order (that is, shorter), they dos-a-dos around each other, thereby exchanging places, and exchanging their orientation as well. Then, when all exchanges have taken place, everybody claps twice, shouts “Ye-Ha!” and jumps around 180 degrees, so that those who were facing backward now face forward, and vice versa. The end point people can participate in the clapping and “Ye-Ha” jumps, even if they have nobody to dos-a-dos with. The new forward facing people can then compare to their backward facing companions, and possibly exchange. This continues until all members of the group have reached their correct position.

Race conditions are not possible, since only those people facing forward are performing a comparison, and their neighbors are both facing backward, so will not ask them to exchange. But we have only been able to eliminate the race condition problem at a cost of some global coordination (keeping time in steps) and everybody needing to keep track of whether they are forward facing or backward facing. Question: How many steps will it take until every body is in order? What is the best possible starting position? What is the worst possible starting position? Our informal description of parallel sorting algorithms has only been able to scratch the surface of this interesting and complex topic. To go any further we would need to introduce concepts such as communication latency, bandwidth, and the number of processors. These issues would take us far beyond the boundaries for this book. Nevertheless, it is a safe prediction that the study of parallel programming algorithms will become an increasingly important part of computer science as the cost of machines with parallel capabilities steadily decreases, and hence their availability steadily increases. Study Questions 1. What property must a collection have before one can perform a binary search? 2. Explain the principle of divide and conquer. How does binary search illustrate this

idea? 3. What does it mean to form a partition of a collection? 4. Explain how forming a partition can be used to find the median of a collection.


5. What is the best-case execution time for the median finding algorithm? What is the worst-case execution time?

6. Which sorting algorithms had O(n2) execution behavior? 7. Which sorting algorithms worked in-place, without requiring any additional memory? 8. Which sorting algorithms had guaranteed O(n log n) behavior? 9. What restrictions on the input values were required in order to use radix sort? 10. What restrictions in input values are required to use counting sort? 11. Explain in your own words how bucket sort works. What is the property that the

hashing function must have in order for this algorithm to work? 12. Explain how library sort contains features similar to bucket sort, counting sort, and

insertion sort. 13. Under what conditions would you need to use an external sort? 14. What does it mean for a computer to support parallel execution? 15. What is the worst case algorithmic execution time for the even odd parallel

transposition sort (dos-a-dos sort)? On The Web There is far more to the topic of searching and sorting than this chapter (or this book) can cover. All the sorting algorithms described here, including external sorting, are topics of entries in Wikipedia. The field of parallel programming is rapidly changing due to the increasing introduction of multi-core architectures. Because of this, any reference to specific web pages would become almost immediately dated. The best advice is to google terms such as “multicore programming” to find the latest work.

Chapter 14: Searching and Sortingclasses.engr.oregonstate.edu/.../cs261/Textbook/Chapter14.pdfa data...

Documents

Transcript of Chapter 14: Searching and Sortingclasses.engr.oregonstate.edu/.../cs261/Textbook/Chapter14.pdfa data...