Scalable parallel word search in multicore/multiprocessor systems

18
J Supercomput (2010) 51: 58–75 DOI 10.1007/s11227-009-0308-3 Scalable parallel word search in multicore/multiprocessor systems Frank Drews · Jens Lichtenberg · Lonnie Welch Published online: 7 July 2009 © Springer Science+Business Media, LLC 2009 Abstract This paper presents a parallel algorithm for fast word search to determine the set of biological words of an input DNA sequence. The algorithm is designed to scale well on state-of-the-art multiprocessor/multicore systems for large inputs and large maximum word sizes. The pattern exhibited by many sequential solutions to this problem is a repetitive execution over a large input DNA sequence, and the genera- tion of large amounts of output data to store and retrieve the words determined by the algorithm. As we show, this pattern does not lend itself to straightforward standard parallelization techniques. The proposed algorithm aims to achieve three major goals to overcome the drawbacks of embarrassingly parallel solution techniques: (i) to im- pose a high degree of cache locality on a problem that, by nature, tends to exhibit nonlocal access patterns, (ii) to be lock free or largely reduce the need for data access locking, and (iii) to enable an even distribution of the overall processing load among multiple threads. We present an implementation and performance evaluation of the proposed algorithm on DNA sequences of various sizes for different organisms on a dual processor quad-core system with a total of 8 cores. We compare the perfor- mance of the parallel word search implementation with a sequential implementation and with an embarrassingly parallel implementation. The results show that the pro- posed algorithm far outperforms the embarrassingly parallel strategy and achieves a speed-up’s of up to 6.9 on our 8-core test system. Keywords Biological word discovery · Parallel algorithms · Cache-awareness · Lock-free data partitioning · Multicore/multiprocessor systems F. Drews ( ) · J. Lichtenberg · L. Welch School of Electrical Engineering and Computer Science, Ohio University, Athens, OH 45701, USA e-mail: [email protected] J. Lichtenberg e-mail: [email protected] L. Welch e-mail: [email protected]

Transcript of Scalable parallel word search in multicore/multiprocessor systems

J Supercomput (2010) 51: 58–75DOI 10.1007/s11227-009-0308-3

Scalable parallel word searchin multicore/multiprocessor systems

Frank Drews · Jens Lichtenberg · Lonnie Welch

Published online: 7 July 2009© Springer Science+Business Media, LLC 2009

Abstract This paper presents a parallel algorithm for fast word search to determinethe set of biological words of an input DNA sequence. The algorithm is designed toscale well on state-of-the-art multiprocessor/multicore systems for large inputs andlarge maximum word sizes. The pattern exhibited by many sequential solutions to thisproblem is a repetitive execution over a large input DNA sequence, and the genera-tion of large amounts of output data to store and retrieve the words determined by thealgorithm. As we show, this pattern does not lend itself to straightforward standardparallelization techniques. The proposed algorithm aims to achieve three major goalsto overcome the drawbacks of embarrassingly parallel solution techniques: (i) to im-pose a high degree of cache locality on a problem that, by nature, tends to exhibitnonlocal access patterns, (ii) to be lock free or largely reduce the need for data accesslocking, and (iii) to enable an even distribution of the overall processing load amongmultiple threads. We present an implementation and performance evaluation of theproposed algorithm on DNA sequences of various sizes for different organisms ona dual processor quad-core system with a total of 8 cores. We compare the perfor-mance of the parallel word search implementation with a sequential implementationand with an embarrassingly parallel implementation. The results show that the pro-posed algorithm far outperforms the embarrassingly parallel strategy and achieves aspeed-up’s of up to 6.9 on our 8-core test system.

Keywords Biological word discovery · Parallel algorithms · Cache-awareness ·Lock-free data partitioning · Multicore/multiprocessor systems

F. Drews (�) · J. Lichtenberg · L. WelchSchool of Electrical Engineering and Computer Science, Ohio University, Athens, OH 45701, USAe-mail: [email protected]

J. Lichtenberge-mail: [email protected]

L. Welche-mail: [email protected]

Scalable parallel word search in multicore/multiprocessor systems 59

1 Introduction

Today, more organisms are having their genomes sequenced, thus creating a greaterdemand from the biological community to better understand the exact biologicalmechanisms which are encoded within the genomic blueprint of each organism.While biologists continue to analyze genomes and identify new functional elementswithin organisms, there remain several regions of the genome that are often over-looked, such as nonprotein coding regions, introns, and intergenic regions. Severalbioinformatics oriented methodologies exist to discover functional elements, whichare also referenced within as biological words, within these regions. There are anumber of existing word discovery methods and tools reported in literature [1–6].Addressing several drawbacks of classical alignment-based methods [1, 2], new enu-merative approaches have emerged during the current decade [3–6]. Enumerativeapproaches identify and score words that are present in a set of sequences. Some ofthe approaches exhaustively consider all possible words of a given range of lengthsusing brute force approaches whereas others aim to reduce the size of the word searchspace (typically by employing biologically inspired heuristics), thus potentially lim-iting the quality of the results. Facing the computational demand and complexity ofthe word discovery process, there is a clear trade-off between the desire to accu-rately analyze wide ranges of word lengths in very large DNA sequences (up to thewhole genome of organisms), and the limited available computing power and datastorage.

In this paper, we focus on a class of exhaustive enumerative word discovery meth-ods. The main contribution of this work is a new parallel algorithm for enumeratingthe whole word space for a given DNA sequence (or set of sequences) within a givenminimum and maximum word length. The algorithm is designed to handle both longinput sequences and long maximum word lengths, and scales well on state-of-the artmulticore/multiprocessor systems. While the paper will primarily demonstrate thatthe algorithm provides a good speed-up under a varying number of CPU cores (ina shared memory environment) for a given maximum input size and word length,we will also briefly discuss nice properties of the algorithm that can be further ex-ploited in distributed memory systems to push the scalability of the problem instanceitself.

The performance and scalability is achieved by a number of techniques, includinglock-free data access and update, cache-aware data processing and layout, intelligentdata partitioning, and load-balancing. The lock-free nature of the structures repre-senting enumerated words helps eliminate (or reduce) system wide overheads such asthose caused by front-side bus locking. The cache-aware data processing performedby the parallel algorithm ensures that each CPU core processes requires only a small(and adjustable) amount of memory over a short time window. Additionally, althoughthe structures representing the word space give rise to pointer-chasing, cache-awaredata layout helps allocating memory that is likely to be used in the near future ina consecutive fashion, thus improving both the space-locality and the performanceof hardware prefetching. All the aforementioned features effectively enable multipleCPU cores to efficiently process different parts of the word space in a parallel fash-

60 F. Drews et al.

ion. Finally, intelligent data partitioning will allow an equal partition of the overallworkload among the CPU cores.

2 Related work

In this section, we will first look and a some fundamental algorithmic word searchingtechniques around which word discovery tools are built. Finally, we will discuss someprevious work pertaining to cache-aware data access and layout.

Word Searching has been the subject of a large body of previous work.In the clas-sical exact string matching problem, a short string called pattern and a longer stringcalled the text are given. The problem is then to find all the occurrences of the patternin the text [7]. There are a number of early approaches that all utilize the fact that,under certain circumstances, the pattern can be shifted to the right by more than onecharacter without missing the possible occurrence of a word, thus reducing the over-all number of iterations [8–10]. The exact set matching problem is a generalizationof the string matching problem. It consists of finding all the occurrences of a set ofpatterns, as opposed to just a single one (see, e.g., [11]).

Suffix trees [12–15] have been successfully applied to a large number of problemsin bioinformatics and computational biology. A suffix tree is a data structure that,once built, solves the exact string matching problem in O(m) time [14], where m

is the length of the pattern. Suffix trees are different from the techniques describedabove in the sense that they first need to create a suffix tree data structure from theinput text. In order for the above listed suffix tree algorithms to achieve the timebounds, they require �(n · |Σ |) space, where Σ is the alphabet and n is the lengthof the text. After building the suffix tree, any pattern of length m can be matchedin O(m) [7]. Suffix trees have been the preferred choice for a number of problems,including finding maximal repeats, finding supermaximal repeats, and finding thelongest common substring [7]. Note that suffix trees can be built for either the inputtext or the pattern (or set of patterns), depending on which one is longer. For very longinput sequences, the linear space bounds for suffix trees may still not be sufficient.More recently, suffix arrays [16] were introduced as a more space efficient alternativeto suffix trees. Despite the lower space complexity, they are almost as efficient atsolving the exact string and set matching problems at suffix trees [7]. A suffix arrayfor a string of n characters is defined as an array of integers in the range of 1 to n,specifying the lexicographic order of all the n suffixes of the string. Suffix arrays arebuilt by lexicographically sorting, and hence grouping together all the suffixes of astring. A given pattern can then be found by performing a binary search. A similarapproach can be achieved using the Burrows–Wheeler transform [17]. The Burrows–Wheeler transform also offers the possibility to compress the DNA sequence, and forthe algorithm to operate in the compressed space [18].

Compressed radix tries [19] (or Patricia Tries) are data structures for storing col-lections of strings. They are similar to radix trees but, unlike conventional tries, nodeswith only one child are merged with their parents. Radix trees allow a number ofoperations, including insertion, deletion, lookup, finding a successor, and finding apredecessor to be performed in linear time (with respect to the length of the string

Scalable parallel word search in multicore/multiprocessor systems 61

to which they are applied) [19]. They are among the fastest algorithms for managingstrings [20] but are generally not as space efficient as suffix trees. A number of au-thors have developed mechanisms for reducing the space requirements of the radixtree while attempting to maintain fast access times [21–24]. Others have proposedspace reductions at the expense of time [25, 26]. Recently, Burst Tries have beenshown to be among be the fastest and most compact in-memory tree structures tohandle large volumes of variable-length strings [20, 27, 28].

Suffix trees, suffix arrays, Burrows–Wheeler transform, Aho and Korasick’s algo-rithm, radix trees, and burst tries can all be employed as a basis for enumerating theword space of a DNA sequence. For example, to enumerate all the words of a givenmaximum length l in a suffix tree, an algorithm could do a depth first search until nofurther nodes with edge labels corresponding to strings of length ≤ l can be found.For each word represented by an inner node, the number of leafs of the subtree start-ing from this node gives the number of repeats of the word. For a radix tree, eachnodes at level l of the tree represents a word of length l. The number of times a givennode is hit during the generation of the radix tree gives the number of repetitions. Ourapproach to fully enumerating the word space of a DNA sequence is based on com-pressed radix tries because (i) they allow for an efficient parallelization (as will beshown below), (ii) they allow constant time insertions and lookup, and (iii) utilizingsome of the aforementioned improvements for reduction of space requirements, theyoffer a good trade-off between space complexity and access time.

One of the main challenges of parallelizing the word space enumeration for multi-core/multiprocessor systems is the use of data structures and data accesses that makeefficient use of the cache-hierarchy. While this is generally true even in sequentialapplications, it becomes of paramount importance in the parallel case. The main is-sue is the fact that the trie structures generated during the enumeration of the wordspace are extremely space intensive. The individual trie nodes are, however, to a cer-tain extend cache conscious due to the fact their size is very small [20]. This issueof cache-aware data structures and layout has been previously studied in literature.Strategies for improving the cache usage for pointer based data structures by cache-conscious data layout were discussed in [29, 30], and by performing cache-consciousmemory allocation in [31, 32]. These strategies aim at improving the cache usage ofpointer-intensive data structures by improving the access locality of algorithms. Raoand Ross developed cache-sensitive search tree techniques that aim at improving thelocality of references by changing the data layout in order to reduce pointer chasingoperations [33, 34]. Our approach will borrow some of the ideas presented abovefor allocating new trie-nodes likely to be used in the near future in a consecutivefashion, and thus improving the spacial locality, and the performance of hardwareprefetchers [35]. While much of the aforementioned work is focused on efficient andspace-optimized storage of strings in memory, the focus of our work is on develop-ing an efficient parallel word search algorithm by improving the access locality andcache-usage of multiple parallel threads, while largely avoiding the need for locking.In our approach, we exploit the special correlation between input sequence data andoutput data to develop a cache-conscious parallelization of word storage and retrievalin the context of the particular application of building a vocabulary for genome se-quences.

62 F. Drews et al.

3 Word space enumeration

3.1 Word search framework

The word search is typically divided into several stages: (1) Word Search: In thisstage, either a single input DNA sequence, or a set of sequences as provided as input.Subsequently, all the words between a specified minimum and maximum word sizeare identified and stored, along with statistical information such as the total num-ber of occurrences of a word and the number and indices of sequences in which theword occurs. The reason for considering ranges of word lengths is biologically moti-vated. For example, Bilu and Barkai [36] state that binding sites in S. cerevisiae arebetween 5 and 19 bp (bp = base pairs) in length and 9.3 on average. Tomovic andOakeley [37] conducted a detailed analysis of the transcription factor binding sitesstored in JASPAR [38] and determine average length for different subsets between8.25 and 12.15 bp. (2) Word Scoring: This stage computes a score for each word de-termined in the Word Search phase which represents the exceptionality of the word.The degree of exceptionality of a word is typically defined by its statistical over-or underrepresentation. A number of scores are used in existing tools, ranging fromobserved-to-expected scores to z-scores to p-values. A number of scoring schemesrequire the computation of probability for a word to occur anywhere in the input se-quence. Markov chain models of a given order m are frequently used to perform thiscomputation [39]. Note that a background model of maximum Markov order to de-termine word probabilities may require to determine the number of the occurrencesof all words ranging from 1 to the maximum word size. This is another reason whythe ability to enumerate ranges of words is important. (3) Word Selection and WordClustering: In subsequent stages, a number of top words will be selected based on thescore computed in the previous stage, and groups of different words will be formedthat are candidates for being related in an evolutionary sense (i.e., changed throughmutations, insertions, and deletions).

3.2 Parallelizing the word space enumeration

In the following, we will focus on the challenges of the stage 2 where an internalrepresentation of the input sequence is generated to determine repeats in the inputsequence. Depending on the length of the input sequence, this phase may requirea substantial amount of main memory and may adversely affect the system perfor-mance when not done in an efficient and cache-aware manner. The main challenge isto impose locality on a problem that, by definition, does not exhibit locality.

For example, consider a naive implementation that uses a standard hash table forthe internal representation. For a DNA sequence corresponding to a human chro-mosome, the amount of input data will result somewhere between 24 and 230 Mbof input data. Tracking repeats within a minimum and maximum word length, andmaintaining a statistic for them, will result in gigabytes of data required to store thisinformation. A hash table approach first needs to choose a very large table size toreduce the number of pointer-chasing operations and string comparisons in case ofconflicts. The hash function is, by definition, not local and a front-side bus contention

Scalable parallel word search in multicore/multiprocessor systems 63

will greatly reduce the performance of any parallel implementation based on this rep-resentation.

As mentioned in Sect. 2, our approach employs a data representation that is basedon compressed radix trees [19]. In this representation, each path from the root to anintermediate node or leave corresponds to a word. The pseudo-code in Fig. 1 (left)sketches the individual steps of stage 2 of the Word Search framework. The algorithmiterates over the whole input sequence. For a given index into the input sequence, allthe words ranging from the minimum to the maximum word length are generated.Each word is presented to the radix tree. If it does not exist, it will be added to thetrie. If it does exist, its statistics are updated.

There are different conceivable ways of parallelizing the above algorithm, all ofwhich have certain advantages and disadvantages. The main aspects and challengesof parallelizing the generic word search algorithm can be stated as follows: (i) Cache-efficiency: There are two main data structures, the input sequence, and the radix tree.Ideally, we want to achieve a cache-optimal access pattern to both data structureswhile processing words. In other words, a good scheme would be one in which whileprocessing a limited window of the input sequence will always result in an accessto a limited window of the radix tree. (ii) Synchronization-efficiency: Since the inputsequence is read-only, locking would affect only the radix tree. Avoiding any lock-ing would be preferred to an approach in which fine or course grained locking isnecessary. If locking is necessary, the lock contention should be as small as possi-ble. (iii) Load-balancing and dynamic partitioning: When partitioning the problemamong different threads, we should ensure that the work can be partitioned in a waythat each thread has an equal amount of work to perform. This may be done eitherstatically or by dynamic repartitioning.

The most obvious attempt at parallelization would be to divide the total input se-quence into windows of a smaller size that can be processed by different threads (seeFig. 2 (left)). This approach has several drawbacks with respect to the requirementsof an efficient implementation. While it (theoretically) offers cache-locality for therepetitive iteration over the input sequence winder, it is clearly not cache-efficientwith respect to the radix tree. The main problem here is that there is generally noclear correlation between the sequence window and an isolated area in the radix tree.Different threads may consequently need to update shared branches in the radix tree.Additionally, most of the cache efficiency for the input sequence is largely offset bycache pollution incurred by the non-local radix tree access pattern. This strategy alsointroduces the need for either course grained or fine-grained locking, where course-grained locking will result in a very high lock contention, and very fine per-nodelocking may result in a significant additional space overhead (a locking scheme suchas per-level locking might be a trade-off between these two approaches). One of thefew advantages of such a scheme is that the load can be fairly evenly balanced amongthe different threads, and not additional measure to ensure load-balancing need tobe taken. As we will show in the experimental results section, the overall speed-upachieved by such an embarrassingly parallel implementation is very low.

The approach taken in our parallelized version word space enumeration algorithmsis based on the observation that, in order to achieve any reasonable speed-up in amulti-core/multiprocessor system, we need to ensure that, during the processing of

64 F. Drews et al.

Fig

.1(L

eft)

Sequ

entia

lwor

dse

arch

algo

rith

m(S

WSA

);(R

ight

)Pa

ralle

lized

wor

dse

arch

algo

rith

m(P

WSA

)

Scalable parallel word search in multicore/multiprocessor systems 65

Fig

.2(L

eft)

Em

barr

assi

ngly

para

llela

ppro

ach;

(Rig

ht)

Cac

he-a

war

epa

rtiti

onin

gsc

hem

e

66 F. Drews et al.

words, different threads access (mostly) isolated and preferably relatively small por-tions of the radix tree. Besides being more cache-effective, this would also enable usto largely avoid the need for locking. In order to achieve a partition of the radix tree assuggested in Fig. 2 (right), we need to partition the input sequence among the threadsin such a way as to let each thread process only sequences starting with a particularprefix (see the explanations below for what we mean by “locking region”).

Example 1 As a simple example, consider the case where we partition the input se-quence among 4 threads, T1, T2, T3, T4, such that T1 only processes words in theoriginal sequence starting with prefix “a,” T2 only elements with “c,” T3 elementswith “g,” and T4 elements with “t.” If the thread encounters a prefix that does notmatch its allocated prefix, the word will be skipped. If a match is found, the prefixcan be used to determine the words following this prefix from the minimum up to themaximum total word length. Similarly, a partition into 8 threads T1, T2, . . . , T8 couldbe performed by assigning them the following prefixes of depth 2: “aa,” “ac” (T1),“ag,” “at” (T2), “ca,” “cc” (T3), . . . , “tg,” “tt” (T8).

These prefixes are used to divide the radix tree among the threads. Each threadcan now further reduce the size of the radix tree that is accessed during a small timewindow, and thus increase the degree of locality of references, by further refining thedepth of the prefixes.

Example 2 Thread T1 in the previous example was assigned prefixes “aa” and “ac.”We could increase the total prefix depth to 4 and let T1 process the input sequence inthe following order: “aaaa,” . . . , “aatt,” “acaa,” . . . , “actt.”

In the following, we formalize the prefix problem and describe strategies of howto assign prefixes to threads.

In general, we consider a set of threads T1, T2, . . . , Tn. Each thread Ti is assigneda set of prefixes Si = {pi

1,pi2, . . . , p

iki

}, where

pij = αi

j,1αij,2 · · ·αi

j,lαij,l+1 · · ·αi

j,l+c−1

where αij,k ∈ Σ (Σ is the string alphabet). We generally assume that all the

threads have the same prefix depth l and that |Σ |l > n. Consider a prefix. Thefirst l elements of the prefix are used to partition the trie among different threads.When thread Ti scans through the input sequence, it will skip all words whoseprefix does not match αi

j,1αij,2 · · ·αi

j,l . The remaining c elements in each pre-fix are not shared among threads and may be used to enable a thread to furtherlimit the region in the radix tree that is operated on at a given time. This canbe done by having each thread Ti systematically generate all the permutations ofαi

j,l+1 · · ·αij,l+c−1. To summarize, each thread will skip any word whose prefix does

not match αij,1α

ij,2 · · ·αi

j,lαij,l+1 · · ·αi

j,l+c−1.This scheme can be arbitrarily refined to allow a finer-grained partitioning among

threads. Note that this refinement is done to achieve three complementary goals: (i) toachieve a (largely) disjoint partition of the total radix tree among a number of threads,

Scalable parallel word search in multicore/multiprocessor systems 67

(ii) to avoid or limit the need for locking, and (iii) to achieve a better cache-localitywithin each thread by limiting the window with the radix tree that is accessed at anygiven point in time.

Note that, while the need for locking is reduced, it is still necessary for pairsof threads sharing a common prefix path. Since the prefixes are partitioned amongthread, the longest possible shared path is of length l − 1. However, since generallyl � m, where m is the maximum word size, this is practically not a major concern.

While theoretically a promising approach, in practice this partition scheme resultsin a trade-off between cache optimization and additional overhead incurred by theprefix matching. On the one hand, in theory, we can see that the more refined the par-titioning among the threads the smaller the region of the radix tree that is accessed byany thread at any given point in time, and hence the cache-efficiency of the implemen-tation increases. If the region is small enough to fit in the L2-cache of the system, thismight dramatically reduce the front-side bus contention between cores/CPU’s andmain memory. On the other hand, restricting threads to the processing of certain pre-fixes incurs an additional overhead due to the fact that the prefixes of each threadneed to be matched against the input sequence, which may result in skipping of char-acters, and hence an overall larger number of read operations on the input sequence.Overall, the probability of skipping characters, and thus the overhead will increasewith longer prefixes. In Sect. 4, we will present experiments to determine suitablevalues for the prefix depth.

Figure 1 (right) presents the parallelized Word Search algorithm. Each thread re-ceives the list of prefixes to use. The outer loop of each thread iterates over all pre-fixes. For a given prefix, the inner loop iterates over the whole input sequence. Eachgenerated word will be matched against the current prefix and, if no match is found, itskips to the next character. If a match is found, the processing is performed as usual.

3.3 Calculating prefixes

Another issue that has not been discussed yet concerns the fact that partitioning thedata among the threads based on prefixes may result in an imbalance in load amongthe threads, if the characters of the given alphabet are not uniformly distributed inthe input sequences, as it is the case for the frequencies of the four nucleotides (ade-nine, thymine, guanine, and cytosine). Different organisms have different frequen-cies, leading to the use of “gc” contents (percentage of guanine and cytosine in con-trast to adenine and thymine) as a way to discern between different species. The ref-erence organism Arabidopsis thaliana (thalecress) has a “gc” content of around 36%[40] varying over its 5 chromosomes, while the bacterium Streptomyces coelicolorhas a “gc” content 72% [41].

We assume again that the prefix depth l (or, more specifically, d = l + c) is speci-fied by the user, and that for the number of threads n, |Σ |l > n. The prefix assignmentproblem can be stated as follows: Find a partition of all the |Σ |l distinct prefixesamong all the threads so as to minimize the imbalance in load that threads will en-counter at runtime.

Since we do not have a priori knowledge of the exact distribution of characters inthe input sequence, we use a simplified statistical model to describe the word distribu-

68 F. Drews et al.

tion. We employ a 0-order Markov model (=Bernoulli model) in which the probabil-ity for a word w = (w1,w2, . . . ,wm), wi ∈ Σ = {α1, α2, . . . , αs} to occur anywherein the input sequence is given by φ̂w = ∏m

i=1 φ(wi), where φ(α1),φ(α2), . . . , φ(αs)

is the distribution of the characters in Σ . Hence, this model assumes that the prob-ability for any string to occur in the input sequence is independent of the context ofthe string.

Let P = {p1,p2, . . . , p|Σ |l } be the set of prefixes to be partitioned among then threads and pi = (ρi,1, ρi,2, . . . , ρi,l) ∈ Σl . Furthermore, let φ be the distributionof the characters of the input sequence. Since we assume a Bernoulli model, φ̂i =∏l

j=1 φ(ρi,j ) gives the probability for the prefix pi to occur anywhere in the inputsequence. If a set of prefixes is assigned to a thread, then we can use the sum of theφ̂i ’s as a measure of the load for this thread.

Let a partition of P into n subsets S1, S2, . . . , Sn be described by

πi,j ={

1, if pi ∈ Sj ;

0, else.

Equipped with the above model, we can now reformulate the prefix problem asfollows:

Find a partition πi,j , i = 1,2, . . . , |Σ |l , j = 1,2, . . . , n such that

nmaxj=1

{|Σ |l∑

i=1

πi,j · φ̂i

}

→ min !.

This problem is a well-known NP-complete problem known as MultiprocessorScheduling Problem (MSP), where a set of independent tasks is to be scheduledon identical processors in order to minimize schedule length [42]. A well-studiedheuristic for this problem is known as LPT rule [43] which orders the tasks in non-decreasing order of processing times. The tasks are assigned to processors in sucha way that at each step the first available processor is selected to process the firstavailable task on the list.

Applied to our problem, the set of processors in the MSP corresponds to thethreads in the prefix problem, the set of tasks in the MSP corresponds to the pre-fixes in the prefix problem. This makes the LPT rule applicable to the prefix problem.The following section will discuss an implementation of the LPT rule.

4 Experimental results

In this section, we present a set of experiments that will demonstrate the effective-ness and scalability of the proposed framework. We compare the performance of theParallelized Word Search algorithm (PWSA; Fig. 1 (right) with the sequential WordSearch Algorithm (SWSA; Fig. 1 (left)). Finally, we present an experimental compar-ison of the embarrassingly parallel version of the code (EPWSA) (see Fig. 2 (left)) todemonstrate it’s failure to perform adequately on the given hardware platform.

Scalable parallel word search in multicore/multiprocessor systems 69

4.1 Implementation and experimental setup

The experiments described in the following subsections were all executed on an In-tel Dual Processor Xeon E5410 Quad Core 2.33 GHz system with 32 GB of mainmemory and 2 × 6 Mb of L2 cache (for each CPU). The systems features a total of 8cores (divided into two dual-core pairs per physical package). The operating systemenvironment consists of Ubuntu 64-bit Linux kernel release 2.6.22-14-generic. Allthe word search algorithms were implemented in C using the gcc compiler version4.1.3 and “-O3” compiler optimizations.

The PWSA implementation is based on POSIX Pthreads [44]. The implementa-tion aimed to achieve two main (and generally conflicting) goals: (1) A minimizationof the total memory necessary to build the radix tree and (2) the elimination of allpotential factors that might adversely affect the performance of the multithreadedimplementation. The first goal was achieved by optimizing the data structures andalgorithms for a minimization of the memory footprint of the application, and bytrading a reduction in time complexity for a reduction in space complexity, if neces-sary. Several of the techniques introduced in Sect. 2 to minimize the space for radixtrees were employed [21–24], including: (i) compression of nodes with only one suc-cessor into a single node, (ii) preallocation of radix nodes to each thread to overcomea lack of scalability of the malloc() function, and (iii) allocation of radix tree nodes ina (per-thread) array which allows to reference nodes by means of 32-bit array indicesas opposed to 64-bit pointers. The second goal involved the avoidance of performancelimiting factors such as “cache ping pong” caused by false sharing and avoidance oflocking wherever possible. False sharing was eliminated by padding all frequentlyaccessed writable data structures to the size of a cache line. As described above, theneed to lock radix data structures has been limited to accesses to radix nodes on paths(from the root) up to a length of the longest shared prefix.

The sequential version of the radix tree, SWSA, utilizes the same radix tree datastructure as PWSA. An optimal sequential radix-tree based word enumeration algo-rithm has to extract every word from the input sequence (whose length is within thespecified interval) and present it to the radix tree to check whether it has alreadybeen stored and update it, if necessary. Skipping of characters using techniques de-scribed in Sect. 2 is not possible, since the whole word space needs to be enumeratedand all repeats have to be tracked. Our SWSA implementation slides a variable-sizewindow (ranging from the minimum to the maximum word size) over the input se-quence. While the retrieval of each new word has a complexity that is linear in theminimum word length, all the words with the same prefix (i.e., words with a lengthbetween minimum and maximum word size) can be retrieved in constant time. To ourknowledge, this implementation strategy is an optimal sequential implementation ofa radix-tree based enumerative word space enumeration.

The embarrassingly parallel version of the code, EPWSA, also uses POSIXPthreads. It essentially uses the same trie structures as PWSA and SWSA. However,since this version introduces the need for synchronizing access to the radix tree sharedby multiple threads, locking primitives have to be added. After some experimentation,we found that a locking scheme that uses a triple of locks for each combination of thetwo start characters of a word and the level of the node in the radix tree (i.e., the word

70 F. Drews et al.

length) appears to yield the best tradeoff between space requirement and concurrentaccess to the radix tree. Hence, there are m · |Σ |2 locks, where m is the maximumword length and Σ the alphabet.

4.2 LPT vs. RR prefix calculation strategy

In the first experiment, we investigate the impact of the load-balancing strategy onthe speed-up of PWSA against SWSA. We compare an implementation of the LPTheuristic for prefix calculation (introduced in Sect. 3.3) with a simple greedy heuris-tic, RR, that partitions all the prefixes among threads in a round robin manner.1 Forboth heuristics, we selected a prefix depth of d = l+c with l = 2 and c = 2, and n = 8threads (see Sect. 4.3 for the choice of d and l). Since Σ = {a, c, g, t}, there are atotal of 42 = 16 prefixes to be partitioned among the 8 threads. The distribution of thenucleotides in the input sequences, φ(a),φ(c),φ(g),φ(t) can be computed by sim-ply counting the occurrences of the a, c, g, t and dividing them by the length of theinput sequence. Our LPT implementation avoided iterating through the whole inputsequence by calculating an approximate distribution based on a number of sample-windows selected randomly from the input sequence. Since both the LPT and theRR heuristics take a time linear in the number of prefixes, they both do not incur asignificant overhead for the chosen values of l and n.

Since the frequencies of the nucleotides tend to be fairly consistent within an or-ganism, we demonstrate the load-balancing effects on three sample organisms: Hu-man, C. elegans, and Arabidopsis thaliana. For our experiment, we chose a minimumword length of 2, a maximum word length of 32, and a sequence of 100,000,000 nu-cleotides for each of the three organisms. We ran SWSA, and PWSA on n = 2, n = 4,and n = 8 cores, and computed the speed-up’s by averaging the results over 10 sam-ple runs for each organism. The bottom left plot in Fig. 3 shows the results for thisexperiments. The results show that LPT performs overall significantly better than RRby better balancing the load between the cores. We can also see that for the humangenome, which exhibits the largest differences in the distribution of the nucleotidefrequencies, the performance gain of LPT over RR is larger than for the other two or-ganisms. Finally, we can see that a larger number of cores yields a larger performancebenefit.

4.3 PWSA vs. SWSA

In the next experiment, we investigated the impact of the prefix depth c = d + l onthe performance of PWSA. We chose an input sequence of 300,000,000 nucleotidesof the human genome, a minimum word length of 2, and varied the maximum wordlength from 12 to 20 in increments of 2. We ran PWSA using the LPT heuristic andn = 8 threads, and the sequential version, SWSA, and computed both the executiontimes and the speed-up’s by averaging the results over 10 sample runs. The top leftplot in Fig. 3 shows the speed-up’s, and the top right table presents both the speed-up’s and the execution times. The results show that the choice of c = 2 + 2 appears to

1Note that the prefixes are initially alphanumerically sorted.

Scalable parallel word search in multicore/multiprocessor systems 71

Fig

.3(T

ople

ft)

PWSA

vs.S

WSA

,Hum

ange

nom

e,in

puts

ize

300,

000,

000,

wor

dle

ngth

n=

12,1

4,16

,18,

20,v

aryi

ngpr

efix

dept

hsd

=l+

c,L

PThe

uris

tic;(

top

righ

t)PW

SAvs

.SW

SA,T

able

:Cor

resp

ondi

ngex

ecut

ion

times

and

spee

d-up

’s;(

bott

omle

ft)

PWSA

vs.S

WSA

,LPT

vs.R

RPr

efix

Allo

catio

nSt

rate

gyfo

rth

ege

nom

esof

Hum

an,

C.e

lega

ns,a

ndA

rabi

dopo

sis

thal

iana

,inp

utsi

ze10

0,00

0,00

0,w

ord

leng

thn

=32

,pre

fixde

pth

c=

2+

2;(b

otto

mri

ght)

PWSA

vs.E

PWSA

,hum

ange

nom

e,in

put

size

1,50

0,00

0,00

0,w

ord

leng

thn

=16

,pre

fixde

pths

d=

2+

2,L

PThe

uris

tic

72 F. Drews et al.

generally yield the best trade-off between overhead and access locality. The variationsin the speed-up for the different word length appear to be caused by the fact thatthe LPT heuristic assumes a Bernoulli model for the composition of words, whichdeviates from the true character distribution. The best speed-up achieved was 6.9.

4.4 PWSA vs. EPWSWA

In our final experiment, we compare the performance of an implementation of theembarrassingly parallel word search algorithm, EPWSA algorithm with PWSA. EP-WSA divides the input array among the threads as indicated in Fig. 2 (left). The factthat threads now share the whole radix tree introduced the need for locking primitivesin EPWSA. We added a locking scheme as described above. We ran both EPWSAand PWSA on 1,500,000,000 base pairs of the human genome and varied the num-ber of cores from 2 to 8 in increments of 1. We applied the LPT heuristic and chosea prefix depth of d = 2 + 2 for PWSA. The bottom right plot in Fig. 3 presents theexecution times for both parallel algorithms. PWSA achieves a super-linear speed-upwhen running on 2, 3, 4, and 5 cores. This appears to be due to caching effects. Sinceeach pair of cores shares 6 MB of L2 cache, more cache memory is available to eachthread. For a larger number of threads, each thread has less L2 cache available. Theresulting saturation of the front-side bus due to a larger number of memory and mem-ory prefetch requests results in a decrease of the overall speed-up to approximately 6for 8 threads. We clearly see that PWSA provides a more scalable approach.

4.5 Outlook and conclusions

In this paper, we demonstrated that the problem of determining the repeats in an inputDNA sequence does not lend itself to embarrassingly parallel solution techniques inmulticore/multiprocessor systems. This paper proposes a new, scalable parallel algo-rithm for enumerating the word space of genomic sequences. In a series of experi-ments, we demonstrated that the algorithm performs well for large input sequenceson a multiprocessor/multicore machine with a total of 8 cores. The focus of this paperwas on demonstrating the scalability in terms of CPU cores for a given instance of theproblem. Recently, we have begun to use a very similar strategy to the one describedin this paper to distribute the radix tree among the nodes of distributed memory sys-tem. The first results show that this approach appears also to be promising in pushingthe limits of scalability in terms of the input size of the problem.

References

1. Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifsin biopolymers. AAAI Press, Menlo Park, pp 28–36

2. Roth FR, Hughes JD, Church PE, Church GM (1998) Finding DNA regulatory motifs withinunaligned non-coding sequences clustered by whole-genome mRNA quantita. Nature Biotechnol16(10):939–945

3. Pavesi G, Mereghetti P, Mauri G, Pesole G (2004) Weeder web: discovery of transcription factorbinding sites in a set of sequences from co-regulated genes. Nucleic Acids Res 32:W199–W203

Scalable parallel word search in multicore/multiprocessor systems 73

4. Rigoutsos I, Huynh T, Miranda K, Tsirigos A, McHardy A, Platt D (2006) Short blocks from thenoncoding parts of the human genome have instances within nearly all known genes and relate tobiological processes. Proc Nat Acad Sci 103(17):6605–6610

5. Sinha S, Tompa M (2003) Ymf: a program for discovery of novel transcription factor binding sites bystatistical overrepresentation. Nucleic Acids Res 32(13):3586–3588

6. Wang G, Yu T, Zhang W (2005) Wordspy: identifying transcription factor binding motifs by build-ing a dictionary and learning a grammar. Nucleic Acids Res 33(Web Server issue). http://view.ncbi.nlm.nih.gov/pubmed/15980501

7. Gusfield D (1997) Algorithms on strings, trees, and sequences: Computer science and computationalbiology. Cambridge University Press, Cambridge

8. Boyer RS, Moore JS (1977) A fast string searching algorithm. Commun ACM 20(10):762–7729. Knuth DE, Morris JH, Pratt VR (1977) Fast pattern matching in strings. SIAM J Comput 6(2):323–

35010. Apostolico A, Giancarlo R (1986) The Boyer-Moore-Galil string searching strategies revisited. SIAM

J Comput 15:98–10511. Aho AV, Corasick MJ (1975) Efficient string matching: an aid to bibliographic search. Commun ACM

18(6):333–34012. Weiner P (1973) Linear pattern matching algorithms. In: Proc of the 14th annual IEEE symposium on

switching and automata theory, pp 1–11. URL http://citeseer.ist.psu.edu/context/43441/013. McCreight EM (1976) A space-economical suffix tree construction algorithm. J ACM 23(2):262–27214. Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14(3):249–26015. Giegerich R, Kurtz S (1997) From Ukkonen to McCreight and Weiner: A unifying view of linear-time

suffix tree construction. Algorithmica 19:331–35316. Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput

22(5):935–94817. Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Tech Rep 124.

URL http://citeseer.ist.psu.edu/76182.html18. Adjeroh D, Bell T, Mukherjee A (2008) The Burrows-Wheeler transform: Data compression, suffix

arrays, and pattern matching. Springer, Berlin19. Morrison DR (1968) Patricia—Practical algorithm to retrieve information coded in alphanumeric.

J ACM 15(4):514–534. doi:10.1145/321479.32148120. Askitis N, Sinha R (2007) Hat-trie: A cache-conscious trie-based data structure for strings. In: Dobbie,

G (ed) Proceedings of the thirtieth Australasian computer science conference (ACSC 2007). CRPIT,vol 62. Australian Computing Society, Ballarat, pp 97–105

21. Knuth DE (1998) Art of computer programming, vol 3: Sorting and searching, 2nd edn. Addison-Wesley, Reading

22. Bell TC, Cleary JG, Witten IH (1990) Text compression. In: Prentice Hall advanced reference series.Prentice Hall, New York

23. Sedgewick R (2002) Algorithms C. Addison-Wesley/Longman, Boston24. Severance DG (1974) Identifier search mechanisms: A survey and generalized model. ACM Comput

Surv 6(3):175–19425. Al-Suwaiyel M, Horowitz E (1984) Algorithms for trie compaction. ACM Trans Database Syst

9(2):243–26326. Maly K (1976) Compressed tries. Commun ACM 19(7):409–415. doi:10.1145/360248.36025827. Sinha R, Zobel J (2004) Cache-conscious sorting of large sets of strings with dynamic tries. J Exp

Algorithmics 9:1.528. Sinha R, Ring D, Zobel J (2006) Cache-efficient string sorting using copying. J Exp Algorithmics

11:1.229. Chilimbi TM, Davidson B, Larus JR (1999) Cache-conscious structure definition. In: PLDI’99: Pro-

ceedings of the ACM SIGPLAN 1999 conference on programming language design and implementa-tion. ACM, New York, pp 13–24

30. Chilimbi TM, Hill MD, Larus JR (2000) Making pointer-based data structures cache conscious. Com-puter 33(12):67–74

31. Badawy AHA, Aggarwal A, Yeung D, Tseng CW (2001) Evaluating the impact of memory systemperformance on software prefetching and locality optimizations. In: ICS’01: Proceedings of the 15thinternational conference on Supercomputing. ACM, New York, pp 486–500

32. Hallberg J, Palm T, Brorsson M (2003) Cacheconscious allocation of pointer-based data structuresrevisited with hw/sw prefetching. In: 2nd Annual workshop on duplicating, deconstructing, and de-bunking

74 F. Drews et al.

33. R J, Ross KA (1999) Cache conscious indexing for decision-support in main memory, pp 78–8934. Rao J, Ross KA (2000) Making b+-trees cache conscious in main memory. In: Proceedings of the

2000 ACM SIGMOD international conference on management of data, pp 475–48635. Yang CL, Lebeck AR, Tseng HW, Lee CH (2004) Tolerating memory latency through push prefetch-

ing for pointer-intensive applications. ACM Trans Archit Code Optim 1(4):445–47536. Bilu Y, Barkai N (2005) The design of transcription-factor binding sites is affected

by combinatorial regulation. Genome Biol 6(12):R103. doi:10.1186/gb-2005-6-12-r103. URLhttp://genomebiology.com/2005/6/12/R103

37. Tomovic A, Oakeley EJ (2007) Position dependencies in transcription factor binding sites.Bioinformatics 23(8):933–941. doi:10.1093/bioinformatics/btm055. URL http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/8/933

38. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B (2004) Jaspar: an open-access database for eukaryotic transcription factor binding profiles. Nucl Acids Res 32(1):D91–94.doi:10.1093/nar/gkh012

39. Robin S, Rodolphe F, Schbath S (2005) DNA, Words and models. Cambridge University Press, NewYork

40. Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, Li D, MeyerT, Muller R, Ploetz L, Radenbaugh A, Singh S, Swing V, Tissier C, Zhang P, Huala E (2007) Thearabidopsis information resource (TAIR): gene structure and function annotation. Nucleic Acids Respp 965+. doi:10.1093/nar/gkm965

41. Borodina I, Krabben P, Nielsen J (2005) Genome-scale analysis of streptomyces coelicolor a3(2)metabolism. Genome Res 15(6):820–829. doi:10.1101/gr.3364705

42. Karp RM (1972) Reducibility among combinatorial problems. In: Miller RE, Thatcher JW (eds) Com-plexity of computer computations. Plenum, New York, pp 85–103

43. Graham R Bounds for certain multiprocessing anomalies. Bell Syst Tech J (1966)44. Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley/Longman, Boston

Frank Drews is an Assistant Professor of Computer Science and Elec-trical Engineering at Ohio University. Dr. Drews received his Ph.D. inComputer Science from Clausthal University of Technology in Ger-many. His main research interests are high-performance computing,real-time systems, and bioinformatics. He has served as General Chairand Program Chair for the IEEE International Workshop on Parallel andDistributed Real-Time Systems, as Program Chair for the Second IEEEInternational Symposium on Applied Computing and ComputationalSciences (ACCS 2009), and as a member of the Organization Commit-tee of the Bioinformatics Open Source Conference (BOSC 2009). He isa Member of the Editorial Board of the International Journal of Com-putational Bioscience, and was Guest Editor for the Journal of Systemsand Software Special Issue on Resource Management for Real-Timeand Distributed Systems.

Jens Lichtenberg received his B.Sc. and M.Sc. degree in Business In-formatics from the Clausthal University of Technology, Germany in2002 and 2004, respectively. He is currently a Ph.D. candidate in Bioin-formatics at Ohio University, USA, and the President of the RegionalStudent Group Ohio for the Student Council of the International Soci-ety of Computational Biology. His research interests include regulatorygenomics and proteomics.

Scalable parallel word search in multicore/multiprocessor systems 75

Lonnie Welch received a Ph.D. in Computer and Information Sciencefrom the Ohio State University. Currently, he is the Stuckey Professor ofElectrical Engineering and Computer Science at Ohio University, andhe is a member of the Graduate Faculties of the Biomedical Engineer-ing Program and of the Molecular and Cellular Biology Program. Dr.Welch performs research in the areas of bioinformatics and high perfor-mance computing. His research has been sponsored by the Defense Ad-vanced Research Projects Agency, the Navy, NASA, the National Sci-ence Foundation, the Army, and the Ohio Board of Regents. Dr. Welchhas more than 20 years of research experience in the area of high per-formance computing. In his graduate work at Ohio State University, hedeveloped high performance 3-D graphics rendering algorithms, and heinvented a parallel virtual machine for object-oriented software. For 15years, his research focused on middleware and optimization algorithmsfor high performance computing; this work produced three successive

generations of adaptive resource management middleware for high performance real-time systems, and re-sulted in two patents and more than 150 publications. Currently, Professor Welch directs the Bioinformat-ics Laboratory at Ohio University, where he performs research in the area of computational regulatory andfunctional genomics. Dr. Welch is founder and Co-Editor-in-Chief of the International Journal of Compu-tational Biosciences, and is a member of the editorial boards of the International Journal of ComputationalScience, and the Journal of Scalable Computing: Practice and Experience. He is the founder and Chairof the Ohio Bioinformatics Consortium and the Ohio Collaborative Conference on Bioinformatics. He isalso the principal investigator of the $9M Bioinformatics Program which is funded by the Ohio Board ofRegents and eleven academic institutions from Ohio. Dr. Welch has served on the organizing committeesof the Bioinformatics Open Source Conference, the International Symposium on Bioinformatics Researchand Applications, and the IEEE International Symposium on Bioinformatics and Bioengineering.