GeneIndex: an open source parallel program for enumerating and locating words in a genome
-
Upload
ptihpa -
Category
Technology
-
view
322 -
download
0
Transcript of GeneIndex: an open source parallel program for enumerating and locating words in a genome
![Page 1: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/1.jpg)
GeneIndex: an open sourceGeneIndex: an open source parallel program for enumerating and locating words in a genome
Huian Li, David Hart, Matthias Mueller, Ulf Markward, Craig Stewart
A t 3 2009August 3, 2009
![Page 2: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/2.jpg)
Contents• MotivationMotivation• Serial algorithm
Parallel implementation• Parallel implementation• Performance Analysis• Conclusion
![Page 3: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/3.jpg)
Motivation
Question from a Biology professor:
Gi d l th i th t ti lGiven a word length, is the computational task of scanning a DNA sequence and
di th iti f ll iblrecording the positions of all possible words trivial?
5 10 15 20 25 30 * * * * * *
5’ TAGCCGTGGCGGAGCCTCTTGGCTTTGTTTATTC 3’
![Page 4: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/4.jpg)
Serial algorithm• Straightforward implementation:Straightforward implementation:
Binary coding for A, C, G, T. For example:A 00A: 00C: 01G: 10T: 11
5 10 15 20 * * * ** * * *
T A G C C G T G G C G G A G C C T C T T G ...110010010110111010011010001001011101111110 ...
![Page 5: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/5.jpg)
Serial algorithm• Given a sequence and a word length k in order toGiven a sequence and a word length k, in order to
list all possible words, we scan the sequence once from left to right g
5 10 15 20 * * * *
T A G C C G T G G C G G A G C C T C T T G ...110010010110111010011010001001011101111110 ...
T A G CT A G C 11001001
A G C CA G C C00100101
G C C G10010110
C T T G ...01111110 ...0 0 ...
![Page 6: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/6.jpg)
Serial algorithm
5 10 15 20 * * * *
T A G C C G T G G C G G A G C C T C T T G ...T A G C C G T G G C G G A G C C T C T T G ...110010010110111010011010001001011101111110 ...
T A G C 11001001
A G C C00100101
ENCODE("AGCC") =
00100101
ENCODE( AGCC ) ENCODE("TAGC") & MASK << 2 | ENCODE('C')
MASK = 4k-1 – 1 = 111111 (in case of k = 4)
![Page 7: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/7.jpg)
Serial algorithm• This essentially becomes a sorting problem sinceThis essentially becomes a sorting problem, since
each word is now converted into an integer.• Each word is associated with its position• Each word is associated with its position
information: (Encoded Word, Position)• Sorting has to be stable so that for the same words• Sorting has to be stable so that for the same words,
their positions will be in a certain order.
![Page 8: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/8.jpg)
Serial algorithmImplementation details:Implementation details:• Words & positions are stored in a long longinteger (8 bytes = 64 bits)integer (8 bytes = 64 bits)
• Hash table with a linked list for each entryS i d f ll d i i i• Space required for all words in given sequence is pre-allocated, instead of malloc one by oneM tl AND OR d SHIFT LEFT ti• Mostly AND, OR and SHIFT-LEFT operations.
![Page 9: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/9.jpg)
Word frequencies
![Page 10: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/10.jpg)
Word distribution
![Page 11: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/11.jpg)
Motivation for parallel implementation
Another question from Biology professor:
How about the human genome?
Fact: Human genome includes about 3 billion DNA bases.about 3 billion DNA bases.
![Page 12: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/12.jpg)
Parallel implementation: inputLarge dataset input:Large dataset input:• Each process reads its own partition from the input
filefile.• Boundary area between neighboring processes has
to be consideredto be considered.agcatgcatgcatcgatcgatcgatgcatcgatgcatcgatacgatgcatgcta
t t t t tgacgatacgagcatgcatctagcatgcagtagcatgcatcgatgcattagcatgctagctagcatgctagcatgcatcgatgcatgctagcatgctagctagcatgctg g g g g g gatgcatgcatgcatcatgcatcgatcgatcgtgcaatgcatgctacgatgcatgcatcagtcagcatgcatgcatcgatcgt t t t t tatgcatcgatcgatgcatgcatgacgagcaatgatgcagtcatgcatcgacgagcatcgatcgatgcatgcatgcat
![Page 13: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/13.jpg)
Parallel implementation: load balancingComputation and load balancing:Computation and load balancing:1. Each process deals with its own piece of data2 All processes perform global sorting2. All processes perform global sorting
Straightforward implementation: binary tree merge sortingXX sorting
Possible solution but could be problematic Ideal solution leading to load balancing
XX Ideal solution leading to load balancing
![Page 14: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/14.jpg)
Parallel implementation: load balancingStraightforward implementation: binary tree mergeStraightforward implementation: binary tree merge
sortingAAAA:AAAA: AAAC: ...
TTTG:TTTT:
P0
AAAA: AAAC:
AAAA: AAAC
TTTT:
...
TTTG: TTTT:
AAAC: ...
TTTG: TTTT:
P2P0
AAAA: 5, 9AAAC: 22
AAAA: 19AAAC: 12
AAAA: 4, 8AAAC: 67
AAAA: 35AAAC: 46 P P P P...
TTTG: 101TTTT: 80
...
TTTG: 201TTTT: 26
...
TTTG: 88TTTT: 53
...
TTTG: 40TTTT: 30
P0 P1 P2 P3
![Page 15: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/15.jpg)
Parallel implementation: load balancingPossible solution but could be problematic:Possible solution but could be problematic: Straightforward solution: partition word range [0, 4k) equally,
so each process hasso each process has
ni
ni kk 14,4 , where i = 0, 1, ..., n-1
AAAA:AAAC:CAAC:
AAAA:AAAC:CAAC:
AAAA:AAAC:CAAC:
AAAA:AAAC:CAAC:
CTTT:GAAA:GTTT:
CTTT:GAAA:GTTT:
CTTT:GAAA:GTTT:
CTTT:GAAA:GTTT:
`
TTTG:TTTT:
TTTG:TTTT:
TTTG:TTTT:
TTTG:TTTT:
![Page 16: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/16.jpg)
Parallel implementation: load balancingImplementation of the straightforward solution:Implementation of the straightforward solution:• Problem is that some words occur more often than others,
leading to different memory requests for different processesg y q p
AAAA: 5, 9AAAC: 22, 37
CAAA: 19CAAC: 12, 47
GAAA: 4, 8GAAC: 67, 72
TAAA: 35, 93TAAC: 46
... ... ... ...
ATTG: 101ATTT: 80
CTTG: 201CTTT: 26
GTTG: 88GTTT: 53
TTTG: 40, 87TTTT: 15, 30
![Page 17: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/17.jpg)
Parallel implementation: load balancingIdeal solution leading to load balancing:Ideal solution leading to load balancing:• partition the total number of words L-k+1 equally, so each
process has (L-k+1)/n words, where L is the length of p ( ) , ggiven sequence, k is the given word length.
Implementation:p1. After each process scanned its own piece, we know that:
kLPWfk 1)(
14
where i 0 1 n 1
2. We divide the word range [0,4k) into many small divisions
nPWf i
xx ),(
0
, where i = 0, 1, ..., n-1
g [ ) ywith total divisions of d (where d>>n):
kLPWfd j
d
k
1)(1 1)1(4
nkLPWf
ji
jd
x
xk
1),(0 4
, where i = 0, 1, ..., n-1
![Page 18: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/18.jpg)
Parallel implementation: load balancingImplementation:Implementation:3. The number of words in each small division as below:
1)1(4 jd
k
),(),(4
i
d
jd
x
x PWfjiTk
, where i = 0, 1, ..., n-1and j = 0, 1, ..., d-1
4. The total number of words in each small division across all processes will be:p
1 1)1(4
),()(n
i
jd
x PWfjT
k
where j = 0 1 d-1
0 4
),()(i
i
jd
x
xfjk
, where j = 0, 1, ..., d 1
![Page 19: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/19.jpg)
Parallel implementation: load balancingImplementation:Implementation:5. Find an array of boundary B, such that:
kLmB 1)1(
nkLjT
mB
mBj
1)()1(
)(
where B(0)= 0, B(0<m<n) = any number in (0,d-1),
and B(n)= d-1
6. Process Pi should have all words in:[B(i) B(i+1)) where i = 0 1 n-1[B(i), B(i+1)) , where i = 0, 1, ..., n 1.
![Page 20: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/20.jpg)
Parallel implementation: load balancing
AAAA:AAAC:CAAC:
AAAA:AAAC:CAAC:
AAAA:AAAC:CAAC:
AAAA:AAAC:CAAC:
CTTT:GAAA:GTTT:
CTTT:GAAA:GTTT:
CTTT:GAAA:GTTT:
CTTT:GAAA:GTTT:
`
TTTG:TTTT:
TTTG:TTTT:
TTTG:TTTT:
TTTG:TTTT:
![Page 21: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/21.jpg)
Parallel implementation: outputOutput:Output:• Each process creates its own output file
If necessary all files can concatenate into one• If necessary, all files can concatenate into one single file, while keeping the order
AAAA: 5, 9, 10AAAC: 22, 37 ... ...
... TTTG: 15, 30TTTT: 40, 87
![Page 22: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/22.jpg)
Testbed
![Page 23: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/23.jpg)
Testbed specification• Consists of 768 IBM JS21 bladesConsists of 768 IBM JS21 blades• On each blade:
2 dual core PowerPC CPUs @ 2 5GHz• 2 dual-core PowerPC CPUs @ 2.5GHz• 8 GB memory• SUSE Linux Enterprise Server 9 (ppc)
• Interconnect: Myrinet• Parallel environment: MPI
![Page 24: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/24.jpg)
Performance analysis
Number of nodes
Number of processes
D. melanogaster H. sapiensk=6 k=25 k=6 k=25
1 1 95 232032 2 53 57704 4 29 15008 8 15 386
16 16 9 107 212 9367232 32 6 32 118 2493464 64 5 14 73 5998
128 128 9 11 59 1450256 256 11 15 49 558512 512 18 22 71 195
Timings of running against two datasets on BigRed using 1 PPN (SECONDS)Timings of running against two datasets on BigRed using 1 PPN (SECONDS)
![Page 25: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/25.jpg)
Performance analysis
Number of nodes
Number of processes
D. melanogaster H. sapiensk=6 k=25 k=6 k=25
1 2 54 59582 4 31 15454 8 17 4008 16 11 115
16 32 8 40 156 2566132 64 6 19 101 623364 128 7 12 82 1506
128 256 25 16 61 576256 512 20 27 81 198512 1024 34 37 115 170
Timings of running against two datasets on BigRed using 2 PPN (SECONDS)Timings of running against two datasets on BigRed using 2 PPN (SECONDS)
![Page 26: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/26.jpg)
Performance analysis
Number of nodes
Number of processes
D. melanogaster H. sapiensk=6 k=25 k=6 k=25
1 4 37 17382 8 23 4534 16 15 1308 32 10 46
16 64 9 23 163 664932 128 11 17 121 163264 256 20 25 96 634
128 512 39 42 120 233256 1024 79 90 180 187512 2048 134 131 270 281
Timings of running against two datasets on BigRed using 4 PPN (SECONDS)Timings of running against two datasets on BigRed using 4 PPN (SECONDS)
![Page 27: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/27.jpg)
Performance analysis
35.00
40.00Scalability in terms of node numbers
25.00
30.00
35.00up 1
15.00
20.00
Spe
edu 1 ppn
2 ppn
4 ppn
5.00
10.00
pp
0.0016 32 64 128 256 512
Number of nodes
Scalability of enumerating 6-mers in H. sapiens
Number of nodes
![Page 28: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/28.jpg)
Performance analysis
600.00Scalability in terms of node numbers
400.00
500.00up 1
200 00
300.00
Spe
edu 1 ppn
2 ppn
4 ppn
100.00
200.00pp
0.0016 32 64 128 256 512
Number of nodes
Scalability of enumerating 25-mers in H. sapiens
Number of nodes
![Page 29: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/29.jpg)
Conclusion• Addressed questions from the biology professor Addressed questions from the biology professor • Complicate solution aroused from memory
restrictionrestriction.• It can handle words of length up to 30.
It fi d ft t d d l d• It can find often-repeated words, rarely-occurred , or even non-occurred words.It l l ti l ll l l t hi• It scales relatively well on large cluster machines.
• We recently developed a Java version for small “ f ”DNA sequences, which was “our future work”. It can
zoom in or zoom out to view distribution and f i i t ti lfrequencies interactively.
![Page 30: GeneIndex: an open source parallel program for enumerating and locating words in a genome](https://reader034.fdocuments.in/reader034/viewer/2022052622/558cb71ed8b42af25c8b456a/html5/thumbnails/30.jpg)
The End
Thank youThank you