Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

15
Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns Ing. Ľuboš Takáč PhD student Faculty of Management Science and Informatics University of Žilina

description

Description of designed and implemented Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns which is used for fast generating word search games (wordfind).

Transcript of Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Page 1: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Ing. Ľuboš Takáč

PhD student

Faculty of Management Science and Informatics

University of Žilina

Page 2: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Presentation overview

• Motivation

• Problem Definition

• Existing Solutions

• Our Implemented Algorithm

• Testing Results

• Conclusion

Page 3: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Motivation

• Word search game generator

• Searching string patterns with

fixed length– M . . . E R

– . . . A H

– . . . . .

Page 4: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Problem Definition• Design fast in-memory data structure (class)

• Requirements– fast searching, if it is possible with O(1) complexity

– each founded word get only once

– each founded word must be randomly chosen

– founded word have to match the pattern

class Model

FastStringPatternSearch

+ FastStringPatternSearch(String[], Random)+ FastStringPatternSearch(String[])+ reset() : void+ searchPattern(String) : String

Page 5: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Existing Solutions• Relational DB table with full-text index - access to hard drive

• Linked List or array in memory – O(N) complexity

• Indexing of array – necessary to index all possible combination of patterns to have O(1) complexity

Number of undefined positions

0 1 2 3 4 5 6 7 8

Example of pattern

PATTERNS

PATT-RNS

PA-TE-NS

-AT--RNS

P-T-E--S -A--ER-- --TT---- ----E--- --------

Binomial coefficient

All combinations count 1 8 28 56 70 56 28 8 1

Total combinations

count256

Page 6: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Our Implemented Algorithm

• Dynamic in-memory tree(s) with linked list of words (id’s) on nodes

• Roots are in 3-dimensional matrix

• Nodes has 2-dimensional matrix of children

Page 7: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Root

• 3 dimensional matrix of root nodes with linked shuffled lists

– alphabet dimension

– word length dimension

– character position dimension

• Example– We put the word “NAUTICAL” into nodes [N][8][1], [A][8][2], [U][8]

[3], …, [L][8][8]

– When we search for pattern “. . U . . . . .”, we are looking into root node [U][8][3] where we find word “NAUTICAL” in linked list

Page 8: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Root

Page 9: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Child nodes• 2 dimensional matrix of child nodes with linked shuffled lists

– alphabet dimension

– word length dimension – can be determine from ancestor

– character position dimension

Page 10: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Searching algorithm• Searching for pattern “. . T . E R . .”

1. Get the first defined character, pattern length and the position of first defined character (T, 8, 3). Get a node of three-dimensional array data structure at [character][length][position] ([T][8][3]). Continue to step 2 with this node.

2. If a node is null, string with this pattern does not exists. – END.

If a node is not null and a node has not children (leaf node) or pattern has no further defined characters, find the first string in a node list which matches the pattern. Return founded string or null if no string matches the pattern. – END.

If a node is not null and a node has children (not leaf node), take the next defined character in pattern (E, position 5) and access two-dimensional array of children nodes of node at element [position][character] ([5][E]), go to step 2 with the given node.

Page 11: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Complexity of algorithm

• We can set MaxListSize on leaf nodes, which determine the complexity to O(L+MaxListSize), where L is the length of the string

• low MaxListSize = fast searching, high memory consumption, slow initializing

• High MaxListSize = slow searching, low memory consumption, fast initializing

• Recommendation– Set it based on purpose, dictionary size

– Create data structure only once and share it

Page 12: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Other requirements

• Get every word only once– Creating array map with boolean value “used” and comparing and updating it

– Function reset, which set all values to “not used” - O(N)

• Get randomly chosen words– All linked list are shuffled after initialization

– After finding the word, we put the word on the end of linked list – O(1)

• Get words with pattern without character e.g. “. . . . . . .”– Creating special linked lists with all sizes and put the words from dictionary

there

Page 13: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Testing Results• Dictionary with 225 thousands word

• Generating 5 000 word search games of size 25x25

• More than 1300 times faster than naive algorithmWe used for testing HP ProBook 6550b with configuration Win 7 Professional 64bit, Intel® Core ™ i5 CPU M450 2cores 2.40GHz, 4GB RAM, Java 7.

MaxListSizeInitializing time (s)

Generating time (s)

Memory consumption (MB)

Unlimited 1,508 989,643 86

5000 2,726 839,294 101

1000 4,843 400,539 265

500 7,062 324,728 340

100 16,141 279,410 808

Naive algorithm O(N) 0,095 381 073,600 15

Page 14: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Conclusion

• We design and implement fast in-memory data structure for searching string patterns with fixed length

• Dynamic structure, up to O(1) complexity

• Randomly chosen words matching the pattern, each founded only once

• Options to reset data structure, to get all words again without initializing data structure ( complexity O(N) )

Page 15: Fast Exact String Pattern-Matching Algorithm for Fixed Length Patterns

Thank you for your attention!

[email protected]