Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Post on 22-Apr-2015

10.435 views 0 download

description

 

Transcript of Text indexing and search libraries for PHP - Zoë Slattery - Barcelona PHP Conference 2008

Can you be dynamic and fast?

“Miss Marple and the case of the Missing MIPS”

Zoë Slattery

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times

● Conclusions

Index and search

● Problem of finding relevant information is not new.– 3000 years BC [1]– Vannevar Bush, As We May Think, 1945.

● Today applications that search the Web must be able to provide instant access to > 10 billion documents

● Many applications need some form of search, eg searching your hard drive, email....

1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3), 16­18, 2005.

Options for information retrieval

● Search engines– Nutch, SearchBlox.....

● Information Retrieval libraries– Three with broadly similar features

Egothor

Xapian

Lucene

Implementationlanguage

Languagebindings

Languageports

License

Java None None BSD like

C++Perl, Python,

PHP, Java, TCL None GPL

Java NoneC++, Perl, PHP, C#

Apache 2

Lucene [2]

DBWeb

Filesystem

Get user query Present search 

results

Index

Indexdocuments

Searchindex

Gatherdata

Luce

neA

pplic

atio

n

User

2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005.

.

Lucene indexing

Oh for a muse of fire that would 

acsend thebrightest heaven of 

invention.....

start

fire

ascend

...

Henry V, Scouting for boys...

Aerospace, Henry V...

Terms Documents

3. Inverted index

1. Documents

AnalysisIndex creation

end

[fire]   [ascend]  [bright]  [heaven]

2. Token stream

Optimise

4. Optimised inverted index

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times

● Conclusions

Indexing speed

Java + JIT

Java

PHP

4

32

167

Time to index/seconds

0.3

3

43

Time to optimise/seconds

4.3

35

210

Total time

Benchmark:●17.4 MB, 814 files of PHP source code●Linux/Thinkpad T60

Ouch! nearly 50 times as fast in Java

Why is the performance so bad?

First make sure we are comparing same thing:

➢ Analyser➢ Java Lucene has many analysers

➢ Limits on terms➢ Java stops looking at 10,000 terms

➢ Scoring➢ Java rounds down, PHP rounds to closest

➢ Compare indexes using Luke

Analysis ­ Java

Analyzing "A Quick Brown Fox jumped over the Lazy Dog" StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Analyzing "XY&Z Corporation - xyz@example.com" StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]

SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]

Analysis ­ PHPAnalysing "A Quick Brown Fox jumped over the Lazy Dog" Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

Analysing "XY&Z Corporation - xyz@example.com" Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com]

Stop words filter: [xy] [z] [corporation] [xyz] [example] [com]

Short words filter: [xy] [corporation] [xyz] [example] [com]

Compare indexes

Same 663 terms

java

php

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

Execution profiles

● Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations

● Profiling tools (all open source)

– Java● Eclipse TPTP

– PHP● Xdebug● KCachegrind

– System● Sysprof● vmstat, iostat

Java profile

Small problems with TPTP...

Java

Java + profile

2.3

687258

Time to index/seconds

0.3

673851

Time to optimise/seconds

88

50

% time in indexing

●Invasive and slow. Takes 600,000 times as long to execute●Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts)●Output file is machine readable only

But – it's free, open source and it works enough.

Benchmark data:● 39 files of PHP source code (php/Zend), 1.2 MB 

PHP profile

No problems with this tool

PHP

PHP + profile

5

70

Time to index/seconds

3

55

Time to optimise/seconds

63

56

% time in indexing

●Not so invasive as the Java tool  but still adds to time and distorts slightly●Results easy to display with KCachegrind●Output file is readable

Benchmark data:● 39 files of PHP source code (php/Zend), 1.2 MB 

The normalize() function

Sum( ) = 2.92;  

18.99 – 2.92 = 16.07 

Micro benchmark

<?php         require_once "Token.php";         require_once "LowerCase.php"; 

        $token = new Token("GO", 105, 107);         $filter = new LowerCase(); 

        for ($i=0; $i < 10000000; $i++) {                 $norm_token = $filter­>normalize($token);         } ?> 

normalize() opcodescompiled vars:  !0 = $srcToken, !1 = $newToken line     #  op                   ext  return   operands ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ 11     0  RECV 1 13     1  ZEND_FETCH_CLASS :0 'Token'        2  NEW $1 :0        3  ZEND_INIT_METHOD_CALL !0, 'getTermText'        4  DO_FCALL_BY_NAME 0        5  SEND_VAR_NO_REF $3        6  DO_FCALL 1     'strtolower'        7  SEND_VAR_NO_REF $4 14     8  ZEND_INIT_METHOD_CALL !0, 'getStartOffset'        9  DO_FCALL_BY_NAME 0       10  SEND_VAR_NO_REF $6 15    11  ZEND_INIT_METHOD_CALL  !0, 'getEndOffset'       12  DO_FCALL_BY_NAME 0       13  SEND_VAR_NO_REF $8       14  DO_FCALL_BY_NAME 3       15  ASSIGN  !1, $1 16    ......

System profile

1. Convert to lower case2. Look up opcodes

How Xdebug worksS

crip

t exe

cutio

n

●Convert function name to lower case●Look up function in function table

Execute function

Call out to profiler – start time 

Call out to profiler – end time 

ZEND_INIT_METHOD_CALL

DO_FCALL_BY_NAME

The normalize() function

Sum( ) = 2.92;  

18.99 – 2.92 = 16.07 

Is consumed in setting up functions to be run

Why is function calling faster in Java?

● Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time.

● First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast.

● In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult.

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

PHP profile

look at the call to normalize()

$token = $this­>normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));

public function normalize(Token $srcToken ){

         $newToken = new Token(strtolower( $srcToken­>getTermText() ),                                $srcToken­>getStartOffset(),                                $srcToken­>getEndOffset());

        $newToken­>setPositionIncrement($srcToken­>getPositionIncrement());

     return $newToken;    }

look at the call to normalize()

$token = $this­>normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));

public function normalize (Token $srcToken) {$srcToken­>setTermText(strtolower($srcToken­>getTermtext()));return $srcToken;

}

normalize() recoded....

After fix

Performance improvement?

PHP + fix

PHP

151

167

Time to index/seconds

43

43

Time to optimise/seconds

Java  32 3 35

194

210

Total time

9.5 % improvement

Java + JIT 4 0.3 4.3

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

Conclusions

● Two reasons why the PHP implementation of Lucene is slow:– Function calling overhead in PHP– Inefficient code in the analyser [3]– These are the main two, there are others....

● Dynamic and fast?– Hard to get to the same execution speed as Java – but possible to get closer.– But development speed is much better [4]– what speed to you care about?– Better not to use Java coding style (lots of methods that do nothing)

● So which implementation of Lucene should I use?– it depends.....

3. http://framework.zend.com/issues/browse/ZF-36834. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.

Options for PHP 

Do you care about 

speed?

Use Zend Search Lucene

Only need basic features?

Can support Java environment?

Use a Web Service?

Use Lucene via a Java bridge

No Lucene solution today [5]

Use SOLR as web service

Y

Y

Y

NN N

N

Y

5. http://pecl.php.net/package/clucene

Acknowledgements

● Rob Young's presentation [6] to the London PHP user group.

● Members of the PHP internals community, in particular Scott MacVicar, Derick Rethans and Dmitry Stogov.

6. http://www.phplondon.org/wiki/Search_tools_in_PHP_(Rob_Young)

Other useful links

●http://www.egothor.org/●http://xapian.org/●http://lucene.apache.org/●http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html●http://www.derickrethans.nl/vld.php●http://lucene.apache.org/nutch/●http://www.searchblox.com/●http://www.xdebug.org/●http://www.eclipse.org/tptp/●http://www.getopt.org/luke/●http://www.projectzero.org●http://www.ibm.com/developerworks/ (Publication due 24/09/08)●http://php-java-bridge.sourceforge.net/doc/●http://www.zend.com/en/products/platform/product-comparison/java-bridge●http://lucene.apache.org/solr/●http://www.ibm.com/developerworks/websphere/library/techarticles/0809_phillips/0809_phillips.html