Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in...
-
Upload
itzel-cleverly -
Category
Documents
-
view
216 -
download
0
Transcript of Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in...
Benchmarking Oracle 8Benchmarking Oracle 8ii Intermedia TextIntermedia Text
• Background for this benchmark
• Interesting new features in OIMT
• Benchmarking, methodology and problems
• Results
• Conclusions
ODF 000419 Benchmarking Oracle8i Intermedia Text
Background for this benchmarkBackground for this benchmark
The task of the thesis projectThe EDMS Search Engine
CERN’s EDMS
CADIM/EDB, managing product data documents
MP5, managing physical components
Oracle 7, database platform
Implement Oracle8i in the future?
ODF 000419 Benchmarking Oracle8i Intermedia Text
Oracle 8Oracle 8i i IntermediaIntermedia
A product embedded in Oracle8i
Intermedia allows a unified technique for accessing various types of data such as:
Text Documents Images Audio Video
ODF 000419 Benchmarking Oracle8i Intermedia Text
Features in Intermedia TextFeatures in Intermedia Text
ODF 000419 Benchmarking Oracle8i Intermedia Text
Database integration• Mixed queries against multiple text columns
• Single SQL API
• Indexes on most database columns
• All index data in the database
• Automatic triggers on textcolumns to detect changes
Full text search• Exact Word and phrase search, operators, removing stopwords
• XML and HTML document section searching
Features in Intermedia TextFeatures in Intermedia Text
ODF 000419 Benchmarking Oracle8i Intermedia Text
About Search (All languages)• Parses any text search to perform an optimal search
• Complements full-text search
Theme Identification (English only)• Identifies strong themes in documents using a ”Theme base”
• ”Thematic” information is put into the index
• Themesearch is available via the about operator
• May be customized for specific terminology
Features in Intermedia TextFeatures in Intermedia Text
ODF 000419 Benchmarking Oracle8i Intermedia Text
Document services• View documents as plain text or HTML format
• Store documents in a database, file system or at an URL adress
Multilingual text search• Full-text search in most languages including Japanese, Chinese and Korean
• Support for all Oracle-NLS character sets
• Stemming and Fuzzy search for Dutch, English, French, German and Spanish
• Base-letter support and alternate spelling for Western European languages
Features in Intermedia TextFeatures in Intermedia Text
To be investigated:
• The text indexing technique
• The new query operators
• How to use this in the EDMS Search Engine
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT TextindexesOIMT TextindexesAn OIMT textindex can be created on: • varchar2 columns
• Large object (LOB) columns
• B-file columns, indexing entire files
Allowing queries like:Select id from table where contains(column,’Atlas’)>0;
1 One of the detectors in the LHC ring is ATLAS
2 The Atlasmountains are situated in the north of Africa
-> 1
… contains(column,’Atlas inner detector’)>0;
1 The problems with the ATLAS inner detector...
2 The inner detector of ATLAS...
-> 1
Query optimization
Selects the best executing plan based on analyze table…compute statistics;
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT TextindexesOIMT TextindexesAn OIMT textindex is an ”inverted” index
1 the LHC accelerator
2 the LEP accelerator
-> accelerator: row #1 position 2, row #2 position 2
LHC: row #1 position 1
LEP: row #2 position 1
A textindex is built up by five objects
Four tables: I,K,N,R and one b-tree index: X
Create OIMT textindex statement
create index myindex on table(column)
indextype is ctxsys.context;
Note that table must have a primary key
Updating, two choices
Automaticly by starting ctxsrv and commit
Rebuild ”manually”
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT TextindexesOIMT Textindexes
Datastore
Filter
Sectioner
Lexer
The indexing ”pipeline”
• Loops over the rows and reads data out of the column
• Or from remote servers, accessed via http or ftp via pointers
• Transformns the data into text representation
• Output can be in HTML or XML
• Takes the output from the filter and converts it to plain text
• Different sectioner for different formats
• Detects important section tags
• Separates text into tokens and words
• Remove stopwords
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT TextindexesOIMT TextindexesODF 000419 Benchmarking Oracle8i Intermedia Text
The preference system allows customization of textindexes
Classes of ”customizable” objects:
DATASTORE, FILTER, SECTION_GROUP, LEXER, WORDLIST, STOPLIST, STORAGE
Create a preference to customize an OIMT textindex:execute ctx_ddl.create_preference(’my_pref’, ’BASIC_LEXER’);
execute ctx_ddl.set_attribute(’my_pref’, ’INDEX_THEMES’, ’YES’);
create index my_index on table(column) indextype is
ctxsys.context parameters(’LEXER my_pref’);
Default preferences
Unset preferences get their value from the default system
OIMT Query operatorsOIMT Query operators
Scoring operators: WEIGHT(*), THRESHOLD(>), ACCUM(,), MINUS(-)Examples: …contains(column, ’(edms*2) AND cms’)>10;
…contains(column, ’edms, cms’)>0; edms+cms scores higher than each word alone
Word expansion operators: WILDCARD(%), FUZZY(?), STEM($), SOUNDEX(!), EQUIV(=)Examples: …contains(column, ’?mignets’)>0; Fuzzy correct missspellings
1 The magnets of the LHC accelerator…
-> 1
…contains(column, ’$go’)>0; Stem considers e.g. go, went, gone as the ”same” word
1 I will go to the cinema. as well as plurals, magnet=magnets
2 I went back home.
3 The train has gone.
-> 1,2,3
…contains(column, ’!dog’)>0; Soundex retrieves all word which sounds alike
1 I have a dog.
2 Someone has dug a hole.
-> 1,2
ODF 000419 Benchmarking Oracle8i Intermedia Text
Proximity operator: NEAR(;)Examples: …contains(column, NEAR((edms,cms),4, TRUE))>0;
1 EDMS is available for CMS, ATLAS, LHCb...
2 The EDMS at CERN manages all product data documents of CMS
-> 1
Section limiting and theme operators: WITHIN, ABOUTabout is used as a ”general” operator, optimizing the query by including stem($) and theme search if available
Thesaurus operator: SYNExamples: SYN(dog) == {boxer} | {rotweiler} | {terrier}
Boolean operators: AND, OR, NOT
OIMT Query operatorsOIMT Query operatorsODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT BenchmarksOIMT Benchmarks
• Is SELECT…CONTAINS(column,’word’)>0; faster than SELECT…LIKE(’%word%’); ?
• How do indexes on entire files perform?
• What is the prize in terms of memory storage, maintenance?
• Still fast retrieval?
• Is updating flexible?
• Do the query operators perform as expected?
ODF 000419 Benchmarking Oracle8i Intermedia Text
Searchword "ALICE", matching 5% of the documents
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
6009 12018 24036 48072
Tablesize [rows]
Tim
e [
s] contains
like
Comparing CONTAINS and LIKE, retrieval times
Using OIMT’s CONTAINS
Fulltablescan using LIKE, as in EDMS Search today
Y-axis: Retrieval time [s]
X-axis: Tablesize [# of rows]
ODF 000419 Benchmarking Oracle8i Intermedia Text
Searchword "POLYIMIDE", matching 0.01% of the documents
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
6009 12018 24036 48072
Tablesize [rows]
Tim
e [
s] contains
like
Comparing CONTAINS and LIKE, retrieval times
Using OIMT’s CONTAINS
Fulltablescan using LIKE, as in EDMS Search today
Y-axis: Retrieval time [s]
X-axis: Tablesize [# of rows]
ODF 000419 Benchmarking Oracle8i Intermedia Text
Indexing entire filesTests with smaller amounts of files (<1000) works fine
Encountered heavy problems when indexing up to 9000 files:
• Security bug
A user gets the database owner’s OS privileges when accessing files through FILE DATASTORE.
Special roles will be introduced in the 8.1.7 version to manage this problem.
• Not indexing certain Excel files, bug Said to be fixed in the 8.1.7 version
• Storage problems, tablespace, temp, shared pool, lobsegmentsA OIMT textindex is complex, several storage parameters have to be extended, which was non-trivial.
• Unrelevant errormessagesSomewtimes when problems occur, unrelevant errormessages or no errormessages at all are returned,
making troubleshooting difficult.
ODF 000419 Benchmarking Oracle8i Intermedia Text
Indexing entire files
Statistics about the 9000 file index:
Total indexsize: 1.0 GB Number of files: 8942Accumulated filesize: 4.4 GB Tot indexsize/acc filesize: 23.6 %Average indexsize per file: 115.8 KB/fileAverage filesize: 490.1 KB/fileCreation time: 7:06 hours Creation time per file: 2.9 seconds/file
Updating: Works fine, 2 seconds/inserted rowQuering: 1-3 seconds/query (depends on the query!)
Machine: SUN 4CPU 300MHz 1.5 GB RAM
ODF 000419 Benchmarking Oracle8i Intermedia Text
A view of the testtable
ODF 000419 Benchmarking Oracle8i Intermedia Text
Index created on this file reference column
ConclusionsODF 000419 Benchmarking Oracle8i Intermedia Text
• Indexing entire files works, but is not trivial to do in version 8.1.6, to be improved in 8.1.7?
• Quering is fast, both for indexed varchar2 columns and filecolumns
• Updating textindexes can be done automatic
• Index preferences may be customized
• Multilingual
Textindexes
Query Operators
• Works mainly as expected, providing powerful tools for an advanced Search Engine
• Wildcards a very slow when executed on fileindexed columns
O8i IMT
• A very interesting ”platform” for the future EDMS Search Engine
• Will provide tools for fast full-text searches, with both simple and advanced queries
Application interfaces are non-trivial to create
Interface important to make userfriendly
Many hits may be retrieved from full-text searches, limiting these may be crucial
Some ideas:
Ask the users for hints
Keep the interface as simple as possible, avoid too many graphical objects etc
A simple and an advanced search option
Menus for choosing query operators, scoring etc
The EDMS Search Engine Interface
ODF 000419 Benchmarking Oracle8i Intermedia Text
ODF 000419 Benchmarking Oracle8i Intermedia Text
An OIMT Test Search Engine:
9000 indexed files from the EDMS production database
http://oraweb01.cern.ch:9000/anders/owa/edms_extended_search.enter_search