KU Leuven combines/connects SAP HANA, fuzzy search...
Transcript of KU Leuven combines/connects SAP HANA, fuzzy search...
KU Leuven combines/connects SAP HANA, fuzzy search, gateway services and SAP UI5 to build apps for students
and staff
Nico Croes - Head of Development Coaching team
1
Kris Claes - Sr. Software Architect
SAPience.be TECHday ‘15
Agenda
• Intro KU Leuven • The project • The flow
SAPience.be TECHday ‘15 2
UI5
GW-‐HUB
GW-‐Client
ODATA model
Handler Fuzzy Frame
ADBC Frame
ADBC classes
HANA
KU Leuven & Association
> 102.000 students > 22.000 staff members [PERC
ENTAGE]
[PERCENTAG
E]
Academic programmes
2%
Professional programmes
Arts
SAP @ KU Leuven
SRM (e-catalog)
CRM
Solution Manager
PI
BO BW
ca. 800 GB data
RRB PAY- TIME
PA-PD
e-recrui- ting
HR- FPM
FM
CO
AA
SD
MM
FSCM
FICA IS: SLcM
PM
PS IM
RE-FX
FI
ERP
ca. 2 TB data
+ A lot of custom code (workflow, web applicaSons, interfaces, …)
Since 1999…
Central IT Department
DIRECTORATE ICTS
Customer & Service Centre • Customer &
service managers
• IT Vendor & Purchasing mgmt
• ICTS Helpdesk, communication & training
Facilities for Education, Research, Communication and Collaboration • Inter- and Intranet • Facilities for
Education • Facilities for
Research • Communication &
Collaboration • Competence
Centre Information Security
Administrative Applications for General Management • Finance • Logistics • Human
Resources • SAP Basis &
ICTS
Administrative Applications for University management • Students • Education • Individual Study
programmes & Exams
• Research • CRM
Local Network & Support
• Local
Infrastructure System administration
• Local Infrastructure support
• PC Classrooms support
Central IT Infrastructure
• System administration AIX
• System administration UNIX
• System administration Windows
• Data Centre Network
Competence Centre
Management Information
ICTS Administrative
Office
Total: 215 FTE
• System administration AIX
The project
" Original requirement • add some additional fields to existing BSP application
" Objections • Conflict with our internal guidelines concerning modifications of
‘old’ BSP programs • The existing application is slow (search can take up to 20s…) • The existing application is not MVC (tricky on modifications)
" New proposal • Refactoring of ‘old’ BSP application to new UI5 app • Use HANA to speed up the search and data collection • Offer 1 application to run on all devices (responsive) • Separate UI (frontend developer) from processing data (backend
developer)
SAPience.be TECHday ‘15 6
Old BSP application
SAPience.be TECHday ‘15 7
Search for (bio)chemical substances by name, formula, CASnumber (in SAP EH&S) è detailed sheet with informaSon on risks, regulaSons, safety measures è print labels for recipients
UI5 application design
SAPience.be TECHday ‘15 10
" Use standard UI5 UI elements & concepts • ‘Master detail’ design pattern • View drill down to details https://sapui5.netweaver.ondemand.com/explored.html#/entity/sap.m.Table/samples
GW-Hub
SAPience.be TECHday ‘15 12
" Endpoint for the GW service " Location of the UI5 app " The GW REST service can be be used by other (non-SAP)
applications https://admin.kuleuven.be/icts/services/dataservices
OData model
SAPience.be TECHday ‘15 14
• Logical data model (not the DB model)
• Link between frontend and backend (developers)
Database Model - Search
15
TCGTPLREL
ESTRH ESTRI
TCG22 TCG24 TCG11
ESTVA ESTVH
TCG53 TCG12
AUSP
ESTPH
ESTPP
ESTPJ
ESTPS
ESTPT
CABN
CABNT
KSML
KLAH
ZLT_BIG_LBL
ZLT_BIG_LBL_UNIT
ZLT_BIG_LBL_PHR
T006
GW implementation SEGW
SAPience.be TECHday ‘15 16
Data enSty types
Service implementaSon
• Different service implementaSons are ‘grouped’ into a handler/model class
Improving Existing Search Help
Performance Only ‘excact’ search
Typo’s
Ethyleen Etyleen Ethylene ESlien …
What data do we have?
EH&S – hazardous substances – database • Approximately 750,000 records • Text fields with product names, synonyms, formulas • Some typos
Database Model - Fetch
20
TCGTPLREL
ESTRH ESTRI
TCG22 TCG24 TCG11
ESTVA ESTVH
TCG53 TCG12
AUSP
ESTPH
ESTPP
ESTPJ
ESTPS
ESTPT
CABN
CABNT
KSML
KLAH
ZLT_BIG_LBL
ZLT_BIG_LBL_UNIT
ZLT_BIG_LBL_PHR
T006
HANA – Fuzzy Search
Fuzzy search can be used in various applicaSons, for example: • Fault-‐tolerant search in text columns (html or pdf for example): Search for documents on
'Driethanolamyn' and find all documents that contain the term 'Triethanolamine'. • Fault-‐tolerant search in structured database content: Search for a product called 'coffe krisp
biscuit' and find 'Toffee Crisp Biscuits'. • Fault-‐tolerant check for duplicate records: Before creaSng a new customer record in a CRM
system, search for similar customer records and verify that no duplicates are already stored in the system. When creaSng a new record called 'SAB AkSengesellschak & Co KG Deutschl.' in 'Wahldorf' for example, the system would bring up 'SAP Deutschland AG & Co. KG' in 'Walldorf' as a possible duplicate.
Fuzzy Search is a fast and fault-‐tolerant search feature for SAP HANA. The term ”fault-‐tolerant search” means that a database query returns records even if the search term (the user input) contains addiSonal or missing characters, or other types of spelling error.
You can call the fuzzy search by using the CONTAINS predicate with the FUZZY opSon in the WHERE clause of a SELECT statement.
Search queries with CONTAINS
... where contains( ident, 'ethyleen‘ , FUZZY(0.5) )
... where contains( ident, 'ethyleen‘ , LINGUISTIC )
... where contains( ident, 'ethyleen‘ , EXACT )
“A linguis:c search finds all words that have the same word stem as the search term. It also finds all words for which the search term is the word stem. In the SELECT statement of the full-‐text search query, you can specify the LINGUISTIC search type. When you execute a linguis:c search, the system has to determine the stems of the searched terms. It will look up the stems in the stem dic:onary. The hits in the stem dic:onary point to all words in the word dic:onary that have this stem”
Basic Fuzzy
select score() as score, * from z_chemical_nofuzz where contains ( product , 'ethyleen' , fuzzy( 0.7 ) ) order by score desc
select * from z_chemical_nofuzz where product like '%ethyleen%' order by product ;
No typos
No longer texts
Fuzzy Search on String Columns
Not all opSons are available on string columns!
select score() as score, * from z_chemical_nofuzz where contains ( product , 'ethyleen' , fuzzy( 0.7 , 'textsearch=compare' ) ) order by score desc
Could not execute 'select score() as score, * from z_chemical_nofuzz where contains ( product , 'ethyleen' , fuzzy( ...' in 4 ms 709 µs . SAP DBTech JDBC: [2048]: column store error: search table error: [2018] Option 'textSearch' not allowed for column 'PRODUCT'
You can make this available by building a Fuzzy Search Index on the required columns
String types
Basic Fuzzy
Text types
SophisScated fuzzy
Date types
SophisScated fuzzy
Fuzzy Search Index
• CreaSon by SQL • CreaSon in the HANA Studio: has less possibiliSes
CREATE FULLTEXT INDEX <index_name> ON <tableref> '(' <column_name> ')' [<fulltext_parameter_list>] Specify any of the following additional parameters for the full-text index: LANGUAGE COLUMN <column_name> LANGUAGE DETECTION '(' <string_literal_list> ')' MIME TYPE COLUMN <column_name> FUZZY SEARCH INDEX <on_off> PHRASE INDEX RATIO <on_off> CONFIGURATION <string_literal> SEARCH ONLY <on_off> FAST PREPROCESS <on_off> TEXT MINING <on_off> TEXT MINING CONFIGURATION <string_literal> TEXT ANALYSIS <on_off> MIME TYPE <specified mime type, e.g. application/pdf> TOKEN SEPARATORS <\/;,.:-_()[]<>!?*@+{}="&>
CREATE FULLTEXT INDEX "INDENT" ON "UX000657"."Z_ESTRI" ("IDENT") SYNC PHRASE INDEX RATIO 0.200000 FUZZY SEARCH INDEX OFF SEARCH ONLY ON FAST PREPROCESS ON TEXT MINING OFF TEXT ANALYSIS OFF TOKEN SEPARATORS '\/;,.:-‐_()[]<>!?*@+{}="&#$~|'
Basic Fuzzy - again
select score() as score, * from z_chemical where contains ( product , 'ethyleen' , fuzzy( 0.7 ) ) order by score desc
select * from z_chemical where product like '%ethyleen%' order by product ;
Yes! longer texts
SSll no typos
How does it work?
SELECT * FROM WHERE CONTAINS ( (<col1>, <col2>, <col3> ) ,<search_string> )
• Tokenize using the token-‐separators. • Similarity: defined by the number of common characters, wrong characters, addiSonal
characters in search string and reference string. • StandardizaEon: translaSon to lower case characters without diacriScs.
• Possible to get a match when comparing 2 unequal terms ( “Café” = “café” ). • High fuzzy scores for common differences in the spelling of words.
• Influenced by some addiEonal parameters • excessTokenWeight • andThreshold • …
Fuzzy Score
The higher the score, the more similar the strings are. A score of 1.0 means the strings are idenScal. A score of 0.0 means the strings have nothing in common.
You can request the score in the SELECT statement by using the SCORE() funcSon. You can sort the results of a query by score in descending order to get the best records first (the best record is the record that is most similar to the user input). If a fuzzy search of mulSple columns is used in a SELECT statement, the score is returned as an average of the scores of all columns used.
When searching text columns, a TF/IDF (term frequency/inverse document frequency) score is returned by default instead of the fuzzy score.
The TF/IDF calculaSon can be disabled so that you get the fuzzy score instead. In parScular, this makes sense for short-‐text columns containing data such as product names or company names.
New result
select distinct score() as score, ident from z_estri where contains( ident, 'ethyleen' , fuzzy(0.6 , 'textsearch=compare' ) ) order by score desc, ident
Our typos get top scores … … but we lose points in the longer texts
excessTokenWeight
• excessTokenWeight defines the weight of excess (that is, unassigned) tokens. It is set to 1.0 by default.
• Excess tokens are tokens that do not have a counterpart token on either the input side or the request side.
• This parameter enables a beuer sorSng by score when the lengths (that is, the number of tokens) of the request entry and the reference entry are different.
• We want to select the data regardless of the excess tokens. • But a record without excess tokens must be more exact than one with excess tokens.
excessTokenWeight
where contains( ident, 'ethyleen' , fuzzy(0.6 , 'textsearch=compare, excessTokenWeight=1.0' ) )
• Default = 1.0
where contains( ident, 'ethyleen' , fuzzy(0.6 , 'textsearch=compare, excessTokenWeight=0.1' ) )
andThreshold
• specify a 'parSal AND' • The 'andThreshold' parameter defines the percentage of tokens that have to match
• andThreshold = 1.0 à all tokens have to match, 'strict AND' • 0.0 < andThreshold < 1.0 à some of the tokens have to match, 'sok AND’ • andThreshold = 0.0 à at least one token has to match, 'OR'
• The parameter influences performance.
andThreshold
where contains( ident, 'zuiver ethyleen' , fuzzy(0.6 , 'textsearch=compare, excessTokenWeight=0.1 ,andThreshold=1.0' ) )
where contains( ident, 'zuiver ethyleen' , fuzzy(0.6 , 'textsearch=compare, excessTokenWeight=0.1 ,andThreshold=0.0' ) )
• Default = 1.0 = strict AND • 0.0 = OR
(de)composeWords
• composeWords: how words in the user input are combined into compound words • decomposeWords: how words in the user input are split into separate words, building a
decomposiSon phrase • compoundWordWeight: how compound word hits affect the score of a document
• The parameter influences performance.
composeWords = 3 decomposeWords = 3
Van der weyden Vander weyden Vanderweyden Va nd erweyden
Van derweyden …
Vanderweyden Vander weyden
Van derweyden
ABAP
Could it be this easy? SELECT * FROM estri INTO TABLE DATA(lt_estri) WHERE contains (ident , 'ethyleen' , FUZZY (0.8) ).
Not available in ABAPOpen SQL: • hup://scn.sap.com/thread/3757646 • hup://scn.sap.com/community/abap/hana/blog/2012/12/28/sap-‐teched-‐2012-‐
abap-‐for-‐sap-‐hana-‐how-‐to-‐exploit-‐the-‐power-‐of-‐sap-‐hana
ABAP – ADBC-interface
SAPience.be TECHday ‘15 36
• Prepare statement in HANA Studio • Own Framework as wrapper of ADBC interface reduces programming effort
DATA(lr_data) = lr_object->execute( ).
SAP Documentation
SAP HANA Search Developer Guide (24/06/2015) hup://help.sap.com/hana/SAP_HANA_Search_Developer_Guide_en.pdf
SAP HANA Developer Guide (28/05/2014) hup://hcp.sap.com/content/dam/website/saphana/en_us/Technology%20Documents/SAP_HANA_Developer_Guide_en.pdf (Chapter 10)
SAP HANA Fuzzy Search Reference (Help Portal) hup://help.sap.com/saphelp_hanapla{orm/helpdata/en/27/b6f00d4d4744d1b3dcfdea68e0eb0a/frameset.htm