KU Leuven combines/connects SAP HANA, fuzzy search...

KU Leuven combines/connects SAP HANA, fuzzy search, gateway services and SAP UI5 to build apps for students

and staff

Nico Croes - Head of Development Coaching team

1

Kris Claes - Sr. Software Architect

SAPience.be TECHday ‘15

Agenda

•  Intro KU Leuven •  The project •  The flow

SAPience.be TECHday ‘15 2

UI5

GW-‐HUB

GW-‐Client

ODATA model

Handler Fuzzy Frame

ADBC Frame

ADBC classes

HANA

KU Leuven & Association

> 102.000 students > 22.000 staff members [PERC

ENTAGE]

[PERCENTAG

E]

Academic programmes

2%

Professional programmes

Arts

SAP @ KU Leuven

SRM (e-catalog)

CRM

Solution Manager

PI

BO BW

ca. 800 GB data

RRB PAY- TIME

PA-PD

e-recrui- ting

HR- FPM

FM

CO

AA

SD

MM

FSCM

FICA IS: SLcM

PM

PS IM

RE-FX

FI

ERP

ca. 2 TB data

+ A lot of custom code (workflow, web applicaSons, interfaces, …)

Since 1999…

Central IT Department

DIRECTORATE ICTS

Customer & Service Centre •  Customer &

service managers

•  IT Vendor & Purchasing mgmt

•  ICTS Helpdesk, communication & training

Facilities for Education, Research, Communication and Collaboration •  Inter- and Intranet •  Facilities for

Education •  Facilities for

Research •  Communication &

Collaboration •  Competence

Centre Information Security

Administrative Applications for General Management •  Finance •  Logistics •  Human

Resources •  SAP Basis &

ICTS

Administrative Applications for University management •  Students •  Education •  Individual Study

programmes & Exams

•  Research •  CRM

Local Network & Support

•  Local

Infrastructure System administration

•  Local Infrastructure support

•  PC Classrooms support

Central IT Infrastructure

•  System administration AIX

•  System administration UNIX

•  System administration Windows

•  Data Centre Network

Competence Centre

Management Information

ICTS Administrative

Office

Total: 215 FTE

•  System administration AIX

The project

"  Original requirement •  add some additional fields to existing BSP application

"  Objections •  Conflict with our internal guidelines concerning modifications of

‘old’ BSP programs •  The existing application is slow (search can take up to 20s…) •  The existing application is not MVC (tricky on modifications)

"  New proposal •  Refactoring of ‘old’ BSP application to new UI5 app •  Use HANA to speed up the search and data collection •  Offer 1 application to run on all devices (responsive) •  Separate UI (frontend developer) from processing data (backend

developer)


Old BSP application


Search for (bio)chemical substances by name, formula, CASnumber (in SAP EH&S) è detailed sheet with informaSon on risks, regulaSons, safety measures è print labels for recipients

New UI5 application


Exact search Fuzzy search

The Flow UI5

GW-‐HUB

GW-‐Client

ODATA model

Handler Fuzzy Frame

ADBC Frame

ADBC classes

HANA

UI5 application design


" Use standard UI5 UI elements & concepts •  ‘Master detail’ design pattern •  View drill down to details https://sapui5.netweaver.ondemand.com/explored.html#/entity/sap.m.Table/samples

The Flow UI5

GW-‐HUB

GW-‐Client

ODATA model

Handler Fuzzy Frame

ADBC Frame

ADBC classes

HANA

GW-Hub


" Endpoint for the GW service " Location of the UI5 app "   The GW REST service can be be used by other (non-SAP)

applications https://admin.kuleuven.be/icts/services/dataservices

The Flow UI5

GW-‐HUB

GW-‐Client

ODATA model

Handler Fuzzy Frame

ADBC Frame

ADBC classes

HANA

OData model


•  Logical data model (not the DB model)

•  Link between frontend and backend (developers)

Database Model - Search

15

TCGTPLREL

ESTRH ESTRI

TCG22 TCG24 TCG11

ESTVA ESTVH

TCG53 TCG12

AUSP

ESTPH

ESTPP

ESTPJ

ESTPS

ESTPT

CABN

CABNT

KSML

KLAH

ZLT_BIG_LBL

ZLT_BIG_LBL_UNIT

ZLT_BIG_LBL_PHR

T006

GW implementation SEGW


Data enSty types

Service implementaSon

•  Different service implementaSons are ‘grouped’ into a handler/model class

The Flow UI5

GW-‐HUB

GW-‐Client

Handler

Fuzzy Frame

ADBC Frame

ADBC classes

HANA

Improving Existing Search Help

Performance Only ‘excact’ search

Typo’s

Ethyleen Etyleen Ethylene ESlien …

What data do we have?

EH&S – hazardous substances – database •  Approximately 750,000 records •  Text fields with product names, synonyms, formulas •  Some typos

Database Model - Fetch

20

TCGTPLREL

ESTRH ESTRI

TCG22 TCG24 TCG11

ESTVA ESTVH

TCG53 TCG12

AUSP

ESTPH

ESTPP

ESTPJ

ESTPS

ESTPT

CABN

CABNT

KSML

KLAH

ZLT_BIG_LBL

ZLT_BIG_LBL_UNIT

ZLT_BIG_LBL_PHR

T006

HANA – Fuzzy Search

Fuzzy search can be used in various applicaSons, for example: •  Fault-‐tolerant search in text columns (html or pdf for example): Search for documents on

'Driethanolamyn' and find all documents that contain the term 'Triethanolamine'. •  Fault-‐tolerant search in structured database content: Search for a product called 'coffe krisp

biscuit' and find 'Toffee Crisp Biscuits'. •  Fault-‐tolerant check for duplicate records: Before creaSng a new customer record in a CRM

system, search for similar customer records and verify that no duplicates are already stored in the system. When creaSng a new record called 'SAB AkSengesellschak & Co KG Deutschl.' in 'Wahldorf' for example, the system would bring up 'SAP Deutschland AG & Co. KG' in 'Walldorf' as a possible duplicate.

Fuzzy Search is a fast and fault-‐tolerant search feature for SAP HANA. The term ”fault-‐tolerant search” means that a database query returns records even if the search term (the user input) contains addiSonal or missing characters, or other types of spelling error.

You can call the fuzzy search by using the CONTAINS predicate with the FUZZY opSon in the WHERE clause of a SELECT statement.

Search queries with CONTAINS

... where contains( ident, 'ethyleen‘ , FUZZY(0.5) )

... where contains( ident, 'ethyleen‘ , LINGUISTIC )

... where contains( ident, 'ethyleen‘ , EXACT )

“A linguis:c search finds all words that have the same word stem as the search term. It also finds all words for which the search term is the word stem. In the SELECT statement of the full-‐text search query, you can specify the LINGUISTIC search type. When you execute a linguis:c search, the system has to determine the stems of the searched terms. It will look up the stems in the stem dic:onary. The hits in the stem dic:onary point to all words in the word dic:onary that have this stem”

Basic Fuzzy

select score() as score, * from z_chemical_nofuzz where contains ( product , 'ethyleen' , fuzzy( 0.7 ) ) order by score desc

select * from z_chemical_nofuzz where product like '%ethyleen%' order by product ;

No typos

No longer texts

Fuzzy Search on String Columns

Not all opSons are available on string columns!

select score() as score, * from z_chemical_nofuzz where contains ( product , 'ethyleen' , fuzzy( 0.7 , 'textsearch=compare' ) ) order by score desc

Could not execute 'select score() as score, * from z_chemical_nofuzz where contains ( product , 'ethyleen' , fuzzy( ...' in 4 ms 709 µs . SAP DBTech JDBC: [2048]: column store error: search table error: [2018] Option 'textSearch' not allowed for column 'PRODUCT'

You can make this available by building a Fuzzy Search Index on the required columns

String types

Basic Fuzzy

Text types

SophisScated fuzzy

Date types

SophisScated fuzzy

Fuzzy Search Index

•  CreaSon by SQL •  CreaSon in the HANA Studio: has less possibiliSes

CREATE FULLTEXT INDEX <index_name> ON <tableref> '(' <column_name> ')' [<fulltext_parameter_list>] Specify any of the following additional parameters for the full-text index: LANGUAGE COLUMN <column_name> LANGUAGE DETECTION '(' <string_literal_list> ')' MIME TYPE COLUMN <column_name> FUZZY SEARCH INDEX <on_off> PHRASE INDEX RATIO <on_off> CONFIGURATION <string_literal> SEARCH ONLY <on_off> FAST PREPROCESS <on_off> TEXT MINING <on_off> TEXT MINING CONFIGURATION <string_literal> TEXT ANALYSIS <on_off> MIME TYPE <specified mime type, e.g. application/pdf> TOKEN SEPARATORS <\/;,.:-_()[]<>!?*@+{}="&>

CREATE FULLTEXT INDEX "INDENT" ON "UX000657"."Z_ESTRI" ("IDENT") SYNC PHRASE INDEX RATIO 0.200000 FUZZY SEARCH INDEX OFF SEARCH ONLY ON FAST PREPROCESS ON TEXT MINING OFF TEXT ANALYSIS OFF TOKEN SEPARATORS '\/;,.:-‐_()[]<>!?*@+{}="&#$~|'

Basic Fuzzy - again

select score() as score, * from z_chemical where contains ( product , 'ethyleen' , fuzzy( 0.7 ) ) order by score desc

select * from z_chemical where product like '%ethyleen%' order by product ;

Yes! longer texts

SSll no typos

How does it work?

SELECT * FROM WHERE CONTAINS ( (<col1>, <col2>, <col3> ) ,<search_string> )

•  Tokenize using the token-‐separators. •  Similarity: defined by the number of common characters, wrong characters, addiSonal

characters in search string and reference string. •  StandardizaEon: translaSon to lower case characters without diacriScs.

•  Possible to get a match when comparing 2 unequal terms ( “Café” = “café” ). •  High fuzzy scores for common differences in the spelling of words.

•  Influenced by some addiEonal parameters •  excessTokenWeight •  andThreshold •  …

Fuzzy Score

The higher the score, the more similar the strings are. A score of 1.0 means the strings are idenScal. A score of 0.0 means the strings have nothing in common.

You can request the score in the SELECT statement by using the SCORE() funcSon. You can sort the results of a query by score in descending order to get the best records first (the best record is the record that is most similar to the user input). If a fuzzy search of mulSple columns is used in a SELECT statement, the score is returned as an average of the scores of all columns used.

When searching text columns, a TF/IDF (term frequency/inverse document frequency) score is returned by default instead of the fuzzy score.

The TF/IDF calculaSon can be disabled so that you get the fuzzy score instead. In parScular, this makes sense for short-‐text columns containing data such as product names or company names.

New result

select distinct score() as score, ident from z_estri where contains( ident, 'ethyleen' , fuzzy(0.6 , 'textsearch=compare' ) ) order by score desc, ident

Our typos get top scores … … but we lose points in the longer texts

excessTokenWeight

•  excessTokenWeight defines the weight of excess (that is, unassigned) tokens. It is set to 1.0 by default.

•  Excess tokens are tokens that do not have a counterpart token on either the input side or the request side.

•  This parameter enables a beuer sorSng by score when the lengths (that is, the number of tokens) of the request entry and the reference entry are different.

•  We want to select the data regardless of the excess tokens. •  But a record without excess tokens must be more exact than one with excess tokens.

excessTokenWeight

where contains( ident, 'ethyleen' , fuzzy(0.6 , 'textsearch=compare, excessTokenWeight=1.0' ) )

•  Default = 1.0

where contains( ident, 'ethyleen' , fuzzy(0.6 , 'textsearch=compare, excessTokenWeight=0.1' ) )

andThreshold

•  specify a 'parSal AND' •  The 'andThreshold' parameter defines the percentage of tokens that have to match

•  andThreshold = 1.0 à all tokens have to match, 'strict AND' •  0.0 < andThreshold < 1.0 à some of the tokens have to match, 'sok AND’ •  andThreshold = 0.0 à at least one token has to match, 'OR'

•  The parameter influences performance.

andThreshold

where contains( ident, 'zuiver ethyleen' , fuzzy(0.6 , 'textsearch=compare, excessTokenWeight=0.1 ,andThreshold=1.0' ) )

where contains( ident, 'zuiver ethyleen' , fuzzy(0.6 , 'textsearch=compare, excessTokenWeight=0.1 ,andThreshold=0.0' ) )

•  Default = 1.0 = strict AND •  0.0 = OR

(de)composeWords

•  composeWords: how words in the user input are combined into compound words •  decomposeWords: how words in the user input are split into separate words, building a

decomposiSon phrase •  compoundWordWeight: how compound word hits affect the score of a document

•  The parameter influences performance.

composeWords = 3 decomposeWords = 3

Van der weyden Vander weyden Vanderweyden Va nd erweyden

Van derweyden …

Vanderweyden Vander weyden

Van derweyden

ABAP

Could it be this easy? SELECT * FROM estri INTO TABLE DATA(lt_estri) WHERE contains (ident , 'ethyleen' , FUZZY (0.8) ).

Not available in ABAPOpen SQL: •  hup://scn.sap.com/thread/3757646 •  hup://scn.sap.com/community/abap/hana/blog/2012/12/28/sap-‐teched-‐2012-‐

abap-‐for-‐sap-‐hana-‐how-‐to-‐exploit-‐the-‐power-‐of-‐sap-‐hana

ABAP – ADBC-interface


•  Prepare statement in HANA Studio •  Own Framework as wrapper of ADBC interface reduces programming effort

DATA(lr_data) = lr_object->execute( ).

SAP Documentation

SAP HANA Search Developer Guide (24/06/2015) hup://help.sap.com/hana/SAP_HANA_Search_Developer_Guide_en.pdf

SAP HANA Developer Guide (28/05/2014) hup://hcp.sap.com/content/dam/website/saphana/en_us/Technology%20Documents/SAP_HANA_Developer_Guide_en.pdf (Chapter 10)

SAP HANA Fuzzy Search Reference (Help Portal) hup://help.sap.com/saphelp_hanapla{orm/helpdata/en/27/b6f00d4d4744d1b3dcfdea68e0eb0a/frameset.htm

Thank you!


KU Leuven combines/connects SAP HANA, fuzzy search...

Documents

Transcript of KU Leuven combines/connects SAP HANA, fuzzy search...