SQL for pattern matching (Oracle 12c)
-
Upload
logan-palanisamy -
Category
Data & Analytics
-
view
246 -
download
6
description
Transcript of SQL for pattern matching (Oracle 12c)
L O G A N P A L A N I S A M Y
SQL for Pattern Matching
Agenda
Introduction to regular expressions
RegEx functions in Oracle
SQL for Pattern Matching
Meeting Basics
Put your phones/pagers on vibrate/mute
Messenger: Change the status to offline or in-meeting
Remote attendees: Mute yourself (*6). Ask questions via WebEx.
What are Regular Expressions?
A way to express patterns
credit cards, license plate numbers, vehicle identification numbers, voter id, driving license, SSNs, phone numbers
UNIX (grep, egrep), PHP, JAVA support Regular Expressions
PERL made it popular
Regular Expression Examples
Example Meaning
[0-9]{10,} 10 or more digits.
[0-9]{3}-[0-9]{2}-[0-9]{4} Social Security number
([0-9]{3})[1-9]{3}-[0-9]{4} Phone number (xxx)yyy-zzzz
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} Very basic IPv4 address format using Perl notation
(\d{4}[- ]?){3}\d{4} Credit Card (three occurrences of four digits followed optionally by a space or dash, and one 4-digit series)
[1-9][A-Z]{3}[0-9]{3} Car License Plate in California
[A-Z][a-z]+(\s+[A-Z][a-z]*)?\s+[A-Z][a-z]+
First name, optional Middle Initial/name, and Last name
([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5]\.){3}([01]?[0-9][0-9]?|2[0-4][0-9]|25[0-5])
IPv4 address format
Regular Expression Meta Characters
6
Meta character
Meaning
. Matches any single "character" except newline.
* Matches zero or more of the character preceding ite.g.: bugs*, table.*
^ Denotes the beginning of the line. ^A denotes lines starting with A
$ Denotes the end of the line. :$ denotes lines ending with :
\ Escape character (\., \*, \[, \\, etc)
[ ] matches one or more characters within the brackets. e.g. [aeiou], [a-z], [a-zA-Z], [0-9], [[:alpha:]], [a-z?,!]
[^] negation - matches any characters other than the ones inside brackets. eg. ^[^13579] denotes all lines not starting with odd numbers, [^02468]$ denotes all lines not ending with even numbers
Extended Regular Expressions Meta Characters
7
Meta character Meaning
| alternation. e.g.: the(y|m), (they|them)
+ one or more occurrences of previous character.
? zero or one occurrences of previous character.
{n} exactly n repetitions of the previous char or group
{n,} n or more repetitions of the previous char or group
{n, m} n to m repetitions of previous char or group
(....) grouping or subexpression
\n back referencing where n stands for the nth sub-expression. e.g.: \1 is the back reference for first sub-expression.
POSIX Character Classes
POSIX Description
[:alnum:] Alphanumeric characters
[:alpha:] Alphabetic characters
[:ascii:] ASCII characters
[:blank:] Space and tab
[:cntrl:] Control characters
[:digit:] [:xdigit:] Digits, Hexadecimal digits
[:graph:] Visible characters (i.e. anything except spaces, control characters, etc.)
[:lower:] Lowercase letters
[:print:] Visible characters and spaces (i.e. anything except control characters)
[:punct:] Punctuation and symbols.
[:space:] All whitespace characters, including line breaks
[:upper:] Uppercase letters
[:word:] Word characters (letters, numbers and underscores)
Perl Character Classes
9
Perl POSIX Description
\d [[:digit:]] [0-9]
\D [^[:digit:]] [^0-9]
\w [[:alnum:]_] [0-9a-zA-Z_]
\W [^[:alnum:]_] [^0-9a-zA-Z_]
\s [[:space:]]
\S [^[:space:]]
Tools to learn Regular Expressions
http://www.weitz.de/regex-coach/
http://www.regexbuddy.com/
String operations before Regular Expression support in Oracle
Pull the data from DB and perform it in middle tier or FE
LIKE operator
OWA_PATTERN in 9i and before
LIKE operator
% matches zero or more of any character
_ matches exactly one character
Examples WHERE col1 LIKE 'abc%';
WHERE col1 LIKE '%abc';
WHERE col1 LIKE 'ab_d';
WHERE col1 LIKE '\_%' escape '\';
WHERE col1 NOT LIKE 'abc%';
Very limited functionality Check whether first character is numeric: where c1 like '0%' OR c1
like '1%' OR .. .. c1 like '9%'
Very trivial with Regular Exp: where regexp_like(c1, '^[0-9]')
REGEXP_* functions
Available from 10g onwards.
Powerful and flexible, but CPU-hungry.
Easy and elegant, but sometimes less performant
Usable on text literal, bind variable, or any column that holds character data such as CHAR, NCHAR, CLOB, NCLOB, NVARCHAR2, and VARCHAR2 (but not LONG).
Useful as column constraint for data validation
REGEXP_LIKE
Determines whether pattern matches. REGEXP_LIKE (source_str, pattern,
[,match_parameter]) Returns TRUE or FALSE. Use in WHERE clause to return rows matching a pattern Use as a constraint
alter table t add constraint alphanum check (regexp_like (x, '[[:alnum:]]'));
Use in PL/SQL to return a boolean. IF (REGEXP_LIKE(v_name, '[[:alnum:]]')) THEN ..
Can't be used in SELECT clause regexp_like.sql
REGEXP_SUBSTR
Extracts the matching pattern. Returns NULL when nothing matches
REGEXP_SUBSTR(source_str, pattern [, position [, occurrence [, match_parameter]]])
position: character at which to begin the search. Default is 1
occurrence: The occurrence of pattern you want to extract
regexp_substr.sql
REGEXP_INSTR
Returns the location of match in a string
REGEXP_INSTR(source_str, pattern, [, position [, occurrence [, return_option [, match_parameter]]]])
return_option:
0, the default, returns the position of the first character.
1 returns the position of the character following the occurence.
regexp_instr.sql
REGEXP_REPLACE
Search and Replace a pattern
REGEXP_REPLACE(source_str, pattern [, replace_str] [, position [, occurrence [, match_parameter]]]])
If replace_str is not specified, pattern/search_str is replaced with empty string
occurence:
when 0, the default, replaces all occurrences of the match.
when n, any positive integer, replaces the nth occurrence.
regexp_replace.sql
REGEXP_COUNT
New in 11g
Returns the number of times a pattern appears in a string.
REGEXP_COUNT(source_str, pattern [,position [,match_param]])
For simple patterns it is same as (LENGTH(source_str) –LENGTH(REPLACE(source_str, pattern)))/LENGTH(pattern)
regexp_count.sql
Why “SQL for Pattern Matching”
Deficiency of REGEXP_* functions
Retrieving contiguous rows that are inter-related.
Shortcoming of LEAD/LAG analytic functions
Example: Identify successive login failures
Given a sequence of records, identify two or more consecutive login failures showing all the details
SELECT user_id, login_time, result, mn, classifier
FROM logins MATCH_RECOGNIZE (
PARTITION BY user_id
ORDER BY login_time
MEASURES MATCH_NUMBER() as MN,
CLASSIFIER() as classifier
ALL ROWS PER MATCH
PATTERN (F{2,} S)
DEFINE
F AS result = 'FAILURE',
S AS result = 'SUCCESS’)
ORDER BY user_id, login_time;
Logins_pm.sql
Components of SQL for pattern matching
PARTITION BY: Logically divides the rows into groups
ORDER BY: Orders the rows in a partition
[ONE ROW | ALL ROWS] PER MATCH: Chooses summaries or details for each match
MEASURES: Defines calculations for use in the query
PATTERN: Defines the row pattern to be matched
DEFINE: Defines primary pattern variables
AFTER MATCH SKIP: Defines where to restart the matching process after a match is found
SUBSET: Defines union row pattern variables
Operator Precedence
Order of precedence
1. Quantifiers (*, +, {n, m}, etc)
2. Concatenation
3. Alternation (vertical bar “|” is the alternation operator)
PATTERN (A B*)
Is equivalent to PATTERN (A (B*))
But not equivalent to PATTERN ((A B)*)
PATTERN (A B | C D)
Is equivalent to PATTERN ( (A B) | (C D))
But not equivalent to PATTERN ( A (B | C) D)
Your Pals: MATCH_NUMBER & CLASSIFIER:
The two most useful functions
MATCH_NUMBER ()
Tells which rows are members of which match
CLASSIFIER()
Tells which pattern variable applies to which rows
Difference between an Empty Match and No Match
Empty-Match: A match with zero rows
PATTERN (X*) could result in an empty match
MATCH_NUMBER() increases for an empty-match
CLASSIFIER() returns null value
No match: No match at all
PATTERN (X+) will never produce an empty-match. It either matches something or doesn’t.
empty_N_nomatch.sql
EMS Incident analysis
Show worst incident periods (e.g. series of Sev0/Sev1/Sev2s back to back)
Show series of incidents that affected multiple properties
Explain how the following thing work
PERMUTE (A, B, C)
Not displaying certain matched rows with {- -}
Incidents_pm.sql
Example: Sessionization of clickstream data
Sessionize based on 30 or more minutes of inactivityselect *
from clicks MATCH_RECOGNIZE (
partition by user_id
order by click_time
MEASURES MATCH_NUMBER() as session_id
ALL ROWS PER MATCH
PATTERN (A B*)
DEFINE
B AS B.click_time < PREV(B.click_time) + 1/48
)
ORDER BY user_id, click_time;
clicks_pm.sql
Defining Where to Restart the Matching Process After a Match Is Found
AFTER MATCH SKIP TO NEXT ROW: Resume pattern matching at the row after the first row of the current match.
AFTER MATCH SKIP PAST LAST ROW: Resume pattern matching at the next row after the last row of the current match. The default
AFTER MATCH SKIP TO FIRST pattern_variable: Resume pattern matching at the first row that is mapped to the pattern variable.
AFTER MATCH SKIP TO LAST pattern_variable: Resume pattern matching at the last row that is mapped to the pattern variable.
AFTER MATCH SKIP .. : Things to watch out for
1. Resuming at non-existent rowAFTER MATCH SKIP TO B
PATTERN (A B* C)
2. Resuming at the same row (infinite loop)AFTER MATCH SKIP TO A
PATTERN (A B+ C+)
3. Resuming at the same row or non-existent rowAFTER MATCH SKIP TO FIRST A
PATTERN (A* B)
Greedy Versus Reluctant quantifier
By default, quantifiers are greedy. They try to match as many instances of regular expression as possible.
A* or A+ will try to match as many instances of A as possible
Greedy behavior can be changed to reluctant by suffixing the quantifiers with a question mark
A*? Or A+? will match only as few instances of A as possible
It is also called Lazy match
greedy_vs_reluctant.sql
RUNNING vs FINAL Semantics
RUNNING semantics Includes the rows from the beginning of the match to the
currently matched rows.
This is the default
Could be used in MEASURES and DEFINE sections
FINAL semantics Includes all rows in a match
Could be used only in MEASURES
running_vs_final.sql
Detecting spikes/drops, and trends
Simple V-Shape with 1 Row Output per Match (Ex. 18-1)
Simple V-Shape with All Rows Output per Match (Ex. 18-2)
Pattern match for a W-Shape (Ex. 18-4)
Pattern match V and U shapes (Ex. 18-11)
Other detectable trends:
Linearly increasing or Linearly decreasing
Increasingly increasing or Increasingly decreasing
Decreasingly increasing or Decreasingly decreasing
References
Oracle Data Warehousing Guide (12c), Chapter 18
Q&A