This post was originally published on this site

I’ve been playing around with MATCH_RECOGNIZE – the data pattern matching extension to SELECT that was introduced in version 12.

Most examples I’ve seen have used the default AFTER MATCH SKIP PAST LAST ROW as most often the logic dictates, that when we have found a match in a group of rows, we want to search for further matches after those rows to avoid unwanted „double“ matches.

But can there be uses where we want overlapping or even nested matches? Well, I found at least one case where I think it makes perfect sense…

Suppose for the employees in SCOTT.EMP we want the employee hierarchy and for each employee we wish to show how many subordinates that employee has (employees below him in the hierarchy.)

First let’s do a classic hierarchial query on SCOTT.EMP:

select empno
, lpad(' ', (level-1)*2) || ename as ename
, (
select count(*)
from emp sub
start with sub.mgr = emp.empno
connect by sub.mgr = prior sub.empno
) subs
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
/

EMPNO ENAME        SUBS
----- ------------ ----
7839 KING 13
7566 JONES 4
7788 SCOTT 1
7876 ADAMS 0
7902 FORD 1
7369 SMITH 0
7698 BLAKE 5
7499 ALLEN 0
7521 WARD 0
7654 MARTIN 0
7844 TURNER 0
7900 JAMES 0
7782 CLARK 1
7934 MILLER 0

The outer query shows the employee hierarchy, the scalar subquery finds how many subordinates each employee has.

Simple and easy to understand. Now for the MATCH_RECOGNIZE method:

with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select empno
, lpad(' ', (lvl-1)*2) || ename as ename
, subs
from hierarchy
match_recognize (
order by rn
measures
strt.rn as rn
, strt.lvl as lvl
, strt.empno as empno
, strt.ename as ename
, count(higher.lvl) as subs
one row per match
after match skip to next row
pattern (
strt higher*
)
define
higher as higher.lvl > strt.lvl
)
order by rn
/

EMPNO ENAME        SUBS
----- ------------ ----
7839 KING 13
7566 JONES 4
7788 SCOTT 1
7876 ADAMS 0
7902 FORD 1
7369 SMITH 0
7698 BLAKE 5
7499 ALLEN 0
7521 WARD 0
7654 MARTIN 0
7844 TURNER 0
7900 JAMES 0
7782 CLARK 1
7934 MILLER 0

Exactly same result, but at the cost of longer code, true. Let’s take a closer look at the code and see if it really is that hard…

We place the base query in a WITH clause just for clarity, so the pattern matching query is easier to read (it could have been an inline view instead if we wanted.)

To preserve the hierarchial ordering, the CONNECT BY query with its ORDER BY clause is wrapped in an inline view, so we can save ROWNUM as rn, and use RN for ordering both in the MATCH_RECOGNIZE clause and the final ORDER BY.

Using the PATTERN clause in line 26 we look for a pattern in the rows that consists of first exactly one STRT row followed by zero or more HIGHER rows. The syntax here is very similar to regular expressions, we are just not matching characters, we are matching rows.

But how does it then know what is a STRT row and what is a HIGHER row? By the DEFINE clause, where we in line 29 define that a HIGHER row is defined as a row whose LVL value is greater than the LVL value of the STRT row.

The STRT row has no definition, so any row can match a STRT row.

This means that the database takes a look a the first row. Can it match STRT? Yes, any row can do that. OK, then it tries second row, can it match HIGHER? Yes, the LVL value is higher than LVL of the STRT row. How about the third? Yes, that too. And so on, actually on to the end, since everybody has a LVL value higher than King…

So the first match has been found with one STRT row (the first one) followed by 13 HIGHER rows. In line 23 we state we only want a single row returned from this match containing the values we define in the MEASURES clause, which is values from the STRT row plus a COUNT of the HIGHER rows. That count is actually the number of subordinates.

So once the first match has been found, line 24 state SKIP TO NEXT ROW. That means NEXT ROW after the first row of the match. So the database moves on to Jones (even though he was a HIGHER row in the first match.)

Can Jones match STRT? Yes, any row can do that. OK, can Scott match HIGHER? Yes, Scott has higher LVL than Jones. And so on until Blake – can Blake match HIGHER? No, Blake does not have a higher LVL than Jones. So the match stops here. Even though there are more rows further down that has a higher LVL than Jones, the pattern defines zero or more consecutive HIGHER rows.

So the second match contains one STRT row (Jones) followed by 4 HIGHER rows (Scott to Smith), and therefore is output a row with Jones having 4 subordinates (the count of HIGHER rows in the match.)

And so it continues for all 14 rows, each of them will start their own match, since they match STRT row, and then each will have the count of HIGHER rows immediately following the STRT row.

We can see in more details how this works if we change from ONE ROW PER MATCH to ALL ROWS PER MATCH:

with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select mn
, rn
, empno
, lpad(' ', (lvl-1)*2) || ename as ename
, roll
, subs
, cls
, stno
, stname
, hino
, hiname
from hierarchy
match_recognize (
order by rn
measures
match_number() as mn
, classifier() as cls
, strt.empno as stno
, strt.ename as stname
, higher.empno as hino
, higher.ename as hiname
, count(higher.lvl) as roll
, final count(higher.lvl) as subs
all rows per match
after match skip to next row
pattern (
strt higher*
)
define
higher as higher.lvl > strt.lvl
)
order by mn, rn
/

 MN  RN EMPNO ENAME        ROLL SUBS CLS     STNO STNAME  HINO HINAME
--- --- ----- ------------ ---- ---- ------ ----- ------ ----- ------
1 1 7839 KING 0 13 STRT 7839 KING
1 2 7566 JONES 1 13 HIGHER 7839 KING 7566 JONES
1 3 7788 SCOTT 2 13 HIGHER 7839 KING 7788 SCOTT
1 4 7876 ADAMS 3 13 HIGHER 7839 KING 7876 ADAMS
1 5 7902 FORD 4 13 HIGHER 7839 KING 7902 FORD
1 6 7369 SMITH 5 13 HIGHER 7839 KING 7369 SMITH
1 7 7698 BLAKE 6 13 HIGHER 7839 KING 7698 BLAKE
1 8 7499 ALLEN 7 13 HIGHER 7839 KING 7499 ALLEN
1 9 7521 WARD 8 13 HIGHER 7839 KING 7521 WARD
1 10 7654 MARTIN 9 13 HIGHER 7839 KING 7654 MARTIN
1 11 7844 TURNER 10 13 HIGHER 7839 KING 7844 TURNER
1 12 7900 JAMES 11 13 HIGHER 7839 KING 7900 JAMES
1 13 7782 CLARK 12 13 HIGHER 7839 KING 7782 CLARK
1 14 7934 MILLER 13 13 HIGHER 7839 KING 7934 MILLER
2 2 7566 JONES 0 4 STRT 7566 JONES
2 3 7788 SCOTT 1 4 HIGHER 7566 JONES 7788 SCOTT
2 4 7876 ADAMS 2 4 HIGHER 7566 JONES 7876 ADAMS
2 5 7902 FORD 3 4 HIGHER 7566 JONES 7902 FORD
2 6 7369 SMITH 4 4 HIGHER 7566 JONES 7369 SMITH
3 3 7788 SCOTT 0 1 STRT 7788 SCOTT
3 4 7876 ADAMS 1 1 HIGHER 7788 SCOTT 7876 ADAMS
4 4 7876 ADAMS 0 0 STRT 7876 ADAMS
5 5 7902 FORD 0 1 STRT 7902 FORD
5 6 7369 SMITH 1 1 HIGHER 7902 FORD 7369 SMITH
6 6 7369 SMITH 0 0 STRT 7369 SMITH
7 7 7698 BLAKE 0 5 STRT 7698 BLAKE
7 8 7499 ALLEN 1 5 HIGHER 7698 BLAKE 7499 ALLEN
7 9 7521 WARD 2 5 HIGHER 7698 BLAKE 7521 WARD
7 10 7654 MARTIN 3 5 HIGHER 7698 BLAKE 7654 MARTIN
7 11 7844 TURNER 4 5 HIGHER 7698 BLAKE 7844 TURNER
7 12 7900 JAMES 5 5 HIGHER 7698 BLAKE 7900 JAMES
8 8 7499 ALLEN 0 0 STRT 7499 ALLEN
9 9 7521 WARD 0 0 STRT 7521 WARD
10 10 7654 MARTIN 0 0 STRT 7654 MARTIN
11 11 7844 TURNER 0 0 STRT 7844 TURNER
12 12 7900 JAMES 0 0 STRT 7900 JAMES
13 13 7782 CLARK 0 1 STRT 7782 CLARK
13 14 7934 MILLER 1 1 HIGHER 7782 CLARK 7934 MILLER
14 14 7934 MILLER 0 0 STRT 7934 MILLER

In the MEASURES clause here we use MATCH_NUMBER to show us a unique number for each match that the MATCH_RECOGNIZE clause has found. And CLASSIFIER tells us which part of the PATTERN that each row of the match has matched.

So we can see that match number 1 consists of RN=1 to 14, where RN=1 is a STRT row and the 13 rest of the rows are HIGHER rows. Match number 2 consists of RN=2 to 6, where RN=2 is a STRT row and the other 4 are HIGHER rows. And so on.

Notice that when we use ALL ROWS instead of ONE ROW, the expression COUNT(HIGHER.LVL) becomes a rolling count. It now works similar to the analytic ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. To get the total count, we here need to add the FINAL keyword, which makes it work similar to analytic ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

In all rows of the match, using STRT.xxx in the MEASURES clause gives us the values of the single STRT row. Using HIGHER.xxx shows us values from the actual HIGHER row, therefore NULL for the STRT rows in the output.

We can visualize the rows of each of the 14 matches by pivoting:

with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select rn, empno, ename
, case "1" when 1 then 'XX' end "1"
, case "2" when 1 then 'XX' end "2"
, case "3" when 1 then 'XX' end "3"
, case "4" when 1 then 'XX' end "4"
, case "5" when 1 then 'XX' end "5"
, case "6" when 1 then 'XX' end "6"
, case "7" when 1 then 'XX' end "7"
, case "8" when 1 then 'XX' end "8"
, case "9" when 1 then 'XX' end "9"
, case "10" when 1 then 'XX' end "10"
, case "11" when 1 then 'XX' end "11"
, case "12" when 1 then 'XX' end "12"
, case "13" when 1 then 'XX' end "13"
, case "14" when 1 then 'XX' end "14"
from (
select mn
, rn
, empno
, lpad(' ', (lvl-1)*2) || ename as ename
from hierarchy
match_recognize (
order by rn
measures
match_number() as mn
all rows per match
after match skip to next row
pattern (
strt higher*
)
define
higher as higher.lvl > strt.lvl
)
)
pivot (
count(*)
for mn in (
1,2,3,4,5,6,7,8,9,10,11,12,13,14
)
)
order by rn
/

 RN EMPNO ENAME        1  2  3  4  5  6  7  8  9  10 11 12 13 14
--- ----- ------------ -- -- -- -- -- -- -- -- -- -- -- -- -- --
1 7839 KING XX
2 7566 JONES XX XX
3 7788 SCOTT XX XX XX
4 7876 ADAMS XX XX XX XX
5 7902 FORD XX XX XX
6 7369 SMITH XX XX XX XX
7 7698 BLAKE XX XX
8 7499 ALLEN XX XX XX
9 7521 WARD XX XX XX
10 7654 MARTIN XX XX XX
11 7844 TURNER XX XX XX
12 7900 JAMES XX XX XX
13 7782 CLARK XX XX
14 7934 MILLER XX XX XX

14 columns showing which of the rows are part of each of the 14 matches.

Even if we go back to using ONE ROW PER MATCH, we can still include some information from the detail rows in an „aggregate“ manner:

with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select empno
, lpad(' ', (lvl-1)*2) || ename as ename
, subs
, hifrom
, hito
, himax
from hierarchy
match_recognize (
order by rn
measures
strt.rn as rn
, strt.lvl as lvl
, strt.empno as empno
, strt.ename as ename
, count(higher.lvl) as subs
, first(higher.ename) as hifrom
, last(higher.ename) as hito
, max(higher.lvl) as himax
one row per match
after match skip to next row
pattern (
strt higher*
)
define
higher as higher.lvl > strt.lvl
)
order by rn
/

EMPNO ENAME        SUBS HIFROM HITO   HIMAX
----- ------------ ---- ------ ------ -----
7839 KING 13 JONES MILLER 4
7566 JONES 4 SCOTT SMITH 4
7788 SCOTT 1 ADAMS ADAMS 4
7876 ADAMS 0
7902 FORD 1 SMITH SMITH 4
7369 SMITH 0
7698 BLAKE 5 ALLEN JAMES 3
7499 ALLEN 0
7521 WARD 0
7654 MARTIN 0
7844 TURNER 0
7900 JAMES 0
7782 CLARK 1 MILLER MILLER 3
7934 MILLER 0

FIRST and LAST can be applied to HIGHER.xxx to give us the from and to of the HIGHER rows of the match.

And just like we have used aggregate function COUNT to get number of subordinates, we can also use aggregate function MAX to find the highest LVL of the subordinates.

Suppose we only want to see those employees that have subordinates? In other words those where SUBS > 0.

Sure, we could do it by wrapping the query above in an inline view and use a WHERE SUBS > 0 clause. But there’s a simpler way with the PATTERN clause as alternative:

with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select empno
, lpad(' ', (lvl-1)*2) || ename as ename
, subs
from hierarchy
match_recognize (
order by rn
measures
strt.rn as rn
, strt.lvl as lvl
, strt.empno as empno
, strt.ename as ename
, count(higher.lvl) as subs
one row per match
after match skip to next row
pattern (
strt higher+
)
define
higher as higher.lvl > strt.lvl
)
order by rn
/

EMPNO ENAME        SUBS
----- ------------ ----
7839 KING 13
7566 JONES 4
7788 SCOTT 1
7902 FORD 1
7698 BLAKE 5
7782 CLARK 1

By replacing HIGHER* with HIGHER+ we change definition from zero or more to one or more HIGHER rows, which explicitly states we only want to match where a STRT row is followed by at least one HIGHER row, meaning the employee has at least one subordinate.

But is all this really worth it when we have such a simple query as the one at the top of this page using a scalar subquery?

Well, let’s try to create a slightly bigger employee table – 14.000 subordinates + 1 at the very top of the pyramid:

create table bigemp
as
select 1 empno
, 'LARRY' ename
, cast(null as number) mgr
from dual
union all
select dum.dum*10000+empno empno
, ename || '#' || dum.dum ename
, coalesce(
dum.dum*10000+mgr
, 1
) mgr
from emp
cross join (
select level dum
from dual
connect by level <= 1000
) dum
/

create index bigemp_mgr_eno on bigemp (mgr, empno)
/

begin
dbms_stats.gather_table_stats(USER,'BIGEMP',cascade=>true);
end;
/

So let’s try the pattern matching method:

set autotrace traceonly
set timing on

with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from bigemp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select empno
, lpad(' ', (lvl-1)*2) || ename as ename
, subs
from hierarchy
match_recognize (
order by rn
measures
strt.rn as rn
, strt.lvl as lvl
, strt.empno as empno
, strt.ename as ename
, count(higher.lvl) as subs
one row per match
after match skip to next row
pattern (
strt higher*
)
define
higher as higher.lvl > strt.lvl
)
order by rn
/


14001 rows selected.

Elapsed: 00:00:00.35

Execution Plan
----------------------------------------------------------
Plan hash value: 443729768

----------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 237 | 20 (15)| 00:00:01 |
| 1 | SORT ORDER BY | | 3 | 237 | 20 (15)| 00:00:01 |
| 2 | VIEW | | 3 | 237 | 19 (11)| 00:00:01 |
| 3 | MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTOMATON| | 3 | 198 | 19 (11)| 00:00:01 |
| 4 | VIEW | | 3 | 198 | 18 (6)| 00:00:01 |
| 5 | COUNT | | | | | |
| 6 | VIEW | | 3 | 159 | 18 (6)| 00:00:01 |
|* 7 | CONNECT BY NO FILTERING WITH START-WITH | | | | | |
| 8 | TABLE ACCESS FULL | BIGEMP | 14001 | 300K| 17 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

7 - access("MGR"=PRIOR "EMPNO")
filter("MGR" IS NULL)


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
55 consistent gets
0 physical reads
0 redo size
435280 bytes sent via SQL*Net to client
10763 bytes received via SQL*Net from client
935 SQL*Net roundtrips to/from client
4 sorts (memory)
0 sorts (disk)
14001 rows processed

Nice, we process all 14.001 rows in less than half a second with a single full table access of BIGEMP.

So how about the scalar subquery method:

select empno
, lpad(' ', (level-1)*2) || ename as ename
, (
select count(*)
from bigemp sub
start with sub.mgr = bigemp.empno
connect by sub.mgr = prior sub.empno
) subs
from bigemp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
/

14001 rows selected.

Elapsed: 00:00:11.61

Execution Plan
----------------------------------------------------------
Plan hash value: 2938165771

----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 3 | 198 | 27 (4)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 26 | | |
|* 2 | CONNECT BY WITH FILTERING | | | | | |
|* 3 | INDEX RANGE SCAN | BIGEMP_MGR_ENO | 2 | 24 | 2 (0)| 00:00:01 |
| 4 | NESTED LOOPS | | 5 | 125 | 4 (0)| 00:00:01 |
| 5 | CONNECT BY PUMP | | | | | |
|* 6 | INDEX RANGE SCAN | BIGEMP_MGR_ENO | 2 | 24 | 1 (0)| 00:00:01 |
|* 7 | CONNECT BY NO FILTERING WITH START-WITH| | | | | |
| 8 | TABLE ACCESS FULL | BIGEMP | 14001 | 300K| 17 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("SUB"."MGR"=PRIOR "SUB"."EMPNO")
3 - access("SUB"."MGR"=:B1)
6 - access("SUB"."MGR"="connect$_by$_pump$_009"."prior sub.empno ")
7 - access("MGR"=PRIOR "EMPNO")
filter("MGR" IS NULL)

Note
-----
- this is an adaptive plan


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
465005 consistent gets
0 physical reads
0 redo size
435280 bytes sent via SQL*Net to client
10763 bytes received via SQL*Net from client
935 SQL*Net roundtrips to/from client
37008 sorts (memory)
0 sorts (disk)
14001 rows processed

Well, well… More than 11 seconds – 30 times slower. One full table access and then a lot of CONNECT BY with INDEX RANGE SCAN in the scalar subquery.

And notice 465.005 consistent gets versus 55 – and 37.008 sorts versus 3 !

I think the numbers speak for themselves – I will go with MATCH_RECOGNIZE for this…

Details on the syntax can be found in the chapter on SQL for Pattern Matching in the 12c Data Warehousing Guide.