This post was originally published on this site

Collation in Oracle 12cR2 gives some more finegrained possibilities for setting up how your data is to sorted and compared linguistically. A lot we could do before with NLS_SORT and NLS_COMP and the likes, but the collation features can both be simpler to use as well as offer more detailed control.

But do some testing for your specific language cases – you might find some small surprises like I did when trying it out with danish.

Here’s how I tried danish column level collation…

To explain a bit to the international readers: the danish alphabet has 29 letters, the standard 26 A-Z followed by three special characters Æ, Ø and Å. In the old days, we didn’t have the Å and instead the same sound was written AA (double-A). Today words in the language all use Å, but quite a few names (people or places) still use AA where others use Å, and you can’t hear the difference when someone pronounces his name whether he spells it Vestergård or Vestergaard.

So for a proper danish sorting of names, AA should be considered identical to the letter Å. Linguistic NLS sorting in Oracle has been capable of this for many versions, collation does not bring any news on this front.

But you can often have a table with some name columns that need to be linguistically sorted and compared, and at the same time some alphabetic key columns (primary or foreign) that need to be binary sorted and compared. You could solve that by putting NLS_SORT in the queries, but with collation you can specify right down to column level how you want the data sorted and compared, so the queries don’t need to do anything but just query the data.

For collation to work, I’ve had to set the database parameter MAX_STRING_SIZE to EXTENDED.

Before the demonstration, I’m first making certain my session is not using Danish NLS sorting (just to be sure not to be accused of cheating 😉 :

alter session set nls_sort = binary
/

For collation, I can use the values that are valid for NLS sorting too. So I can find the possibilities for Danish like this:

select value
from v$nls_valid_values
where parameter = 'SORT'
and value like '%DANISH%'
order by value
/

VALUE                                                          
----------------------------------------------------------------
DANISH
DANISH_M
UCA0610_DANISH
UCA0620_DANISH
UCA0700_DANISH
XDANISH

A little testing showed I want the DANISH_M value (later I’ll show the difference to the DANISH value.)

I create a table specifying the default collation for the table to be BINARY_CI and the specific collation for column CITY to be DANISH_M_CI – this would typify cases where an alphabetic key is used and treated as „generic ascii characters“ without any linguistic properties, while the information carrying column is to be considered in a specific language. In both cases the suffix _CI indicates to the database that I want Case Insensitive collation (since data has been entered in a horrible mishmash of upper and lower case.)

create table stores (
store_id varchar2(5 char) primary key
, city varchar2(20 char) collate danish_m_ci
)
default collation binary_ci
/

insert into stores values ('AA001', 'København')
/
insert into stores values ('AA002', 'Korsør')
/
insert into stores values ('AB001', 'Aarhus')
/
insert into stores values ('AB002', 'Ålestrup')
/
insert into stores values ('BA001', 'karup')
/
insert into stores values ('BA002', 'KYBEHUSE')
/
insert into stores values ('BB001', 'AALBORG')
/
insert into stores values ('BB002', 'andst')
/
insert into stores values ('ÅÅ001', 'AUNING')
/
insert into stores values ('ÅÅ002', 'Kaastrup')
/
commit
/

Column STORE_ID inherits the table default collation, so when I order by STORE_ID, it’s sorted with BINARY_CI collation:

select store_id, city
from stores
order by store_id
/

STORE CITY               
----- --------------------
AA001 København
AA002 Korsør
AB001 Aarhus
AB002 Ålestrup
BA001 karup
BA002 KYBEHUSE
BB001 AALBORG
BB002 andst
ÅÅ001 AUNING
ÅÅ002 Kaastrup

But column CITY has a different collation specified, so when I order by CITY, it’s sorted with DANISH_M_CI collation. The sorting is case insensitive, and the double-A’s are sorted like they were an Å:

select store_id, city
from stores
order by city
/

STORE CITY               
----- --------------------
BB002 andst
ÅÅ001 AUNING
BA001 karup
AA002 Korsør
BA002 KYBEHUSE
AA001 København
ÅÅ002 Kaastrup
BB001 AALBORG
AB002 Ålestrup
AB001 Aarhus

Three cities have been affected by double-A considered identical to Å:

  • Kaastrup sorted as Kåstrup
  • AALBORG sorted as ÅLBORG
  • Aarhus sorted as ÅRHUS

We can inspect the collation used for the columns with COLLATION function:

select store_id, city
, collation(store_id) store_coll
, collation(city) city_coll
from stores
order by store_id
/

STORE CITY                 STORE_COL CITY_COLL 
----- -------------------- --------- -----------
AA001 København BINARY_CI DANISH_M_CI
AA002 Korsør BINARY_CI DANISH_M_CI
AB001 Aarhus BINARY_CI DANISH_M_CI
AB002 Ålestrup BINARY_CI DANISH_M_CI
BA001 karup BINARY_CI DANISH_M_CI
BA002 KYBEHUSE BINARY_CI DANISH_M_CI
BB001 AALBORG BINARY_CI DANISH_M_CI
BB002 andst BINARY_CI DANISH_M_CI
ÅÅ001 AUNING BINARY_CI DANISH_M_CI
ÅÅ002 Kaastrup BINARY_CI DANISH_M_CI

The collation on the column can be overruled with the COLLATE operator (as an alternative to sorting by NLSSORT function.) So we can sort by the STORE_ID column in a DANISH_M_CI collation, so the double-A’s are sorted like they were a single Å:

select store_id, city
from stores
order by store_id collate danish_m_ci
/

STORE CITY               
----- --------------------
AB001 Aarhus
AB002 Ålestrup
BA001 karup
BA002 KYBEHUSE
BB001 AALBORG
BB002 andst
AA001 København
AA002 Korsør
ÅÅ001 AUNING
ÅÅ002 Kaastrup

Overruling does not change the collation of the column itself, it only is the collation of the expression that is affected:

select store_id, city
, collation(store_id) store_coll
, collation(store_id collate danish_m_ci) expr_coll
from stores
order by store_id collate danish_m_ci
/

STORE CITY                 STORE_COL EXPR_COLL 
----- -------------------- --------- -----------
AB001 Aarhus BINARY_CI DANISH_M_CI
AB002 Ålestrup BINARY_CI DANISH_M_CI
BA001 karup BINARY_CI DANISH_M_CI
BA002 KYBEHUSE BINARY_CI DANISH_M_CI
BB001 AALBORG BINARY_CI DANISH_M_CI
BB002 andst BINARY_CI DANISH_M_CI
AA001 København BINARY_CI DANISH_M_CI
AA002 Korsør BINARY_CI DANISH_M_CI
ÅÅ001 AUNING BINARY_CI DANISH_M_CI
ÅÅ002 Kaastrup BINARY_CI DANISH_M_CI

An expression with a COLLATE operator can be given column alias and the alias keeps the collation information:

select store_id, city, store_id_m
, collation(store_id) store_coll
, collation(store_id_m) store_m_coll
from (
select store_id, city
, store_id collate danish_m_ci as store_id_m
from stores
)
order by store_id_m
/

STORE CITY                 STORE STORE_COL STORE_M_COL
----- -------------------- ----- --------- -----------
AB001 Aarhus AB001 BINARY_CI DANISH_M_CI
AB002 Ålestrup AB002 BINARY_CI DANISH_M_CI
BA001 karup BA001 BINARY_CI DANISH_M_CI
BA002 KYBEHUSE BA002 BINARY_CI DANISH_M_CI
BB001 AALBORG BB001 BINARY_CI DANISH_M_CI
BB002 andst BB002 BINARY_CI DANISH_M_CI
AA001 København AA001 BINARY_CI DANISH_M_CI
AA002 Korsør AA002 BINARY_CI DANISH_M_CI
ÅÅ001 AUNING ÅÅ001 BINARY_CI DANISH_M_CI
ÅÅ002 Kaastrup ÅÅ002 BINARY_CI DANISH_M_CI

Using the COLLATE operator to overrule a column collation allows us to see, that the difference between DANISH_CI and DANISH_M_CI is, that although both are case insensitive due to the _CI prefix, DANISH sorts double-A’s as two single A’s (unlike DANISH_M that considers a double-A to be the same as an Å):

select store_id, city
from stores
order by city collate danish_ci
/

STORE CITY               
----- --------------------
BB001 AALBORG
AB001 Aarhus
BB002 andst
ÅÅ001 AUNING
ÅÅ002 Kaastrup
BA001 karup
AA002 Korsør
BA002 KYBEHUSE
AA001 København
AB002 Ålestrup

The COLLATE operator also allows certain pseudo-collations to be used. Unlike a named collation like DANISH_M_CI, the collation USING_NLS_SORT is not a specific collation, but tells the database to use the collation of the currently active NLS settings, in this case BINARY:

select store_id, city
from stores
order by city collate using_nls_sort
/

STORE CITY               
----- --------------------
BB001 AALBORG
ÅÅ001 AUNING
AB001 Aarhus
BA002 KYBEHUSE
ÅÅ002 Kaastrup
AA002 Korsør
AA001 København
BB002 andst
BA001 karup
AB002 Ålestrup

The collations are not just used for sorting, but also for comparison (similar to NLS_COMP.) So when we look for cities containing a U, this comparison is case insensitive:

select store_id, city
from stores
where city like '%U%'
order by city
/

STORE CITY               
----- --------------------
ÅÅ001 AUNING
BA001 karup
BA002 KYBEHUSE
ÅÅ002 Kaastrup
AB002 Ålestrup
AB001 Aarhus

Not just LIKE but other functions as well use the collation information, for example INSTR:

select store_id, city
from stores
where instr(city, 'h') > 0
order by city
/

STORE CITY               
----- --------------------
BA002 KYBEHUSE
AA001 København
AB001 Aarhus

And some comparison operators as well (the sequence of Å’s following the A is because Å is the last letter of the danish alphabet.) This finds cities beginning with an A, and since the column has DANISH_M_CI comparison, double-A’s are not consider as two single-A’s, so those two cities are not found here:

select store_id, city
from stores
where city >= 'a' and city <= 'aååååååååååååååååååå'
order by city
/

STORE CITY               
----- --------------------
BB002 andst
ÅÅ001 AUNING

But using >= and <= comparison to look for cities beginning with an Å finds the two cities with double-A:

select store_id, city
from stores
where city >= 'å' and city <= 'åååååååååååååååååååå'
order by city
/

STORE CITY               
----- --------------------
BB001 AALBORG
AB002 Ålestrup
AB001 Aarhus

The interesting surprise I found, is a bit different behaviour in LIKE operator. If we use LIKE to search for cities beginning with A, we do not get those with double-A’s, so that’s OK:

select store_id, city
from stores
where city like 'a%'
order by city
/

STORE CITY               
----- --------------------
BB002 andst
ÅÅ001 AUNING

Then I would expect them to be part of the result when I use LIKE to search for cities beginning with Å – but no, they are not there either:

select store_id, city
from stores
where city like 'å%'
order by city
/

STORE CITY               
----- --------------------
AB002 Ålestrup

In order to find them with LIKE, I need explicitly to look for two A’s:

select store_id, city
from stores
where city like 'aa%'
order by city
/

STORE CITY               
----- --------------------
BB001 AALBORG
AB001 Aarhus

Using >= and <= involves ordering/sorting to identify what is greater than or less than, so that should work fine, and it does.

Using LIKE could conceptually be thought of as the same (in which case it should have given same result) or it could be thought of something like SUBSTR(city,1,1)=’a‘ (in which case the double-A’s should have been found when searching for cities beginning with A.) The weird part is that LIKE seems to do neither, but has a logic all of its own? I don’t know if that’s expected behaviour or perhaps sort of buggy…

Anyway, my general conclusion is that using the new collation features makes some things easier to do and implement than using NLS_SORT / NLS_COMP. I’d suggest looking into collation if you have an application with those needs.