2

In MySQL there is a collation utf8_general_ci which provides case-insensitive comparisons in a variety of languages. For example, these are all 1 (true):

SELECT 'ı' = 'I' SOLLATE 'utf8_general_ci';
SELECT 'i' = 'I' COLLATE 'utf8_general_ci';
SELECT 'ä' = 'Ä' COLLATE 'utf8_general_ci';

Can I define a similar collation using PostgreSQL's ICU?

I tried

CREATE COLLATION "undefined_ci_nondet_old" (
  PROVIDER = 'icu',
  LOCALE = "@colStrength=secondary",
  DETERMINISTIC = false
);

But that doesn't seem to include the Turkish I/ı conversion:

SELECT 'ı' = 'I' COLLATE undefined_ci_nondet_old; -- false
3
  • Will you be using the collation with a query? Or is this something you want to set at the database level?
    – matigo
    Commented May 9, 2021 at 22:16
  • you check the post and especoally the comment dba.stackexchange.com/a/225570/190821 maaybe this helps
    – nbk
    Commented May 9, 2021 at 22:35
  • @matigo Probably query-time, but either is fine.
    – AndreKR
    Commented May 10, 2021 at 7:56

1 Answer 1

3

The dotless I is a special case. It's processed by the ICU collation service with rules that depend on the language.

If the locale refered to the Turkish or Azerbaijani languages, it would produce the result that speakers of these languages might expect (that is, i and ı are two different letters with İ and I being their respective uppercase counterparts; cross-comparisons return false). Otherwise the result is normally that i is the lowercase version of I, whereas ı is not.

postgres=# CREATE COLLATION "undefined_ci_nondet_old" (
  PROVIDER = 'icu',
  LOCALE = 'tr@colStrength=secondary',
  DETERMINISTIC = false
);
CREATE COLLATION


postgres=# SELECT 'ı' = 'I' COLLATE undefined_ci_nondet_old;
 ?column? 
----------
 t
(1 row)

postgres=# select 'i'='İ' COLLATE undefined_ci_nondet_old;
 ?column? 
----------
 t
(1 row)

postgres=# select 'i'='I' COLLATE undefined_ci_nondet_old;
 ?column? 
----------
 f
(1 row)

postgres=# select 'ı'='İ' COLLATE undefined_ci_nondet_old;
 ?column? 
----------
 f
(1 row)

4
  • This is a funny answer. At first I thought it was wrong because clearly switching to the tr locale cannot be the solution to be case-insensitive in all languages, right? But it turns out it is - the Turkish collation works not only for ı/I but also for example for ä/Ä (German), ø/Ø (Norwegian) and ж/Ж (Russian).
    – AndreKR
    Commented May 10, 2021 at 12:01
  • Could you maybe include those other characters in the example output, to make it immediately clear that the solution works not only for the dotless I?
    – AndreKR
    Commented May 10, 2021 at 12:04
  • @AndreKR: I've improved the answer a bit, including the 4 permutations of comparisons. Commented May 10, 2021 at 13:13
  • Oh, if 'i'='I' is false, then the collation does not fulfill the purpose of utf8_general_ci.
    – AndreKR
    Commented May 10, 2021 at 16:38

Not the answer you're looking for? Browse other questions tagged or ask your own question.