Equivalent of utf8_general_ci in Postgres/ICU?

Question

In MySQL there is a collation utf8_general_ci which provides case-insensitive comparisons in a variety of languages. For example, these are all 1 (true):

SELECT 'ı' = 'I' SOLLATE 'utf8_general_ci';
SELECT 'i' = 'I' COLLATE 'utf8_general_ci';
SELECT 'ä' = 'Ä' COLLATE 'utf8_general_ci';

Can I define a similar collation using PostgreSQL's ICU?

I tried

CREATE COLLATION "undefined_ci_nondet_old" (
  PROVIDER = 'icu',
  LOCALE = "@colStrength=secondary",
  DETERMINISTIC = false
);

But that doesn't seem to include the Turkish I/ı conversion:

SELECT 'ı' = 'I' COLLATE undefined_ci_nondet_old; -- false

Will you be using the collation with a query? Or is this something you want to set at the database level? — matigo, Commented May 9, 2021 at 22:16
you check the post and especoally the comment dba.stackexchange.com/a/225570/190821 maaybe this helps — nbk, Commented May 9, 2021 at 22:35

Daniel Vérité · Accepted Answer · 2021-05-10 13:12:08Z

3

The dotless I is a special case. It's processed by the ICU collation service with rules that depend on the language.

If the locale refered to the Turkish or Azerbaijani languages, it would produce the result that speakers of these languages might expect (that is, i and ı are two different letters with İ and I being their respective uppercase counterparts; cross-comparisons return false). Otherwise the result is normally that i is the lowercase version of I, whereas ı is not.

postgres=# CREATE COLLATION "undefined_ci_nondet_old" (
  PROVIDER = 'icu',
  LOCALE = 'tr@colStrength=secondary',
  DETERMINISTIC = false
);
CREATE COLLATION


postgres=# SELECT 'ı' = 'I' COLLATE undefined_ci_nondet_old;
 ?column? 
----------
 t
(1 row)

postgres=# select 'i'='İ' COLLATE undefined_ci_nondet_old;
 ?column? 
----------
 t
(1 row)

postgres=# select 'i'='I' COLLATE undefined_ci_nondet_old;
 ?column? 
----------
 f
(1 row)

postgres=# select 'ı'='İ' COLLATE undefined_ci_nondet_old;
 ?column? 
----------
 f
(1 row)

edited May 10, 2021 at 13:12

answered May 10, 2021 at 11:17

Daniel Vérité

31.8k3 gold badges76 silver badges84 bronze badges

This is a funny answer. At first I thought it was wrong because clearly switching to the tr locale cannot be the solution to be case-insensitive in all languages, right? But it turns out it is - the Turkish collation works not only for ı/I but also for example for ä/Ä (German), ø/Ø (Norwegian) and ж/Ж (Russian).
– AndreKR
Commented May 10, 2021 at 12:01
Could you maybe include those other characters in the example output, to make it immediately clear that the solution works not only for the dotless I?
– AndreKR
Commented May 10, 2021 at 12:04
@AndreKR: I've improved the answer a bit, including the 4 permutations of comparisons.
– Daniel Vérité
Commented May 10, 2021 at 13:13
Oh, if 'i'='I' is false, then the collation does not fulfill the purpose of utf8_general_ci.
– AndreKR
Commented May 10, 2021 at 16:38

Add a comment |

Stack Exchange Network

Equivalent of utf8_general_ci in Postgres/ICU?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
postgresql
collation
international-components-unicode
or ask your own question.

Linked

Hot Network Questions

Equivalent of utf8_general_ci in Postgres/ICU?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged postgresqlcollationinternational-components-unicode or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
postgresql
collation
international-components-unicode
or ask your own question.