I may have been a bit confused here - I thought where the question was "remove spaces" - I believed that it was multiple spaces that were being discussed. Anyway, I've done an answer so here it is - maybe it'll help people down the line.
Solution 1 - remove all commas, spaces and accents.
By far the simplest way of just removing all spaces and commas as well as accents in one pass is to use the TRANSLATE()
function (see fiddle here).
SELECT
TRANSLATE(str, 'áéíóú ,', 'aeiou')
FROM
test;
This solution has one advantage - sometimes employees/consultants are not permitted to install extenstion or the PostgreSQL instance could be cloud-based and the UNACCENT
extension not available - otherwise @ErwinBrandstetter's solution is more performant, even with a double REPLACE()
to eliminate the commas:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS)
SELECT
REPLACE(REPLACE(unaccent(str), ' ', ''), ',', '')
FROM test;
On my home machine, I created a test
table with 10M records and EB's solution was consistently ~ 25% faster than the TRANSLATE()
solution and ~ 40% faster than the REGEXP_REPLACE()
one - the results on db<>fiddle are normally in line with this, but sometimes (perhaps due a spike in usage), this can change. The perils of benchmarking on system where one is not in control.
Solution where any sequence of spaces is reduced to 1.
However, to remove multiple spaces - leaving one. In my experience, one should never underestimate just how stupid/lazy/distracted/bored people can be when entering free text into an application - by accident a phone is placed on a keyboard and you've 100 commas in no time! I had an address field once which said (something like - "Joe will meet you for a sandwich later.")!
So, what I did was the following (all of the code below is available on the fiddle here):
CREATE TABLE test
(
id INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY,
str TEXT NULL
);
Populate it with some strings:
INSERT INTO test (str) VALUES
('A normal string'), -- No commas, starts with a letter, unaccented
(' first , second , third '), -- starts with spaces, commas, more spaces, no accents
(',,, A Náid ,,, ,,, , a h-Aon , ,,, a Dó ,,,, A Trí , ' );
-- the last string has commas, accents and spaces all over the place
Now, if you can't and/or are not permitted to install PostgreSQL extensions, you can do the following using the REPLACE
(manual) function:
SELECT
REPLACE
(
REPLACE
(
REPLACE(str, 'á', 'a'),
'í', 'i'
), 'ó', 'o'
) -- &c. for all the accents in your languages
str
FROM
test;
Result (for the accented string):
A Naid ,,, ,,, , a h-Aon , ,,, a Do ,,,, A Tri
The accents are gone - needless to say, this is error-prone and tiresome. Way better to use the UNNACCENT
extension - it's part of the contrib modules anyway - that's almost "core"!
Then the following code is run:
SELECT
REGEXP_REPLACE
(
BTRIM(REPLACE(UNACCENT(str), ',', '')),
'[ ]{2,}', ' ', 'g'
),
LENGTH(
REGEXP_REPLACE
(
BTRIM(REPLACE(UNACCENT(str), ',', '')),
'[ ]{2,}', ' ', 'g'
))
FROM
test;
The reason that the REGEXP_REPLACE()
has to be run is because there could be more than one space (could be 2, 3 or more) in sequence (remember the remarks about free text entry above). The BTRIM()
function removes any space characters before or after the strings - the LENGTH()
function is just used to check the results.
regexp_replace length
A normal string 15
first second third 18
A Naid a h-Aon a Do A Tri 25
The regexp pattern explained:
[ ]
- the square brackets indicate a character class - in this case, it only contains the space character
{2,}', ' '
is the quantifier and the replacement - this is telling REGEXP_REPLACE()
to replace 2 or more space characters with a single space
'g'
specifies the global option - do this throughout the string
In the fiddle (not shown for brevity), I have used REPLACE()
to remove supplmentary spaces, but it doesn't work correctly as can be seen from the strings themselves (between the words) and with the lengths.
There is another way of doing this (a bit contrived - but useful to see some of PostgreSQL's other functionality):
WITH change1 AS
(
SELECT
id,
REGEXP_SPLIT_TO_TABLE
(
REPLACE(UNACCENT(BTRIM(str)), ',', ''), ' '
) AS str1
FROM
test
)
SELECT
id,
STRING_AGG(str1, ' ' ORDER BY id),
LENGTH(STRING_AGG(str1, ' ' ORDER BY id))
FROM change1
WHERE
LENGTH(str1) >= 1
GROUP BY id
ORDER BY id;
Result - same.
The two interesting functions here are REGEXP_SPLIT_TO_TABLE()
and STRING_AGG()
. Try taking apart the query and seeing how it all fits together.
It would appear from the performance analysis that I did that the first query is (unsurprisingly) faster than the second (longer) one - but YMMV - test on your own systems and data. The usual caveats about testing performance on a system with only 3 records and over which you have no control at any given moment apply here.