Remove spaces and replace characters with regexp_replace()

Question

I wish to ...

... remove spaces
... delete apostrophes
... replace 'é' and 'è' with 'e'

I use the function regexp_replace().

For the moment, I can delete the spaces but poorly. Indeed, when the attribute contains several spaces only one is deleted.

I can't process 1., 2., and 3. at the same time. Is this possible?

Below is a link to my code:

https://dbfiddle.uk/22ODtpNS

Please remember to include essential code as formatted text in the question. And always disclose your version of Postgres. — Erwin Brandstetter, Commented May 25 at 22:59

Erwin Brandstetter · Accepted Answer · 2024-05-25 22:52:39Z

Install the additional module unaccent, which (among few other things) provides the eponymous function unaccent(). It ...

removes accents (diacritic signs) from a given string

Then:

SELECT replace(unaccent(compar), ' ', '')
FROM   tbl;

replace() removes all occurrences of a pattern string, not just the first. No regexp magic needed here. See below.

Your UPDATE could be:

UPDATE tbl
SET    compar = replace(unaccent(compar), ' ', '')
WHERE  compar ~ ' ';  -- ignores strings with accents but no spaces (?!)

fiddle

Indeed, when the attribute contains several spaces only one is deleted.

You could modify the behavior of regexp_replace() by adding 'g' as fourth parameter, like:

regexp_replace(compar, ' ', '', 'g')

See:

Regex to remove multiple characters

For the simple case, that's just an expensive way of saying:

replace(compar, ' ', '')

Instead of replace() or regexp_replace(), you could use translate(..., ' ''', ''). — Laurenz Albe, Commented May 26 at 20:11

fcka · Accepted Answer · 2024-05-26 20:30:28Z

1

for the apostrophe:

select replace (compar,CHR(39),'_') dans my_table ;

works

answered May 26 at 20:30

fcka

1255 bronze badges

1

You could use the TRANSLATE() function to simultaneously substitute accented characters - see my revised answer above. You can vary the accented character list for your own language. The advantage it has is an extension is not required!
– Vérace
Commented May 27 at 17:46

Add a comment |

Vérace · Accepted Answer · 2024-05-27 17:44:29Z

I may have been a bit confused here - I thought where the question was "remove spaces" - I believed that it was multiple spaces that were being discussed. Anyway, I've done an answer so here it is - maybe it'll help people down the line.

Solution 1 - remove all commas, spaces and accents.

By far the simplest way of just removing all spaces and commas as well as accents in one pass is to use the TRANSLATE() function (see fiddle here).

SELECT
  TRANSLATE(str, 'áéíóú ,', 'aeiou')
FROM 
  test;

This solution has one advantage - sometimes employees/consultants are not permitted to install extenstion or the PostgreSQL instance could be cloud-based and the UNACCENT extension not available - otherwise @ErwinBrandstetter's solution is more performant, even with a double REPLACE() to eliminate the commas:

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS)
SELECT 
  REPLACE(REPLACE(unaccent(str), ' ', ''), ',', '')
FROM  test;

On my home machine, I created a test table with 10M records and EB's solution was consistently ~ 25% faster than the TRANSLATE() solution and ~ 40% faster than the REGEXP_REPLACE() one - the results on db<>fiddle are normally in line with this, but sometimes (perhaps due a spike in usage), this can change. The perils of benchmarking on system where one is not in control.

Solution where any sequence of spaces is reduced to 1.

However, to remove multiple spaces - leaving one. In my experience, one should never underestimate just how stupid/lazy/distracted/bored people can be when entering free text into an application - by accident a phone is placed on a keyboard and you've 100 commas in no time! I had an address field once which said (something like - "Joe will meet you for a sandwich later.")!

So, what I did was the following (all of the code below is available on the fiddle here):

CREATE TABLE test
(
  id INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY,
  str TEXT NULL
);

Populate it with some strings:

INSERT INTO test (str) VALUES

('A normal string'),     -- No commas, starts with a letter, unaccented
('    first  ,  second   ,     third   '), -- starts with spaces, commas, more spaces, no accents
(',,,  A Náid ,,,   ,,,      ,    a h-Aon    , ,,, a Dó ,,,,   A Trí   ,  ' );

-- the last string has commas, accents and spaces all over the place

Now, if you can't and/or are not permitted to install PostgreSQL extensions, you can do the following using the REPLACE (manual) function:

SELECT
  REPLACE
  (
    REPLACE
    (
      REPLACE(str, 'á', 'a'), 
      'í', 'i'
    ), 'ó', 'o' 
  )  -- &c. for all the accents in your languages
  str
FROM 
  test;

Result (for the accented string):

A Naid ,,,   ,,,      ,    a h-Aon    , ,,, a Do ,,,,   A Tri

The accents are gone - needless to say, this is error-prone and tiresome. Way better to use the UNNACCENT extension - it's part of the contrib modules anyway - that's almost "core"!

Then the following code is run:

SELECT
  REGEXP_REPLACE 
  (
    BTRIM(REPLACE(UNACCENT(str), ',', '')), 
    '[ ]{2,}', ' ', 'g'
  ),

  LENGTH(
    REGEXP_REPLACE 
  (
    BTRIM(REPLACE(UNACCENT(str), ',', '')), 
    '[ ]{2,}', ' ', 'g'
  ))
FROM
  test;

The reason that the REGEXP_REPLACE() has to be run is because there could be more than one space (could be 2, 3 or more) in sequence (remember the remarks about free text entry above). The BTRIM() function removes any space characters before or after the strings - the LENGTH() function is just used to check the results.

regexp_replace              length
A normal string                 15
first second third              18
A Naid a h-Aon a Do A Tri       25

The regexp pattern explained:

[ ] - the square brackets indicate a character class - in this case, it only contains the space character
{2,}', ' ' is the quantifier and the replacement - this is telling REGEXP_REPLACE() to replace 2 or more space characters with a single space
'g' specifies the global option - do this throughout the string

In the fiddle (not shown for brevity), I have used REPLACE() to remove supplmentary spaces, but it doesn't work correctly as can be seen from the strings themselves (between the words) and with the lengths.

There is another way of doing this (a bit contrived - but useful to see some of PostgreSQL's other functionality):

WITH change1 AS 
(
  SELECT
    id,
    REGEXP_SPLIT_TO_TABLE
    (
      REPLACE(UNACCENT(BTRIM(str)), ',', ''), ' '
    ) AS str1
  FROM
    test
 
)
SELECT
  id,
  STRING_AGG(str1, ' ' ORDER BY id),
  LENGTH(STRING_AGG(str1, ' ' ORDER BY id))
FROM change1
WHERE 
  LENGTH(str1) >= 1
GROUP BY id
ORDER BY id;

Result - same.

The two interesting functions here are REGEXP_SPLIT_TO_TABLE() and STRING_AGG(). Try taking apart the query and seeing how it all fits together.

It would appear from the performance analysis that I did that the first query is (unsurprisingly) faster than the second (longer) one - but YMMV - test on your own systems and data. The usual caveats about testing performance on a system with only 3 records and over which you have no control at any given moment apply here.

Stack Exchange Network

Remove spaces and replace characters with regexp_replace()

3 Answers 3

Solution 1 - remove all commas, spaces and accents.

Solution where any sequence of spaces is reduced to 1.

Not the answer you're looking for? Browse other questions tagged
postgresql
regular-expression
string-manipulation
unaccent
or ask your own question.

Hot Network Questions

Remove spaces and replace characters with regexp_replace()

3 Answers 3

Solution 1 - remove all commas, spaces and accents.

Solution where any sequence of spaces is reduced to 1.

Not the answer you're looking for? Browse other questions tagged postgresqlregular-expressionstring-manipulationunaccent or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
postgresql
regular-expression
string-manipulation
unaccent
or ask your own question.