8

I'm looking for a sql statement to count the number of unique characters in a string.

e.g.

3333333333 -> returns 1
1113333333 -> returns 2
1112222444 -> returns 3

I did some tests with REGEX and mysql-string-functions, but I didn't find a solution.

3
  • 6
    Do you really think this is a job for MySQL?
    – N.B.
    Commented Apr 30, 2015 at 12:13
  • Maybe this post can help you sort it out. [1] stackoverflow.com/questions/12344795/…
    – avk
    Commented Apr 30, 2015 at 12:21
  • You require to write block of code to do this, give me some time I will do this.
    – Anvesh
    Commented Apr 30, 2015 at 12:29

7 Answers 7

8

This is for fun right?

SQL is all about processing sets of rows, so if we can convert a 'word' into a set of characters as rows then we can use the 'group' functions to do useful stuff.

Using a 'relational database engine' to do simple character manipulation feels wrong. Still, is it possible to answer your question with just SQL? Yes it is...

Now, i always have a table that has one integer column that has about 500 rows in it that has the ascending sequence 1 .. 500. It is called 'integerseries'. It is a really small table that used a lot so it gets cached in memory. It is designed to replace the from 'select 1 ... union ... text in queries.

It is useful for generating sequential rows (a table) of anything that you can calculate that is based on a integer by using it in a cross join (also any inner join). I use it for generating days for a year, parsing comma delimited strings etc.

Now, the sql mid function can be used to return the character at a given position. By using the 'integerseries' table i can 'easily' convert a 'word' into a characters table with one row per character. Then use the 'group' functions...

SET @word='Hello World';

SELECT charAtIdx, COUNT(charAtIdx)
FROM (SELECT charIdx.id,
    MID(@word, charIdx.id, 1) AS charAtIdx 
    FROM integerseries AS charIdx
    WHERE charIdx.id <= LENGTH(@word)
    ORDER BY charIdx.id ASC
    ) wordLetters
GROUP BY
   wordLetters.charAtIdx
ORDER BY charAtIdx ASC  

Output:

charAtIdx  count(charAtIdx)  
---------  ------------------
                            1
d                           1
e                           1
H                           1
l                           3
o                           2
r                           1
W                           1

Note: The number of rows in the output is the number of different characters in the string. So, if the number of output rows is counted then the number of 'different letters' will be known.

This observation is used in the final query.

The final query:

The interesting point here is to move the 'integerseries' 'cross join' restrictions (1 .. length(word)) into the actual 'join' rather than do it in the where clause. This provides the optimizer with clues as to how to restrict the data produced when doing the join.

SELECT 
   wordLetterCounts.wordId,
   wordLetterCounts.word,   
   COUNT(wordLetterCounts.wordId) AS letterCount
FROM 
     (SELECT words.id AS wordId,
             words.word AS word,
             iseq.id AS charPos,
             MID(words.word, iseq.id, 1) AS charAtPos,
             COUNT(MID(words.word, iseq.id, 1)) AS charAtPosCount
     FROM
          words
          JOIN integerseries AS iseq
               ON iseq.id BETWEEN 1 AND words.wordlen 
      GROUP BY
            words.id,
            MID(words.word, iseq.id, 1)
      ) AS wordLetterCounts
GROUP BY
   wordLetterCounts.wordId  

Output:

wordId  word                  letterCount  
------  --------------------  -------------
     1  3333333333                        1
     2  1113333333                        2
     3  1112222444                        3
     4  Hello World                       8
     5  funny - not so much?             13

Word Table and Data:

CREATE TABLE `words` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `word` varchar(128) COLLATE utf8mb4_unicode_ci NOT NULL,
  `wordlen` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

/*Data for the table `words` */

insert  into `words`(`id`,`word`,`wordlen`) values (1,'3333333333',10);
insert  into `words`(`id`,`word`,`wordlen`) values (2,'1113333333',10);
insert  into `words`(`id`,`word`,`wordlen`) values (3,'1112222444',10);
insert  into `words`(`id`,`word`,`wordlen`) values (4,'Hello World',11);
insert  into `words`(`id`,`word`,`wordlen`) values (5,'funny - not so much?',20);

Integerseries table: range 1 .. 30 for this example.

CREATE TABLE `integerseries` (
  `id` int(11) unsigned NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=500 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
0
4

There is no direct or easy way of doing it. You may need to write a store function to do the job and by looking at all the characters you may expect in the data. Here is an example for just digits , which could be extended for all the characters in a stored function

mysql> select * from test ;
+------------+
| val        |
+------------+
| 11111111   |
| 111222222  |
| 1113333222 |
+------------+


select 
val, 
sum(case when locate('1',val) > 0 then 1 else 0 end ) 
+ sum( case when locate('2',val) > 0 then 1 else 0 end)
+ sum(case when locate('3',val) > 0 then 1 else 0 end)
+sum(case when locate('4',val) > 0 then 1 else 0 end ) as occurence
from test group by val


+------------+-----------+
| val        | occurence |
+------------+-----------+
| 11111111   |         1 |
| 111222222  |         2 |
| 1113333222 |         3 |
+------------+-----------+

Or if you have enough time , create a lookup table with all the characters you could think of. And make the query in 2 lines

mysql> select * from test ;
+------------+
| val        |
+------------+
| 11111111   |
| 111222222  |
| 1113333222 |
+------------+
3 rows in set (0.00 sec)

mysql> select * from look_up ;
+------+------+
| id   | val  |
+------+------+
|    1 | 1    |
|    2 | 2    |
|    3 | 3    |
|    4 | 4    |
+------+------+
4 rows in set (0.00 sec)

select 
t1.val, 
sum(case when locate(t2.val,t1.val) > 0 then 1 else 0 end ) as occ 
from test t1,(select * from look_up)t2 
group by t1.val ;

+------------+------+
| val        | occ  |
+------------+------+
| 11111111   |    1 |
| 111222222  |    2 |
| 1113333222 |    3 |
+------------+------+
7
  • 1
    too much hardcoded query! what if string has [A-Z] [a-z] [@#$%^] etc?
    – Vikrant
    Commented Apr 30, 2015 at 12:26
  • 4
    @Vikrant so what do you think what would be the easiest way to get the job done in mysql, post as an answer. Commented Apr 30, 2015 at 12:28
  • 1
    this is not solution. we require to write block of code to gain this result. Your solution is not valid require hard code.
    – Anvesh
    Commented Apr 30, 2015 at 12:28
  • @Anvesh as I mentioned in my answer there is no such functions available in mysql do these jobs, however if you still want to do it in mysql here is one way. If you know a better way add that. Note that the better way is not using PHP or some other languages which I can give 100s of example, it should be in mysql. Commented Apr 30, 2015 at 12:30
  • 3
    The whole premise is dumb. I don't even know why you guys criticize Abhik here for providing a solution for exact input parameters given by OP. MySQL is not a tool for this job, there's even no discussion about it.
    – N.B.
    Commented Apr 30, 2015 at 12:34
3

One thing you can do is have a table of all your characters, such as:

mysql> select * from chars;
+----+------+
| id | c    |
+----+------+
|  1 | 1    |
|  2 | 2    |
|  3 | 3    |
|  4 | 4    |
+----+------+

If your table of words looks like this:

mysql> select * from words;
+----+-----------+
| id | word      |
+----+-----------+
|  1 | 111222333 |
|  2 | 11111111  |
|  3 | 2222111   |
|  4 | 5555555   |
+----+-----------+

You can then join these tables on the condition of the character being in the word, and get the count, like this:

mysql> select word, count(c) from words w inner join chars c on locate(c.c, word) group by word;
+-----------+----------+
| word      | count(c) |
+-----------+----------+
| 11111111  |        1 |
| 111222333 |        3 |
| 2222111   |        2 |
+-----------+----------+
1
  • 3
    it's a way to do it. your stored procedure is broken, so stop trying to belittle other, working answers.
    – pala_
    Commented Apr 30, 2015 at 13:04
0

I think this is not a job for Mysql, but you could do anything if you try hard enough ;)

I don't like this answers but it works, and it's not too ugly if you have only numbers

SELECT 
    CASE WHEN yourcolumn LIKE '%1%' THEN 1 ELSE 0 END + 
    CASE WHEN yourcolumn LIKE '%2%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%3%' THEN 1 ELSE 0 END + 
    CASE WHEN yourcolumn LIKE '%4%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%5%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%6%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%7%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%8%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%9%' THEN 1 ELSE 0 END +
    CASE WHEN yourcolumn LIKE '%0%' THEN 1 ELSE 0 END
FROM yourtable
0
DROP FUNCTION IF EXISTS test.count_chrs;
CREATE DEFINER=`test`@`localhost` FUNCTION `count_chrs`(s CHAR(100)) RETURNS CHAR(4)
  BEGIN 
    DECLARE string_length int(4);
    DECLARE unique_string CHAR(100) DEFAULT "";
    DECLARE count_unique int(4) DEFAULT 0;
    DECLARE current_char int(4) DEFAULT 1;
    SET string_length = CHAR_LENGTH(s);

    WHILE current_char <= string_length DO
      IF (!LOCATE(SUBSTR(s, current_char, 1), unique_string)) THEN
        SET count_unique = count_unique + 1;
        SET unique_string = CONCAT(unique_string, SUBSTR(s, current_char, 1));
      END IF;

      SET current_char = current_char + 1;
    END WHILE;

    RETURN count_unique; 
  END;

I'm very newbie with MySQL functions declaration, but this could turn you to the right way.

1
  • I can't be bothered testing it but this looks on the right track. The only thing i'd say is that the func is returning an integer, not a char like you state in the definition
    – pala_
    Commented Apr 30, 2015 at 13:19
0

There's a few levels of subquery which may put off some, and it would need to be extended for columns with longer strings, but by turning it on its side using UNPIVOT its quite straightforward.

declare @Data table (RowID nvarchar(1), StringData nvarchar(10))
insert into @Data values (N'1', N'3333333333'),(N'2', N'1113333333'),(N'3', N'1112222444')

select  t1.StringData, cast(t2.CharCount as nvarchar) as 'Unique Characters in String'
from    @Data t1
        inner join (
            select  RowID,count(*) as 'CharCount'
            from    (
                    select  distinct RowID, [char]
                    from    (
                        select  RowID,
                            substring(StringData,1,1) as '1',
                            substring(StringData,2,1) as '2',
                            substring(StringData,3,1) as '3',
                            substring(StringData,4,1) as '4',
                            substring(StringData,5,1) as '5',
                            substring(StringData,6,1) as '6',
                            substring(StringData,7,1) as '7',
                            substring(StringData,8,1) as '8',
                            substring(StringData,9,1) as '9',
                            substring(StringData,10,1) as '10'
                        from    @Data
                        ) Unpivd
                    unpivot ( [char] for chars in ([1],[2],[3],[4],[5],[6],[7],[8],[9],[10])) unpiv
                    where [char] <> ''
                ) CharCounter
            group by RowID
            ) t2
            on t2.RowID = t1.RowID

This returns:

StringData  Unique Characters in String
3333333333  1
1113333333  2
1112222444  3
1
  • Spotted after the fact that this was a MySQL question, not TSQL as per my example. Commented May 25, 2018 at 16:04
0

There isn't a single statement that I know of that will count the distinct characters in a string, but the coded lines below can achieve the desired outcome. The value below stored in @passedstr represent the string to be counted. The result of the query contains a number in the 'cnt' column that represents the number of distinct characters in the string. Examples below are written for MySQL and SQL Server. You could use this code idea in a function to have one statement that performs the count.

MySQL:

declare @passedstr as nvarchar(4000) = 'abbcdeffffff';

with recursive ccount (idx, pstr) AS (
    select 1, substring(@passedstr, 1, 1) union all
    select idx + 1, substring(@passedstr, idx + 1, 1) from ccount where idx < length(@passedstr))

select count(distinct (pstr)) cnt from ccount;   

SQL Server:

declare @passedstr as nvarchar(4000) = 'abbcdeffffff';

with ccount as (
    select 1 idx, substring(@passedstr, 1, 1) pstr union all
    select idx + 1, substring(@passedstr, idx + 1, 1) from ccount where idx < len(@passedstr))

select count(distinct pstr) cnt from ccount
1
  • As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.
    – Community Bot
    Commented Oct 8, 2022 at 0:49

Not the answer you're looking for? Browse other questions tagged or ask your own question.