How to Use UTF-8 Collation in SQL Server database?

Question

I've migrated a database from mysql to SQL Server (politics), original mysql database using UTF8.

Now I read https://dba.stackexchange.com/questions/7346/sql-server-2005-2008-utf-8-collation-charset that SQL Server 2008 doesn't support utf8, is this a joke?

The SQL Server hosts multiple databases, mostly Latin-encoded. Since the migrated db is intended for web publishing, I want to keep the utf8-encoding. Have I missed something or do I need to enc/dec at application level?

Actually there is no problem with using UTF8, by creating SQLCLR UDF and you can download from Microsoft the code. check this link: technet.microsoft.com/en-us/library/ms160893(v=sql.90).aspx — Ronen Ariely, Commented Jan 5, 2015 at 23:26
Worth noting that Java, JavaScript, DotNet and Windows all use UTF-16 internally, so if your website is to be coded in any of those then you are saving a conversion to UTF16. — Ben, Commented Feb 4, 2016 at 12:20
If it helps, you can pretend that SQL Server internally stores text as UTF-8; and the driver converts it back to UTF-16 when you retrieve it. You could also pretend that SQL Server stores the text as UCS-32. The internal storage format is an irrelevant implementation detail. What is important is that databases return unicode character data as UTF-16 (matching your programming environment). — Ian Boyd, Commented Mar 17, 2016 at 20:45

Guffa · Accepted Answer · 2015-08-20 21:23:27Z

47

UTF-8 is not a character set, it's an encoding. The character set for UTF-8 is Unicode. If you want to store Unicode text you use the nvarchar data type.

If the database would use UTF-8 to store text, you would still not get the text out as encoded UTF-8 data, you would get it out as decoded text.

You can easily store UTF-8 encoded text in the database, but then you don't store it as text, you store it as binary data (varbinary).

answered Aug 20, 2015 at 21:23

Guffa

697k109 gold badges748 silver badges1k bronze badges

Thanks for input. There's more to learn around this, looking at for example stackoverflow.com/questions/3951722/…
– Teson
Commented Sep 2, 2015 at 9:47
I can't get my head around this. "The character set for UTF8 is Unicode"?? Isn't utf8 way broader than unicode. Saving Dauðalogn in unicode vs utf8 gives different results: (EF BB BF) 44 61 75 C3 B0 61 6C 6F 67 6E vs \u0044\u0061\u0075\u00f0\u0061\u006c\u006f\u0067\u006e
– Teson
Commented Sep 3, 2015 at 8:31
3

@user247245: UTF-8 is the encoding and Unicode is the character set. UTF-8 is one way of saving Unicode. What you have used to represent the Unicode is escape codes used in string literals, that's not normally how you represent Unicode as a file. UTF-32 would be the closest translation directly from Uncode to a file format, where each character code is saved as a 32 bit number.
– Guffa
Commented Sep 3, 2015 at 10:10
Can you please explain why the third letter in the example above is represented in UTF8 as C3 B0 and in unicode as simply F0 . Thanks for your help.
– Teson
Commented Sep 4, 2015 at 9:56
4

@user247245: Character codes between 8 and 11 bits are encoded as 110xxxxx 10xxxxxx in UTF-8 (where x represents data bits), so the character code F0 (00011110000 as 11 bits) is encoded as 11000011 10110000 (putting 00011 from the character code in the first byte and 110000 in the second) which is C3 B0.
– Guffa
Commented Sep 4, 2015 at 10:18

Add a comment |

Bartosz X · Accepted Answer · 2022-06-29 12:37:59Z

Looks like this will be finally supported in the SQL Server 2019! SQL Server 2019 - whats new?

From BOL:

UTF-8 support

Full support for the widely used UTF-8 character encoding as an import or export encoding, or as database-level or column-level collation for text data. UTF-8 is allowed in the CHAR and VARCHAR datatypes, and is enabled when creating or changing an object’s collation to a collation with the UTF8 suffix.

For example,LATIN1_GENERAL_100_CI_AS_SC to Latin1_General_100_CI_AS_KS_SC_UTF8. UTF-8 is only available to Windows collations that support supplementary characters, as introduced in SQL Server 2012. NCHAR and NVARCHAR allow UTF-16 encoding only, and remain unchanged.

This feature may provide significant storage savings, depending on the character set in use. For example, changing an existing column data type with ASCII strings from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50% reduction in storage requirements. This reduction is because NCHAR(10) requires 22 bytes for storage, whereas CHAR(10) requires 12 bytes for the same Unicode string.

2019-05-14 update:

Documentation seems to be updated now and explains our options staring in MSSQL 2019 in section "Collation and Unicode Support".

2019-07-24 update:

Article by Pedro Lopes - Senior Program Manager @ Microsoft about introducing UTF-8 support for Azure SQL Database

Caveat Emptor: sqlquantumleap.com/2018/09/28/…
– JohannesB
Commented Mar 25, 2021 at 17:14 — JohannesB, Commented Mar 25, 2021 at 17:14

Jasha · Accepted Answer · 2016-07-28 08:10:43Z

27

No! It's not a joke.

Take a look here: http://msdn.microsoft.com/en-us/library/ms186939.aspx

Character data types that are either fixed-length, nchar, or variable-length, nvarchar, Unicode data and use the UNICODE UCS-2 character set.

And also here: http://en.wikipedia.org/wiki/UTF-16

The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.

edited Jul 28, 2016 at 8:10

Jasha

7959 silver badges15 bronze badges

answered Sep 20, 2012 at 13:45

edze

3,0051 gold badge24 silver badges29 bronze badges

Ok. Can the mssql-client translate to the outside UTF8-world?
– Teson
Commented Sep 20, 2012 at 14:42
mssql-client can be everything. Java, .NET, C, PHP, etc... what do mean with client?
– edze
Commented Sep 20, 2012 at 14:53
1

Client : sqlsrv extension under php. Robert puts it i clear text here: social.msdn.microsoft.com/Forums/en/sqldriverforphp/thread/…, will evaluate and post results.
– Teson
Commented Sep 25, 2012 at 11:43
2

Hi again, sorry for the delay but thanks for the reps, using sqlsrv_connect(,array("CharacterSet" => "UTF-8").. in connection string works fine. PDO is out, right?
– Teson
Commented Nov 27, 2013 at 23:56

Add a comment |

Xabi · Accepted Answer · 2018-11-24 14:33:06Z

Two UDF to deal with UTF-8 in T-SQL:

CREATE Function UcsToUtf8(@src nvarchar(MAX)) returns varchar(MAX) as
begin
    declare @res varchar(MAX)='', @pi char(8)='%[^'+char(0)+'-'+char(127)+']%', @i int, @j int
    select @i=patindex(@pi,@src collate Latin1_General_BIN)
    while @i>0
    begin
        select @j=unicode(substring(@src,@i,1))
        if @j<0x800     select @res=@res+left(@src,@i-1)+char((@j&1984)/64+192)+char((@j&63)+128)
        else            select @res=@res+left(@src,@i-1)+char((@j&61440)/4096+224)+char((@j&4032)/64+128)+char((@j&63)+128)
        select @src=substring(@src,@i+1,datalength(@src)-1), @i=patindex(@pi,@src collate Latin1_General_BIN)
    end
    select @res=@res+@src
    return @res
end

CREATE Function Utf8ToUcs(@src varchar(MAX)) returns nvarchar(MAX) as
begin
    declare @i int, @res nvarchar(MAX)=@src, @pi varchar(18)
    select @pi='%[à-ï][€-¿][€-¿]%',@i=patindex(@pi,@src collate Latin1_General_BIN)
    while @i>0 select @res=stuff(@res,@i,3,nchar(((ascii(substring(@src,@i,1))&31)*4096)+((ascii(substring(@src,@i+1,1))&63)*64)+(ascii(substring(@src,@i+2,1))&63))), @src=stuff(@src,@i,3,'.'), @i=patindex(@pi,@src collate Latin1_General_BIN)
    select @pi='%[Â-ß][€-¿]%',@i=patindex(@pi,@src collate Latin1_General_BIN)
    while @i>0 select @res=stuff(@res,@i,2,nchar(((ascii(substring(@src,@i,1))&31)*64)+(ascii(substring(@src,@i+1,1))&63))), @src=stuff(@src,@i,2,'.'),@i=patindex(@pi,@src collate Latin1_General_BIN)
    return @res
end

Charles Burns · Accepted Answer · 2016-12-21 20:18:07Z

4

Note that as of Microsoft SQL Server 2016, UTF-8 is supported by bcp, BULK_INSERT, and OPENROWSET.

Addendum 2016-12-21: SQL Server 2016 SP1 now enables Unicode Compression (and most other previously Enterprise-only features) for all versions of MS SQL including Standard and Express. This is not the same as UTF-8 support, but it yields a similar benefit if the goal is disk space reduction for Western alphabets.

edited Dec 21, 2016 at 20:18

answered Jul 19, 2016 at 21:57

Charles Burns

10.4k7 gold badges64 silver badges83 bronze badges

1

But not OPENQUERY? I wonder if this is why I'm having issues migrating CLOB data from Oracle using OPENQUERY.
– Geoff Dawdy
Commented Jun 10, 2019 at 14:38

Add a comment |

Collectives™ on Stack Overflow

How to Use UTF-8 Collation in SQL Server database?

5 Answers 5

Not the answer you're looking for? Browse other questions tagged
sql-server
utf-8
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Not the answer you're looking for? Browse other questions tagged sql-serverutf-8 or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
sql-server
utf-8
or ask your own question.