Non US-ASCII characters dropped from full (profile) URL

Question

I have characters which are outside 7-bit ASCII in my username, "Jakub Narębski". Characters outside US-ASCII are dropped from the full profile URL: https://meta.stackoverflow.com/users/130454/jakub-narbski (observe that it is 'narbski', not 'narebski' or 'narębski').

I'm not quite sure if it is a bug or not. I think that the last part of URL is purely informational; https://meta.stackoverflow.com/users/130454 works as well, I suspect to allow to change one's name field.

Added 2009-07-30:
This feature request, in the form of stripping diacritical marks, got implemented (status-completed). Wouldn't it be a good idea to separate this Unicode transliteration code and put it as ASP.NET snippet or mini-library somewhere?

See also: This Is America, Take Your Unicode Somewhere Else blog post by Ted Dziuba, which mentions Text::Unidecode Perl module (which does US-ASCII transliterations of Unicode text), and mentions Stack Overflow in passing.

Added 2009-08-30
Perl 6 has :ignoreaccent modifier, and there is also the Text::Unaccent Perl module (which uses unac C library).

The right way to do accent-insensitive comparisons is by comparing things at the primary strenth in the Unicode Collation Algorithm. That is what it is there for. You will never get them all otherwise. — tchrist, Commented Apr 29, 2011 at 4:15

Jeff Atwood · Accepted Answer · 2011-07-19 10:04:14Z

41

Here's what we have in the substitution table

public static string RemapInternationalCharToAscii(char c)
{
    string s = c.ToString().ToLowerInvariant();
    if ("àåáâäãåą".Contains(s))
    {
        return "a";
    }
    else if ("èéêëę".Contains(s))
    {
        return "e";
    }
    else if ("ìíîïı".Contains(s))
    {
        return "i";
    }
    else if ("òóôõöøőð".Contains(s))
    {
        return "o";
    }
    else if ("ùúûüŭů".Contains(s))
    {
        return "u";
    }
    else if ("çćčĉ".Contains(s))
    {
        return "c";
    }
    else if ("żźž".Contains(s))
    {
        return "z";
    }
    else if ("śşšŝ".Contains(s))
    {
        return "s";
    }
    else if ("ñń".Contains(s))
    {
        return "n";
    }
    else if ("ýÿ".Contains(s))
    {
        return "y";
    }
    else if ("ğĝ".Contains(s))
    {
        return "g";
    }
    else if (c == 'ř')
    {
        return "r";
    }
    else if (c == 'ł')
    {
        return "l";
    }
    else if (c == 'đ')
    {
        return "d";
    }
    else if (c == 'ß')
    {
        return "ss";
    }
    else if (c == 'Þ')
    {
        return "th";
    }
    else if (c == 'ĥ')
    {
        return "h";
    }
    else if (c == 'ĵ')
    {
        return "j";
    }
    else
    {
        return "";
    }
}

Updated a few times now; does this cover it?

edited Jul 19, 2011 at 10:04

answered Jul 21, 2009 at 8:09

Jeff Atwood

311k107 gold badges888 silver badges1.2k bronze badges

1

Māori has ā. There's also æ and œ (--> ae and oe) though I'm not sure if they turn up in names. You could add Þ for when we get our first Anglo-Saxon users :-)
– John Fouhy
Commented Jul 30, 2009 at 22:57
I also can't find the uppercase variants L with stroke (Ł) and D with stroke (Đ).
– Marcel Korpel
Commented Aug 31, 2010 at 12:02
@John: There's @aether, which turns a single character to "ae" accoring to meta.stackoverflow.com/questions/38288/…
– Andrew Grimm
Commented Apr 28, 2011 at 2:14
26

@Jeff, why in the world are you hardcoding all those? Use the Unicode Collation Algorithm and compare at the primary strength, which works even better than canonically decomposing and then removing Grapheme_Extend characters. The UCA1("ð") is the same as UCA1("d"), etc. Otherwise you fight a losing battle. Use the Unicode facilities; don’t hardcode stuff.
– tchrist
Commented Apr 29, 2011 at 4:12
@tchrist this algorithm is quite fast, and it needs to be since it is potentially called hundreds of times per page. Do you have any c# source code examples?
– Jeff Atwood
Commented Apr 29, 2011 at 5:07
1

@Jeff: Well, the UCA is supposed to be fast, too; there are big lookup tables, which may be a memory issue. But I don’t do C#; I'm not a Microsoft guy. I could have probably winged it in Java if you’d asked since I do Java stuff from time to time, but my primary language is Perl. I agree you’re doing better than the naïve stripping of Mark characters that I normally see done. It just seems painfully ad hoc, and we have tables for this stuff.
– tchrist
Commented Apr 29, 2011 at 12:08
1

@Jeff: is there a reason you went with If-Else-If vs Switch? (did you test/compare?) posts here (e.g. stackoverflow.com/questions/445067/if-vs-switch-speed) and elsewhere (e.g. blackwasp.co.uk/SpeedTestIfElseSwitch.aspx) indicate that switch is generally faster.
– Faust
Commented Jun 8, 2011 at 11:26
2

@faust kind of irrelevant for a number of reasons, the primary reason is that the sequence of if..then here is ordered by frequency that those letters occur in English. So it hits the most common cases first, and almost never goes all the way to the bottom.
– Jeff Atwood
Commented Jul 13, 2011 at 12:07
4

@Jeff I've got a version using a unicode collation function in .net on this answer : stackoverflow.com/questions/25259/… - I haven't measured the performance impact, but coverage is better, although I still need an exceptions table. I also delay load the hyphens which might be useful even if you prefer the hardcode lookup table for perf.
– DanH
Commented Jul 19, 2011 at 12:59
@dan great tip, I was looking for that code! I'll check it out.
– Jeff Atwood
Commented Jul 20, 2011 at 5:15
@Jeff will be v. interested in any results if you benchmark it, especially as you have real data to work against. I'm pretty curious if the delay hyphen makes any diff for you by itself.
– DanH
Commented Jul 22, 2011 at 11:02
That doesn't work on my name. Please contain uppercase versions of that chars.(For example: Ç )
– Murat Çorlu
Commented Jan 10, 2013 at 13:04
Using ToLowerInvariant might result in some strange behavior in character normalization, as "İ".ToLowerInvariant() gives us "İ" (Yes, the same character: Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130)) The issue seems to be related to yet another Turkish Locale Horror (love the article, btw.), and according to this article, using ToUpperInvariant for string normalization is the recommended approach. What do you think?
– mono blaine
Commented Dec 27, 2013 at 23:46
2

I see this problem persists:my comment from three years ago still applies in full.
– tchrist
Commented May 4, 2014 at 20:55
1

@tchrist, I've read some of your answers on Unicode here on StackOverflow, I doubt that many people have your recognition of that problem and fully grasp the concept of Unicode. Many just see it as "international ASCII" or a set of translation tables. The real solution would be using IRI's instead of URI's and not do any translation at all. C# even has that integrated.
– Sebastian Godelet
Commented May 13, 2014 at 14:03

| Show 4 more comments

Pavel Sem · Accepted Answer · 2017-09-06 14:28:56Z

The code above could be simplified and generalized by using unicode categories. Below method will also take into account also diacritics not explicitly mentioned above (e.g. in Czech there are also řďť which becomes rdt). It could be used as first step and then there could be manually replaced rest of the characters which are standard national characters but it is not good to have them in url.

Note: I have not measured performance of this solution.

public static string RemoveDiacritics(string s)
{
    s = s.Normalize(NormalizationForm.FormD);
    StringBuilder sb = new StringBuilder();

    for (int i = 0; i < s.Length; i++)
    {
        if (CharUnicodeInfo.GetUnicodeCategory(s[i]) != UnicodeCategory.NonSpacingMark) sb.Append(s[i]);
    }

    return sb.ToString();
}

RemoveDiacritics("àåáâäãåąèéêëęìíîïıòóôõöøőðùúûüŭůçćčĉżźžśşšŝñńýÿğĝřłđßÞĥĵřďť");
                 "aaaaaaaaeeeeeiiiiıoooooøoðuuuuuucccczzzssssnnyyggrłđßÞhjrdt"

Original code + comments how it works (Czech only)

Community · Accepted Answer · 2017-03-20 09:39:08Z

0

Those characters are often encoding in URLs, but it works just as well to drop them, too. That last part is purely informational (SEO ftw), so if you felt like it, you could even call yourself Mary, Queen of Scots and it still would go to the right page.

edited Mar 20, 2017 at 9:39

CommunityBot

1

answered Jul 20, 2009 at 17:13

Eric

13.9k1 gold badge40 silver badges67 bronze badges

2

well, sort of. We realized that having lots of people link to a question as /12345/blah-foo-bar and /12345/baz-shell and /12345/extra-words was confusing Google in some cases, so we force-redirect to the 'correct' title now.
– Jeff Atwood
Commented Mar 15, 2011 at 5:11

Add a comment |

Stack Exchange Network

Non US-ASCII characters dropped from full (profile) URL

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
bug
status-completed
profile-page
unicode
.

Linked

Hot Network Questions

Non US-ASCII characters dropped from full (profile) URL

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged bugstatus-completedprofile-pageunicode.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
bug
status-completed
profile-page
unicode
.