3

First, I am not too sure if this is “Workplace” specific, or a question best suited for the larger Stack Exchange world. If this question can be moved to a more appropriate place, please do so.

So this morning I came across this bizarre/stilted question that refers to events from 2010:

Beyond the very stilted wording & references to 2010, something else stood out to me: The text itself has tons of odd unicode “gremlins” in it. Mainly in what appear to be spaces, but actually are something else. Perhaps non-breaking spaces? Who knows. It’s one of those things I bet I could dissect deeper, but had better things to do today.

Anyway, my hunch that it was fake was confirmed by the user jonsca who discovered the question is lifted from some H.R. management text or test. Kudos to jonsca!

So anyway knowing this text was so littered with an inhuman amount of unicode cruft that caused any attempt to edit futile at best, is there any automated way the Stack Exchange filtering system can detect stuff like this & perhaps flag it right away? Or is that part of the CATCHA mechanism that pops up every now & then.

It seems to me the clean text that comes from a valid question when entered automatically trumps the unicode junk seen in this post. But unclear if Stack Exchange actually does detect stuff like this & it comes through?

Basically: This was obviously not a real question just even from the formatting. How can we combat this stuff? Considering the clear data discrepancies on a unicode level, can better filtering be put in place to catch junk like this before it hits the system?

8
  • This is a copy-paste from somewhere else. We have had it before. Developers/Community Managers review our meta, so they will see this as-is. Don't worry!
    – jmac
    Commented May 27, 2014 at 1:31
  • @jmac Thanks! But the question I mention is given how inhuman the basic text formatting is—littered with gremlins—is there some way to automatically look out for this stuff? Commented May 27, 2014 at 1:34
  • 2
    Obviously they can search for certain character codes in submissions to filter stuff like this out. The question is whether or not that would also filter out content that does have value for the site. That's something for the community team to look at, as they probably won't share the nitty gritty of the spam algorithm for (hopefully) obvious reasons. At any rate, thanks for pointing it out, and the community team will see it.
    – jmac
    Commented May 27, 2014 at 1:38
  • @jmac Okay, obviously you are right. But just so you understand, pretty much every seemingly empty space in the original post was what I believe to be a unicode U+00A0 non-breaking space. Commented May 27, 2014 at 1:41
  • 1
    I absolutely see the concern, but my guesstimation is that this wasn't done intentionally (in the sense of the person maliciously doing it to try to skirt spam filters or the like), but rather due to the encoding used on the site it was copied from. If copying from sites that use non-ascii characters results in weird unicode spaces, any filter may end up punishing other legitimately copy-pasted resources in questions (which would be bad). Asian languages are...problematic...when it comes to encoding (this seems to be from Pakistan).
    – jmac
    Commented May 27, 2014 at 1:44
  • @jmac Understood. I worked on a Chinese website 5 years ago & was stunned how little I truly understood about character set issues before that. Commented May 27, 2014 at 1:51
  • While I'd love to discuss it further Jake, continuing the discussion of odd character sets would be better suited for The Workplace Chat. (Don't want to dilute the message to the super-important community team!)
    – jmac
    Commented May 27, 2014 at 1:53
  • 1
    meta.stackexchange.com/questions/213201/… We spoke about weird unicode stuff in posts a while ago at Whiteboard chat at Programmers. Back then, guys brought up examples of legitimate posts with unicode, if memory serves these were from Islam.SE and some other smaller SE sites
    – gnat
    Commented May 27, 2014 at 8:16

1 Answer 1

3

It is not a huge issue that really requires a solution beyond the normal process that the SE framework provides.

In this case the question was off topic. It was more appropriate for a business policy SE than here. The users identified, and handled the question in the appropriate manner and it has been deleted.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .