7

I have a website built off my Q & A posted in Stack Exchange sites such as; Ask Ubuntu, Stack Overflow, SE Security, etc.

Data Explorer is used to scrape ALL Stack Exchange Q & A I've posted and download the query results for filtering by a python script and publishing to GitHub Pages. There are 310 Questions and 2,181 Answers, of which 54 and 1,135, respectively, are worthy of publication.

Data Explorer has been working fine for a few years but today it crashed with error:

Line 11: Cannot resolve collation conflict between "SQL_Latin1_General_CP1_CI_AS" and "SQL_Latin1_General_CP1_CS_AS" in CASE operator occurring in GROUP BY statement column 4.

Here is the data explorer query, last changed on December 12, 2021.

How do I debug this error?


Problem fixed

A week after posting question the problem was fixed:

enter image description here

Many thanks for the answers posted below!

4
  • 4
    IMO, this is a task for which using the SE API is more appropriate. SEDE is not intended for programmatic access, while the SE API is. While there's no SE API request to get just the posts you want, it's trivial to get your answers on the two sites and then filter the results locally based on the "is accepted" and ">= 2 upvotes" criteria. For example, the SE API request for your posts on AU could be this. Admittedly, you'd need to get several pages of results.
    – Makyen
    Commented Apr 28 at 17:48
  • I've limited the above request to return a small subset of the data which is available for each answer in order to more clearly show getting a list of answers with the specific criteria you're interested in. However, the request, using a different filter can return complete data on each answer, if that's what you're wanting. Also of note is that if what you're really wanting is >= 2 score, rather than >= 2 upvotes, then the /search/excerpts endpoint can be used to return a much more narrow set of results.
    – Makyen
    Commented Apr 28 at 17:53
  • @Makyen I'm guilty of putting in TMI and not enough info at the same time. My program stack-to-blog.py reads ALL answers and questions and then selects the good ones for the website. It also ignores self-accept answers if up votes < 2. The spirit of the question was how to fix the SEDE bug, not how to filter Q&A. As stated it's been working fine for a few years. I'm thinking there is a specific Question or Answer it is choking on and then I could simply edit it in SE and rerun the query. Commented Apr 28 at 18:35
  • 2
    I understood the spirit of your question. That's why I posted a comment, not an answer. I accept that your selection criteria is more complex than you first presented, but that changes nothing. In reality, you are using SEDE as just a source for all the data on your answers. That's something the SE API is specifically designed to do. I'd argue that what you're actually doing is even more a task for the SE API, which is intended for such programmatic access and can easily accomplish downloading as JSON the data you're currently getting from SEDE as a CSV file (which you then have to parse).
    – Makyen
    Commented Apr 29 at 1:29

2 Answers 2

8

I've found the issue. Over the last few weeks the weekly refresh times got improved, a lot. From 6 to 8 hours we're now down to well under an hour.

One of the improvements was revealed to me by Aaron and involves pre-staging the PostHistory table. With that in mind I looked at your query to see if PostHistory was invovled. I then ran this query to see on which columns there were collation mismatches.

You can see both SQL_Latin1_General_CP1_CI_AS and SQL_Latin1_General_CP1_CS_AS are used. The former is the database default, the latter how most columns in the schema were created. Most is the critical point here. It looks like the tables and views and their columns that are involved in one way or the other with the new and much needed improvements now have a different collation, or to be precise: the database default collation. See also: Let the default collation of the SEDE databases match the dominant collation used for the nvarchar columns

To solve the issue with your specific query for now we have to sprinkle a few COLLATE hints in the query to tell SQL Server how to deal with these differences in collation.

In this query I've made these changes:

...
  coalesce(Users.DisplayName, PostHistory.UserDisplayName collate SQL_Latin1_General_CP1_CS_AS),
...

in both the projection and the group by of the UserName column inside the Editor's CTE.

One of the debug option is to run each select of the CTE separately, for example by running this:

select
        PostHistory.PostId as PostId,
        PostHistory.UserId as UserId,
        coalesce(Users.DisplayName, PostHistory.UserDisplayName /* collate SQL_Latin1_General_CP1_CS_AS */) as UserName,
        min(PostHistory.CreationDate) as FirstEditDate
      from
      (  select Id as PostId, ParentId, PostTypeId
         from Posts
         where OwnerUserId = 811) PostsByUser
        join PostHistory on PostHistory.PostId = PostsByUser.PostId
        left join Users on Users.Id = PostHistory.UserId
      where
        PostHistory.PostHistoryTypeId in (4,5)  -- edit body/title
        and (PostHistory.UserId != 811 or PostHistory.UserId is null)
      group by
        PostHistory.PostId,
        PostHistory.UserId,
        coalesce(Users.DisplayName, PostHistory.UserDisplayName /* collate SQL_Latin1_General_CP1_CS_AS */)

You'll get the error:

Line 1: Cannot resolve collation conflict between "SQL_Latin1_General_CP1_CI_AS" and "SQL_Latin1_General_CP1_CS_AS" in CASE operator occurring in GROUP BY statement column 4.

The line number indicator isn't very helpful here, but at least you can now have localized the issue to only this query.

Keep in mind SEDE is updated once a week on Sunday. Use the awesome SEDE Tutorial written by the unforgettable Monica Cellio. Say "Hi" in SEDE chat.

1
  • Yes Monica is truly unforgettable. Thank you for testing and finding the source of the problem so quickly Rene. Your work helped pave the way for Aaron's quick upgrade today. Commented Apr 30 at 2:22
7

I did intentionally avoid case-sensitive collations in some of the recent improvements I've been working on, in order to avoid forcing you to use COLLATE clauses when joining to the views I created.

But I perhaps over-reached, and I did not test cross-database joins against Stack Overflow specifically after the most recent changes, so that's on me.

For next week

I've added explicit collation to the two tables that are pre-staged (which means they get the database's collation unless otherwise specified) instead of using SELECT INTO (which means they inherit the source collation). This will resolve the issue for the next data dump this coming Sunday.

In the meantime

You can use an explicit COLLATE clause in a couple of places. The least intrusive way is like rene described, but you'll also want one here:

coalesce(WikiTags.TagName, 
   Posts.Tags COLLATE SQL_Latin1_General_CP1_CS_AS , 
  Parent.Tags COLLATE SQL_Latin1_General_CP1_CS_AS ) as [Tags],

I wrote an alternative version of your query that adds this in the right places using a variable, making the workaround easy to remove them once it is no longer needed (e.g. in about 6 days). It also avoids all the build-up of #Sites data from parsing an answer from this site by using the new dbo.sede_ineachdb procedure. It returned data for me (6,800+ posts) in about 9 seconds without caching, and for you in about 8 seconds, which demonstrates that most of the overhead is looping through each database, not the query itself. In any case, I don't know if I'd try it for the likes of Jon Skeet.

Asides

There was a long-standing suggestion to change the SEDE databases to have a case-sensitive collation, but that would cause worse problems. I'm not ready to apply yet, but we'd have to figure out a way to address all the existing queries that would suddenly break because they didn't use ThE riGHt cAsE for object and column names.

I will agree with others that this type of retrieval seems more like a job for the API than for SEDE. Find the posts you're after using SEDE, then get bodies and all that in a more scalable way. Or break the query into multiple chunks.

I noticed that magic formatting for [Tags] isn't very useful in cross-database queries, since they all link to the query context's site. This is something, too, that will be easier to massage outside of SEDE.

1
  • 2
    Thank you for your hard work and quick turn-around. I will wait until next Sunday's data dump and then refresh my website. If all goes well, the above question, which qualifies with 3 upvotes, will appear when I type data explorer into the search engine. Commented Apr 30 at 2:13

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .