Moderation Strike update: Data dumps, choosing representatives, GPT data, and where we’re holding

Question

_{Note: As part of the strike organization, this is a mirror of a post on the network-wide Meta, MSE. Make sure to check out the answers and discussion there, too!}

Introduction

Since our strike announcement, a number of new developments have occurred. Philippe, VP of Community, posted data they have regarding GPT content on the platform. Stack Exchange staff reached out to strike organizers and asked us to choose three moderator representatives for the strike. Also, a former DBA for Stack Exchange, Inc. disclosed that data dumps have been disabled. This post strives to speak on these developments from the perspective of strike participants.

Strike representatives

We have been coordinating action on our Discord server, as Stack Exchange, Inc. has instructed us in no uncertain terms that we cannot organize on their platform. Through this server, we were reached out to by employees of Stack Exchange, Inc. to designate representatives for the strike. These representatives will later meet with Stack Exchange, Inc. representatives and will negotiate on our behalf. These representatives must be moderators, as requested by the company, because some of the discussion points will involve information that is currently confidential and covered by the moderator agreement. Additionally, new data may be disclosed by SE that may not be cleared for public release, and must be protected for privacy reasons.

In order to decide the representatives, a poll was organised which ended at midnight UTC on the 11th of June. As a result of that poll, the following users will be representing us in discussions with Stack Exchange, Inc.:

The full, anonymized results of the voting can be seen here.

The Data Dumps

A former database administrator (DBA) for the company has disclosed that Stack Exchange, Inc. quietly disabled the data dumps in March 2023, with a note that they should only be re-enabled with approval from senior leadership. Shortly after, the CTO, Jody Bailey, confirmed this to be the case, citing the need to “protect Stack Overflow data from being misused by companies building LLMs” and the dump has been stopped until “guardrails” are in place.

The Stack Exchange data dumps have been in place since 2009 and have been used to make network data available in an alternative format that allows people to take advantage of the open CC BY-SA license.

Disabling the data dumps in this manner is yet another example of poor communication with the very community of contributors who is at the heart of the network. The data dumps were turned off for several months, with no advance warning or communication until a user asked about it. And, even then, the company's decision was only revealed by a whistleblower, who effectively forced the CTO's hand to confirm it.

Perhaps more importantly, the data dumps serve to emphasize the very reason for the existence of the platform: Guaranteed, free access to a repository of knowledge. The network was founded to be an alternative to a paywalled platform and to guarantee that information was freely distributed. The data dumps were an insurance that no matter what happened with the company in the future, the information shared on the platform would always be freely accessible to all. Disabling these dumps is a betrayal of the founding philosophy of the network.

It would be hard to put it better than one of the site's founders, Joel Spolsky, did, when speaking on the Stack Overflow podcast #84:

We created Stack Overflow to be against [expropriation of community content]. If there's anything that's more in the DNA of Stack Overflow than that, I don't know what it is. That's one of our most core things. You can see this all over the place in the design of Stack Overflow.

First of all, from day one, we use the CC-wiki license. And it's basically a license, it says that we don't own the content that's on there, which is why we make those database dumps that are available.

Because we wanted to make sure that if no matter what happens, literally no matter who we sell to, or raise money from, or turn the site over to, and even if they take Stack Overflow, and make it an evil site where you have to pay to look at things and there's pop-up ads and pop-under ads, and you know, dancing chariots of fire that cross the screen and punch the monkey, and, man, I can take so many evil things anyway. And it just becomes a big gigantic spam site.

Doesn't matter because just take the latest CC-wiki download that we provided and go start your own site saying, you know what, this is gonna be the clean version. And I think a lot of people will follow you. We very, very deliberately built Stack Overflow in a way that there wouldn't be any chance of locking and we're pretty much doing the same thing with Stack Exchange.

Beyond promoting openness, in the spirit of the CC BY-SA license, and serving as the final bulwark of the community and its contributors against a company that ever turns "evil", the data dumps were also key to innovative uses of Stack Overflow's knowledge base in environments like prisons and Arctic research labs, where no Internet connection is available to access the live site. You can read more about these initiatives, called the "Overflow Offline" Project, on the Stack Overflow Blog and in a Verge article by Mitchell Clark.

The impact so far

Stack Exchange, Inc. has claimed in a statement to the press that 11% of Stack Exchange moderators are participating in this strike. We would like to clarify that, while this was technically an accurate statement at the time that it was made, it was, even then, a misrepresentation of the actual percentage of moderator workload that had gone on strike (the moderators and flaggers who had gone on strike and suspended their activity were drawn disproportionately from those who were actively raising and handling flags). Therefore, this cited number failed to put the strike's effects into their proper perspective.

On Monday, June 5th, the notice about the strike was posted to Meta Stack Exchange, and the strike kicked into effect. The open letter, however, had been available to sign beforehand, as organization and coordination required. Some moderators signed the letter before it went "live" on June 5th (although their signatures publicly display as the 5th due to the strike not starting before then). The 11% of moderators cited is the percentage of moderators who had signed the letter before it even went live.

Currently, the vast majority of moderators on Stack Overflow have suspended their activity. The pending flag queue has grown from just over 130 pending flags prior to Stack Exchange posting the moderator-private version of the AI generated content policy to an excess of 3,000, even while many of the most active flag-raising users have also ceased raising flags.

On multiple other sites (Super User, Software Engineering, Math, Academia, etc.) the majority of—or all—site moderators are on strike.

As of the time of writing, 113 out of 538 total Stack Exchange network moderators have signed the open strike letter, a percentage of 21%, and this number continues to grow.

The GPT data analysis

Stack Exchange, Inc. has released some of the data behind their decision to override community consensus and prohibit moderators from handling AI-generated content. This data analysis has several flaws and unverifiable underlying assumptions, which have been examined in detail in the answers to that post. We do not believe that this data sufficiently backs up the perceived need to implement such a total prohibition on moderating AI-generated content, nor does it excuse the manner in which Stack Exchange, Inc. went about doing so.

One of the issues that bears specific mention is the company's focus on the accuracy of GPT detectors. This is a red herring. Although Philippe continues to characterize this as the basis of the strike to the media, moderators do not rely blindly on GPT detectors. As has been noted repeatedly, moderators have long known about and warned flaggers about the inaccuracy of these detectors. We have no objection to ceasing reliance on the detectors, since we already were not relying upon them.

To summarize points raised in just a selection of answers:

Stack Exchange claims to have a reliable method of detecting GPT posts through draft count. This method has been called into question: it does not appear to consider ways this detection method could fail through trivial action. It also does not match reality as observed by multiple commentators. Many of the remaining conclusions depend on this method being accurate. That is, if the conclusion that GPT posts have fallen based on this detection method is inaccurate, many following conclusions are invalidated. (discussed by Mithical, Gilles, Kevin, CodeCaster, etc.)
Methodology for handling appeals from suspended users is in question. - Moderators could have been conferred with, for example. (Ryan M, Chris)
Multiple answers question or disprove the validity of the data and the claims drawn from it by Staff as a whole. (starball, kaya3, CodeCaster)
Staff appeared to have identified a problem (declining user activity rates), but answers provide alternative explanations for the data displayed that staff appear to have not considered. (Bryan Krause, starball)

…and this is just a brief overview of some of the first half of the first page of responses.

Our conditions for ending the strike

In both the open letter and the Meta post, we have issued several conditions that must be met in order for the strike to end. In light of the developments mentioned above, we wanted to reiterate them here:

The prohibition on moderating GPT content must be retracted.

This is the immediate, first action that Stack Exchange, Inc. must take in order to begin resolving this issue. This is a non-negotiable, fundamental requirement.
The private policy on GPT content that was issued to moderators must be revealed publicly.

Stack Exchange, Inc. has put moderators in the untenable position of having a private policy dictating how to handle flags and moderate content that differs from the public version of this policy. The private policy must be retracted and revealed publicly so that the public knows what restrictions moderators were placed under.
The data dumps must be re-enabled, and SEDE and API access guaranteed.

The data dumps of Stack Exchange content serve to further the goals of free knowledge-sharing. The content posted to the Stack Exchange network was posted to further that goal and with the understanding that it would be freely distributed to anyone seeking knowledge. The data dumps safeguard that collected knowledge and must be continued.

The Stack Exchange API and Data Explorer both serve as major parts of moderation. Userscripts, queries, bots, and others are used to find, identify, and improve content across the Stack Exchange network. Access to these resources and the data dumps must be allowed to continue unimpeded.
Stack Exchange, Inc. must communicate, gather feedback, and act on that feedback before making major policy or software changes to the public platform.

Stack Exchange, Inc. has consistently made harmful changes to both policies and the software running the public platform that run counter to the knowledge-sharing goal of the network. Moving forwards, Stack Exchange, Inc. must consult with the community to gather feedback in order to safeguard the goals of the platform.

We continue to hope for a speedy resolution to this conflict. We look forward to Stack Exchange, Inc. taking the steps required in order for the network to return to its normal operations, focused on building and maintaining a repository of freely accessible, high-quality information in the form of questions and answers.

I wonder if the desire to lock down and restrict community content in SO Inc mirrors the tensions that Reddit is currently going through with its own communities. The worry seems to be that "someone might make money from our data", but why is this causing such a backlash with UGC tech companies? This is not a new phenomenon. The data dumps for Stack Overflow from a few months back are still widely available, and could easily be used to train an LLM. I doubt there is anything new in the latest quarterly snapshot that would improve an LLM significantly. — halfer, Commented Jun 13, 2023 at 13:34
@halfer My guess is that after the collapse of Silicon Valley Bank, venture capitalists have become far more reticent to continue indefinitely funding massive websites like reddit and SO without seeing a return on investment. They don't seem to understand that when you started off giving their product away, any sort of added cost is an anathema to users. It's going to be very interesting to see what happens when, not if, the VCs blink; or if they're willing to see things go the way of digg. — Ian Kemp, Commented Jun 13, 2023 at 14:06
@Ian: I don't doubt the VCs would sell their grandmothers if they could be effectively packaged and monetised. But Big Tech's role presumably is to explain to VCs that they're asking for a lot of locks on the UGC stable doors after the data horse has firmly bolted. Plus of course if new data really is desirable, someone will scrape it, either directly or from the search engines. That's my confusion - I can't see why SO isn't siding with the volunteers, because the VC request is essentially unworkable or pointless. — halfer, Commented Jun 13, 2023 at 15:29
@halfer As far as the Reddit boycott, I had the same thought as far as training an LLM. The tech is changing all the time though, and if the LLMs aren't trained on examples from the latest tech then people might think of it as less capable. I think it's completely possible that Stack is looking to break into the LLM market, or at least the language data market. — Brock Brown, Commented Jun 13, 2023 at 15:52
SE needs to remember that they have exactly the same license, a copyleft one, to our contributions as everyone else does (and that if they build an LLM derived from those contributions, said model is CC BY-SA) — Ben Voigt, Commented Jun 13, 2023 at 16:22
@halfer I suspect it's a continuation of that thought along the lines "someone might make money from our data and then we won't be able to do the same " — mbrig, Commented Jun 13, 2023 at 17:12
Maybe an unanswerable question @CodyGray but, how unified are the mods in having the restoration of the data dumps as a condition? While I fully respect it, it seems like less of a moderation specific thing and more a general community issue, compared to "our secret mod policy stops you from modding properly". — mbrig, Commented Jun 13, 2023 at 17:18
@SupportUkraine The impact will appear on the site quality, as spam or other content tend to stay longer now. I would think too that support ticket might get longer delay too, as some CM got to visit some site to "help" to handle flags — yagmoth555, Commented Jun 13, 2023 at 19:11
@yagmoth555 If you have that experince please post an answer on meta.stackoverflow.com/questions/425115/… — Support Ukraine, Commented Jun 13, 2023 at 19:21
Re data dumps being back on by Friday, see this answer from Philippe. — T.J. Crowder, Commented Jun 14, 2023 at 5:49
@mbrig Virtually all mods who originally supported the strike were also in support of adding the condition of getting the data dumps back. Notably, the strike is not meant to be just about moderators' concerns; it is about the broader community, including but not limited to moderators, who has invested countless hours into contributing and curating content on these sites, creating the value that they have today, and who maintain a passion for these sites and wish not to see them destroyed. Fortunately, the data dump plank is now moot. — Cody Gray - on strike, Commented Jun 14, 2023 at 7:07
@ThomasWeller generally, just continue to be respectful while cutting back on curation/moderation. (signing on to the letter helps too) — Kevin B, Commented Jun 14, 2023 at 16:00
@CodyGray-onstrike Is there any progress being made, even if you can't say what it is? (Other than data dumps, of course.) — T.J. Crowder, Commented Jun 22, 2023 at 7:52
Moderators care about CC licenses of content on the main sites for the same reason that everyone cares about them, @Evan. The Teachers' Lounge contains private, confidential info, so, no, its contents are not dumped publicly. I agree it'd be great if they could be, but, unfortunately, that's just not tenable. Furthermore, the goal of the Teachers' Lounge is not to build a repository of knowledge. We have no objection to removing content that offers no redeeming value from the data dump. That's the same reason we delete it from the main site. — Cody Gray - on strike, Commented Jun 23, 2023 at 0:03
It's a condition for us volunteers continuing to devote hundreds of thousands of unpaid (free) labor to the billion-dollar company and its shareholders, @bad_coder. I don't think it's unreasonable or unrealistic. Millions of dollars of free labor has some cost. Donations often come with conditions or stipulations. These are ours. We would argue that not appreciating the millions of dollars worth of free labor (not to mention content, the very content that makes this site worth visiting for anyone) that volunteer moderators and community members provide is the part that is "out of touch". — Cody Gray - on strike, Commented Jun 23, 2023 at 0:05

Moderation Strike update: Data dumps, choosing representatives, GPT data, and where we’re holding

Introduction

Strike representatives

The Data Dumps

The impact so far

The GPT data analysis

Our conditions for ending the strike

0

You must log in to answer this question.

Browse other questions tagged
discussion
moderation
stack-exchange
community
moderation-strike
.

Linked

Hot Network Questions

Introduction

Strike representatives

The Data Dumps

The impact so far

The GPT data analysis

Our conditions for ending the strike

0

You must log in to answer this question.

Browse other questions tagged discussionmoderationstack-exchangecommunitymoderation-strike.

Linked

Related

Browse other questions tagged
discussion
moderation
stack-exchange
community
moderation-strike
.