GPT on the platform: Data, actions, and outcomes

Question

In a meeting with some moderators last week, I committed to releasing the data sets from our initial studies around the efficacy and false positive rates of ChatGPT detectors to them. Tuesday afternoon, we did so. This post contains as much information from that discussion as we are able to share in public, and represents the joint efforts of my community leadership team and other members of the community team.

We will open this post with a discussion about the baseline error rates in GPT detectors. Then, we are going to discuss some data coming out of Stack Overflow that can rule out many hypotheses about root causes for contraction.

We have done our best to look through the data and understand what hypotheses could hold water. To this end, we are engineers and community managers, not scientists. We will not claim these are formal studies that produce scientific data, nor will we claim that the data sources are perfect or uncontestable. Rather, that they are operational data sufficient to produce engineering decisions: the same standard that we use as a Community Management team. We’d like to be clear: nothing in this post is intended to cast aspersions on the moderator team of any site. We respect the work that they are doing to address this problem, as challenging as it may be.

This post will be a lot to take in. And even then, it’s a fraction of the work put into analyzing this question across multiple teams in the company. However, it’s a piece of the work we believe may help clarify some facts that have heretofore been obfuscated.

The tl;dr: Summarizing the results

The actual rate at which GPT posts are made on Stack Exchange has fallen continuously since its release, and is now very small.
The rate of suspension for frequent answerers rose by a factor of 16 from ~0.4% to ~6.6% since the rise of GPT, and has held steady since its release.
If every GPT poster posted exactly three answers and were suspended within three weeks, this would imply a minimum GPT post rate of 330 answers per week on Stack Overflow; in practice, we would expect a significantly greater quantity. Measurements of GPT occurrences on the platform imply fewer than 100 GPT answers per week, in disagreement with this rate, implying the existence of many false positive detections.
The suspensions issued appear to have a significant and measurable impact on the demography and volume of answerers on the site, preferentially excluding frequent answerers.
GPT detectors continue to be ineffective for detecting GPT answers on the platform.

Automated GPT detectors have unusable error rates on the platform

In order for us to consider using detectors of any kind (automated or human) on the platform at these volumes, we’d need to see less than a 1-in-50 false positive rate from them. We’ve selected this rate as a ballpark estimate for acceptability.

At this rate, we would still expect to see around 150 incorrectly-placed suspensions on Stack Overflow in the last six months. This value is still too high for comfort, and ideally, we’d see better rates than this. However, at this level of precision, conversations about how we may put such a system to practice can begin.

HuggingFace’s GPT detector assigns a “threshold” score from 0 to 1 of the post being authored by GPT. Based on a random sample of 500 answers from before GPT’s release, each answer being no less than 400 characters long, the false positive rates follow. At the 0.50 detection threshold, around 1-in-5.5 posts are falsely detected. At the 0.90 detection threshold, around 1-in-13 posts are falsely detected.

Keen readers will note that 1-in-5.5 is a fair bit better than the 1-in-3 false detection rate we originally noticed. This is because we used two detectors during the original survey: ZeroGPT and HuggingFace’s detector. The new value of 18% +/- 3.4% false positives from HuggingFace falls within the 95% CI of the original smaller survey (27% +/- 8.9%).

While it is theoretically possible to achieve better baseline error rates than 1-in-20 by picking higher thresholds, the efficacy of the detector may fall off considerably. A detector that does not produce false positives is no good if it also produces no true positives.

For a 1-in-20 false positive rate, the detector threshold needs to be 0.97. Satisfactory rates (less than 1-in-50 false detections) could not be achieved in this test until the detection threshold was set to 0.9975, but the error due to low sample size at this threshold is considerable. Narrowing this window to a more precise value would require significantly more data points than we collected in this survey. At this point, we can’t endorse usage of this service either as a tool for discriminating AI-generated posts or as a tool for validating suspicions.

We thought it would be helpful to include some more discussion and context. Rather a lot of discussion and context, actually. Over the last few months, folks within the company have been working to answer the question, “What has been taking place in the data coming out of Stack Overflow since GPT’s release?” Considerable changes have taken place in demography: users use Stack Overflow at different rates than they did before; different sorts of users use Stack Overflow; and questions and answers reach different outcomes.

What follows is a single piece of the puzzle. The broader picture informs the company’s operational decision-making processes. This piece has a part to play, but only a part. It is not a full explanation of the question, nor is it a blame-first investigation. Neither can it explain every change in the data on Stack Overflow that we see. However, what it reveals calls into question whether GPT detection is possible or effective on the platform, and by proxy, whether a high false positive rate may be partially responsible for a decline in answerers and answerer retention on the platform.

The volume of users who post 3 or more answers per week has dropped rapidly since GPT’s release

While this value has been dropping slowly over time, it’s been dropping at a well-characterized rate. And, it’s been consistent for many years – since late 2016, with some fluctuation, of course (particularly during the onset of COVID). Nominally we would expect to see this behavior continue; however, after the rise of GPT, the slope inflects and there is no recovery present. In total, the rate at which frequent answerers leave the site quadrupled since GPT’s release.

It is worth noting that, early in the release of GPT, we changed the Stack Overflow rules to require new users to wait 30 minutes between first posts, instead of 3 minutes as was originally set for abuse prevention. If this change were causative, we would expect to see a sudden jump to a new lower level, and a return to the prior well-established rate of decrease. However, we do not see this, a strong indicator of deepening attrition. (We would also expect to see a discontinuity in other metrics not listed – this point is established by a confluence of metrics.)

The number of high-volume answerers has seen a -2.4% average week-over-week decrease since December. In total, there has been a -42% contraction in high-volume answerers since the release of GPT. While users may go to ChatGPT to ask their questions, they are obviously not going there to answer questions. Therefore we can consider that…

The total volume of questions available to frequent answerers continues to rise

The alternative hypothesis for the above chart is that the number of questions available for users to answer has simply fallen, on account of question rates falling. This claim is hard to swallow given current data. What follows is the # of available questions posted per week, divided by the number of users who post 3 or more answers in a week. If this hypothesis were true, then this value should be falling, or at least not rising as quickly, as active users are crunched by a collapsing question rate.

However, even though the question rate is collapsing, there are still plenty of available questions for users to answer. It’s hard to tell exactly what the carrying capacity is for this number, but we certainly know that it is somewhere between 3 and today’s value of ~18. In other words, if users were willing and able to answer more questions, the evidence strongly suggests that they could in fact do so. Of course, on short timescales, factors like % of questions closed and % of questions deleted could cause fluctuations in this value. However, in the long view, the trend is clear and not violated in the post-GPT region.

This leaves one question remaining: Where are they going?

7% of the people who post 3 or more answers in a week are suspended within three weeks

…and this value has held reasonably stationary since these suspensions were enacted.

In the 16 weeks before we enabled moderators to suspend users for GPT users, around 0.4% of users who posted >2 answers per week were suspended within three weeks. After we allowed GPT suspensions on first offense, 6.6% of users who posted >2 answers in a given week were suspended within three weeks, a 16-fold increase.

(Note that suspensions issued for users who posted fewer than three answers are not counted here; nor are suspensions issued three weeks after the user answered several questions.)

In a given week, 32% of the people who post 3 or more answers also did so during one of the last eight weeks. Supposing that these suspensions are distributed across users regardless of GPT usage, we should see (by rough order of magnitude) a ~2.2% decrease in the actual volume of users who post three or more answers in a given week. And indeed, real data (2.4%) are quite close to the theoretical percentage.

Instead suppose that no more than 1-in-50 of the people who were suspended for GPT usage were not actually using GPT. In order for this to be true, a large volume of users would have needed to immediately convert from being regular users to ChatGPT users; and then, a high rate of conversion would have to be sustained over time, long after the release of ChatGPT, in order to sustain present suspension rates.

Regardless of the above, no Community Manager will tell you that removing 7% of the users who try to actively participate in a community per week is remotely tenable for a healthy community. Supposing every suspension is accurate, the magnitude raises serious concerns about long-term sustainability for the site.

It is worth acknowledging that we did give explicit permission to suspend on 1st offense for GPT usage. However, even in the absence of these policies, this value alone rings a deafening number of alarm bells for potential false positive detections and contributor loss alike. If there are false positive detections, even removing users’ content incorrectly could prove harmful.

Users who post 3 or more answers in a given week produce about half the answers

First, a short detour. We are going to focus on users who post 3 or more answers in a given week for much of this post. While it may seem a bit odd to look only at this segment of users (and it is, of course, not the only segment of users we investigated), there is a rationale for doing so.

Users who post answers more than twice a week used to comprise about half the content produced on Stack Overflow. However, since the advent of GPT, the % of content produced by frequent answerers has started to collapse unexpectedly. Given the absence of question scarcity as a factor for answerers (note the above chart), the clear inference is that a large portion of frequent answerers are leaving the site, or the site is suddenly not effective at retaining new frequent answerers.

(It is worth noting that GPT messages and suspensions disproportionately skew towards users who have posted more than two answers in a week, but we can’t discuss this in more detail publicly without revealing the details of how GPT posts are detected on the platform.)

Yet, at the same time, actual GPT posts on the site have fallen continuously since release

What follows is the internal ‘gold standard’ for how we measure GPT posts on the platform. It produces a coarse estimate, and can’t be used to decide whether a given post or person is posting using GPT. However, in aggregate, it can offer us insight into the ‘true’ rate of GPT posts on the platform.

This metric is based around the number of drafts a user has saved before posting their answer. Stack Exchange systems automatically save a draft copy of a user’s post to a cache location several seconds after they stop typing, with no further user input necessary. In principle, if people are copying and pasting answers out of services like GPT, then they won’t save as many drafts as people who write answers within Stack Exchange. In practice, many users save few drafts routinely (for example, because some users copy and paste the answer in from a separate doc, or because they don’t stop writing until they’re ready to post), so it’s the ratio of large draft saves to small draft saves that actually lets us measure volume in practice.

This metric is sensitive to noise, but was validated against other metrics early on at the peak of the GPT answer rate. Additionally, it matches the volume of actual actions taken on the platform early on after the release of GPT.

This allows us to get a good, albeit coarse, understanding of the overall population trends in GPT answers. However, it does not allow us to identify which specific answers were drafted by or with assistance of GPT, because quite a lot of answers are ‘normally’ posted after saving very few drafts.

Between January 2023 and March 2023, it appeared that we were going to hit a floor of around 700 GPT-generated posts per week and stay there, which would have been quite a bad outcome for the site’s general health. However, as time progressed, it became clear that the actual volume of GPT answers was falling precipitously - even as a percentage of total answers (which, yes, is also falling comparably).

Some folks have asked us why this metric is capable of reporting negative numbers. The condensed answer is that the metric has noise. If the true value is zero, sometimes it will report a value higher than zero, and sometimes a value lower than zero. Since we know how much noise the metric has, we know what the largest value for GPT-suspect posts should be.

At the initial holding point, GPT answers reflected around 2.5% of the answers posted on Stack Overflow. Rates hovered around this level for several weeks, but such rates were not sustainable and the percentage of posts authored by GPT began to fall. The following chart shows the expected % of answers posted in a given week that are GPT-suspect.

These days, however, it’s clear that the rate of GPT answers on Stack Overflow is extremely small. In fact, it is difficult to estimate the true number of GPT answers posted on Stack Overflow for this reason. Based on the data, we would hazard a guess that Stack Overflow currently sees 10-15 GPT answers in the typical day, or 70-100 answers per week. There is room for error due to the inherent uncertainty in the measurement method, but not room for magnitudes of error. We can therefore say that the rate of GPT posts is far less than it was two months ago, and then it is less than it was two months before that.

So, under what conditions could it be the case that roughly 7% of frequent answerers on the site are still posting via ChatGPT? If this were the case, the site should be seeing at least 330 GPT answers per week, but the rate estimate is not close. This also assumes every user who posts GPT answers are caught, and that GPT answerers post no more or less than three answers in a given week. In practice, the site should be seeing significantly more than 330 GPT answers per week to support this suspension rate.

It could be possible, either due to severe measurement error or due to an unexpected change in user behavior that obfuscates GPT usage using this method. But the evidence for this viewpoint does not appear strong.

Many of GPT appeals sent to the Stack Exchange support inbox could not be reasonably substantiated

While we could not recover all of the GPT suspension appeals sent to the Stack Exchange inbox, we could characterize some of them.

As a platform, we have an obligation to ensure that moderation actions taken on the Stack Exchange network are accurate and can be verified upon review if we need to do so. We need to be able to see, understand, and assess whether the actions taken were correct. It therefore needs to be said that we are very rarely, if ever, in the position where we cannot do so. In all other areas for which we receive suspension appeals, moderator actions are easily verified and double-checked the overwhelming majority of the time. It is rare and notable if we are ever in the position of overturning a moderator’s decision due to insufficient or contradictory evidence.

So, when we say that many of the GPT appeals we receive could not be substantiated on review, please keep in mind that our baseline value for this is zero, and it’s been that way for years. It is exceptionally strange for us to look at a moderator’s action and find ourselves unable to verify it – yet this is the situation we are frequently in with respect to GPT.

It is worth noting that we don’t believe this discrepancy is due to moderator misconduct or malfeasance. Our goal here is not to accuse moderators of wrongdoing or poor judgment. We respect the fact that they were, and are, working under difficult circumstances to achieve a goal we appreciate. Rather, it is important to remember that the company has a strong need to ensure moderator actions are verifiable and justifiable. And, on this point, we need to seriously consider whether these processes, in whatever form they are taking, do what they should.

True false positive rate of moderator actions

We want to clear up a particularly important point. We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

Under this assumption, it is impossible for us to generate a list of cases where we know moderators have made a mistake. If we were to do so, it would imply that we have a method by which we can know that the incorrect action was taken. If we could do this, we would share our methods in a heartbeat in the form of guidance to the moderator teams across the network, and then we’d carry on with things as normal. Instead, the most we can do is state that we just can’t tell. We lack the tools to verify wrongdoing on the part of a user who has been removed, messaged, or had their content deleted, and this is a serious problem.

This is a critical opportunity for us to inspect the processes that are contributing to the outlook for the site, and contextualize them in the overall state and progression of the network.

Stringing it all together

Look, the truth is, there are going to be many hypotheses about what could be taking place in the data on Stack Overflow. Heck, we’ve had several large analytics teams pore over the SO and SE data for any sorts of anomalies or possibilities when it comes to participation contraction. While it is absolutely true that Stack Overflow is losing users faster than it is gaining users, none of the hypotheses generated by the company can explain away the relationship between % of frequent answerer suspensions and the decrease in frequent answerers, in the context of falling actual GPT post rates.

In the Community Management industry, it is a well-known fact that removing a person from a community, even for a short time, has an outsize impact on the contributor community. The scalar factor here varies considerably from community to community, but it must be taken into account here as well. Deleting their contributions for reasons they feel unjust, or commenting on behavior that is not present, appears to have a similar effect to being suspended, and differences are often small when it comes to long-term user outcomes (such as the user leaving, and/or potentially other people leaving with them).

Is it still possible that the proportion of false positives is small? Maybe so – it can’t be completely eliminated at this time. Direct causative data are not possible to obtain for this problem. But for that to be true, it would require some very strange user behavior en masse around answering, by users who were otherwise answering questions normally. These are behaviors we do not have an organic explanation for after months of exploration, even under the scenario where question demand contracts at a rapid pace. Again, it’s possible, but the evidence is not favorable. We are forced to look at all possible exogenous causes of user attrition - including whether logic internal to the network needs to be altered (and we have made changes here, as well).

Finally, it needs to be said that the analysis presented here is far from the complete analysis we have conducted internally. These are pieces of a larger puzzle, selected because they best express our meaning (and contain no proprietary data). Seeking a set of root causes for the contraction in the network’s community size, and an explanation for how it affects different sites/community segments around the network, has been the object of study for dozens of people, and for many months now.

We hope by now the ultimate point here is clear: Suppose we are right in this assessment and GPT removal actions are not reasonably defensible. How long can we afford to wait? To what extent can we continue to risk the network’s short-term integrity against the long-term risks of the current environment? Any good community management policy must balance the risk of action against the risk of inaction in any given situation, and the evidence does not presently favor inaction.

What we know, right now, is that the current situation is untenable. We have real, justified concerns for the survival of the network. We’re not saying this to invoke despair, or to imply that these decisions are overly rushed. There are silver linings here, and there is significant potential for future growth and a more stable long-term path. The onus is on us to find a way forward that protects the communities and network as a whole.

If there is no future for a network where mods can’t assess posts on the basis of GPT authorship (where we would be after this policy), and if there is no future for a network where mods can assess posts on the basis of GPT authorship (where we are today), then there is no future for the network at all. Yet somehow, our adjacent communities are making this work. In this deadlock, something has to give. At the end of the day, it all boils down to this: We have to walk the middle path.

I see we're once again ignoring requests internally pointing out severe errors in the methodology — Zoe - Save the data dump, Commented Jun 7, 2023 at 19:33
Why didn't you expose your metric for GPT-ness (the number of drafts) to the moderators and let them factor it in to their own decisions instead of declaring it to be the be-all end-all and override everyone's judgement? — pppery, Commented Jun 7, 2023 at 19:47
Although it is an interesting analysis, something is missing. I asked about this on the Stack Moderators Team and didn't get an answer. What, exactly, is the problem you are attempting to solve? There may be more than one problem. For each problem, can you express that problem in a single sentence or question? What questions are you trying to answer or what problems are you solving? My current thinking is that there may be assumptions inherent in the question that do not hold true. (For staff/mods, the Team post I reference.) — Thomas Owens, Commented Jun 7, 2023 at 19:55
Could you please explain why you didn't mention the actual numbers or percentages of appeals that you could not verify? I don't think "many" cuts it here, I can imagine users coming up with wildly different percentages here when interpreting this. — Mad Scientist, Commented Jun 7, 2023 at 19:56
Although mods have made clear several times that they are not relying on detection software like this, and several have stated that they never use the software at all, I challenge the characterization that this analysis shows the detector is not useful. It shows that an acceptable false-positive rate can be achieved when used on single questions as long as the threshold is set to 0.9975. That's a high threshold, but also one frequently exceeded by AI-generated content fed to the tool. If you tested instead sets of, say, 3 posts by the same person, you might find it improves substantially. — Bryan Krause, Commented Jun 7, 2023 at 19:58
Thank you very much for posting this publicly. Doing so will, hopefully, improve the overall discussion. It will allow a much wider audience to review the procedures used and conclusions drawn, which is a good thing overall, regardless of any position or opinion which we might or might not individually have. — Makyen, Commented Jun 7, 2023 at 20:08
I've found one user who has posted 14 obviously ChatGPT answers in the last hour. So that pretty much accounts for your entire daily quota even if no-one else posts anything. — DavidW, Commented Jun 7, 2023 at 20:35
It's very strange that you released your internal metric (thus ensuring that you can't use it anymore reliably), but decided to censor the actual number of suspensions in question. Are you worried that the public won't think it's large enough to be as big a deal as you are making it? — Chris, Commented Jun 7, 2023 at 20:50
Philippe, I was really hoping you would show some data and linked outcomes, as per all the requests in TL, but you have really just stuck to your list of items that you are somehow assuming are correlated, while not providing any evidence of any linkage. I would be more than happy to attest that every post I deleted, and every account suspended for misuse of ChatGPT was accurate. Further I am confident the quality of those posts was terrible. Sure, I don't moderate SO, but I do moderate a large number of smaller sites - and your new policy will only cause damage. — Rory Alsop, Commented Jun 7, 2023 at 21:28
"7% of the people who post 3 or more answers in a week are suspended within three weeks" <- I think one big piece of data needed here is how many of these are new accounts. Right now it fits both the "overzealous mods are banning longtime community members" and "a few bored GPT fans keep making new SE burner accounts". Ban evasion is... a common thing online — Kaia, Commented Jun 7, 2023 at 22:11
Thanks for posting this. I'll post detailed thoughts once the June data dump is released and I can sanity-check some of these charts, but for now I would strongly recommend y'all work on better record keeping - I was NOT expecting to read, "we could not recover all of the GPT suspension appeals sent to the Stack Exchange inbox". Y'all are the backstop for these things - it is crucial that you take them seriously. If you're losing appeals, that is extremely worrying. Also, unless moderators have gotten a lot better in 3 years, a baseline of 0 is unrealistic - which also suggests data loss. 😬 — Shog9, Commented Jun 7, 2023 at 23:32
@Shog9 - to clarify the language used here, “recover” does not mean the remediation of a data loss in this case. “Recover” means “one of my staff went through our support inbox and attempted to pull out all the ones related to GPT”. To the very best of my knowledge, there are no “lost appeals” and there has been no data loss. (Our data protection and retention policies are pretty strict, and there’s no question that regulators and lawyers would probably be involved in that case.) — Philippe, Commented Jun 8, 2023 at 1:27
That's a relief, @Philippe - might wanna rephrase that section, I'm sure I'm not the only one to interpret it that way! FWIW, should be possible to query mod messages for replies - that's not going to give you the same information, but I've found it to be a pretty good indicator of exceptional circumstances. Threads with more than 1 reply nearly always warrant a look... — Shog9, Commented Jun 8, 2023 at 2:56
"The actual rate at which GPT posts are made on Stack Exchange has fallen continuously since its release, and is now very small." (1) How do you know they are GPT posts? Did you use tools to check? How do you know the tools are accurate? (2) Did you include those GPT posts with added spam links in your data? Many GPT posts were deleted and accounts were suspended because they were spam posts, not because of GPT. — Nobody, Commented Jun 8, 2023 at 7:22
Question I need to answer now is whether I still care, @Chris. A walled garden isn't the Stack Overflow I signed onto almost 15 years ago; it is the antithesis of that. I'm gonna take some time to reflect. — Shog9, Commented Jun 9, 2023 at 18:50

Mithical · Accepted Answer · 2023-06-07 20:59:31Z

295

What makes you so confident that your methods of identifying GPT posts are more accurate than how moderators have been identifying them?

So far, all of the data I've seen hinges on the assumption that your methods of detecting GPT posts - i.e. the "gold standard" - are inherently a better method of detection than the heuristics that moderators have been using, which involve evaluating the writing style, the frequency of posts, and the context of the user, among other methods. However, I have not seen any proof of this assumption.

Moderators have offered to go through blind testing, with known-good and known-bad data being used in order to see if mods can actually identify GPT posts. Moderators have asked you to identify cases where suspensions were issued in error.

We have yet to see any examples of cases where GPT suspensions were issued in error, with a very small number of cases being mentioned as "unverifiable appeals", and Stack has declined to test the moderators' abilities in this manner.

There has been no coordination or communication with the people issuing these suspensions and deleting content before this policy change was pushed through in the worst way possible. Nobody reached out to mention concerns of too many false-positives; nobody asked moderators to cut back on suspending for first-offences; and nobody discussed changing policy before the change was announced.
Moderators were given no opportunity to explain that the detectors are known to be inaccurate and are not relied on in moderation, or to compare identification methodology.

These are some severe gaps in the data and how it was gathered, and I cannot trust in its accuracy without some sort of proof being presented.

answered Jun 7, 2023 at 20:59

Mithical

83.1k28 gold badges143 silver badges306 bronze badges

65

I'm deeply unimpressed with the idea that "number of drafts per answer" is a substantially more accurate measurement of GPT use than "number of answers deleted for suspected use of GPT." I simply find it incredible that the CM team is even considering the former as a more accurate metric than the latter.
– Kevin
Commented Jun 8, 2023 at 0:57
2

@Kevin If my (unverified) theory is correct, it should've been a good predictor before ChatGPT was released, back when people were using document-based interfaces like Write With Transformer. Now that we're seeing chatbot-style generators being used in earnest, I expect the metric to have become decoupled.
– wizzwizz4
Commented Jun 8, 2023 at 1:04
12

@Kevin I think what Philippe is driving at, but not stating plainly, is that they are uncomfortable because they cannot independently verify whether the mod's action was defensible. If they could find a reliable ChatGPT detector or other method to verify whether the mod did what they were supposed to do, they could pretty much automate the handling of appeals (but then of course they might as well deploy those as automated checks on the site itself in the first place).
– tripleee
Commented Jun 8, 2023 at 4:44
34

@tripleee: Unfortunately, life is like that sometimes. You can't possibly always know who's right and who's wrong. If their internal processes have no room for shades of gray, the problem is with those processes, not the moderators.
– Kevin
Commented Jun 8, 2023 at 4:56
3

No disagreement. It's more of me thinking out loud.
– tripleee
Commented Jun 8, 2023 at 4:57
10

@tripleee there is a small possibility that this is indeed an overblown issue and what the company was trying to do was to find a more verifiable workflow for suspension with no ill intent, maybe because someone did actually call in to the GDPR rule about fully automated actions. Yet even if this was the case, they made every possible mistake one could make while trying to do something that could have been achieved sitting in a bar while drinking tea together if they asked the right way. Public shaming on media, hidden information, no room for discussion all while advertising AI on blog etc.
– SPArcheon - on strike
Commented Jun 8, 2023 at 15:10
"I have not seen any proof of this assumption" To be fair there seldom is proof of anything that leaves no reasonable doubt in such things. I guess one can also not prove the opposite, so it comes down to trust in the end and weighing the evidence. And even a blind test is not perfect, for example the used test setup could differ from reality.
– NoDataDumpNoContribution
Commented Jun 9, 2023 at 10:41
Further, some platforms don't support drafts. iOS 12 for instance.
– Harper - Reinstate Monica
Commented Jun 12, 2023 at 1:53
6

@tripleee Philippe is pretty clear: Participation is falling at a rate that poses an existential threat to the site and the company. Management and staff are in red alarm mode, and rightly so. One reason answering is going down is an incredibly high suspension rate of frequent answerers (at around 10%/month, an order of magnitude higher than baseline) which did not subside in lockstep with the number of actual GPT answers estimated through other channels; conclusion: Most valuable users are driven off the site at high numbers, contributing to the existential threat.
– Peter - Reinstate Monica
Commented Jun 12, 2023 at 12:21
7

@tripleee The compounding factor here is that AIs have mastered a subsection or specialized version of the Turing test: They are indistinguishable from idiots when they are discussing technical issues. I fully believe Rory that he/she is convinced that each and every suspension was justified. It is just that this may be compatible with human authorship.
– Peter - Reinstate Monica
Commented Jun 12, 2023 at 12:28
Without access to the actual suspension data, this is hard to assess. A reasonable null hypothesis is that these were not valuable, long-term answerers who were actually capable of bringing value to the site in the first place.
– tripleee
Commented Jun 12, 2023 at 12:31
13

I'm not a regular asker on SO but last week I had my first GPT answer. It was well formatted and tidy, if tonally a little heavy on the obsequious politeness of ChatGPT but the real giveaway was that it referenced methods and properties that simply don't exist and the code it gave was useless. If the company wants a way to drive users away, bad answers that look good is a great way to do that. In almost every non-trivial case that is what a LLM will offer.
– glenatron
Commented Jun 13, 2023 at 11:08
"in the Community Management industry, it is a well-known fact that removing a person from a community, even for a short time, has an outsize impact on the contributor community." and what is the impact on the contributor community of repeatedly overruling, denigrating and undervaluing your volunteer moderation staff, who as elected representatives of the community can be assumed to have a lot of community support and respect?
– Ty Hayes
Commented Jun 16, 2023 at 7:40
1

@Peter-ReinstateMonica It seems to me that the users driven off the site by the LLM ban were mostly spammers who want to rep farm, while the users driven off the site by prohibiting all LLM moderation are power users, SMEs and moderators who want to curate a high-quality repository of accurate information about programming. The company believes the value proposition of uninhibited LLM spam beats the 15 years of curation and expertise that built the site up to this point.
– ggorlen
Commented Jun 20, 2023 at 20:12
2

"What makes you so confident that your methods of identifying GPT posts are more accurate than how moderators have been identifying them?" - This is a silly question. Overzealous mods were suspending almost 7% of the site's active users every three-week window. That's both unsustainable and prima facie inaccurate. What evidence have moderators provided that their methods for detecting GPT posts are accurate? They're the ones suspending people; they're the ones who have to prove they're doing it justifiably.
– aroth
Commented Jun 26, 2023 at 3:50

| Show 1 more comment

Gilles 'SO- stop being evil' · Accepted Answer · 2023-06-07 20:57:32Z

261

Thank you for posting the methodology! That's a step forward.

Unfortunately, there's a critical flaw in your methodology: you assume that

In principle, if people are copying and pasting answers out of services like GPT, then they won’t save as many drafts as people who write answers within Stack Exchange.

If, over time, the rate at which GPT posters save drafts changed, your analysis breaks down. And since the number of drafts is the only argument you're giving, this brings the conclusion into doubt.

This would be a reasonable assumption for isolated posters who are not going to invest any effort in bypassing detection. But if a significant fraction of GPT posters are organized, it is not a reasonable assumption: you can expect them to evolve to counter your countermeasures.

I know from experience that you sometimes get a captcha if you paste an answer too quickly after loading a page. I'd expect automated posters to want to bypass that captcha, either by using an automated captcha solver or by arranging not to post too quickly. I have no a priori knowledge of which.

So I googled “how to post on stackoverflow with gpt”, and looked for relevant results dating from after the initial rush in December 2022. The first relevant result starting from February 2023, on page 2 of the results, is a Chrome extension, released on 28 February 2023. Going by the demo video, this extension posts the answer a few characters at a time, at the rate a human would type. Someone using this extension would generate a similar number of drafts as a human.

The existence of such a tool invalidates your metric. And since it's your only metric, it invalidates your conclusion.

I am much more inclined to trust moderators to remove bad content than statistics that do good math on irrelevant data.

answered Jun 7, 2023 at 20:57

Gilles 'SO- stop being evil'

109k27 gold badges218 silver badges465 bronze badges

19

I mostly agree with your conclusion but unless I'm missing something, the demo of that extension shows the GPT output being printed in real time (not in the answer box) in the same way GPT prints its output in chat mode. So, this doesn't affect the number of drafts. That being said, this doesn't mean that there are no other tools, tutorials, etc. that allows users to avoid captchas, bypass GPT detection systems, and so on.
– 41686d6564
Commented Jun 7, 2023 at 21:04
51

@41686d6564standsw.Palestine: IMHO the most straightforward explanation is that the GPT users are editing a word here, a word there, etc., to try and make it sound a bit less like an AI, and incidentally generating drafts. If I were trying to attack SO in this manner, that's literally the first thing I would try.
– Kevin
Commented Jun 7, 2023 at 22:00
14

@Kevin The first thing I would try is direct copypaste and then I'd run into the captcha. The second thing I'd try is to avoid running into the captcha, because I know captcha makers are going to try to break whatever captcha breaker I might use. And for that simulating a human's typing is a very natural solution. I don't know if GPT posters are doing that, but to me it's a more plausible hypothesis than moderators having suddenly dramatically increased their rate of unwarranted deletions.
– Gilles 'SO- stop being evil'
Commented Jun 7, 2023 at 22:05
17

@Gilles'SO-stopbeingevil': While I think that is entirely plausible, I prefer an explanation that doesn't require significant coordination on the part of the attackers, because I imagine many of them to be unsophisticated users who are playing with a shiny new toy. I'm sure some of them are organized as you describe, but I would tend to expect that some of them are just not sophisticated enough to reason about how CAPTCHAs work.
– Kevin
Commented Jun 7, 2023 at 22:07
35

There is a market for high-reputation SO accounts, although it's a very small market compared to buying political influence on Facebook and Twitter. In the 2010s, for SO, the main methods were: gaining access to an abandoned account (presumably by accessing the email account), voting rings (but I think those are efficiently combatted so it's never really worked), and posting plagiarized content (which still requires effort and is easy to verify). I expect GPT to replace plagiarism because it's so much more effort to filter. The incentive is there to develop efficient tools for GPT posting.
– Gilles 'SO- stop being evil'
Commented Jun 7, 2023 at 22:10
39

@Kevin The shiny new toy was the Dec 2022 spike. Now we're seeing the second wave who aren't just playing.
– Gilles 'SO- stop being evil'
Commented Jun 7, 2023 at 22:11
11

@Kevin "I imagine many of them to be unsophisticated users who are playing with a shiny new toy" -- I have seen several on MSE lately who became more sophisticated in disguising GPT contributions. Not all of them match the stereotype of new-account-dozen-answers-then-gone, and I don't know what their end game is, but it certainly happens nowadays.
– dxiv
Commented Jun 7, 2023 at 23:48
4

What about formatting? I don't use CGPT but from what I see, answers contain formatting and code boxes with a copy code button. Can you do a ctrl-A/ctrl-C and capture the whole text+code in a format that works in meta-wiki directly? How many formatting steps are required if you're trying to reproduce the formatting provided by the AI? Also, does that vary whether you're inside the openAI interface or some other interface, such as Bing? I don't know, but it may be people still want nice looking documents despite using the AI? Some topics might generate more formatting?
– دولة فلسطين
Commented Jun 8, 2023 at 2:14
2

In terms of saving drafts... one thing I had considered recently was using a text editor to work on my answers offline, then pasting the finished product into the edit box. If there's anyone out there who does it that way, aren't they going to be a GPT false positive due to their own lower number of drafts? Sure, there'll be some saved as they correct errors in the Markdown etc. but even so.
– AJM
Commented Jun 8, 2023 at 9:16
17

@AJM They'd be false positives if the GPT rate was considered to be the low-draft rate, but it's not. There are many non-GPT posts that don't make drafts for various reasons: very short posts, posts made from older browsers where the draft code doesn't work, etc. We'd expect that rate to be mostly constant over time, so the changes to that rate are likely related to GPT. The idea is that the baseline is from before Dec 2022, then there's a GPT spike, and then the situation is with GPT posters whose behavior may or may not have changed over time.
– Gilles 'SO- stop being evil'
Commented Jun 8, 2023 at 9:48
3

Upvoted only because you went as far as to verify what I had only guessed: obviously users of chatGPT quickly catch up on the fact that posting an answer too fast is a clear tell-by and developed countermeasures that undermine any assumption the company made in their study.
– SPArcheon - on strike
Commented Jun 8, 2023 at 14:48
@Gilles, I somewhat feel relieved that the Chrome extension you speak of does not work in dark mode for some reason. Maybe the extension itself was written from "AI" input.
– Frédéric Hamidi
Commented Jun 8, 2023 at 20:50
I like to think of spammers as basically people like you or me that simply have a ChatGPT tab open and copy and paste from there. My first assumption would be that their behavior is simple and does not change over time. But of course it's possible that I'm too naive there and rep farming by posting GPT output on SO is a highly (profitable?) organized crime nowadays. Maybe there are any other indications for that?
– NoDataDumpNoContribution
Commented Jun 9, 2023 at 6:43
2

For this theory to be true, it needs to presume that the answerers that are posting GPT generated answers are lazy enough to "just ask chatgpt and paste the answer", but diligent enough to automatize the process. This kind of presumption suppose a very high bar to clear to even demonstrate. Most people give up posting at the first hurdle, and this is something that even SE can observe.
– Braiam
Commented Jun 15, 2023 at 18:52
@Braiam It only takes one person to automate the process. Then a large pool of unqualified people can do the grunt work. This happens whenver there's money at stake. I confess that I don't know what money is at stake at the moment in pushing AI-generated content on SO: highish-rep SO profiles aren't that much of a market.
– Gilles 'SO- stop being evil'
Commented Jun 15, 2023 at 20:18

| Show 1 more comment

blackgreen · Accepted Answer · 2023-06-09 01:21:50Z

Appeals of moderator actions that cannot be validated need to be shown to the moderator team

When taking action against a user, we need to have strong evidence that we are correct. That evidence should be, whenever possible, documented in a form that allows a second person to double check that the action was correct.

When reviewing appeals, the standard should not be whether a moderator action can be proven wrong. We hold ourselves to a higher standard to that. If the reviewer can't find sufficient evidence to demonstrate the action to be correct, then we should be contacted to supply any additional details and the results of our original investigation. If the reviewer then feels the action hasn't been demonstrated to be correct, then the action should be reversed.

I can only find 3 instances when the Stack Overflow moderator team was contacted by Community Managers about appeals related to ChatGPT/AI suspensions. Of those:

One poster admitted that they had used ChatGPT, but wanted some answers undeleted that they had written themselves and posted among the GPT answers. (To be clear, those posts were recognized as non-GPT but still deleted for another reason. The Community Manager had been informed in detail about that, and did not object.)
One set of posts was affirmatively confirmed to be ChatGPT to the satisfaction of the CM.
One set of posts was not specifically confirmed to be AI-generated, but a CM agreed that it was unlikely that this user was writing their answers specifically for each question they answered. That suggests that there were, at the very least, quality issues (the handling moderator noted that there was a lot of dupe answering).

However, your post strongly implies that there are many appeals that we have not seen. More specific numbers were discussed internally, leading us to believe that we have not seen the vast majority of appeals that could not be validated. We need to know if we're making mistakes so that we can figure out how it happened and prevent it from happening again. And if we're not making mistakes, then perhaps we need to document our work better. Either way, we'll have more information on possible root causes for the problem you're trying to solve.

Please share with the moderator team these and any future appeals that cannot be validated (regardless of the reason for the action).

Good Lord but this needs to be higher. Amid so much speculation, this is actionable. And it addresses a genuinely bad thing that's kinda hidden in all the details of the post: mods are flagging stuff as ChatGPT that reviewers think isn't ChatGPT! (Or maybe it's only those that lead to suspension? Maybe that reconciles the gap between people who have found 10,000+ GPT answers and the smaller number in this post?) Regardless, if I were a mod and I was misidentifying answers as GPT, I would want to know. — JonathanZ, Commented Jun 8, 2023 at 13:18
I have been - and continue to be - impressed by and grateful for the clarity of thought and consistency of message coming from our volunteer S.E. moderators and community leaders (this is just one example). — andrewJames, Commented Jun 8, 2023 at 15:51

starball · Accepted Answer · 2023-06-14 00:51:02Z

^{_{Notes about my following SEDE queries: Upon new SEDE refresh, you may have to run the cross-site queries a few times for them to finally complete). Ignore the datapoints for the current month (cropped out of the screenshots). That data is not yet complete. Ex. Things like roomba have "lag".}}

The alternative hypothesis for the above chart is that the number of questions available for users to answer has simply fallen, on account of question rates falling. This claim is hard to swallow given current data.

The total volume of questions available to frequent answerers continues to rise

You're kidding me, right?

Your assertion that the number of available questions is rising is completely contrary to reality and using a warped ^_(IMO) view of the data. Here's a network-wide query of new contributions per month since 2018.

Here's a site-specific version of the query if you're interested.

Just in case it's not obvious enough, question influx rate is dropping. A lot. And so is answer influx rate.

Even if frequent-answerers are leaving as you claim (more on that later), all else being constant, that has no effect on the number of available answers to other answerers. If there are N new questions per day, and one answerer leaves, there are still N new questions per day. Each of those N questions are still available to each answerer. There will likely be more unanswered questions per day, but that's not your argument, and even if it were, that's a pretty vapid statement. And all else is not constant: Question influx rate is declining rapidly.

Did you forget that in your new Code of Conduct, you state:
Broadly speaking, we do not allow and may remove misleading information that:
- Misrepresents, denies, or makes unsubstantiated claims about historical or newsworthy events or the people involved in them
Why aren't we instead talking about the problem of loss of askers and questions? There are no answers without questions. Losing questions is logically a more root problem. Why not give attention to the more root problem?

The way I see it, answer rates are dropping as a result of question rates dropping - not because of the wrongful suspensions that you are trying so hard to believe are happening.

I have two reasons to believe that answers are dropping because of questions dropping:

Some pretty simple logic: Again, there can be no answers if there are no questions. Answers are to questions, and answerability is largely dependent on question characteristics.
That's certainly what the consistent proportion of new answers to new questions per month suggests to me.

Though there was a (not at all surprising) uptick in new deleted answers per new (deleted or not) question last December with the ChatGPT release and ban policy on SO, the proportion of new (non-deleted) answers to new questions (deleted or not) is hovering fairly constantly around 0.74 and has not experienced any dramatic change in the several years before, and since the release of ChatGPT. It has been fluctuating smoothly with a gradual decline from an average of ~0.8 in 2020. In that sense, from where I stand, I think you are missing the forest for the trees. Here's a network-wide query of average new answers per new question per month since 2018

Notice how the trend for non-deleted answers is smooth from Nov 2022 to Dec 2022 to Jan 2023. Here's a site-specific version of the query if you're interested.

The smoothness of the monthly new non-deleted answers in proportion to new questions and the noticeable jump in deleted ones leads me to believe that - largely speaking ^{_{(I did see some high-rep users violate the policy)}} - we didn't significantly lose our consistently-answer-contributing userbase, and instead, just rapidly gained a bunch of ~~gold~~ reputation prospectors trying to get rich quick and easy with the fun new toy without any regard for possible damage done to the ecosystem, and we were able (thanks to mods and flaggers, and CMs who cooperated with mods in blessing the ban policy) to keep the current and long-term damage under control in protection of the Stack Exchange network's reputation as a trusted source of information.

Let me add to that my own experience: My rate of answering on SO has been in decline recently, and it's not because I'm spending less time or effort looking for things to answer. There was a period earlier this year when I was averaging ~10 answers per day. Now it's more like ~3. It's largely because I've been getting fewer questions in my tags. A few months ago, I used to wake up to over a page of new questions, and now I often wake up to less than a page.

I'm very much inclined to believe that you're pointing your fingers at the wrong place and the wrong people.

The volume of users who post 3 or more answers per week has dropped rapidly since GPT’s release

In your obsession with answerers, you are missing that according to your methodology and logic, askers seem to be "leaving" even faster. Continuing on your weird focus on answers instead of questions, here's a SEDE query I made of SO graphing number of contributors who asked three or more questions or wrote three or more answers in a week since October 2011 (basically extending your graph to actually show the historical trend and put question-askers in the picture as well):

Here's a SO query for the range from 2016-01-01 to 2022-11-30, and here's a SO query for the range from 2022-12-01 to today (2023-06-12 at the time of this writing). Here's the Google Sheet where I got the fit lines.

	2016-01-01 to 2022-11-30	2022-12-01 to 2023-06-12	change factor
answerers	-1.143154562	-4.850514565	4.243096014
askers	-0.3719258563	-1.875632304	5.043027454

I.e. using your approach and logic, your "dedicated asker-base" has been "leaving" 1.188525416 times faster than your "dedicated answerer-base". How did you not notice that? Were you too busy trying to find a scapegoat and look good to whoever you need to impress that isn't us? You're making much ado about a smaller problem and ignoring a bigger one, and destroying your relationship with the more dedicated part of your userbase in the process...

I see no data to support that that problem is any fault of the moderators or curators in enacting ChatGPT policies. If anything, from what I've been reading on Reddit, people are going to ChatGPT to ask their questions instead of Stack Overflow.

In total, the rate at which frequent answerers leave the site quadrupled since GPT’s release.

You've skipped an important step (and it's why I've been putting "leaving" in double-quotes). What does writing fewer answers have to do with leaving the site? Again, there are fewer questions. I'm still here. Again, I'm writing fewer answers largely because there are fewer questions.

If you're concerned about traffic dropping with the rise of ChatGPT, I just can't understand why the first thing you'd think to do is basically allow ChatGPT answers instead of looking at things like improving your platform and making it more usable, improving user experience / guidance / onboarding, fixing bugs, and looking at highly-requested features.

When I read peoples' comments on Reddit about why they're leaving, it's often a case of What about the community is "toxic" to new users?. See Mithical's answer to that question- it's why I bring up improving user guidance and onboarding. In fact, you have a project to work on that: Introducing new user onboarding project. What happened to that project?

Like - c'mon. Stick to your guns. Stack Exchange succeeded because its whole approach to being a Q&A platform was valuable, and I'm convinced that it can and will continue being valuable without allowing LLM-generated answers.

7% of the people who post 3 or more answers in a week are suspended within three weeks

I don't see 7% as significant compared to the rough order of a ~25% decline in overall post contributions comparing November 2022 to May 2023. Again, I'm so confused why we're all sitting here talking about this mouse when there's a whole elephant in the same room.

no Community Manager will tell you that removing 7% of the users who try to actively participate in a community per week is remotely tenable for a healthy community.

:/ Just 6 months ago you were quite supportive of SO's ChatGPT policy, for reasons of community and platform health.

You wrote:

https://stackoverflow.com/help/answering-limit

We slow down new user contributions in order to ensure the integrity of the site and that users take the time they need to craft a good answer.

https://stackoverflow.com/help/gpt-policy

Stack Overflow is a community built upon trust. The community trusts that users are submitting answers that reflect what they actually know to be accurate and that they and their peers have the knowledge and skill set to verify and validate those answers. The system relies on users to verify and validate contributions by other users with the tools we offer, including responsible use of upvotes and downvotes. Currently, contributions generated by GPT most often do not meet these standards and therefore are not contributing to a trustworthy environment. This trust is broken when users copy and paste information into answers without validating that the answer provided by GPT is correct, ensuring that the sources used in the answer are properly cited (a service GPT does not provide), and verifying that the answer provided by GPT clearly and concisely answers the question asked.

[...] In order for Stack Overflow to maintain a strong standard as a reliable source for correct and verified information, such answers must be edited or replaced. However, because GPT is good enough to convince users of the site that the answer holds merit, signals the community typically use to determine the legitimacy of their peers’ contributions frequently fail to detect severe issues with GPT-generated answers. As a result, information that is objectively wrong makes its way onto the site. In its current state, GPT risks breaking readers’ trust that our site provides answers written by subject-matter experts.

Supposing every suspension is accurate, the magnitude raises serious concerns about long-term sustainability for the site

In the quote just above, you were concerned about the long-term sustainability of the site in support of suspensions for ChatGPT-generated content:

Moderators are empowered (at their discretion) to issue immediate suspensions of up to 30 days to users who are copying and pasting GPT content onto the site, with or without prior notice or warning.

It's so confusing and frustrating for me to watch you suddenly do a 180 to this direction, contradicting yourself. From my point of view, it's just inexplicable. I'm not quite angry, but I'm certainly flabbergasted.

However, since the advent of GPT, the % of content produced by frequent answerers has started to collapse unexpectedly. Given the absence of question scarcity as a factor for answerers (note the above chart), the clear inference is that a large portion of frequent answerers are leaving the site, or the site is suddenly not effective at retaining new frequent answerers.

Again, the data does not support this. The rate of incoming questions is dramatically dropping, and with it, the rate of answers to those questions. And that inference of yours is not clear to me. Again, what does fewer answers have to do with people leaving? Until I see actual data that directly supports that conclusion, I'll be here pressing X.

In your draft approach, I'd like to see data focusing on the sizes (number of characters) of initial drafts instead of the number of drafts.

Like - why have you ruled out that people are just getting craftier about evading detection, and pasting into the answer input, and then doing editing there to make the content look less like it was ChatGPT/LLM-generated? Because that's totally what I'd expect to be happening over time - especially with suspensions being handed out. And that's very same reason I'm doubtful of your following

These days, however, it’s clear that the rate of GPT answers on Stack Overflow is extremely small.

Stray question: Do the drafts you're looking at include content that gets posted quickly enough that there is no intermediate draft?

Some folks have asked us why this metric is capable of reporting negative numbers. The condensed answer is that the metric has noise. If the true value is zero, sometimes it will report a value higher than zero, and sometimes a value lower than zero. Since we know how much noise the metric has, we know what the largest value for GPT-suspect posts should be.

Can you clarify and expand on this?

In the graph, what's the light blue line, and the dark blue line?
How/why does the metric have noise?
What does "true value" mean? Are you saying you've found a 100% accurate way to detect ChatGPT answers? (obviously not, but I can't understand what else "true value" would mean. Are you referring to your own technique? Because as I've already explained, I think it has flaws (unless I've misunderstood your explanation of it).)

While we could not recover all of the GPT suspension appeals sent to the Stack Exchange inbox, we could characterize some of them.

... what do you mean "Could not recover"? You mean you lost them? If so,... how? status-completed: this was clarified here.

We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

Have you talked with NotTheDr01ds and sideshowbarker? Ex. see NotTheDr01ds' post here, and sideshowbarker's post here. I'm sure they'd beg to differ. Also, this statement seems to lose the nuance of true positive and false positive rate. I think humans can have an excellent true-positive rate with some basic heuristics while erring on "the side of caution".

Under this assumption, it is impossible for us to generate a list of cases where we know moderators have made a mistake. If we were to do so, it would imply that we have a method by which we can know that the incorrect action was taken.

Again, have you talked with mods like sideshowbarker and users like NotTheDr01ds? I'm sure they could give you some practical, useful, concrete heuristics.

none of the hypotheses generated by the company can explain away the relationship between % of frequent answerer suspensions and the decrease in frequent answerers, in the context of falling actual GPT post rates.

Again, I think you're missing something obvious - namely that questions are incoming at a lower rate as people go to ChatGPT instead of Stack Exchange to ask questions. I almost see the relation you're staring at to be a spurious one. I don't get why you're ignoring the relation between decreased rate of incoming questions, and decrease rate of answers. As I've shown, answer influx rates has remained consistently proportional to question influx rate for much more than 5 years.

Suppose we are right in this assessment and GPT removal actions are not reasonably defensible. How long can we afford to wait? To what extent can we continue to risk the network’s short-term integrity against the long-term risks of the current environment? Any good community management policy must balance the risk of action against the risk of inaction in any given situation, and the evidence does not presently favor inaction.

Again, I find it funny and sad how consistent that is with the GPT policy Help Center page that was written if you just replace "GPT removal actions" with "GPT addition actions".

if there is no future for a network where mods can assess posts on the basis of GPT authorship (where we are today)

... why? I don't follow the reasoning. Again, as I've said, Stack Exchange succeeded because its approach provided real value, and I don't see how that value is gone with the rise of ChatGPT. I plan to write up a new Q&A specifically about this, or reuse ones where I've already written answers, such as here and here.

Worth noticing that a line as "Supposing every suspension is accurate, the magnitude raises serious concerns about long-term sustainability for the site" seems to imply that since it is not sustainable to suspend every accurate detection then we should stop doing so. Which is quite in line with the picture that apparently the mods got in the private version of the policy — SPArcheon - on strike, Commented Jun 8, 2023 at 8:42
I gotta say, it's quite interesting to me that the trend of new answer decline happens to start around the same time as the fallout from the Monica debacle. Of course, the pandemic started to ramp up around the same time as well, but I have a lot of uncertainty there. It's hard to suss from the scale of your chart, but it seems like the decline started just before COVID hit its stride. Any thoughts? — BryKKan, Commented Jun 8, 2023 at 13:48
@BryKKan I'd have expected that traffic to Stack Overflow would increase with COVID. Software development can largely be done from home, so the questions would keep coming. And answers too, because with no place to go after work, you might as well answer a few questions on SO. — S.L. Barth is on codidact.com, Commented Jun 9, 2023 at 6:05
I'd say the graph agrees with my expectations around Monicagate (late 2019) and COVID-19 (US lockdowns starting March 2020). There was a weak downward trend before COVID, then a huge uptick, then a stronger downward trend; the Monica incident might have contributed, but then in non-obvious ways which are hard to impossible to decouple from the COVID effects. Finally, starting in 2023 or late 2022, a steeper still downward trend, apparently due to ChatGPT. — tripleee, Commented Jun 9, 2023 at 6:25
The first 2 statements do not necessarily contradict each other - there can be more unanswered question than ever right now, some of them old. — Johannes Kuhn, Commented Jun 9, 2023 at 11:22
Very instructive answer. In the graph with the ratio of average questions and average answers there seems to be a roughly yearly periodic effect. Are incoming students with lots of questions the reason? Also in spring and summer 2014 SO took a huge hit before stabilizing itself. I wonder why then the number of new questions went down dramatically. GPT want invented yet. Maybe we can learn something from then. The situation looks a bit similar. — NoDataDumpNoContribution, Commented Jun 9, 2023 at 22:17
I agree with a lot of your comments but I'd also point out Philippe's statement (emphasis added) was "The total volume of questions available to frequent answerers continues to rise". If the number of frequent answerers is falling (which Philippe states is happening) faster than the number of asked questions, then this statement is still true. — Zhaph - Ben Duguid, Commented Jun 10, 2023 at 12:42
relation between decreased rate of incoming unique questions, and decrease rate of answers. - I have an 8yo answer that I've linked to twenty seven times. It wasn't GPT, it was getting to 20k, now seeing all the deleted garbage making it look like the rest of the internet. And there's quite a few of my peers who joined around the same time as me who are now ~100k because they keep answering dupes, instead of doing it right. — Mazura, Commented Jun 10, 2023 at 18:48
@Zhaph-BenDuguid how so? I'm a frequent answerer. All else being constant, other frequent answerers answering less doesn't mean there are more questions for me to answer. It could mean there are more questions with no answers for me to answer, but that's not part of the argument Philippe presented. And all else is not constant here. As I've shown, questions influx is declining rapidly. Either I'm still missing something (and if so, I apologize), or I think you've falling for a pretty obvious logic trap. — starball, Commented Jun 10, 2023 at 20:34
eh, zhaph is correct. "Your assertion that the number of available questions is rising is completely contrary to reality" Philippe never made that assertion. I'm all for pointing out flaws in the theory, but lets at least make sure we're pointing out flaws in the theory presented, not one that wasn't. — Kevin B, Commented Jun 13, 2023 at 14:19
@Zhaph-BenDuguid """The total volume of questions available to frequent answerers continues to rise"" This observation from Philippe is not so relevant. Answerers predominantly focus on questions not older than 7 days. Older questions just hang around and don't seem to be that interesting fruit. Looking at the number of new questions as starball does is the more interesting quantity. The company unfortunately is looking at the wrong thing. — NoDataDumpNoContribution, Commented Jun 14, 2023 at 4:35
@starball Yes, that's exactly the quote i think you're misrepresenting. The first sentence in that section is "The alternative hypothesis for the above chart is that the number of questions available for users to answer has simply fallen, on account of question rates falling." which more aligns with my interpretation, as does the data he presented to support it. He clearly agrees here that question rates are falling in general. He's trying to disprove that that is why we're seeing a reduction in activity from frequent answerers. — Kevin B, Commented Jun 14, 2023 at 4:43
put another way, he's literally stating the same data you're claiming disproves his statement in the first sentence... how does that make any sense if he's trying to argue that question rates aren't falling? he literally asserts that they are. — Kevin B, Commented Jun 14, 2023 at 4:49
@starball yes, the data he provides next that shows that there are more questions being provided than there are answerers, on average on a day to day basis. He's claiming the falling rate of questions isn't as high as the falling rate of active answerers, and therefore the falling rate of questions doesn't explain the falling rate of active answerers. I agree that this is a weak argument, but it doesn't change that your data in the first chart doesn't in any way disprove it. He accepts that data as fact right out of the gate. — Kevin B, Commented Jun 14, 2023 at 4:52
effectively, his data shows that, despite the question rate going down by x%, the number of "active answerers", which are answerers who post 3 or more answers a month, dropped not also by x%, but by x%+n, aka more than he expected it to if the drop was solely due to the question rate dropping. There's certainly better arguments for why these rates are different than because mods are suspending them, but the data doesn't prove or disprove that it's simply due to less questions being asked either. that's just another hypothesis — Kevin B, Commented Jun 14, 2023 at 5:06

Makoto · Accepted Answer · 2023-06-07 20:49:42Z

I came off of read-only mode for this.

My initial take on this is that the analysis tries to affirm that user retention is valued higher than user quality, but the data points are still reasonable enough to ask questions.

The actual rate at which GPT posts are made on Stack Exchange has fallen continuously since its release, and is now very small.

How are you determining if a post was made with ChatGPT at all? What's the selection criterion for that? I mean, even today I've managed to find one just sitting on the front page that very clearly has a ChatGPT response.

If you're saying that you have a way to reliably detect ChatGPT responses, but the mods don't, that's prime disconnect number one. Otherwise all of this data is kinda pointless, isn't it?

The rate of suspension for frequent answerers rose by a factor of 16 from ~0.4% to ~6.6% since the rise of GPT, and has held steady since its release.

I wonder if this a side-effect.

You only tie HuggingFace to this as an example but there are other known detectors out there which simply aren't mentioned.

There's also this sentiment - the standing orders were for diamond mods to issue suspensions to users if they used ChatGPT. Some did so quite flagrantly during the winter, and even wanted to promote it as a new kind of workflow.

But perhaps, just perhaps, more people are eager to use ChatGPT and want to gain notoriety as experts.

I'll never be able to speak to the accuracy of detection methods, which is...always going to be a moving target. I don't think we can confidently identify all of them myself, but it does feel like the consequence of wanting high quality content on the site comes at the cost of users being suspended more.

But I suppose I'd need to challenge y'all on this. Yes, the rise is high. But is it justified? Is there more to this story than "wait, people aren't posting as much ChatGPT stuff, but they're getting banished a lot more often."

One additional note - the bans aren't permanent, and I don't think they're even that long. It's the case that someone did a bad thing and gets to go away for a week or so. It's up to them if they come back or not and want to be a part of the community... From a vanilla moderation perspective, that's not functionally different than someone posting a whole bunch of low-quality questions or answers and getting rate-limited, or someone deciding to inflame tensions in Chat and getting a time-out for a few days. The action's on them, not us.

If every GPT poster posted exactly three answers and were suspended within three weeks, this would imply a minimum GPT post rate of 330 answers per week on Stack Overflow; in practice, we would expect a significantly greater quantity. Measurements of GPT occurrences on the platform imply fewer than 100 GPT answers per week, in disagreement with this rate, implying the existence of many false positive detections.

Again, I want to know how you're detecting ChatGPT posts to begin with such that you got this number.

The suspensions issued appear to have a significant and measurable impact on the demography and volume of answerers on the site, preferentially excluding frequent answerers.

So the impact of this statements reads to me as, "people who post often are getting suspended more often and that's bad". This only begs the question of what quality their posts were to begin with. As in, is this demographic of people who frequently answer adding any measurable value to the site, even before all of this ChatGPT nonsense?

GPT detectors continue to be ineffective for detecting GPT answers on the platform.

Sounds like y'all got a good one. Mind sharing?

I suppose my ire in all of this is really centered around the timing and optics, too. ChatGPT hit the network like a runaway freight train. It's still a problem if we want to pride ourselves on having expert answered content on the site.

Then, we have the company pushing more and more AI on the community as if we're OK with going along with it at all twists and turns. And sure, there's some skeptical optimism, but nowhere near what the press the CEO puts out is implying.

So instead of engaging with this data first, you took the really heavy-handed approach of telling our diamond moderators that they can't do what they need to do to get rid of AI-generated content, which - wait for it - goes against the notion of having expert-answered content.

The story I tell myself from that is that you value people being here more than the quality of the content. And if that story is true, that doesn't work for me, someone who likes the quality that Stack Overflow outputs. That implies that I, an SME in Java, Python and other technologies, have no place here, or nothing to contribute to the company as an unequal but not insignificant contributor.

I hope my story is just a story. But given the posturing taken from this whole strike-thing and even the tone of this post, I'm not so sure.

(No, don't try to go on a diatribe saying you're committed to quality. If you were, the people who keep posting ChatGPT stuff would be unceremoniously banned from the site and we wouldn't care about it.)

I'll give you some credit though - you did at least bring us the data we've wanted. It's just...a day late and a dollar short, and hell you've already made decisions on this data which could be flawed from an outsider's perspective.

The part about the optics is exactly where my head is at - if quality no longer matters it's unclear why I'm here. And I'm really struggling to read it any other way. — Flexo, Commented Jun 7, 2023 at 21:32
New users, if banned temporarily, will rarely return. Banning someone for a week is pretty much the same as banning them forever — Richard, Commented Jun 7, 2023 at 21:34
@Richard: That's a human behavior problem, then. To be fair we've had literally decades of this style of moderation to explain why, from the venerable bulletin boards to IRC to phpBB to Reddit and the mainstream social networks. Only in the last few years have strong operating policies or laws required communities to start caring more about this fact. — Makoto, Commented Jun 7, 2023 at 21:43
Sure, but saying 'the bans aren't permanent, and I don't think they're even that long' is incorrect if most bans result in the person leaving entirely. — Richard, Commented Jun 7, 2023 at 21:51
@Richard: Do not conflate a prohibition from participating in the same light as an unwillingness to participate. I don't disagree that banning someone carries the likelihood that they'll not come back. But once the ban expires, they can participate as normal here. It's the choice of the individual to come back or not, hopefully after reflection on what they did wrong. — Makoto, Commented Jun 7, 2023 at 21:53
@Makoto "you value people being here more than the quality of the content" Wish I could upvote this more. The writing on the wall was there when I posted this more than a month ago: "If that's the choice, I expect such a change of direction to alienate and drive away many of the longtime good-faith contributors, myself included.". — dxiv, Commented Jun 7, 2023 at 23:57
@Makoto I think the part you're missing is that if it's an unjust suspension - and especially if they never receive any kind of recognition or apology as to that - then that choice isn't really neutral. If they want to participate in a platform that treats them with respect, then the moderator who banned them effectively selected SO out of the candidate list. Of course, this is somewhat a "devil's advocate" position. In my experience, moderators mostly tend to engage more proactively than that, so I question how widespread a problem this actually was. — BryKKan, Commented Jun 8, 2023 at 4:16
I guess the OP is too long? Philippe is pretty clear about their metric for estimating the number of ChatGPT posts. Other answers here demonstrate that the metric is seriously flawed, but it is described in the OP. — Cris Luengo, Commented Jun 8, 2023 at 15:22
@CrisLuengo I'm not convinced that the metric is sound though. They are presuming multiple edits automatically means non-GPT, but since the policy was announced publicly, there seems (to me) to be sufficient motivation for GPT users to try to edit their copy-paste answers so as to not appear to be GPT-sourced. This seems to be the only metric being used to determine false positives, and for the total count they seem to be relying on moderator bans. So there isn't actually any true empirical data here - just conjectures with little evidence — vbnet3d, Commented Jun 8, 2023 at 17:50
@vbnet3d Yes, I never said their metric was good, I just said that it’s described. This answer asks what the metric is, I was replying to that. Plenty of other answers here poke holes in the metric, the data, the analysis and the results. — Cris Luengo, Commented Jun 8, 2023 at 19:21
@CrisLuengo: Maybe in this context you're missing the forest for the trees. If Stack Overflow had this approach in mind to detect ChatGPT posts, why not share it with the moderators instead of handing them yet another rubber mallet and tasking them to level El Capitan, and even worse, say that they shouldn't use those mallets since it's causing too much sediment to go away? — Makoto, Commented Jun 8, 2023 at 19:28
@Makoto Philippe is not saying that they can tell a ChatGPT post from a regular one. He's saying that it gives them a statistic that they correlate to the number of ChatGPT posts. I think it is pretty clearly described what they do. Of course the decrease in their estimated number of ChatGPT posts is due to people changing how they copy-paste the posts, invalidating their metric. I'm not defending them. I just find your answer either purposefully misrepresents what is written up top, or was written without reading what was written up top. — Cris Luengo, Commented Jun 8, 2023 at 19:36
@BryKKan: I went back through the post and couldn't find any instance where Philippe said the suspensions were unjust, only that they were abnormally more frequent. Who knows, the mods may have found more ChatGPT posts that largely predate the "big wave". Moderation is across a spectrum of time, not a fixed point in time. — Makoto, Commented Jun 8, 2023 at 20:45
@DavidRoberts I should have said "that they think correlates to the number of ChatGPT posts." Anyway, I am only pointing out that they described their metric, I'm not defending the metric. There are answers already on this page clearly pointing out why the metric is broken. Please stop arguing with me about the metric. — Cris Luengo, Commented Jun 9, 2023 at 7:34

Chris · Accepted Answer · 2023-06-07 21:04:27Z

125

Copying my response from Teams, with some small edits to remove things that you chose not to reveal publicly for some reason:

I am skeptical about several points in your analysis. But most troubling at all is this bit:

So, when I say there are [small number that was removed from public post, for some reason] appeals that we cannot identify as correct, please keep in mind that our baseline value for this is zero, and it’s been that way for years. It is exceptionally strange for us to look at a moderator’s action and find ourselves unable to verify it – yet this is the situation we are frequently in with respect to GPT.

Moderators are human, and we make mistakes and have disagreements. This rate should have never been zero- if you are actually giving appeals a fair shake, you should at least be finding yourself contacting the moderators in question for clarification from time to time.

When I tell someone I've just suspended to use the "Contact Us" button for appeal, it's with the belief that appeals are given a thorough evaluation. Have I just been telling them to spit into the wind all this time?

So, now you have [same small number again] suspension appeals you aren't sure about. So talk to the moderators in question! Maybe there are hallucinations that it takes a subject expert to identify. Maybe the moderators in question are using a heuristic you are unfamiliar with. Maybe some of them are actually wrong.

You'll never be able to find out any of this if you don't talk about it with the moderators. Please work with us to develop heuristics that work and identify ones that don't, rather than just giving up and forbidding us from moderating at all.

Side note:

mods can’t assess posts on the basis of GPT authorship (where we would be after this policy)

Not sure you meant to post this publicly, but I'm glad you're finally admitting publicly what was said privately- that mods aren't allowed to moderate ChatGPT posts at all.

edited Jun 7, 2023 at 21:04

answered Jun 7, 2023 at 20:47

Chris

2,9842 gold badges14 silver badges21 bronze badges

7

Was the post saying that they never overturn moderator decisions on appeal? I read it as saying that they never have to overturn it due to being unable to verify something. I.e. there may well be plenty of times where they can review a moderator's actions in other areas and declare that the moderator was wrong, but with ChatGPT appeals they simply have no evidence to work with.
– Alex
Commented Jun 8, 2023 at 2:22
12

@Alex "It is rare and notable if we are ever in the position of overturning a moderator’s decision due to insufficient or contradictory evidence." Seems pretty clear to me. And now they have a whopping two dozen cases and it's too much? Give me a break. At least talk to the moderators in question before you freak out.
– Chris
Commented Jun 8, 2023 at 2:41
8

"due to insufficient or contradictory evidence" I take that to mean that it is possible in general to present sufficient evidence that a moderator erred (or acted maliciously), and then the Community Managers can overrule it. The issue with ChatGPT appeals is that there is no evidence on either side, so there is no way for the Community Managers to review a decision.
– Alex
Commented Jun 8, 2023 at 3:41
12

@Alex You skipped "rare and notable." And "no evidence on either side" is ridiculous when the CMs haven't even discussed the suspensions in questions with the moderators.
– Chris
Commented Jun 8, 2023 at 4:00
1

Have I just been telling them to spit into the wind all this time? Yeah, probably. I can't really imagine that there wouldn't be at least an occasional case where you changed your mind upon seeing new information. I would expect this also extends to reevaluating the quality or applicability of other evidence within a new context. A fresh set of eyes, looking sincerely and impartially, ought to discover such cases even more often. I don't see how a zero reversal rate, even if just "0 for insufficient evidence", is anything other than an admission they aren't taking all appeals seriously.
– BryKKan
Commented Jun 8, 2023 at 4:30
5

“rare and notable” is part of the same sentence. It doesn’t say that it’s rare and notable to overturn a moderator’s decision. It says that it’s rare and notable to overturn a decision due to insufficient or contradictory evidence.
– Alex
Commented Jun 8, 2023 at 4:32
Though actually reading it again, I’m not sure if the insufficient evidence is referring to the moderator or the appealer.
– Alex
Commented Jun 8, 2023 at 4:34
4

Either way, in context, I think his point is the same. The difference between a regular appeal and a ChatGPT appeal is that in the former, the reviewer can see the evidence while in the latter the reviewer cannot.
– Alex
Commented Jun 8, 2023 at 4:37
@Alex Why else would they overturn a suspension? Because the evidence was too good?
– Chris
Commented Jun 8, 2023 at 4:42
3

The appealer might present evidence to support their appeal. Or the moderator’s action might have been unambiguously wrong.
– Alex
Commented Jun 8, 2023 at 4:55
9

I can actually see why the base rate would have been 0 before. The majority of suspensions are for voting fraud or CoC breaches (being abusive, basically). I can imagine that the majority of appeals would be for voting fraud issues, because users have no visibility on how much evidence we have for voting fraud. I've actually been asked by CMs to share any further evidence in a few such cases, and they invariably are from people that think that they had covered their tracks sufficiently to have gotten away with it. In all those cases I was involved with the moderator decision was upheld.
– Martijn Pieters
Commented Jun 8, 2023 at 11:26
8

There have also been a few escalations of plagiarism suspensions, and again the decisions were upheld. I've seen the CMs re-verify the evidence for specific posts each time, and also initiate talks with specific tech companies whose freelance support teams have been plagiarising widely in the name of "providing tech support" via SO. But, for ChatGPT, the picture is different, and without having experienced what John Ericson calls the "Barnum Effect" can be bewildering and hard. The evidence is not easily displayed as metrics on a screen.
– Martijn Pieters
Commented Jun 8, 2023 at 11:30

Add a comment |

This_is_NOT_a_forum · Accepted Answer · 2023-06-08 15:45:58Z

122

What percentage of users who have been suspended (for any amount of time) for GPT use were active answerers as you described them immediately prior to ChatGPT's introduction?

I wasn't a very heavy flagger of GPT posts, but of the 4 users I flagged... 3 users were previously not active users and the 4th was previously and currently active. The 4th had suddenly started posting well written answers with much quicker turnaround than normal (4 in an hour) that greatly deviated from their normal "Here's some code"-like answers. My interpretation, having seen GPT-generated text many times elsewhere (not just here on SO) was that all of these cases were GPT-generated and they were all subsequently acted on by moderators.

Given I found all of these cases within minutes of opening the bounty tab, I find it hard to believe that these are as rare as your draft-based data seems to suggest. Is it not possible that GPT users simply altered the way they were posting answers? If I open the bounty tab today will I be able to quickly find a few more? (yep, on my first click, new user, 6 GPT answers on bountied questions. No detector necessary.)

The sudden drop is certainly troubling, but I question whether or not giving up is better than continuing to fight against under-verified content flooding the network. Neither alone will win back the users who are leaving the platform in droves. Should we sacrifice quality despite the fact that it won't bring back the users who are leaving, or should we begin to address why they are leaving?

edited Jun 8, 2023 at 15:45

This_is_NOT_a_forum

6,6754 gold badges37 silver badges55 bronze badges

answered Jun 7, 2023 at 20:22

Kevin B

10.1k7 gold badges32 silver badges48 bronze badges

4

A potential way they could alter their process is by manually typing the content over from the chatGPT website, which would lead to a high number of revisions.
– mousetail
Commented Jun 7, 2023 at 20:42
8

I don't recall a single one from those I flagged (that would be active in any reasonable way prior to the bad post). Primarily brand new accounts, then some that didn't answer anything for 2 years (and even then it was just few posts). One case where the user seems to have learned and started posting self-written answers.
– Dan Mašek
Commented Jun 7, 2023 at 20:44
3

Most code-containing instances posted using the old editor would at minimum result in a few revisions, if the user cares to turn the code into proper code blocks. But I'd also assume the draft mechanic would over time become less useful as a larger and larger percentage of GPT answerers get caught and then attempt to avoid it through modifications. None of this, of course, touches on why we're losing our regular answerers. Could they perhaps have felt like they were being replaced?
– Kevin B
Commented Jun 7, 2023 at 20:44
58

Regarding the decrease in answerers... I can just speak for myself, but for a while I've been decreasing my answering, partly due to my dissatisfaction with how the corporation was handling the site and treating the community, partly due to the ever increasing influx of junk questions (no research, blatant duplicates). I definitely don't feel like being replaced -- for the questions and topics I'm actually interested answering, that notion is absurd. But seeing mentions of "I tried ChatGPT and got..." definitely make me walk back slowly -- sorry, I ain't touching that, got better things to do.
– Dan Mašek
Commented Jun 7, 2023 at 22:16
2

@DanMašek Fully agree. Since the Monica incident I have stopped posting coding answers that require actual engagement - nowadays my presence on she site is limited to short answers on hobby topics the either require low effort or are fun to research.
– SPArcheon - on strike
Commented Jun 8, 2023 at 13:24
I more meant, why are answerers, as they measured/described them, leaving the platform, not why have people been leaving the platform for 10 years, etc. Sure, we've had a long-term trend of answerers leaving, but why are answerers leaving now at such a larger rate given that they still have plenty of questions to answer? The theory SO is testing is that they're leaving due to suspensions. I'm effectively trying to challenge that theory. It's certainly possible that both new answerers who in the past would have been answering without gpt, that are now using gpt, aren't sticking around as long,
– Kevin B
Commented Jun 8, 2023 at 14:33
and that also, they aren't sticking around as long because they're getting suspended. If that's the case... we're kinda screwed if we want to continue striving for quality.
– Kevin B
Commented Jun 8, 2023 at 14:36
The question of why answerers are leaving is an interesting one. A lot of things are happening at the same time (new plagiarism mechanism, job market uncertainly ("better not do that during work hours"), and ChatGPT drawing most of attention for a while (and perhaps irreversibly cause people to leave (COVID moment - "Why the <censored> am I in this hamster wheel?? Let me find a more interesting hamster wheel."))).
– This_is_NOT_a_forum
Commented Jun 8, 2023 at 15:57
8

Having submitted 20+ helpful GPT-related flags myself, I can attest to this. The majority of users either have never posted an answer before, or have not posted an answer in many months (or even years in some cases). The ones that do have previous answers, it is very clear that their answers from before ChatGPT was released were significantly different in content than their recent answers, which at the very least is extremely suspicious. I personally have not any exceptions to this.
– Jesse
Commented Jun 8, 2023 at 17:05
6

@KevinB I can concoct a whole list of hypotheses: 1. It's a coincidence and has nothing to do with ChatGPT, which should always be the default hypotheses. 2. The leaving answerers are those that previously primarily answered questions that were rather trivial. Now that those questions are directed at ChatGPT, these answerers have nothing to answer. 3. The answerers are demotivated to answer due to ChatGPT, either because they feel like being cheated for points, or they think they are being made obsolete.
– Passer By
Commented Jun 9, 2023 at 5:19
1

#1 is most likely the 30 minute limit, #2 i'd assume people giving up because seeing gpt answers scooping up rep that'd ordinarily go to people who actually care.
– Kevin B
Commented Jun 19, 2023 at 20:50

Add a comment |

cocomac · Accepted Answer · 2023-06-07 20:10:49Z

87

What about everything else?

We've known for a long time that detectors (HuggingFace is the one I'm most familiar with, but others as well), have an extremely high rate of false-positives. Having data that backs that up is great. More information and data is good*.

Some moderators have stated that they have handled many thousands of ChatGPT flags without ever using a detector. - what is the rationale behind the overall ban, vs just banning Detectors specifically, and allowing other metrics, such as writing style, certain phrases, repeated answers, etc. from being used?

You also stated (in the policy announcement on MSE) that "we also suspect that there have been biases for or against residents of specific countries as a potential result of the heuristics being applied [...]". Aside from the fact that a suspicion is a terrible reason for a blanket ban, do you have data to support that claim? If so, share that data.

If that data is hard/impossible to get, if suspensions shouldn't be based on hunches (which they are not), why aren't you following your own belief and making a policy based on a hunch?

You also wrote "[...] we've asked moderators to apply a very strict standard of evidence to determining whether a post is AI-authored when deciding to suspend a user. This standard [...] excludes the use of moderators' best guesses based on users' writing styles and behavioral indicators, because we could not validate that these indicators are [...] successfully identifying AI-generated posts when they are written. [...]".

Why are you not sharing the moderator requirements publicly?

It is terrible to have policies about suspending users that are secret. You mention what the standard excludes. Please share the standard itself.

*Note to SE: I do have some concerns regarding methodology for HuggingFace specifically (there's a clarification I'd like to request). Unfortunately, I'm not sure if that can be publicly shared, so if you'd like, you or a CM can ping me in a private chat room.

answered Jun 7, 2023 at 20:10

cocomac

14.4k6 gold badges29 silver badges84 bronze badges

16

Very much agreed that the still privately held part of the moderation policy continues to feel very problematic, but isn't your point on "detector-ban vs. all out ban" covered by the post? If internal metrics imply that the rate of ChatGPT answers has fallen off, but that suspensions for ChatGPT answers have not, does that not cast at least potentially reasonable doubt on the suspensions themselves, regardless of the mechanisms used? I don't have the data expertise to properly analyze their methodology, but the logic used here sounds relatively sound to me on the surface. Do you disagree?
– zcoop98
Commented Jun 7, 2023 at 20:16
26

@zcoop98 - We're lacking any proof that the methods that Stack are using are inherently any more accurate than the methods that moderators are using. Stack has so far refused to compare detection reliability using known-good and known-bad data (i.e. creating the data specifically to test), so all we have is "take our word for it that our methods are more reliable". It's a "my word vs your word" situation with whose detection methods are more accurate.
– Mithical
Commented Jun 7, 2023 at 20:44
6

@zcoop98 that's one option. The other is that the predictive power of the metrics has changed. It seems likely that the subset of accounts posting generated answers would change over time. For instance, you might see that only relatively new accounts do it now, and they might post more or less often than the previous average. It's also very likely that even rudimentary evasion measures - like editing in place after pasting, or copying select output in segments from successive runs/variations - would skew the stats they're measuring. Even a minor increase in average sophistication blows it.
– BryKKan
Commented Jun 8, 2023 at 4:00
1

@zcoop98 What you said is certainly possibly correct. Unfortunately, my concern regarding it didn't really fit into a comment, so I've explained my concern with that over in Chat (you'll have to expand some of the messages)
– cocomac
Commented Jun 8, 2023 at 4:29
3

just a little nitpicking - Huggingface is not a detector, is just a site that provides hosting and a community for AI related technologies. So what people usually mean is that they used a detector sample hosted on hugging face
– SPArcheon - on strike
Commented Jun 8, 2023 at 13:20

Add a comment |

Kevin · Accepted Answer · 2023-06-07 22:32:57Z

85

The "actual GPT posts have fallen" graph is the lynchpin of your entire analysis. Without that graph, you've proven exactly nothing (aside from "engagement is down and the CM team has difficulty verifying GPT bans" - which are both worrying, but neither is a moderation problem). Unfortunately, as several other answers have pointed out, your "number of drafts" metric is completely arbitrary and not supported by any evidence (that you have included in this post).

Anyone who doesn't want their GPT answer deleted can easily just decide to edit a word here, a word there, and end up with just as many drafts as a legitimate user would (on average) have. Nothing in your analysis even attempts to engage with the possibility that most of the attackers are now doing that sort of editing. Worse, this is by far the most obvious countermeasure that an attacker could try in response to moderation, and so it is entirely reasonable to assume that many attackers started doing it at around the same time, even without coordination. If we allow for attackers intentionally coordinating with one another, then more elaborate explanations become possible (though whether you consider that plausible as an explanation is perhaps a different question).

The sensible course of action would have been to talk about this with the moderators. Show them your unverifiable suspensions, and ask them to help substantiate them. Talk to each other, and develop some sort of rubric or standard for evaluating GPT flags. It doesn't have to be public, and it doesn't have to be perfect, but it really ought to exist. You should not just point to statistics and claim that you know better than the moderators who are actually doing the work.

Releasing the data was a good start, but real progress is going to require collaboration with the community. The overall tone of your post is still very much in the "we're going to talk and you're going to listen" voice, and it really isn't helping your case.

answered Jun 7, 2023 at 22:32

Kevin

5,9071 gold badge21 silver badges27 bronze badges

Re "who doesn't want their GPT answer deleted": Yes, but fortunately the whole point is to do as little work as possible. They have the minimum-effort attitude and the minimum amount of work is the top priority. There is a reason old-school plagiarism was replaced by ChatGPT. Only a tool (that automated the process) could make it widespread, I think.
– This_is_NOT_a_forum
Commented Jun 8, 2023 at 17:46
3

Obfuscating AI generated text isn't driven by people posting on SE. It's driven by all the other places where submitting AI generated content as your own work is unacceptable (e.g. schools), which, in aggregate, are orders of magnitude larger than SE. For those, tools have been and continue to be created which perform various obfuscations for people automatically. Obviously, it's not all users who obfuscate, but the tools exist and usage of them is increasing. So, it is that such things do have coordination outside of SE, it's just that the coordination isn't, necessarily, SE specific.
– Makyen
Commented Jun 8, 2023 at 17:47
6

@This_is_NOT_a_forum Don't forget about the 30 minute answer rate limit for users. This forces them to wait, which means at least some are board staring at an already created answer. Might as well spend some of that forced-time making it harder to detect.
– Makyen
Commented Jun 8, 2023 at 17:49
3

In the weeks after the suspensions started, I started seeing a couple of recurring patterns: 1) Users would still post ChatGPT content, but make efforts to carefully mark all the code correctly (which ChatGPT did not (yet?) do). 2) Users would post ChatGPT content, but make an effort to remove the telltale ChatGPT introduction/conclusion phrases (in most cases replacing them with their own, rife with spelling/capitalization/punctuation errors). 3) Users would post ChatGPT content but remove all prose and only post the (commented) code. (Verifiable by asking the same question to ChatGPT.)
– Robby Cornelissen
Commented Jun 9, 2023 at 5:38
"develop some sort of rubric or standard for evaluating GPT flags" But even then the result could still have been that it's not repairable and not acting on GPT flags is the best. In your model, who decides in case of a difference in opinion?
– NoDataDumpNoContribution
Commented Jun 9, 2023 at 11:42
@Trilarion: Consensus decision making is hard. Ideally, they would talk it out. Realistically, they would try to reach some sort of compromise that leaves everyone a bit unhappy.
– Kevin
Commented Jun 9, 2023 at 16:45

Add a comment |

kaya3 · Accepted Answer · 2023-06-09 00:07:44Z

OK, finally there is something to actually discuss. This discussion should have happened before the policy was decided, and the policy should have been decided with community-elected moderators' input rather than imposed by fiat ─ and the real policy remains secret ─ so let's not celebrate too much.

But now at least we can have a discussion about the factual basis for the policy, while not forgetting that the strike is about the imposition of the policy against the community's wishes and without our feedback, the differing public and private versions of the policy, and the slander against moderators, not about disputes over what this data implies.

At the 0.50 detection threshold, around 1-in-5.5 posts are falsely detected. At the 0.90 detection threshold, around 1-in-13 posts are falsely detected.

OK. How many moderators are trusting this tool at thresholds of 50% or 90%? I think pretty much everyone knows that these tools are completely useless at such low thresholds.

While it is theoretically possible to achieve better baseline error rates than 1-in-20 by picking higher thresholds, the efficacy of the detector may fall off considerably. A detector that does not produce false positives is no good if it also produces no true positives.

"Theoretically" is a strange word to use here. It is possible to achieve better baseline error rates than 1-in-20 by picking higher thresholds, i.e. thresholds above 97%. The data you have presented shows that this is empirically true, not just theoretically.

Whether or not a threshold of 97+% means the tool misses a lot of true positives is irrelevant. It makes no sense to forbid the use of a tool just because it misses a lot of true positives. Lateral flow tests for COVID-19 can miss 20-80% of true positive cases; that just means we can't (and don't) rely solely on LFTs. It doesn't mean we should ban LFTs.

Also, moderators have said clearly that they do not rely exclusively on this tool, or any GPT detection tool. We should expect that the false positive rate for moderator decisions, which are based on multiple kinds of evidence, should be significantly lower than the false positive rate for just one kind of evidence.

At this point, we can’t endorse usage of this service either as a tool for discriminating AI-generated posts or as a tool for validating suspicions.

Nobody is asking you to endorse it. What we're asking is for you to let us choose how our own communities are moderated.

Over the last few months, folks within the company have been working to answer the question, “What has been taking place in the data coming out of Stack Overflow since GPT’s release?”

(Emphasis mine.) Stack Overflow is the largest Stack Exchange site, but that means it is not representative of other Stack Exchange sites. Even if this data really does show a need for an extremely permissive policy which de facto allows users to plagiarise AI answers, it at most shows that need for Stack Overflow, not all Stack Exchange sites.

In total, the rate at which frequent answerers leave the site quadrupled since GPT’s release.

This graph looks pretty noisy, so I assume there are some wide error bars around that "quadrupled" figure. For the sake of argument let's say it's roughly correct, though. So why are the more active users leaving?

Are there just fewer questions? ─ Yes, there are. You attempt to rule this out as a factor, because the "number of questions per frequent answerer" has gone up, not down. But this is not surprising, and doesn't imply anything.

Imagine there are 100 questions per day, 50 nerds who answer one question per day, and 10 hypernerds who answer five each. Now imagine the number of questions goes down to 60 per day. The hypernerds are more affected by the question shortage, so suppose they're more likely to leave; say 30% of the nerds and 50% of the hypernerds leave. There are now 35 nerds and 5 hypernerds, and all 60 questions per day are still getting answered. So the number of questions per hypernerd has gone up from 10 to 12, but there aren't more questions available to be answered. So this number rising simply doesn't imply that there are enough "available" questions to retain more of the hypernerds.

The only assumption I've made is that hypernerds are more likely to become uninterested if there aren't enough questions for them, whereas regular nerds aren't there only to answer questions so a question shortage is less likely to motivate them to leave. Seems pretty plausible to me, and it's consistent with the trends in your data (on questions per hypernerd, and proportion of answers written by hypernerds), so we can't rule it out using this data.

Additionally, there is a flaw in your framing: the idea that what matters is the ratio of questions per answerer, rather than the actual number of questions. The ratio probably matters much less than you give weight to it, because each question can be answered by more than one person, so an answered question is still often "available" to be answered again. In practice, many hypernerds are interested in some specific tag, and they may be one of a small number of people answering questions on that tag, or even the only one. What matters to these users is the total number of answerable questions being asked in that tag. And if they leave the site, the questions on that tag don't become "available" to other users, because the other users aren't experts on that topic and aren't interested in those questions.

Has the quality of the questions gone down? ─ Perhaps the hypernerds are more motivated by having interesting questions to answer, whereas the regular nerds just answer a question every now and then if they happen to notice it. Unfortunately your data says nothing about question quality, but anecdotally this is the reason I have stopped writing so many answers on Stack Overflow. I could speculate on a few reasons why question quality might have fallen in the advent of ChatGPT ─ perhaps people who know how to write a good question are more likely to get a satisfactory response from ChatGPT themselves, and therefore don't need to ask Stack Overflow ─ but anyway a drop in question quality can't be blamed on the AI moderation policy.

After we allowed GPT suspensions on first offense, 6.6% of users who posted >2 answers in a given week were suspended within three weeks [...] no Community Manager will tell you that removing 7% of the users who try to actively participate in a community per week is remotely tenable for a healthy community.

This is a non-sequitur, because you include AI plagiarists among the group of people "trying to actively participate in" Stack Overflow, but plagiarising answers from an AI is not what it means to participate in Stack Overflow. If all of those users who get suspended are indeed AI plagiarists, then the correct percentage of suspensions among people who are "trying to actively participate" in the community is 0%, not 7%. So your argument here is wholly uncompelling.

These suspensions only negatively affect the community if they are false positives. You have given some data on the false positive rate of AI tools, but not about the false positives for moderator decisions to suspend users (which, again, are made on the basis of multiple kinds of evidence).

Additionally, this metric ─ users who posted >2 answers in a given week, and were suspended within three weeks ─ seems suspiciously precise. Why is >2 answers the cutoff? Why is 1 week the period in which those answers were written, and why is 3 weeks the period in which they were suspended? It smells of cherry-picking to me. How robust is this finding to changes in the metric?

Instead suppose that no more than 1-in-50 of the people who were suspended for GPT usage were not actually using GPT. In order for this to be true, a large volume of users would have needed to immediately convert from being regular users to ChatGPT users;

No, that does not follow. It can be true if there is a steady flow of new users who use ChatGPT, or if users suspended for ChatGPT use return to the site after their 7-day suspension and then post more ChatGPT answers, or if not everyone who uses ChatGPT gets caught immediately. All three are very plausible.

this value alone rings a deafening number of alarm bells for potential false positive detections and contributor loss alike.

This 7% figure implies nothing about the rate of false positives, because it uses the wrong denominator. Many users who don't use ChatGPT and don't get suspended for suspected ChatGPT use, are not in the ">2 answers per week" category, but the number of those users obviously matters for the false positive rate.

Likewise, this figure doesn't imply anything about loss of legitimate contributors unless we already accept the doubtful claim about false positives.

What follows is the internal ‘gold standard’ for how we measure GPT posts on the platform [...] In principle, if people are copying and pasting answers out of services like GPT, then they won’t save as many drafts as people who write answers within Stack Exchange.

This metric doesn't seem fit for the present purpose. Firstly, it's an absolute number, whereas we already know that the absolute numbers of questions and answers have been falling, so automatically the absolute number of ChatGPT answers is expected to fall alongside that. To support your argument about false positive rates, we need the proportion of answers written by ChatGPT, not the total number.

Additionally, it assumes that the behaviour of ChatGPT plagiarists has not changed over time. But this is an unreasonable assumption, because as Stack Overflow's policy on ChatGPT answers became more widely known ─ and people's 7-day suspensions for posting ChatGPT answers expired ─ we should expect that the AI plagiarists' behaviour changed to try to avoid getting caught. More sophisticated plagiarists will change a few words, delete "fluff" sentences which don't contribute to answering the question, introduce intentional spelling or grammar mistakes, or so on; these are smaller edits, so they will tend to make answers look "less ChatGPT-like" in the ratio of small to big edits.

So this metric going down does not really indicate that ChatGPT use has gone down.

This metric is sensitive to noise, but was validated against other metrics early on at the peak of the GPT answer rate.

The fact that it was validated early does not mean it remains valid, since AI plagiarists' behaviour should be expected to change over time.

The following chart shows the expected % of answers posted in a given week that are GPT-suspect.

There is no "following chart" ─ perhaps you intended to include a chart here but it got lost while editing?

Based on the data, we would hazard a guess that Stack Overflow currently sees 10-15 GPT answers in the typical day, or 70-100 answers per week. [...] could it be the case that roughly 7% of frequent answerers on the site are still posting via ChatGPT? If this were the case, the site should be seeing at least 330 GPT answers per week, but the rate estimate is not close.

The estimate of 70-100 per week is probably a significant underestimate for the previously-stated reasons. It's quite plausible that 330 per week is the correct number. It's also plausible that the correct number is much higher, and moderators aren't catching all of them.

Even so, if you were right that there are only 70 to 100 AI answers being added to Stack Overflow per week, that would still be a bad thing and it would still be necessary to moderate them. If left on the site, those answers will accumulate over time, getting upvotes from people who don't know better (because they resemble high-quality answers), and Stack Exchange, Inc. has clearly recognised the harm that would be caused by leaving these answers up on the site.

So even if all of your analysis was correct ─ and to be clear, it isn't ─ but even if it was, it still wouldn't logically support the new policy of de facto allowing AI plagiarism.

It is exceptionally strange for us to look at a moderator’s action and find ourselves unable to verify it – yet this is the situation we are frequently in with respect to GPT.

You haven't said why you were unable to verify these moderator actions. Is it because moderators have not provided sufficient information about how they have made their judgements, or is it because you don't agree with those judgements?

If it's the former, that would be a basis for you to tell moderators that they need to provide more information when they suspend users for suspected ChatGPT plagiarism; if it's the latter, then that would again be something to discuss with moderators. Either way, it doesn't support the new policy which practically forbids almost all such suspensions.

Instead, the most we can do is state that we just can’t tell. We lack the tools to verify wrongdoing on the part of a user who has been removed, messaged, or had their content deleted, and this is a serious problem.

You may not be able to tell, but I remain entirely unconvinced that the moderators issuing these suspensions can't tell. The data you've presented here doesn't support that conclusion.

Is it still possible that the proportion of false positives is small? Maybe so – it can’t be completely eliminated at this time. [... But] it would require some very strange user behavior en masse around answering, by users who were otherwise answering questions normally. These are behaviors we do not have an organic explanation for after months of exploration

Perhaps if you had discussed this with the moderators, they might have been able to offer some explanations which you failed to consider.

It looks really bad for you to impose a policy by fiat, keep the justification for the policy secret for a week, and then admit that the basis for the policy is that you couldn't think of any other explanations for this data ─ when you never involved the people with direct experience of the issue in your attempt to understand that data. I'm sorry, but that is not behaviour I associate with a good-faith effort to reach the truth.

What we know, right now, is that the current situation is untenable. We have real, justified concerns for the survival of the network.

It's a shame that in this sentence, the "current situation" you're referring to is the rate of users leaving Stack Overflow, rather than the behaviour of Stack Exchange, Inc. which has caused this strike action.

It's our community, so we are at least as concerned as you about threats to the community's continuing viability. Unfortunately, the behaviour of Stack Exchange, Inc. is currently the greatest threat to our community's continued existence, and while publishing this data and your analysis is welcome, it neither demonstrates an understanding of why we are on strike, nor does it address any of the strikers' demands.

Has the quality of the questions gone down? — I think here could actually be the other way around. All the easiest mega-dupe questions are now being asked to ChatGPT, as it will answer anything right away vs. crafting a question here that has any hope to stay open. So fast-gun and low-effort answerers have less questions they can answer quickly and serially. Not necessarily implying that question quality has gone up: just that there are less questions that many people can answer fast. — blackgreen, Commented Jun 8, 2023 at 1:41
we should expect that the AI plagiarists' behaviour changed to try to avoid getting caught — and this is the key point. It's entirely reasonable to expect the "drafts saved" metric to decrease in accuracy over time, i.e. to have a dropping recall, as people adapt to the ban. — blackgreen, Commented Jun 8, 2023 at 1:45
@blackgreenonstrike Interesting idea, but it's also possible that both are true. The kinds of question that might no longer be appearing on SO due to askers trying ChatGPT first and being satisfied with the answer they get, include both the questions that are written well enough to be definitely answerable, and the megadupe questions that ChatGPT has a lot of training data for; and both of those would appeal to different kinds of very-active users. I think I'd need some evidence before I accepted the premise that there are fewer megadupe questions being asked on SO now, though. — kaya3, Commented Jun 8, 2023 at 2:12
I'm not so sure about the "hypernerd" explanation. If anything, I'd think that they'd be the ones to more proportionately find reason to stay / write more answers: I'd expect that if question askers are flocking to ChatGPT, the only reason they'd come back is if they get something that ChatGPT can't help them with- probably something harder- something that a "regular nerd" would be proportionately less likely to be able to answer. — starball, Commented Jun 8, 2023 at 7:31
@starball The numerical observation isn't related to question quality; the ratio of questions per hypernerd will automatically go up if hypernerds are more likely to leave when there aren't enough questions, so the "number of questions per hypernerd" being high doesn't mean there are available questions ─ Phillipe's use of the word "available" here is a statistical fallacy. We can't conclude from this statistic that there are "enough" questions for answerers to stay around for; the decline in the quantity of questions can't be ruled out as a factor despite Phillipe trying to do so. — kaya3, Commented Jun 8, 2023 at 15:12
Your argument is also missing a step ─ you're saying that ChatGPT causes question quality to go up, and this should cause better retention of hypernerds. But has question quality actually gone up? As I said, in my judgement it has gone down; I'm willing to consider data which shows the opposite, but not hypothetical arguments which reach a conclusion contrary to my observations. — kaya3, Commented Jun 8, 2023 at 15:21
The issue of "questions per answerer" is discussed in more depth in this other answer. — kaya3, Commented Jun 8, 2023 at 15:49
@kaya3 hehe I wrote that other answer :P (so yes, I'm very much in agreement that question availability is going down rapidly and that the "POV" of the data Phillipe presented about that is heavily warped). I wasn't making an argument about quality- more about increasing complexity / difficulty. — starball, Commented Jun 8, 2023 at 17:40
What a thoughtful and nuanced answer. I especially like the term hypernerd. I think it would be worthwhile to study more how questions and answers fared, i.e. take the score into account. Maybe voting has gone down too and that demotivates the nerds or the hypernerds. — NoDataDumpNoContribution, Commented Jun 9, 2023 at 18:23

NotTheDr01ds · Accepted Answer · 2023-06-07 21:45:21Z

74

We are, as of right now, operating under the evidence-backed assumption that humans cannot identify the difference between GPT and non-GPT posts with sufficient operational accuracy to apply these methods to the platform as a whole.

What "evidence" do you have for that? Have you actually tested this with one of the Stack Overflow Mods who have reviewed thousands of GPT posts?

What if you:

Take two of the Stack Overflow moderators who have handled thousands of AI flags.
Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
Paste the markdown of the question into ChatGPT
Take the unedited ChatGPT results and ask the Mods to identify the ChatGPT answer (or whether they believe it is inconclusive).
Then tell us how many GPT answers were identified correctly and how many were "wrongly suspended" ;-)

Have you done anything close to this to support your "evidence-backed assumption"?

That's certainly the "easy" version of the test, since most GPT answers do (IMHO) have edits at this point. But if you don't believe it's possible for a human to identify a GPT answer at all, let's start with this easy test.

If you want to go a little more "difficult":

Take two of the Stack Overflow moderators who have handled thousands of AI flags.
Take 100 random, pre-2020 questions from Stack Overflow that have existing answers
Take a random 50 of those
Paste the markdown of the question into ChatGPT
Add the unedited ChatGPT results to the pool of answers for those 50 questions
Place all 100 original questions before the Mods and ask them to identify whether there's a GPT answer (and which one), no GPT answer, or inconclusive.
Then tell us how many GPT answers were identified correctly and how many were "wrongly suspended" ;-)

I'll put my money where my mouth is and happily take the challenge as well. That way you can even determine the rate at which a non-Moderator is potentially flagging false-positives.

I'd be interested in the results if it was run twice, once with the subjects using automated detectors to verify their suspicions, and once with no automated-detectors allowed.

edited Jun 7, 2023 at 21:45

answered Jun 7, 2023 at 21:33

NotTheDr01ds

4,98215 silver badges33 bronze badges

14

If you want to add an additional level of realism, first, lay them in front of a subject-matter expert (to substitute the flagger) to identify possible use of ChatGPT. Then, give those results to a mod to do a second check. I'd imagine your approach would still lead to vastly more incorrect suspensions than what's actually happening, if I flag for the tags I follow I'm 100% confident it's not a human.
– Erik A
Commented Jun 7, 2023 at 22:02
2

I second the non-moderator test. I'm pretty sure there are many users who have seen enough GPT answers to have a fairly low error rate in detecting them. I would be quite curious to see how well I do.
– Esther
Commented Jun 7, 2023 at 22:12
4

@ErikA To be honest, I'm not a SME in most of the answers I'm identifying as coming from ChatGPT. I've just seen thousands of them personally, and I feel I'm fairly good at pattern recognition (obviously a biased statement, but I'm willing to put it to the test). And when I saw thousands, today I just added my 2,000th answer to my "ChatGPT Output" "Saves" list on SO. I have another 144 on a similar list on Ask Ubuntu. But I absolutely agree that for code answers, you must be a SME to identify the issues.
– NotTheDr01ds
Commented Jun 7, 2023 at 22:26
5

Even if we may misidentify some posts as AI-generated, the damage of allowing them is much larger than the damage of removing them. It's better to lower the punishment, and start with just a warning, and put the user on a watchlist for future offenses, which times out.
– Andreas moved to Codidact
Commented Jun 8, 2023 at 0:14
1

A more realistic scenario would be to have groups of three-ish answers from the same user, some groups from real users and some from ChatGPT. Then ask to differentiate which are which.
– Ryan M
Commented Jun 8, 2023 at 4:27
1

A better test would have a small fraction of the 100 posts be ChstGPT-generated, and the mod should not know how many there actually are.
– Cris Luengo
Commented Jun 8, 2023 at 15:19
I proposed a similar test (in moderator-only space), however using pre-GPT answers that score highly on Huggingface (or whatever) vs. GPT-generated answers. This makes the test considerably more difficult of course, but otherwise you have the problem that the vast majority of human answers do not look GPT-generated at all and thus the GPT answers stand out. This wouldn’t reflect a collection of answers flagged for being machine-generated in practical usage.
– Wrzlprmft
Commented Jun 8, 2023 at 16:16
All good scenarios as well. Just wondering why nothing like this seems to have been done since SE claims this is an "evidence-backed" assumption.
– NotTheDr01ds
Commented Jun 8, 2023 at 18:20
1

Unfortunately, most of these scenarios appear to overlook several of the input metrics that various mods have said they take into account (at least in anything but the most glaringly obvious cases) — things like prior posting history and style, just to pick one example (but by no means the only one).
– Joel Aelwyn
Commented Jun 9, 2023 at 1:22
^That. I for one rely heavily on user history to find GPT plagiarists. It starts by smelling something fishy with the post, and ends in confirmation with user history.
– Passer By
Commented Jun 9, 2023 at 9:45
1

@JoelAelwyn Absolutely, the chances of us finding GPT usage are even higher when we have more data to go on. The point here is simply that some of us do feel that we can identify at a very high rate just based on the unedited GPT output.
– NotTheDr01ds
Commented Jun 9, 2023 at 10:43
@NotTheDr01ds Fair enough, just wanted to make sure it wasn't being overlooked.
– Joel Aelwyn
Commented Jun 9, 2023 at 16:22

Add a comment |

Nicolas Chabanovsky · Accepted Answer · 2023-06-14 11:36:20Z

My summary of your nine-google-doc-page question

You have tested how the ChatGPT-detectors work on human written posts and you did not like the accuracy.
You came up with a heuristic that you use to determine the number of possible ChatGPT-posts on the site.
Your heuristic did not show a surge of ChatGPT-posts, but the number of suspended users of a certain segment increased.
The number of answers continued to decrease, although the outflow of questions did not.

Multiplying one by the other, you concluded that the moderators incorrectly suspended 16 times more users than they should have.

My thoughts

If the community is fighting something intentionally, then it will find more precedents than if the users stumble upon the problematic posts by accident

For the past few months, the community has been specifically looking for ChatGPT-posts. When we deliberately look for something, we will find much more of it by definition. Look at the data for any other community-driven initiatives. For example, look at the data for the plagiarism detection initiative that the community held recently.

Users get suspended for low quality content, plagiarism and more, not only for ChatGPT-posts

When the community deliberately works on something, other types of activities usually increase as well. By searching for potential ChatGPT-posts, users will definitely find plagiarism, low quality posts, and so on. Much of this will result in additional suspensions. Moreover, we will see suspended “now” users for their “old” posts.

The number of answers is most of the time late

Questions and answers are like eggs and chickens. It takes a while for an egg to become a chicken. In addition, the number of answer givers in the last few years has declined faster than those who ask questions. Especially those who post many answers.

Your heuristic may not be correct. And it is definitely not correct during the period of "active cleaning" of the site

According to the information in your question, all conclusions are based on some heuristic about the time it takes for a user to create a post. Have you thought that it may be incorrect or not very accurate, especially when the community proactively is working on cleaning the site?

Moreover, have you tried at least normalizing it by the number of processed flags over a period and presenting the data not by the suspension date, but by the date of posting the content that violates the rules? I think you would see a different picture.

With ChatGPT, those who would not be included in the group of "active users" will get there

And when they are suspended, they will create the very anomaly that you write about.

The term “false-positive” has a specific definition

To talk about a false-positive rate, you need to know exactly how many posts the moderators processed incorrectly out of the total number. But you didn't analyze their work! No way you can talk about the “false-positive rate”, you simply do not have data for that.

To be honest, I spent an hour and a half reading your post, but did not find the meaning of many of your "scientific” dictums.

What I see

Your data is a set of facts taken out of context, which are in no way connected with each other, and even more so with your conclusions.

"The number of answers continued to decrease, although the outflow of questions did not." It is not even clear how this conclusion follows from the evidence that is presented. If you look at the public (25k+) analytics, it is patently obvious that the decrease in the number of questions exactly corresponds to the decrease in the number of answers, down to the smallest of trends. Not only is the number of questions decreasing, which is a confounding variable, but the number of answers is decreasing in a pattern that exactly matches that for questions. — Cody Gray - on strike, Commented Jun 14, 2023 at 11:20
@CodyGray-onstrike Totally agree. data.stackexchange.com/stackoverflow/query/1753665#graph I guess that this statement was made either about the segment of 3+ answer givers or about the month prior the post (when the number of questions stopped declining. — Nicolas Chabanovsky, Commented Jun 14, 2023 at 11:27
It should be noted that the mods also have no clue what their own false positive rate is. — Era, Commented Jun 22, 2023 at 22:58
"the community has been specifically looking for ChatGPT-posts. When we deliberately look for something, we will find much more of it by definition" Careful, because when you go looking for something you might misidentify something else for that. Humans brains are fantastic at finding patterns on random datapoints. — Braiam, Commented Jul 12, 2023 at 15:37
@NicolasChabanovsky It means mods could be doing very well or they could be doing poorly with respect to false positives and they can't distinguish between those two situations from the information available to them. It is natural that they would feel they are doing very well, but they don't actually have evidence for that. — Era, Commented Sep 17, 2023 at 16:09
* "If the community is fighting something intentionally..." This does not seem to explain the observation made, specifically the observation of a rapid decrease of users who answer 3+ questions per week. You may improve your answer by proposing a testable hypothesis explaining that observation. — fiktor, Commented Jan 20 at 23:11

CodeCaster · Accepted Answer · 2023-06-10 13:51:51Z

I already posted an answer to this question, in which I try to assert why people create posts using GPT and why others do not find that behavior healthy for the community. I still wanted to post another one, as I am unable to move past what we can read in this quote from the question (emphasis mine):

Based on the data, we would hazard a guess that Stack Overflow currently [June 7th?] sees 10-15 GPT answers in the typical day, or 70-100 answers per week. There is room for error due to the inherent uncertainty in the measurement method, but not room for magnitudes of error. We can therefore say that the rate of GPT posts is far less than it was two months ago, and then it is less than it was two months before that.

So, under what conditions could it be the case that roughly 7% of frequent answerers on the site are still posting via ChatGPT? If this were the case, the site should be seeing at least 330 GPT answers per week, but the rate estimate is not close. This also assumes every user who posts GPT answers are caught, and that GPT answerers post no more or less than three answers in a given week. In practice, the site should be seeing significantly more than 330 GPT answers per week to support this suspension rate.

It could be possible, either due to severe measurement error or due to an unexpected change in user behavior that obfuscates GPT usage using this method. But the evidence for this viewpoint does not appear strong.

Stop taking us for fools

I know for a fact how wrong you are and I find it extremely offensive towards the entire community (moderators, curators, flaggers, askers, answerers and readers) that:

You've dug in your heels so deeply that you base your entire policy on this belief, and
You don't respond to any of the overwhelming amount of constructive criticism shown in the answers to this question.

Stop trusting your tools

Whatever your methodology is, stop using it. Start listening to your moderators and users. You are off by orders of magnitudes, you are ignoring users who slightly edit and mark up GPT-generated text in the answer box, you are somehow severely erring in your measurements, you are seeing (or rather not seeing) at least hundreds of GPT-generated posts per week. You have been for half a year already.

My methodology

I am an English as a second language person. My native tongue is Dutch, and I have been hearing, speaking, reading and writing English for over three decades already. I love actively working with language, researching etymology, rewriting sentences twenty times to have the perfect cadence, throw in a joke or two here or there, and so on. According to multiple people, I'm doing a terrific job posing as a native English speaker (though not audibly; it is apparently pretty hard to get rid of a Dutch accent on your own).

As stated in my other answer, I have read and vetted literally tens of thousands of posts on this network, mainly on Stack Overflow. Besides that I am an avid reader of and poster on Reddit and Tweakers (a Dutch tech news site and forum).

Having played with ChatGPT for a couple of months, there are some telltale signs about the prose it generates, especially in the context of asking for help with programming problems.

One of the biggest signs is that it will almost never tell you that that what you're trying to do is a stupid idea, something developers can't hear often enough. It will do exactly what you ask for, and produce very readable code, often with comments explaining what that code does, and then proceed to repeat that code in the form of an explanation thereof in English. But the code will be conceptually and/or idiomatically wrong, if not syntactically.

This all is leading up to this: I know, through my 30+ years of being very active online, how users behave on forums and Stack Overflow, and how they write in English and Dutch. Sure, doing things a lot doesn't necessarily mean you're good at it; you'll just have to take my word for all of this.

For someone like me, it is not hard to recognize the obvious signs of ChatGPT-generated text and code. And there are dozens of us. Dozens! (There is room for magnitudes of error in this statement.)

Case studies

#1: 10 answers in 30 minutes

On May 19th, right about where the GPT-suspect posts graph converges to zero, I have flagged a post from a now-deleted user with the message:

Someone has discovered ChatGPT. Their last 10 or so answers within 30 minutes have been generated.

To clarify: they posted this answer (screenshot with moderator name and red haze removed using F12 for sub-10K users) that has obvious GPT signs, eight minutes after the question was asked, along with at least 9 other answers, in a 30-minute timespan, after an answering-hiatus of 6 years, ignoring a comment by the OP stating "It does not produces expected results", ignoring my comment that GPT-generated answers aren't accepted, linking to Temporary policy: ChatGPT is banned and https://stackoverflow.com/help/gpt-policy (left three minutes after the answer was posted), leaving all their generated answers up until a moderator deleted them more than 1.5 hours later.

You will be unable to convince me that this person, with their 10 answers showing typical GPT-generated prose in 30 minutes and all other signs, was not using GPT to craft their (incorrect) answers. I did not use tools to detect this, I used the heuristics outlined above (i.e. my brain).

#2: 10 answers in 10 minutes

Two days earlier, May 17th, I encountered an answer which I flagged with:

10 answers in 10 minutes? ChatGPT.

They answered the first revision of a question, basically stating:

[When] the server has an outage [...] my page [...] gets a Http 503 'service unavailable' returned, and that's how it stays [...] it's on a unmanned PC [...]

do I need [...] to redirect to a local page if, on refresh, http 503 (or other error) is returned?

They need code that refreshes their user-interaction-less web page when the server serving that page returns an error.

The answerer posting 10 GPT-generated answers in 10 minutes (an inhuman feat to begin with) after a nearly 2-year answering gap produced this answer (screenshot) which is basically a very generic rehash of every "how to set up error handling for ASP.NET" tutorial, entirely ignoring both the premise of the question and the question itself.

Again: extremely unlikely that this was a false positive, and that this was the only user on that day posting GPT-generated cruft.

Resuming

I'm not in the SOCVR chatroom nor do I do review queues anymore, so I encounter all my questions and answers organically: from searches related to things I'm working on, or from trolling the frontpage while I'm not working.

I have (only) 21 more handled flags on ChatGPT answers, which are older and usually don't point to as many answers as these two flags I expanded on above. Ironically enough, most of those remaining flags were on answers posted mid April, just when your graph starts sharply going towards zero.

If I, in my casual browsing during the day encounter users with such egregious display of using GPT, in the three weeks before when you claim the upper bound to be 10-15 answers a day, then one of us must be wrong. And by now I'm pretty confident it's not me.

Mea Culpa

I was in fact wrong at least once. I shouldn't have flagged this answer, accusing it and this other one of being GPT-generated. Both of those answers remain up and the writer unpunished, and rightfully so. Their writing style is what threw me off.

The moderator who handled my flag correctly noted that this was a false positive – clear evidence if there ever was that moderators were not simply rubber-stamping flags or triggering on insufficient evidence.

In conclusion

I'm not claiming to have superhuman properties. I'm not saying I'm free from faults. I'm merely saying that there's something in your analyses causing you to severely under-report GPT-generated content, causing unsubstantiated policy changes that the community rightfully disagrees with.

I don't know what's causing your data to be that off, and I don't know how to express my (and others') heuristics into definitive rules (nor whether we should publicly do so), but you need to reconsider your criteria.

Please listen to your moderators and users.

Devil's advocate: What's wrong with keeping the valid GPT answers and only deleting the bad answers? — Jon Ericson, Commented Jun 10, 2023 at 21:11
See my other answer. We don't have enough experienced people to evaluate the correctness of human-typed answers, let alone the generated ones. — CodeCaster, Commented Jun 10, 2023 at 21:12
@JonEricson In principle, nothing. The problem is that the policy would be "ChatGPT can be used as an information source if you verify it for correctness and completeness, and rewrite the answers to remove the tangents it often inserts", which is awfully vague, and for most experts writing an answer from scratch would be faster and easier. You're going to end up with a lot of arguments over a vague policy from people providing answers on topics they perhaps shouldn't be answering questions on, so a simple and clear "ChatGPT is banned" is a lot simpler and clearer for everyone. — Martin Tournoij, Commented Jun 11, 2023 at 2:05
Aside: merely "verify the information is correct" is NOT enough, as I found it somewhat frequently gives correct but woefully incomplete answers missing very obvious things. "Barack Obama is a best-selling author and senator from Illinois" is technically correct, but also so incomplete it's basically just wrong. I found these types of answers the most misleading, and sometimes confused me even on topics where I was an expert. — Martin Tournoij, Commented Jun 11, 2023 at 2:06
Re the two false positives: That is where posts from before December 2022 comes in (there isn't a significant different). There also weren't a hiatus of several years answering questions (but it is weaker signal). An even weaker signal is if they are late answers (several years) or not. — This_is_NOT_a_forum, Commented Jun 11, 2023 at 11:32
cont' - Late answers are much more likely to be plagiarised or generated by ChatGPT (as they want to avoid scrutiny (read downvotes), hoping for stray upvotes). — This_is_NOT_a_forum, Commented Jun 11, 2023 at 11:38
"Stop taking us for fools" - That's damn right. If anything, the active participants and volunteer moderators might in a position to have a much better understanding of what's actually going on in the Network than staff do. — Todd Wilcox, Commented Jun 17, 2023 at 5:55
These comments seem to ignore the existence of low quality human answers. The issue is one of volume of low quality answers, none of the actual issues present in CGPT answers are new. — Era, Commented Jun 22, 2023 at 22:50
"there are some telltale signs about the prose it generates" Sure, for the average person using CGPT that's true. You can't use this to identify LLM responses, nor even to identify ChatGPT, since someone could just make a local instance that has a different prose style. It would be incredibly easy to do. This is not a generally reliable method! — Era, Commented Jun 22, 2023 at 22:54
The thing about ChatGPT -- which I use regularly when I have to do things like, say, search for a windows registry setting -- is that it is inferior to technical documentation, which is inferior to stack exchange. At best it is a synoptical search that provides key words nested in grammatically correct language. Why should the most inferior solution be allowed to pollute and obscure the most superior solution? 1/2 — Chris, Commented Aug 18, 2023 at 21:43
Secondly, how is ChatGPT different than a plagiarism database constructed through embedding rather than CRUD operations? It is just a novel angle of interpolative plagiarism (not that it isn't useful) that would have precedent if humans could rapidly memorize and copy-paste content from the breadth of the internet. Why does SO need to allow this glorified linguistic covariance matrix into its space, especially when it short circuits traffic to authoritative sources (human beings' free content whose traffic provides value) that OpenAI has effectively plundered and repackaged for gain? — Chris, Commented Aug 18, 2023 at 21:48

cag51 · Accepted Answer · 2023-06-08 01:22:44Z

69

The one-size-fits-all policy is a problem

The data you present is somewhat compelling. An answer with many cached versions, showing the answer being fleshed out over time, is a pretty reasonable defense against ChatGPT. I imagine you will eventually turn this heuristic into a metric that mods have access to. I'm not sure I completely buy all your conclusions, but after a quick skim, the data is more compelling than I thought it would be. Certainly none of us mods want to suspend people who don't deserve it.

But different sites are different. On my site (Academia.SE), many of our askers are at a very vulnerable stage in their lives, and they're looking for career guidance from people who have been there before. This is not a programming site where they can say "thanks, but I tried it and it didn't work"; they are often facing life-changing decisions that can only be made once. It is unacceptable that they would make such decisions (unknowingly) based on auto-generated answers. In other words: on Academia.SE, I would much rather turn away someone who is posting content indistinguishable from ChatGPT (which we don't really want anyway) than let someone base a difficult decision on some auto-generated nonsense.

Our community can figure this out. We don't need you to tell us what the answer is. We need you to tell us what the problem is (which you have now done) and then we'll come up with the answer. Our goals are aligned: we don't want to suspend innocent users, and I'm sure you don't want bad advice to run rampant. So please retract your policy (at least for the smaller sites) and let us mods get back to work: we'll start a conversation with our community on meta and decide how we can better balance these two priorities.

edited Jun 8, 2023 at 1:22

answered Jun 8, 2023 at 1:04

cag51

1,3419 silver badges10 bronze badges

6

This needs to be seen, and I hope it will. Thank you for the beautiful delivery.
– Levente
Commented Jun 8, 2023 at 1:49
22

Your characterization of Stack Overflow wrong answers being shrugged off with a "thanks but it didn't work" I don't think is accurate. Perhaps sometimes, but low quality answers that contain security problems or bugs that cause mass outages do have a huge impact on the well-being of the world too. An answer like that can be very dangerous if the search algorithms start favoring it.
– mason
Commented Jun 8, 2023 at 2:55
11

Perhaps you're trying to show we (staff +community) are on the same team by saying "I'm sure you don't want bad advice to run rampant" - but I'm not sure that all the SE staff believes bad advice is a problem. Particularly at the CEO and decision making levels. If they actually cared about quality, things would look quite different. They chase quantity of users over quality of content, to the detriment of both.
– mason
Commented Jun 8, 2023 at 2:59
3

@mason, I tend to disagree: personal consequences are worse, specifically for people in difficult or vulnerable position. Consequences of bad code are generally less impactful, plus many layers of protection are usually in place: code-review, testing, quality-gates and so on.
– markalex
Commented Jun 8, 2023 at 9:33
2

@markalex With all those protections in place, big breaches that affect thousands or millions must never happen, right? Bank of America, Equifax, Ashley Madison - any of those names ring a bell? I'm not saying quality isn't important on Academia.SE, just pointing out that the consequences of low quality answers can be equally devastating on many other sites in the SE network.
– mason
Commented Jun 8, 2023 at 12:40
3

@mason, I mean, that if bad answer from SO lead to some failure in major techproject, said bad answer is a single point in whole sequence of bad events: failed all mentioned protections, failed redundancy. Failure must overstep margin of safety of all layers. For a single person this usually is not the case.
– markalex
Commented Jun 8, 2023 at 12:47
@mason, and to be clear, I agree about "many other sites in the SE network", just disagreeing on specifically SO's part.
– markalex
Commented Jun 8, 2023 at 12:48
9

You worded this well, and I really like the sentiment; I think this answer is really important. I tend to agree with you that, from a CEO/ business perspective, it's probably highly tempting to see the Network as "Stack Overflow plus some", rather than a bunch of different sites with a wide variety of needs. I agree with your argument that SO has a different tolerance and level of importance for wrong answers than e.g. Acadamia.SE, and that this should absolutely be taken into account... hadn't really thought about that explicitly before, but it makes a lot of sense to me.
– zcoop98
Commented Jun 8, 2023 at 16:02
@mason Your example of security breaches is just ridiculous. Unless you have any evidence that the breaches were related to specific SO answers, I invite you to retract this comment.
– Era
Commented Jun 22, 2023 at 22:57
@Era you've never seen anyone post cod containing a vulnerability on Stack Overflow? Recommend a NuGet package that contains a vulnerability? Suggest it's okay to keep developing on a platform that doesn't receive security updates? Sorry, I've seen it happen far too many times. Stack Overflow is people's first stop when they have a programming question. It is inevitable that people pick up vulnerabilities from there and implement them in their apps, and that it has led to breaches. It's just not all the time your postmortem traces it all the way back to SO.
– mason
Commented Jun 22, 2023 at 23:59
@mason Yes, but you named specific large companies, you did not express a general concern. Do you see the difference?
– Era
Commented Jun 23, 2023 at 0:02
@Era the large companies I mentioned are to point out that despite lots of resources and pr presumably layers of protection in places, breaches causes by implementing bad code still happen. So we should do our best to keep stack Overflow clean of code that contains security flaws, instead of just trusting that something else is going to be in place to prevent a security breach.
– mason
Commented Jun 23, 2023 at 0:06
@mason I don't disagree with your conclusion, but you're giving examples that aren't examples in a way that tends to promote alarmism. There's no connection unless you can establish one. The fact that the Therac-25 software bug killed people in the 1980s doesn't imply that every SE user is potentially putting people's lives at risk if they make a mistake in an answer. It's still the developer's responsibility and no one else's. SE does absolutely need to maintain high quality, but not because of breaches like this specifically. (If you could establish a link...)
– Era
Commented Jun 23, 2023 at 0:15
@Era You're asking for something that's difficult to establish. I don't have the tooling and insight into past breaches to trace back the code they wrote to where they originally got it from. And you know that - what you're asking for is ridiculous. It's entirely reasonable that vulnerable code that appears on Stack Overflow will end up in apps, and that it could lead to a data breach. And that's sufficient to prove my point that the content we create on Stack Overflow or any other SE site have serious real world implications.
– mason
Commented Jun 23, 2023 at 0:23
@mason I know it's hard to establish, that's my point. If it's hard to establish, then it's hard to know whether it's true. You're making claims that are difficult to evaluate either way, however the burden of proof remains on you as the person claiming something specific is happening. I agree SO has real world effects; I disagree with your asserting specifically what those effects are without any evidence.
– Era
Commented Sep 17, 2023 at 16:16

Add a comment |

tripleee · Accepted Answer · 2023-06-10 12:11:26Z

Thanks for elaborating on your reasoning. However, I still have questions.

How was urgently overriding the moderators' mandate a conclusion?

Your exposition does not at all reveal how such an outrageous action could be the result of your analysis.

Why not merely halt suspensions?

If this is the part you were actually having trouble with, I'm sure temporarily allowing the users who posted AI-generated content to remain on the site would have been a more appropriate solution to the actual problem you were apparently trying to solve, and somewhat more palatable to the moderators, if not outright embraced by them.

If you also wanted to allow alleged AI-generated content to remain on the site, even that could have been more acceptable than what you ended up with. We already have mechanisms for marking content as contested; for example, disabling voting on the reported posts would prevent the OP from using these posts to improve (or ruin) their reputation.

Again, why is the accuracy of automatic detectors important?

Reading between the lines, I guess you are looking for ways to validate the moderators' actions. But why? You don't need a "rudeness estimation tool" to validate suspensions for rude behavior, or proof that promotional posts are genuinely spam, as opposed to honest mistakes by over-eager marketers (actually a really thorny problem in its own right).

You rely on the moderators to make these calls every day, and on the CMs to handle appeals for all of these other cases. Why is this process not acceptable for AI-generated answers? Because it's harder? That's precisely why the community regards them as particularly problematic, you know.

What's with the alleged cultural bias?

You originally claimed the ChatGPT suspensions might have "biases for or against residents of specific countries". In spite of requests to clarify this, I have yet to see any attempt to explain this allegation.

The best speculations I have seen are that some users who are not fluent in English might have been suspended because they used ChatGPT to create their posts. Is that what this is about? How is that a bias for or against residents of specific countries? I would think it would indiscriminately penalize anybody who is incapable of writing coherent sentences in standard English (which includes a portion of the native speakers of English as well).

More tangentially, I have some speculations about the dynamics of the ChatGPT eruption.

Where did we go?

Did users who previously posted answers actually stay on the site?

In particular, did they stop posting answers, but spend more time curating content?

You should be able to see which users stopped visiting vs which users merely stopped contributing answers, vs perhaps also which users flagged posts more often than previously.

Did ChatGPT users get better at evading detection?

Like in many adversarial scenarios, you would expect both sides to evolve.

In the first wave, you would expect many users to get the same bright idea, get caught, and learn from the experience.

While some would simply learn that the community didn't like their attempt at gaming the system, and stop doing that, others would take this as a new challenge to overcome.

My concrete speculation is that this is what accounts, at least partially, for the falling rate of detected ChatGPT answers in recent weeks.

Isn't it great if beginners got helped by ChatGPT?

If you are as enthusiastic as your CEO about the potential of AI, isn't it actually a good thing that Stack Overflow no longer receives mundane "where is the missing closing parenthesis?" questions? This translates to less traffic, but also fewer trivial typo / duplicate questions which are useless noise to everyone except the asker. Fewer junk bytes in your database, less curation time spent by your valuable volunteers on pointless rote content, and fewer newcomers who complain that their low-value, and in the worst case also low-quality, contributions got rightfully downvoted and closed. Thus, happier results all around, and more time for us to answer actually useful and unique questions.

Somehow ... I ... don't ... think ... those ... who ... misuse ... punctuation ... would ... actually ... try ... ChatGPT. But I could be wrong. — tripleee, Commented Jun 8, 2023 at 12:17
I think you've forgotten that time they did actually make a rudeness evaluation tool. It was quite well-designed, too: they were using "AI tech" appropriately, without buying into the hype. I hope we can get back to that kind of thing some day. — wizzwizz4, Commented Jun 8, 2023 at 14:58
@wizzwizz4 That's quite tangential here, though. The CM team doesn't need a tool to gauge whether someone who posted rude content was rightfully suspended. — tripleee, Commented Jun 8, 2023 at 18:24

NotTheDr01ds · Accepted Answer · 2023-06-07 21:53:07Z

60

What follows is the internal ‘gold standard’ for how we measure GPT posts on the platform.

While I'm still digesting much of the data, you seem to be failing to consider alternative explanations for it.

The actual rate at which GPT posts are made on Stack Exchange has fallen continuously since its release, and is now very small.

Or are users editing their GPT posts more nowadays rather than just copy/paste in an effort to evade detection? I've certainly seen the latter, although this week, with the floodgates opened, it's much more copy/paste once again. I cannot remotely imagine that you are seeing a "very small" number of GPT posts this week, as I've spotted more than 100 myself barely looking. These are clear copy/paste

edited Jun 7, 2023 at 21:53

answered Jun 7, 2023 at 21:31

NotTheDr01ds

4,98215 silver badges33 bronze badges

How do people edit their GPT answers now? Adding content, just changing language, exchanging larger blocks, ...? It would also be interesting to see how their measure behaves since last week.
– NoDataDumpNoContribution
Commented Jun 9, 2023 at 10:55
@Trilarion I've seen a variety of types of edits. See this one for example, where I receive very similar answers and style back from ChatGPT. However, the first and last sentence appear to be edited, of course.
– NotTheDr01ds
Commented Jun 9, 2023 at 15:31
@Trilarion As for how that has changed in the last week, I can't totally be sure. As a non-mod I can't easily find deleted answers from before this week. My gut feel is that the GPT-users this week are far more bold, since they aren't facing as many repercussions. Prior to this week, I've seen some very extensive edits to output that I believe were attempts to hide the ChatGPT usage. Or perhaps users just believe that if they "rewrite it", it's not plagiarism, but of course it still is.
– NotTheDr01ds
Commented Jun 9, 2023 at 15:37
It sounds like you are implying CGPT cannot be used in any capacity to write answers, which is not correct. CGPT can be used as a tool in certain ways just like an IDE autocomplete can. That's not plagiarism. In both cases you didn't physically type the characters.
– Era
Commented Jun 22, 2023 at 22:47
@Era Just to be clear, I'm actually very in favor of ChatGPT and other AI being used responsibly here. However, as you can see in that answer, I feel the first requirement for responsible use is acknowledging it. This is simply required by Stack Exchange whenever someone writes an answer with information obtained elsewhere. It's not always possible to know where you originally learned something, but when using ChatGPT, they certainly know it came from. Related reading.
– NotTheDr01ds
Commented Jun 23, 2023 at 0:24

Add a comment |

CodeCaster · Accepted Answer · 2023-06-08 13:47:30Z

there is no future for the network at all.

You might be closer to the truth than you think, or fear.

Regardless of the above, no Community Manager will tell you that removing 7% of the users who try to actively participate in a community per week is remotely tenable for a healthy community.

Is "trying to actively participate" the only bar a Community Manager sets for the members of a healthy community? You know what Stack Overflow is missing, partially by design? Because it is a "no chit-chat" site, because members aren't allowed to talk on-site, but only post answers, that's what you get: a bunch of uncoordinated (as far as it looks on-site) people with no common goal. You cannot throw hundreds of thousands of members together without hardly any guidance (literally nobody reads FAQs and Help Centers), and expect them to generate answers and to have those answers have something in common:

Quality

Quality comes in many forms. Attention to detail and language, for one. Eagerness to teach, as another. Providing readable code and reasonable examples. Using abstraction and experience to craft an answer that not only answers the explicitly asked question, but also addresses any implicit properties of the problem, and translate that into knowledge that applies to the question as asked but also serves as a useful resource for later visitors.

There is an abhorrent lack of such educators. I've asked about this on Meta Stack Overflow, three years ago: Where's the new boatload of experts who can explain stuff to me like I'm five?. It's very problematic for the network in the long run.

Apart from that, there's another problem on Stack Overflow:

Quality control

I mainly browse Stack Overflow from the homepage. It has a filtered view with questions that match my areas of interest and expertise. I click many questions a day, I often abstain from voting. In my 13 years and 4 months of membership I have posted 3,591 answers, averaging 5 per week. Apart from answering, I downvote unclear questions and incorrect answers. My voting stats (viewable for everyone on my profile):

3,475 upvotes
14,543 downvotes (80%)
11,649 question votes
6,369 answer votes

From this you can deduct that I have read and assessed tens of thousands of human-written questions and answers. I downvote answers that don't explain what they contain. I downvote when answers contain code that omits a glaring problem from the question. I downvote when an answer contains an incorrect claim, especially when there's easy to find documentation that proves otherwise.

People do not downvote enough. Worse, people counter-upvote downvotes as they see them come in. When I downvote an answer "too soon" after it gets posted, it sometimes immediately gets a counter-upvote, while being not that great at all or flatout wrong.

Even worse than that is that most posts do not get the experienced eyes viewing them that they deserve, neither for appropriate voting nor for answering or correcting them.

As time moves on, and as posts remain on and remain to be posted onto the site, the average quality of posts will keep declining. Allowing GPT-generated answers to stay will only accelerate this.

We need more knowledgeable, invested people, not more posts, nor more copy-pasters.

none of the hypotheses generated by the company can explain away the relationship between % of frequent answerer suspensions and the decrease in frequent answerers, in the context of falling actual GPT post rates.

I have one for you. It's actually pretty trivial. Most posts on Stack Overflow come from people who do not care about quality. Either through lack of knowledge, lack of interest, different education or upbringing, not enough experience, or any other combination of factors. Those people can answer only basic questions. They can do so unchecked, because nobody reads those posts anyway. Those questions now get asked to ChatGPT, instead of on the network. Now those people have nothing to answer anymore. So they venture into different tags, different topics, in which they actually have no experience, but GPT can generate nonsense resembling an answer. So that gets posted instead. Those people get suspended. Learn this is not the way they should behave. Leave.

This message from the moderators is akin to "Hey, you know what you're doing? We don't want that here", and for some users that might be the first time they ever hear that here.

Quality control in measurements

I have no idea how you came to this conclusion:

Based on the data, we would hazard a guess that Stack Overflow currently sees 10-15 GPT answers in the typical day, or 70-100 answers per week. There is room for error due to the inherent uncertainty in the measurement method, but not room for magnitudes of error.

But you could not be more wrong. Take this user. On May 30th, they have posted 6 answers. I suspect all of them to be GPT-generated based on writing style. From two of their answers, I know they are GPT-generated because of what's in them:

How to set tags for Secret in Azure Key Vault using C#: they hallucinated a class and a method (SecretCreateOrUpdateOptions, CreateOrUpdateSecretAsync()) which don't exist anywhere on the internet except on that post.
write List<T> to File: they answered the question to a T and perfectly explained what they changed, but missed a glaring problem: OP is asking for an explanation about generics. What use does a generic method have when it accepts a List<T>, but then constrain T to be PERSON using an if statement? Only an AI could make that up and be so meticulously reasonable about it.

You believe this user to be single-handedly responsible for generating 50% of GPT answers for May 30th? Then it should be trivial to find the other user and ban them both.

You know what Stack Overflow can do? Change. Trust your community-elected moderators to do the right thing. They know what their sites stand for. They know what they, and their community members, do and don't want to happen on their site.

"Only an AI could make that up and be so meticulously reasonable about it." ─ I think this is definitely something a confused novice could produce; the modus operandi of many beginners who don't really understand the problem, is to take the original code and make random changes that they know how to make, until it works. But if their other answers stink of GPT (which they do) then there's a high chance this is GPT too. — kaya3, Commented Jun 8, 2023 at 15:41
Yes, a novice could produce that. They did, actually, though ChatGPT. It's not that hard to learn to recognize its writing style. I fear the future where more people will learn how to let it actually produce prose that's not so easily recognizable. — CodeCaster, Commented Jun 8, 2023 at 16:14
Thank you for pointing to an instance of ChatGPT answers and calling out what you see that's wrong with them. As a user who only visits the site infrequently and who doesn't have enough rep to see deleted posts, I didn't have many opportunities to see what ChatGPT posts looked like before the mods got them. — Tim C, Commented Jun 8, 2023 at 17:28
From the referenced MSE question: "...it is increasingly difficult to find good questions..." --Eric Lippert — This_is_NOT_a_forum, Commented Jun 8, 2023 at 20:26
@CodeCaster Finally found your post again. About the hallucinated method, you may find this twitter post VERY relevant. See here — SPArcheon - on strike, Commented Jun 13, 2023 at 16:28
"We need more knowledgeable, invested people, not more posts, nor more copy-pasters." Abso-freakin-lutely. Except. Less higher-quality content probably does not generate as much revenue as more lower-quality content does. Maybe the real problem is that what is good for the community is disconnected from what is good for the company. — Todd Wilcox, Commented Jun 17, 2023 at 6:00
I think this is definitely something a confused novice could produce So, their answer gets deleted with a wrong classification: "AI" instead of "answer so bad it's unsalvageable". I don't know if the classification really matters. To any user "your answer is just as bad as what ChatGPT produces" should be understandable. It doesn't matter whether the answer was actually AI-derived. It's "AI-level bad" either way, isn't it? And moderation and review and all that is to get rid of bad answers, right? So, IMHO that's all right! — Kuba hasn't forgotten Monica, Commented Aug 15, 2023 at 23:22

Thomas Owens · Accepted Answer · 2023-06-07 20:39:16Z

58

This is based on a question that I asked on the Stack Moderators Team. It wasn't addressed there, but it may have gotten lost in the shuffle. I've trimmed it down to make it more appropriate for Meta Stack Exchange.

What, exactly, is the problem you are attempting to solve? There may be more than one problem. For each problem, can you express that problem in a single sentence or question? What questions are you trying to answer or what problems are you solving?

After reading this post, I see several different problems being expressed, such as, but not necessarily limited to:

GPT detectors have a false positive rate that is higher than is acceptable for use as a moderation tool.
User engagement is down.
The policy of suspending users for posting algorithmically-generated content is causing potential contributors to leave the site.

It seems like the initial theory is that these are all connected. That is, the problem statement appears to me to be something like: The tooling that moderators are using to detect algorithmically-generated content has a high false-positive rate, which is leading to a large amount of erroneous suspensions, which is causing users to not convert to contributing members of the community.

I don't understand why you would combine these problem statements into one. They feel like three separate issues, although they could be related. Until there is evidence that explicitly links them, I don't see why the assumption should be made that they are related. When you combine problems, you built upon supposing truth for part of the problem that may not be there.

I do think that there are problems. However, we don't know what the formulation of the problem statement that led to gathering this data and doing this analysis was. And, although making the data and analysis public is a good step, it doesn't address concerns with the underlying methodology.

I would also encourage anyone interested in questions to check out The Art of Asking Questions: Ask Better Questions, Get Better Answers. It's really eye opening and informative about the effects on how questions are phrased to how people give answers. Unless we know the question, it's hard to assess the validity of the answer.

answered Jun 7, 2023 at 20:39

Thomas Owens

52.1k17 gold badges99 silver badges178 bronze badges

22

The clear message I got in terms of "the problem being solved" here is they are seeing negative user growth at an alarming rate and it coincides with ChatGPT suspensions becoming a standard practice. Therefore they are attributing that alarming negative growth (whether it is correct or not) to the ChatGPT suspension practice.
– TylerH
Commented Jun 7, 2023 at 21:31
12

@TylerH If the problem statement is something more like "the network is experiencing slow or negative growth", then I have company-provided analytics that show that the problem has existed for much longer than ChatGPT. That doesn't mean that ChatGPT isn't part of the trend, but it's a very recent contributing factor.
– Thomas Owens
Commented Jun 7, 2023 at 21:38
4

@TylerH What they haven't shown, which seems a fairly obvious hypothesis to me, is that both decrease in user growth and suspensions, are not co-morbid symptoms. Ie, they correlate, because both increase in suspensions and decrease in user growth are directly caused by the growth in use of ChatGPT.
– user1937198
Commented Jun 7, 2023 at 22:17
3

@ThomasOwens Yes, it is a long-term trend, but the post above mentions that, and focuses on how, since ChatGPT, and coincidently the enactment of insta-banning chatGPT users, the negative growth rate has become significantly worse. That's presumably why they are running around like their hair is on fire trying to fix it.
– TylerH
Commented Jun 7, 2023 at 22:27
3

@user1937198 FWIW I agree with the argument that some bad data analysis is being made here. Unfortunately Stack Overflow doesn't seem to employ data scientists anymore' by the OP's own admission, they are engineers and community managers trying to perform data science analysis of statistical data. My comment(s) above are simply trying to address Thomas' question of 'what are you trying to solve here', based on my interpretation of the question/announcement.
– TylerH
Commented Jun 7, 2023 at 22:28
19

Absolutely this. The OP even says quite clearly that the attrition inflection timing matches the introduction of GPT, not the ban policy: "In total, the rate at which frequent answerers leave the site quadrupled since GPT’s release." The simplest explanation that occurs to me is that these answerers don't want to compete with GPT posters (regardless of whether they're correct in their perception that they are so competing). I don't have any evidence for that, but neither does the OP disprove it.
– jscs
Commented Jun 7, 2023 at 23:11
@jscs: That is an interesting theory, but, like regular plagiarism, most ChatGPT answers are not on new questions, but on very old questions (mostly the popular ones with many existing answers) or relatively old (bountied questions). They want to avoid the scrutiny on the new questions (where most of the attention is)—read instant downvotes. They hope for the stray upvotes. And in some cases a virtuous circle (the first upvote can start the snowball rolling).
– This_is_NOT_a_forum
Commented Jun 8, 2023 at 16:12
2

As I said, "regardless of whether they're correct in their perception that they are so competing". It doesn't matter what the GPT posters are actually doing, it matters what the leavers think.
– jscs
Commented Jun 8, 2023 at 18:43
@jcs I don't understand why your speculation is relevant.
– Era
Commented Jun 22, 2023 at 22:44

Add a comment |

zcoop98 · Accepted Answer · 2023-06-07 21:59:22Z

51

I have big doubts about this part of presented data:

Yet, at the same time, actual GPT posts on the site have fallen continuously since release

It's implied that posts with a small number of Drafts are occurring less and less on sites, so it means that we have less ChatGPT answers.

I would argue that the only conclusion that can be drawn is that we have less posts that are blind copies from ChatGPT, with no reading or edits to apply basic formatting.

I've executed a small experiment: I've gone to one of my latest answers, copied the question to ChatGPT, and attempted to create an answer with its output. It took me three drafts to paste and reformat the answer so that it would be adequate to my personal standards.

I don't have the exact number of drafts required to create an initial answer, but since it consisted of three rather conservative in length paragraphs, and ready-made code copied from an IDE, I'm pretty sure that it took me less than 6 Drafts to create the answer.

I'm open to the idea that while this study was conducted there were some additional internal indicators of answers being AI-made that were applied, but the published part, in my opinion, has major flaws regarding this point on drafts.

edited Jun 7, 2023 at 21:59

zcoop98

9,9533 gold badges26 silver badges58 bronze badges

answered Jun 7, 2023 at 20:54

markalex

1,7915 silver badges16 bronze badges

6

Good point. You may also want to add that as the awareness that copy&paste GPT post are quickly identified, your average user would then start to try to edit their post so that they no longer are straight copy in an attempt to evade detection. So, the fact that even GPT only posters are starting to edit their posts before posing seems just a natural "arms race" between the ones trying to detect generated content and the ones trying to post it
– SPArcheon - on strike
Commented Jun 8, 2023 at 14:29
I do not believe the other conclusion is valid either. The baseline that is measured against, changes as people with previously little keylogger-observable input change their behaviour. E.g. right now the best text editor to reduce my spelling and grammar problems is in my browser, so that is where I type. I imagine with many new tools released, that recently also changed for some non-negligible fraction of the user base.
– anx
Commented Jun 8, 2023 at 19:58
I would simply like to see the ratio of short draft to long draft answers over time for at least the last year. If there is indeed a sudden change in GPT onset and a return to the mean later I'd conclude that there is an effect and that it's more likely due to less copy and paste answers. I don't believe much in people polishing their GPT answers more often although that could happen if course.
– NoDataDumpNoContribution
Commented Jun 8, 2023 at 21:55
@Trilarion, you make the same mistake as SE did in their analysis: it is required to take some time to apply formatting: code blocks, marked lists, ans so on. And number of drafts in this case is. Similar to such of an answer with small description. As a result your threshold of "obviously chatgpt" will be very low. Plus it's completely missing all the tools imitating human input by partial copying text into textbox.
– markalex
Commented Jun 9, 2023 at 2:45

Add a comment |

Ryan M · Accepted Answer · 2023-06-08 22:51:19Z

Analogy and motivation

Let's say I'm a dairy farmer in Wisconsin. I look at some data about milk output from my dairy and find that output is down 50% from last year. This is a remarkable drop, and it must be explained! What on earth is causing this? I need to call a vet to check that there isn't some infection afflicting the cows. I need to send our feed for analysis and make sure it's properly balanced. I need to have our milking machinery serviced and checked. This is all going to be very time-consuming and expensive, but it's very important to get to the bottom of it.

But first, I should count how many cows I have, because if I have half as many cows as the year before, then all these expensive solutions which actually might be good for solving a "less milk per cow" problem are not the right solutions for a "fewer cows" problem.

If our Stack Exchange cows are our answerers, the analysis presented in OP is suggesting that our very best cows, the ones that produce more than 3 answers in a given week, are being particularly affected. They're dropping faster than the overall answers! The proportion of answers by the >=3 answerers is down! But, before we go about an expensive way of figuring out what is targeting our best producers, is there any simpler explanation?

One could, perhaps, find these same effects due to an actual cause that is "answers in general are down" plus an analysis strategy based on thresholds that creates an illusion that the top answerers are most affected.

Recapitulating results

I've grabbed post data from SO in two weeks, one from Nov 2022 and one from April 2023. I'll refer to these as just Nov and Apr from now on, but note these are 1-week examples from each, not the whole month.

Details of the analysis are here: https://pastebin.com/mVEMVuFD I'll just include the results in the rest of the post.

33571 answers in Nov. 21001 answers in Apr, 62.6% of Nov.

2399 users with >=3 answers in Nov. 1360 users with >=3 answers in Apr. 56.7%

47.9% of answers in Nov are by users with >=3 answers. 39.6% of answers in April are by users with >=3 answers.

So far, these results are right in line with the data presented in the original post here: the number of users with >=3 answers has dropped off quite a bit. The proportion of answers by the top answerers is down to a lower fraction than overall answers (39.6% is much smaller than 62.6%).

Checking an alternative explanation

So, let's consider another possible explanation, which is that this result is really just a threshold phenomenon, introduced by the 3-answer threshold. If we want to know if the top answerers are really more affected than the general trend, we need to compare to a suitable null hypothesis.

We can simulate what the data would look like if the decrease is uniform by starting with our Nov data, and randomly tossing out 62.6% of the answers. Then, we can look again at the April results and check whether they are consistent with a simple across-the-board reduction (if they look like the simulated data), or whether the data are more consistent with an alternative hypothesis that the top answerers are being specifically targeted (if they don't).

I run the simulation 1000 times. The mean total answers in the simulation is 21000, 62.6% of the original count (good reality check there).

The mean number of answerers with >=3 is 1461, 60.9% of Nov.

The percentage of answers by the top answerers is 40.9% of Nov.

It looks to me like indeed, the reduction in top answerers, at least comparing just these two weeks of data, is more than what you'd expect just from an across-the-board drop. The simulation predicts 1461 users with >=3 answers, but the actual April data only had 1360 such users. If you were just expecting the number to drop proportional to the number of answers, though, you would have predicted 2399 * 0.626 = 1502 users, so some of the drop is accounted for by the general effect filtered through the threshold, rather than the specific one.

Similarly, for the "fraction of answers by the top users", the simulation predicts 40.9%. The actual observation 39.6% doesn't seem practically different from the expectation (though it would be statistically significant if you used my simulations as an empirical null distribution).

If we plot out the actual April versus simulated data we can get a better idea of what's going on. It looks like there are really no fewer answerers than predicted among the very top answerers, those producing over 20 answers in a week. Rather, it looks like there's a proportional drop in people posting 3-4 answers and an increase in those posting 1. I think these data would make me look for reasons that people aren't posting more than 1 answer, rather than only focusing on why people who post a lot of answers might be going away. Some people have pointed to the 30-minute timeout as a possible cause. One might look at historically how many of the people who post 3-4 answers in a week are new accounts posting in rapid succession to guess at that impact. You might also look at how many people experience the 30 minute block in a week. If the number seeing the block is similar to or greater than the 100-150 missing people who'd normally post >=3 answers, that might support it as a cause.

Importantly, the simulation predicts about 5300 people who would have posted in November just don't post an answer at all in April, and of the 1000 fewer people posting at least 3 answers, all but about 100 are following a general pattern of fewer answers rather than a specific pattern.

Summary

In summary, it does look like you're losing more of the "3+ answer" crowd than you'd expect just from the drop in answers, so it's important to figure out where they're going. Importantly, though, it may not be that anyone is going anywhere, really, but just that users who before would post 1 answer aren't posting more. Overall, the discrepancy in frequent answerers is still quite small compared to the overall drop in answers, which likely originates in a drop in questions, as others have pointed out.

I am very open to feedback and criticism on this approach. I didn't have a lot of time to play around in SEDE, but if someone wants to make a data set that has week-by-week data instead of just one week extracted, I'd be happy to tweak the rest of my code to plot this out over time.

"One could, perhaps, find these same effects due to an actual cause that is "answers in general are down"" yep. exactly! and they're dropping consistently proportionally with (and I'm highly inclined to think as a result of) questions dropping — starball, Commented Jun 8, 2023 at 17:54
Here's a week-by-week query (heavily based on yours). I mapped each post CreationDate back to the beginning of the week (Sunday) and left it as a date in the first column. I'm not sure if that's a usable form for you or not, but I was avoiding a week number in case of multi-year queries. — Henry Ecker, Commented Jun 9, 2023 at 3:45
@HenryEcker Looks great, haven't had a chance to get back to this but I'll try to get to it soon. — Bryan Krause, Commented Jun 10, 2023 at 4:07

Wrzlprmft · Accepted Answer · 2023-06-09 11:47:18Z

TL;DR: Your analysis that concludes “automated GPT detectors have unusable error rates on the platform” is not sound.

I can’t easily find what exactly Hugging Face’s percentage signifies, but I assume that it tells you how likely a random non-AI text (of comparable length) is to appear less AI-generated than the given text by its internal metric. In other words, the percentage is the true negative rate. This is the same test you have been exposing it to, just with a restricted dataset (namely only SO posts). Thus, ideally, its false-positive rate would look like the green line I added to your plot:

First, given that the situation is not ideal as you used a restricted dataset (for a good reason), it is no surprise that the blue and green line strongly deviate in parts, but that’s no problem per se.

In the region up to 90%, Hugging Face performs much better than advertised on your data. I mention this only because you seem to focus on this region for estimating a false-positive rate, although hopefully nobody uses it for detecting AI posts. In brief: If Hugging Face says that a post has a 50% chance of being AI-generated, of course I would expect it to be human-generated in roughly half of the cases.

Now, the really interesting region is above 99%. Here, Hugging Face performs worse than advertised, although not drastically so. Importantly, as you mention yourself, you need much more data to make reasonable estimates here: Going by your numbers (500 posts in total), you have probably less than twenty posts in this region. Also any results in this region are highly sensitive to confounding factors such as posts being generated or assisted with simpler machines than ChatGPT and SO being part of ChatGPT’s training data.

Finally, I could be wrong about my initial assumption about what Hugging Face’s percentage means, but I fail to imagine any interpretation that would arrive at a different conclusion.

While it is theoretically possible to achieve better baseline error rates than 1-in-20 by picking higher thresholds, the efficacy of the detector may fall off considerably. A detector that does not produce false positives is no good if it also produces no true positives.

The only way I can make sense of this statement is if you assume that we feed Hugging Face with posts whose distribution of scores is the same as your pre-ChatGPT training data. But that’s hardly the case. Of course, ChatGPT exists now and some people use it to produce answers and many of those score very high on Hugging Face. In the last months, there has been no shortage of posts that score far above 99%. If we moderate only by AI detectors (which we don’t) and set the threshold to 99.75%, we would still find plenty of posts to delete, which I would expect to be mostly true positives. In fact, the last moderated post I was involved with scored 99.98%

@This: It’s a rate of false positives not a positive rate that is false. — Wrzlprmft, Commented Jun 8, 2023 at 20:24
Maybe one could run this detector on old pre2023 data, on 2023 data and on purely GPT produced data, make a histogram of the scores and try to approximate the 2023 histogram as combination of both others. — NoDataDumpNoContribution, Commented Jun 9, 2023 at 10:50
@Trilarion: The problem with this approach is that the fraction of interesting posts (GPT-produced stuff) is very small, so you would have to run a lot of posts through Hugging Face and any trends will have a strong influence. It’s like trying to determine the number of five-person households with a car by subtracting the fraction of households with a car and four or fewer persons from the fraction of all households for a car, but measured in a different year. It’s essentially catastrophic cancellation, with some complexity on top. — Wrzlprmft, Commented Jun 9, 2023 at 11:15
While a similar problem exists with the analysis of saved drafts, it can at least directly focus on all relevant posts. — Wrzlprmft, Commented Jun 9, 2023 at 11:16
I'm just trying to find alternative ways to estimate the amount of GPT contributions or as a sanity check. Short of requiring additional webcam live feeds for every question and answer, all reasonable information should be used. And why not plotting the average Huggingface score of all new answers per week for the last 12 months including deleted posts for example. If then there is a statistical significant change in Dec. we know what happened and the direction of the time evolution of the score might also tell us if the problem got worse or better in the mean time. — NoDataDumpNoContribution, Commented Jun 9, 2023 at 11:38
@Trilarion: And why not plotting the average Huggingface score of all new answers per week for the last 12 months including deleted posts for example. – I guess Hugging Face and similar have rate limits for their services, which is why SE’s study only used 500 posts. If they could easily analyse all posts in the year before December 2022, they probably would have done so. — Wrzlprmft, Commented Jun 9, 2023 at 11:49

ColleenV · Accepted Answer · 2023-06-08 16:11:19Z

The methodology presented is very interesting, but if the company really believes that human beings can't tell the difference between good quality posts that contribute to the site and poor quality or plagiarized posts (given a pattern of posts made by a particular user), then they might as well shut the doors, turn off the lights and go home.

This is not about sifting AI posts from non-AI posts and I can't understand why everyone is so fixated on that aspect. This is about humans judging content and recognizing patterns that indicate a user is trying to game the system and the moderation tools (for everyone, not just elected moderators) not scaling when it becomes easy for lots of people to generate low quality content that requires more than a glance to determine the quality problems.

It's disappointing that y'all are acting like this is some new problem that suddenly manifested because of new technology. SE has been struggling with moderating low quality content (incorrect, plagiarized, unsupported, irrelevant, outdated etc.) and onboarding new users for all of the years I participated on the network. The decline in frequent answerers is not going to be resolved by preventing moderators from moderating. Healthy communities require that spammers, bots and other bad actors be excluded.

The answer is not to tell the people working on the front lines that they can't use their judgement to try to keep things from falling apart. The volunteer moderators know more about how their communities work and what they need than the company will ever be able to determine by data mining (uh, don't get me wrong, data is still valuable). The LLM AI problem is very similar to the sock-puppeteer problem. It's really hard to detect a skilled puppeteer, and pretty easy to detect an unskilled or lazy one. We don't stop moderators from acting on sock puppets because they might alienate someone who is disrupting the site.

Either you trust moderators to use their judgement, or you don't. If you don't trust them, then you need to replace them with something or someone else that you do trust, because almost every aspect of what a moderator needs to do for a site requires a judgement call, especially now that the CoC prohibits "harmful political content" and other vague categories of speech that can only be moderated by human judgment. The answer (in my opinion) is to get back to Stack Exchange's core purpose, which is to crowd source credibility. With the current design, that means that the giant sites have to be segmented somehow into smaller gardens curated by communities of experts that are invested in them.

I cannot believe this post has no comments! It is so much shorter than most others, yet gets right down to the core of the matter, instead of its technicalities. — Lutz Prechelt, Commented Jul 12, 2023 at 16:35

gnat · Accepted Answer · 2023-06-07 23:26:08Z

since the advent of GPT, the % of content produced by frequent answerers has started to collapse unexpectedly

The data you show, along with system changes introduced since advent of GPT suggest that most likely cause for content collapse was identified incorrectly.

Much more plausible explanation is that collapse of frequent answerers content is the result of introducing new suspension reason.

Just think of it. Imagine some user willing to gain rep points without sufficient knowledge (for example if in their region these points impact job offers).

Prior to advent of new suspension reason such user could post low quality answers at a fairly high rate... well practically indefinitely - give or take some miniscule impact of soft rate limiting for those of them who are particularly unlucky. Really, what could stop them? Moderators surely could not because they don't intervene in low quality answers. Regular users could not either because of rep penalty.

Since the advent of GPT it is only natural to expect that many of such users will see it as an opportunity to gain reputation with even less effort than they invested before. So instead of their prior low quality answers they started posting GPT dumps. The difference this time is, because of a new suspension reason moderators now can introduce a substantial throttling (suspension) after their first few answers.

So what you've got since the advent of the new suspension reason is that the kind of users who previously could post tens if not hundreds of low quality answers at sufficiently high rate are now throttled by moderators much ealier - like, after posting just 5-10 GPT dumps.

There is your "% of content produced by frequent answerers has started to collapse" and it's not even close to happening "unexpectedly". (and if my assumption of such answerers historically tending to be from regions where rep points matter in the job market holds, there is also your observation about increase of region-specific suspensions)

GPT removal actions are not reasonably defensible

By same token, abrupt interruption of GPT related moderation is not reasonably defensible.

Quite the opposite - delay in such action is reasonably defensible. You have official procedure of introducing policy changes and this procedure justifies reasonable delay. You could (should) use such a delay to properly discuss this matter with moderators and find out how they would prefer to address it.

Maybe someone speaking the language better than I do can add the words "representative sample" and "confounder" in here, because that seems to be the core of the arguments: GPT offenders being compared to totals as if they were just any other user. Yet we should absolutely expect them to stand out in statistics for reasons impacting both their human and their entirely made-up contributions, making this likely measurable. — anx, Commented Jun 8, 2023 at 19:46
To check the ideas in this answer, one could for example look at the score of the answers. Maybe the fraction of positively scoring answers did increase. That would backup the idea that low quality answerers took refuge in using GPT. — NoDataDumpNoContribution, Commented Jun 9, 2023 at 11:44
@Trilarion raw score is heavily confounded by answer speed, which I expect is abnormal for tool-assisted speedrunners. It probably is measurable by score though, when applying a suitable filter on whose upvotes to look at. — anx, Commented Jun 9, 2023 at 22:18

This_is_NOT_a_forum · Accepted Answer · 2023-06-08 17:40:00Z

There are multiple things I have questions/concerns about, but I'd like to focus on just one. This still seems like throwing the baby out with the bathwater with regard to what the announcement/policy actually says.

The problem you have observed and are trying to correct here is negative user growth: active, answer-writing users are leaving the platform at a higher rate than you like, or that you think is sustainable, etc.

The cause you have attributed to that, based on this post, is suspensions for using ChatGPT, which have also increased by a marked amount in the same time frame. In fact, you even point out in the post, "since we greenlit the suspension on first offense..." (or something similar) to try and show a strong positive correlation.

However, your solution to this problem has a severe negative effect in terms of the underlying problem: ChatGPT content; it encourages ChatGPT content to grow and metastasize across the network, unfettered by moderation attempts. If moderators are no longer allowed to delete or suspend for ChatGPT at all, then we are tacitly (or even expressly) saying that ChatGPT-authored contributions are welcome on the site. As you say, Stack Overflow, at least as it exists today, cannot survive such a reality. It will absolutely become the next Quora (or whatever awful Microsoft forum iteration is around at any given time).

The CM team took the unusual step in concert with moderators at the advent of the ChatGPT problem of writing a site policy that outright banned the use of that tool in answering questions. The community largely agrees with this decision, as Stack Overflow and its sister sites across the network pride themselves in curated, expert content. As we all know, ChatGPT is neither expert nor curated. And because the bar for using it is so extremely low, and the quality of ChatGPT's English grammar and spelling is so extremely high, the absolute fire hose of ChatGPT-generated content on the network really mandated such an unusual response.

However, you also enabled a pretty severe enforcement option for moderators to mete out for violations of this new policy: suspensions for 1st-time offenses. If a new user gets suspended for their first or second post, I agree that it does make for a pretty unwelcoming experience (whether the suspension was warranted or not), and such users are not likely to stick around.

The metaphor of getting told to permanently or even quasi-permanently shut up as soon as you say something for the first time in a new place tends to have that effect.

So...

The problem I have here is that the new policy doesn't just revert the extreme suspension policy. It purports* to rollback the ChatGPT policy entirely. User retention is suffering, so you are going to undo the policy that helps protect the site's entire raison d'être: to provide free, expertly curated answers to every programming question there is.

Have you considered a half-measure somewhere between "no ChatGPT content allowed--you are banned the first time you use it" and "you get ChatGPT content, you get ChatGPT content, everyone gets ChatGPT content!"?

One thought is reducing the suspension threshold to only occur on the second or third offenses (based on severity, of course), similar to how they are for other infractions, while still allowing mods to delete the offending posts with a 'ChatGPT notice' and work out the details via mod mail if a user wants to appeal.

To reuse the same metaphor from before, this would be saying something for the first time in a new place and told that, while you are allowed to speak, the specific thing you just said is not OK. A far better and more useful piece of feedback than just instantly suspending someone.

I'm glad the suspensions are sometimes (usually?) temporary, but let’s improve on that even more. We are OK with the user, so long as they aren't posting content that isn't allowed. Have moderators show some grace here, and perhaps users will stick around a bit longer. It would certainly be a less harmful iteration, at least, of the policy than what is looming on the horizon, in my opinion.

The main benefit for the site is that ChatGPT copy-and-paste content is still not allowed, and the main benefit for the company is that your metrics of number of new/frequent answerers getting suspended will decrease, likely leading to better (read: less bad) retention numbers. Win-win!

* - I say 'purports' because, as far as I am aware, the moderator team (and certainly not the users of the network writ large) doesn't have clear guidance on what they can or cannot allow on the site (a core tenet of Stack Exchange sites/communities, I'm sure I don't have to tell the CM team, has always been that each site gets to decide for itself what kind of questions it finds acceptable). This, aside with the extreme speed at which the marching orders were foisted upon the moderator teams, understandably has a lot of people upset.

I think that their interpretation is that there is not a significant amount of GPT content being created and most suspensions are unwarranted and of human content creators. Therefore stopping to suspend or at least strongly reducing it might only do good not bad. I guess it all comes down to how high that effective false positive rate and the posted amount of GPT content really is. — NoDataDumpNoContribution, Commented Jun 9, 2023 at 11:12

mason · Accepted Answer · 2023-06-08 02:02:15Z

In the Community Management industry, it is a well-known fact that removing a person from a community, even for a short time, has an outsize impact on the contributor community.

In other words - removing people from the community makes the community smaller? I guess that's important if the thing you care about most is having a large number of members in the community. But here's the secret that anyone who has spent much time contributing on SE knows: it's the quality of the content that draws people here. And they'll stay here even in the face of negativity and having rules forced on them that they don't understand, so long as this is the place they can come to for high quality Q&A.

Seeking a set of root causes for the contraction in the network’s community size, and an explanation for how it affects different sites/community segments around the network, has been the object of study for dozens of people, and for many months now.

I assume by "dozens of people....for many months now" you mean this is something that Stack Overflow Inc staff have been doing? That sounds like a massive waste of effort. High quality questions and answers, and tooling that enables us to promote high quality questions and answers - that's all it takes. No studies needed. Invest in developers, and have them work on implementing features the community has been begging for. Don't waste time seeking answers to questions about how to grow the community when we've been saying all along it's about the quality. If you build the quality, the quantity will naturally grow. If you spend time focusing on the quantity, then the quality will shrink, and thus the quantity will eventually shrink too until you collapse.

No wonder you have such an antagonistic relationship with the community: you don't ask the right questions, you use terrible methods of gathering data (number of cached drafts, really?!) and you don't bother consulting with the community or even the elected moderators before making sweeping devastating changes. If you carry on like this, you WILL lose your company. I get that you have all this pressure from investors to grow the business, but it's your job to advocate for the best way to do that. And right now, what you're advocating for will do the exact opposite of what's good for the platform, and ultimately what's good for the investors.

"If you build the quality, the quantity will naturally grow. If you spend time focusing on the quantity, then the quality will shrink" — I believe this simple rule stood on firmer legs before there was meaningful competition. Now however, when OpenAI and soon others (e.g. github's Copilot) are offering a sort of alternative, whatever SE does, the dwindling user engagement may remain... — Levente, Commented Jun 8, 2023 at 2:56
@Levente while I can't speak for other SE sites, we are a long way from Chat GPT really understanding programming problems and coming up with the correct solution in all but the simplest questions. So, where do you go for your not-simple questions? Stack Overflow. And the day AI can answer all the questions and eliminate the need for Stack Overflow, well that's the day software development ceases as a profession anyways. Doesn't seem like it's happening soon, as these LLMs don't really understand the way a human can. — mason, Commented Jun 8, 2023 at 3:06

John Omielan · Accepted Answer · 2023-06-09 00:36:56Z

Thank you for providing some details, but I have several issues, mainly statistically related. First, there's with

In order for us to consider using detectors of any kind (automated or human) on the platform at these volumes, we’d need to see less than a 1-in-50 false positive rate from them. We’ve selected this rate as a ballpark estimate for acceptability.

At this rate, we would still expect to see around 150 incorrectly-placed suspensions on Stack Overflow in the last six months. This value is still too high for comfort, and ideally, we’d see better rates than this. However, at this level of precision, conversations about how we may put such a system to practice can begin.

There is then a discussion about the HuggingFace’s GPT detector false positive rate, based on its "threshold score". However, this is only an important issue if a tool, like that detector, was a significant factor in any suspension decisions. However, I have not read about any diamond moderator doing this among the multiple related answers and comments. In fact, it seems many of them don't use automated checkers very often (and, if they do, they never rely on just them alone), or even don't use them at all (e.g., sideshowbarker's answer where they state "Background from a striking SO mod who handled 10000+ GPT flags" and "In the thousands of cases I handled, I never ever used any of the detection tools"). Thus, since you've not stated what your actually determined # of false-positives actually is (but with the discussion about the GPT detector apparently being used as a proxy for this instead), it's definitely quite possible, perhaps even relatively likely, that the number of incorrectly-placed suspensions is actually considerably lower than 150.

One other important aspect is you're assuming any diamond moderators using these automated tools are for the purpose of helping them to decide to potentially suspend a user, rather than the opposite. For example, although I don't know if it actually applies to any particular person, a moderator may not be using any automated tool to initially determine whether or not to potentially suspend a member but, only after deciding they should probably be suspended, they then use the automated tools to check, with a suspension occurring only if the results indicate with a sufficiently high confidence that the posts were generated by an AI. In this case, the use of these tools can potentially only decrease the incidence of suspensions, including also any false positives. Would this be an allowable use of those tools?

Note that I believe one important issue, which is a significant, underlying part of the problem here, is the relative importance assigned between having more answers, even at the expense of quality, versus ensuring their quality remains relatively high, even though the number of answers may decrease. The company seems to consider the first option to be more important (e.g., for it to generate more traffic, even at the expense of long-term quality), while I believe that many diamond moderators and relatively active curators (including myself) consider the second aspect to have higher relative importance (in particular, to help ensure the SE sites continue to be relatively high-quality, reliable sources of information).

Next, there's

What follows is the internal ‘gold standard’ for how we measure GPT posts on the platform. It produces a coarse estimate, and can’t be used to decide whether a given post or person is posting using GPT. However, in aggregate, it can offer us insight into the ‘true’ rate of GPT posts on the platform.

This metric is based around the number of drafts a user has saved before posting their answer. Stack Exchange systems automatically save a draft copy of a user’s post to a cache location several seconds after they stop typing, with no further user input necessary. In principle, if people are copying and pasting answers out of services like GPT, then they won’t save as many drafts as people who write answers within Stack Exchange. In practice, many users save few drafts routinely (for example, because some users copy and paste the answer in from a separate doc, or because they don’t stop writing until they’re ready to post), so it’s the ratio of large draft saves to small draft saves that actually lets us measure volume in practice.

Note that this metric may actually be quite misleading, regarding the number of drafts which may not be that much different between using a ChatGPT result and writing an answer from scratch, such as mentioned in markalex's answer

I've executed a small experiment: I've gone to one of my latest answers, copied the question to ChatGPT, and attempted to create an answer with its output. It took me three drafts to paste and reformat the answer so that it would be adequate to my personal standards.

I don't have the exact number of drafts required to create an initial answer, but since it consisted of three rather conservative in length paragraphs, and ready-made code copied from an IDE, I'm pretty sure that it took me less than 6 Drafts to create the answer.

Also, tripleee's answer indicates that at least some users will learn to get better at evading detection. This would involve things like making more changes to the text (e.g., to have it formatted better, look less like it was generated by AI, etc.), so there would then also be more drafts saved, thus affecting the validity of your metric.

In addition, with your "... was validated against other metrics early on at the peak of the GPT answer rate", the situation may have changed significantly since that initial start, as discussed in my several paragraphs above.

Finally, it seems to me that, overall, you're not appropriately taking into account that correlation does not imply causation.

Note that, as discussed in the comments on that answer, its characterization of the extension's functionality appears to be mistaken. — Ryan M, Commented Jun 8, 2023 at 5:45
@RyanM-Regenerateresponse Note I've now instead removed that part of my answer since, although in response to your info I added about the comment to that other answer indicating the Chrome extension doesn't actually apparently type into the answer box, it's unclear how this relates to the number of drafts saved. — John Omielan, Commented Jun 8, 2023 at 13:29

Henry Ecker · Accepted Answer · 2023-06-07 21:18:05Z

It is worth noting that, early in the release of GPT, we changed the Stack Overflow rules to require new users to wait 30 minutes between first posts, instead of 3 minutes as was originally set for abuse prevention. If this change were causative, we would expect to see a sudden jump to a new lower level, and a return to the prior well-established rate of decrease. However, we do not see this, a strong indicator of deepening attrition. (We would also expect to see a discontinuity in other metrics not listed – this point is established by a confluence of metrics.)

(Emphasis added)

Can you elaborate on this further? I don't I understand how you've been able to determine that this rate limit would have a fixed decrease and not have a compounding impact on the rate of decline in the number of answers and amount of answers posted per user.

From a very human perspective, I would think this limit would be enormously frustrating. Users cannot participate in almost anything on the site without having reputation. Earning reputation is hard and answering questions was previously one of the most effective ways of doing so. Starting from a 1-reputation account, you need to earn 124 reputation to have the rate limit lifted.

Assuming you earn 1 upvote per answer with no downvotes (which is honestly quite difficult to do in some tags and especially for new users) you would need 13 answers to get above the threshold. The minimum amount of time to post these answers would now be 6 hours (up from a minimum of 36 minutes prior to this change). I would think that this would have a lasting discouraging effect on users beyond a fixed rate decrease based solely on time.

I understand that users "are obviously not going [to ChatGPT] to answer questions" but that doesn't mean that the rate limit is not frustrating or otherwise deterring our users from contributing on our sites.

Further, even if it did not drive the users off the site entirely, it would almost certainly reduce the number of answers they are posting. — Ryan M, Commented Jun 8, 2023 at 4:29
As I understand, this will affect the percentage of new users who stick on the site, but it shouldn't lead to more users leaving? — Paŭlo Ebermann, Commented Aug 8, 2023 at 19:41

Lutz Prechelt · Accepted Answer · 2023-07-12 19:07:36Z

I've read this twice, pondered for a while on a few of your more vague turns of phrase. Most of my specific constructive thoughts and criticisms have already been covered by other users. I found Kaya3's answer the most comprehensive in terms of breaking down the pitfalls in your data analysis, but there is really something for you to glean from almost all of these answers. For my part I'm going to focus on what seems to be the root drive of your policy:

You're experiencing user attrition, and particularly seeing erosion of "eager answerers". While this is understandably eye-opening from your perspective, I think you're very obviously drawing the wrong conclusions. I won't belabor the details, but suffice to say there is reasonable doubt as to whether the statistical group you're focused on is actually correlated with high-quality answers, or whether the "lost" posts were actually a benefit to the community.

Worse yet, your blundering and unnecessarily imperious attempt at "solving the problem" is actually inflaming the true cause of user attrition: waning community trust in the leadership of SE inc. If you want to reverse the tides, that's the area you need to focus on. The solution to all your woes is as simple as it is hard:

Humility.

You need to admit you were wrong. Acknowledge that the strike was justified by your error, and roll back your decision. Then reapproach this the right way, with transparency and community involvement from the outset.

Otherwise the dedicated and skilled users that make this site are just going to keep hemorrhaging away. Which is just a shame, honestly.

Adriaan · Accepted Answer · 2023-06-08 08:23:47Z

As to the future of the site part, I think it's imperative to be able to detect AI generated answers. Whilst it is true that voting can take care of hallucinations and plain incorrect answers, the problem lies in one of the core principles of the site: gamification through tying credibility to reputation.

Getting reputation for AI generated content is akin to getting reputation for plagiarised content: it is not your own. Reputation gives you on-site credibility and trust in the form of privileges. But not only that, more crucially, it gives you credibility to other users. An answer posted by a user with 100k reputation is trusted more by readers than when the exact same answer would be posted by a user with 1 reputation. Thus, at some point people would be seen as credible and trustworthy, when in fact it should be the AI getting the reputation.

This is in fact a known problem and has happened in a slightly different form, when a US teenager without speaking any Scots started editing Wikipedia articles in Scots . They got moderation privileges, allowing them to e.g. roll back correcting edits on their posts. At some point, even translation algorithms, such as Google Translate, started to use their, partly/mostly incorrect, posts.

Thus, if anything, fully AI generated content should still be disallowed and a detection mechanism should be found.

Note that I am not completely against the use of AI in all forms. Especially for non-native speakers having an AI clear up their language to make it more understandable and grammatically correct is a good use in my opinion (much like spellcheckers have been doing the past decades). Where the line should be drawn is up to discussion; e.g. is a user answering a programming question with their own code, but having an AI writing the complete explanation acceptable?

This is a good point. And not only does permitting cheating undermine the concept of reputation as a signal (or trust, or being good at the game or whatever), it corrodes existing community. Why should I bother playing if the game allows cheats to play? — snakecharmerb, Commented Jun 8, 2023 at 8:09

SPArcheon - on strike · Accepted Answer · 2023-06-09 07:34:51Z

This is basically the same thing Mithical already said, and it is the first thing that came to mind as soon as I started to read your very long post. That said, I am posting this version to give a sightly different approach to the issue.

Your assumptions so far:

ChatGPT detection tools have an unacceptable error rate. References:

1- We ran an analysis and the ChatGPT detection tools have an alarmingly high rate of false positives
2- Automated GPT detectors have unusable error rates on the platform - this very post
Moderators' own judgment has an unacceptable error rate too. References: multiple moderators pointing out that the private version of the policy, that very conveniently for you the userbase will never see, doesn't just forbid the use of tools. Since I would not be surprised to see retaliation actions against those individuals you won't get a list here.

Logical conclusions from your premises:

Since neither automated or human made judgment works, it is impossible to determine if any post is the product of automated generation without the author disclosing that.

Your next claims:

The company was able to measure the rate of generated content on the site.
The company was able to assume the expected number of false positive bans on the site based on that rate

Both claims seem to be founded in the strategy you used to detect/estimate the number of AI generated posts.

This metric is based around the number of drafts a user has saved before posting their answer. Stack Exchange systems automatically save a draft copy of a user’s post to a cache location several seconds after they stop typing, with no further user input necessary. In principle, if people are copying and pasting answers out of services like GPT, then they won’t save as many drafts as people who write answers within Stack Exchange.

On paper, this may work. There are two problems though.

This idea, which based on your claims work better than any judgement the brains of more than a hundred mods managed to do, is coincidentally the first thing that came to mind when I was trying to imagine what the tells-by of a copy-and-paste post could be. I therefore assume that unless I am far smarter than you, the mods and everyone else that is also what most will have though of too, if anything else because this is exactly how the "are you a human" check usually work on sites: if someone is making some action too fast, they are probably not an human.
Mods have been telling you that quite often users posting generated content try to edit it before posting to remove the more blatant commonly seen recognizable signs of automated generation (for example the usual final line that often sound like "it is worth to notice that the above is just ... and not an accurate ....") and that obviously increases the time spent on a post.

For these two reasons alone, I am not really convinced that your data provides much more value than the mods' own analysis and actually imho looks like a pretty disingenuous oversimplification that tries to prove causality while ignoring a ton of other variables, but it is still better than no data.

I therefore once again propose you an experiment: take one of the posts that based on your analysis should be a false positive, identify the mod who claimed it was a real positive, post an anonymized version of the post so that users can also see what the content looked like, and then have both parties explain how they came to their conclusions.

Corollary

I appreciate the shift in tone in this post. Sadly, I still have to consider this as a part of a bigger picture so I still find it important to write the following.

It is worth noting that we don’t believe this discrepancy is due to moderator misconduct or malfeasance. Our goal here is not to accuse moderators of wrongdoing or poor judgment. We respect the fact that they were, and are, working under difficult circumstances to achieve a goal we appreciate.

The sentiment of this passage does not align at all with your actions. Once again you immediately jumped on the wagon to post what your own voluntary moderators apparently see as a direct disparaging of their work to a media press site. Those posts were not removed or rectified after multiple requests to do so, and therefore you current words feel as empty as they can possibly be. You don't know how to start to rectify the misrepresentation you gave because "once on the internet, it is forever"? YOUR PROBLEM.

Also, while we are at it, this is not the first time this happened. Remember someone called... Monica Cellio? How many years ago did that happen again?

Wonder what? That [redacted] article... it is still there.. Lucky, apparently the Register had the heart to follow up on that story, but neither article seems something the company requested to clarify things. It is my understanding therefore that you were perfectly fine with never taking action to tell the press that you were wrong.

I therefore can't help but ask: how is that every time some "mishap" of this size happens we end up with an article posted somewhere that manages to portrait the company as the Heroes Of Light (tm) that fight for a better world against the evil forces of the Malevolent Community of Evils (tm)? I seriously doubt that the Register or any other site spends its day looking at Stack just to post a new story, so I assume that YOU are the ones to contact them and request an article, not the opposite. This obviously opens another question: what do you hope to gain from those coincidentally wrong representations of the issue at hand and the mods?

Just a caveat– I don't think the post claimed at any point that the Company is "able to measure the rate of false positive bans on the site"; in fact, they say exactly the opposite: "Under this assumption, it is impossible for us to generate a list of cases where we know moderators have made a mistake. ... Instead, the most we can do is state that we just can’t tell.". I don't think it weakens your argument per say, but I also think it's important to identify theirs accurately. — zcoop98, Commented Jun 8, 2023 at 16:26
I read their argument more as "We have reason to believe GPT posts have fallen off, but suspensions have not. That implies heavy false-positives." Which, as you point out, is sound, if the GPT-rate data is accurate, which seems to be the main gripe of most folks who feel it doesn't match their experience at all. — zcoop98, Commented Jun 8, 2023 at 16:26

GPT on the platform: Data, actions, and outcomes

The tl;dr: Summarizing the results

Automated GPT detectors have unusable error rates on the platform

The volume of users who post 3 or more answers per week has dropped rapidly since GPT’s release

The total volume of questions available to frequent answerers continues to rise

7% of the people who post 3 or more answers in a week are suspended within three weeks

Users who post 3 or more answers in a given week produce about half the answers

Yet, at the same time, actual GPT posts on the site have fallen continuously since release

Many of GPT appeals sent to the Stack Exchange support inbox could not be reasonably substantiated

True false positive rate of moderator actions

Stringing it all together

37 Answers 37

What makes you so confident that your methods of identifying GPT posts are more accurate than how moderators have been identifying them?

Appeals of moderator actions that cannot be validated need to be shown to the moderator team

What about everything else?

Why are you not sharing the moderator requirements publicly?

My summary of your nine-google-doc-page question

My thoughts

If the community is fighting something intentionally, then it will find more precedents than if the users stumble upon the problematic posts by accident

Users get suspended for low quality content, plagiarism and more, not only for ChatGPT-posts

The number of answers is most of the time late

Your heuristic may not be correct. And it is definitely not correct during the period of "active cleaning" of the site

With ChatGPT, those who would not be included in the group of "active users" will get there

The term “false-positive” has a specific definition

What I see

Stop taking us for fools

Stop trusting your tools

My methodology

Case studies

#1: 10 answers in 30 minutes

#2: 10 answers in 10 minutes

Resuming

Mea Culpa

In conclusion

The one-size-fits-all policy is a problem

How was urgently overriding the moderators' mandate a conclusion?

Why not merely halt suspensions?

Again, why is the accuracy of automatic detectors important?

What's with the alleged cultural bias?

Where did we go?

Did ChatGPT users get better at evading detection?

Isn't it great if beginners got helped by ChatGPT?

Quality

Quality control

Quality control in measurements

Yet, at the same time, actual GPT posts on the site have fallen continuously since release

Analogy and motivation

Recapitulating results

Checking an alternative explanation

Summary

So...

Your assumptions so far:

Logical conclusions from your premises:

Your next claims:

Corollary

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged discussionchatgptcommunity-data.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
discussion
chatgpt
community-data
.