Examples of ChatGPT being right and wrong on Stack Exchange?

Question

Some sites have prohibited the use of ChatGPT on their site, with the rationale that it produces "mostly wrong answers". I've not seen examples myself of answers of ChatGPT being flat out wrong, just one of the many claims being wrong, and that's pretty good considering that the competing answer contains less relevant information while not having wrong statements.

Do you have examples of ChatGPT answers being right, mostly right, wrong or flat out wrong? How many of those were deleted, edited, or handled in some other way?

Remember that it isn't just banned because it is terrible at answering certain types of question. It does tend to generate wrong answers on technical questions (because it is not designed to produce correct technical answers, it is designed to connect words together that match the types of sentences it sees in its corpus), it also lies about references (making them up) and is quite likely a plagiarism issue — Rory Alsop, Commented Mar 15, 2023 at 23:16
@RoryAlsop and I want examples of all of those. I'm surprised that even while it seems "prevalent", I seldom seen examples of those. It's like "yeah, we know it's wrong, but we won't show you the evidence we have", which is pretty suspicious behavior. — Braiam, Commented Mar 16, 2023 at 15:12
I don't understand. I'm asking about examples of posts on Stack Exchange. If that's not allowed to ask, why the heck we have meta? — Braiam, Commented Mar 16, 2023 at 15:13
@Braiam questions about ChatGPT that allude that ChatGPT is sometimes correct tend to get closed, e.g. How good should a generative question-answering system or a language model be to be allowed to write answers on Stack Exchange? — Franck Dernoncourt, Commented Mar 16, 2023 at 19:14
I've edited this a bit and voted to reopen because closing this as "Not about the software that powers SE" doesn't seem right to me. However, I would consider editing in your reasons for soliciting this information. What purpose do you have for asking for these AI-generated posts and how they were handled? Are you trying to form an opinion based on the software itself, or about various SE sites' decisions to ban/allow them? — Spevacus, Commented Mar 16, 2023 at 19:38
@Spevacus as it says in my question "I've not seen [...] ChatGPT being flat out wrong" in SE. It's somewhat right or I can see where it's coming from. I've seen however examples of being right. — Braiam, Commented Apr 17, 2023 at 13:24
@FranckDernoncourt Noticed that too. Really abuse of power of a couple of anti-ChatGPT moderators if you ask me... Any thoughts where all that negativity comes from? Seems to have devolved into some anti-AI religion or something... — Tom Wenseleers, Commented Jun 17, 2023 at 19:01
@TomWenseleers not sure, my guess it's a mixture of Luddism (some people don't like to think that their work could have been automated), following the anti-GPT trend, elitism of thinking only humans are deemed worthy to be allowed to write answers on SE, and fair concerns about output inaccuracies + lack of attributions of existing models (esp. compared to the other extreme group that keeps chanting on Twitter/tiktok/other clickbait areas about GPT/LLMs and overlooks all their limitations). Note it's not specific to mods. — Franck Dernoncourt, Commented Jun 17, 2023 at 19:08
@FranckDernoncourt Yes true also some very vocal and fanatical anti-GPT group of users. Seems I'm the only one here raving about AI & the potential of ChatGPT4, and that immediately seems to invite and angry mob of anti-AI users & moderators to go after me... Hehe... That alone makes me prefer ChatGPT now: always friendly & helpful. Not half as opinionated. Much more rational. Sometimes wrong, like human answers on SO. :-) — Tom Wenseleers, Commented Jun 17, 2023 at 19:15
@RoryAlsop If you look at my examples below I think it's doing a little more than just connecting some words. :-) — Tom Wenseleers, Commented Jun 17, 2023 at 21:08
@FranckDernoncourt Also noticed that if you post anything supportive of ChatGPT as a comment under stackoverflow.blog/2023/06/14/… that it gets labelled as "this comment is awaiting moderation", while anything negative gets published right away. Democracy on SO seems like a bit of a joke, hehe... — Tom Wenseleers, Commented Jun 18, 2023 at 8:56
I posted about this at Chinese.SE: Examples of an AI (Sage) answering questions at Chinese.SE. — Rebecca J. Stones, Commented Jul 1, 2023 at 23:44

ggorlen · Accepted Answer · 2023-07-02 14:34:25Z

12

Here's an example of a typical ChatGPT spammer on Stack Overflow (you'll need deletion view privileges on SO to see these):

Implement Flutter App Check Breaking Firebase storage on Android - answered May 16 at 22:24
- -4 score
- Response from another user who'd bountied the question: "Sorry dude, as you have no history with Flutter and this feels to me like a ChatGPT response I'm going to hold back the bounty. If this is a legit answer I do apologise, but I already had most of this suggested by ChatGPT anyway and no luck"
How to configure sanity preview to work on live? - answered May 16 at 21:53
- 0 score
- OP asked for follow up clarifying the question, but the GPT spammer didn't respond.
Send verification code via e-mail after keycloak login - answered May 22 at 14:04
- 0 score
Collections of different structs keyed by trait - answered May 22 at 16:48
- -1 score
How can I take a screenshot of a full page with Playwright (visual comparison)? - answered May 23 at 19:39
- -3 score
- When questioned about the post in the comments, the spammer calls the community "moronic" and tells someone "you should reconsider your developing career my guy". I've found this to be typical of the maturity level of ChatGPT rep farmers.
Is there a faster alternative to using dblink or fdw extension to create materialized views in Postgres? - answered May 23 at 12:05
- -1 score
- Spammer responded to a comment with another ChatGPT copy-paste they probably didn't understand.
Micronaut Declarative Client Streaming and error handling - answered May 23 at 13:49
- 0 score
Do I need Load Balancer for internal ReplicaSet - answered May 23 at 15:26
- 0 score
Parsing xml data and writing it on a csv file - answered May 23 at 16:45
- -1 score
- OP responded that it didn't work
How do I spilt my array so that it returns parts of the input and output arrays based on some condition? - answered May 23 at 20:55
- 0 score
- No response to OP's follow-up

I'm not a subject matter expert on all these topics (they're in totally random technologies, as is typically the case with LLM rep farmers), but the net negative impact here is clear based on users who did respond and the vote score, -10.

This spammer is not an outlier and I didn't cherry-pick them (just grabbed the most recent one in my flag history without bothering to look for others). I've flagged dozens of posts from dozens of users since the ChatGPT release. It's a serious problem that this is now allowed--the presence of some helpful LLM posts doesn't counteract the horrible signal-to-noise ratio posed by the rep farmers who could care less if their answers are correct or not and are outwardly hostile to the community rules as they existed at the time.

Here's a sample screenshot of this user in my flag queue, indicative of what LLMs have done to the site, which the company now endorses:

Apparently the above example isn't convincing to some people. Here's my whole LLM flag history in JSON. Quick analysis:

Total post flags I raised from 2018 (my first flag) to 5/2023: 503
Total post flags I raised from 12/2022 to 5/2023: ~148 (probably overcounting)
Total LLM flags I raised from 12/2022 to 5/2023: 114 (jq length)
Number of helpful LLM flags: 111 (jq '[.[] | select(.outcome == "Helpful")] | length')
Number of LLM flagged answers that were self-removed: 3 (jq '[.[] | select(.outcome == "Self-removed")] | length')
Total score of LLM flagged answers: -86 (jq '[.[] | .score] | add')
Number of flagged LLM posts with a positive score: 6 (jq '[.[] | select(.score > 0)] | length')
Number of flagged LLM posts that were accepted: 1 (jq '[.[] | select(.accepted)] | length')
Number of unique users I flagged as having posted LLM answers, omitting deleted accounts: 36 (jq '[.[] | .user] | unique | del(..|nulls) | length')

Note that I didn't try to find LLM answers (proof: I follow Python turtle, Playwright, Puppeteer, Cheerio tags and you'll see these represented in the flags relative to my scores in these tags), just flagged them as I saw them. To give an idea of my enagement, before the strike, I used to spend a couple hours a day on SO answering questions in my favorite tags and doing research. However, if I did identify an LLM answer, I would go through the poster's history and flag all of the other LLM-looking answers I found. On average, these LLM answers had been up for ~22 hours between posting and flagging, plenty of time to accrue upvotes and accepts. I flagged all LLM-looking answers regardless of whether I thought they were correct or not (since LLMs are banned).

edited Jul 2, 2023 at 14:34

answered Jun 18, 2023 at 15:29

ggorlen

6057 silver badges13 bronze badges

7

Giving OP the ability to instantly delete any answer is ripe for abuse. That's way too much control over the community resource. It's not a matter of not liking GPT. I use GPT. It's a matter of keeping it where it belongs so it doesn't undermine/overwhelm other avenues to knowledge. As I've said before, if you want a human answer, there should be a place for that. Instead, people like yourself seem to be interested in polluting any and all sources of information with GPT to the point where there's no avenue to access guaranteed human information any longer.
– ggorlen
Commented Jun 18, 2023 at 18:45
5

Again, the presence of a few good LLM-based answers doesn't justify the unmoderatable barrage of nonsensical crap I've witnessed since LLMs landed. Stand by: I'll be compiling a list of all the posts I've flagged, almost all of which are wrong, unmitigated garbage copy-pasted by someone who probably has no idea what the question is even asking. I don't mind losing the few false positives. We can get those from LLMs directly! I get that you like LLMs and have written good answers with them, but the rest I've seen are rep farming spammers like the featured user in this post.
– ggorlen
Commented Jun 18, 2023 at 18:57
6

As I said, OP shouldn't have unilateral say over what winds up in a thread. They don't even have say about deleting their own question once it's answered with an upvote, because it's now a community resource (think Wikipedia). Ostensibly, LLMs were allowed by the company because they're worried about abusive deletions, and letting OP be the arbiter of this is a sure way to achieve the exact opposite. Also, as I said in the post, I didn't cherry pick this. This is just the first user I came across in my flag history among dozens.
– ggorlen
Commented Jun 18, 2023 at 19:04
6

It's not cherry picked in the sense that the quantity of posts by users that obviously abuse the system outnumbers the useful GPT posts by a factor of hundreds, from what I've observed spending hours per day on SO since 11/30/22. Your posts are literally the first remotely useful GPT answers I've seen on the site out of hundreds. If that isn't cherry picking, I don't know what is. Like I said, I'll compile all of my GPT flags and post them when I have the chance, so there will be no accusations of cherry picking.
– ggorlen
Commented Jun 18, 2023 at 19:14
2

If it's undetectable, I don't mind as much. LLMs are useful tools. I use them. I'm not really a purist. But gosh are they ever obvious. Most answers I reference day-to-day are older than 11/30/22 anyway, and I'm considering adding the userscript to block any newer answers.
– ggorlen
Commented Jun 18, 2023 at 19:16
2

So with that last comment it makes it clear that your sample is entirely focused on the ones clearly engaging in abuse, whilst you have no idea what % of the good answers have been written whilst being assisted by ChatGPT. If abuse is obvious so much the better: that would also make it easy to take action in an automated way then by the SO system itself...
– Tom Wenseleers
Commented Jun 18, 2023 at 19:18
6

Believe me, I'd tell you if I thought there was a good GPT answer I've seen. I've seen none that I can tell. If these were even remotely apparent, there'd be no discussion. Given how obvious the copy-pasted ones are, I find it hard to believe that there'd be an equal quantity of GPT answers masquerading as humans that are so good, I can't identify them. In my SME tags, I know almost all of the "regulars" in the tag, and I've seen the quality of their work before LLMs, so most of the good answers are either posted pre-LLM, or on par with the work of trusted users established before LLMs.
– ggorlen
Commented Jun 18, 2023 at 19:21
8

Nobody's telling you or preventing you from doing that. Go ahead and use LLMs, just please don't feed it back into the human training data. Most people are doing a much worse job of it than you--they're overwhelmingly rep farming spammers rather than well-intentioned university professors, as far as I've seen.
– ggorlen
Commented Jun 18, 2023 at 19:32
4

@TomWenseleers "Definitely after a couple of prompts, asking some clarification on this or that, testing the code & quickly debugging it if need be, etc." this is definitely a good process leading to a good answer, you're doing well with that. What ggorlen is objecting here is that it's much easier to not do what you did with ChatGPT output and instead just copy paste it, and that's what people apparently (and unsurprisingly) did.
– justhalf
Commented Jun 19, 2023 at 16:35
2

The problem with this answer is mainly that you do not know (and here I'm quoting you "I'm not a subject matter expert on all these topics") but you take 0 scored answers as "being wrong". That kinda defeats the purpose of my question. You are just showing random examples of answers that use chatgpt, rather than from an objective point of view was this answer right or wrong for the question asked. I included an example on my question of chatgpt being right.
– Braiam
Commented Jun 19, 2023 at 20:48
3

@Braiam When did I say I score 0 as wrong? You're putting words in my mouth--I'm not claiming LLMs are never correct; I'm talking about an overall trend: -10 points, zero positive comments or upvotes, disrespectful comments. Clearly not a positive contribution, but this is literally the first user I found scrolling through my history of ChatGPT flags and it's indicative of what I've seen of ChatGPT usage on SO. I could have picked a "worse" example. Yes, LLMs are sometimes correct, and I'm sure a couple of these are (the Playwright one was--I'm the 8th highest rated user in that tag)
– ggorlen
Commented Jun 19, 2023 at 22:02
2

@Braiam If you don't believe me, here's my flags: stackoverflow.com/users/flag-summary/6243352. Scroll back from the start until you see a big chunk of red--that's what you see in this post here. Keep scrolling back and you'll see more, larger chunks of red.
– ggorlen
Commented Jun 19, 2023 at 22:02
2

@Braiam Also, my point about not being a SME in all these tags is there to illustrate that, quite clearly, users like this are pulling random posts in random tags and copy-pasting ChatGPT output. The answerer here has no idea what they're talking about, could not answer these questions on their own without AI help, and could care less whether or not the ChatGPT output is correct! And this behavior is now endorsed! By and large, LLMs are not being used as an assistant to preexisting expertise as suggested by this well-intentioned answer.
– ggorlen
Commented Jun 19, 2023 at 22:48
5

@Braiam I suspect people such as yourself are having such a rough time with an answer that disagrees with your point of view is that ChatGPT will rarely disagree with you. The conversation is completely artificial: "You're right. I apologize for my mistake in my last answer. Here's updated answer that doesn't challenge your viewpoint about the world whatsoever". ChatGPT seems fantastic for confirmation bias and will rarely let you know when you have an XY problem or push back on anything.
– ggorlen
Commented Jun 20, 2023 at 17:20
3

@Braiam Sorry, but I don't have the ability or time to give you a precise % accuracy--that's too much to ask, but hopefully someone has the time, talent and dataset to cook it up. Anecdotally based on my usage and from what I've seen on SO, maybe 25% correct, which is way too low. Even if it's 99% though, I still don't see a reason to post LLM answers on SO. Just ask the LLM directly if you want an LLM answer--simple, everyone wins. LLM flooding all platforms is pointless and redundant at best, and at worst drowns out existing human knowledge. And that's at best, which we're nowhere near.
– ggorlen
Commented Jun 21, 2023 at 15:27

| Show 27 more comments

Franck Dernoncourt · Accepted Answer · 2023-04-16 23:21:15Z

-1

Examples:

correct answers: 8-letter word that uses all letter keys on a telephone, https://sports.stackexchange.com/revisions/28585/3, https://webapps.stackexchange.com/a/170232/18147
deleted answer: https://travel.stackexchange.com/a/178066/1810

FYI:

How good should a generative question-answering system or a language model be to be allowed to write answers on Stack Exchange?
How good ChatGPT is at answering questions?

edited Apr 16, 2023 at 23:21

answered Mar 15, 2023 at 19:48

Franck Dernoncourt

41.3k5 gold badges60 silver badges163 bronze badges

2

A comment to the first says "It must have been a fluke of luck, because I asked the exact same question, and it repeated a letter. I pointed it out, it apologized and acknowledged that it really repeated it, and came up with another word instead... with the same mistake."
– This_is_NOT_a_forum
Commented Mar 16, 2023 at 10:05
1

@This_is_NOT_a_forum that's a thing with tools that are in active development, they tend to improve over time.
– Braiam
Commented Mar 16, 2023 at 15:16
2

@Braiam Stephen Wolfram has written some excellent articles about how ChatGPT works, its limitations, and the potential benefit of combining it with Wolfram | Alpha. Please see What Is ChatGPT Doing … and Why Does It Work?.
– PM 2Ring
Commented Mar 16, 2023 at 21:00
10

Yes, ChatGPT will continue to improve with continued training, but no quantity or quality of training data will allow it to overcome the intrinsic limitations inherent in its construction as a Large Language Model. By improving its training it won't stop it from uttering nonsense, but it will make it harder for humans to detect when it's spouting nonsense.
– PM 2Ring
Commented Mar 16, 2023 at 21:07
1

@PM2Ring it's an open debate spectrum.ieee.org/ai-hallucination For SE, the main question IMHO is How good should a generative question-answering system or a language model be to be allowed to write answers on Stack Exchange?
– Franck Dernoncourt
Commented Mar 16, 2023 at 21:32
6

@Franck Yes, it's currently a matter of debate how well a LLM can capture concepts by operating on language structures. But as Wolfram explains, that approach is inadequate for handling anything but rudimentary mathematics and logic, even if you give it orders of magnitude more complexity in its neural network and the data necessary to train that network. GPT-3 can't even reliably balance parentheses when the nesting gets deeper than 10 levels or so.
– PM 2Ring
Commented Mar 16, 2023 at 22:16
1

@Braiam Yes, it can generate plausible explanations of all sorts of things. But it doesn't have any way of evaluating if that explanation is correct. Sometimes it's right, sometimes it says obvious nonsense, and sometimes you need to be an expert to see that it's nonsense, eg chat.stackexchange.com/transcript/message/63191184#63191184
– PM 2Ring
Commented Mar 20, 2023 at 16:24
5

Here's a crude analogy. ChatGPT is like a person doing a jigsaw puzzle. You give it a prompt of a few joined pieces and it adds more pieces, one by one. The picture on the jigsaw consists of a bunch of words written in English, and the jigsaw pieces are cut so that when you join them together the result will be syntactically correct English. However, the guy doing the puzzle doesn't understand English. When we see the result, it looks impressive because we understand English, but ChatGPT just knows how to stick bits of jigsaw together.
– PM 2Ring
Commented Mar 20, 2023 at 16:33
5

@Braiam ChatGPT isn't really like Searle's Chinese Room. See ai.stackexchange.com/a/39313 At best, ChatGPT operates a little bit like a Chinese Room in the way that it can output correct information by following the rules it's developed for continuing a sequence of tokens. But ChatGPT isn't even trying to produce correct output. Its job is to create a statistically plausible continuation of its prompt, one token at a time.
– PM 2Ring
Commented Apr 14, 2023 at 23:03
2

@PM2Ring What do you think about my examples below?
– Tom Wenseleers
Commented Jun 17, 2023 at 21:49
2

@TomWenseleers I agree with GPT-4 that the bee nest question is fascinating. ;) Yes, GPT can be a useful assistant. The problems (mostly) happen when people who don't understand how it works or its limitations let it be the primary navigator instead of a copilot. Unfortunately, it's logistically difficult to allow responsible use of GPT on our network while preventing the large volume of irresponsible use.
– PM 2Ring
Commented Jun 18, 2023 at 4:20
3

@TomW Even when you do use GPT-4 responsibly, it can be tricky to catch all of its mistakes, even if you're a subject matter expert. Eg, Scott Aaronson (arguably one of the world's leading experts in Quantum Computing) initially missed one of the mistakes that GPT-4 made on his QC exam that I linked previously.
– PM 2Ring
Commented Jun 18, 2023 at 4:26
1

@TomW Under the SO policy banning ChatGPT, the correctness of the answer is irrelevant. The system was overwhelmed by the sudden influx of GPT answers, most of which were just raw copy & paste with no validation, so the mods decided that the only practical way to deal with the sheer volume of GPT content was to ban it all.
– PM 2Ring
Commented Jun 18, 2023 at 6:58
2

@PM2Ring Yes but that initial phase is already long past, based on the data on GPT based answers on SO, meta.stackexchange.com/questions/389928/…. And the reason them taking down my answers where I disclosed use of GPT had nothing to do with being overwhelmed. Some of the moderators that took it down were supposedly on strike. I always thought the ban was against plain copy & paste GPT answers. But the ban is constantly being edited to become more extreme. No single GPT line in answer allowed or Github Copilot code etc.. Ridiculous!
– Tom Wenseleers
Commented Jun 18, 2023 at 7:07
5

@TomWenseleers it's literally in the help center, and has been linked to you at least a dozen times in the past week.
– Kevin B
Commented Jun 20, 2023 at 14:57

| Show 9 more comments

Tom Wenseleers · Accepted Answer · 2023-06-20 14:52:49Z

Here an example of a fully correct almost verbatim ChatGPT4 answer (almost unedited, only with the output of the code included to show that the code works & was verifiably correct) : Ridge Regression in R where coefficients are penalized toward numbers other than zero This answer was left, though it received some unfair criticism in the comments section & was not upvoted relative to another less detailed answer to the question of the OP that did not have reproducible R code. I should note that as a domain expert I could immediately verify this answer to be correct. I also disclosed its use, as I should have. This was the output after a single prompt - the exact question asked in this case, plus two more prompts to ask for some extra details.

For another answer I had the OpenGL+SDL code correctly produced by ChatGPT4. This also just took three prompts - two where I pasted in some demo code of the nara and rdyncall packages (as they are not in the ChatGPT training set - they are only available on github) & then the actual question. The only bug in the original code was that the image came out upside down, but with the 4th prompt ChatGPT4 corrected that. I then manually added the benchmarks & wrote out the rest of the answer. This answer was deleted repeatedly by moderators, which is still not resolved, but in the meantime I reposted a somewhat edited answer: Performant 2D SDL or OpenGL graphics in R for fast display of raster image using rdyncall package and SDL/OpenGL calls. Here I had originally disclosed the use of ChatGPT4 & how I had verified the correctness of the code (by showing the benchmarks & correct output), but doing so invited the wrath of moderators who then promptly deleted the answer (3 times no less). This now forced me to slightly rewrite the answer & some bits in the code to evade the answer being deleted. A very bad outcome if you ask me. This question I had given 4 or 5 bounties before over the past 5 years and it received a high number of votes, but it never received an adequate answer - the other answer that is there I would count as a "hallucinated human-written answer" as it hallucinates that solution would solve my problem & would work (it should rather just be a poor suggestion, written as a comment at best). Yet, it is still more highly rated than my answer because moderators and users keep on downvoting my correct answer merely out of antipathy for GPT (even if in that case I wrote most of the answer myself AND it's my own question). Before, at one stage someone also posted a GPT3 or ChatGPT3.5 produced answer that was not working - this I detected as such & I downvoted that answer and it was taken down (if as the OP I would have had the power to do this I would have done this myself, that would decrease the burden on moderators significantly). Of note: the accuracy of ChatGPT4 for coding and maths (together with the Wolfram Alpha plugin) has become way better than for the free ChatGPT3.5. An entirely different kettle of fish... See here for some recent maths benchmarks: https://arxiv.org/abs/2306.08997.

For this answer I used GPT to translate some R code to Rcpp, making the code run 15x faster than my own pure R code (I am an Rcpp beginner, so that was useful, but I could of course confirm that the code was correct & working): Faster way to calculate the Hessian / Fisher Information Matrix of a nnet::multinom multinomial regression in R using Rcpp & Kronecker products (again I disclosed its use). The rest of the answer was manually written. Here some Rcpp expert would in principle have been able to write an answer, but in reality it is unlikely that anyone would ever have bothered to put in the required time.

In all 3 cases it would be dead easy for any moderator and the OP to verify that what is posted makes sense as reproducible code plus output or benchmarks are presented, which makes it clear that the code was checked. It would be much less work for any moderator to check that than to try to infer if someone had used ChatGPT4 or not if they wouldn't have disclosed it. So for programming related questions at least I don't see any problem. Things might be different for answers requiring particular factual knowledge & authoritative sources. There I wouldn't recommend ChatGPT just yet, even though the ScholarAI and web browser plugins can now also give sources, and pasting in a couple of references from Google Scholar manually of course would hardly be any work.

A fourth case, not posted here on Stack Overflow but on Twitter, but instructive nevertheless, was a case where I gave ChatGPT4 with the Wolfram Alpha plugin a verbal description of a biological problem (that had not been analysed yet in the available academic literature). It then correctly wrote down the corresponding differential equation system (using its LLM logic) and then passed this to Wolfram Alpha / Mathematica - there it made a slight mistake though not including a given constraint (S+P=1 in this example), which caused Wolfram Alpha to say the system was unsolvable. It then proceeded solving the differential equation system itself for the stable point, but made a slight mistake there when rearranging terms (much like we could make a mistake when working with pencil and paper). In this case, even though it didn't quite give the correct final result, it was easy to see for me where it went wrong & correct the mistake. It also successfully made various modifications to the model if I presented it with additional biological details that had to go in into the model & made correct modifications to the differential equation system. I will include the final model in an article that I plan to submit shortly. Here the input and part of the output:

For fun I also tried to see how good ChatGPT4 could be at rating Stack Overflow answers & I would say pretty good. For this question & the answers posted it gives good and accurate comments on each of the answers & it would give each of the answers a 6/10, while my answer it gave a 9/10. Based on my expert knowledge here that would seem about right. The replies are pasted below. ChatGPT4's own answer was also pretty good (certainly if edited a tiny bit, e.g. just retaining the best suggestion in terms of method - almost as good as my answer there right now). The one going on about transformations I would probably give a lower score, as that's kind of irrelevant here. Probably ChatGPT4 based moderation would be too expensive still to do at scale, but smaller, cheaper AI bots could probably be trained & do a lot of the moderation work on SO eventually if desired, if human moderator volunteers would not be able to keep up. And the original posters should in my mind also get more privileges to delete wrong or totally irrelevant answers. The OP should be best placed and have the highest incentive to evaluate the correctness of any answer & whether the proposed solutions actually solve his/her problem or not. This should not be the task of moderators. The nonworking RGL solution to my question e.g. I had always wanted to delete myself, but I can't unfortunately.

ChatGPT4 is also pretty good at redacting existing answers and adding more details, e.g. reproducible code or even authoritative sources. For example, taking the most accurate existing answers that lacks in sufficient detail, and asking ChatGPT4 to add essential details & authoritative sources gets you this - much better than what is currently there (which is saying little more than "pick up a book on ICA"), though still not perfect (e.g. quadratic programming problems have linear constraints too, just quadratic loss) :

These examples are drawn from my experience working daily with ChatGPT4 over the past months for work (in academia, my background is evolutionary biology & biostatistics). I could post 100s more examples, but I believe these examples are representative of the current quality of ChatGPT4.

I would like to see some formal benchmarks of the accuracy of ChatGPT4 relative to the accuracy of an average answer or the best accepted answer on Stack Overflow done in a fully blind way (this doesn't cut it): just give ChatGPT4 500 SO questions from after Sept. 2021 (GPT training date cutoff), get answers from ChatGPT4 & SO & have experts rate them blind. My subjective feeling, mainly working on statistical questions and programming in R, is that ChatGPT4 is on a par with an average Stack Overflow answer in terms of quality, but perhaps not even too far off from the quality of best accepted answers, at least for the field in which I work (can't tell for other fields). For me it doesn't bullshit any more than real humans, even domain experts. For the example above, the answer of a user named Carl is in fact misguided & irrelevant, but an unsuspecting visitor might have a hard time inferring this. ChatGPT4 also admits sometimes not knowing something about a given topic, that it doesn't have sufficient information, that you would better consult a domain expert, or will tell you if a particular suggestion is plain wrong. Sure, it's sometimes wrong or makes some small mistakes. So do humans. Many domain experts can sound convincing but quite frequently they are dead wrong too. With ChatGPT4 now scoring 155 on a regular IQ test and giving answers to university entrance exams that would be sufficient to get admitted to Stanford or MIT, making it land in the top10%, it's clear it's become pretty good.

Most academic journals, like PNAS, did not opt for a blanket ban on generative AI, but allow responsible use. All that a blanket ban will achieve is that users will stop disclosing the use of ChatGPT, making it hard or impossible to police (except to pick out the most blatant abuse), or that users will stop visiting SO altogether as ChatGPT can solve their problems faster, which could risk causing traffic to SO to decline further. And answers that can't be solved assisted by ChatGPT will most of the time also not end up receiving a human generated answer. In fact, right now only 8 out of my own 20 questions on Cross Validated were ever answered - 3 of which by myself - that's only a success rate of 25% to get useful help from others that solved my original problem. If I paste the 12 unanswered questions into ChatGPT4 I would count about 83% of the responses as useful & largely or entirely correct (i.e. I would have given them a checkmark, whereas the current answers are either beating about the bush, not solving my original problem or are plain wrong & for some there is no answer at all). And this is unredacted ChatGPT answers - which should not be allowed to be posted anyway - but with a bit of redaction (e.g. pasting in some authorative sources etc) such answers clearly would have been better than anything posted right now. As this concerns unanswered questions, posted a long time ago, this should be a representative, non-cherry picked set of examples. From this, it is clear that more complex problems that ChatGPT4 cannot solve likely would never be answered on SO either (e.g. because they would still require significant theoretical work to solve). In my mind, there is also a disconnect between the results above and the statements in the replies to the ban on generative AI announcement or articles like this one. Most of the time those answers and comments seem outdated & are not in line with the current quality of generative AI or published benchmarks.

To be clear: I do appreciate that some users abuse the system by posting unverified copy & paste answers from ChatGPT (including from the free, lower quality versions) and that this should not be allowed & countered. But the examples of obvious abuse that ggorlen cites in this thread were (1) almost always detected by the OP, who should have the privilege to delete such answers, which would in that way not cause any extra burden on moderators and (2) would be obvious to detect automatically & send a warning or kick out of the system, without moderators having to do this (typically new users that suddenly start posting loads of low quality posts in high succession & never get back to any comment or ever get their answers upvoted & post nonworking code). The examples cited are also just that: examples of clear abuse that would be easy enough to pick out. It says nothing about the % of posts where ChatGPT assisted in formulating an answer (or GPT Bing output, which for many will now be the default search engine) but where it was not disclosed, because they would be impossible to tell apart from any other good answer. And that's ultimately all that counts: is the answer any good? So what I strongly object to is that this whole issue of GPT on SO has devolved into some semi-religious anti-AI witchhunt. I agree with Philippe that this does run the risk of many more users just leaving the platform, which would be a great shame, as having an active user base is essential to keep this plaform going.

Great examples! ChatGPT4 never ceases to amaze me. Very ironic that mods deleted the answer you posted to your own question. I guess some people just prefer to waste human time. — Franck Dernoncourt, Commented Jun 17, 2023 at 22:00
Yes indeed - three times no less - maybe they wanted to show what a great, warm community Stack Overflow is, where you can benefit from all those lovely interactions with real humans. I think for now I'm going to stick to ChatGPT4 for a while. At least until I don't have to fear for being banned for posting a few lines of illegal GPT code. ChatGPT4 is really lovely. It never labels my questions as duplicates and always finds my questions fascinating! And always congratulates me on my thoroughness in challenging GPT if I ask for some extra clarification. :-) — Tom Wenseleers, Commented Jun 17, 2023 at 22:12
W.r.t. the ridge regression question, the approach described in the top-voted answer is more straightforward than the one described in yours, gives the same result, & moreover can be used with LASSO or the elastic net: it seems unfair to characterize it as "inferior"; or at least it's easy to see why people might prefer it. — Scortchi - Reinstate Monica, Commented Jun 20, 2023 at 13:36
@Scortchi-ReinstateMonica Perhaps - I personally like to solve ridge & nonzero centered ridge regression as a regular LS problem with augmented observations (as in stats.stackexchange.com/questions/69205/…). Also immediately would allow for inference and so on & also works to put ridge penalty in a GLM by modifying the weighted LS step of the IRLS algo. It's at least much more detailed than the other answers. Mainly took issue with the SDL+OpenGL answer being deleted several times... — Tom Wenseleers, Commented Jun 20, 2023 at 13:53
Yes, but the bulk of people just want to use a canned routine to do ridge etc. regression, and your answer isn't helpful to them, relatively speaking. It's like the fact that I drive a Miata with a manual transmission doesn't mean that I'm going to recommend a stick shift to someone who asks me for car recommendations; I know that only 1.5% of cars sold in America have manual transmissions and few people know how to use them, so I might mention it as a possibility, but it wouldn't be where my recommendation effort would lie. IMO your answer was interesting but only possibly useful. — jbowman, Commented Jun 20, 2023 at 16:21
@jbowman But my answer also just required have any linear least squares solver available right? Just a lm() call would do it in R, working with an augmented design matrix & outcome variable, but still only an lm. That is a canned routine. While you are assuming they have a package available to do ridge regression, like glmnet or the ridge package, which are both relatively more specialised... — Tom Wenseleers, Commented Jun 20, 2023 at 16:45
1. lm does not select an "optimal" $\lambda$, but, for example, glmnet will, via cross-validation, require one line of code to do so. Using lm alone requires hand-crafting some approach to doing so; it is not a complete solution by any means. 2. if they use SAS, R, or Python, they do have such a package available. They are hardly rarities these days. Suggesting a lengthy hand-crafted approach to avoid typing in library(glmnet) seems... counterproductive. — jbowman, Commented Jun 20, 2023 at 19:23
Here's an example of what you can get with glmnet in just a few lines of code: statology.org/ridge-regression-in-r — jbowman, Commented Jun 20, 2023 at 19:25
@jbowman Thanks - yes, I know glmnet very well of course... I use it all time. But I also use hand-crafted ridge regressions all the time, e.g. iterative adaptive ridge regressions to approximate best subset regression in github.com/tomwenseleers/L0glm/blob/master/benchmarks/…. Then one does need a bit of a deeper understanding of what's going on. Anyway I'm all cool with it. Seems the OP is no longer around to ask which solution would be most useful for him & suit him best... — Tom Wenseleers, Commented Jun 20, 2023 at 19:50
@jbowman My post was not meant to imply what you suggested was bad in any way, rather I wanted to show with these examples that ChatGPT4 produces much more accurate answers than how it is typically portrayed on SO... — Tom Wenseleers, Commented Jun 20, 2023 at 19:55
@jbowman And yes, taking a fixed pre-specified lambda is not what most people do in the ML community. Though Bayesians do it all the time: they hate the idea of tuning lambda / the SD of your Gaussian prior based on your data... :-) — Tom Wenseleers, Commented Jun 20, 2023 at 20:01

Stack Exchange Network

Examples of ChatGPT being right and wrong on Stack Exchange?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
discussion
statistics
chatgpt
.

Linked

Hot Network Questions

Examples of ChatGPT being right and wrong on Stack Exchange?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged discussionstatisticschatgpt.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
discussion
statistics
chatgpt
.