Something that I've noticed during my time on here is an excessive quantity of programming questions related to Pandas. Normally, given the package's popularity, this wouldn't raise any eyebrows. However, it seems that most of these questions are being asked by people that do not understand basic Python, in use-cases that would better suit built-in features.

Is there some popular guide all of these beginners are following that is leading them to use this package when in most cases, a basic list would suffice? If not, why is the Venn diagram of Python beginners and Pandas users nearing such a circle?

    A better question is why do high rep experienced users with gold badges continue to answer blatant duplicates instead of closing them :sigh: -.- Commented Nov 30, 2021 at 11:12
    @Nick That sweet, sweet reputation
    – Libra
    Commented Nov 30, 2021 at 11:13
    I mean, who doesn't like fluffy, dichromatic, bamboo-eating bears? They are obviously so much more friendly than slithering ambush predators who kill by constriction! Commented Nov 30, 2021 at 11:16
    @CodyGray With Anaconda you can have both at the same time. Somehow that doesn't fix the ambush predator problem though. Commented Nov 30, 2021 at 11:27
    duplicate of a theoretical question "why is there such a large quantity of beginner JavaScript questions that concern jQuery" :) Normally, if you are at least somewhat of a knowledgeable person in tech, you can find pretty much every answer to your questions without resorting to asking one on SO (and if one still does, the issue is usually beyond their control in the first place). However, if you are not, and you know that SO is the place to "Ask questions, get answers, no distractions" (cited from the Tour), you take here in hopes of free tutoring from "gurus". Hence, the flood. Commented Nov 30, 2021 at 12:04
    @OlegValter What I'm moreso curious about is not why they are here, but why in general it appears that a disproportionate amount of users are learning things like basic for-loops while using a dataframe. There's some odd internet pipeline leading these people to utilize Pandas for ridiculously simple things. Javascript and Python are the two tags I sit on the most recently and I honestly dont find the trends comparable
    – Libra
    Commented Nov 30, 2021 at 12:07
    @Laif oh, that's exactly what I referred to in my comment :) There is also a large market push for "you can become a data scientist in no time with no knowledge, passion, or idea of how things work". Programming, and especially Python as a language, are "trendy" now, so it is hardly surprising to see absolutely clueless people trying their hand, miserably failing, and running to the "toxic" SO to "get help". Commented Nov 30, 2021 at 12:09
    +1 I'm a Python programmer, and I have [pandas] on my ignore list due to the sheer volume. Do I ever browse the tag anyway? Only when I'm looking for a pile of low-quality posts to edit.
    – Nat Riddle
    Commented Nov 30, 2021 at 14:04
    I'd imagine that many of these people want to use Pandas as a tool to do data analysis. Pandas is, obviously, great for that purpose and it does a reasonable job of abstracting away pain points like file formats and databases. So people can get a long way knowing very little Python but quickly get lost if they have to step outside their comfort zone and so come here to ask their "beginner" questions. Commented Nov 30, 2021 at 14:47
    Pandas is the one tag that I've added to my ignore list because every single question I read with that tag I was downvoting and it just became depressing. Rarely ever do you see a panda tagged question with a mcve or research.
    – Sayse
    Commented Dec 1, 2021 at 11:36
    Similar things happen in the R tag where people ask the same questions of very simple operations but insist they can't use base R functions, they have to use one (or all 30+) of the tidyverse packages. The FAQ for the tidyverse tag is almost always ignored also
    – camille
    Commented Dec 1, 2021 at 16:39
    @SteveBennett It's not about what you find personally satisfying. It's what helps positively contribute to the quality this site and its overall mission of becoming a resource of high-quality answers to programming questions. It helps no one but you when you duplicate content that is already available elsewhere. What do we need to do in order to persuade you to actually help improve the site, rather than satisfying your own ego? Commented Dec 2, 2021 at 5:43
    I would argue over-saturating solutions for the same problem is explicitly harmful to the community and it's functionality
    – Libra
    Commented Dec 2, 2021 at 7:25
    @SteveBennett pay you to follow the stated guidelines of the site? If figuring out the answer to something is what's satisfying, you don't have to actually post it. I probably draft twice as many answers as I end up posting
    – camille
    Commented Dec 2, 2021 at 15:15

It's a combination of multiple reasons, in my opinion:

  1. Python is very popular, and Pandas is one of the most used libraries for data analysis and manipulation. Beginners mistake Python for Anaconda+Python+NumPy+Pandas+whatever. Basic university courses on data analysis or economics are almost guaranteed to touch Python and Pandas without students having ever heard of those before.

    Let me give you an example from my personal experience: I used to have a couple of online postings for Python programming lessons... I had to take them down because people didn't understand that I would teach them Python, and not Python+Pandas.

  2. The whole Pandas library has a lot of counter-intuitive elements and paradigms for beginners, especially if you couple it with NumPy's "symbolic" paradigm, and especially if those beginners are also beginners in Python.

    I am pretty good at Python, but I'm definitely not a data analyst, so I rarely use such libraries. The few times I had to use the library to do the most basic thing, I had to spend hours Googling or reading documentation.

  3. People are lazy: if I see 10 questions a day on that are "please debug and fix this simple piece of code for me", I suspect you will see 100+ on . Programming has become so easy that people stop and think about posting a question at the most minuscule problem, while posting a question should actually be your last resort after some good amount of effort has been spent on the problem.

  4. New users don't care about the quality of the content and have nothing to lose in posting a low-quality question.

  5. They are not even hesitant to do this because they actually have a decent chance to get an answer, as they keep seeing this kind of question answered on a daily basis. Filter by accepted answer and sort by new (link) and you'll see.

  6. The tag has a lot of users floating around. Low-rep users may not even know what voting to close a question means since they've never done it. They just see an easy to answer post, think "hey, I can help!", and jump into it.

  7. Even considering higher rep users, sometimes it is a lot simpler to answer the question with a one-liner than to look for a fitting duplicate (SO's search functionality doesn't help in this regard, and using Google with site:stackoverflow.com can be a pain to filter). So apart from the reputation gain, there is some sort of low effort on the end of the answerers too, but I don't know if I can blame them to be honest.

All in all, I think this is an issue that is present in all high-traffic tags. I definitely see it also happen in for example. What can we do? Not much, just keep dedicating our time to curating the content and vote to close obviously zero-effort duplicates. Unfortunately the askers greatly outnumber the answerers, and the amount of questions that get posted will inevitably start to get overwhelming at some point on any popular tag.

    "Basic university courses on data analysis or economics are almost guaranteed to touch Python and Pandas without students having ever heard of those before." I think this is probably the whole explanation. There are a lot of beginners using Python+Pandas who don't really understand programming, because Python+Pandas is being used for teaching data analysis to students with no prior programming experience. Even if Pandas was straightforward and paradigmatic, and the students were all capable self-learners, we'd still expect to see a lot of beginner Python+Pandas questions.
    – kaya3
    Commented Nov 30, 2021 at 17:04
    @kaya3 yeah that's why I put it as #1. Unfortunately teachers are also not programmers and this is ultimately what gets beginners to ask here instead of e.g. emailing them or waiting for the next class. Commented Nov 30, 2021 at 17:06
    Off-topic, but frome time to time I find it entertaining to go back and take a look at the updated results of this Data SE query that I made last year comparing the average question scores of top languages vs Rust and Haskell. Commented Nov 30, 2021 at 17:32
    "Basic university courses on data analysis or economics are almost guaranteed to touch Python and Pandas without students having ever heard of those before." Unfortunately this makes a lot of sense, I've had senior engineers at work ask me for help with python using something like Jupiter or Anaconda and were surprised when my first advice was to ditch those and open a text editor. I suppose the problem with having an endless amount of programming support is that not a very large amount of it is going to lead to an effective learning curve.
    – Libra
    Commented Nov 30, 2021 at 21:04
    And don't forget that from high school if not earlier children are directed here to post questions about their homework, rather than ask their teachers or, god forbid, do a bit of thinking and research of their own. That's been going on for over a decade, and by now those children are college students who probably should never have gotten their high school degrees (or at least a larger than acceptable section of them).
    – jwenting
    Commented Dec 1, 2021 at 3:57
    All points are all true for any language tag, unfortunately
    – Vega
    Commented Dec 1, 2021 at 12:41
    "Programming has become so easy" I don't know about that. It's just way easier to try and circumvent having to study and research (trying does not imply succeeding of course...). I couldn't way back when, I had to be stuck on a problem for hours and try to work it out myself. If I did have Stack Overflow back then... I wonder what I would have done.
    – Gimby
    Commented Dec 1, 2021 at 12:58
  • 5
    And then there are the mid levels that answer "How to parse csv .." with "use pandas" - it is simple. And down you go into the ever shrinking rabbit hole Commented Dec 1, 2021 at 15:11
    @PatrickArtner yes, it's always annoying to see those answers. But tbh the python+csv tag is generally pretty poor quality, even discounting pandas :-( Commented Dec 1, 2021 at 16:51
  • 2
    – Tomerikoo
    Commented Dec 2, 2021 at 15:32
    What does this all add up to? Continued ability to work as long as we want to! Rejoice! And have pity for the poor shmucks left over when we are all gone. Maybe it's good thing our environment legacy will put them all out of their misery. But seriously, We generally see the worst of things here, because the folks that'll really get the job done in the future generally aren't asking that many questions. Commented Dec 2, 2021 at 22:43
    "Big data for business" 13 lectures, no programming prerequisite. Teaches you Python, regression and deep learning. Commented Dec 3, 2021 at 3:20

I remember people raising concerns about questions quality in the pandas tag before. The previous time I checked it, it indeed was filled with very low quality questions (you know, generally not all beginner questions are like that, but these were indeed) a bit more than I used to see in other tags, even popular ones.

It looked like a vicious circle of inferior quality questions hanging open for too long and because of that, getting a good chance to get answered, which in turn made tag visitors believe that it's okay to ask and answer like that, leading them to ask more of these inferior questions which were again hanging open and again getting answers and so on and so on.

Back then I even tried to somehow contribute to improving things by doing a bunch of close reviews filtered by pandas tag. This turned out to be a rather painful and fruitless effort and the main reason for this I think was a known issue of triage which blocks triaged questions from getting into the close queue for too long.

Thing is, it was not only me who noticed the low quality of these questions. System also correctly identified these and pushed them to the triage queue, where they were hanging for many hours (up to a few days - go figure) while being blocked from getting timely closure in the close queue.

What I observed in the queue looked exactly like harm done by triage. To start with, first I had to skip through multiple newer semi/decent questions that just turned "unlucky" to be of higher quality that prevented them from getting protected by a triage. You see, the close queue reasonably favors newer questions, but this reasoning breaks when triage enters into play and really blatantly close worthy questions that completed the triage get lower priority because of their age.

Okay, skipping that much was annoying, but still a tolerable amount of effort and time, I can handle that. What really made me sad and what made me drop my attempt is when the queue eventually managed to get to older questions that went out of triage and I saw how many of them have answers (at that age and in such a popular tag, why wouldn't they really?).

You know, I generally don't mind closing answered questions - when it happens infrequently. FGITW is an old game and it's only natural that rep hunters sometimes manage to slip through, and I learned to live with that (after all, it's kind of the flipside to Stack Overflow being capable to provide reasonably quick answers to appropriate questions).

But seeing that many blatantly poor questions answered is a different thing. This made me feel like voting to close these is like peeing against the wind. Even if my votes help close 40-50 of these (mostly answered) questions, askers won't try to improve - they will just retry the same way that was proven to work for them and get answers.

Even if I keep doing this curation for weeks or months and maybe get lucky to have few persistent askers of inappropriate (answered) questions banned, this won't help. At this point askers will know very well which way works for them on getting answers and they will just try to abuse the system to circumvent that ban with sock puppets, fraudulent voting and whatever else, making all my prior effort useless.

You see, the only thing I could do was to drop it and watch in despise how this swamp rots further and further.

Summing up, I think the issue with pandas is to large extent made worse by a system which currently functions in a way making triage work in a direction opposite to its intended purpose - shielding inferior questions from getting closed in time and giving them good chances to get an answer.

  • if memory serves, pandas was one of the tags I had in mind suggesting this feature request asking to either add option to filter triage by tags or allow triaged questions be handled in close queue
    – gnat
    Commented Dec 1, 2021 at 12:09
    Kudos for trying to bring order to that chaos! Commented Dec 1, 2021 at 12:12
  • 2
    Keep up the valiant effort. But the best we can probably hope for is some subset of Stack Overflow questions on a separate domain to make using a search engine for research suck less. E.g., bestofstackoverflow.com. Perhaps even subdomains for the main tags for better search specificity (the "Related column" really hurts in this respect), like forth.bestofstackoverflow.com, dotnet.bestofstackoverflow.com, python.bestofstackoverflow.com, php.bestofstackoverflow.com, etc. Relatively simple heuristics could probably be used, along some (AI-assisted) manual selection. Commented Dec 1, 2021 at 16:14
  • 1
    'cont - Simple heuristics could be based on post age (an optimum post age (within a few years or months) could probably be determined for each tag or popular combinations of tags), view rate, vote rate, absolute number of votes, number of answers, length of answers (comprehensiveness), internal and external links in answers, the degree of broken English, number of comments on questions and answers, etc. Commented Dec 1, 2021 at 16:26

Python is a growing language. Depending on your survey of choice it has grown by 2-3x or more in the last five years. So based on that alone it would make sense that there are far more beginning questions in python.

Although I don't have concrete data to back this up a common refrain is that this growth is due to a rise in the popularity of data science and machine learning. Both fields make heavy use of the pandas library. These are fields which do not have "creating software" as the primary goal. The goal is to build models, predictions, visualizations, and analytical papers.

The mindset of such developers is likely to be different. The nature of questions asked is thus likely to be different. The experience of such developers is likely to be different. Many of these developers don't want to program, programming is simply a tool to achieve a result.

Is there some popular guide all of these beginners are following that is leading them to use this package when in most cases, a basic list would suffice?

A lot of data science and machine learning operates at a scale where a basic list will not suffice. For example, if you want to calculate the standard deviation of 20 millions floats using a list of 20 million python objects you are going to pay a heavy performance penalty. Pandas stores arrays as dense contiguous vectors and pushes most analytical functions into C/C++. The performance difference for something as simple as calculating a standard deviation is going to be significant.

    "So based on that alone it would make sense that there are far more beginning questions in python." Wasn't really a part of what I was asking
    – Libra
    Commented Dec 1, 2021 at 3:32
  • 1
    Sorry, perhaps I could have worded it better. The circles of your Venn diagram are "a large number of new users", "users with a data science or machine learning background", and "users that rely on pandas for efficient numerical analysis". Those three circles have considerable overlap.
    – Pace
    Commented Dec 1, 2021 at 4:12

Short version: Pandas is the new jQuery.


Panda handles data, in such a cryptic and complicated way, that only experts can handle it.

The documentation and very short examples, does help handle complex data structures.

So it is no wonder, that today Pandas, CodeIgniter, Laravel, MySQL, SQL Server, PostgreSQL and so on all systems that handles data are highly used and you quickly get to the point where you have to ask.

But that is normal, that not all concepts can be grasped with a short time of learning, and so the questions pile up.

As the people can’t understand my point of view:

Pupils and student are forced to learn computer science independent of their abilities. Also there are those who want to learn, but are not that gifted.

The first get bad teachers or professors without time, and the last only bad videos and tutorials, which cover only the basics. They still need to do their chores and fail, because the documentation is really, really bad. Stack Overflow helps with that somewhat, but because of the policy and downvotes, the questions will not be found. So they should ask questions a lot.

    I don't think OP refers to people who don't quite get pandas but users who don't quite get the basics of Python - ifs, loops, etc.
    – VLAZ
    Commented Nov 30, 2021 at 15:51
  • i answered some question with panda and python: the questioner slready noticed tey are often the similar, and can't be find in so or elsewhere. As i said, the documentation is bad and the problems go further han any documentation can handle it, but the people behind pqanda should extend their sample with much more complex example in hope that it will help others
    – nbk
    Commented Nov 30, 2021 at 18:54
    VLAZ is right, I was more so pointing out the correlation between the average skill level of people posting the question and the frequency that question is under the pandas tag.
    – Libra
    Commented Nov 30, 2021 at 21:07
  • 1
    even for long time progrqammers panda is at first hard to grasp and concept to undestand, so it is no wonder that there are such questions and as i pointed out that is common for all data related tools and programs
    – nbk
    Commented Nov 30, 2021 at 21:45
  • 5
    Well, that's kind of irrelevant, since my question is about beginner programmers. I don't see anywhere near as many questions about pandas from programmers that seem to understand Python basics, thats the concern of the problem
    – Libra
    Commented Dec 1, 2021 at 1:54
  • my answer is a vital point that is relevant for all programming, the documentation has to change and a vital part has to be add more complex examples
    – nbk
    Commented Dec 1, 2021 at 8:28
  • 1
    Yeah I think OP meant users who need to grasp the basics of python first before trying to ask about complex packages like pandas
    – user16612111
    Commented Dec 1, 2021 at 11:13
  • 1
    But most questions are about simple moving data around in Pandas tables (duplicate beginner questions). A few canonical questions should be able answer all such questions. Commented Dec 1, 2021 at 15:23
  • as i said, the most users can't understand the concepts and idea behind a tool or language framework and they lack a college or uni education, where they learn how to access such knowledge and get the basics. They can't understand the tutorials or documentation, that is why they post here in hope we help them
    – nbk
    Commented Dec 1, 2021 at 15:35
  • For the people downvoting, Pandas can be confusing and some parts of the API are pretty strange to get used to, ex. stackoverflow.com/questions/38886080/…
    – qwr
    Commented Dec 1, 2021 at 20:57

