10

I've been curious about why Python seems to be so popular amongst data scientists.

Firstly I needed to check there was some truth to this assertion so I wrote a query on the StackExchange data explorer:

https://data.stackexchange.com/stackoverflow/query/1555958/other-tags-on-python-posts

This shows common data processing and analysis tools in the list of tags also found on python posts, such as numpy, matplotlib and tensorflow.

I also checked the other direction, finding other tags most common on posts tagged as 'data-science':

https://data.stackexchange.com/stackoverflow/query/1555978/other-tags-on-data-science-posts

Python is #1 on this list so I seem to be on to something.

Why and how has python become so popular amongst data scientists?

2
  • 3
    There are almost two questions here - why has it always been common in scientific/engineering computing to provide a scripting/dynamic language for domain experts to use on top of the heavy weight libraries created by the software engineers (I've seen the same pattern for my entire career since the late 80s, prior to Python achieving hegemony there were many other scripting languages, most of which were in-house developments for particular systems), and why Python is now the most popular language for doing it in the last decade or so rather than companies rolling their own. Commented Feb 13, 2022 at 9:45
  • youtube.com/watch?v=4RSht_aV7AU is a talk on how scientific python got started by the author of SWIG. There's likely similar talks by other major developers of the early systems (e.g. numarray/numeric/numpy, matplotlib, scipy) online. Commented Feb 19, 2022 at 6:44

4 Answers 4

11

I've been working in analysis / data science and leading data science teams for over a decade by now. Concerning the claim:

I find it a little counterintuitive that python would be so popular amongst data scientists, as I presume that people who are interested in the quality of data would also be interested in the quality of code, and a heavily dynamic language like Python doesn't lend itself to checking correctness.

It is possible to offer a number of reasons, but the key ones are concerned with ecosystem, learning curve and popularity i.e. "snowball factor".

Learning Curve

The challenge is that many data scientist come from statistical/mathematical background with limited knowledge of programming and learn software development aspects of their roles on the job. In comparison to languages like Java or C++, Python doesn't have a steep learning curve and that makes the language particularly appealing from perspective of professional development / training requirements.

Ecosystem

Valid points were made in this discussion concerning the rich ecosystem that Python environment offers. While working on projects ability to leverage existing libraries to solve common data/analytical challenges is particularly appealing. In a commercial analytical setting, the biggest drain on time is the development time not the computational effort. Arguably, rich ecosystem that comes with Python shortens that curve making the language more appealing.

Exogenous factors

Data science managers have to consider external factors when creating analytical teams. Python is popular making recruiting Python developers easy when compared to other languages. That is particularly appealing from business continuity perspective. Building solutions on technology that is exceptionally good but unpopular can prove counterproductive.


Personal perspective

As a data science manager, I've led teams that worked in SAS, R, Python, Scala and variety of other less popular solutions (say Stata, Matlab and so forth). I find it highly disputable whether the one can argue that there is a best language as such. Haskell, with some beautiful concepts like monads, offers amazing possibilities. Haskell should be particularly appealing to companies working with large amount of data; nevertheless, the number of jobs for Haskell developers is significantly smaller when compared to vacancies for Python developers. Julia proposed a number of very sensible paradigms that are promising but it's not widely used in business. It is important to draw distinction between "technical quality" of a language and business utility. Business utility is to a great degree determined by popularity, versatility, learning curve and a wider cost overhead.

Possibly your question could be rephrased: why Python became popular in the first place? The rough answer could be do to appealing combination between simplicity and flexibility.

R vs. Python

In the UK graduates with statistical background are more frequently trained in R whereas graduates trained in computer-related subjects tend to come with knowledge of Python. That divide translates into the industry where consultancies focusing more on statistical side of things tend to use R more often and business concerned more with programming aspect of data science ten to gravitate towards Python. This is very much a soft distinction as solutions like Matlab and SAS have relatively big deployment bases and the business still has plenty of outfits that rely on those technologies.

Education

A postgraduate degree in data science lasts under two years. If you can teach people that in order to run linear regression you have to type lm(y ~ x, data=df)1, this is much more appealing than common C++ implementation. The outcome of that approach is that data scientist come with a very rudimentary knowledge of programming techniques. The practicalities are that if the intention for a data science programme would be to deliver a traditional computer science training, covering basics of memory management, algorithms and programming plus advanced statistical concepts the programme would have to last ~ 5 years or more. That simply wouldn't work due to financial / market constrains. Universities choose not to focus too strongly on programming details and only use languages as vehicles for running statistical models.

This approach comes with a number of challenges. For instance, with respect to R, many graduates trained in R are unaware of object oriented programming capabilities available through R and struggle with computationally intensive tasks that can be solved by experienced developers having good understanding of common programming concepts.


1 In R.

2
  • You are comparing a function that is available in some library with a complete implementation (which on first sight has some horrible weaknesses as well). Now the C++ implementation I can easily fix, the python one I cannot see and check.
    – gnasher729
    Commented Feb 16, 2022 at 7:15
  • @gnasher729 Valid observation, I should probably expand my answer. My intention was to emphasise business factors that frequently induce managers to base they dev environments on a specific stack. Wise individuals will attempt to strike the balance between complexity and flexibility but business pressures make this challenging. TBH, the same happens with the whole "low code / no code" dev environments that are bringing very little to the table but (sadly) sell well. The point I wanted to make is that solution may "technically subpar" but more appealing business-wise due to a number of factors.
    – Konrad
    Commented Feb 16, 2022 at 8:49
12

The simple answer is because Python has one of the two best ecosystems (along with R) of data science libraries. There is nothing which has a similar breadth to scikit-learn in (say) Java or C#. And then add on the fact that all the deep learning frameworks have bindings to Python - although of course by this stage you're to some extent into self-reinforcement as data scientists use Python, so they want bindings in Python.

a heavily dynamic language like Python doesn't lend itself to checking correctness

I think this statement demonstrates a misunderstanding of how the Python data science ecosystem works; the vast majority of the actual "maths" is not written in pure Python (it would be far too slow), but is native code written in C, C++ or even Fortran. The Python layer is just a thin wrapper over the top which coordinates between the various libraries in play.

(I'm trying hard not to get into a dynamic vs static flamewar here; let's just say it's perfectly possible to write code which checks correctness in Python)

7
  • Thanks! Do you know anything about how this came to be? I assume Python 1.0 didn't have the robust data science ecosystem, so presumably some people started developing them, but why choose Python at that point?
    – Ian Newson
    Commented Feb 12, 2022 at 22:03
  • 2
    @IanNewson Re, "why choose Python at that point?" If there wasn't any clear choice at that point then why not choose Python? I Don't pretend to know anything about the history of data science libraries, but presumably, there was somebody who wrote something in Python, and then shared it with colleagues, and they shared it and improved it, and... A little bitty snowball rolling down a slope can turn into an avalanche. Why that snowball? Why not some other snowball? Maybe they were keen on Python because they wanted to get closer to somebody who was a member of a Python user's group. Commented Feb 12, 2022 at 22:35
  • 6
    @IanNewson SciPy started around 2001, Matplotlib in 2003, and Numpy brought a unified interface for array programming around 2005 (with Numpy, processing lots of numerical data is fast and memory-efficient, unlike normal Python data structures). The rest of the Python data science ecosystem such as Pandas and scikit-learn condensated around that in the following years. The Python stack was more powerful than that of competing scripting languages (e.g. Perl/PDL), much easier to use than C++ (though CERN's Root still has its niche in Physics), cheaper than Matlab, less weird than R.
    – amon
    Commented Feb 12, 2022 at 23:07
  • 6
    @IanNewson I think you are massively underestimating how much easier Python is to use for data scientists than Java. No need to mess around with javac, JAR files, getting your classpath right and whatever else. I can write 20 lines of code in Python and it just works - in Java that would be 100 lines of boilerplate and hours of trying to get it to find its packages for the same functionality. Commented Feb 12, 2022 at 23:37
  • 7
    @IanNewson Java has two crucial problems: First, it's not a scripting language. Simple tasks like reading a file are much more difficult. Also interactive, explorative work styles are more difficult with Java. In contrast, Python had IPython and now Jupyter. If you haven't tried Jupyter, I highly recommend the experience. Second, Java has its own virtual machine that makes integration with high-performance numerical libraries more difficult. Large parts of Scipy are just bindings to highly optimized C or Fortran code that will easily outperform pure-Java implementations.
    – amon
    Commented Feb 12, 2022 at 23:40
5

I think it is key here to split data scientists and data engineers. A data scientists role is to take a dataset and work out how to derive interesting and valuable results from it. They are not primarily interested in taking that algorithm for how you derive those results, and building a product that uses that algorithm.

Since a data scientist is primarily interested in the algorithm, the code can be considered a disposable prototype constructed for there own use, and only needs to be correct within the small scale of initial development. It should not need to be built such that other people can easily understand and modify it without risk of breaking it many times in the future. This significantly changes the calculus in favor of tools that can be manually tried with low initial overhead (like scripting languages), rather than the traditional software development processes.

If you start looking at tooling for data engineering and large scale product level data processing, you may find scala and java are much bigger players.

1

You say

I find it a little counterintuitive that python would be so popular amongst data scientists, as I presume that people who are interested in the quality of data would also be interested in the quality of code

But the quality of the data is not related to the quality of the code in any way, performance, readability, maintenance, none of that is what the data scientist is looking for.

This is similar to how Perl got a reputation for being difficult to work with. The fact of the matter is, in both cases, in normal usage, the code is really unimportant and probably entirely transient. What matters is the output.

I don’t know and so haven’t written any python, but I have written Perl code that was never even saved, just ran in the IDE and thrown away when the window was closed without saving.

I don’t think this reflects badly on me as a professional programmer, or my dedication to code quality.

2
  • This is basically the answer I came here to write. Back in the 1980s, there was a strong focus on modelling schemes which could mathematically prove the behaviour of code, in the belief that coding errors were the biggest problem. They died without getting much traction, and all adherents like Jackson and Yourdon have long recognised they basically wasted everyone's time, because this is simply not true. The biggest problem is thinking about what to do, not the details of how to do it, because you can test the latter but not the former.
    – Graham
    Commented Feb 13, 2022 at 10:28
  • 1
    @Graham, I agree. "Thinking about what to do", and being enormously precise about what is done, and predetermining the outcome of every possible decision in every possible circumstance, is what makes the programming task difficult. Only a modest amount of error is made in the transcription from mind to paper. The main errors are all in the quality of the conception, before that conception has even left the programmer's mind. And the criteria for declaring it to be an erroneous conception, largely exists only in the minds of others.
    – Steve
    Commented Feb 13, 2022 at 13:17

Not the answer you're looking for? Browse other questions tagged or ask your own question.