197
$\begingroup$

Having recently graduated from my PhD program in statistics, I had for the last couple of months began searching for work in the field of statistics. Almost every company I considered had a job posting with a job title of "Data Scientist". In fact, it felt like long gone were the days of seeing job titles of Statistical Scientist or Statistician. Had being a data scientist really replaced what being a statistician was or were the titles synonymous I wondered?

Well, most of the qualifications for the jobs felt like things that would qualify under the title of statistician. Most jobs wanted a PhD in statistics ($\checkmark$), most required understanding experimental design ($\checkmark$), linear regression and anova ($\checkmark$), generalized linear models ($\checkmark$), and other multivariate methods such as PCA ($\checkmark$), as well as knowledge in a statistical computing environment such as R or SAS ($\checkmark$). Sounds like a data scientist is really just a code name for statistician.

However, every interview I went to started with the question: "So are you familiar with machine learning algorithms?" More often than not, I found myself having to try and answer questions about big data, high performance computing, and topics on neural networks, CART, support vector machines, boosting trees, unsupervised models, etc. Sure I convinced myself that these were all statistical questions at heart, but at the end of every interview I couldn't help but leave feeling like I knew less and less about what a data scientist is.

I am a statistician, but am I a data scientist? I work on scientific problems so I must be a scientist! And also I work with data, so I must be a data scientist! And according to Wikipedia, most academics would agree with me (https://en.wikipedia.org/wiki/Data_science, etc. )

Although use of the term "data science" has exploded in business environments, many academics and journalists see no distinction between data science and statistics.

But if I am going on all these job interviews for a data scientist position, why does it feel like they are never asking me statistical questions?

Well after my last interview I did want any good scientist would do and I sought out data to solve this problem (hey, I am a data scientist after all). However, after many countless Google searches later, I ended up right where I started feeling as if I was once again grappling with the definition of what a data scientist was. I didn't know what a data scientist was exactly since there was so many definitions of it, (http://blog.udacity.com/2014/11/data-science-job-skills.html, http://www-01.ibm.com/software/data/infosphere/data-scientist/) but it seemed like everyone was telling me I wanted to be one:

Well at the end of the day, what I figured out was "what is a data scientist" is a very hard question to answer. Heck, there were two entire months in Amstat where they devoted time to trying to answer this question:

Well for now, I have to be a sexy statistician to be a data scientist but hopefully the cross validated community might be able to shed some light and help me understand what it means to be a data scientist. Aren't all statisticians data scientists?


(Edit/Update)

I thought this might spice up the conversation. I just received an email from the American Statistical Association about a job positing with Microsoft looking for a Data Scientist. Here is the link: Data Scientist Position. I think this is interesting because the role of the position hits on a lot of specific traits we have been talking about, but I think lots of them require a very rigorous background in statistics, as well as contradicting many of the answers posted below. In case the link goes dead, here are the qualities Microsoft seeks in a data scientist:

Core Job Requirements and Skills:

Business Domain Experience using Analytics

  • Must have experience across several relevant business domains in the utilization of critical thinking skills to conceptualize complex business problems and their solutions using advanced analytics in large scale real-world business data sets
  • The candidate must be able to independently run analytic projects and help our internal clients understand the findings and translate them into action to benefit their business.

Predictive Modeling

  • Experience across industries in predictive modeling
  • Business problem definition and conceptual modeling with the client to elicit important relationships and to define the system scope

Statistics/Econometrics

  • Exploratory data analytics for continuous and categorical data
  • Specification and estimation of structural model equations for enterprise and consumer behavior, production cost, factor demand, discrete choice, and other technology relationships as needed
  • Advanced statistical techniques to analyze continuous and categorical data
  • Time series analysis and implementation of forecasting models
  • Knowledge and experience in working with multiple variables problems
  • Ability to assess model correctness and conduct diagnostic tests
  • Capability to interpret statistics or economic models
  • Knowledge and experience in building discrete event simulation, and dynamic simulation models

Data Management

  • Familiarity with use of T-SQL and analytics for data transformation and the application of exploratory data analysis techniques for very large real-world data sets
  • Attention to data integrity including data redundancy, data accuracy, abnormal or extreme values, data interactions and missing values.

Communication and Collaboration Skills

  • Work independently and able to work with a virtual project team that will research innovative solutions to challenging business problems
  • Collaborate with partners, apply critical thinking skills, and drive analytic projects end-to-end
  • Superior communication skills, both verbal and written
  • Visualization of analytic results in a form that is consumable by a diverse set of stakeholders

Software Packages

  • Advanced Statistical/Econometric software packages: Python, R, JMP, SAS, Eviews, SAS Enterprise Miner
  • Data exploration, visualization, and management: T-SQL, Excel, PowerBI, and equivalent tools

Qualifications:

  • Minimum 5+ years of related experience required
  • Post graduate degree in quantitative field is desirable.
$\endgroup$
44
  • 6
    $\begingroup$ Nice question! I have been wondering about this quite a lot lately. In my eyes it seems that jobs that include data scientist in the description are looking for people that can apply statistical/ML methods that scale well, not necessarily people that can deal with theory. I still think that there is some redundancy in these job descriptions. Requiring a PhD is probably often an overqualification and the HR people that make these job descriptions are heavily influenced by the buzz around big-data. Is a data scientist a statistician or vice versa is the main question I want to see answered. $\endgroup$
    – Gumeo
    Commented Feb 11, 2016 at 8:54
  • 4
    $\begingroup$ I think this is an excellent paper that kind of addresses this shift in cultures of being a statistician versus being a data scientist: projecteuclid.org/download/pdf_1/euclid.ss/1009213726 $\endgroup$ Commented Feb 11, 2016 at 9:08
  • 7
    $\begingroup$ "But if I am going on all these job interviews for a data scientist position, why does it feel like they are never asking me statistical questions"...story of my life...literally LOL!!! I think data science, statistics, econometrics, biostat,..etc. have considerable overlap but they all use different jargon which makes communication difficult (especially when you are being interviewed by an HR person who isn't knowledgeable and focuses on key words). Hopefully increased inter-disciplinary efforts and some much needed open-mindedness will change this in the future. $\endgroup$ Commented Feb 11, 2016 at 9:43
  • 9
    $\begingroup$ I've followed the "rise of the data scientist" ever since it became mainstream in about 2008. To me it was and is mostly a marketing term fuelling a hype - the disciplines statistics, machine learning, data engineering, data analysis all are the same with different emphasis. Paraphrasing G. Box: If asked questions such as "Are you a Bayesian, frequentist, data analyst, designer of experiments, data scientist?" Say "yes". $\endgroup$
    – Momo
    Commented Feb 11, 2016 at 11:37
  • 12
    $\begingroup$ @Momo: Nevertheless, if one opens one of the 600+ pages textbooks called "Machine learning" (or similar) and one of a textbooks called "Statistics" (or similar), there will be very little overlap. My Bishop's Pattern Recognition and Machine Learning or Murphy's Machine Learning have almost zero intersection with Lehman & Casella Theory of Point Estimation, Casella & Berger Statistical Inference, or Maxwell & Delaney Designing Experiments and Analyzing Data. They are so different that I think people familiar with one set of books might have trouble reading the other. $\endgroup$
    – amoeba
    Commented Feb 11, 2016 at 11:43

13 Answers 13

63
$\begingroup$

There are a few humorous definitions which were not yet given:

Data Scientist: Someone who does statistics on a Mac.

I like this one, as it plays nicely on the more-hype-than-substance angle.

Data Scientist: A Statistician who lives in San Francisco.

Similarly, this riffs on the West Coast flavour of all this.

Personally, I find the discussion (in general, and here) somewhat boring and repetitive. When I was thinking about what I wanted to---maybe a quarter century or longer ago---I aimed for quantitative analyst. That is still what I do (and love!) and it mostly overlaps and covers what was given here in various answers.

(Note: There is an older source for quote two but I can't find it right now.)

$\endgroup$
7
  • 30
    $\begingroup$ +1. I find the discussion (in general, and here) somewhat boring and repetitive and vain talk of trifles or new buzzling words, I would add. I still can't differentiate afterwards between data scientists, christian scientists, and data scientologists. $\endgroup$
    – ttnphns
    Commented Feb 15, 2016 at 10:49
  • 1
    $\begingroup$ LOL @ data scientologists. $\endgroup$
    – dsaxton
    Commented Feb 15, 2016 at 14:17
  • 4
    $\begingroup$ And I tip my hat to the (of course anonymous) Very Serious Person who just came by, downvoted and didn't leave a reason. Hint: That ain't how the discussion improves. $\endgroup$ Commented Feb 15, 2016 at 14:44
  • 1
    $\begingroup$ Being a statistician in South San Francisco who is very actively fighting the title Data Scientist, the second definition hits too close to home (but I wasn't the downvoter). $\endgroup$
    – Cliff AB
    Commented Feb 15, 2016 at 19:11
  • 1
    $\begingroup$ (+1) @CliffAB I am actually a statistician in South San Francisco too. $\endgroup$ Commented Feb 16, 2016 at 1:28
91
$\begingroup$

People define Data Science differently, but I think that the common part is:

  • practical knowledge how to deal with data,
  • practical programming skills.

Contrary to its name, it's rarely "science". That is, in data science the emphasis is on practical results (like in engineering), not proofs, mathematical purity or rigor characteristic to academic science. Things need to work, and there is little difference if it is based on an academic paper, usage of an existing library, your own code or an impromptu hack.

Statistician is not necessary a programmer (may use pen & paper and a dedicated software). Also, some job calls in data science have nothing to do with statistics. E.g. it's data engineering like processing big data, even if the most advanced maths there may be calculating average (personally I wouldn't call this activity "data science", though). Moreover, "data science" is hyped, so tangentially related jobs use this title - to lure the applicants or raise ego of the current workers.

I like the taxonomy from Michael Hochster's answer on Quora:

Type A Data Scientist: The A is for Analysis. This type is primarily concerned with making sense of data or working with it in a fairly static way. The Type A Data Scientist is very similar to a statistician (and may be one) but knows all the practical details of working with data that aren’t taught in the statistics curriculum: data cleaning, methods for dealing with very large data sets, visualization, deep knowledge of a particular domain, writing well about data, and so on.

Type B Data Scientist: The B is for Building. Type B Data Scientists share some statistical background with Type A, but they are also very strong coders and may be trained software engineers. The Type B Data Scientist is mainly interested in using data “in production.” They build models which interact with users, often serving recommendations (products, people you may know, ads, movies, search results).

In that sense, Type A Data Scientist is a statistician who can program. But, even for quantitive part, there may be people with background more in computer science (e.g. machine learning) than regular statistics, or ones focusing e.g. on data visualization.

And The Data Science Venn Diagram (here: hacking ~ programming):

The Data Science Venn Diagram

see also alternative Venn diagrams (this and that). Or even a tweet, while humorous, showing a balanced list of typical skills and activities of a data scientist:

a data scientist should be able to

See also this post: Data scientist - statistician, programmer, consultant and visualizer?.

$\endgroup$
12
  • 16
    $\begingroup$ I like the tweet. I'd add that he should also know how to bake pizza, grow own ecological vegetables, write poetry and dance salsa :) $\endgroup$
    – Tim
    Commented Feb 11, 2016 at 10:14
  • 4
    $\begingroup$ Minor quibble: not all "sciences" have emphasis on "proofs or mathematical purity". Think e.g. biology. $\endgroup$
    – amoeba
    Commented Feb 11, 2016 at 11:20
  • 2
    $\begingroup$ What does it mean to hack a p-value? It seems to me that someone (aka the client) has a specified p-value target and the data scientist is supposed to cut and dice the data so that the p-value target can be achieved. Or is it supposed to mean something different? $\endgroup$
    – emory
    Commented Feb 11, 2016 at 20:35
  • 2
    $\begingroup$ @amory This tweet is humoristic (It's a pastishe of a paragraph from en.wikiquote.org/wiki/Time_Enough_for_Love, "A human being should be able to [list]. Specialization is for insects."). "Hack a p-value" is certainly a dark practice (sadly, prevalent in some academic disciplines), and (I hope) is here as a joke. $\endgroup$ Commented Feb 11, 2016 at 22:08
  • 4
    $\begingroup$ +1 for the remark about not calling someone a Data Scientist who calculates simplistic "statistics" on enormous datasets. I think we're moving out of a phase in Data Science where Computer Scientists who specialized in cluster computing (Hadoop, etc) were labeled "Data Scientists". I'm not looking down on those skills, but they aren't nearly as important as statistical/reasoning/investigation skills and the technology is moving beyond map-reduce. $\endgroup$
    – Wayne
    Commented Feb 14, 2016 at 15:17
43
$\begingroup$

There's a number of surveys of data science field. I like this one, because it attempts to analyze the profiles of people who actually hold data science jobs. Instead of using anecdotal evidence or author's biases, they use data science techniques to analyze data scientist DNA.

It's quite revealing to look at the skills listed by data scientists. Notice the top 20 skills contain a lot of IT skills.

In today’s world, a data scientist is expected to be a jack of all trades; a self-learner who has a solid quantitative foundation, an aptitude for programming, infinite intellectual curiosity, and great communication skills.

enter image description here

UPDATE:

I am a statistician, but am I a data scientist? I work on scientific problems so I must be a scientist!

If you do PhD you're most likely a scientist already, especially, if you have published papers and active research. You don't need to be a scientist to be a data scientist, though. There are some roles at some firms, like Walmart (see below), where PhD is required, but usually data scientists have BS and MS degrees as you can see from examples below.

As you can figure from the chart above, most likely, you'll be required to have good programming and data handling skills. Also, often data science is associated with some level, often "deep", of expertise in machine learning. You certainly may call yourself a data scientist if you have PhD in stats. However, PhD in computer science from top schools may be more competitive than stats graduates, because they may have quite strong applied statistics knowledge which is supplemented by strong programming skills - a sought after combination by employers. To counter them you have to acquire strong programming skills, so in a balance you'll be very competitive. What's interesting is that usually all stat PhDs will have some programming experience, but in data science often the requirement is much higher than that, employers want advanced skills, knowledge of algorithms and data structures, distributed computing etc.

To me the advantage of having a PhD in stats is in the problem captured in the rest of the phrase "a jack of all trades" that is usually dropped: "a master of none". It's good to have people that know a little a bit of everything, but I always look for folks who know something deeply too, whether it's stats or computer science is not so important. What matters is that the guy is capable of getting to the bottom, it's a handy quality when you need it.

The survey also lists the top employers of data scientists. Microsoft is on the top, apparently, which was surprising to me. If you want to get a better idea of what they're looking for, searching LinkeIn with "data science" in the Jobs section is helpful. Below is two excerpts from MS and Walmart's jobs in LinkedIn to make a point.

  • Microsoft Data Scientist

    • 5+ years of Software Development experience in building Data Processing Systems/Services
    • Bachelors or higher qualifications in Computer Science, EE, or Math with specialization in Statistics, Data Mining or Machine Learning.
    • Excellent Programming Skills (C#, Java, Python, Etc.) in manipulating large scale data
    • Working knowledge of Hadoop or other Big Data processing technology
    • Knowledge of analytics products (e.g. R, SQL AS, SAS, Mahout, etc.) is a plus.

Notice, how knowing stat packages is just a plus, but excellent programming skills in Java is a requirement.

  • Walmart, Data Scientist

    • PhD in computer science or similar field or MS with at least 2-5 years of related experience
    • Good functional coding skills in C++ or Java (Java is highly preferred)
    • must be capable of spending up to 10% daily work day in writing production code in either C++/Java/Hadoop/Hive
    • Expert level knowledge of one of the scripting languages such as Python or Perl.
    • Experience working with large data sets and distributed computing tools a plus (Map/Reduce, Hadoop, Hive, Spark etc.)

Here, PhD is preferred, but only computer science major is named. Distributed computing with Hadoop or Spark is probably an unusual skill for a statistician, but some theoretical physicists and applied mathematicians use similar tools.

UPDATE 2:

"It’s Already Time to Kill the “Data Scientist” Title" says Thomas Davenport who co-wrote the article in Harvard Business Review in 2012 titled "Data Scientist: The Sexiest Job of the 21st Century" that sort of started the data scientist craze:

What does it mean today to say your are—or want to be, or want to hire—a “data scientist?” Not much, unfortunately.

$\endgroup$
6
  • 3
    $\begingroup$ +1 for using data and linking to a nice data-driven report. But does the screenshot need a web browser interface? $\endgroup$ Commented Feb 12, 2016 at 10:40
  • $\begingroup$ @PiotrMigdal, I should learn to crop or stop being lazy $\endgroup$
    – Aksakal
    Commented Feb 12, 2016 at 14:19
  • 4
    $\begingroup$ I cropped it for you. $\endgroup$
    – amoeba
    Commented Feb 14, 2016 at 20:09
  • 1
    $\begingroup$ I am tempted to downvote after today's update: this thread is already very busy and having a gigantic wall of citations to scroll down is not very helpful in my opinion... Perhaps the links + brief summary could suffice? $\endgroup$
    – amoeba
    Commented Feb 15, 2016 at 19:44
  • 1
    $\begingroup$ @amoeba, I stripped down the list. It's a fair comment $\endgroup$
    – Aksakal
    Commented Feb 16, 2016 at 0:25
41
$\begingroup$

Somewhere I've read this (EDIT: Josh Will's explaining his tweet):

Data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.

This quote can be shortly explained by this data science process. The first look onto this scheme looks like "well, where is the programming part?", but if you have tons of data you have to be able to process them.

$\endgroup$
10
  • 11
    $\begingroup$ So probably every R contributor that is a statistician is a data scientist? ;) $\endgroup$
    – Tim
    Commented Feb 12, 2016 at 10:41
  • 15
    $\begingroup$ Wow, I was just strolling the site, wondering about this question (given that there is datascience) and then in passing learn that I have a friggin' Wikipedia page? That was news to me... And for what it is worth I trained in Econometrics, not statistics, but have worked as a 'quant' for 20+ years. That is effectively the same as data science... $\endgroup$ Commented Feb 14, 2016 at 3:05
  • 3
    $\begingroup$ -1. I downvote not because I don't like the quote (it was most probably tongue in cheek anyway), but because the answer is too brief and unsubstantial, in particular compared to many other answers here. I would suggest it is converted into a comment, unless perhaps you expand it somehow. $\endgroup$
    – amoeba
    Commented Feb 14, 2016 at 20:06
  • 3
    $\begingroup$ Here is an explanation of this quote by its author Josh Wills. The first three paragraphs after the quote are quite pertinent to this discussion. $\endgroup$
    – amoeba
    Commented Feb 15, 2016 at 16:43
  • 3
    $\begingroup$ @amoeba: I liked Josh Wills' article up until this point: "I suspect that we teach people advanced statistics in a way that tends to scare off computer scientists by focusing on parametric models that require a lot of calculus instead of non-parametric models that are primarily computational". Also, I do disagree with him that it's easier to teach advanced statistics to CS people than how to program well to statisticians (although I certainly agree that most statisticians are terrible programmers). $\endgroup$
    – Cliff AB
    Commented Feb 15, 2016 at 19:24
16
+400
$\begingroup$

I've written several answers and each time they got long and I eventually decided I was getting up on a soapbox. But I think that this conversation has not fully explored two important factors:

  1. The Science in Data Science. A scientific approach is one in which you try to destroy your own models, theories, features, technique choices, etc, and only when you cannot do so do you accept that your results might be useful. It's a mindset and many of the best Data Scientists I've met have hard-science backgrounds (chemistry, biology, engineering).

  2. Data Science is a broad field. A good Data Science outcome usually involves a small team of Data Scientists, each with their own speciality. For example, one team member is more rigorous and statistical, another is a better programmer with an engineering background, and another is a strong consultant with business savvy. All three are quick to learn the subject matter, and all three are curious and want to find the truth -- however painful -- and to do what's in the best interest of the (internal or external) customer, even if the customer doesn't understand.

The fad over the last few years -- now fading, I think -- is to recruit Computer Scientists who have mastered cluster technologies (Hadoop ecosystem, etc) and say that's the ideal Data Scientist. I think that's what the OP has encountered, and I'd advise the OP to push their strengths in rigor, correctness, and scientific thinking.

$\endgroup$
5
  • $\begingroup$ @RustyStatistician: You're welcome. I'd add that the consultancy I work for has PhDs (engineering, biology, astronomy, computer science), but in general views MS degrees -- often people with work experience who go back for an MS in Analytics -- as the sweet spot. That said, I'm thankful every day for my biology PhD coworker who is currently on a project where I'm the tech lead. Along with the project lead who has an Economics background (and an MS in Analytics), we're a great team! (My MS is in Artificial Intelligence.) $\endgroup$
    – Wayne
    Commented Feb 14, 2016 at 21:28
  • $\begingroup$ +1, but I am wondering about your first bullet point saying that [good] data science is a science. If so, it is a curious and perhaps a misleading (?) term because "data science" is not studying "data" in itself; it is using data to study something else, whatever is of interest in a given application. In contrast, e.g. "political science" is supposed to study politics and "neuroscience" is studying neurons, as the names suggest. $\endgroup$
    – amoeba
    Commented Feb 14, 2016 at 23:58
  • 1
    $\begingroup$ @amoeba: Actually, I meant that a Data Scientist must use the scientific method ala Richard Feynman as a part of how they understand and use data. (As you say, in pursuit of a particular application.) It's the statistician part of the job: "This variable seems highly significant -- is it a leak from the future?" Or "This model seems to be reasonable, but let's run CV on the entire model-making process, and then let's do some resampling on top of that." It's trying hard to disprove your model/theory and involving others in doing so. Not accepting "Green M&Ms cause cancer". $\endgroup$
    – Wayne
    Commented Feb 15, 2016 at 0:13
  • $\begingroup$ @Wayne is the only one who mention the "scientific method" so far. This is so sad. $\endgroup$
    – jgomo3
    Commented Mar 14, 2016 at 17:15
  • $\begingroup$ An understanding of physics, especially units, is necessary for anyone trying to make sense of anything. However, in this brave new world of ours it is often enough to make heuristic observations that have sub-optimal predictive value as "gob-stoppers," but are not real solutions. $\endgroup$
    – Carl
    Commented Feb 17, 2018 at 17:34
15
$\begingroup$

I think Bitwise covers most of my answer but I am gonna add my 2c.

No, I am sorry but a statistician is not a data scientist, at least based on how most companies define the role today. Note that the definition has changed over time, and one challenge of the practitioners is to make sure they remain relevant.

I will share some common reasons on why we reject candidates for "Data Scientist" roles:

  • Expectations about the scope of the job. Typically the DS needs to be able to work independently. That means there's nobody else to create the dataset for him in order to solve the problem he was assigned. So, he needs to be able to find the data sources, query them, model a solution and then, often, also create a prototype that solves the problem. Many times that is simply the creation of a dashboard, an alarm, or a live report that constantly updates.
  • Communication. It seems, that many statisticians have a hard time "simplifying" and "selling" their ideas to business people. Can you show just one graph and tell a story from the data in a way that everybody in the room can get it? Note, that this is after you secure that you can defend every bit of the analysis if challenged.
  • Coding skills. We don't need production level coding skills, since we have developers for that, however, we need her to be able to write a prototype and deploy it as a web service in an AWS EC2 instance. So, coding skills doesn't mean ability to write R scripts. I can add fluency in Linux somewhere here probably. So, the bar is simply higher to what most statisticians tend to believe.
  • SQL and databases. No, he can't pick up that on the job, since we actually need him to adapt the basic SQL he already knows and learn how to query the multiple different DB systems we use across the org including Redshift, HIVE and Presto - each of which uses its own flavour of SQL. Plus, learning SQL on the job means the candidate will create problems in every other analyst until they learn how to write efficient queries.
  • Machine Learning. Typically they have used Logistic Regression or few other techniques to solve a problem based on a given dataset (Kaggle style). However, even that the interview starts from algorithms and methods, it soon focus on topics such as feature generation (remember you need to create the dataset, there's nobody else to create it for you), maintainability, scalability and performance as well as the related trade offs. For some context you can check out a relevant paper from Google published in NIPS 2015.
  • Text Analysis. Not a must have, but some experience in Natural Language Processing is good to have. After all, a big portion of the data is in textual format. As discussed there's nobody else to make the transformations and clean up the text for you in order to make it consumable by a ML or other statistical approach. Also, note that today even CS grads already have done some project that ticks this box.

Of course for a junior role you can't have all the above. But, how many of these skills can you afford missing and pick up on the job?

Finally, to clarify, the most common reason for rejecting non-statisticians is exactly the lack of even basic knowledge of stats. And somewhere there is the difference between a data engineer and a data scientist. Nevertheless, data engineers tend to apply for these roles, since many times they believe that "statistics" is just the average, the variance and the normal distribution. So, we may add a few relevant but scary statistical buzzwords in job descriptions in order to clarify what we mean by "statistics" and prevent the confusion.

$\endgroup$
2
  • 4
    $\begingroup$ Since 2006 I teach applied statistics and data analysis courses in programs called "business informatics" at two universities and this applies 100% to what my students learn. 1. They need to collect real, perhaps messy data from their business, the web, survey, etc.. 2. Clean, prepare and store the data in an SQL data base for the course. 3. Do various statistical analyses on the data. 4. Prepare 1-2 pages short executive briefs and write an in-depth report with literal programming (knitr or the like). From that data science is business informatics with an additional statistics/ML course, no? $\endgroup$
    – Momo
    Commented Feb 12, 2016 at 13:35
  • 4
    $\begingroup$ Sure, your course covers many of the required skills. I suppose we can find many combinations, e.g., Computer Science degree with some stats courses and a thesis/internship on a business ML-based problem. At the end of the day, what matters is the depth and breadth of the relevant skills the candidate brings on the table. $\endgroup$
    – iliasfl
    Commented Feb 12, 2016 at 14:04
11
$\begingroup$

Allow me to ignore the hype and buzzwords. I think "Data Scientist" (or whatever you want to call it) is a real thing and that is distinct from a statistician. There are many types of positions that effectively are data scientists but are not given that name - one example is people working in genomics.

The way I see it, a data scientist is someone that has the skills and expertise to design and execute research on large amounts of complex data (e.g. highly dimensional in which the underlying mechanisms are unknown and complex).

This means:

  • Programming: Being able to implement analysis and pipelines, often requiring some level of parallelization and interfacing with databases and high-performance computing resources.
  • Computer Science (algorithms): Designing/choosing efficient algorithms such that chosen analysis is feasible and error rate is controlled. Sometimes this may also require knowledge of numerical analysis, optimization, etc.
  • Computer science / statistics (usually emphasis on machine learning): Designing and implementing a framework in order to ask questions on the data or find "patterns" in it. This would include not only knowledge of different tests/tools/algorithms but also how to design proper holdout, cross-validation and so on.
  • Modelling: Often we would like to be able to produce some model that gives a simpler representation of the data such that we can both make useful predictions and gain insight into the mechanisms underlying the data. Probabilistic models are very popular for this.
  • Domain-specific expertise: One key aspect of successfully working with complex data is incorporating domain-specific insight. So I would say that it is critical that the data scientist either have expertise in the domain, be able to quickly learn new fields, or should be able to interface well with experts in the field that can yield useful insights about how to approach the data.
$\endgroup$
13
  • 6
    $\begingroup$ And who is a statistician, in your opinion? How is this list of skills different from the skills that a "statistician" should have? $\endgroup$
    – amoeba
    Commented Feb 11, 2016 at 18:30
  • 4
    $\begingroup$ @amoeba I may be wrong, but many statisticians do not have some of these skills (e.g. extensive programming with massive datasets, graduate-level training in computer science). Also, some statistical skills are irrelevant for often a data scientist (some of theory, some sub-fields). $\endgroup$
    – Bitwise
    Commented Feb 11, 2016 at 18:38
  • 4
    $\begingroup$ @rocinante: I strongly disagree that "programming with 'massive datasets' is not really a hinderance". I don't think I know anyone with the title "statistician" who could implement software that makes real time decisions based on incoming packets on a server. Certainly not all data scientists could either, but the proportion is much higher. $\endgroup$
    – Cliff AB
    Commented Feb 12, 2016 at 1:23
  • 3
    $\begingroup$ @rocinante a good understanding of statistics is necessary but not sufficient in my view. Regarding the profoundness/difficulty of stats vs. other skills, I would argue that obtaining a good understanding of the computer science side is as profound/difficult, if not more. Also, regarding the questions on that SE, you find those kinds of questions on any SE (including this one) - it doesn't mean anything except that some people want easy solutions without understanding. $\endgroup$
    – Bitwise
    Commented Feb 12, 2016 at 13:10
  • 7
    $\begingroup$ The one thing that gets tiring in these "data science vs. statistics" debates is the subtle implication that data scientists are like a superior breed of statistician. The fact is that as the breadth of your knowledge increases the depth goes down, and of the people who are better than clueless at all the tasks necessary to be a "data scientist," I would imagine their knowledge of most of these things to be pretty superficial. In general it is extremely difficult to even come close to being expert in any of the domains people expect these mythical data scientists to have mastered. $\endgroup$
    – dsaxton
    Commented Feb 15, 2016 at 1:59
7
$\begingroup$

All great answers, however in my job hunting experience I have noted that the term "data scientist" has been confounded with "junior data analyst" in the minds of the recruiters that I was in contact with. Thus many nice folks with no statistics experience apart from that introductory one term course they did a couple of years ago now call themselves data scientists. As someone who with a computer science background and years of experience as a data analyst, I did a PhD in Statistics later in my career thinking it would help me stand out from the crowd, I find myself in an unexpectedly large crowd of "data scientists". I think that I might revert to "statistician"!

$\endgroup$
1
  • 5
    $\begingroup$ I basically see the same thing. Any job that request some work with data or some analysis is called "Data Science". I think very similar thing happened to "Quant" in finance, where anybody who did some work with data was calling themselves "Quant". $\endgroup$
    – Akavall
    Commented Feb 14, 2016 at 16:23
7
$\begingroup$

I'm a junior employee, but my job title is "data scientist." I think Bitwise's answer is an apt description of what I was hired to do, but I'd like to add one more point based on my day-to-day experience at work:

$$\text{Data Science} \neq \text{Statistics},$$ $$\text{Statistics} \in \text{Data Science}.$$

Science is a process of inquiry. When data is the means by which that inquiry is made, data science is happening. It doesn't mean that everyone who experiments or does research with data is necessarily a data scientist, in the same way that not everyone who experiments or does research with wiring is necessarily an electrical engineer. But it does mean that one can acquire enough training to become a professional "data inquirer," in the same way that one can acquire enough training to become a professional electrician. That training is more or less comprised of the points in Bitwise's answer, of which statistics is a component but not the entirety.

Piotr's answer is also a nice summary of all the things I need to do wish I knew how to do in a given week. My job so far has mostly been helping to undo the damage done by former employees who belonged to the "Danger Zone" component of the Venn diagram.

$\endgroup$
8
  • 2
    $\begingroup$ +1. I think it's very valuable in this thread to hear from people who are actually employed as "data scientists". $\endgroup$
    – amoeba
    Commented Feb 14, 2016 at 20:00
  • $\begingroup$ (+1) @amoeba I agree 100% with your sentiment. $\endgroup$ Commented Feb 14, 2016 at 20:30
  • 8
    $\begingroup$ Just to nitpick a bit: I agree that $\text{Data Science} \ne \text{Statistics}$, but I disagree that $\text{Statistics} \subset \text{Data Science}$. I think you should say $\text{Statistics} \cap \text{Data Science} \ne \emptyset$ instead. $\endgroup$
    – caveman
    Commented Feb 15, 2016 at 7:54
  • 1
    $\begingroup$ @caveman it was actually $\in$ When I wrote the post. Probably better that way $\endgroup$ Commented Feb 15, 2016 at 13:56
  • 1
    $\begingroup$ @ssdecontrol, but now you are saying that all of $\text{Statistics}$ is a single element of $\text{Data Science}$. I am not sure if this is true. I have a question: is there any in $\text{Statistics}$ that is not of interest in $\text{Data Science}$? For example, are there any kinds of theoretical proofs that $\text{Data Science}$ does not care about? Or any other example? $\endgroup$
    – caveman
    Commented Feb 15, 2016 at 16:05
3
$\begingroup$

I have also recently become interested in data science as a career, and when I think of what I learnt about the data science job in comparison to the numerous statistics courses that I took (and enjoyed!), I started to think of data scientists as computer scientists who turned their attention to data. In particular, I noted the following main differences. Note though that the differences appear mood. The following just reflects my subjective impressions, and I do not claim generality. Just my impressions!

  1. In statistics, you care a lot about distributions, probabilities, and inferential procedures (how to do hypothesis tests, which are the underlying distributions, etc). From what I understand, data science is more often than not about prediction, and worries about inferential statements are to some extent absorbed by procedures from computer science, such as cross-validation.

  2. In statistical courses, I often just created my own data, or used some ready made data that is available in a rather clean format. That means it is in a nice rectangular format, some excel spreadsheet, or something like that that fits nicely into RAM. Data cleaning surely is involved, but I never had to deal with "extracting" data from the web, let alone from databases that had to be set up in order to hold an amount of data that does not fit into RAM anymore. My impression is that this computational aspect is much more dominant in data science.

  3. Maybe this reflects my ignorance about what statisticians do in typical statistical jobs, but before data science I never thought about building models into a larger product. There was an analysis to be done, a statistical problem to be solved, some parameter to be estimated, and that is it. In data science, it seems that often (though not always) predictive models are built into a larger something. For instance, you click somewhere, and within milliseconds, a predictive algorithm will have decided what is being shown as a result. So, while in statistics, I always wondered "what parameter can we estimate, and how do we do it elegantly", it seems that in data science the focus is more on "what can we predict that is potentially useful in a data product".

Again, the above does not try to give a general definition. I am just pointing out the major differences that I have perceived myself. I am not in data science yet, but I hope to transition in the next year. In this sense take my two cents here with a grain of salt.

$\endgroup$
3
$\begingroup$

I always like to cut to the essence of the matter.

statistics - science + some computer stuff + hype = data science
$\endgroup$
1
  • 2
    $\begingroup$ That sounds like the impression I've formed of "machine learning", which I encapsulate as "learning how to operate a piece of software without understanding how it actually works" (unfair of course, but we see a lot of "machine learning" people coming out of school who understand nothing but what the tuning parameters of different kinds of neural nets represent.) $\endgroup$
    – jbowman
    Commented Feb 17, 2018 at 16:49
2
$\begingroup$

I say a Data Scientist is a role where one creates human-readable results for business, using the methods to make the result statistically solid (significant).

If any part of this definition is not followed we talk about either a developer, a true scientist / statistician, or a data engineer.

$\endgroup$
1
$\begingroup$

Data science is a multidisciplinary blend of data inference, algorithm development, and technology in order to solve analytically complex problems. But due to dearth of Data Scientists, a career in data science can really create numerous opportunities. However, organizations are looking for certified professionals from SAS, Data Science Council of America (DASCA), Hortonworks etc. Hope this is a good information!

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.