21
$\begingroup$

I am not sure where this question belongs to: Cross Validated, or The Workplace. But my question is vaguely related to statistics.

This question (or I guess questions) arose during my working as a "data science intern". I was building this linear regression model and examining the residual plot. I saw clear sign of heteroskedasticity. I remember that heteroskedasticity distorts many test statistics such as confidence interval and t-test. So I used weighted least square, following what I have learned at college. My manager saw that and advised me not to do that because "I was making things complicated", which was not a very convincing reason to me at all.

Another example would be "removing an explanatory variable since its p-value is insignificant". To be, this advice just does not make sense from a logical point of view. According to what I have learned, insignificant p-value could be due to different reasons: chance, using the wrong model, violating the assumptions, etc.

Yet another example is that, I used k-fold cross validation to evaluate my model. According to the result, $CV_{model 1}$ is just way better than $CV_{model 2}$. But we do have a lower $R^2$ for model 1, and the reason has something to do with the intercept. My supervisor, though, seems to prefer model 2 because it has higher $R^2$. His reasons (such as $R^2$ is robust, or cross-validation is machine learning approach, not statistical approach) just do not seem to be convincing enough to change my mind.

As someone who just graduated from college, I am very confused. I am very passionate about applying correct statistics to solve real world problems, but I don't know which of the followings is true:

  1. The statistics I learned by myself is just wrong, so I am just making mistakes.
  2. There is huge difference between theoretical statistics and building models in companies. And although statistics theory is right, people just don't follow it.
  3. The manager is not using statistics correctly.

Update at 4/17/2017: I have decided to pursue a Ph.D. in statistics. Thank you all for your reply.

$\endgroup$
7
  • 1
    $\begingroup$ Related to your question are the comments (especially those at the end) below this answer: stats.stackexchange.com/questions/229193/… $\endgroup$
    – user83346
    Commented Sep 3, 2016 at 6:49
  • $\begingroup$ This discussion can also be relevant. In practice, you can sometimes use models where your data violates some required assumptions (eg, Naive Bayes on dependent variables) and still have interesting results. But you must then be very careful about the conclusions you draw, and that's where the main problem is: most people just don't care about the meaning of your results as long as you get results. Publish or perish... $\endgroup$
    – gaborous
    Commented Sep 3, 2016 at 15:25
  • 1
    $\begingroup$ The answers "you are right and he is wrong" are probably right and apply to your case. Anyway, beware that sometimes the answer can be "he is wrong but his wrong way works for his purposes - maybe it works even better than the right way would do for his non statistical purposes of running business". I think that happens often with all kind of scientific knowledge, not just statistics. Maybe in SE Workplace they can give you non statistical examples. $\endgroup$
    – Pere
    Commented Sep 3, 2016 at 17:44
  • 3
    $\begingroup$ @Aksakal: From what the OP describes statistically he is more likely correct. Your personal anecdote, is just an anecdote. I can counter it by saying I moved into a job where A/B testing would be done with just 30 samples; showing basic power-calculations changed the teams's whole mindset about sample sizes and decision making. Returning to the OP's question, I agree that what is described doesn't mean that the OP's supervisor made a wrong call. Business workflows have a particular inertia associated with them and the "new guy" has to prove himself as a preacher before becoming a prophet... $\endgroup$
    – usεr11852
    Commented Sep 4, 2016 at 0:07
  • 1
    $\begingroup$ @usεr11852, my comment was a rant :) but it has a point, me thinks: for someone who's new to the field, it's safer to assume that a boss knows better. with experience he can relax this assumption, maybe give more weight to his own opinion and less to boss'. for an intern the weight on own opinion should be close to ZERO. $\endgroup$
    – Aksakal
    Commented Sep 4, 2016 at 16:30

3 Answers 3

13
$\begingroup$

In a nutshell, you're right and he's wrong. The tragedy of data analysis is that a lot of people do it, but only a minority of people do it well, partly due to a weak education in data analysis and partly due to apathy. Turn a critical eye to most any published research article that doesn't have a statistician or a machine-learning expert on the author list and you'll quickly spot such elementary mistakes as interpreting $p$-values as the probability that the null hypothesis is true.

I think the only thing to do, when confronted with this kind of situation, is to carefully explain what's wrong about the wrongheaded practice, with an example or two.

$\endgroup$
12
  • 3
    $\begingroup$ Thanks for the reply. I guess a "next-step question" is, is there any job out there that actually does correct statistics? I understand that data science is very popular nowadays, but somehow I have this impression that many "data scientists" do not really care about doing correct statistics... $\endgroup$
    – 3x89g2
    Commented Sep 3, 2016 at 5:33
  • 1
    $\begingroup$ @Misakov I think it really depends on the person or organization. But buzzwords like "data science", "analytics", and "business intelligence" are red flags. And don't forget that in a job interview, you're interviewing them, too. It doesn't just make you.l look good to ask detailed questions about how things are done; it lets you see how serious they are about data analysis. $\endgroup$ Commented Sep 3, 2016 at 5:40
  • $\begingroup$ @Misakov You'd probably need to go into academia if you really want to do correct statistics. The vast majority (see my answer above) of industrial use will be wrong. $\endgroup$
    – Mooks
    Commented Sep 3, 2016 at 9:20
  • $\begingroup$ @Kodiologist: I think you are taking a slight "righteous" approach on this and you are not helping the OP by just confirming his bias against industry statistics. Also the idea of contradicting a senior member after he gave a direct decision ("Go with higher $R^2$") is a bit naive... Given that the enterprise still exists the manager's decisions aren't so wrong and the over-simplification of some rules might not be too catastrophic within the context of their work. New people (like the OP) come on-board and the team evolves; evolution is a Wiener process though, not a Lévy flight! $\endgroup$
    – usεr11852
    Commented Sep 3, 2016 at 23:56
  • 1
    $\begingroup$ @usεr11852 A good (i.e., non-pointy-haired) manager will defer to employees when they know better than he does. "Given that the enterprise still exists the manager's decisions aren't so wrong" — The race is not to the swift. $\endgroup$ Commented Sep 4, 2016 at 0:04
12
$\begingroup$

Kodiologist is right - you're right, he's wrong. However sadly this is an even more common place problem than what you're encountering. You're actually in an industry that's doing relatively well.

For example, I currently work in a field where specifications on products need to be set. This is nearly always done by monitoring the products/processes in some ways and recording means and std deviations - then using good old $mean + 3*\sigma$.

Now, apart from the fact that this confidence interval is not telling them what they actually need (they need a tolerance interval for that), this is done blindly on parameters that are hovering near some maximum or minimum value (but where the interval won't actually exceed those values). Because Excel will calculate what they need (yes, I said Excel), they set their specs according to that, despite the fact that the parameter is not going to be anywhere near normally distributed. These people have been taught basic statistics, but not q-q plots or such like. One of the biggest problems is that stats will give you a number, even when used inappropriately- so most people don't know when they have done so.

In other words, the specifications on the vast majority of products, in the vast majority of industries, are nonsense.

One of the worst examples I have of people blindly following statistics, without understanding, is Cpk use in the automotive industry. One company spent about a year arguing over a product with their supplier, because they thought the supplier could control their product to a level that was simply not possible. They were setting only a maximum spec (no minimum) on a parameter and used Cpk to justify their claim - until it was pointed out that their calculations (when used to set a theoretical minimum level - they didn't want that so had not checked) implied a massive negative value. This, on a parameter that could never go less than 0. Cpk assumes normal, the process didn't give anywhere near normal data. It took a long time to get that to sink in. All that wasted time and money because people didn't understand what they were calculating - and it could have been a lot worse had it not been noticed. This might be a contributing factor to why there are regular recalls in the automotive industry!

I, myself, come from a science background, and, frankly, the statistics teaching in science and engineering is shockingly insufficient. I'd never heard of most of what I need to use now - it's all been self taught and there are (compared to a proper statistician) massive gaps in my knowledge even now. For that reason, I don't begrudge people misusing statistics (I probably still do it regularly), it's poor education.

So, going back to your original question, it's really not easy. I would agree with Kodiologist's recommendation to try to gently explain these things so the right statistics are used. But, I would add an extra caveat to that and also advise you to pick your battles wisely, for the sake of your career.

It's unfortunate, but it's a fact that you won't be able to get everyone to do the best statistics every time. Choose to correct them when it really matters to the final overall conclusion (which sometimes means doing things two different ways to check). There are times (e.g. your model 1,2 example) where using the "wrong" way might lead to the same conclusions. Avoid correcting too many people too frequently.

I know that's intellectually frustrating and the world should work differently - sadly it doesn't. To a degree you'll have to learn to judge your battles based on your colleagues' individual personalities. Your (career) goal is to be the expert they go to when they really need help, not the picky person always trying to correct them. And, in fact, if you become that person, that's probably where you'll have the most success getting people to listen and do things the right way. Good luck.

$\endgroup$
5
  • $\begingroup$ Excel is quite possibly the most widely used data analysis software. No need for the "yeah, I said it!" remark. Unless someone hasn't gone out of academia (and maybe big pharma) he wouldn't bat an eye with your original statement. (Nice answer, +1) $\endgroup$
    – usεr11852
    Commented Sep 3, 2016 at 23:31
  • 1
    $\begingroup$ It is the most widely used, and I think that highlights my original point. Excel has huge deficiencies for data analysis. If what you're doing is being done on Excel, you can't really call it data analysis - unless you're manually entering all the calculations yourself. Nothing against Excel as a spreadsheet, but it's a rudimentary data analysis tool, at best. But people don't know any better, because they're not taught any better. I don't come from a statistics background, but I was lucky someone mentioned R to me for making better graphs - and that, coincidentally, led me into better stats. $\endgroup$
    – Mooks
    Commented Sep 4, 2016 at 8:55
  • $\begingroup$ "I would agree with Kodiologist's recommendation to try to gently explain these things so the right statistics are used." - I want to be a witness. An intern explaining to his employer how to do business. $\endgroup$
    – Aksakal
    Commented Sep 4, 2016 at 16:32
  • 1
    $\begingroup$ This will help, check #9. It's a common advice that comes in this sort of lists all the time. First 100 days at job: do not suggest to change things, figure out first why people are doing things the way they're doing, often times there's a valid reason. You'll make a fool of yourself, and I've seen this happening with new guys over and over. Just shut up and observe for a few months $\endgroup$
    – Aksakal
    Commented Sep 4, 2016 at 16:43
  • $\begingroup$ @Aksakal What you said definitely makes sense. I'm acting a little bit "bold" in my situation mainly because I'm an intern and I know I'm leaving pretty soon anyway. $\endgroup$
    – 3x89g2
    Commented Sep 8, 2016 at 20:09
4
$\begingroup$

What is described appears like a somewhat bad experience. Nevertheless it should not be something that causes one to immediately question their own educational background nor the statistical judgement of their supervisor/manager.

Yes, very, very likely you are correct to suggest using CV instead of $R^2$ for model selection for example. But you need to find why this (potentially dodgy) methodology came to be, see how is this hurting the company down the line and then offer solutions for that pain. Nobody wants to use a wrong methodology consciously unless they are reasons to do so. Saying that something is wrong (which might very well be) and not showing how the mistake affects your actual work, rather than the asymptotic behaviour somewhere in the future, does not mean much. People will be reluctant to accept it; why spend energy to change when everything is (somewhat) working? Your manager is not necessarily wrong from a business perspective. He is responsible for the statistical as well as the business decisions of your department; those decision do not necessarily coincide always and quite likely do not coincide on short-term deliverables (time constraints are a very important factor in industry data analytics).

My advise is to stick to your (statistical) guns but be open to what people do, be patient with people that might be detached from new statistical practices and offer advice/opinions when asked, grow a thicker skin and learn from your environment. If you are doing the right stuff, this will slowly show, people will want your opinion because they will recognise you can offer solutions where their current work-flow does not. Finally, yeah sure, if after a reasonable amount of time (a couple of months at least) you feel that you are devalued and disrespected just move on.

It goes without saying that now you are in the industry you cannot sit back and think you do not need to hone your Statistics education. Predictive modelling, regression strategies, clustering algorithms just keep evolving. For example, using Gaussian Processes Regression in an industrial setting was close to science fiction 10 years ago; now it can seen almost like an off-the-shelf thing to try.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.