Arum and Roksa (p. 7) say:
Research on course evaluations by Valen Johnson has convincingly demonstrated that "higher grades do lead to better course evaluations" and "student course evaluations are not very good indicators of how much students have learned. "
I don't have access to Johnson's book, but a review states:
[Johnson] found the "grade-attribution" theory the most useful: "Students attribute success in academic work to themselves, but attribute failure to external sources" (96). Regardless of the reason, the analysis provides "conclusive evidence of a biasing effect of student grades on student evaluations of teaching" (118).
Johnson did his work in the US. If I'm understanding correctly based on the fairly brief descriptions I have available, he managed to get permission to spy on students' actions over time, so that he could actually detect not just correlations but the time-ordering of events, which could help to tease apart questions of causation.
Johnson says that evaluations are "not very good" indicators of learning. My question is basically on what the available evidence is as to what "not very good" means. It's possible that someone could answer this simply by having access to Johnson's book and flipping to p. 118.
If "not very good" means low correlation, then it would be interesting to know whether the correlation is statistically different from zero, and, if so, what its sign is. My guess, which encountered a very skeptical reaction in comments here, was that the correlation might be negative, since improved learning might require higher standards, which would tend to result in lower grades.
If the correlation is nonzero, it would also be interesting to understand whether one can infer that learning has any causal effect on evaluations. These two variables could be correlated due to the grade-attribution effect, but that wouldn't mean higher learning caused higher evaluations; it could just mean that better students learn more, and better students also give higher evaluations.
If we had, for example, a study in which students were randomly assigned to different sections of a course, we might be able to tell whether differences between sections in learning were correlated with differences between sections in evaluations. However, my understanding is that most of these "value added" analyses (which are often done in K-12 education) are statistically bogus. Basically you're subtracting two measurements from one another, and the difference is very small compared to the random and systematic errors.
My anecdotal experience is that when I first started teaching, I was a relatively easy grader, I got very high teaching evaluations, and my students did badly on an internationally standardized test that I gave at the end of the term. Over time, I got confident enough to raise my standards, my teaching evaluations went down, and my students' learning got dramatically better, as measured by this test.
References
Arum and Roksa, Academically Adrift: Limited Learning on College Campuses
Valen Johnson, Grade Inflation: A Crisis in College Education, 2003
related: Do teaching evaluations lead to lower standards in class?