34

I am a lecturer in charge of a course with over 250 students and several TAs. The TAs are partly responsible for grading the homework assignments. Since these are assignments in programming, no two submissions are identical, so it is impossible to cover all possible cases in the grading guidelines, and the grading has some subjective element.

I noticed that one TA is consistently stricter than the others. For example, if the guidelines say that "code efficiency" is worth 20 points, then this TA would deduct 15 points when the code is inefficient, while the other TAs would deduct only 5 points for a similar issue. A potential problem here is that it might be unfair to the students in the strict TA's class, but this can be solved by allocating the assignment task "horizontally" (each TA grades all 250 submissions in some of the assignments) rather than "vertically".

But I have a different question: I noticed that students who are graded more harshly, take the feedback comments more seriously, and tend to become better programmers. So, rather than just being "fair", I would like all TAs to grade in a stricter way - for the sake of the students. The problem is, most TAs are not motivated to grade strictly - they gain nothing from it; all they get is having to handle students' complaints and appeals, and risking lower marks in the students' feedback (since the students do not understand that it is in their favor until after they graduate).

The TAs are not lazy - they do put a lot of effort in teaching and helping students; they just don't like to be the "bad guys" who give low grades. How can I motivate them to give stricter grades?

CONCLUSION: Thanks a lot to all repliers. In addition to the excellent answers, two things that I did were:

  • I assigned myself to one of the TA sections (where the assignments are graded), in order to get a view of the grading task from the perspective of a TA. It was a very interesting and important experience, and helped me refine the rubric.
  • I introduced the use of a static analysis tool (specifically: clang-tidy, for C++) as part of the automatic grading. It was way more strict than both me and the TAs in detecting readability and code-quality issues. Students learned a lot just from trying to make clang-tidy run on their code without warnings.
20
  • 3
    What sort of feedback do students get other than their "marks"? Do the TAs explain where the student went wrong or why points were deducted?
    – Buffy
    Commented Jan 24, 2021 at 12:57
  • 78
    I don’t have time to write a response, but it sounds like an issue with your rubric. It seems like it is unclear what 20 pts of efficiency looks like vs 15 vs 10 vs 5.
    – Dawn
    Commented Jan 24, 2021 at 14:52
  • 2
    @Dawn the issue is that every assignment is different, and it is hard to cover all 250 ways in which students write inefficient code (and similarly for other grading criteria). Commented Jan 24, 2021 at 16:17
  • 5
    I hear you, but I think you are not being creative enough about this. I would suggest you explore rubrics for grading essay assignments to get a better idea of good ways to go about this. Yes, it does take time and effort to come up with good rubrics, but with practice you will improve at it.
    – Dawn
    Commented Jan 24, 2021 at 17:07
  • 9
    For my own actions, I find that distinguishing "feedback" from "grading" is useful. One does not have to "grade strictly/harshly" to give feedback. Serious students don't need to be clubbed to understand critiques. Commented Jan 24, 2021 at 21:44

9 Answers 9

21

A few things to add:

  • Be sure the TA's know you'll back them and will be the bad guy. The students know the TA's are using your guidelines. If a TA uses their judgement but was too harsh, you'll let both them and the student know it was your fault for not being more clear.

  • Emphasize the benefits of consistency to the TA's. Remind them students compare scores. Let them know it's OK to ask another TA how they grade something (or you). Let them know if they give too many points, it's causing problems for the other TA's.

  • Remind them they had to work hard to pass this class (assuming they did). Students tend to be protective of their majors, especially TA's, and want to maintain standards.

  • Get TA buy-in for the grading criteria. This is similar to the last bullet. Suppose it's -50% for not using functions, even if it works. Remind them this is the "learn to use functions" assignment and it said they were required, and you went over functions in class all last week. 50% off is generous.

I managed to go 5 years never having heard the word Rubric, then it was 6 months more before I realized it's the exact same thing as a grading key. I try to be somewhat detailed over a range:

Style:

  • -5: didn't try. nonsense var names, random indents, looks like garbage
  • -3: barely tried, and only in some places
  • -0: actually tried but still looks bad.

"Efficiency" seems way too vague. I try to list specific things they need to do:

Efficiency:

  • -5: no array loops, just lots of IF's.
  • -3: No nested if's
  • -0: at least 2 useful nested if's (even if others could be)

But (and I know this isn't what you asked) in a regular coding class they're often just trying to learn the new stuff and make it work. "Good style" is often too much to ask "Efficiency" can be even scarier and more confusing.

I've never done any training exercises with TA's. Just gone over grading at the first meeting. Then discuss the upcoming assignment, and how to grade the one coming due at every other meeting.

5
  • 9
    I really like the fact that "actually tried but still looks bad" comes in at -0 in this key. Everyone grading the assignments is notably more competent in the domain than the students. I think spelling this out explicitly helps people to stop looking for perfect solutions for a perfect score.
    – m00am
    Commented Jan 25, 2021 at 7:17
  • The "efficiency" rubric looks weird to me comparing loops and ifs and somehow grading the nested ifs. It feels more like part of the Style. If it's just as an example, then fine, but I really hope no one uses it in practice. Also, one thing to note that the most efficient code is often much less readable ("looks bad"), than simply "good enough" code.
    – Dan M.
    Commented Jan 25, 2021 at 14:00
  • @DanM.I'll add a blank line. For if's I'm thinking of 5 lines like if(letterGrade==true && g>=70 && g<80) when you're just covered and required-where-applicable nested and cascading if's. Commented Jan 25, 2021 at 19:22
  • I like your response the best. I graded for CS (was NOT a CS major) and, with the prof, had a defined key- basically diff'd in/out and they ought to match. Then I'd go read their code for issues/comments. Given they were provided sample data and in/out diffs there was no excuse for it not generating formatting. With that, any disagreements about style, comments, or.. excessive duplication of another person's work... went to the professor.
    – J.Hirsch
    Commented Jan 26, 2021 at 21:59
  • 2
    The first four points address the psychological issues of TAs not wanting to give low grades, which is exactly the focus of my question. Thanks Commented Jan 27, 2021 at 16:21
68

You have a bigger problem than encouraging stricter grading. You need to provide consistent grading. Otherwise your scheme is fundamentally unfair.

For starters you don't have an option to fail to provide a proper rubric. If you aren't doing that, then you are failing the students. If it is a lot of work to do it, then you have a large task, but a required one.

You can make the rubric as strict as you like (though I don't really like the concept of narrow interpretations), but it has to be clear to your TAs and it has to be reasonable to your students.

One way to assure some consistency is to have more than one TA involved with each student's work. They need to agree with each other or appeal to you for a judgement. If they enter student grades into a spreadsheet you can easily see the differences and can also use it for further TA training as needed, such as, for example when one TA is consistently "too" lenient.

For an exercise with lots of parts, it might be possible to have each TA responsible for only one part. This tends to work for final exams where students answer questions, but less well for programming assignments.

Another way to achieve a good rubric is to have yourself or a small team of advanced TAs scan the student work without grading it and use what they learn to refine the rubric to assure consistency. It is probably a mistake to use this trick to create the rubric in the first place, but it gives you an idea of where the students are going wrong and need correction. An overall view. Once that is in place, the actual grading can occur.

Another trick, though not very easily done in pandemic times or with a large pool of graders, is to bring everyone together in real time to grade all the papers. This could possibly be done online (zoom) and you could be present to answer questions and make decisions.

But, again, consistency is a requirement. The rubric needs to be complete to assure that. The "strictness" is a secondary concern, but could be improved (your idea, not mine) with a proper rubric that everyone finds clear.


Moreover, if you try to grade things on a fixed scale that are fundamentally "fuzzy" then you have an impossible task. If you can define "efficiency" in your example, then fine. But if it is a fuzzy concept then almost every rubric is likely to leave the grading to intuition. Give precise grades on the things that are precise. But for other things, judgement and a bit of compassion are probably needed.

In CS, some things are clear, of course. If a student uses bubble sort on a large array it is clearly inefficient. But selection sort is more efficient than quick sort at a certain scale, which is why library versions of quick sort normally drop back to selection sort for small sections of the initial set.

But judgments about "proper factoring" of code are judgmental. If your feeling is "I can't define it, but I know it when I see it." then it is nearly impossible to provide a rubric that will be used consistently by a group of TAs.

5
  • Even in the sort case, insertion sort can often out perform the other sorts to a surprising amount on modern processors due to its predictable control flow, despite executing significantly more instructions. So it can depend a lot on your definition of efficiency. So many standard librarys use that for small cases now. Commented Jan 25, 2021 at 10:49
  • 4
    I don’t see how this address the question asked. In the question, OP acknowledges this consistency issue, mentions they’re aware of the standard and adequate solution (assigning the grading “horizontally” among TA’s rather than “vertically”), and then explains that they’re asking about a different (though related) issue.
    – PLL
    Commented Jan 25, 2021 at 15:53
  • 4
    @PLL a properly formulated rubric will, by its nature, set the level of "strictness" when it is applied. If it doesn't, then it is an imperfect rubric leaving too much to interpretation.
    – Buffy
    Commented Jan 25, 2021 at 15:59
  • 3
    +1 just for the first sentence. When I've had several competent TAs, one thing I've done is had the get together to come up with a rubric, and look at how the other TAs are grading at the beginning to make sure they're on the same page. Once they were on the same page, they would take turns at writing the rubrics for the assignments.
    – Kimball
    Commented Jan 25, 2021 at 16:29
  • 2
    I agree that a rubric is important, but as you said, even a very detailed rubric leaves a certain subjective element to grading, and my question is specifically about this remaining subjective element. It is more about the psychology of not wanting to be the "bad guy". Commented Jan 27, 2021 at 16:20
16

From my experience, the best way to ensure consistency is by setting simple, clear-cut rubrics.

This can be done via a moderation exercise: have all the teaching staff mark 10 scripts together and see where the disagreements lie.

Alternatively, if you have access to moderation tools like Gradescope, then this obviates the need to meet in person.

Bottom line - just be clear about your expectations. Explain to the TAs the purpose of grading - either harsh or lenient, and the need for consistency.

7

Several suggestions, dealing with several aspects of the question.

Spend some time training the TAs. At the start of the semester, have everyone grade the same set of sample submissions from a previous semester. Meet in a group to discuss what you and they think matters when correcting inefficiencies and inelegancies. If you can reach consensus, fine. If not, make your own requirements clear to all. Perhaps repeat this exercise with the TAs after the first assignment.

Consider two separate marks for each assignment, one for correctness and one for style. Perhaps be strict on the style scale but weight that mark less.

Weigh programming assignments at the end of the semester more than those at the start, and make sure the students know this. That should mitigate the effect of strict standards at the start and teach them what they need to do to do better later.

1
  • Interesting advices, thanks. I particularly like the last one, which simulates the idea that "a low grade now pays off well in the long run" Commented Jan 27, 2021 at 16:22
5

For programming assignments, there is nothing better than a suite of automated tests that look at accuracy, performance, and code smells.

You can have tiers of tests:

  • Tier I - basic tests
  • Tier II - advanced/edge-case tests
  • Tier III - performance, code smells tests

Based on the tier, the points/grade is determined.

Of course, you can use TAs to skim through the solutions to adjust the points for clever solutions and attempts to play the tests.

3
  • 1
    We already have that; my question is about things that cannot be graded automatically, e.g. code quality and writing style. Commented Jan 25, 2021 at 21:02
  • 2
    about code quality, most languages do have analyzers to automatically test code quality. The most common on being Linter, It is also possible to detect code smells using other metrics with specialized tools such as cyclomatic complexity, coupling, coherence, etc. The domain as a whole of analyzing code without running it is called Static Analysis. Commented Jan 26, 2021 at 19:21
  • @ErelSegal-Halevi - As a software developer I find that highly ironic because code quality in academia is abysmal. And like Felix said, in industry, we have a TON of tools for code quality.
    – Davor
    Commented Jan 27, 2021 at 14:51
4

What I and the other TAs did on last semester's course to make sure we graded all students as evenly as possible:

  • Before the assignment was released to the students, we went through the rubric to try to poke holes in it. We wanted to make sure that what the students were told to do in the assignment, what the assignment told them they would be graded on, and the actual rubric we would use, were all in agreement. If they weren't consistent, we'd take it back to the lecturer and propose a refinement.
  • When results came in, we took a couple of submissions and graded them collectively, to calibrate between the TAs how we would apply the items in the rubric in practice. After the first assignment we had a good idea of who were strong and weak programmers so we'd pick a presumed-strong and presumed-weak submission to calibrate at both ends of the scale.
  • When grading an assignment, we'd keep notes on what points were scored for each item the rubric. So you could look up why exactly a student got the grade they got, and what they should improve on a re-sit.
  • We also tracked the average mark given by each TA, so that we could do an inspection if one TA's average marks were significantly higher or lower than the others.

This approach worked out quite well.

1
  • This answer is underrated. Writing a rubric that covers all cases is an incredibly difficult task. Marking a couple of scripts together is much more likely to actually produce consistent scores.
    – Clumsy cat
    Commented Jan 12, 2022 at 8:37
0

If you want to reduce variance, you can multi-pass marking.

All deductions come from a combination of a marked part of the code, a reason, and a deduction because of a marked part of the code.

Take such information and hand off the marked part of the code and reason to another TA, and they independently determine the deduction based off your Rubric.

If they significantly disagree, send it to a 3rd TA.

If they still significantly disagree, escalate to you, and improve Rubric.

The first TA is spending most of the time (looking for problems), the second TA only has to apply the Rubric on an identified problem, so this should less than double the marking workload.

Because you are now comparing two (or three) TA's determination of how severe something is, you can bias the average towards the more severe rating. TAs who regularly are the least severe you can review the work of, and encourage to be more severe if they need to be.

You'll note that you can even use such a mechanism to have student-student code review (looking at what other people do wrong or right is of high value when learning). Students are told to use the Rubric to review other student's code, and identify areas that violate the rules. Then TAs can review the student's selection of issues and issue marks that way.

That student-student review can also be used to spot problems that a TA is missing; if TA consistently misses violations of the Rubric that students catch (and a 3rd party TA marks as good catches), then that is a reason to consider talking to that TA about being more strict.

In short, you need visibility into the process. This is expensive to do yourself, so you need to have your large number of Students and TAs provide useful cross-review, consume the resulting data to find exceptions you want to deal with yourself, and then apply corrective action on the exceptions. That should efficiently move the marking to be more consistent, and correctly harsh.

-1

It sounds to me that one of the TAs is not calibrating their marks to reasonable expectations of the student cohort, but to absolute criteria, e.g. x-points lost for each of a list of potential deficiencies. This can be unfair because the deficiencies are often correlated, rather than independent, which leads to a bimodal distribution of scores - they all either got it basically right, or basically wrong. In the U.K. system, where we have first-class, upper- and lower-second class etc., I'd ask the TA to look at the score they have given for the student and ask themselves whether that was consistent with the implied degree classification. For instance if there were 20 marks for efficiency and a marker awarded only 5, then that is saying the work is a borderline fail, according to that quality. Phrased that way, they might see that the score is incompatible with their subjective assessment of the work, rather than the objective "tick-box" score. If another marker could have given it 15 marks, it clearly isn't in fail territory!

... of course it could be the other marker is too lenient, but again it is a calibration issue, they should ask themselves (in the U.K. system), whether the work was "first class" from an efficiency perspective, as that is what the score of 15 (75%) would imply.

How the TAs would moderate their original marks is another matter, but it is a sanity-check of their calibration.

Programming has a large subjective element. The point is not just to write a program that computes the right answer (hopefully efficiently), it needs to be written in a way that is understandable by other human beings, so that it is maintainable by somebody else. Overly prescriptive rubrics can cause more problems than they are worth, because sometimes students can come up with good solutions that don't fit your rubric, and they shouldn't be penalised. Students also enjoy a degree of freedom to be creative in programming assignments, and I think that makes them better programmers in the long run.

1
  • 1
    @downvoter - some feedback on why would be very welcome. Commented Jan 26, 2021 at 14:23
-7

Assuming "You" is is the university: Pay your TAs more to increase motivation. You cannot expect good quality work or higher motivation for bad quality pay. Generous grading is low-effort grading.

The other answers advocating rubrics and training are also correct.

11
  • Why do you assume "You" is the university? Can you explain how OP can influence their university to make a change?
    – user111388
    Commented Jan 25, 2021 at 7:39
  • It depends on the university. The "lecturer" role has different levels of power in different locations. Commented Jan 25, 2021 at 8:43
  • 5
    Note, there is no indication that the TAs are failing to do "good quality work" (the problem is only that OP feels their grading is too generous, not that the grading is inaccurate or unfair). Nor is there any evidence that the student's workload is disproportionate to their stipend. By this logic, any possible problem with a coworker could be attributed to inadequate salary. You may be opposed to long answers, but I am opposed to cryptic ones with unclear reasoning :-)
    – cag51
    Commented Jan 26, 2021 at 2:46
  • 1
    @cag51 I don't feel your criticism is honest. I'm sure you are aware nearly all TA pay is very low, so I am not claiming that "any possible problem with a coworker could be attributed to inadequate salary" as most industries pay more. You are also aware that grading too generously is, in fact, low quality work. Further, the question specified "motivation" was the problem, which is indicative of low quality work. Commented Jan 26, 2021 at 2:55
  • 1
    Thanks for reply. I don't entirely agree, but at least I understand your logic now. (You might consider adding your reply, especially the last two sentences, into the post so that others will understand).
    – cag51
    Commented Jan 26, 2021 at 3:12

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .