122

We've been talking a lot about the efficacy of community self-evaluations lately, both internally on the Community Team and out in the open on meta. Lots of ideas were tossed around, but the underlying theme of these discussions has been that site self-evaluations are not useful in their current form. The review queue format hampers discussion, the meta thread posted by the Community user is trivially ignored, and the sample question size is far too small to be useful, but big enough to be non-trivial for participants. They also don't provide much in the way of useful information to community managers behind the scenes.

So we're going to shut them down this week, and we'll be using the time it saves us to digest the ideas y'all have thrown out in the previous discussions and figure out where to go from here.

I'll be going over the previous threads with a fine-toothed comb to try and suss out what worked well and what was broken and determine a more useful mechanism - if one is required at all. I'd also welcome continued suggestions from the communities about how to continue capturing what the evals accidentally did right - encouraging communities to stop and examine how they're doing at somewhat regular intervals.

So: if you have ideas on how to get communities to monitor their own Q&A quality and get a little introspective, feel free to sound off in the answers.

Update: with a creative application of per-site feature toggles, we've shut down evaluations except on a handful of sites where they're currently running. Once those evaluations run their course, no new ones will be started anywhere on the network.

9
  • 29
    I have no idea whether this is the Right^TM choice or not, but I'm glad that y'all are critically evaluating things that have been around for a long time as opposed to being chained down by tradition! Commented Jul 28, 2015 at 23:18
  • 48
    We did Community Evaluations of sites?
    – Robotnik
    Commented Jul 29, 2015 at 0:54
  • 6
    @Robotnik Every 6 months, and only on beta sites.
    – hairboat
    Commented Jul 29, 2015 at 0:58
  • 2
    @abbyhairboat - I know, it was a joke :P
    – Robotnik
    Commented Jul 29, 2015 at 1:00
  • 38
    @Robotnik I thought it might be, but I played it safe and ruined the joke. Because I hate fun.
    – hairboat
    Commented Jul 29, 2015 at 15:11
  • 5
    I agree that I don't believe the site self-evaluations are very pertinent, nor provide good results. When I've done it for the Beta sites I'm on, I often wonder, "Who picked these questions?" ... Are these random, or what? Without thinking about it further, I go through the questions and do the best I can, trying to be honest in evaluating. Do I believe this does any good? No. This too can be gamed. I don't have a better solution for this, so just leaving as a comment. Come up with some better way to self evaluate ... It's needed! Commented Jul 29, 2015 at 16:19
  • 6
    I'd ditch the idea from the onset that one-size-fits-all. I know that's impractical at the beginning. But the fact is, some exchanges are never going to have a lot of traffic, but still have a lot of value.
    – ouflak
    Commented Jul 30, 2015 at 12:53
  • @ouflak Yes, definitely - value and traffic do not always correlate directly. In fact, they frequently don't. Part of the reason we think this kind of self-evaluation (in concept, not the implementation we've had for the past few years) is to encourage people to talk about less tangible quality and value outside the context of metrics like traffic, activity, and so on.
    – hairboat
    Commented Jul 30, 2015 at 20:04
  • 7
    A self evaluation on self evaluations... thats super meta... Commented Jul 31, 2015 at 0:05

4 Answers 4

76

Okay, I'll take a bite on site quality metrics, liberally borrowing some ideas that have been floated on MSE. Here are the broad sections that are IMO essential to judge the progress of a site on a continuing basis:

  • Users
  • Questions
  • Answers
  • Moderation

While some of the metrics have been implemented in SEDE queries, it makes sense to assemble them in one place for perusal by mods and CMs, and the general public, if possible.

Users

The most important bunch of metrics.

  • How many users have posted two or more upvoted answers last week? It is better to show the distribution rather than totals. That's the expert core of the site <1>.

Below: a graph depicting users at Space Exploration SE as nodes and edges (from X to Y) corresponding to "User X answers with a positive score a question with a positive score from User Y". The hell inside is actually the core.

Space Exploration - community structure

  • How does the size of the core last week compare with the same metric six months ago (expert core growth)? (Ideally, a time series graph is needed here) Actual implementation courtesy of Isaac Moses is here as a SEDE query/graph.

Critique of the query by Shog9 - old posts gathering votes with time (and thus skewing the statistics)


As an illustration here are the graphs for the original trilogy:

As of January 2016 (to show that the trends persist):

Stack Overflow

July 2015:

Stack Overflow core users

Stack Overflow core

January 2016

SF core in Jan 16

July 2015

Server Fault core users

Server Fault core

January 2016

SU core 2016

July 2015

Super User core users

Super User core


  • What is the proportion of core users from six months a year ago that is still present in last week's numbers? This measures (100% - expert burnout rate). If core users get bored quickly, it's a problem.

SEDE query

(100% - yearly burnout rate, measured on week 15 to avoid Northern Hemisphere's vacation period)

             2010   2011   2012   2013   2014   2015
SO          80.71  78.32  80.32  82.14  84.51  82.17
SF           Jeff  81.51  83.57  82.60  78.04  87.27
SU            N/A  90.90  84.80  86.72  91.07  89.88
Mathematics   N/A    N/A  63.30  69.84  69.55  71.39
U&L           N/A    N/A  70.00  88.09  74.07  80.70
  • Average number of votes per user per day, averaged over the whole user base last week. (voting engagement)
  • Same as above minus upvotes from users with 101 reputation. It's a correction for SuperCollider drive-by voting (corrected voting engagement).

Questions

Self-evaluation conflated assessment of questions and answers. It's better to keep these items separate.

  • Number of questions per day last week and last month. Self explanatory in view of new graduation guidelines, but the numbers must be current. The Area51 QPD number is unclear and is not sufficient.
  • The ratio of closed/deleted/zero or negative votescore questions to the total number of questions asked last month (noise ratio).
  • Number of views per one asked, positive votescore, not closed nor deleted question last month (1 / view to question conversion).
  • (tiphat to D.W.) Proportion of "questions with positive score where all answers have zero score" (drive-by "demoralizing" questions' ratio) last month.
  • Number of questions reaching the SuperCollider aka "Hot Network Questions" last month. Self-explanatory - rather than lament the algorithm and side effects, we have to measure the network-wide site visibility.

Answers

  • Average length of answers, in words. Averaged over last month. If the experts are too lazy to type, it's a problem for the site. A distribution would be nice.
  • Average number of hyperlinks/book or paper references per answer, for answers with positive score only. Averaged over last month. Not sure how to automate this one - refs come in all guises, no single regex can capture the whole complexity. This metric is not for cross-site comparisons, yet it is known that great answers make use of equations/pictures/schematics/hyperlinks/references/code snippets and one picture is worth a thousand words. This also helps to retain readers' attention. For lack of a better term I'd dub it "rich content ratio".

Moderation

  • Average size of moderation queues over last week. Samples are taken each hour. (key moderation metric)
  • Proportion of failed audits (if enabled) last month.
  • Proportion of split-vote moderation actions last week.
  • Ratio of actions resolved without a moderator's binding vote to total number of actions last week (1 - moderator escalation ratio - alternative names are welcome)
  • Total number of actions resolved last month by CMs.
  • Ratio of salvaged (closed, edited, reopened) questions to the total number of noise (closed and deleted) questions (averaged over last month), courtesy of TildalWave, and aptly named the Tildal Wave Ratio, or salvaged questions ratio.
  • Number of non-moderator reviewers/closevoters with more than 5 actions last week (moderation core size). (SEDE query on edit reviews, both mods and non-mods)
  • How many members of moderation core from six months ago have been still actively moderating last week (again, 1 - moderation burnout rate)?
  • Ratio of new meta views last week to main site views during the same period (meta relevance ratio).

Rationale for the choice of averaging periods

  • StackExchange activity changes a lot depending on day of week. A week is chosen for averaging to eliminate this effect and to maintain interest of CMs, mods and users in watching changing numbers (it's not fun when the number stays the same all the time).

  • Rare events are averaged over a month (last four calendar weeks, to be precise).


Footnotes:

  1. One can mentally call this number the site's bus factor.
0
28

Engineering SE is scheduled to begin a site self-evaluation in a few days. If they are to be shut down "this week" I hope that one of two things will occur:

  • They will be shut down before ours begins;
  • Or, if the trigger isn't pulled before ours begins, please allow an in-progress site self-eval to run to completion.

Either of these options would be less confusing for the community than having it pop up and then disappear.

5
  • 8
    As far as I understand it (and I'll check to make sure I'm right), currently running evals will run their course and just not get scheduled to recur in 6 months.
    – hairboat
    Commented Jul 29, 2015 at 15:10
  • 5
    @abbyhairboat Any in-progress evaluations would be immediately stopped.
    – Adam Lear StaffMod
    Commented Jul 29, 2015 at 17:26
  • 13
    Clarification for whoever flagged my comment: what I said is what will happen if/when the switch is flipped. I'm a dev and Abby is a CM (ish), so there's nowhere to escalate this to. :) She checked with me to see how it would behave. I corrected her comment.
    – Adam Lear StaffMod
    Commented Jul 29, 2015 at 23:51
  • 2
    And to close the loop entirely: I'm going to wait to pull the shutdown trigger until a moment when no evals are running, so we don't end up with a bunch of skeleton eval posts all over child metas.
    – hairboat
    Commented Jul 30, 2015 at 20:03
  • 7
    Let it never be said I know what I'm doing... Forgot we had network-level settings and also per-site overrides, so current evals will complete as scheduled and no new ones will be started (assuming we remember to turn off the 4 per-site overrides currently in place some time in the next few months).
    – Adam Lear StaffMod
    Commented Jul 31, 2015 at 5:40
11

I never participated in a review, but after reading context and the examples linked there I think the most important point is «the results were secret, emailed to a handful of people within the company and quickly forgotten about».

While statistics can be automated and published as Deer Hunter says, it's useful for any discussion to happen in the public: even a chat room or (public) mailing list (!) is better than private emails.

1
  • 1
    Yes - definitely. The ideal outcome of whatever system we could create would be to have lots of folks (mostly community members and moderators, with the occasional SE Inc. community manager weighing in) hashing out quality issues and other "how are we doing?" ideas out in the open on site metas.
    – hairboat
    Commented Jul 30, 2015 at 20:34
4

I have participated in site evaluations on a beta-site, and I assumed that this was an aspect of progress toward "graduation" (which itself is undergoing a bit of rethinking). However the results of the evaluation were shared on the Meta of the site, so it wasn't so much "secret" as the case described as "the meta thread posted by the Community user is trivially ignored."

Since participation in these site evaluations is voluntary, these will necessarily suffer from a bias towards folks (like myself) who have the energy and inclination to voice an opinion. Numerically I thought the participation was reasonable given the amount of site activity.

That did not bother me as much as the conflation between evaluating Answers (as instructions more or less focused on) and the underlying selection of Questions for such self-evaluations.

A re-do of this kind of exercise should at least separate the evaluations of the Questions and the Answers.

1
  • 4
    Good point. "How does this question compare with other answers you've found on the greater internet?" is a prompt that makes approximately no sense whatsoever.
    – hairboat
    Commented Jul 31, 2015 at 19:32

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .