45
$\begingroup$

It is obvious that a question which is too short will almost certainly lack context, and that a question which is too long may run the risk of readers never finding out what the question actually is (Some may not want to read the full question).

Therefore, I was wondering: What is the optimal length in characters for a question? I would like to evaluate this statistically, using the Site Analytics (Data.SE) by Stack Exchange.

To evaluate the community response, I suggest using the following metric:

$$\text{Percentage of upvotes}=\frac{\text{Total number of upvotes}}{\text{Total number of votes}}\cdot 100$$

From experience, I think there would be a global maximum around $3000$ characters.

According to this post, the maximum amount of characters per question is $30000$ characters (with spacing). Since most questions seem to be approximately $500$ characters, I suggest that we represent the data on two different bar charts. One going from $0$ to $2500$ characters on intervals of $50$ characters, the other going from $0$ to $30000$ characters on intervals of $500$. Here is an example of what I mean by "intervals" (Except that here I represented it on a table instead of a bar chart). Obviously, this data is made up:

$$\small\begin{array}{c|c}\text{number of characters}&\text{Percentage of Upvotes}\\\hline1-50&10\%\\51-100&15\%\\101-150&25\%\\ \vdots&\vdots\\2451-2500&90\% \end{array} \qquad \begin{array}{c|c}\text{number of characters}&\text{Percentage of Upvotes}\\\hline1-500&30\%\\501-1000&60\%\\1001-1500&77.5\%\\ \vdots&\vdots\\29501-30000&85\% \end{array}$$

I suggest that we let the Number of characters (with spacing) be on the horizontal axis and the Percentage of upvotes to be on the vertical axis on the bar chart.


Of course, if you can think of a better way of representing this (Rather than a bar chart), feel free to write an answer. Similarly, if you can think of a better metric to evaluate community response, feel free to suggest one in the comments or write an answer using that metric.

Since I lack experience in programming and I have not seen any query which does this, I would appreciate it if you could show us the statistics and conclude with an optimal length.

$\endgroup$
5
  • 3
    $\begingroup$ This would also be helpful for new users, especially the ones that post three liners in the format of "Here's the question: ___ I couldn't do anything". $\endgroup$ Commented Apr 30, 2017 at 12:34
  • 2
    $\begingroup$ Glorfindel answered the question you asked (about upvote percentage), but it's worth noting that the answer seems to be a little different for question score. Basically, longer seems to always be better, but diminishing returns set in a lot later for score than they do for upvote percentage. $\endgroup$
    – Micah
    Commented Apr 30, 2017 at 22:18
  • $\begingroup$ @Micah Thank you for answering! I think this is a good metric to work with (Since some posts are very popular but have an approximately equal amount of upvotes and downvotes). $\endgroup$ Commented May 1, 2017 at 5:24
  • 1
    $\begingroup$ This is a very interesting and useful question. But I think that, from an asker's perspective, the interesting metric is not the percentage of upvotes - it's the time until the question is answered. $\endgroup$ Commented May 5, 2017 at 5:45
  • $\begingroup$ @ErelSegal-Halevi In that case, it would probably take a lot shorter time for the ones with less words (Though the quality of the answers are also likely to be worse). $\endgroup$ Commented May 5, 2017 at 10:02

1 Answer 1

38
$\begingroup$

TL;DR: the longer, the better.

Based on the current data, longer questions tend to get a higher ratio of upvotes than shorter questions. There is not enough data to determine an 'optimal' question length.

Full version

I constructed a SEDE query which lets you play a bit with the interval length, and the maximum considered length. Feel free to fork it to play around yourself.

enter image description here

(note that the vertical axis does not start at zero, because of how SEDE works and reasons)

We see that shorter questions definitely score worse, but after 2500 characters the trend seems to halt, or at least it's hard to see the trend because of the 'noise' caused by the fact that there aren't that many questions in that range.

For reference, here is the complete SEDE query:

DECLARE @IntervalLength INT; SET @IntervalLength = ##IntervalLength:int##;
DECLARE @MaximumLength INT; SET @MaximumLength = ##MaximumLength:int##;

SELECT (LEN(p.Body) / @IntervalLength) * @IntervalLength AS 'Post length',
  100.0 * SUM(CASE v.VoteTypeId WHEN 2 THEN 1 ELSE 0 END) / COUNT(*) AS 'Upvote %'
  FROM Votes AS v
  INNER JOIN Posts AS p
    ON v.PostId = p.Id
  WHERE p.PostTypeId = 1 -- questions
    AND v.VoteTypeId IN (2, 3) -- up/downvotes
    AND LEN(p.Body) < @MaximumLength
  GROUP BY (LEN(p.Body) / @IntervalLength) * @IntervalLength
  ORDER BY (LEN(p.Body) / @IntervalLength) * @IntervalLength
$\endgroup$
18
  • 2
    $\begingroup$ Hm, the scale on the y axis surprises me on how many upvotes you can get on short answers. Indeed, since the graph does not start at zero, it is misleading, at least that's what I was taught in statistics. $\endgroup$ Commented Apr 30, 2017 at 15:55
  • 1
    $\begingroup$ @SimplyBeautifulArt that's what I get from SEDE (it's not easy to change, and it allows for better comparison). I added a warning below the graph. $\endgroup$
    – Glorfindel
    Commented Apr 30, 2017 at 15:59
  • 1
    $\begingroup$ @SimplyBeautifulArt I also missed the fact that the OP asked about questions, not answers. That has now been corrected. $\endgroup$
    – Glorfindel
    Commented Apr 30, 2017 at 16:18
  • 2
    $\begingroup$ Ah, yes :-). Looks a bit more stable, probably because people aren't so harsh on downvoting long questions versus downvoting long answers? Or perhaps people tend to put more effort into questions? $\endgroup$ Commented Apr 30, 2017 at 16:36
  • $\begingroup$ The upvote percent is more than 90% for length between 300 and 2000. $\endgroup$
    – user312097
    Commented May 1, 2017 at 13:57
  • 7
    $\begingroup$ Curiously, the pattern is very different on Stack Overflow, with questions over 2000 characters long showing a clear downward trend (and also, if you zoom in, with a clear peak around 50 to 70 chars or so, followed by a local minimum around 200 to 250 chars). Most other sites I tried your query (and my fork of it) on show a pattern similar to math.SE, however. $\endgroup$ Commented May 1, 2017 at 23:54
  • 4
    $\begingroup$ Looking at some of the biggest stack exchange sites... superuser shows a third pattern, and English shows no pattern. Mathoverflow is similar to MSE $\endgroup$ Commented May 5, 2017 at 1:17
  • 2
    $\begingroup$ Arqade similarly shows no correlation, and while serverfault shows a pattern, you'll notice that it's very noisy and the curve as a whole represents all questions being high scoring. I question whether this is really a generalizable pattern. What did you look at, @IlmariKaronen $\endgroup$ Commented May 5, 2017 at 1:20
  • 2
    $\begingroup$ AskUbuntu also shows consistently very high %s, as does Ask Different shows an increasing correlation, but again every data point (besides one that's obviously hella noisy) scores above 90%. Together with Stack Overflow and MSE, these are the nine largest stack exchange sites. $\endgroup$ Commented May 5, 2017 at 1:23
  • 2
    $\begingroup$ To round out the top 10 largest sites, we have cross validated which shows a positive correlation and yet again even the very low data points are above 90%. In fact, MSE is the only example in the top 10 where the % is below 85% (ignoring two cases of extreme noisy data). Sorry for making many comments, but the URLs are long and make me hit the character limit very quickly. $\endgroup$ Commented May 5, 2017 at 1:29
  • 1
    $\begingroup$ My bad, the low end of Arqade looks like the low end of MSE, but slightly higher. So those two drop below 85% $\endgroup$ Commented May 5, 2017 at 2:03
  • 5
    $\begingroup$ This looks like it would make a great topic for the SE blog or podcast. $\endgroup$ Commented May 7, 2017 at 13:58
  • 1
    $\begingroup$ @JackM yes, the formula is explained in the question. I felt no need to repeat it in the answer. $\endgroup$
    – Glorfindel
    Commented May 8, 2017 at 14:54
  • 1
    $\begingroup$ The SEDE code you give still references 'answer length' in passing. If that's a relic of the original query, it should probably be modified as well. $\endgroup$ Commented May 11, 2017 at 13:42
  • 1
    $\begingroup$ Thanks, corrected. $\endgroup$
    – Glorfindel
    Commented May 11, 2017 at 13:43

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .