Project Reduplication of Deduplication Has Begun!

Question

While announcing the second iteration of the Stack Exchange quality project, I not-so-briefly alluded to a collaboration we're kicking off with the University Of Melbourne.

The project presents some very interesting possibilities for us if their model validates as well as is hoped:

Knowing very quickly if a question is not a duplicate
Being able to quickly surface the most appropriate (by criteria we provide) duplicate with a high degree of accuracy.
Being able to surface high and low quality groups of questions that essentially ask the same thing that aren't currently being surfaced very well

Essentially, we might to be able to alleviate two of the most frequent pain points for new and experienced users alike:

By showing more relevant probable duplicates while folks ask, we should stand to lessen the frequency of duplicate questions (particularly, those that tend to be of not-so-great quality)
We might be in a better position to more clearly identify something as not a duplicate, which could ease those mean jerks closed my question as a duplicate and it completely wasn't!

The paper describing the methodology is short and worth reading if you're curious.

So, what do we need from you?

The researchers have already validated data in subject domains where they're knowledgeable enough to make a clear call on something being a duplicate or not. However, we cover a pretty vast group of academic and technical domains, and they simply need knowledgeable people from those communities to help them decide if their method made the right decision.

How are we going to do this?

Easily. They've built a system similar to our review system where you're presented with a pair of questions, and can indicate if they're a duplicate, strongly related, or not a duplicate. You can use your Stack Exchange account to sign in.

Who is eligible?

Those with the ability to cast close votes on:

Android
English Language & Usage
GIS
Mathematica
Physics
Software Engineering
Stats
Tex
Unix & Linux
Webmasters
Wordpress

In the rare cases where you have a gold tag badge, but have not yet unlocked close privileges, a gold tag badge will allow you to work on that subset of questions.

What about other sites? What about Stack Overflow?

We'll have to see how things go with this initial set. If everything looks really promising, then they'll take a look at running Stack Overflow through the model. That would be an enormous undertaking since it would entail every single question since the dawn of time, while accounting for duplicate targets being deleted later.

There might also be a need to run a few more smaller sites through - we just have to see how it goes.

Where do we go from here?

Look for a post on your meta site from Doris Hoogeveen letting folks know that we're ready for help if you care to provide it. She'll link to this post for the benefit of folks that missed it.

What if it doesn't pan out?

We're okay with that because we learn quite a bit either way. It's my personal opinion that getting the right duplicates in front of people as they ask is probably the most impactful way that we can elevate the experience of both new and experienced users. If something looks promising in that direction, we feel it's worth exploring.

Awesome. I wanted to sign in but I saw that neither Chem nor ELL are among the test sites. :/ — M.A.R., Commented Oct 26, 2016 at 17:06
(gnat feverishly trying to find a duplicate of this very post) — gnat, Commented Oct 26, 2016 at 17:12
@Andy It's from our public dumps and API. They reached out to us only once they were pretty sure they had something. — user50049, Commented Oct 26, 2016 at 17:34
"In the rare cases where you have a gold tag badge, but have not yet unlocked close privileges" does that even happen? — Cai, Commented Oct 26, 2016 at 17:47
@gnat I have two duplicates of this very Post. They are ages 11 and 2. — user50049, Commented Oct 26, 2016 at 18:29
@TimPost Are they really duplicates or just strongly related? — balpha, Commented Oct 26, 2016 at 18:32
Tim, but we generally don't close older Posts as duplicates of younger ones — gnat, Commented Oct 26, 2016 at 20:11
Do you think it would be possible to get the actual methodology they used to conduct this experiment? The paper mainly seems to detail the results and not how they found those results. — hichris123, Commented Oct 26, 2016 at 20:26
The paper is a very interesting read, curious to see how this pans out on a larger scale and for real-time suggestions — Cai, Commented Oct 26, 2016 at 20:29
We haven't seen a meta post for this on WordPress yet, so I've started it here. — Tim Malone, Commented Nov 6, 2016 at 22:11
The twelve sites are: Android Enthusiasts, TeX - LaTeX, Arqade, Software Engineering, English Language & Usage, Unix & Linux, Physics, Geographic Information Systems, Mathematica, Cross Validated, Webmasters, and WordPress Development. — Monozygotic, Commented Nov 11, 2016 at 3:09

Ian Ringrose · Accepted Answer · 2016-11-01 12:09:15Z

8

Thinking about Stack Overflow…

Does the learning system need to consider it as one site, or can for example the PHP tag be done in isolation?

Likewise we can define collection of tags, C#,VB.NET,.NET, ASP.NET for example that have few questions outside of the “tag group” and lots of questions crossing tags within the group. Then just look for duplications within the given group.

edited Nov 1, 2016 at 12:09

answered Nov 1, 2016 at 10:29

Ian Ringrose

27.5k5 gold badges52 silver badges95 bronze badges

3

I believe it's going to need to work in a similar manner as our tag prediction engine - as soon as it can begin to hazard a guess about what you're talking about, it'll infer tags, and start looking. It kind of has to work that way because in the absence of code, a question asking how to connect to a database would be remarkably similar between C# and PHP. But the cool part is once we have the data sets validated, they can start trying all sorts of things with the model since they'll have 'known correct' outcomes they can measure against.
– user50049
Commented Nov 7, 2016 at 5:58

Add a comment |

Community · Accepted Answer · 2017-03-16 15:44:09Z

I just posted my concerns about this on stats.stackexchage, so I will follow up with a summary here: I am really concerned that this will prevent people from asking questions or getting help on stack exchange. Here are my reasons:

Synthesis Questions: I am worried that project will limit sythensis questions, which are re-asked questions for the purpose of getting a final answer. Usually these are asked when the original question: 1) has multiple answer that changed over the years or 2) It has had conflicting answers over years
Potential for Over-fitting The unstructured data from stack exchange posts is prone to overfitting and we have to be careful of that.
Skewed toward/against the questions of new users: Depending on how you code it the algorithm is going to be skewed toward or against news users in a significant way.
1. [Reason for the Skew Against] Basically new users are prone toward producing missing values this text mined dataset then older users because reputation system promotes such behavior.
2. [Reason for the Skew Toward] On the other hand, new users are less familiar with the site are more prone not to be familar with how to navigate stack exchange and thus will ask duplicated questions.
StackOverflow will be an outlier: This will be either very good or very bad for stackoverflow because: 1) To ask how code something means there are at least 5 valid duplicates of how to code in other languages. 2) The dependency on tags as classifier, cluster, or filter of question @IanRingrose mentioned.

The duplicate questions tag does not automatically delete questions. I am worried that this safe guard from duplicate questions tag will be removed when one releases Project Reduplication of Deduplication into the full wild of stack exchange. Again, I explain each of these reasons in more detail in the link below.

P.S. Also I just posted an example of reason 1.2 (Synthesis Question) in Cross Validated

I'm not convinced synthesis questions should ever be asked in the first place. If the question has lousy answers (conflicting and non-authoritative, changing, missing, wrong), then the correct way to fix it is to either make a useful edit in order to bump and give additional details, or add a bounty. — Nathan Tuggy, Commented Dec 10, 2016 at 7:53

Stack Exchange Network

Project Reduplication of Deduplication Has Begun!

So, what do we need from you?

How are we going to do this?

Who is eligible?

What about other sites? What about Stack Overflow?

Where do we go from here?

What if it doesn't pan out?

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions

Project Reduplication of Deduplication Has Begun!

So, what do we need from you?

How are we going to do this?

Who is eligible?

What about other sites? What about Stack Overflow?

Where do we go from here?

What if it doesn't pan out?

2 Answers 2

You must log in to answer this question.

Linked

Related

Hot Network Questions