I wonder what the longest question duplicate chain is on SE. For example, Matching files with various extensions using for loop is closed a duplicate of Brace expansion with variable?, which is closed a duplicate of Variables in bash seq replacement ({1..10}) which is closed a duplicate of How do I iterate over a range of numbers defined by variables in Bash?. That's a question duplicate chain length of 3.
-
4Do duplicates chains that have since been re-opened (and broke the chain) count...? If so, I may have an answer for this– cocomacCommented Oct 8, 2023 at 20:29
-
@cocomac I'm guessing it was broken because of an infinite loop?– Franck DernoncourtCommented Oct 8, 2023 at 20:30
-
4Yep. See this question or search for "circular duplicates meta stack exchange". If those are fair game, then it'd be infinity most likely. For a proper answer, it's an interesting SEDE SQL question. Not sure about how I'd go about writing SQL to find that, though– cocomacCommented Oct 8, 2023 at 20:33
-
3@cocomac Thanks, good point, let's ignore circular duplicates.– Franck DernoncourtCommented Oct 8, 2023 at 20:35
-
5Here's a list of some of the weirder ones, though I don't know if any of them is the longest. (I once saw an answer by Glorfindel on a since-deleted question that linked an insanely long chain.)– CDRCommented Oct 8, 2023 at 20:54
-
4A recursive query like this will run on pretty much every site up to and including Server Fault, but unfortunately won't finish on Stack Overflow before timing out. If that could finish there it could be run network wide....– Henry EckerCommented Oct 9, 2023 at 0:31
-
2Honestly, a search like this might make a good community event...– RedzCommented Oct 10, 2023 at 10:29
1 Answer
TLDR; 10 Links1 starting from any of:
- https://apple.meta.stackexchange.com/q/455
- https://gaming.meta.stackexchange.com/q/1892
- https://drupal.meta.stackexchange.com/q/2159
- Is it possible to change the "integral-dependance" tag into "integral-dependence"?
Intro
Honestly, this was really fun to work through. I had to reevaluate quite a lot of assumptions through this process, like all sites have duplicate questions and that every question closed as a duplicate has a dupe target (both of which are incorrect).
SEDE
I started out on SEDE and made it through any number of variations which all ultimately did not do what was needed. I couldn't get any query to complete on Stack Overflow and couldn't create a query that would run network-wide. I was only able to pull limited sections of sites before it timed out. This query represents one of many many many attempts at solving this problem there and covers less than half of the available sites.
Ultimately, I decided it would be easier to just step away from SEDE and move to the data dumps. I used the 2023-09-12 Stack Exchange Data Dump which is the most current version at this time.
Data Dump Process
I pulled down all the data and pulled out just the PostLinks.xml file from every site. I renamed the files so that they follow the format of {site}-PostLinks.xml
and put them in a folder called "PostLinks." A really great benefit of using the data dumps instead of SQL for this is that I can solve it graphically as longest path problem instead of using a recursive lookup.
Approach
Ultimately, I ended up using networkx's dag_longest_path to find the "longest path" per site. Please note, that this function only returns one instance of the longest path so there may be several paths of that length which are not visible here. This was a huge time saver and does fine to answer the question of "What's the longest question duplicate chain on SE?"
Results
At this time, considering the information available the longest duplicate chain is 10 links. This occurs on multiple different sites.
See the following summary table of the top 15 values network wide (unfortunately the full result set is 85k characters and won't fit in an answer on SE, one of the main drawbacks of not using SEDE here):
Network-wide
Site Name | Number of links | Link to start | Path |
---|---|---|---|
apple.meta.stackexchange.com | 10 | https://apple.meta.stackexchange.com/q/455 | 455 -> 932 -> 1519 -> 1882 -> 2318 -> 2717 -> 2964 -> 3190 -> 3463 -> 3716 -> 4006 |
gaming.meta.stackexchange.com | 10 | https://gaming.meta.stackexchange.com/q/1892 | 1892 -> 3352 -> 5602 -> 8040 -> 10168 -> 11204 -> 12230 -> 12860 -> 13242 -> 14958 -> 15880 |
drupal.meta.stackexchange.com | 10 | https://drupal.meta.stackexchange.com/q/2159 | 2159 -> 2737 -> 3146 -> 3351 -> 3569 -> 3686 -> 3741 -> 3786 -> 3861 -> 3872 -> 3906 |
math.meta.stackexchange.com | 10 | Is it possible to change the "integral-dependance" tag into "integral-dependence"? | 16727 -> 12485 -> 19037 -> 22348 -> 25694 -> 27653 -> 29718 -> 31103 -> 32974 -> 34525 -> 35451 |
rpg.meta.stackexchange.com | 9 | https://rpg.meta.stackexchange.com/q/1497 | 1497 -> 2723 -> 3222 -> 5288 -> 6024 -> 6717 -> 7721 -> 8737 -> 9802 -> 11612 |
gis.meta.stackexchange.com | 9 | https://gis.meta.stackexchange.com/q/540 | 540 -> 1968 -> 3460 -> 3881 -> 4133 -> 4453 -> 4805 -> 5013 -> 5156 -> 5287 |
physics.meta.stackexchange.com | 9 | https://physics.meta.stackexchange.com/q/980 | 980 -> 2810 -> 5263 -> 6388 -> 7426 -> 9554 -> 10394 -> 11001 -> 12737 -> 13630 |
stackoverflow.com | 8 | https://stackoverflow.com/q/72146651 | 72146651 -> 72145571 -> 72142808 -> 72140680 -> 72128963 -> 72127464 -> 72117405 -> 72113318 -> 72104018 |
christianity.meta.stackexchange.com | 8 | https://christianity.meta.stackexchange.com/q/1750 | 1750 -> 3296 -> 4313 -> 5856 -> 6225 -> 6476 -> 6644 -> 6871 -> 7086 |
meta.stackexchange.com | 8* | Open Source Advertising - Sidebar - 1H 2010 | 31913 -> 53346 -> 74983 -> 93312 -> 114442 -> 132988 -> 162128 -> 195823 -> 210389 |
meta.stackoverflow.com | 7 | https://meta.stackoverflow.com/q/350845 | 350845 -> 323253 -> 303472 -> 278434 -> 260070 -> 250977 -> 258610 -> 252133 |
movies.meta.stackexchange.com | 7 | https://movies.meta.stackexchange.com/q/1695 | 1695 -> 1745 -> 2089 -> 2616 -> 4131 -> 4433 -> 4644 -> 4772 |
softwareengineering.meta.stackexchange.com | 7 | https://softwareengineering.meta.stackexchange.com/q/8380 | 8380 -> 8398 -> 8412 -> 8423 -> 8435 -> 8442 -> 8455 -> 8467 |
chemistry.meta.stackexchange.com | 6 | https://chemistry.meta.stackexchange.com/q/2696 | 2696 -> 3079 -> 3546 -> 4097 -> 4364 -> 4653 -> 5028 |
unix.stackexchange.com | 5 | https://unix.stackexchange.com/q/500560 | 500560 -> 414417 -> 17510 -> 13828 -> 9612 -> 120311 |
You'll likely notice rather quickly looking through these posts that the majority of duplicate chains this long are the result of year-to-year events which were closed as duplicates of the succeeding year and exist on meta sites.
The one here on MSE is a rather special case because it was migrated to MSO and the duplicate chain continues there. Which makes it technically 11 links (one of which is a migration) and 10 links between duplicate targets (this is the 1 in the TLDR and the asterisk in the table) though it is split between two sites. This approach wouldn't successfully track migrations across network, so it is possible there are longer chains than the above including migrated then closed questions.
Main sites only (excluding MSE)
By ignoring meta sites we can mostly exclude these types of year-to-year closures. Here are the top 15 results from main sites (excluding MSE):
Site Name | Number of links | Link to start | Path |
---|---|---|---|
stackoverflow.com | 8 | https://stackoverflow.com/q/72146651 | 72146651 -> 72145571 -> 72142808 -> 72140680 -> 72128963 -> 72127464 -> 72117405 -> 72113318 -> 72104018 |
unix.stackexchange.com | 5 | https://unix.stackexchange.com/q/500560 | 500560 -> 414417 -> 17510 -> 13828 -> 9612 -> 120311 |
math.stackexchange.com | 5 | https://math.stackexchange.com/q/1617654 | 1617654 -> 102107 -> 67068 -> 357063 -> 833451 -> 1490794 |
workplace.stackexchange.com | 5 | https://workplace.stackexchange.com/q/47560 | 47560 -> 47451 -> 34828 -> 23527 -> 9032 -> 5560 |
softwareengineering.stackexchange.com | 4 | https://softwareengineering.stackexchange.com/q/129951 | 129951 -> 125435 -> 66438 -> 135311 -> 6395 |
security.stackexchange.com | 4 | https://security.stackexchange.com/q/200847 | 200847 -> 130961 -> 121044 -> 107828 -> 25219 |
scifi.stackexchange.com | 4 | https://scifi.stackexchange.com/q/160020 | 160020 -> 153440 -> 150036 -> 54108 -> 2611 |
raspberrypi.stackexchange.com | 4 | https://raspberrypi.stackexchange.com/q/76129 | 76129 -> 74552 -> 73469 -> 55035 -> 8861 |
ell.stackexchange.com | 4 | https://ell.stackexchange.com/q/178587 | 178587 -> 178501 -> 178230 -> 178487 -> 178246 |
travel.stackexchange.com | 4 | https://travel.stackexchange.com/q/127025 | 127025 -> 100175 -> 37722 -> 29149 -> 66313 |
ru.stackoverflow.com | 4 | https://ru.stackoverflow.com/q/569802 | 569802 -> 563785 -> 494510 -> 422794 -> 506131 |
superuser.com | 4 | https://superuser.com/q/564968 | 564968 -> 207606 -> 504847 -> 85083 -> 22064 |
english.stackexchange.com | 4 | https://english.stackexchange.com/q/260200 | 260200 -> 252217 -> 105116 -> 4700 -> 152 |
puzzling.stackexchange.com | 4 | https://puzzling.stackexchange.com/q/66455 | 66455 -> 59281 -> 59199 -> 41421 -> 183 |
blender.stackexchange.com | 4 | https://blender.stackexchange.com/q/40642 | 40642 -> 19039 -> 15207 -> 7627 -> 1842 |
Again, there may be more than one chain of the same length on the site. This approach only found one instance of the longest path.
Infinite Cycles
While here I also found 103 circular duplicates (since they needed to be removed before finding a "longest path"). Technically all of the following are the longest path with length of infinite.
I also found 17 self duplicates which appear to be a remnant of a very old duplicate closure process wherein the question was just closed because the question already exists but relied on a user to either comment with a link or edit the link into the post themselves. This appears to have been programmatically stored as a self-referencing duplicate link. They all exist on Stack Overflow have been closed for ~14 years.
Here is the approach I used to find these:
from glob import glob
from pathlib import Path
from typing import List, Generator
import networkx as nx
import pandas as pd
from tqdm import tqdm
def make_df_from_xml(fp: Path) -> pd.DataFrame:
df = pd.read_xml(fp)
df = df.loc[
df['LinkTypeId'].eq(3), # Only care about duplicate link types
['PostId', 'RelatedPostId']
].copy(deep=True)
return df
def save_cyclic_deps(dupe_cycles: List[object]) -> None:
if not dupe_cycles:
return
p = Path('./raw_cycle_output.csv')
if not p.exists():
pd.DataFrame(dupe_cycles).to_csv(p, index=False)
else:
pd.concat([
pd.read_csv(p),
pd.DataFrame(dupe_cycles)
]).drop_duplicates().sort_values(
by='Cycle Length',
ascending=False
).to_csv(p, index=False)
def ids_list_to_str(ids: List[int]) -> str:
return ' -> '.join(map('{:.0f}'.format, ids))
def create_link_to_start(site: str, ids: List[int]) -> str:
if not ids:
return ''
return f'https://{site}/q/{ids[0]}' # Link to start of path/cycle
def process_individual_site(site_name: str, df: pd.DataFrame) -> List[int]:
DG: nx.DiGraph = nx.from_pandas_edgelist(df, source='PostId', target='RelatedPostId', create_using=nx.DiGraph())
dupe_cycles = []
# Remove any circular dependencies
for coord in list(nx.simple_cycles(DG)):
dupe_cycles.append({
'Site Name': site_name,
'Cycle Length': len(coord), # Technically infinite
'Link to start': create_link_to_start(site_name, coord),
'Cycle Path': ids_list_to_str(coord)
})
if len(coord) == 1:
# For Stack Overflow
# Duplicate closures used to just be "This question already has answers here"
# and a user would have to manually edit or comment with a link
# this appears to be stored programmatically as a self cycle
s, = coord
DG.remove_edge(s, s)
else:
s, *_, e = coord
# Remove back edge
DG.remove_edge(e, s)
save_cyclic_deps(dupe_cycles)
res = nx.dag_longest_path(DG)
return res
def get_longest_path_for_all_sites() -> Generator[dict]:
progress_glob = tqdm(glob(r'.\PostLinks\*-PostLinks.xml'))
for post_links in progress_glob:
p = Path(post_links)
[site, *_] = p.stem.split('-')
# Update progress with current site
progress_glob.set_postfix_str(site)
longest_path = process_individual_site(site, make_df_from_xml(p))
yield {
'Site Name': site,
'Number of links': len(longest_path) - 1, # Edge count instead of node count
'Link to start': create_link_to_start(site, longest_path),
'Path': ids_list_to_str(longest_path)
}
def main() -> None:
pd.DataFrame(
get_longest_path_for_all_sites()
).sort_values(
by='Path Length',
ascending=False
).to_csv('./raw_aggregated_output.csv', index=False)
if __name__ == '__main__':
main()
-
3RE the final section, about questions closed as duplicates without targets: prior to May 20, 2009, the system did not prompt for a target when users voted to close as duplicates, and prior to February 2013, targets weren't saved in the database (then were backfilled using info from the automatic edits). See this comment thread and its link. Commented Oct 10, 2023 at 4:33
-
Regarding some of the infinite loops: I looked at this question which is closed as a dupe of another question that seems to have been deleted ... Commented Oct 10, 2023 at 7:16