14

I wonder what the longest question duplicate chain is on SE. For example, Matching files with various extensions using for loop is closed a duplicate of Brace expansion with variable?, which is closed a duplicate of Variables in bash seq replacement ({1..10}) which is closed a duplicate of How do I iterate over a range of numbers defined by variables in Bash?. That's a question duplicate chain length of 3.

7
  • 4
    Do duplicates chains that have since been re-opened (and broke the chain) count...? If so, I may have an answer for this
    – cocomac
    Commented Oct 8, 2023 at 20:29
  • @cocomac I'm guessing it was broken because of an infinite loop? Commented Oct 8, 2023 at 20:30
  • 4
    Yep. See this question or search for "circular duplicates meta stack exchange". If those are fair game, then it'd be infinity most likely. For a proper answer, it's an interesting SEDE SQL question. Not sure about how I'd go about writing SQL to find that, though
    – cocomac
    Commented Oct 8, 2023 at 20:33
  • 3
    @cocomac Thanks, good point, let's ignore circular duplicates. Commented Oct 8, 2023 at 20:35
  • 5
    Here's a list of some of the weirder ones, though I don't know if any of them is the longest. (I once saw an answer by Glorfindel on a since-deleted question that linked an insanely long chain.)
    – CDR
    Commented Oct 8, 2023 at 20:54
  • 4
    A recursive query like this will run on pretty much every site up to and including Server Fault, but unfortunately won't finish on Stack Overflow before timing out. If that could finish there it could be run network wide.... Commented Oct 9, 2023 at 0:31
  • 2
    Honestly, a search like this might make a good community event...
    – Redz
    Commented Oct 10, 2023 at 10:29

1 Answer 1

14

TLDR; 10 Links1 starting from any of:


Intro

Honestly, this was really fun to work through. I had to reevaluate quite a lot of assumptions through this process, like all sites have duplicate questions and that every question closed as a duplicate has a dupe target (both of which are incorrect).

SEDE

I started out on SEDE and made it through any number of variations which all ultimately did not do what was needed. I couldn't get any query to complete on Stack Overflow and couldn't create a query that would run network-wide. I was only able to pull limited sections of sites before it timed out. This query represents one of many many many attempts at solving this problem there and covers less than half of the available sites.

Ultimately, I decided it would be easier to just step away from SEDE and move to the data dumps. I used the 2023-09-12 Stack Exchange Data Dump which is the most current version at this time.

Data Dump Process

I pulled down all the data and pulled out just the PostLinks.xml file from every site. I renamed the files so that they follow the format of {site}-PostLinks.xml and put them in a folder called "PostLinks." A really great benefit of using the data dumps instead of SQL for this is that I can solve it graphically as longest path problem instead of using a recursive lookup.

Approach

Ultimately, I ended up using networkx's dag_longest_path to find the "longest path" per site. Please note, that this function only returns one instance of the longest path so there may be several paths of that length which are not visible here. This was a huge time saver and does fine to answer the question of "What's the longest question duplicate chain on SE?"

Results

At this time, considering the information available the longest duplicate chain is 10 links. This occurs on multiple different sites.

See the following summary table of the top 15 values network wide (unfortunately the full result set is 85k characters and won't fit in an answer on SE, one of the main drawbacks of not using SEDE here):

Network-wide

Site Name Number of links Link to start Path
apple.meta.stackexchange.com 10 https://apple.meta.stackexchange.com/q/455 455 -> 932 -> 1519 -> 1882 -> 2318 -> 2717 -> 2964 -> 3190 -> 3463 -> 3716 -> 4006
gaming.meta.stackexchange.com 10 https://gaming.meta.stackexchange.com/q/1892 1892 -> 3352 -> 5602 -> 8040 -> 10168 -> 11204 -> 12230 -> 12860 -> 13242 -> 14958 -> 15880
drupal.meta.stackexchange.com 10 https://drupal.meta.stackexchange.com/q/2159 2159 -> 2737 -> 3146 -> 3351 -> 3569 -> 3686 -> 3741 -> 3786 -> 3861 -> 3872 -> 3906
math.meta.stackexchange.com 10 Is it possible to change the "integral-dependance" tag into "integral-dependence"? 16727 -> 12485 -> 19037 -> 22348 -> 25694 -> 27653 -> 29718 -> 31103 -> 32974 -> 34525 -> 35451
rpg.meta.stackexchange.com 9 https://rpg.meta.stackexchange.com/q/1497 1497 -> 2723 -> 3222 -> 5288 -> 6024 -> 6717 -> 7721 -> 8737 -> 9802 -> 11612
gis.meta.stackexchange.com 9 https://gis.meta.stackexchange.com/q/540 540 -> 1968 -> 3460 -> 3881 -> 4133 -> 4453 -> 4805 -> 5013 -> 5156 -> 5287
physics.meta.stackexchange.com 9 https://physics.meta.stackexchange.com/q/980 980 -> 2810 -> 5263 -> 6388 -> 7426 -> 9554 -> 10394 -> 11001 -> 12737 -> 13630
stackoverflow.com 8 https://stackoverflow.com/q/72146651 72146651 -> 72145571 -> 72142808 -> 72140680 -> 72128963 -> 72127464 -> 72117405 -> 72113318 -> 72104018
christianity.meta.stackexchange.com 8 https://christianity.meta.stackexchange.com/q/1750 1750 -> 3296 -> 4313 -> 5856 -> 6225 -> 6476 -> 6644 -> 6871 -> 7086
meta.stackexchange.com 8* Open Source Advertising - Sidebar - 1H 2010 31913 -> 53346 -> 74983 -> 93312 -> 114442 -> 132988 -> 162128 -> 195823 -> 210389
meta.stackoverflow.com 7 https://meta.stackoverflow.com/q/350845 350845 -> 323253 -> 303472 -> 278434 -> 260070 -> 250977 -> 258610 -> 252133
movies.meta.stackexchange.com 7 https://movies.meta.stackexchange.com/q/1695 1695 -> 1745 -> 2089 -> 2616 -> 4131 -> 4433 -> 4644 -> 4772
softwareengineering.meta.stackexchange.com 7 https://softwareengineering.meta.stackexchange.com/q/8380 8380 -> 8398 -> 8412 -> 8423 -> 8435 -> 8442 -> 8455 -> 8467
chemistry.meta.stackexchange.com 6 https://chemistry.meta.stackexchange.com/q/2696 2696 -> 3079 -> 3546 -> 4097 -> 4364 -> 4653 -> 5028
unix.stackexchange.com 5 https://unix.stackexchange.com/q/500560 500560 -> 414417 -> 17510 -> 13828 -> 9612 -> 120311

You'll likely notice rather quickly looking through these posts that the majority of duplicate chains this long are the result of year-to-year events which were closed as duplicates of the succeeding year and exist on meta sites.

The one here on MSE is a rather special case because it was migrated to MSO and the duplicate chain continues there. Which makes it technically 11 links (one of which is a migration) and 10 links between duplicate targets (this is the 1 in the TLDR and the asterisk in the table) though it is split between two sites. This approach wouldn't successfully track migrations across network, so it is possible there are longer chains than the above including migrated then closed questions.

Main sites only (excluding MSE)

By ignoring meta sites we can mostly exclude these types of year-to-year closures. Here are the top 15 results from main sites (excluding MSE):

Site Name Number of links Link to start Path
stackoverflow.com 8 https://stackoverflow.com/q/72146651 72146651 -> 72145571 -> 72142808 -> 72140680 -> 72128963 -> 72127464 -> 72117405 -> 72113318 -> 72104018
unix.stackexchange.com 5 https://unix.stackexchange.com/q/500560 500560 -> 414417 -> 17510 -> 13828 -> 9612 -> 120311
math.stackexchange.com 5 https://math.stackexchange.com/q/1617654 1617654 -> 102107 -> 67068 -> 357063 -> 833451 -> 1490794
workplace.stackexchange.com 5 https://workplace.stackexchange.com/q/47560 47560 -> 47451 -> 34828 -> 23527 -> 9032 -> 5560
softwareengineering.stackexchange.com 4 https://softwareengineering.stackexchange.com/q/129951 129951 -> 125435 -> 66438 -> 135311 -> 6395
security.stackexchange.com 4 https://security.stackexchange.com/q/200847 200847 -> 130961 -> 121044 -> 107828 -> 25219
scifi.stackexchange.com 4 https://scifi.stackexchange.com/q/160020 160020 -> 153440 -> 150036 -> 54108 -> 2611
raspberrypi.stackexchange.com 4 https://raspberrypi.stackexchange.com/q/76129 76129 -> 74552 -> 73469 -> 55035 -> 8861
ell.stackexchange.com 4 https://ell.stackexchange.com/q/178587 178587 -> 178501 -> 178230 -> 178487 -> 178246
travel.stackexchange.com 4 https://travel.stackexchange.com/q/127025 127025 -> 100175 -> 37722 -> 29149 -> 66313
ru.stackoverflow.com 4 https://ru.stackoverflow.com/q/569802 569802 -> 563785 -> 494510 -> 422794 -> 506131
superuser.com 4 https://superuser.com/q/564968 564968 -> 207606 -> 504847 -> 85083 -> 22064
english.stackexchange.com 4 https://english.stackexchange.com/q/260200 260200 -> 252217 -> 105116 -> 4700 -> 152
puzzling.stackexchange.com 4 https://puzzling.stackexchange.com/q/66455 66455 -> 59281 -> 59199 -> 41421 -> 183
blender.stackexchange.com 4 https://blender.stackexchange.com/q/40642 40642 -> 19039 -> 15207 -> 7627 -> 1842

Again, there may be more than one chain of the same length on the site. This approach only found one instance of the longest path.


Infinite Cycles

While here I also found 103 circular duplicates (since they needed to be removed before finding a "longest path"). Technically all of the following are the longest path with length of infinite.

Site Name Link to start Cycle Path
meta.stackexchange.com https://meta.stackexchange.com/q/44774 44774 -> 44983 -> 45085 -> 45278 -> 44818
magento.stackexchange.com https://magento.stackexchange.com/q/126728 126728 -> 126655 -> 126645
stackoverflow.com How do I redirect back to calling app using ios custom URL scheme? 16580058 -> 16582148
stackoverflow.com R Newbie - Histograms/Frequency of values incorporating a quantity Column 23836169 -> 23839177
stackoverflow.com https://stackoverflow.com/q/23976362 23976362 -> 23975948
stackoverflow.com Android 6.0: Get package name of current activity 34528283 -> 34506199
stackoverflow.com https://stackoverflow.com/q/32939977 32939977 -> 32909021
stackoverflow.com https://stackoverflow.com/q/31864539 31864539 -> 31845093
stackoverflow.com https://stackoverflow.com/q/29562920 29562920 -> 29564921
stackoverflow.com Go: serve static templates 27424801 -> 27413997
stackoverflow.com Python posting answer from function to entry box in GUI 26936780 -> 26935846
stackoverflow.com Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 26727788 -> 26707733
stackoverflow.com Compare two return values 36526872 -> 36526527
stackoverflow.com Hash table to getkey 17526151 -> 17526535
stackoverflow.com DllImport - An attempt was made to load a program with an incorrect format 24297616 -> 24314325
stackoverflow.com UNIX script to check if httpd service is on or off and if off send a email 36791290 -> 36791470
stackoverflow.com https://stackoverflow.com/q/22840688 22840688 -> 22847735
stackoverflow.com How to input integer value to an array, based preceeding row + column values? 21818440 -> 21817821
stackoverflow.com Not receiving eMail 20592292 -> 20592116
stackoverflow.com https://stackoverflow.com/q/20290648 20290648 -> 20290341
stackoverflow.com why MyTextView in android isn't aligned to right? 19905275 -> 19904749
stackoverflow.com Alternative to using span? 19085080 -> 5137465
stackoverflow.com why same code in two technology behaving different 17993032 -> 17993035
stackoverflow.com How to change this ajax code to long polling 15840016 -> 15838715
stackoverflow.com How to control the duration of a VideoWriter function in MatLab? 24761386 -> 24739382
crypto.stackexchange.com https://crypto.stackexchange.com/q/15053 15053 -> 15054
stackoverflow.com Target Smooth Following 37548099 -> 37551707
stackoverflow.com SQL order by date, time 16161627 -> 16163422
blender.stackexchange.com https://blender.stackexchange.com/q/39987 39987 -> 39956
ell.stackexchange.com https://ell.stackexchange.com/q/18904 18904 -> 18898
stackoverflow.com https://stackoverflow.com/q/4156176 4156176 -> 4156249
askubuntu.com bought computer with ubuntu program messed up 194357 -> 194350
askubuntu.com How to get desktop background picture changed periodically? 21787 -> 174460
meta.stackexchange.com https://meta.stackexchange.com/q/45196 45196 -> 45007
raspberrypi.stackexchange.com https://raspberrypi.stackexchange.com/q/22065 22065 -> 22061
askubuntu.com Why should my business choose Ubuntu? 100084 -> 93142
magento.stackexchange.com https://magento.stackexchange.com/q/35499 35499 -> 35510
math.stackexchange.com https://math.stackexchange.com/q/601602 601602 -> 604492
askubuntu.com https://askubuntu.com/q/136880 136880 -> 137210
stackoverflow.com Php sessions switch image 16209881 -> 16208732
stackoverflow.com https://stackoverflow.com/q/6642415 6642415 -> 6642023
meta.stackexchange.com https://meta.stackexchange.com/q/263936 263936 -> 263961
stackoverflow.com https://stackoverflow.com/q/6871993 6871993 -> 6815083
stackoverflow.com https://stackoverflow.com/q/924219 924219 -> 924222
stackoverflow.com System.NullReferenceException: Object reference not set to an instance of an object 6212629 -> 6213271
stackoverflow.com https://stackoverflow.com/q/6071696 6071696 -> 6071956
stackoverflow.com Question about pointers and strings in C 4380090 -> 12121759
stackoverflow.com https://stackoverflow.com/q/5942721 5942721 -> 5915159
stackoverflow.com https://stackoverflow.com/q/6608586 6608586 -> 14186573
stackoverflow.com https://stackoverflow.com/q/7840314 7840314 -> 3982878
stackoverflow.com https://stackoverflow.com/q/5778149 5778149 -> 5778167
stackoverflow.com get ip address of all connected pc in c# 5360352 -> 5360497
stackoverflow.com can anyone tell me how to use switch? 5639372 -> 5637799
travel.stackexchange.com https://travel.stackexchange.com/q/64432 64432 -> 64429
superuser.com assign router ip address to web server 201546 -> 201550
stackoverflow.com https://stackoverflow.com/q/693210 693210 -> 464287
stackoverflow.com https://stackoverflow.com/q/2793081 2793081 -> 2793091
stackoverflow.com php - strstr with an ending needle 12050673 -> 3969907
stackoverflow.com https://stackoverflow.com/q/616168 616168 -> 615843
stackoverflow.com https://stackoverflow.com/q/662607 662607 -> 661447
stackoverflow.com https://stackoverflow.com/q/6577914 6577914 -> 6577629
stackoverflow.com https://stackoverflow.com/q/790568 790568 -> 790533
stackoverflow.com User control javascript 2519251 -> 2602398
stackoverflow.com C code compiles as C++, but not as C 3143052 -> 3142420
stackoverflow.com https://stackoverflow.com/q/4746634 4746634 -> 4746780
stackoverflow.com HangMan RandomString Class 13818297 -> 13817886
stackoverflow.com ereg_replace, preg_replace and PHP 5.3.0 6269616 -> 6269693
stackoverflow.com height of the dropdown box display 5600982 -> 5600646
stackoverflow.com strange behaviour of live wallpaper in preview 12870547 -> 12872613
stackoverflow.com https://stackoverflow.com/q/12877232 12877232 -> 12878663
stackoverflow.com https://stackoverflow.com/q/11674473 11674473 -> 11668221
stackoverflow.com https://stackoverflow.com/q/11413173 11413173 -> 11412839
stackoverflow.com https://stackoverflow.com/q/11712020 11712020 -> 11712471
stackoverflow.com https://stackoverflow.com/q/10737474 10737474 -> 10737879
stackoverflow.com Finding whether time is in a defined range 10330922 -> 10329134
stackoverflow.com https://stackoverflow.com/q/9416521 9416521 -> 9415821
stackoverflow.com https://stackoverflow.com/q/9075482 9075482 -> 9075199
stackoverflow.com How can an app detect that it's going to be uninstalled? 18692571 -> 3013823
stackoverflow.com https://stackoverflow.com/q/7536055 7536055 -> 7536087
stackoverflow.com https://stackoverflow.com/q/8037684 8037684 -> 8012429
stackoverflow.com https://stackoverflow.com/q/6579336 6579336 -> 6593441
stackoverflow.com In Perl, how can I produce a PDF file using data in an XML file? 7823913 -> 7776629
askubuntu.com Desktop launcher documentation? 90345 -> 37627
stackoverflow.com How would I save a UIButton's properties and load with a button? 7338218 -> 7336594
stackoverflow.com get image height and width in file tag using javascript 6633197 -> 6633190
stackoverflow.com https://stackoverflow.com/q/6730757 6730757 -> 6731231
meta.stackexchange.com https://meta.stackexchange.com/q/44772 44772 -> 44790
stackoverflow.com How to checkbox id to the modal box? 6805540 -> 6805127
meta.stackexchange.com https://meta.stackexchange.com/q/28154 28154 -> 28126
meta.stackexchange.com https://meta.stackexchange.com/q/48307 48307 -> 48299
english.stackexchange.com https://english.stackexchange.com/q/129147 129147 -> 129143
chemistry.stackexchange.com https://chemistry.stackexchange.com/q/44934 44934 -> 44935
math.meta.stackexchange.com https://math.meta.stackexchange.com/q/19635 19635 -> 19636
math.stackexchange.com https://math.stackexchange.com/q/975148 975148 -> 974975
math.stackexchange.com https://math.stackexchange.com/q/705745 705745 -> 705650
electronics.stackexchange.com https://electronics.stackexchange.com/q/95170 95170 -> 95043
stackoverflow.com https://stackoverflow.com/q/311555 311555 -> 385620
wordpress.stackexchange.com https://wordpress.stackexchange.com/q/35299 35299 -> 35285
tex.stackexchange.com https://tex.stackexchange.com/q/192832 192832 -> 192636
superuser.com Compress pdf file? 207606 -> 504847
superuser.com cannot delete file on windows or mac 846634 -> 846635
superuser.com Would an old laptop run with converter SATA to IDE? 554419 -> 554452
superuser.com https://superuser.com/q/927977 927977 -> 927973

I also found 17 self duplicates which appear to be a remnant of a very old duplicate closure process wherein the question was just closed because the question already exists but relied on a user to either comment with a link or edit the link into the post themselves. This appears to have been programmatically stored as a self-referencing duplicate link. They all exist on Stack Overflow have been closed for ~14 years.

Site Name Link to start Cycle Path
stackoverflow.com https://stackoverflow.com/q/686052 686052
stackoverflow.com https://stackoverflow.com/q/351819 351819
stackoverflow.com https://stackoverflow.com/q/563626 563626
stackoverflow.com https://stackoverflow.com/q/807502 807502
stackoverflow.com https://stackoverflow.com/q/389971 389971
stackoverflow.com https://stackoverflow.com/q/730577 730577
stackoverflow.com https://stackoverflow.com/q/298164 298164
stackoverflow.com https://stackoverflow.com/q/887293 887293
stackoverflow.com https://stackoverflow.com/q/601361 601361
stackoverflow.com https://stackoverflow.com/q/318242 318242
stackoverflow.com https://stackoverflow.com/q/156872 156872
stackoverflow.com https://stackoverflow.com/q/268730 268730
stackoverflow.com https://stackoverflow.com/q/560291 560291
stackoverflow.com https://stackoverflow.com/q/450161 450161
stackoverflow.com https://stackoverflow.com/q/94558 94558
stackoverflow.com https://stackoverflow.com/q/552256 552256
stackoverflow.com https://stackoverflow.com/q/151375 151375

Here is the approach I used to find these:

from glob import glob
from pathlib import Path
from typing import List, Generator

import networkx as nx
import pandas as pd
from tqdm import tqdm


def make_df_from_xml(fp: Path) -> pd.DataFrame:
    df = pd.read_xml(fp)
    df = df.loc[
        df['LinkTypeId'].eq(3),  # Only care about duplicate link types
        ['PostId', 'RelatedPostId']
    ].copy(deep=True)
    return df


def save_cyclic_deps(dupe_cycles: List[object]) -> None:
    if not dupe_cycles:
        return

    p = Path('./raw_cycle_output.csv')
    if not p.exists():
        pd.DataFrame(dupe_cycles).to_csv(p, index=False)
    else:
        pd.concat([
            pd.read_csv(p),
            pd.DataFrame(dupe_cycles)
        ]).drop_duplicates().sort_values(
            by='Cycle Length',
            ascending=False
        ).to_csv(p, index=False)


def ids_list_to_str(ids: List[int]) -> str:
    return ' -> '.join(map('{:.0f}'.format, ids))


def create_link_to_start(site: str, ids: List[int]) -> str:
    if not ids:
        return ''
    return f'https://{site}/q/{ids[0]}'  # Link to start of path/cycle


def process_individual_site(site_name: str, df: pd.DataFrame) -> List[int]:
    DG: nx.DiGraph = nx.from_pandas_edgelist(df, source='PostId', target='RelatedPostId', create_using=nx.DiGraph())

    dupe_cycles = []

    # Remove any circular dependencies
    for coord in list(nx.simple_cycles(DG)):
        dupe_cycles.append({
            'Site Name': site_name,
            'Cycle Length': len(coord),  # Technically infinite
            'Link to start': create_link_to_start(site_name, coord),
            'Cycle Path': ids_list_to_str(coord)
        })
        if len(coord) == 1:
            # For Stack Overflow
            # Duplicate closures used to just be "This question already has answers here"
            # and a user would have to manually edit or comment with a link
            # this appears to be stored programmatically as a self cycle
            s, = coord
            DG.remove_edge(s, s)
        else:
            s, *_, e = coord
            # Remove back edge
            DG.remove_edge(e, s)

    save_cyclic_deps(dupe_cycles)

    res = nx.dag_longest_path(DG)
    return res


def get_longest_path_for_all_sites() -> Generator[dict]:
    progress_glob = tqdm(glob(r'.\PostLinks\*-PostLinks.xml'))

    for post_links in progress_glob:
        p = Path(post_links)
        [site, *_] = p.stem.split('-')

        # Update progress with current site
        progress_glob.set_postfix_str(site)

        longest_path = process_individual_site(site, make_df_from_xml(p))

        yield {
            'Site Name': site,
            'Number of links': len(longest_path) - 1,  # Edge count instead of node count
            'Link to start': create_link_to_start(site, longest_path),
            'Path': ids_list_to_str(longest_path)
        }


def main() -> None:
    pd.DataFrame(
        get_longest_path_for_all_sites()
    ).sort_values(
        by='Path Length',
        ascending=False
    ).to_csv('./raw_aggregated_output.csv', index=False)


if __name__ == '__main__':
    main()

2
  • 3
    RE the final section, about questions closed as duplicates without targets: prior to May 20, 2009, the system did not prompt for a target when users voted to close as duplicates, and prior to February 2013, targets weren't saved in the database (then were backfilled using info from the automatic edits). See this comment thread and its link. Commented Oct 10, 2023 at 4:33
  • Regarding some of the infinite loops: I looked at this question which is closed as a dupe of another question that seems to have been deleted ... Commented Oct 10, 2023 at 7:16

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .