107

While CMs have been remarkably liberal in leaving even content extremely critical of Stack Overflow untouched here on Meta since Monica was removed as a moderator, it is not completely unreasonable to imagine a scenario in which all Monica-/CoC-related stuff (or even all of Meta, really) is nuked by fiat from above at some point, making it unavailable to anyone outside the company.

Is anybody here in the community scraping the relevant content to be prepared for this eventuality?

This could be in everyone's interest because:

  • There simply are a lot of wonderful, thoughtful, thought-provoking, opinion-changing thoughts and perspectives on the gender debates, people's experiences, etc. here which are worth preserving for their own sake

  • There could be value in preserving this entire thing (as much of a sad tire fire as it is) as learning material for future community developers/managers

  • We who used to pour a lot of passion and energy in this place and were very active on Meta have a vital interest in having receipts of what was actually said, in case a "SO introduced a new CoC and the transphobes and misogynists were up in arms because they just hate kindness and diversity" type narrative is created (by the company or whoever else) at some point down the line, potentially harming all of our reputations just like Monica's was harmed. Wouldn't be the first time this happened on the Internet.

I've been manually taking screenshots of some of the main discussions using Firefox' new screenshotting feature that can store the whole page. It's just a lot of effort (you have to expand every comment section etc.) and not efficient at all.

If this isn't already happening: could somebody better equipped than me to make this a painless, automatic, perhaps even recurring process (using userscripts or a CLI scraper that can interpret JavaScript) please do it? (ideally including comment sections collapsed by default and such... even more ideally also with 10k+-only deleted content!)

Enter image description here

A recursive (daily? Hourly?) scrape of the excellent list that Mari-Lou A is curating here would probably be more than enough.

9
  • 4
    If you got time, you could create a quick script (Python is always a nice place to start) to grab the content of specific questions and all their answers, and save them locally. The Wayback Machine is of course safer, but stuff can be taken down from there. Commented Nov 15, 2019 at 10:43
  • 62
    Good idea. The recent blog post ends with talk of "some new feedback mechanisms we will be announcing next week" which will be "an exciting start to working hand in hand with the community to build a better Stack Overflow". I wouldn't be at all surprised if what that really means is they're going to shut down meta completely and replace it with an announcements board where user input is limited to clicking on hearts to show how much we love our overlords. Commented Nov 15, 2019 at 10:56
  • There should be lots of tools/web services out there to help scraping websites (maybe even with versioning). I only found commercial ones. hongkiat.com/blog/web-scraping-tools Might be a good question for softwarerec.exchange Commented Nov 15, 2019 at 16:42
  • It may already have been answered, but: Save your time with this screenshot thingy. Screenshots are worthless when push comes to shove. Give me one minute and I'll show you a screenshot where you said that you were the reincarnation of Elvis Presley. These are just pixels. (At least, things like the wayback machine are much harder to fake, unless someone who maintains the wayback machine does a deep dive into their databases...)
    – Marco13
    Commented Nov 15, 2019 at 17:19
  • 2
    @Marco13 It's not necessarily about proof in a court room, but not losing part of our shared history. Commented Nov 15, 2019 at 18:05
  • 1
    I've been archiving on archive.today every MSE post I read. It's not automated, but might still help someone. I'm highly worried that, like the Wayback Machine, those archives will eventually get taken down by SE too.
    – hftf
    Commented Nov 15, 2019 at 19:07
  • 2
    To those who want to look into what web archivers and crawlers are out there, here is a very nice github page: github.com/iipc/awesome-web-archiving // I tried monolith and SingleFile (both cmd-line and Chrome extension). Both of them saved the SE pages almost completely (I couldn't see the difference) in HTML form (everything packed into it). However, neither the comments were auto-expanded, nor was there a solution to auto-archive pages (for multiple answers). This seems to be the problem with WayBack archives as well.
    – 286110
    Commented Nov 15, 2019 at 20:33
  • 2
    @286110 There's a script on Stack Apps that could be useful for people who are manually archiving pages. It automatically expands all comments. Commented Nov 15, 2019 at 21:55
  • @rockwalrus-stopharmingMonica yeah I tried that before posting my first comment. Tampermonkey found some error and Greasemonkey didn't run the script. I don't know the know-how of user scripts so didn't dig into the errors. :(
    – 286110
    Commented Nov 15, 2019 at 22:15

5 Answers 5

28

That particular Q&A is archived quite often, about 40-60 times a day (!) in the Wayback Machine. It seems to be a combination of users manually archiving it and something of an automated web crawler. It's even archived more often than e.g. the Meta Stack Exchange homepage.

enter image description here

An alternative archiving site is Archive Today, but it seems to be updated far less often.

15
  • 2
    That's way better than I expected them to do this, didn't bother to even check! Good to know. It would still be cool to have the contents in the hands of private individuals from the community, though, just in case.
    – Pekka
    Commented Nov 15, 2019 at 10:31
  • 44
    Problem is, since the wayback machine accepts requests for deletion from content owners, we are only one legal-department-advice short of all these being removed. Commented Nov 15, 2019 at 10:31
  • @FrédéricHamidi Not exactly. The legal department can ask for a lot of things. You still need the means in the real world to make that happen. Remember that SE Inc. only has a few hundred employees, top. Even if they all started to "moderate" content according to the lawyers ideas, the community would outnumber them. It would easily ruin them within a few weeks to go down that path. Of course, that would achieve the opposite of what we want: because then all content we created might be gone for good ...
    – GhostCat
    Commented Nov 15, 2019 at 10:33
  • 2
    @GhostCat, I was specifically referring to the wayback machine. Surely, sending a please remove all archived copies of questions X, Y and Z message does not require too many employees. Commented Nov 15, 2019 at 10:37
  • 7
    @GhostCatsaysReinstateMonica Er? SE can take down the entire Meta with one push of a switch. And as Frederic said, the Wayback machine is likely to remove content if the company asks it to. Commented Nov 15, 2019 at 10:37
  • @ReinstateMonica Sure, that they could do. But then they would need to take all other META sites, too. Otherwise the storm would rage there. Which then means that all META topics are gone too. I am pretty sure: the platform wouldnt survive that.
    – GhostCat
    Commented Nov 15, 2019 at 10:40
  • 22
    Which then means that all META topics are gone too. I am pretty sure: the platform wouldnt survive that I'm certain that's their long-term bet, though: replacing Meta with a much more, um, "guided" feedback channel. See the last paragraph in the recent announcement re the voting change: ....along with some new feedback mechanisms we will be announcing next week are an exciting start to working hand in hand with the community... face it, that community is not us. 😄
    – Pekka
    Commented Nov 15, 2019 at 10:56
  • 7
    @GhostCatsaysReinstateMonica Sorry but if you think they wouldn't just do that anyway, without having thought about the consequences, you haven't been paying attention Commented Nov 15, 2019 at 11:00
  • 1
    @user56reinstatemonica8 On the other hand, it doesn't help to engage in speculations and prophecies. They do what they want to do, and when that happen, each one of us can make decisions how to continue.
    – GhostCat
    Commented Nov 15, 2019 at 11:31
  • 3
    Sure. But it would be a smart move to be prepared for them to completely nuke meta from orbit then look surprised when this proves to be an unpopular move with consequences they don't know how to react to beyond yet more poorly planned lashing out. Commented Nov 15, 2019 at 11:33
  • @FrédéricHamidi Are we not the owners of the content we produce here, and we just give SE a royalty free worldwide license to distribute it? If that's correct, I don't believe SE would have any justification for requesting deletion from the archive. Commented Nov 15, 2019 at 13:15
  • 8
    @Jeff, well it has already happened, so clearly SE is in a position to do that. To be fair, this is their website being archived, so I believe that gives them all the agency they need to have archived content removed. Commented Nov 15, 2019 at 13:21
  • @JeffLambert But there isn't only that licence. There is also the "terms and conditions" between you and SE Inc.. SE Inc. takes down harassment and hate-speech for example. Do you really think that SE Inc. doesn't have the right to turn to a mirror and ask them to delete content that (according to SE Inc.) violates their service agreement?!
    – GhostCat
    Commented Nov 15, 2019 at 14:42
  • 5
    If the site OWNERS request a backup be deleted, Archive.org will do so. (Also, archive.org is blocked at my workplace...grrr). But a workaround was found for the JME-cons-explosion in Winter 2018 -- a blogspot kept a copy of everything (from multiple sites, too), and THAT was archived on WayBack, and that blogowner would not request anything to be deleted. owleyeview.blogspot.com – (at the time of the most change, I was manually putting everything on archive.org, but I can't currently do that part.) Commented Nov 15, 2019 at 17:10
  • 1
    For anyone in doubt, archive.org has had requests of SE to remove pages in the past and complied with it. It's not a hypothetical, it has been done. More than once.
    – Mast
    Commented Nov 26, 2019 at 13:50
20

I would like someone with 10k+ reputation (access to deleted elements) to copy everything to a free blog or their own site, and then archive that.

An example of that being done (in a different situation) was here: Time to Name Drop and Protect Newbies

Brenna started this because people would often lock down/block information about what JME was really up to, and others sent her anonymous information of their own experiences. By having it on BlogSpot, under her identity, Facebook blocking didn't apply.

I then would make sure that things were updated on archive.org's Wayback Machine, daily or weekly during the peak updates, then I slowed down. Having copies/screenshots on Brenna's blog meant that someone involved in JME could not later limit access to it.

It may be slower (and ideally there are multiple duplicates and mirrors, and more independent than even using BlogSpot or WordPress.com as a host), but redundancy is the only security.

15

Technical notes:

  • Comments are easier than I expected: https://meta.stackexchange.com/posts/{postid}/comments gives content which can be inserted into a <ul></ul>
  • Question pages probably need an HTML Soup parser. It's a while since I did any real web dev, so I'm certainly out of date on specs, but there are inline <script> elements which have unescaped, unclosed HTML tags in strings inside them, and I think that would break a standards-compliant parser.
  • If the goal is just to have the content available for manual processing, it's a bit simpler. Download the question page as https://meta.stackexchange.com/questions/{questionid}?page=1&tab=active and scan for strings of the form <a href="/posts/[0-9]+/edit" to identify answers. If there's more than a threshold (30 - or play it safe and use 1...) then increment the page parameter and repeat.
  • That might be the easiest non-API way of getting a list of answer IDs anyway. Then to get the current markdown of a post, https://meta.stackexchange.com/posts/{postid}/edit and look for the only <textarea>.
10

As an addition to the Wayback Machine, there are data dumps:

https://archive.org/download/stackexchange/meta.stackexchange.com.7z

Store them locally. It's only 280 MB. I just did it. Ideally it would be done regularly and incrementally.

Unfortunately, the latest available data dump is from 2019-09-04 which is quite long ago.

P.S.:

User "I am not the way you speak" mentioned in the comments the possibility to make a data dump via the data explorer, which is updated weekly.

Indeed, a simple query like

SELECT *
FROM PostsWithDeleted
ORDER BY
  Id DESC

on https://data.stackexchange.com/meta.stackexchange/query/new goes way back and contains lots of information that can be downloaded as csv file (just did it, sizes are reasonable).

With a bit more sophistication all the tables could probably be downloaded in full (need to select at most 50k rows per single query) or only the content that is missing since the last update of the data dump. The output is easy to process and may be converted into something that resembles the web output here.

6
  • 1
    Worth noting that those are only updated every few months - that one is from Sept 02, for example. But I agree, this is the most convenient way of grabbing everything for local archival. Commented Nov 15, 2019 at 16:44
  • @JonasCz-ReinstateMonica Thanks. I just saw it. I thought it's updated more regularly. Commented Nov 15, 2019 at 16:46
  • 1
    The latest data dump is too long ago. This stuff hit around 27th Sep..
    – gbjbaanb
    Commented Nov 15, 2019 at 19:13
  • 2
    You can create your own dump through the SE data explorer at data.stackexchange.com, which is updated weekly (I believe). Commented Nov 15, 2019 at 20:14
  • @Iamnotthewayyouspeak Is there by chance a recipe for how to create a data dump of a whole stackexchange with the data explorer. I usually get only a maximum of 50000 rows at a time. Commented Nov 20, 2019 at 22:33
  • @Trilarion I'm sorry, but I don't know. Commented Nov 21, 2019 at 12:44
5

The best you can hope for (imho): Wayback Machine, or some other existing service, even the Google cache.

There is simply no way for an individual, or even a group of individuals to easily pull up something better that works for the public. There are plenty of technical difficulties to solve, and that takes time and money. But the real issue is (imho) a legal one: how does it help the community you create such an archive for yourself? "But I am going to make my archive public!". Then let me ask you: "do you have a good lawyer and the money to pay him?"

Even when the technical problems can be resolved ... think about it: when you assume SE Inc. is now "evil enough" to pull the plug on MSE for good ... sorry: what would be stopping them from sending their lawyers your way, to end your "public mirror" of MSE content?!

Beyond that: the underlying point is something that every user who creates content on any third-party hosted service needs to understand: that content doesn't live on your computer, your storage. It could be gone tomorrow. You might have rights on that content, but if that third party disappears tomorrow, so might "your" content.

Honestly, I look at this in a Zen way: you need to be ready to let things go. Do not get attached to "things", as they can easily be taken from you.

The true beauty of our interactions is the experience we made when reading or writing said content. Even if you can preserve the text, you can't preserve the emotion.

12
  • 2
    There is simply no way for an individual, or even a group of individuals to easily pull up something better Not sure I agree - it seems fairly easy to do this. I've been taking screenshots of some of the main discussions using Firefox' new screenshotting feature that can store the whole page. It's just a lot of effort (you have to expand all the comment sections) and not efficient at all. I'm sure userscripts or browser plugins exist that can make this an automated streamlined process
    – Pekka
    Commented Nov 15, 2019 at 10:26
  • @PekkasupportsGoFundMonica I am not saying it is impossible. I am saying that it will be hard to come up with something that is better than what a service like wayback gives you for free.
    – GhostCat
    Commented Nov 15, 2019 at 10:31
  • Waybackmachine is doing this well, that's good to know. There is something to be said in favour of having this content also in the hands of a couple of us, though. Not sure where archive.org stand on big tech players requesting erasure of stuff in their archives
    – Pekka
    Commented Nov 15, 2019 at 10:33
  • 9
    The wayback machine does respond to companies requesting removal of their content, though. So it's not much safer there than it's here. Commented Nov 15, 2019 at 10:35
  • @ReinstateMonica And I doubt that any commercial service could offer you a "protected wayback". When you make money by pulling other sites content, you have to make sure you don't violate their rights, or they will pull the plug on you in no time.
    – GhostCat
    Commented Nov 15, 2019 at 10:37
  • 1
    You're overestimating the difficulty. There are only two mildly non-trivial issues here: the fact that some content is only loaded by JavaScript fetches, and the fact that some content is only visible when authenticated with an account with more than 10k rep. The rest is really really easy. When Sun changed policy on off-topic discussions on the official Java forum in 2002 and provoked a war between the moderators and the most active users, it took me a couple of hours to knock up a tool which maintained a list of thread IDs and backed up those threads every hour. Commented Nov 15, 2019 at 11:52
  • @PeterTaylor Fine. But what is next? You think you aren't open for being sued when putting up that content in public? And of course: this doesn't help with deleted comments, and it also doesnt help with content that was fully deleted by a CM for example before your script comes back. Next: how do you go about "merging" over time? But I guess I will update the question to emphasize the legal problem I see here.
    – GhostCat
    Commented Nov 15, 2019 at 11:58
  • 3
    You think you aren't open for being sued when putting up that content in public? I doubt there is any problem publishing/hosting the content. Worst comes to worst, remove the Stack Exchange logo and show only the stuff that is covered by the CC-Wiki license. But honestly having a hard time seeing how they could attack even publishing the literal screenshots of a publicly available resource
    – Pekka
    Commented Nov 15, 2019 at 12:08
  • 1
    @PekkasupportsGoFundMonica Before the courts and on the high seas, we are in God's hands to use another German proverb. It is nice to know "most likely, I will win in court". But that assumption alone doesn't help you one bit in case some SE Inc. lawyer decide to go after you, does it.
    – GhostCat
    Commented Nov 15, 2019 at 12:33
  • 4
    @GhostCatsaysReinstateMonica, on what basis could SE sue me, provided I respect the CC-BY-SA licence (preferably both 3.0 and 4.0, to be safe)? Also note that I'm not based in the US, and I find it unlikely that SE has any lawyers on retainer in the less sue-happy country where I live. Commented Nov 15, 2019 at 13:20
  • @PeterTaylor Assuming you want to get to "deleted" content, you need to be a SE network user. This means you have to agree to stackoverflow.com/legal/terms-of-service/public ... are you 100% sure that nothing in there could become a problem? And yes, when you aren't a US citizen that might make many things easier.
    – GhostCat
    Commented Nov 15, 2019 at 13:30
  • @GhostCatsaysReinstateMonica The worst thing may happen is account deletion as a part of termination of the legal agreement. Commented Nov 15, 2019 at 23:06

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .