User:GreenC/WaybackMedic 2.5

WaybackMedic
by GreenC

Wayback Medic 2.5 is a bot that adds and maintains links from the list of known web archive services in use on the English Wikipedia.

Edits made after 2018-12-04 are by version 2.5

The bot operator is User:GreenC. The bot account is User:GreenC bot. The bot (software) is "WaybackMedic".


WaybackMedic Fixes
Fix number Function name Example edit Description Notes Date added
1 fixthespuriousone Example Remove spurious |1= in cite templates. August 2016
2 fixmissingprotocol Example 1. Add https if protocol missing from the archive.org URL.
2. Convert existing protocol http to https.
3. Add second-level domain web if missing (archive.org/web/ → web.archive.org/web/)
4. Add /web/ path (web.archive.org/2016/ → web.archive.org/web/2016/). In some URLs adding /web/ breaks the link, test for those.
HTTPS per RFC August 2016
3 fixemptyarchive Example 1. If |archiveurl= is empty or missing but |archivedate= has content, attempt to find a working archive URL based on the archive date, otherwise add {{dead link}} if appropriate.
2. If |archivedate= is empty or missing but |archiveurl= has content, generate date value based on timestamp in the archive URL.
3. If |archiveurl= and |archivedate= are empty, remove both and leave a {{dead link}} if appropriate.
August 2016
4 fixbadstatus Example Check all Wayback Machine URLs for response code errors (anything but 200s). If an error code, try for a better URL via the Wayback API – first using accessdate, then using the earliest date available. If none there, check WebCite API. Try Memento API which checks a few dozen other archives. Other techniques undocumented. If still none found, remove |archiveurl= and |archivedate= and add {{dead link}}. August 2016
5 Retired
6 fixemptywayback Example The wayback template is mangled in a certain way. Action: re-assemble. It won't delete multiple instances if they exist in the same ref (as in the Example). August 2016
7 fixencodedurl Example The URL was incorrectly encoded. Fully decode URL and re-encode. August 2016
8 fixdatemismatch Example 1. Ensure |archivedate= matches the snapshot date in the URL
2. Ensure date format matches dmy or mdy if set (retain ymd if in use)
August 2016
9 fixwebcitlong Example
Example
Convert WebCite URL's from short-form to long-form
Convert Freezepage.com URL's from short-form to long-form
WebCite Usage January 2017
10 fixstraydt Example Remove stray {{dead link}} template when an archive exists for the link January 2017
11 fixwam Example Merge {{wayback}} and {{webcite}} --> {{webarchive}}
Merge completed February 5, 2017
Webarchive TfM January 2017
12 fixiats Example archive url -> |archive-url) January 2017
13 fixswitchurl Example Move an archive.org URL from |url= to |archiveurl= and add |archivedate= if missing. January 2017
14 Retired
15 fixembway Example
Example
1. A {{wayback}} is embedded in a CS template.
2. A {{dead link}} is embedded in a CS template.
January 2017
16 <various> Example Timestamp and/or |archivedate= is 19700101 and/or out-of-bounds. January 2017
17 fixdoubleurl Example archive.org URLs are doubled, tripled, etc.. January 2017
18 fixemptywebarchive Example {{webarchive}} |date= is missing or empty value. January 2017
19 fixdoublewebarchive Example Remove duplicate {{webarchive}} instances. January 2017
20 fixembwebarchive Example A {{cite web}} is embedded in a {{webarchive}} January 2017
21 fixarchiveis Example
Example
1. Convert Archive.today URL's from short-form to long-form
2. Fix URL encoding of broken links
3. Normalize as "archive.today" see note
Archive.today Usage January 2017
22 fixitems Example Change "/items/" URLs that are using machine IDs BRFA January 2017
23 encodemag Example Convert MediaWiki encoding to url encoding in URLs (ie. {{!}} and {{=}}) RFC3986 January 2017
24 decodespace Example Convert %20 to +, + to %20, etc.. in URLs that can be repaired this way See also June 2017
25 waytree_trailgarb Example
Example
Example
Remove typical garbage characters found at the end of URLs: .,;:-"l(%XX)('') February 2018
26 fixcommentarchive Example Open-up commented-out archives and add a |deadurl= "yes" or "no" February 2018
27 waytree_x2encoding Example Repair double URL-encoding eg. %3A -> %253A February 2018
28 fixencodebug Example Repair missed URL-encoding of square brackets T186417 February 2018
29 fixiats Example
Example
Restore truncated Wayback URL February 2018
30 fixiats Example Convert |title={title} -> |title=Archived copy T203865 September 2018
31 urlchanger Example Move broken URL to a new working URL and undo previous archives. BOTREQ November 2018
32 cosmetic
Example
Example
Example
Example
Example
Edits that might be cosmetic. Only with other edits.
1. Del trailing # in URLs
2. Del empty archive fields
3. archive.is --> archive.today
4. Fix double fragments
5. Convert protocol-relative URLs
WP:PRURL, T214855, Archive.today January 2019
BotWikiAwk
Technical details
  • Changes to URLs are checked against the remote site to ensure they are working
  • Real-time link checks, no link database. However, links are checked over a 24 hour period before final upload of diff.
  • Supports many APIs including the Internet Archive, Memento, WebCite and "Timemap" APIs at individual services
  • Multiple HTTP header status code checks at the application (WaybackMedic) layer
  • Additional time-out and retries built-in to the web transfer libraries.
  • Additional operating-procedure level checks against network and other errors – bot is semi-supervised in known trouble areas.
  • Multiple redundant checks of the APIs using multiple dates to ensure a page really is unavailable
  • Accepts API results but then verifies by looking at page headers and/or contents
  • The bot is primarily written in Nim (compiles to C source) with support utilities in Awk. Libraries were custom made including a string primitives library for regex, a wiki template parsing library, OAuth library (in awk), a MediaWiki API interface library, a soft404 detector.
  • Due to the nature of the task, running the bot includes a fair amount of supervisory overhead so it requires operator training, though the steps are documented in the source package.

Running

edit

The bot takes requests at WP:URLREQ on a per-domain basis. You can request a domain name for the bot to process.

edit

  GreenC, in accordance with the Wikimedia Foundation's Terms of Use, discloses that he has been paid by the Internet Archive for his contributions to Wikipedia. This funding is for the ongoing development of WaybackMedic and a module of InternetArchiveBot related to books.

General sources

edit
  • GitHub is an old public repo. The most current version is not public. The bot is written in Nim and GNU awk.

Citations

edit
edit