Wikisource:Scriptorium/Help

From Wikisource
Jump to navigation Jump to search

The Scriptorium is Wikisource's community discussion page. This subpage is especially designated for requests for help from more experienced Wikisourcers. Feel free to ask questions or leave comments. You may join any current discussion or a new one. Project members can often be found in the #wikisource IRC channel (a web client is available).

Have you seen our help pages and FAQs?



Tilted manually scanned pages

[edit]
Righted tilted page

This book was manually scanned and the resulting text layer is needs to be retyped manually.

I can straighten the page One of the many but then, so what? What options do I have? — ineuw (talk) 04:04, 16 June 2024 (UTC)Reply

@Ineuw: I don't understand the question. Why do you want to straighten this page? If you're asking how to straighten all pages in this scan then I advise against it: it's a lot of manual work, and the benefits are limited. Xover (talk) 06:28, 16 June 2024 (UTC)Reply
Thanks. That's what I thought, but needed an experienced opinion. I will retype it when needed. — ineuw (talk) 06:33, 16 June 2024 (UTC)Reply

Help fixing Index:Tropical Cyclone Report – Hurricane Katrina.pdf

[edit]

I am unsure what I did, but the pagelist for the PDF is not displaying. I tried to display commons:File:Tropical Cyclone Report – Hurricane Katrina.pdf. WeatherWriter (talk) 16:07, 18 June 2024 (UTC)Reply

Fixed. In general, you can try purging the file page on enWS (e.g. File:Tropical Cyclone Report – Hurricane Katrina.pdf) by adding ?action=purge to the end of the URL, then purging the index page. —CalendulaAsteraceae (talkcontribs) 17:43, 18 June 2024 (UTC)Reply
I think it works better to purge it at commons, and also the ?action=purge only works (I think) if you're already in index.php, so it's simpler to use of of the gadgets that do that. — Alien333 (what I did & why I did it wrong) 07:12, 19 June 2024 (UTC)Reply

Scan resolution (question for the technical people)

[edit]

I'm getting frustrated with the poor quality of the scan image when proofreading A Dictionary of Hymnology. Have a look at Page:Dictionary of Hymnology 1908.pdf/44—the fine print is barely legible, even though I have increased the "Scan resolution in edit mode" to 2000. When viewing the PDF directly, the print is perfectly crisp.

I am guessing that the Wikimedia software takes the scan image at its default resolution, heavily JPG-compresses it, then increases the resolution of the compressed image, rather than scaling up before converting and compressing. This results in high-fidelity images of JPEG artefacts instead of actually usable scan images. I also have found a related task phab:T38597, to replace JPG with PNG in these images, which would presumably mitigate this issue—but this ticket is ten years old and hasn't been touched for years.

Anyway, my question is this: is there any way to improve the scan image inside ProofreadPage? Or do I just have to open the PDF in a separate window (which is what I have been doing)? —Beleg Tâl (talk) 18:06, 2 July 2024 (UTC)Reply

Don't know much about it, but there was a discussion a few months ago about the same problem and there the answer given was to use DjVu, not PDF. — Alien333 (what I did & why I did it wrong) 18:36, 2 July 2024 (UTC)Reply
Lol thanks, should have searched the archives first :D —Beleg Tâl (talk) 18:49, 2 July 2024 (UTC)Reply
Taking a quick look at the code, the PdfHandler extension generates jpgs which are then retrieved by us. Which jpg is retrieved might vary but it doesn't regenerate the images at a higher resolution if the original conversion is a poor representation. MarkLSteadman (talk) 21:31, 2 July 2024 (UTC)Reply
It may be possible to regenerate the pdf outside and then upload it such that the conversion goes smoother. MarkLSteadman (talk) 21:36, 2 July 2024 (UTC)Reply
I use User:Inductiveload/jump to file, which is a very useful workaround if the file is from one of the sources it supports, although it is a workaround rather than a proper fix. —CalendulaAsteraceae (talkcontribs) 02:00, 3 July 2024 (UTC)Reply
Let's do a little math…
The file as uploaded has 1796 pages and is 194.31 MB, which works out to about 110 kB per page. A modern smartphone photo averages about 6 MB, which means each page image here is somehow 155 smaller than the photos your iPhone makes. How is that possible given bulk book-scanning rigs are literally a DSLR mounted over a plate with some lights and other gizmos? Well, if you go look at the raw scan images at IA you'll find they add up to 2.7 GB, and even the cropped and colour-corrected images are 1.3 GB. But surely that's non-compressed images? Oh no, these are JPEG 2000 (.jp2) wavelet-compressed images (i.e. the 1.3 GB is already compressed size), which works out to about 760 kB per page. This means the PDF that IA produces takes already compressed images and then compresses them a further 7x.
Then we get to the images on Commons. When a page image is requested, MediaWiki essentially uses Ghostscript to "print" that page out of the PDF and into a JPEG file. It does this by extracting the image data out of the glorified Postscript format (PDF is just PS with some sugar on top) into its own internal raster representation and then serializing that into the requested file format, in this case JPEG, including (lossy) compression. Proofread Page always requests thumbnails that are 1024 pixels wide (height is set automatically to preserve the original aspect ratio), which means that for page images that were originally less than 1024 pixels wide the extracted image is then decompressed, scaled up to 1024 pixels wide, and then recompressed before being sent to the web browser as a JPEG. Now the OpenSeadragon embedded in Proofread Page takes over and crams that image into the Page:-namespace viewer (OSD requests 1.5x, 2x, and 3x assets too, which complicates this a bit, but lets simplify for illustration purposes). This multiply-rescaled and recompressed image data is then what OSD and the web browser zooms in and out of and which you're trying to proofread from. That is, you're looking at an image that has been lossily recompressed at least 3 times and upscaled beyond what image data was there to begin with twice.
So why is DjVu better? Well, the purely technical advantage isn't all that huge, but as it happens IA over-compresses their PDF files (they deliberately use very aggressive compression settings when making the PDF in order to achieve a small file size). When I make a DjVu file I grab the original scan images (the 1.3 GB zip), extract and convert the JPEG 2000 files to PPM (lossless), and then directly convert them to DjVu with moderate compression settings. That saves one recompression, and the compression settings are a lot less lossy. In addition, the DjVu compression algorithm (also a wavelet-based algorithm), designed specifically for scanned text (vs. JPEG that was designed for general photos), does a lot better at preserving original image data (it's a lot less lossy for this case). And finally, instead of the awful Ghostscript-based method for PDFs, MediaWiki uses the native DjVuLibre tools to extract a single page image, and it does a much better job at extracting the page image. MediaWiki (Thumbor) still rescales the resulting image based on Proofread Pages request, but since the starting image is of much higher quality with fewer compression artefacts, the resulting output is usually also much better. There are pathological cases where the result is bad, but these are extremely rare (usually from some random web service that converts the IA PDF into DjVu, achieving only making things worse).
So… When I say I strongly recommend using DjVu whenever possible I really mean "Come on people, why would you ever use the IA PDF?!?! Get with the program and use DjVu because even if you have to bend over backwards and jump through hoops to get that DjVu it's still going to be better!" And it's why I have an open invitation to anyone to ask me to make DjVu files for them, that I try to prioritise as much as I can (which isn't very just now, but…). There are issues with lack of user-friendly end-user tools for DjVu (i.e. you can't view DjVu inline in web browsers any more), and there are big questions about the long-term viability of the format (there's no commercial backing and no significant community around it), but it is still a much much better choice than the current state of PDF and PDF tooling. Longer term (much longer) the new target is probably support for "Collections" on Commons so that we can upload the original JPEG 2000 scans (zero loss) but still get an atomic pseudo-"file" for Proofread Page to work on. But given the pace of development and lack of resources the WMF assigns both Commons and Wikisource this is still a long way in the future so we still need the lesser evil in the mean time. Xover (talk) 08:40, 3 July 2024 (UTC)Reply
Thanks for the detailed info! I knew that IA highly compresses the PDF files, but since I am able to see the page clearly in a PDF viewer I would not have expected that to be the issue. In fact, Wikisource:DjVu vs. PDF claims that PDF has a higher resolution. Most discussions I have seen (here on enWS, and also on commons) seem to take the view that there is no longer any reason to use DJVU ...
Perhaps I'll need to update Wikisource:DjVu vs. PDF with some additional reasons why DJVU should be preferred where possible :D —Beleg Tâl (talk) 13:21, 3 July 2024 (UTC)Reply
Oh, and the "Scan resolution in edit mode" option in Index: pages… It's been a long time since I dug into what that actually did, so I'm very vague on the details, but as I recall its effect was essentially about how big to display the image in the web browser but the image generated was exactly the same. I.e. it's a kind of hard-to-use zoom that's been obsolete for years. I could be wrong, but my conclusion at the time was that the option was useless. Xover (talk) 08:45, 3 July 2024 (UTC)Reply

Brackets for vocal + piano scores

[edit]

I'm transcribing a score that follows the common pattern of one line of vocal music together with a treble and bass piano part, the piano parts marked with a curly bracket. At the moment, on the pages I've transcribed (2 and 3), the vocal line is also included in the bracket. Could someone fix this? —CalendulaAsteraceae (talkcontribs) 18:41, 4 July 2024 (UTC)Reply

You've got all three Staffs inside the PianoStaff. You need to nest them like this:
<<
  Staff
  PianoStaff <<
    Staff
    Staff
  >>
>>
Beleg Tâl (talk) 22:24, 4 July 2024 (UTC)Reply
Great, thanks! —CalendulaAsteraceae (talkcontribs) 23:00, 4 July 2024 (UTC)Reply

Unfamiliar chord notations

[edit]

Page:Hello Hello Who's Your Lady Friend.pdf/4 has what appear to be chords, but I'm not familiar with the notation, and I'd appreciate help from someone who is. (I expect once I've seen an example I'll be able to do subsequent pages myself.) —CalendulaAsteraceae (talkcontribs) 18:31, 6 July 2024 (UTC)Reply

Hi, this is sol-fa notation, which is used as an alternate way of representing the melody for those who don't read graphical music. d=doh or the tonic; r=re (supertonic); m=mi (mediant); &c. The lines and colons indicate how long to hold the note. There's currently no satisfactory way of representing this in Lilypond and I have quietly ignored such in the transcriptions I've done. Beeswaxcandle (talk) 02:43, 7 July 2024 (UTC)Reply
Cool, thanks! —CalendulaAsteraceae (talkcontribs) 13:54, 7 July 2024 (UTC)Reply
Follow-up question, what version of LilyPond are we on and does it support repeats with alternate endings? This is for Page:Hello Hello Who's Your Lady Friend.pdf/7. —CalendulaAsteraceae (talkcontribs) 02:14, 8 July 2024 (UTC)Reply
We're on 2.22.0. And yes, repeats with alternate endings are supported. Beeswaxcandle (talk) 06:59, 8 July 2024 (UTC)Reply

Footnote query

[edit]

I can't figure out how to address the following footnote variation in Index:The Remains of Hesiod the Ascraean, including the Shield of Hercules - Elton (1815).djvu. On page 129 there is a footnote beginning Tis time to sow. This footnote continues onto the following page 130, where there is a footnote within the footnote. I have dealt with such footnotes before where both the main footnote and the sub footnote are on the same page but I can't figure this out. I have tried using the new-ish <refn> but it doesn't seem to support the use of 'name' and 'follow', unlike <ref>, which are needed to cover the spread of the main note over two pages. Any suggestions? Chrisguise (talk) 21:51, 6 July 2024 (UTC)Reply

@Chrisguise The documentation of refn may not look like it supports name and follow, but clicking edit on the template (without touching anything) seems to indicate that name and follow are parameters. I have attempted to use {{refn}} on said pages (and 131 for the additional follow on), and checking your transclusion, it looks like it is working. Thanks for all your efforts, and sorry for Wikisource's (lack of) documentation. Regards, TeysaKarlov (talk) 23:37, 6 July 2024 (UTC)Reply
Thanks for that. I did try refn with 'name' and 'follow' and couldn't get it to work. I guess I must have made a mistake somewhere. Thanks again. Chrisguise (talk) 07:43, 7 July 2024 (UTC)Reply

The first page of the Preface to the Second Edition (two pages below the title page) is missing in the index, but present in the Google Books scan[1]. Can someone who knows add it? I assume, also, that the pages following it in the Index should be moved +1, but I do not know how to do this. Mårtensås (talk) 22:36, 9 July 2024 (UTC)Reply