6

Measurement data in publications is often provided only within figures, while the original data is not available. There are some very useful tools around to digitize such data, such as the web application WebPlotDigitizer, the app Engauge Digitizer or within the software Origin, but to my knowledge they only support raster images.

Since publications are usually available in digital form and figures therein are often embedded as vector graphics, a more accurate digitization would be desirable. Are there tools around which allow to directly digitize vector paths from figures (similar to the aforementioned methods)?

This question goes beyond precision (see further remarks below) and also addresses an efficient and semi-automated workflow.


Remarks

  • The achievable accuracy of course depends on the quality of the figure, or more specifically on (i) how the figure was originally produced, and (ii) how it was processed during the publication process. Since often high-quality plotting tools are used (e.g. guaranteeing proper resampling) and journals don't always mess up, figures in appropriate quality should now and then be available.

  • The problem goes beyond precision in terms of reading out values (which could be resolved by rastering figures in high resolution and using the aforementioned tools). In complex figures graphs could (i) cover each other up, (ii) overlap themself due to scatter and line thickness, and (iii) have varying sampling rate. Hence, rastering involves misinterpretation of data. Using a vector graphics editor for preparation before rastering (e.g. hiding individual graphs) would help, but is time-consuming and does only solve some of the problems.

  • At first, it should of course be checked if the original numeric data are available, as required by some journals (unfortunately not in many fields). Also, the corresponding author could be contacted, which often won't lead to success for many reasons such as unavailability (of data or author, after some time) or unwillingness.

Thanks to Martin and Massimo Ortolano, whose contributions inspired some of the remarks.

12
  • 1
    academia.stackexchange.com/q/158356/13240 Just turn it into a bitmap with the required accuracy and then use the previously mentioned tools. Commented Jan 8, 2021 at 7:40
  • PostScript just is not meant to be converted into data tables. Commented Jan 8, 2021 at 7:42
  • 1
    Is your actual problem that the data points cover each other up? Commented Jan 8, 2021 at 8:38
  • 1
    I know you want an automated workflow in the end, but as a first step, have you tried looking at the PDF file in a text editor (one that has the capability to unzip the contents of a PDF), to see if there's anything that looks like a list of ordered pairs of numbers giving the co-ordinates of the points on the graph? Commented Jan 10, 2021 at 14:07
  • 1
    Well, i used to do this a lot, open in illustrator -> Clean export points. This works. BUt several times it would have been easier to grab the data from a image, Since there is no guarantee that the data is not consisting of millions of trimmed triangles with extra points and object expended into million of pieces. But yeah if its a well behaved graph then its a 2 minute job
    – joojaa
    Commented Jan 10, 2021 at 14:56

3 Answers 3

3

This is an amusing programming/hacking challenge, but my guess is that in 90% of cases the best solution lies in the realm of human affairs, and that’s simply to email the authors and ask for the data.

Why? Two reasons:

  1. When it works (and I expect it would most of the time), you’ll know with certainty that the data you have is exactly what the authors were working with rather than some approximation recovered by trying to reverse engineer an unknown sequence of human and algorithmic processes to convert the raw data into a figure. (Vector graphics may offer the illusion of lossless encoding, but that’s assuming no lossy steps were applied by the human or software at any step along the way — a dangerous assumption to make in practice.)

  2. In the infrequent occasion when it doesn’t work because the authors refuse to share the data, you’ll still learn something useful about how trustworthy their data can be assumed to be (i.e., not at all). You can still try to revert to a technological solution, but in most cases I’d just assume the data is invalid and not worth relying on.

4
  • 3
    About point 1, I'd be surprised if it works half of the time, for two reasons. First, many groups, especially in competitive fields, are not so willing to share their data. Second, there is a problem of data preservation. If you ask me the data from my experiments from 25 years ago, or even those from 10 years ago, I have long lost them. Commented Jan 10, 2021 at 9:20
  • 3
    I agree with @MassimoOrtolano but would add a third reason not assuming unwillingness or inability: in some fields lots of publications are written by PhD students, who simply are no longer active in research and therefore unavailable.
    – Pontis
    Commented Jan 10, 2021 at 10:26
  • And sometimes the authors have passed away.
    – Anyon
    Commented Jan 10, 2021 at 17:53
  • @MassimoOrtolano yes, I was thinking more in the context of recent papers.
    – Dan Romik
    Commented Jan 10, 2021 at 18:20
2

(a) You will have to live with whatever graphic format, conversion, compression and thus imprecision has happened to the data. Authors are supposed to proofread and confirm that the data is - at least to on visual level - intact (or correct it - I have had high rank journals messing up with data for no obvious reason). I.e. you may be able to recover numeric data not to the original precision, but this is hopefully neglegible to the imprecision due to statistical sampling. You may not be able to recover "hidden" data such as higher harmonics.

(b) Consider that even the original data before upload/conversion for the journal may contain some drawing-related imprecision, e.g. if folks use Excel-smoothed lines only without showing the actually measured points (what I of course do not recommend for real data, just for symbolic simplification).

(c) In addition to precision, the second layer of complexity is overplotting: the "hidden object" may or may not be invisibly present in a vector graphic. You can use techniques such as opacity (or jittering e.g. in case of categorial data as "beeswarm plots" in R), but who knows what the authors have done?

In summary, if you are fine with the available level of precision, you will probably have to inspect the plot manually. I am not aware of a generic software solution that is intelligent enough to tell you, which data points in a drawn line are measured or smothed, if data is jittered or overplotted, if there must be overplotted data but is missing below an other object, or even in the first place what kind of plot it is (x/y scatter, categorial beeswarm where x does not matter, heatmap ...), and for some data symbols (e.g. numbers/letters) even if the numeric value is represented by the center or some corner (depending on plottting program and settings).

The alternative is to ask the authors for the data (and tell them honestly, what you are doing with the data). Also some journals require upload of bulk numeric data (e.g. gene expression), so check the journal policy.

1
  • Thanks for your efforts! I added some remarks to the question and upvoted your answer, but hesitate to accept it, since manual inspection certainly helps (especially via a vector graphics editor) but it is not the solution I was hoping for.
    – Pontis
    Commented Jan 9, 2021 at 20:17
-3

TL;DR. Extraction doesn't work; rather than attempting extracting, ask authors for raw data.

Since publications are usually available in digital form and figures therein are often embedded as vector graphics

Although many publications are prepared in digital form with vector graphics, such as those prepared in LaTeX, the vector graphics are lost during conversion to camera-ready format, such as PDF, wherein vector graphics...are constructed with paths (source: Wikipedia). I believe precision is lost during conversion (but I'm uncertain). Further research provides evidence that precision is indeed lost:

what in PDF parlance is called an 'image', by definition always is a raster image. There's no such thing as a 'vector image'. Even if the original file which was converted to PDF included vector graphics, then the converter program could have decided that it includes these as raster image. If you extract this, you'll not get your vector graphics back, but a raster image. Raster graphics which are preserved inside a PDF as such cannot be extracted by pdfimages.

Thus,

Are there tools around which allow to directly digitize vector paths from figures?

Original vector graphics cannot be recovered, assuming non-lossless conversion. Partial reconstruction should nonetheless be possible, with some loss of precision.

Even if original vector graphics could be extracted, it surely wouldn't be known whether they were indeed the originals, or some partial reconstruction, which is dangerous for science.

9
  • 1
    I didn't downvote, but this is not true. Depending on the PDF format, the vector paths are preserved. Most frequently than not, you can import a pdf into Inkscape (for instance) and successfully import all the paths and objects of the vector graphic. So, in practice, you can even edit a vector graphic stored in a pdf.
    – cinico
    Commented Jan 9, 2021 at 9:53
  • @cinico According to Wikipedia, vector graphics are converted. I had expected some conversion, because that's how PDF documents work. I haven't checked the precise details. I don't doubt you can extract a vector graphic from a PDF; I don't believe the original vector graphic can be extracted: Are you claiming to the contrary?
    – user2768
    Commented Jan 9, 2021 at 10:10
  • 1
    Yes, that's what I mean: you can extract the original vector graphic. You can easily try this yourself.
    – cinico
    Commented Jan 9, 2021 at 20:40
  • 2
    One thing about text in vector graphics. It's common, as a good practice, to convert text to paths because it ensures a correct visualization anywhere without needing the fonts. So it's possible that you didn't get the plaintext because it was converted to path. What I do, and it works, is that I import a pdf to inkscape and then I need to ungroup (a few times) the objects that I've imported. Then I see the original objects with exactly all the nodes and paths of the original object. If you want to try, create a complex object in inkscape,save as PDF,and then import again
    – cinico
    Commented Jan 10, 2021 at 17:05
  • 1
    Maybe I should add that, for purposes of extracting data from a paper, you can be sure of what you have recovered because either you can or you can't reconstruct the object. If you can reconstruct the object, then you know it's the original. What you cannot recover, it simply won't be reconstructed into vector objects and you will know.
    – cinico
    Commented Jan 11, 2021 at 13:32

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .