56

There are many times when I am faced with the task of extracting data from a published graph (usually a bitmap image in an paper). For example, a scatter plot from which I would like to get a list of individual (x, y) coordinates for the points.

One option is to ask the contact author for raw data. Most will do it, sometimes in nice ASCII format, sometimes in Excel files, sometimes in formats that I cannot open (chemists are fond of software like Origin or Igor Pro). Some authors never reply, or ask questions like “what do you want to do with it?”. In all cases, it takes time. Sometime, it's not even possible (I can hardly email the author of a 1936 paper!).

The other option is to extract the data. I currently use g3data to do that, but for large scatter plots having to click on every single point is tedious. Thus, I am looking for a data extraction software that could recognize individual points automagically, and possibly filter them by point color or symbol used. Is that even something that exists? What other tools can you recommend to work around this issue?

I don't think it'd be appropriate to have extra requirements on the software, so I'm happy with free or commercial solutions, running on any OS. Of course, if given the choice, I'd prefer open source software running on Linux and Mac OS.

10
  • 15
    The problem with extracting the data from a printed graph is that the process will introduce errors. Then what can you really say about the data you have? Commented Feb 1, 2013 at 8:46
  • 9
    @DaveClarke Yes, the process introduces some uncertainty, but if the graph resolution is good, the uncertainty can be low. Also, sometimes there is no choice: I recently digitized data from a 1936 paper, I can hardly imagine emailing the author :)
    – F'x
    Commented Feb 1, 2013 at 8:52
  • 3
    An option you didn't mention in the question is to reproduce the experiment yourself. While in some cases it is a time-waste you'd like to avoid, depending on the nature of the experiment, it may be an interesting solution.
    – T. Verron
    Commented Feb 1, 2013 at 12:24
  • 9
    This has been asked on the stats forum, see Software needed to scrape data from graph.
    – Andy W
    Commented Feb 1, 2013 at 13:01
  • 4
    Edge detection in image processing is not easy; it gets harder if you have anything besides black and white. So the main difficulty is not in the "conversion to tabular" but the "finding the data points" part of the problem; you may have better luck asking on dsp.stackexchange.com Commented Feb 1, 2013 at 16:00

6 Answers 6

17

Here is a very good online tool: http://arohatgi.info/WebPlotDigitizer/app/

3
  • 3
    There's a now famous/infamous blog by a professor at Berkeley where he and his lab carefully read and dissect papers in bioinformatics. I saw him mention this tool. If this guy uses it, it's probably quality. liorpachter.wordpress.com/2014/02
    – vector07
    Commented Sep 24, 2014 at 1:40
  • 4
    The link in the answer is dead, here's the new one: automeris.io/WebPlotDigitizer Commented Jan 30, 2019 at 22:14
  • 1
    Automeris.io has become old and is glitchy. PlotDigitizer is what I often used nowadays, plus it looks more professional than Automeris.io.
    – Anonymous
    Commented Jan 8, 2021 at 16:17
12

A colleague suggested I use GraphClick, a Mac OS software that includes (according to its website):

  • Automatic detection of curves (solid, dotted or dashed), symbols, bar charts, or perimeters of areas
  • Frame-by-frame digitization of QuickTime movies

The later is something I had not thought about, but might actually be useful for some teaching needs (analysis of motion from a video). My first experiences are good: the software is easy to use, includes a nice magnification UI, and automatic curve detection works fine if the graph is “clean”.


And here's a list of other possible software from this answer on Cross Validated (link thanks to @AndyW and @Paresh):

  • Engauge Digitizer (free software, GPL license) auto point / line recognition. Available in Ubuntu repository (engauge-digitizer)
  • Get Data (shareware, free trial version, $30 for personal license) has zoom window, auto point / line recognition
  • DigitizeIt (shareware, free trial version, $49 for personal license) auto point / line recognition
1
  • GraphClick was amazing software! Worked really well. Unfortunately it had problems on some Retina based mac's, and is now dead as it's 32bit and all modern OsX is 64bit only. Parallels or VMware can run/emulate older OsX versions that run 32bit. Hopefully anyone attempting this isn't bitten by the zoom/retina bug.
    – AllenH
    Commented Apr 24, 2020 at 20:23
8

I used DataThief years ago. From what I remember, it is not fully automated. You start by loading a digital image and identifying the axes, some tick marks, the axis limits and the scale (i.e., linear/log/polar). This lets it handle bad scans (e.g., rotation and warping). Once it knows the bounding box of the plot, you then tell it what to extract (curves, points, errorbars, etc.).

It is written in JAVA so should run on most OS's. I believe it is free as in beer (and it might be open source).

3
  • This program only works for continuous series plotted as lines. Cool, though. Commented May 7, 2014 at 1:13
  • 1
    I pulled datathief out and used it today on a series of dotted lines. You just need to extend the tracer leg over the the length of a dot
    – b degnan
    Commented May 14, 2016 at 15:33
  • Worth noting: it's shareware, with a registration fee (as of May 2017) of €23,65, and some features only available in the registered version. And as @WilliamGunn mentioned, it doesn't seem to be able to detect points, only lines.
    – Pont
    Commented May 16, 2017 at 11:32
7

We had a very similar problem at my old job: we had to scour a huge literature database containing literally thousands of papers for any data showing the solubility behavior of different species. A lot of this data was from the 1950s through 1970's, and was data we could not reproduce for a very large number of reasons (time and now safety regulations being chief among these).

The colleague who was responsible for collecting all of this data used a package called Data Thief to remove the data from graphs. It seemed to work well, but is also (from what I recall) commercial software (or rather shareware, but still technically not free). It is cross-platform and written in Java, so perhaps satisfies a decent amount of your criteria.

3

ScanIt does well. It is free of charge, albeit not open source; runs on Windows. It can automatically recognize points, and even distinguish between different symbols used as points:

ScanIt recognizing points

3
  • Welcome to the site. Link-only answers are discouraged here. Can you give us some idea of what ScanIt is capable of doing, some benefits and drawbacks, etc?
    – Fomite
    Commented Mar 14, 2015 at 7:43
  • @Fomite: This is not a link-only answer (see here). Nontheless, it’s still a bad answer that can and should be improved by addressing your questions.
    – Wrzlprmft
    Commented Mar 14, 2015 at 9:55
  • @Wrzlprmft It's close enough to a link-only answer that I'm perfectly comfortable asking the poster for more information.
    – Fomite
    Commented Mar 15, 2015 at 8:56
2

Here I describe how it is possible to recover data from vector graph in a PDF file with maximum exactness and even estimate introduced recovery error. I show how it can be done in Mathematica but the method shown is very basic and simple enough to be easily implemented in other systems.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .