6

Short Version: Making the data behind the figures openly available is undeniably a massive help to the research community. What are some enabling systematic efforts (standard, file format requirement, tools to extract data from the figures) who are pushing for this openness available? This is not to ask for people to share their tabular data, but to use a figure file format that is easily extractable.

Imagine a day that we can do a meta-analysis on a large number of scientific papers to build a big AI. Or a software that can extract all the figures and build a massive database of data that can put the research into a quick use by engineers or others. Perhaps if there is a standard or framework for figures it would help this happen faster (the same goes with the objectives and hypothesis of papers).

  • Are you aware of any effort in this direction?
  • Is there any framework for figure structures, locations of content in the figure that is mandated/suggested by journals for better OCR in any field or certain experiments?
  • For example, the data in Matlab fig files (no matter how many curves or dimensions) can be accurately extracted. Is there any journal that asks for such file formats from which data can be easily extracted? Link here: https://www.mathworks.com/matlabcentral/answers/100687-how-do-i-extract-data-from-matlab-figures

As an example of figures that work as a scientific report, but their data if extracted can help the bigger scientific developments, here are some examples. I doubt if an OCR-like algorithm can extract 100% all these data accurately.

enter image description here enter image description here enter image description here enter image description here

7
  • 19
    It seems to me that making the data behind the figure openly available would be much easier than trying to extract the data from the figure. There are many efforts to do the former. Commented Mar 31, 2017 at 2:01
  • 1
    What do you mean with "the data in Matlab fig files can be easily extracted"? Do you mean by OCR, like in your previous point? That seems false to me. And that format looks like opaque binary data. Commented Apr 1, 2017 at 15:00
  • 2
    @DavidKetcheson Looks like this could be a great answer. Commented Apr 1, 2017 at 19:14
  • @FedericoPoloni Fig files seem to be a rich object you can extract their data with high fidelity. The great example is that that you can extract z-values from a 2-Dimensional image. It will have very low loss comparing to a OCR. mathworks.com/matlabcentral/answers/…
    – Amir
    Commented Apr 14, 2017 at 15:02
  • @DavidKetcheson I would agree. but also I know people who are very afraid of sharing their raw data. can you refer some successful practices in this field.
    – Amir
    Commented Apr 14, 2017 at 15:03

1 Answer 1

5

There is no need for it.

Or a software that can extract all the figures and build a massive database of data that can put the research into a quick use by engineers or others.

This is pretty much what Information Retrieval systems do (Yes, I'm looking at you Google). I worked with interpretation of images through AI in the IR area and finding that something is a figure or not is easy.

First of all there aren't many document formats, almost all papers are either in PDF or PostScript, which have a stamp that something is an image or figure. The same is valid for Word documents and even HTML. Then the question simply becomes what is a figure (in the sense you mention in the question, i.e. something generated from data) and an image generated by a snapshot of reality (e.g. a photo).

A figure (in the sense above) will have few edges in an ordered fashion, whilst a photo will have many edges in an unordered fashion. Even a photo of a white table on white background has enough noise to produce more edges.

The interpretation of what the figure shows is a completely different thing. It is much, much, much harder to do with AI. But then again, a standardised way of placing the figures or placing content in the figure would not help with that. Interpretation of a figure is something we humans have some difficulty to do. More often than not, looking at only the figures in a paper does not make you understand what the paper is about.

To produce a meaningful interpretation of figures in papers you would need to understand the paper first. Just like we humans do.


One note on MATLAB's .fig I'd like to add is that asking for specific file formats that are not publicly available may produce a lock-in problem. This is not a particularly good example of a lock-in since MATLAB provides discounted student (and sometimes general academic) licences. Yet, there are some countries where these licenses are not available or in which even the discounted ones amount for a big sum due to monetary differences.

Therefore requiring specific file formats may lock some researches/students out of a journal. Something that would not be a good sign about the journal's openness. The tech wars, notably between file formats are sometimes more fierce than between options about well debated scientific topics. I'd prefer a journal that stays out of those (although guidelines on how and article should be formatted that are independent of the software used are definitely good).

7
  • 4
    MATLAB student licenses aren't free. Very favorably priced compared to professional licenses, yes, but not free.
    – Ben Voigt
    Commented Mar 31, 2017 at 3:39
  • @BenVoigt - Thanks for that. I just checked and it is 100% correct, the student licences are simply much cheaper (less than $30). I was misguided by the fact that my institution always gave it to me. I've corrected the answer.
    – grochmal
    Commented Mar 31, 2017 at 18:44
  • @grochmal In the US... in other countries it is generally much more expensive, neither institutional licenses are common.
    – Greg
    Commented Apr 2, 2017 at 6:29
  • 2
    Also, depending on your topic, you may not easily be able to make your figures using Matlab. In bioinformatics, many people would use R or python (+matplotlib), in cell biology it's common to work only with Excel, and sometimes Prism for more complicated stats...
    – Alexlok
    Commented Apr 3, 2017 at 18:15
  • Matlab license is not free. bug what a Fig object does is a good example of what people can submit to journals instead of flat images of their plots.
    – Amir
    Commented Apr 14, 2017 at 15:12

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .