8

I'm currently trying to get published journal information (author, title, date published etc) automatically from webpages. I'm trying to avoid scraping since this is against most publishers ToS. The best solution I have so far is:

  1. Get <title> tag from page at URL
  2. Search CrossRef for matching title. Get DOI of top result.
  3. Get the rest of the information from CrossRef

Unfortunately sometimes the wrong journal article is chosen when searching CrossRef. This especially happens with newer publications which have a DOI, but don't seem to show up on CrossRef search at first.

I've also tried using Regex to find a DOI on the page, but this comes up with all the DOIs listed in the references as well so this doesn't help me.

Since the DOI points to the ultimate URL, is there any way to reverse the process to get the DOI from the URL?

3
  • 1
    Hm, good quetion. The CrossRef API does offer a "link - URL" field, but one cannot filter by that. - Anyway, I have two questions: First, if you do that "automatically", why is it not "scraping"? It does sound like webscraping to me. Secondly, can you link us to an example? Is it confined to just a specific publisher? Or are there many publishers and many journals you address? I think there should be another technical possibility to obtain the data you need.
    – anpami
    Commented May 14, 2021 at 14:45
  • 1
    It appears that you work for a UK university. In that case, you'll find that JISC's model journal subscription agreement releases you from some of the more onerous clauses in the Terms of Service as presented on the publisher website. But to be sure whether this includes the anti-scraping clause, I'm afraid you'll have to look at the full text of the model agreement. Commented May 14, 2021 at 15:43
  • Both are valid points thank you. @anpami It's true that getting the title is scraping, but this is the minimum to get basic information (and acceptable to search engines) so this has been deemed okay. An example is this publication: cell.com/molecular-cell/fulltext/S1097-2765(21)00327-0 however note that the top result on CrossRef search is different: search.crossref.org/… Daniel-Hatton The use case is not purely research (this will output to a website) therefore I'm being very careful about minimising scraping Commented May 14, 2021 at 19:36

1 Answer 1

5

It turns out the process seems to be far less trivial than one might expect.

There is a comprehensive write-up entitled URLs and DOIs: a complicated relationship describing all the intricacies.

It seems it should be feasable to resolve at least the URLs of scholarly articles to their corresponding DOIs, but even this is currently not deemed doable.

In order to find the Landing Page for every DOI, we would have to follow each and every individual link. This is not practical at scale, and impossible in some cases. We therefore can't, and don't attempt to discover every landing page URL ahead of time. Our process instead tries to look for possible landing pages and then connect them back to DOIs.

If publishers would refrain from obvious misbehaviours such as indefinitely deploying internal redirects instead of updating the DOI resource URLs or even worse breaking basic HTTP functionality by using HTML redirects or even more worse using JavaScript-based redirects and/or requiring cookies or using some other bizarre means of breaking the web then the task seems rather doable, deliberately ignoring the corner case of DOI aliases for the moment.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .