13

Recently, I added a new paper to my personal website, and it appeared on Google Scholar a couple of days later. On my website, all I did was write the name of the paper, together with the authors and name of the conference, and then provided a link to the PDF. This information alone somehow informed Google Scholar that the text I added was actually a new paper. There is no other information about this paper on the web, so I know that Google only used my website to update Google Scholar.

So what I am wondering, is how does Google know what is a paper, and what is just some arbitrary text on my website? For example, if I had only written the name of the paper, without the authors and conference, and without the PDF, would this still have been detected?

On my website, the paper is listed on a webpage called "Publications", in a list with a load of other papers, but this is quite specific to the design of my own website. I'm wondering whether it has something to do with the PDF which I provided a link to. Perhaps it inspected the PDF and decided it was a paper, and if I hadn't added the PDF, it would not detect it as a paper. But then again, the HTML formatting does not necessarily indicate which text the PDF is actually associated with, even if it is obvious to a human on inspection of the webpage. Or perhaps Google Scholar just has some hand-engineered search which looks for instances of HTML where there is the name of a known conference, known authors, and a PDF nearby.

5
  • 12
    They have a huge basement full of little elves surfing the net like crazy. Commented Jun 1, 2016 at 12:14
  • Presumably the conference has a page and google knows that the conference (or journal) publishes papers. If that coincides with authors that it knows about, maybe that's enough.
    – Chris H
    Commented Jun 1, 2016 at 12:15
  • 5
    See Google Scholar's help page on inclusion. P.S. it has nothing to do with having it in a "known conference" - GS also indexes unpublished material. It basically looks for PDFs that have a title, an author list, and a references section. It keeps trying to index my (teaching) slides that are posted on my website, for example, because they have a title, author, and reference list at the end.
    – ff524
    Commented Jun 1, 2016 at 12:49
  • 1
    I think this is just a reflection of how awesome (and scarey) Google is - it happened to spider your site in the last few days and found it through whatever voodoo (or elves) they use these days.
    – Jon Custer
    Commented Jun 1, 2016 at 12:51
  • @ff524: A few years ago (3 or 4) I happened to notice that if I googled my name in google scholar, then I would get a large number (over 20) of my teaching handouts (and tests and short quizzes) that I had archived in Math Forum posts back when you could do this (they seem to have eliminated this about a year ago). I haven't googled this in a while, but in doing so just now I only see two such items ("Exotic Group Examples" and this, and a few others that I don't know what they are (e.g. "MATHEMATICS: THE UNIVERSAL"). Commented Jun 1, 2016 at 14:29

1 Answer 1

10

(Warning - gross oversimplifications incoming - if somebody doing research in Information Retrieval wants to add technical detail, be my guest!)

Fundamentally, Google finds all resources (HTML pages, images, as well as papers) on the web in the same way: periodically revisiting each and every resource it knows about, (re-)indexing it, and, for HTML content, following all links to other resources (rinse and repeat). Your web page is likely linked from your department web site, which Google definitely knows about, hence your web page is also in Google's database. Your web page links to your paper, hence Google will also know about your paper the next time the crawler checks your page. How long this will take is undefined, but Google has a lot of crawlers and is pretty smart about when to re-check certain types of pages, so it typically does not take very long.

Now, Google has specific heuristics in place to treat different types of resources differently. For instance, if an HTML page is added to the database keywords will be extracted, links will be followed, etc., while an image will lead to completely different actions. Scientific papers are not different in that sense - as soon as Google finds a PDF or Word file that "looks like" a scientific paper to the automated process, Google will generate paper metadata (title, authors, venue, keywords, ...) by parsing the PDF text as good as it can and add it to its special Google Scholar database, and this is when the paper appears in your profile.

Google's own website goes into quite some detail on this process. It also has instructions for authors looking to get their papers indexed by Scholar.

2
  • 4
    I imagine Google isn't just extracting the PDF metadata, but is parsing the actual content of the PDF.
    – Ric
    Commented Jun 1, 2016 at 14:41
  • @Ric Yes, I expressed myself quite badly.
    – xLeitix
    Commented Jun 1, 2016 at 15:33

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .