14

I run a small web crawler and had to decide on what user agent to use for it. Lists of crawler agents as well as Wikipedia suggest the following format:

 examplebot/1.2 (+http://www.example.com/bot.html)

However some bots omit the plus sign in front of the URL. And I wonder what it means in the first place, but couldn't find any explanation. RFC 2616 considers everything in parenthesis a comment and doesn't restrict its format. Yet it is common for browsers to have a semicolon-separated list of tokens in the comment that advertise the version and capabilities of the browser. I don't think this is standardized in any way other than most browsers formatting it similarly. And I couldn't find anything concerning URLs in the comment.

My question is: Why the plus sign? Do I need it?

2 Answers 2

9

The first usage of this I could find was with the Heritrix crawler. In this manual document, I found the following:

6.3.1.3.2. user-agent The initial user-agent template you see when you first start heritrix will look something like the following:

Mozilla/5.0 (compatible; heritrix/0.11.0 +PROJECT_URL_HERE

You must change at least the PROJECT_URL_HERE and put in place a website that webmasters can go to to view information on the organization or person running a crawl.

The user-agent string must adhere to the following format:

[optional-text] ([optional-text] +PROJECT_URL [optional-text]) [optional-text]

The parenthesis and plus sign before the URL must be present. Other examples of valid user agents would include:

my-heritrix-crawler (+http://mywebsite.com)

Mozilla/5.0 (compatible; bush-crawler +http://whitehouse.gov)

Mozilla/5.0 (compatible; os-heritrix/0.11.0 +http://loc.gov on behalf to the Library of Congress)

7

I downloaded all the user agents from http://www.user-agents.org/ and ran a script to count the number of them that used the + style links vs plain links. I excluded the "non-standard" user agent strings that don't match RFC 2616.

Here are the results:

Total: 2471
Standard: 2064
Non-standard: 407
No link: 1391
With link: 673
Plus link: 145
Plain link: 528
Plus link only: 86
Plain link only: 174

So of the 673 user agents that include a link only 21% include the plus. Of the 260 user agents that have a comment that is just a link, only 33% include the plus.

Based on this analysis, the plus is common, but the majority of user agents choose not to use it. It is fine to leave it out, but it is common enough that it would also be fine to include it.

Here is the Perl script that performed this analysis if you want to run it yourself.

#!/usr/bin/perl

use strict;

my $doc="";

while(my $line = <>){
    $doc.=$line;
}

my @agents = $doc =~ /\<td class\=\"left\"\>[ \t\r\n]+(.*?)\&nbsp\;/gs;

my $total = 0;
my $standard = 0;
my $nonStandard = 0;
my $noHttp = 0;
my $http = 0;
my $plusHttp = 0;
my $noPlusHttp = 0;
my $linkOnly = 0;
my $plusLinkOnly = 0;

for my $agent (@agents){
    $total++;
    if ($agent =~ /^(?:[a-zA-Z0-9\.\-\_]+(?:\/[a-zA-Z0-9\.\-\_]+)?(?: \([^\)]+\))?[ ]*)+$/){
        print "Standard: $agent\n";
        $standard++;
        if ($agent =~ /http/i){
            print "With link: $agent\n";
            $http++;
            if ($agent =~ /\+http/i){
                print "Plus link: $agent\n";
                $plusHttp++;
            } else {
                print "Plain link: $agent\n";
                $noPlusHttp++;
            }
            if ($agent =~ /\(http[^ ]+\)/i){
                print "Plain link only: $agent\n";
                $linkOnly++;
            } elsif ($agent =~ /\(\+http[^ ]+\)/i){
                print "Plus link only: $agent\n";
                $plusLinkOnly++;
            }
        } else {
            print "No link: $agent\n";
            $noHttp++;
        }
    } else {
        print "Non-standard: $agent\n";
        $nonStandard++;
    }
}

print "
Total: $total
Standard: $standard
Non-standard: $nonStandard
No link: $noHttp
With link: $http
Plus link: $plusHttp
Plain link: $noPlusHttp
Plus link only: $plusLinkOnly
Plain link only: $linkOnly
";
4
  • Very nice answer! I thought that the plus was more common, but apparently I was mistaken. This answers the question whether I need it, but not yet where it comes from.
    – jlh
    Commented Aug 15, 2013 at 15:42
  • My guess is that some very active spider like Googlebot started doing it and other developers copied the format. Googlebot certainly uses it, but it may not have been the first to do so. Commented Aug 15, 2013 at 17:36
  • great comment - thanks for the stats and the analysis Commented Aug 15, 2013 at 23:04
  • but, you didn't answer the question. Commented Aug 16, 2013 at 3:20

Not the answer you're looking for? Browse other questions tagged or ask your own question.