14

I am trying to make a call to the import.io API. This call needs to have the following structure:

'https://extraction.import.io/query/extractor/{{crawler_id}}?_apikey=xxx&url=http://www.example.co.uk/items.php?sortby=Price_LH&per_page=96&size=1%2C12&page=35'

You can see in that call, the parameter "url" has to be also included:

http://www.example.co.uk/items.php?sortby=Price_LH&per_page=96&size=1%2C12&page=35

It just so happens that this secondary URL also needs parameters. But if I pass it as a normal string like in the example above, the API response only includes the part before the first parameter when I get the API response:

http://www.example.co.uk/items.php?sortby=Price_LH

And this is not correct, it appears as if it would be making the call with the incomplete URL instead of the one I passed in.

I am using Python and requests to do the call in the following way:

import requests
import json

row_dict = {'url': u'http://www.example.co.uk/items.php?sortby=Price_LH&per_page=96&size=1%2C12&page=35', 'crawler_id': u'zzz'}
url_call = 'https://extraction.import.io/query/extractor/{0}?_apikey={1}&url={2}'.format(row_dict['crawler_id'], auth_key, row_dict['url'])
r = requests.get(url_call)
rr = json.loads(r.content)

And when I print the reuslt:

"url" : "http://www.example.co.uk/items.php?sortby=Price_LH",

but when I print r.url:

https://extraction.import.io/query/extractor/zzz?_apikey=xxx&url=http://www.example.co.uk/items.php?sortby=Price_LH&per_page=96&size=1%2C12&page=35

So in the URL it all seems to be fine but not in the response.

I tried this with other URLs and all get cut after the first parameter.

1
  • It seems like import.io's api accepted the rest of arguments.
    – tanglong
    Commented Jul 20, 2016 at 9:11

2 Answers 2

30

The requests library will handle all of your URL encoding needs. This is the proper way to add parameters to a URL using requests:

import requests

base_url = "https://extraction.import.io/query/extractor/{{crawler_id}}"
params = dict()
params["_apikey"] = "xxx"
params["url"] = "http://www.example.co.uk/items.php?sortby=Price_LH&per_page=96&size=1%2C12&page=35"

r = requests.get(base_url, params=params)
print(r.url)

An arguably more readable way to format your parameters:

params = {
    "_apikey" : "xxx",
    "url" : "http://www.example.co.uk/items.php?sortby=Price_LH&per_page=96&size=1%2C12&page=35"
}

Note that the {{crawler_id}} piece above is not a URL parameter but part of the base URL. Since Requests is not performing general string templating something else should be used to address that (see comments below).

3
  • This doesn't seem to work as of requests 2.24.0. r = requests.get("https://www.google.com/{{myparam}}", params={"myparam":"imghp"}) returns 404, and yet google.com/imghp is a perfectly valid URL. Is there anything else to add? Where is the {{crawler_id}} argument passed in your params? That's a path parameter. Commented Jul 2, 2020 at 10:52
  • The part {{crawler_id}} is not a URL parameter but rather part of the base URL. The requests package isn't performing general templating but rather handling the URL parameters & sanitizing. There are several options for general string templating, e.g. the standard library option and (my favorite) Jinja, which would work directly with the example above. I'll update my answer to address this.
    – Demitri
    Commented Jul 2, 2020 at 19:34
  • 6
    There's no such thing as a "URL parameter". There's no "base url" - that part of the URL is called "path", hence when it's parameterized, the parameter is usually called "path parameter" or "path variable". requests supports "query parameters", which are called as such because they're in the query. But, the OP has asked for some way to perform a call which includes path parameters and you used the placeholder LITERALLY! Your answer is wrong. Also, "general string templating" is a terrible idea for what you're doing, because it doesn't perform URL encoding. Can be both buggy and vulnerable. Commented Jul 7, 2020 at 20:10
9

you will need to URL encode the URL you are sending to the API.

The reason for this is that the ampersands are interpretted by the server as markers for parameters for the URL https://extraction.import.io/query/extractor/XXX?

This is why they are getting stripped in the url:

http://www.example.co.uk/items.php?sortby=Price_LH

Try the following using urllib.quote(row_dict['url']):

import requests
import json
import urllib

row_dict = {
  'url': u'http://www.example.co.uk/items.php?sortby=Price_LH&per_page=96&size=1%2C12&page=35',
  'crawler_id': u'zzz'}
url_call = 'https://extraction.import.io/query/extractor/{0}?_apikey={1}&url={2}'.format(
  row_dict['crawler_id'], auth_key, urllib.quote(row_dict['url']))
r = requests.get(url_call)
rr = json.loads(r.content)
1
  • Is this answer still true for recent versions of python/urllib? urllib.quote doesn't seem to exist
    – baxx
    Commented Jan 5 at 11:21

Not the answer you're looking for? Browse other questions tagged or ask your own question.