3

The following script works perfectly with a file containing 2 rows but when I tried 2500 row file, I got 429 exceptions. So, I increased the query time to 5 seconds. I also filled the user agent. After unsuccessful attempts, I connected to VPN to change 'fresh' but I got 429 errors again. Is there something I am missing here? Nominatim policy specifies no more connections than 1 per second, I am doing one per 5 seconds...any help would be helpful!

from geopy.geocoders import Nominatim
import pandas
from functools import partial

from geopy.extra.rate_limiter import RateLimiter

nom = Nominatim(user_agent="[email protected]")
geocode = RateLimiter(nom.geocode, min_delay_seconds=5)


df=pandas.read_csv('Book1.csv', engine='python')
df["ALL"] = df['Address'].apply(partial(nom.geocode, timeout=1000, language='en'))
df["Latitude"] = df["ALL"].apply(lambda x: x.latitude if x != None else None)
df["Longitude"] = df["ALL"].apply(lambda x: x.longitude if x != None else None)

writer = pandas.ExcelWriter('Book1.xlsx')
df.to_excel(writer, 'new_sheet')
writer.save()

Error message:

Traceback (most recent call last):
  File "C:\Users\u6022697\AppData\Local\Programs\Python\Python37\lib\site-packages\geopy\geocoders\base.py", line 355, in _call_geocoder
    page = requester(req, timeout=timeout, **kwargs)
  File "C:\Users\u6022697\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Users\u6022697\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Users\u6022697\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\Users\u6022697\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "C:\Users\u6022697\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 429: Too Many Requests

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/u6022697/Documents/python work/Multiple GPS Nom Pandas.py", line 14, in <module>
    df["ALL"] = df['Address'].apply(partial(nom.geocode, timeout=1000, language='en'))
  File "C:\Users\u6022697\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\series.py", line 3849, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas\_libs\lib.pyx", line 2327, in pandas._libs.lib.map_infer
  File "C:\Users\u6022697\AppData\Local\Programs\Python\Python37\lib\site-packages\geopy\geocoders\osm.py", line 406, in geocode
    self._call_geocoder(url, timeout=timeout), exactly_one
  File "C:\Users\u6022697\AppData\Local\Programs\Python\Python37\lib\site-packages\geopy\geocoders\base.py", line 373, in _call_geocoder
    raise ERROR_CODE_MAP[code](message)
geopy.exc.GeocoderQuotaExceeded: HTTP Error 429: Too Many Requests

3 Answers 3

13

I've done reverse geocoding of ~10K different lat-lon combinations in less than a day. Nominatim doesn't like bulk queries, so the idea is to prevent looking like one. Here's what I suggest:

  1. Make sure that you only query unique items. I've found that repeated queries for the same lat-lon combination is blocked by Nominatim. The same can be true for addresses. You can use unq_address = df['address'].unique() and then make a query using that series. You could even end up with less addresses.

  2. The time between queries should be random. I also set the user_agent to have a random number every time. In my case, I use the following code:

    from time import sleep
    from random import randint
    from geopy.geocoders import Nominatim
    from geopy.exc import GeocoderTimedOut, GeocoderServiceError
    
    user_agent = 'user_me_{}'.format(randint(10000,99999))
    geolocator = Nominatim(user_agent=user_agent)
    def reverse_geocode(geolocator, latlon, sleep_sec):
        try:
            return geolocator.reverse(latlon)
        except GeocoderTimedOut:
            logging.info('TIMED OUT: GeocoderTimedOut: Retrying...')
            sleep(randint(1*100,sleep_sec*100)/100)
            return reverse_geocode(geolocator, latlon, sleep_sec)
        except GeocoderServiceError as e:
            logging.info('CONNECTION REFUSED: GeocoderServiceError encountered.')
            logging.error(e)
            return None
        except Exception as e:
            logging.info('ERROR: Terminating due to exception {}'.format(e))
            return None
    

I find that the line sleep(randint(1*100,sleep_sec*100)/100) does the trick for me.

1
  • 3
    This code is very impressive! Didnt test but kudos :) Commented Feb 24, 2020 at 11:27
0

After some research it turns out Nominatim has a 1000 query limit per day so the script was trying to do more than 1k.

https://getlon.lat/

2
0

Regarding geocoding with geopy RateLimiter and Nominatim , I have put together the following function which works well. It breaks down large files (in this case pandas dataframe) into batches. There is also a try:except clause to catch errors which returns the partial dataset as well as the final row number of the last batch executed, which you then reuse in the last function parameter unique_array_pos. Useful to keep partial results and resume from where it stopped.

Parameters:

  • Wait_time_batch: wait time between batches
  • wait_time_retries: wait time between retries
  • data_df: pandas DataFrame with one column containing the full address
  • batch_size: the size of each batch
  • address_column: the column in the df where the address is stored
  • unique_array_pos: in case the function errors out, the point from which to resume the geocoding.

Through trial and error a good batch size is 200, even for large datasets it ensures the geocoding does not error out.

I am not using tqdm for 2 reasons: could not get it to work on Juyter Lab and I find the row by row output just as effective.

Enjoy!

def batch_geocode(wait_time_batch,wait_time_retries,data_df,batch_size,address_column, unique_array_pos):
unique = data_df[address_column].unique() #get unique addresses from dataframe
un_size = len(unique)
n_iter = math.ceil(un_size / batch_size) #compute the n of iterations necessary
start_time =time.clock()
print('size: '+str(un_size))
print('n_iter: '+str(n_iter))
final = np.empty((0,2),dtype=object)
    
for i in range(unique_array_pos,n_iter,1):
    try:

        start_iter  = time.clock()
        if i ==0:start = i
        else: start = i*batch_size
        print('batch:'+str(i)+',row number:'+str(start))
        geolocator = Nominatim(user_agent='trial'+str(randint(0,1000)))
        geocode = RateLimiter(geolocator.geocode,max_retries =3,error_wait_seconds=randint(1*100,2*100)/100)
        temp1= unique[start:(i+1)* batch_size]
        #print(temp1.shape)
        loc = np.array([geocode(x) for x in temp1])
        #print(loc)
        #print(loc.shape)
        #print(loc)
        temp2 = np.c_[temp1,loc]
        #print(temp2.shape)
        final = np.append(final,temp2,axis=0)
        sleep(randint(1*100,wait_time_batch*100)/100)
        print(f'iteration time: {time.clock()-start_iter: .2f}' + f', total time: {time.clock()-start_time:.2f}')
        #print('iter: '+str(time.clock()-start_iter)+',total: '+str(time.clock()-start_time))
    except Exception as e: 
        print(e)
        print('failed execution, last position in unique array: '+str(unique_array_pos+i*batch_size))
        return pd.DataFrame(final,columns =['address','location'])
return pd.DataFrame(final,columns =['address','location'])

Not the answer you're looking for? Browse other questions tagged or ask your own question.