108

I have seen:

These relate to vanilla python and not pandas.

If I have the series:

ix   num  
0    1
1    6
2    4
3    5
4    2

And I input 3, how can I (efficiently) find?

  1. The index of 3 if it is found in the series
  2. The index of the value below and above 3 if it is not found in the series.

Ie. With the above series {1,6,4,5,2}, and input 3, I should get values (4,2) with indexes (2,4).

1
  • 3
    Note that the title asks about the closest values, but the question text itself asks about the (closest) lower- and upperbounds. As can be seen from the comments to one of the answers, these are not the same (most answers here appear to answer the title, not the question text).
    – 9769953
    Commented Aug 8, 2019 at 9:18

9 Answers 9

102

You could use argsort() like

Say, input = 3

In [198]: input = 3

In [199]: df.iloc[(df['num']-input).abs().argsort()[:2]]
Out[199]:
   num
2    4
4    2

df_sort is the dataframe with 2 closest values.

In [200]: df_sort = df.iloc[(df['num']-input).abs().argsort()[:2]]

For index,

In [201]: df_sort.index.tolist()
Out[201]: [2, 4]

For values,

In [202]: df_sort['num'].tolist()
Out[202]: [4, 2]

Detail, for the above solution df was

In [197]: df
Out[197]:
   num
0    1
1    6
2    4
3    5
4    2
8
  • 3
    does this find the closest below and above, or just the two closest?
    – Steve
    Commented May 7, 2015 at 22:18
  • What do you mean by below and above? Closest values are picked by absolute difference between them and the given input.
    – Zero
    Commented May 7, 2015 at 22:34
  • 3
    I needed to find a) the cloest number above, b) the closest number below. So on absolute difference wouldn't achieve this in all cases.
    – Steve
    Commented May 2, 2017 at 9:46
  • 3
    This gives the incorrect answer. I tried this on a more complex dataset. You must use .iloc instead of .ix and it works well (see @op1)
    – amc
    Commented Nov 9, 2017 at 17:28
  • 1
    Wouldn't this approach have a complexity of N log(N), since you require sorting? This does not seem extremely efficient, as it should be possible to perform the search in ~O(N)? Commented Nov 29, 2018 at 15:21
51

Apart from not completely answering the question, an extra disadvantage of the other algorithms discussed here is that they have to sort the entire list. This results in a complexity of ~N log(N).

However, it is possible to achieve the same results in ~N. This approach separates the dataframe in two subsets, one smaller and one larger than the desired value. The lower neighbour is than the largest value in the smaller dataframe and vice versa for the upper neighbour.

This gives the following code snippet:

def find_neighbours(value, df, colname):
    exactmatch = df[df[colname] == value]
    if not exactmatch.empty:
        return exactmatch.index
    else:
        lowerneighbour_ind = df[df[colname] < value][colname].idxmax()
        upperneighbour_ind = df[df[colname] > value][colname].idxmin()
        return [lowerneighbour_ind, upperneighbour_ind] 

This approach is similar to using partition in pandas, which can be really useful when dealing with large datasets and complexity becomes an issue.


Comparing both strategies shows that for large N, the partitioning strategy is indeed faster. For small N, the sorting strategy will be more efficient, as it is implemented at a much lower level. It is also a one-liner, which might increase code readability. Comparison of partitioning vs sorting

The code to replicate this plot can be seen below:

from matplotlib import pyplot as plt
import pandas
import numpy
import timeit

value=3
sizes=numpy.logspace(2, 5, num=50, dtype=int)

sort_results, partition_results=[],[]
for size in sizes:
    df=pandas.DataFrame({"num":100*numpy.random.random(size)})
    
    sort_results.append(timeit.Timer("df.iloc[(df['num']-value).abs().argsort()[:2]].index",
                                         globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())
    partition_results.append(timeit.Timer('find_neighbours(df,value)',
                                          globals={'find_neighbours':find_neighbours, 'df':df,'value':value}).autorange())
    
sort_time=[time/amount for amount,time in sort_results]
partition_time=[time/amount for amount,time in partition_results]

plt.plot(sizes, sort_time)
plt.plot(sizes, partition_time)
plt.legend(['Sorting','Partitioning'])
plt.title('Comparison of strategies')
plt.xlabel('Size of Dataframe')
plt.ylabel('Time in s')
plt.savefig('speed_comparison.png')
11
  • 1
    I like the idea! However, it seems to me there is a great chance that lower..._id and upper..._id are not next to each other if the index is not monotonic; this is what the other algorithms try to provide by sorting.
    – Joël
    Commented Dec 18, 2018 at 14:47
  • 1
    @Joël I think that I don't understand your point. They would have to be next to eachother, since lower... is the largest value smaller than value and upper... is the smallest value larger than lower. I don't see how there can be another value inbetween there? Or do you mean something else? Commented Dec 19, 2018 at 7:53
  • 1
    I think this answer is more correct, since it gives the closest lower- and upperbounds, not just the two closest values. It does have two (minor) mistakes: indentation on line 3, and the use of an unknown traversed variable (which probably should be value); as I'm about 90% sure on the errors, not 100%, I'm still a bit hesitant to edit and fix the answer.
    – 9769953
    Commented Aug 8, 2019 at 9:21
  • 2
    @ClaudiuCreanga That depends on the size of df. I added a plot and some code to illustrate this behaviour. Commented Aug 14, 2019 at 15:33
  • 1
    @Isaac Thanks for the suggestion, I approved the edit. The comments were helpful to understand your edit though. Commented Jan 23, 2020 at 12:00
23

I recommend using iloc in addition to John Galt's answer since this will work even with unsorted integer index, since .ix first looks at the index labels

df.iloc[(df['num']-input).abs().argsort()[:2]]
1
13

If the series is already sorted, an efficient method of finding the indexes is by using bisect functions. An example:

idx = bisect_left(df['num'].values, 3)

Let's consider that the column col of the dataframe df is sorted.

  • In the case where the value val is in the column, bisect_left will return the precise index of the value in the list and bisect_right will return the index of the next position.
  • In the case where the value is not in the list, both bisect_left and bisect_right will return the same index: the one where to insert the value to keep the list sorted.

Hence, to answer the question, the following code gives the index of val in col if it is found, and the indexes of the closest values otherwise. This solution works even when the values in the list are not unique.

from bisect import bisect_left, bisect_right
def get_closests(df, col, val):
    lower_idx = bisect_left(df[col].values, val)
    higher_idx = bisect_right(df[col].values, val)
    if higher_idx == lower_idx:      #val is not in the list
        return lower_idx - 1, lower_idx
    else:                            #val is in the list
        return lower_idx

Bisect algorithms are very efficient to find the index of the specific value "val" in the dataframe column "col", or its closest neighbours, but it requires the list to be sorted.

3

You can use numpy.searchsorted. If your search column is not already sorted, you can make a DataFrame that is sorted and remember the mapping between them with pandas.argsort. (This is better than the above methods if you plan on finding the closest value more than once.)

Once it's sorted, find the closest values for your inputs like this:

indLeft = np.searchsorted(df['column'], input, side='left')
indRight = np.searchsorted(df['column'], input, side='right')

valLeft = df['column'][indLeft]
valRight = df['column'][indRight]
1
  • 1
    If the input does not exactly match an element in the column, then indLeft and indRight are equal i think. The request was for the two closest indices. Commented Jul 29, 2020 at 10:24
2

If your series is already sorted, you could use something like this.

def closest(df, col, val, direction):
    n = len(df[df[col] <= val])
    if(direction < 0):
        n -= 1
    if(n < 0 or n >= len(df)):
        print('err - value outside range')
        return None
    return df.ix[n, col]    

df = pd.DataFrame(pd.Series(range(0,10,2)), columns=['num'])
for find in range(-1, 2):
    lc = closest(df, 'num', find, -1)
    hc = closest(df, 'num', find, 1)
    print('Closest to {} is {}, lower and {}, higher.'.format(find, lc, hc))


df:     num
    0   0
    1   2
    2   4
    3   6
    4   8
err - value outside range
Closest to -1 is None, lower and 0, higher.
Closest to 0 is 0, lower and 2, higher.
Closest to 1 is 0, lower and 2, higher.
2

The most intuitive way I've found to solve this sort of problem is to use the partition approach suggested by @ivo-merchiers but use nsmallest and nlargest. In addition to working on unsorted series, a benefit of this approach is that you can easily get several close values by setting k_matches to a number greater than 1.

import pandas as pd
source = pd.Series([1,6,4,5,2])
target = 3

def find_closest_values(target, source, k_matches=1):
    k_above = source[source >= target].nsmallest(k_matches+1)
    k_below = source[source < target].nlargest(k_matches)
    k_all = pd.concat([k_below, k_above]).sort_values()
    return k_all

find_closest_values(target, source, k_matches=1)

Output:

4    2
2    4
dtype: int64
0

If you need to find closest value to obj_num in 'num' column and in case there are multiple choices, you can choose best occurence based on values of other columns than 'num', for instance a second column 'num2'.

To do so, I would recommend to create a new column 'num_diff' then use sort_values. Example: we want to choose closest value to 3 in 'num' column, and in case there are many occurences, choose smallest value on 'num2' column. Code as bellow:

import pandas as pd

obj_num = 3
df = pd.DataFrame({
    'num': [0, 1, 3, 3, 3, 4],
    'num2': [0, 0, 0, -1, 1, 0]
})

df_copy = df.loc[:, ['num', 'num2']].copy()
df_copy['num_diff'] = (df['num']-obj_num).abs()
df_copy.sort_values(
    by=['num_diff', 'num2'],
    axis=0,
    inplace=True
)
obj_num_idx = df_copy.index[0]

print(f'Objective row: \n{df.loc[obj_num_idx, :]}')

Here's a function to do the job using a dict of objective values and columns (it respects order of columns to use for sorting):

def colosest_row(df, obj):
    '''
    Sort df using specific columns given as obj keys.
    If a key has None value:
        sort column in ascending order.
    If a key has a float value:
        sort column from closest to farest value from obj[key] value.

    Arguments
    ---------
    df: pd.DataFrame
        contains at least obj keys in its columns.
    obj: dict
        dict of objective columns.
    
    Return
    ------
    index of closest row to obj
    '''
    df_copy = df.loc[:, [*obj]].copy()

    special_cols = []
    obj_cols = []
    for key in obj:
        if obj[key] is None:
            obj_cols.append(key)
        else:
            special_cols.append(key)
            obj_cols.append(f'{key}_diff')

    for key in special_cols:
        df_copy[f'{key}_diff'] = (df[key]-obj[key]).abs()

    df_copy.sort_values(
        by=obj_cols,
        axis=0,
        ascending=True,
        inplace=True
    )

    return df_copy.index[0]

obj_num_idx = colosest_row(
    df=df,
    obj={
        "num": obj_num,
        "num2": None  # Sort using also 'num2'
    }
)
-3

There are a lot of answers here and many of them are quite good. None are accepted and @Zero 's answer is currently most highly rated. Another answer points out that it doesn't work when the index is not already sorted, but he/she recommends a solution that appears deprecated.

I found I could use the numpy version of argsort() on the values themselves in the following manner, which works even if the indexes are not sorted:

df.iloc[(df['num']-input).abs()..values.argsort()[:2]]

See Zero's answer for context.

1
  • I think you're better off using argmin() here.
    – zylatis
    Commented May 20, 2021 at 5:57

Not the answer you're looking for? Browse other questions tagged or ask your own question.