Often, we will want to get to get a specific row, which marks the minimum or maximum of one of its columns. Let’s suppose we have the SF Salaries dataset from Kaggle. We want to find the employee name, with the largest total pay benefits. The experience with writing NumPy/Pandas filter conditions will quickly let us produce the following version:
sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].max()]['EmployeeName']
which is absolutely valid, but is it the only option? I could imagine that that in a large dataset, there would be quite a lot of comparison involved, plus the creation of a whole new data series (the filter condition). Is there perhaps a more performant one? How about trying out idmax() (or, for those coming from NumPy, argmax(), both do the same). By given a column, the function will return the index of the data frame, where it is at its highest.
sal.loc[sal['TotalPayBenefits'].idxmax()]['EmployeeName']
sal.loc[sal['TotalPayBenefits'].argmax()]['EmployeeName']
I did a quick performance check, and indeed, idmax/argmax turn out twice as fast:
# the original filter condition
815 µs ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# idmax()
402 µs ± 7.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# argmax
404 µs ± 8.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Leave a comment