-
Notifications
You must be signed in to change notification settings - Fork 13.4k
libc++ std::nth_element
is quadratic, should be linear
#52747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Any interest in working on this? I'm just trying to nerdsnipe people, don't feel any obligation. :-) |
I'll take a look at this. @ldionne, I have been meaning to increase my participation in libc++ development. Feel free to send some part of the work my way. |
I've been stealing the I guess an introselect that falls back to median-of-median or heap selection would be enough for most uses. Median-of-median itself isn't terribly hard to implement in O(1) space, but it's probably better to have it as a doubly-recursive algorithm with quickselect. HeapSelect is probably the easiest solution though since heap operations are already in the standard library. |
Author of miniselect is here. I analyzed quite thoroughly all existing nth_element implementations and the one from Alexandrescu seems the best to be standard compliant, fast and predictable at the same time I strongly recommend not to have median of medians implementation as it has huge constant underneath, see the benchmarks in the repository The easiest way will be to implement HeapSelect as a fallback, current nth_element is already quite good in terms of speed. See the benchmark section |
By the way, technically current implementation is standard compliant
Quickselect works linear on average |
Thank you so much!
The link you shared says that libstdc++'s implementation is good, but it says about the libc++ implementation:
So I think we definitely do want to improve the libc++ implementation, right? |
Benchmarks showed that Floyd-Rivest is the best algorithm to use but it requires floating point arithmetic and libc++ implementation is much better on average than libstdc++ right now. The final solution depends on the goals of that. At google we don't see much perf opportunity for nth_element, for example. Not that much to spend even a month working on that. If we want to "fix" worst case scenarios, then we can add HeapSelect and forget about it by spending several hours. If we want to be at top of the industry, we can consider Alexandrescu's algorithm or Floyd-Rivest one. And as I said, we already are standard compliant. Unlike std::sort, std::nth_element is required to be linear on average. |
I don't agree, even if we go strictly by the standard, if we allow custom comparison functions. Take this example, where I shuffle the input before using #include <algorithm>
#include <iostream>
#include <vector>
#include <random>
int quadratic(int size) {
int num_solid = 0;
int gas = size + 1;
int comparison_count = 0;
std::vector<int> indices(size);
std::vector<int> values(size);
for (int i = 0; i < size; ++i) {
indices[i] = i;
values[i] = gas;
}
std::random_device rd;
std::mt19937 g(rd());
std::shuffle(indices.begin(), indices.end(), g); // Enforce uniform input distribution!
std::nth_element(indices.begin(), indices.end() - 2, indices.end(), [&](int x, int y) {
// Invariant: gas always compares greater than solid.
comparison_count += 1;
if (values[x] == gas && values[y] == gas) {
// We must solidify one of the two elements.
// Note: still greater than any previous solids - doesn't violate order.
values[x] = num_solid++;
return true;
} else if (values[x] == gas) {
return false;
} else if (values[y] == gas) {
return true;
} else {
return values[x] < values[y];
}
});
return comparison_count;
}
int main(int argc, char** argv) {
std::cout << "N: comparisons\n";
for (int i = 100; i <= 6400; i *= 2) {
std::cout << i << ": " << quadratic(i) << "\n";
}
return 0;
} Output:
Mind you, this comparison function satisfies the requirements of a strict weak ordering and is perfectly valid. I don't think the standard claims to only be linear on average when used with |
Good point and self-referenced comparators definitely make the wording slightly ambiguous alg.nth.element-3
alg.sorting#general-5
I can come up with the sequence of such invocations (from std::sort) where from some point all of them will be false with your comparator. And if removing these invocations, nth_element is going to return the wrong value. What "if the whole range were sorted" is ambiguous here. Does that mean if it was sorted with std::sort or just any order? Or does that mean std::is_sorted returns true? "On average" here does not make sense as well as the definition of the sorted range where the property starts to depend on the order of checking itself. We can bring this to discussion at c++ mailing list but generally I try not to be a lawyer and "on average" in the standard meant likely the ability to implement quickselect. |
I'm reading the introselect Wikipedia article again and surprised to see that it includes strategies that, as far as I can tell, haven't been explored in miniselect:
Also if I understand your implementation in miniselect, you really implemented a raw median-of-medians which doesn't attempt to pick the pivot any other way than with median-of-medians itself. The techniques proposed in the quote above might be worth exploring as a simple alternative to Alexandrecu's adaptive quickselect. |
The reproduction program is almost the same as my 7 year old one for
std::sort
that got fixed only recently:The output:
Evidently quadratic, but the standard requires:
The problem is that pure quickselect is implemented without a fallback for worst cases, like median of medians.
The text was updated successfully, but these errors were encountered: