-
Notifications
You must be signed in to change notification settings - Fork 462
investigate using boyer moore (for real) on longer query strings #408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not sure that this will ever be worth it. I implemented tuned Boyer Moore in a branch on my fork to see if I could produce a speedup that could live behind a heuristic for pattern length, but the only case where I could get my TBM to win consistently was on the random data and large needle case. Whenever the text looked like natural language or the needle was small, TBM lost. When I used a needle size of 1000 TBM won on both random bytes and lorum ipsum, but it still lost on Sherlock. I can't thing of any real use case where you are searching in random seeming data (maybe looking for a specific base64 encoded string in a big blob of base64 data). I would say genetic data is a use case, but according to J Moore BM and friends are not the right choice for that either. Alternatively, I could have missed some optimizations that seem clearer to someone with more experience writing fast string search algorithms. |
One case I can think of is when you're searching for a hash in a file that contains a list of hashes. Another case that might be worth benchmarking is a needle that contains only common characters. There might be a win there with TBM that doesn't use memchr at all (because memchr has some amount of overhead, so if you call it frequently enough, it should be self defeating). But to be clear, this is only my intuition, which could be very wrong!
Just from glancing over the code, it looks pretty good to me. Very nice work. Do you want to work on trying to get it merged? We could start conservatively by only using BM when the needle is over a certain size. I'm not sure if I'd want it to be a separate crate just yet, but we could certainly work toward that. |
That's a good point. I will write a benchmark for that to see if it is a win. Another nice thing about that is that it would be pretty easy to have a factory method that looks at the pattern and switches to TBM if it does not have a sufficiently rare char.
Thanks! I would certainly love to get it merged, but only if I'm convinced it will provide a real win. Unfortunately, even with a needle size of 100 memchr still won (or at least didn't loose), so I'm not sure that a length based heuristic would be small enough to be worth the maintenance burden.
Obviously, I'm fine with the code going wherever you would prefer. It does seem like a nice eventual result to be able to pull in fast string searching from its own crate, but right now people can just make a Regex out of a literal, so there is not a big need. It will take me a little to write that benchmark you brought up, but I should have new results for you relatively soon. |
I wrote up a benchmark to find the turnover point where TBM starts winning because the pattern contains all uncommon chars. I did a binary search by twiddling the Avoiding some pathological cases seems like a big enough win to be worth trying to merge this. If you want to move forward I can merge the BoyerMooreSearcher back into |
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
It occurs to me that I should probably also add a minimum needle length to the heuristic (memchr is certainly going to be faster at finding "e" than TBM, even though 'e' is pretty common). I need to do a little more work to figure out what a good value for that is. |
Add an implimentation of Tuned Boyer-Moore. While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: #408 See: [ripgrep-617](BurntSushi/ripgrep#617)
@ethanpailes Oh just seeing this now. Do you want me to kill the merge to let you work on it more, or would you prefer to do it in another PR? |
@BurntSushi, It looks like a bug in the windows build saved me. I'll add the length heurstic along with the windows fix. |
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617
Add an implimentation of Tuned Boyer-Moore. While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: #408 See: [ripgrep-617](BurntSushi/ripgrep#617)
@BurntSushi, is this issue good to be close? |
Ah yup! Your investigation was exquisite. :) |
Thanks! I definitely wouldn't have felt up to it if it wern't for your awesome maintainer-ship. |
See: BurntSushi/ripgrep#617
The text was updated successfully, but these errors were encountered: