investigate using boyer moore (for real) on longer query strings #408

BurntSushi · 2017-10-15T21:29:32Z

ethanpailes · 2017-11-25T19:47:36Z

I'm not sure that this will ever be worth it. I implemented tuned Boyer Moore in a branch on my fork to see if I could produce a speedup that could live behind a heuristic for pattern length, but the only case where I could get my TBM to win consistently was on the random data and large needle case. Whenever the text looked like natural language or the needle was small, TBM lost.

When I used a needle size of 1000 TBM won on both random bytes and lorum ipsum, but it still lost on Sherlock.

I can't thing of any real use case where you are searching in random seeming data (maybe looking for a specific base64 encoded string in a big blob of base64 data). I would say genetic data is a use case, but according to J Moore BM and friends are not the right choice for that either.

Alternatively, I could have missed some optimizations that seem clearer to someone with more experience writing fast string search algorithms.

BurntSushi · 2017-11-25T20:37:33Z

I can't thing of any real use case where you are searching in random seeming data (maybe looking for a specific base64 encoded string in a big blob of base64 data).

One case I can think of is when you're searching for a hash in a file that contains a list of hashes.

Another case that might be worth benchmarking is a needle that contains only common characters. There might be a win there with TBM that doesn't use memchr at all (because memchr has some amount of overhead, so if you call it frequently enough, it should be self defeating). But to be clear, this is only my intuition, which could be very wrong!

Alternatively, I could have missed some optimizations that seem clearer to someone with more experience writing fast string search algorithms.

Just from glancing over the code, it looks pretty good to me. Very nice work. Do you want to work on trying to get it merged? We could start conservatively by only using BM when the needle is over a certain size. I'm not sure if I'd want it to be a separate crate just yet, but we could certainly work toward that.

ethanpailes · 2017-11-25T22:03:13Z

Another case that might be worth benchmarking is a needle that contains only common characters.

That's a good point. I will write a benchmark for that to see if it is a win. Another nice thing about that is that it would be pretty easy to have a factory method that looks at the pattern and switches to TBM if it does not have a sufficiently rare char.

Just from glancing over the code, it looks pretty good to me. Very nice work. Do you want to work on trying to get it merged? We could start conservatively by only using BM when the needle is over a certain size.

Thanks! I would certainly love to get it merged, but only if I'm convinced it will provide a real win. Unfortunately, even with a needle size of 100 memchr still won (or at least didn't loose), so I'm not sure that a length based heuristic would be small enough to be worth the maintenance burden.

I'm not sure if I'd want it to be a separate crate just yet, but we could certainly work toward that.

Obviously, I'm fine with the code going wherever you would prefer. It does seem like a nice eventual result to be able to pull in fast string searching from its own crate, but right now people can just make a Regex out of a literal, so there is not a big need.

It will take me a little to write that benchmark you brought up, but I should have new results for you relatively soon.

ethanpailes · 2017-11-26T17:51:09Z

I wrote up a benchmark to find the turnover point where TBM starts winning because the pattern contains all uncommon chars. I did a binary search by twiddling the RANK var here to zero in on a good heuristic value. My results are
here. I used cargo benchcmp memchr_{RANK} bm_{RANK} to view the results. It looks like a RANK of 242 is a good turnover point.

Avoiding some pathological cases seems like a big enough win to be worth trying to merge this. If you want to move forward I can merge the BoyerMooreSearcher back into src/literal.rs, hide it behind a rank heuristic and then open a PR.

While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617

ethanpailes · 2017-12-08T14:50:15Z

It occurs to me that I should probably also add a minimum needle length to the heuristic (memchr is certainly going to be faster at finding "e" than TBM, even though 'e' is pretty common). I need to do a little more work to figure out what a good value for that is.

Add an implimentation of Tuned Boyer-Moore. While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: #408 See: [ripgrep-617](BurntSushi/ripgrep#617)

BurntSushi · 2017-12-08T14:53:26Z

@ethanpailes Oh just seeing this now. Do you want me to kill the merge to let you work on it more, or would you prefer to do it in another PR?

ethanpailes · 2017-12-08T15:52:01Z

@BurntSushi, It looks like a bug in the windows build saved me. I'll add the length heurstic along with the windows fix.

While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: rust-lang#408 See: BurntSushi/ripgrep#617

Add an implimentation of Tuned Boyer-Moore. While the existing literal string searching algorithm leveraging memchr is quite fast, in some case more traditional approaches still make sense. This patch provides an implimentation of Tuned Boyer-Moore as laid out in Fast String Searching by Hume & Sunday. Some refinements to their work were gleened from the grep source. See: #408 See: [ripgrep-617](BurntSushi/ripgrep#617)

ethanpailes · 2018-01-29T15:23:48Z

@BurntSushi, is this issue good to be close?

BurntSushi · 2018-01-29T16:50:09Z

Ah yup! Your investigation was exquisite. :)

ethanpailes · 2018-01-29T18:56:00Z

Thanks! I definitely wouldn't have felt up to it if it wern't for your awesome maintainer-ship.

BurntSushi mentioned this issue Oct 15, 2017

Slower than grep: test case on 10M-line text file BurntSushi/ripgrep#617

Closed

ethanpailes mentioned this issue Nov 29, 2017

Add an implimentation of Tuned Boyer-Moore. #419

Merged

BurntSushi closed this as completed Jan 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate using boyer moore (for real) on longer query strings #408

investigate using boyer moore (for real) on longer query strings #408

BurntSushi commented Oct 15, 2017

ethanpailes commented Nov 25, 2017 •

edited

Loading

BurntSushi commented Nov 25, 2017

ethanpailes commented Nov 25, 2017

ethanpailes commented Nov 26, 2017 •

edited

Loading

ethanpailes commented Dec 8, 2017

BurntSushi commented Dec 8, 2017

ethanpailes commented Dec 8, 2017

ethanpailes commented Jan 29, 2018

BurntSushi commented Jan 29, 2018

ethanpailes commented Jan 29, 2018

investigate using boyer moore (for real) on longer query strings #408

investigate using boyer moore (for real) on longer query strings #408

Comments

BurntSushi commented Oct 15, 2017

ethanpailes commented Nov 25, 2017 • edited Loading

BurntSushi commented Nov 25, 2017

ethanpailes commented Nov 25, 2017

ethanpailes commented Nov 26, 2017 • edited Loading

ethanpailes commented Dec 8, 2017

BurntSushi commented Dec 8, 2017

ethanpailes commented Dec 8, 2017

ethanpailes commented Jan 29, 2018

BurntSushi commented Jan 29, 2018

ethanpailes commented Jan 29, 2018

ethanpailes commented Nov 25, 2017 •

edited

Loading

ethanpailes commented Nov 26, 2017 •

edited

Loading