Skip to content

investigate using boyer moore (for real) on longer query strings #408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
BurntSushi opened this issue Oct 15, 2017 · 10 comments
Closed

investigate using boyer moore (for real) on longer query strings #408

BurntSushi opened this issue Oct 15, 2017 · 10 comments

Comments

@BurntSushi
Copy link
Member

See: BurntSushi/ripgrep#617

@ethanpailes
Copy link
Contributor

ethanpailes commented Nov 25, 2017

I'm not sure that this will ever be worth it. I implemented tuned Boyer Moore in a branch on my fork to see if I could produce a speedup that could live behind a heuristic for pattern length, but the only case where I could get my TBM to win consistently was on the random data and large needle case. Whenever the text looked like natural language or the needle was small, TBM lost.

When I used a needle size of 1000 TBM won on both random bytes and lorum ipsum, but it still lost on Sherlock.

I can't thing of any real use case where you are searching in random seeming data (maybe looking for a specific base64 encoded string in a big blob of base64 data). I would say genetic data is a use case, but according to J Moore BM and friends are not the right choice for that either.

Alternatively, I could have missed some optimizations that seem clearer to someone with more experience writing fast string search algorithms.

@BurntSushi
Copy link
Member Author

I can't thing of any real use case where you are searching in random seeming data (maybe looking for a specific base64 encoded string in a big blob of base64 data).

One case I can think of is when you're searching for a hash in a file that contains a list of hashes.

Another case that might be worth benchmarking is a needle that contains only common characters. There might be a win there with TBM that doesn't use memchr at all (because memchr has some amount of overhead, so if you call it frequently enough, it should be self defeating). But to be clear, this is only my intuition, which could be very wrong!

Alternatively, I could have missed some optimizations that seem clearer to someone with more experience writing fast string search algorithms.

Just from glancing over the code, it looks pretty good to me. Very nice work. Do you want to work on trying to get it merged? We could start conservatively by only using BM when the needle is over a certain size. I'm not sure if I'd want it to be a separate crate just yet, but we could certainly work toward that.

@ethanpailes
Copy link
Contributor

Another case that might be worth benchmarking is a needle that contains only common characters.

That's a good point. I will write a benchmark for that to see if it is a win. Another nice thing about that is that it would be pretty easy to have a factory method that looks at the pattern and switches to TBM if it does not have a sufficiently rare char.

Just from glancing over the code, it looks pretty good to me. Very nice work. Do you want to work on trying to get it merged? We could start conservatively by only using BM when the needle is over a certain size.

Thanks! I would certainly love to get it merged, but only if I'm convinced it will provide a real win. Unfortunately, even with a needle size of 100 memchr still won (or at least didn't loose), so I'm not sure that a length based heuristic would be small enough to be worth the maintenance burden.

I'm not sure if I'd want it to be a separate crate just yet, but we could certainly work toward that.

Obviously, I'm fine with the code going wherever you would prefer. It does seem like a nice eventual result to be able to pull in fast string searching from its own crate, but right now people can just make a Regex out of a literal, so there is not a big need.

It will take me a little to write that benchmark you brought up, but I should have new results for you relatively soon.

@ethanpailes
Copy link
Contributor

ethanpailes commented Nov 26, 2017

I wrote up a benchmark to find the turnover point where TBM starts winning because the pattern contains all uncommon chars. I did a binary search by twiddling the RANK var here to zero in on a good heuristic value. My results are
here. I used cargo benchcmp memchr_{RANK} bm_{RANK} to view the results. It looks like a RANK of 242 is a good turnover point.

Avoiding some pathological cases seems like a big enough win to be worth trying to merge this. If you want to move forward I can merge the BoyerMooreSearcher back into src/literal.rs, hide it behind a rank heuristic and then open a PR.

ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Nov 29, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Nov 29, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Nov 29, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Nov 29, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Nov 29, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
@ethanpailes
Copy link
Contributor

It occurs to me that I should probably also add a minimum needle length to the heuristic (memchr is certainly going to be faster at finding "e" than TBM, even though 'e' is pretty common). I need to do a little more work to figure out what a good value for that is.

bors added a commit that referenced this issue Dec 8, 2017
Add an implimentation of Tuned Boyer-Moore.

While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: #408
See: [ripgrep-617](BurntSushi/ripgrep#617)
@BurntSushi
Copy link
Member Author

@ethanpailes Oh just seeing this now. Do you want me to kill the merge to let you work on it more, or would you prefer to do it in another PR?

@ethanpailes
Copy link
Contributor

@BurntSushi, It looks like a bug in the windows build saved me. I'll add the length heurstic along with the windows fix.

ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Dec 8, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Dec 8, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Dec 8, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Dec 8, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Dec 8, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Dec 9, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Dec 9, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Dec 9, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Dec 9, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Dec 9, 2017
While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: rust-lang#408
See: BurntSushi/ripgrep#617
bors added a commit that referenced this issue Dec 11, 2017
Add an implimentation of Tuned Boyer-Moore.

While the existing literal string searching algorithm
leveraging memchr is quite fast, in some case more
traditional approaches still make sense. This patch
provides an implimentation of Tuned Boyer-Moore as
laid out in Fast String Searching by Hume & Sunday.
Some refinements to their work were gleened from the
grep source.

See: #408
See: [ripgrep-617](BurntSushi/ripgrep#617)
@ethanpailes
Copy link
Contributor

@BurntSushi, is this issue good to be close?

@BurntSushi
Copy link
Member Author

Ah yup! Your investigation was exquisite. :)

@ethanpailes
Copy link
Contributor

Thanks! I definitely wouldn't have felt up to it if it wern't for your awesome maintainer-ship.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants