Skip to content

Poor performance in some cases compared to oniguruma #604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lopopolo opened this issue Aug 1, 2019 · 5 comments
Closed

Poor performance in some cases compared to oniguruma #604

lopopolo opened this issue Aug 1, 2019 · 5 comments

Comments

@lopopolo
Copy link
Contributor

lopopolo commented Aug 1, 2019

I'm looking at replacing oniguruma with regex in some situations for the Ruby that I'm building.

I am benchmarking the following three Regexps over this several megabyte text corpus:

bench('Email', '[\w\.+-]+@[\w\.-]+\.[\w\.-]+')
bench('URI', 'https?://(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')
bench('IP', '\b(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\b')

For Email, regex is 10x faster than oniguruma. For URI, regex is 2x slower than oniguruma. For IP, regex is 20x slower than oniguruma.

regex performance

Email: 92 matches
..................................................

    compile: 85.52ms elapsed in 50 iterations (avg. 1.71ms / iteration)
    scan: 1569.22ms elapsed in 50 iterations (avg. 31.38ms / iteration)
    scan with block: 1025.49ms elapsed in 50 iterations (avg. 20.5ms / iteration)

URI: 5388 matches
..................................................

    compile: 71.84ms elapsed in 50 iterations (avg. 1.43ms / iteration)
    scan: 2336.46ms elapsed in 50 iterations (avg. 46.72ms / iteration)
    scan with block: 2045.79ms elapsed in 50 iterations (avg. 40.91ms / iteration)

IP: 6 matches
..................................................

    compile: 10.79ms elapsed in 50 iterations (avg. 0.21ms / iteration)
    scan: 25693.73ms elapsed in 50 iterations (avg. 513.87ms / iteration)
    scan with block: 25642.21ms elapsed in 50 iterations (avg. 512.84ms / iteration)

oniguruma performance (via rust-onig)

Email: 92 matches
..................................................

    compile: 5.89ms elapsed in 50 iterations (avg. 0.11ms / iteration)
    scan: 16335.45ms elapsed in 50 iterations (avg. 326.7ms / iteration)
    scan with block: 16228.96ms elapsed in 50 iterations (avg. 324.57ms / iteration)

URI: 5388 matches
..................................................

    compile: 1.68ms elapsed in 50 iterations (avg. 0.03ms / iteration)
    scan: 1366.95ms elapsed in 50 iterations (avg. 27.33ms / iteration)
    scan with block: 1349.82ms elapsed in 50 iterations (avg. 26.99ms / iteration)

IP: 6 matches
..................................................

    compile: 3.79ms elapsed in 50 iterations (avg. 0.07000000000000001ms / iteration)
    scan: 1465.14ms elapsed in 50 iterations (avg. 29.3ms / iteration)
    scan with block: 1431.35ms elapsed in 50 iterations (avg. 28.62ms / iteration)

If you're interested in doing so, you can invoke this benchmark in Artichoke with:

cargo run --release --bin string_scan_bench -- artichoke-frontend/ruby/benches/string_scan.rb

The benchmark on master (with oniguruma) is different than the benchmark on this branch because I've tweaked the Regexps to remove lookahead patterns.

@lopopolo
Copy link
Contributor Author

lopopolo commented Aug 1, 2019

Compile times are also significantly slower but that seems to be a known property: #452

@BurntSushi
Copy link
Member

Hiya! Thanks for doing an analysis and reporting this issue. I appreciate it. I have two high level thoughts off the bat:

  1. If Artichoke is truly meant to be a drop-in replacement for Ruby, you'll definitely want to be careful with when you use the regex crate. It could have subtly different matches in some cases.
  2. I don't know where to look to see how you're building these regexes. My first thought is that you have Unicode enabled with the regex crate (which is on by default), but do not have Unicode enabled in Oniguruma? I confess I am less familiar with Oniguruma, so I don't know what its defaults all, but for example, PCRE does not have Unicode support enabled by default. I looked at the onig crate docs, but found zero mentions of Unicode anywhere. Looking at the Oniguruma docs itself, there is clearly a Unicode and not-Unicode mode for classes like \w, although it does not say whether its word boundaries are Unicode aware. There's otherwise zero mention of Unicode in the API docs. Okay, so I found this, which seems to imply that the interpretation of \w is based on the selected encoding? (Which seems weird to me.) Either way, this needs to be figured out in order to do an accurate comparison.

With all that said, I can likely explain why the URI and IP benchmarks are so much slower. My guess is because you are using a Unicode word boundary on text that is not entirely ASCII (both of those conditions must be true). Specifically, this is one of the few explicit cases that the faster DFA just cannot handle. So when it has a Unicode word boundary and it sees a non-ASCII byte, the DFA quits and falls over to a much slower engine. This is explained in more detail in the PERFORMANCE.md guide.

So, if you actually need Unicode support here, then you ought to make sure your benchmarking with equivalent options. If the regex crate still turns up slower, then that's a known thing that is unlikely to be fixed any time soon. Otherwise, if you can use the ASCII-only interpretation of everything, then just disable Unicode mode (either with RegexBuilder or by adding (?-u) to the beginning of the regex pattern), and things should hopefully speed up.

Compile times are also significantly slower but that seems to be a known property: #452

It is known, yes, but #452 is the wrong thing to look at. The PR talks about parse time, which is typically an overall very small portion of total compilation time of a regex. Your regexes aren't exactly small, and you're using some fairly large Unicode classes in your Email benchmark, which is likely why it is the slowest to compile.

@lopopolo
Copy link
Contributor Author

lopopolo commented Aug 1, 2019

@BurntSushi thank you for the detailed response. I have an easy way of answering how \b behaves in Oniguruma: I can use MRI Ruby to match some Regexps.

MRI:

[2.6.0] > "\u3031".encoding
=> #<Encoding:UTF-8>
[2.6.0] > "\u3031"
=> "〱"
[2.6.0] > "\u3031".encoding
=> #<Encoding:UTF-8>
[2.6.0] > /\b/ =~ "\u3031"
=> 0
[2.6.0] > " α".encoding
=> #<Encoding:UTF-8>
[2.6.0] > /\b/ =~ " α"
=> 1

Artichoke with onig:

[Compiled with rustc 1.38.0-nightly 6e310f2 2019-07-07]
>>> /\b/ =~ "\u3031"
=> 0
>>> /\b/ =~ " α"
=> 1

I need to support non-ASCII word boundaries.

Thinking about how I can work around this pathological case:

  • I already was planning on bundling oniguruma because I need to support backreferences and lookahead.
  • Use regex crate as a fast path if it can successfully compile the regex and the pathological DFA conditions are not present. Otherwise fallback to oniguruma.
  • Presence of non-ascii characters is easy to detect with Rust String APIs.

Does Regex expose whether the compiled regex contains \b character classes or will I need to separately parse and walk the HIR? Would you be open to exposing this API given the known limitations of the DFA?

@BurntSushi
Copy link
Member

No, there's no way to tell whether the regex contains a Unicode \b other than by parsing it. I don't think I want to expose this as an API in the regex crate itself. It's really an implementation detail.

It sounds like there isn't anything to be done here on my end. Some day, I'd like to try and fix the Unicode \b issue, but it's hard.

@lopopolo
Copy link
Contributor Author

lopopolo commented Aug 1, 2019

Thank you @BurntSushi for helping me to find a path forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants