DFA misses an opportunity to short circuit with an anchored RegexSet #186

dprien · 2016-03-14T16:54:06Z

While trying to implement a lexer, I noticed that RegexSet::matches exhibits very bad performance when used on long strings.
The following test case runs in 1 minute, 16 seconds (cargo build --release):

extern crate regex;
fn main() {
    let mut test_input = String::from("42 ");
    for _ in 0 .. 100_000 { test_input.push_str("XXXXXXXXXXXXXXXXXXXX"); } // Delete me

    let set = regex::RegexSet::new(&[r"^[0-9]+", r"^THIS_SHOULD_NEVER_MATCH"]).unwrap();
    for _ in 0 .. 10_000 { println!("{:?}", set.matches(&test_input)); }
}

Without the line marked with // Delete me, the program runs in about 50 ms.

If I understand correctly, matches should try to match only at the very beginning of the string, thanks to the ^ anchor. However, it would seem that matches traverses the whole string on each call.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-03-14T18:20:06Z

Yup, this looks like a bug. Playing around with it, it seems the anchoring is detected and a preceding .*? is omitted from the DFA (good). That should cause the DFA to reach a dead state and quit, but it does indeed appear to be continuing to the end of the input. Still looking...

BurntSushi · 2016-03-14T21:05:38Z

Thanks for the bug report! Should be fixed now and is on crates.io in 0.1.57.

dprien · 2016-03-14T22:02:18Z

Thanks a lot for the quick response!

The fix is totally bogus. We obviously can't do anything meaningful with recording matches while computing cached states, since once a state is cached, we won't actually record the match. Stay tuned for greener pastures. This reverts commit 3fb34e2.

BurntSushi · 2016-03-15T21:32:31Z

OK, so this is pretty tricky to fix properly. I think it will need to persist as a bug for a bit until I figure out how best to fix it. Out of curiosity, what were you trying to do that caused you to notice this behavior?

Leaving some breadcrumbs for when I can come back to this:

The fundamental problem is in how the DFA detects which regexes match. At the moment, it looks at the very last state that matched, and inspects which NFA states it's in. Any NFA states that are matches each have pointers to the corresponding regex, and the DFA returns only those pointers. In order to make this work, the DFA has to carry around the match instructions throughout matching, but this causes the DFA to never enter a dead state since there is always a match instruction available (assuming another regex has already matched, like in this case). Since the DFA never enters a dead state, it processes the entire haystack.

I had attempted to fix this by instead keeping track of matches as the DFA states are created. This doesn't actually work because the DFA uses a cache of states, which means subsequent executions of the DFA may never create any states, and therefore no matches will be captured.

One way around this is to look for match instructions in DFA states that are match states during the search. We could then stop carrying them through all of the DFA states, which should permit short circuiting. The problem with this approach is that scanning all of the NFA states whenever a DFA match state is entered could be quite expensive, since the number of NFA states in any given DFA state could be quite large.

My next idea is to track all of the matches in the states themselves. Currently, states only track whether it's a match state or not, which is sufficient for normal regexes. I think this could be generalized to indicate exactly which NFA states are match states. This can be used like the above approach, but avoids needing to scan all of the NFA states in DFA match states. Unfortunately, even this can be problematic if the regex set contains lots of regexes, since it could mean traversing a large list on each step of the DFA. Another problem with this approach is that it removes a one byte bool from State and adds a 24 byte Vec. That really hurts.

The current implementation avoids this nastiness and only looks for match states at the very end of execution.

dprien · 2016-03-15T22:36:32Z

Out of curiosity, what were you trying to do that caused you to notice this behavior?

I was trying to implement a lexer like this:

impl Iterator for Lexer {
    type Item = Token;
    fn next(&mut self) -> Option<Token> {
        let slice: &[u8] = &self.input[self.pos ..];
        let matches = self.regex_set.matches(unsafe { str::from_utf8_unchecked(slice) });
        // ...
        self.pos += /* number of bytes consumed by the last token */
        Some(the_token)
    }
}

The idea was to use RegexSet to quickly find the index of the matching regular expression and then using .find with that regular expression to get the (start, end) tuple of the match, pretty much exactly as described in the documentation for RegexSet:

Other features like finding the location of successive matches or their sub-captures aren't supported. If you need this functionality, the recommended approach is to compile each regex in the set independently and selectively match them based on which regexes in the set matched.

BurntSushi · 2016-03-15T22:45:24Z

@dprien Interesting, that is a good use case! Do you happen to know the max length of a token? If so, you could shorten the slice that you're matching against.

dprien · 2016-03-15T22:51:36Z

Do you happen to know the max length of a token? If so, you could shorten the slice that you're matching against.

Unfortunately, one of the possible tokens is String, which can be of arbitrary length.
If I limit the slice length to the longest string of my test input, the performance is definitely much better.

BurntSushi · 2016-03-15T23:10:23Z

Dratz. It would be quite cool if RegexSet could be made to work here. I'll keep noodling on it.

Yet another trick is to go back to a single regex with an alternate for each token, and then look at each capture group to determine what matched. That should quit correctly I think.

Also, I noticed the str::from_utf8_unchecked---if it's useful, you can now match on arbitrary bytes.

dprien · 2016-03-15T23:13:36Z

Yet another trick is to go back to a single regex with an alternate for each token, and then look at each capture group to determine what matched.

Yep, that's what I do now.

Also, I noticed the str::from_utf8_unchecked---if it's useful, you can now match on arbitrary bytes.

Thanks! I didn't know about that.

Fix #186.

BurntSushi · 2016-05-01T19:08:36Z

@dprien This should now be fixed and will be in the next release. The solution was so simple---I'm a complete blockhead for not having thought of it sooner.

dprien · 2016-05-01T20:55:15Z

@BurntSushi Great! Thanks for the fix.

dprien changed the title ~~Performance of RegexSet::matches with ^-anchored regex on long strings~~ Performance of RegexSet::matches with ^-anchored regular expressions on long strings Mar 14, 2016

BurntSushi added the bug label Mar 14, 2016

BurntSushi closed this as completed in 3fb34e2 Mar 14, 2016

dprien mentioned this issue Mar 15, 2016

RegexSet matched_any() yields true, but iter() returns no indexes #187

Closed

BurntSushi reopened this Mar 15, 2016

BurntSushi changed the title ~~Performance of RegexSet::matches with ^-anchored regular expressions on long strings~~ DFA misses an opportunity to short circuit with an anchored RegexSet Mar 15, 2016

BurntSushi mentioned this issue Apr 26, 2016

Finding matches with capturing subgroups anchored at beginning depends on length of haystack? #215

Closed

BurntSushi closed this as completed in 445c834 May 1, 2016

BurntSushi added a commit that referenced this issue May 1, 2016

Merge pull request #223 from rust-lang-nursery/fix-186

4458410

Fix #186.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DFA misses an opportunity to short circuit with an anchored RegexSet #186

DFA misses an opportunity to short circuit with an anchored RegexSet #186

dprien commented Mar 14, 2016

BurntSushi commented Mar 14, 2016

BurntSushi commented Mar 14, 2016

dprien commented Mar 14, 2016

BurntSushi commented Mar 15, 2016

dprien commented Mar 15, 2016

BurntSushi commented Mar 15, 2016

dprien commented Mar 15, 2016

BurntSushi commented Mar 15, 2016

dprien commented Mar 15, 2016

BurntSushi commented May 1, 2016

dprien commented May 1, 2016

DFA misses an opportunity to short circuit with an anchored RegexSet #186

DFA misses an opportunity to short circuit with an anchored RegexSet #186

Comments

dprien commented Mar 14, 2016

BurntSushi commented Mar 14, 2016

BurntSushi commented Mar 14, 2016

dprien commented Mar 14, 2016

BurntSushi commented Mar 15, 2016

dprien commented Mar 15, 2016

BurntSushi commented Mar 15, 2016

dprien commented Mar 15, 2016

BurntSushi commented Mar 15, 2016

dprien commented Mar 15, 2016

BurntSushi commented May 1, 2016

dprien commented May 1, 2016