Skip to content

DFA misses an opportunity to short circuit with an anchored RegexSet #186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dprien opened this issue Mar 14, 2016 · 11 comments
Closed

DFA misses an opportunity to short circuit with an anchored RegexSet #186

dprien opened this issue Mar 14, 2016 · 11 comments
Labels

Comments

@dprien
Copy link

dprien commented Mar 14, 2016

While trying to implement a lexer, I noticed that RegexSet::matches exhibits very bad performance when used on long strings.
The following test case runs in 1 minute, 16 seconds (cargo build --release):

extern crate regex;
fn main() {
    let mut test_input = String::from("42 ");
    for _ in 0 .. 100_000 { test_input.push_str("XXXXXXXXXXXXXXXXXXXX"); } // Delete me

    let set = regex::RegexSet::new(&[r"^[0-9]+", r"^THIS_SHOULD_NEVER_MATCH"]).unwrap();
    for _ in 0 .. 10_000 { println!("{:?}", set.matches(&test_input)); }
}

Without the line marked with // Delete me, the program runs in about 50 ms.

If I understand correctly, matches should try to match only at the very beginning of the string, thanks to the ^ anchor. However, it would seem that matches traverses the whole string on each call.

@dprien dprien changed the title Performance of RegexSet::matches with ^-anchored regex on long strings Performance of RegexSet::matches with ^-anchored regular expressions on long strings Mar 14, 2016
@BurntSushi BurntSushi added the bug label Mar 14, 2016
@BurntSushi
Copy link
Member

Yup, this looks like a bug. Playing around with it, it seems the anchoring is detected and a preceding .*? is omitted from the DFA (good). That should cause the DFA to reach a dead state and quit, but it does indeed appear to be continuing to the end of the input. Still looking...

@BurntSushi
Copy link
Member

Thanks for the bug report! Should be fixed now and is on crates.io in 0.1.57.

@dprien
Copy link
Author

dprien commented Mar 14, 2016

Thanks a lot for the quick response!

BurntSushi added a commit that referenced this issue Mar 15, 2016
The fix is totally bogus. We obviously can't do anything meaningful with
recording matches while computing cached states, since once a state is
cached, we won't actually record the match.

Stay tuned for greener pastures.

This reverts commit 3fb34e2.
@BurntSushi
Copy link
Member

OK, so this is pretty tricky to fix properly. I think it will need to persist as a bug for a bit until I figure out how best to fix it. Out of curiosity, what were you trying to do that caused you to notice this behavior?


Leaving some breadcrumbs for when I can come back to this:

The fundamental problem is in how the DFA detects which regexes match. At the moment, it looks at the very last state that matched, and inspects which NFA states it's in. Any NFA states that are matches each have pointers to the corresponding regex, and the DFA returns only those pointers. In order to make this work, the DFA has to carry around the match instructions throughout matching, but this causes the DFA to never enter a dead state since there is always a match instruction available (assuming another regex has already matched, like in this case). Since the DFA never enters a dead state, it processes the entire haystack.

I had attempted to fix this by instead keeping track of matches as the DFA states are created. This doesn't actually work because the DFA uses a cache of states, which means subsequent executions of the DFA may never create any states, and therefore no matches will be captured.

One way around this is to look for match instructions in DFA states that are match states during the search. We could then stop carrying them through all of the DFA states, which should permit short circuiting. The problem with this approach is that scanning all of the NFA states whenever a DFA match state is entered could be quite expensive, since the number of NFA states in any given DFA state could be quite large.

My next idea is to track all of the matches in the states themselves. Currently, states only track whether it's a match state or not, which is sufficient for normal regexes. I think this could be generalized to indicate exactly which NFA states are match states. This can be used like the above approach, but avoids needing to scan all of the NFA states in DFA match states. Unfortunately, even this can be problematic if the regex set contains lots of regexes, since it could mean traversing a large list on each step of the DFA. Another problem with this approach is that it removes a one byte bool from State and adds a 24 byte Vec. That really hurts.

The current implementation avoids this nastiness and only looks for match states at the very end of execution.

@BurntSushi BurntSushi reopened this Mar 15, 2016
@BurntSushi BurntSushi changed the title Performance of RegexSet::matches with ^-anchored regular expressions on long strings DFA misses an opportunity to short circuit with an anchored RegexSet Mar 15, 2016
@dprien
Copy link
Author

dprien commented Mar 15, 2016

Out of curiosity, what were you trying to do that caused you to notice this behavior?

I was trying to implement a lexer like this:

impl Iterator for Lexer {
    type Item = Token;
    fn next(&mut self) -> Option<Token> {
        let slice: &[u8] = &self.input[self.pos ..];
        let matches = self.regex_set.matches(unsafe { str::from_utf8_unchecked(slice) });
        // ...
        self.pos += /* number of bytes consumed by the last token */
        Some(the_token)
    }
}

The idea was to use RegexSet to quickly find the index of the matching regular expression and then using .find with that regular expression to get the (start, end) tuple of the match, pretty much exactly as described in the documentation for RegexSet:

Other features like finding the location of successive matches or their sub-captures aren't supported. If you need this functionality, the recommended approach is to compile each regex in the set independently and selectively match them based on which regexes in the set matched.

@BurntSushi
Copy link
Member

@dprien Interesting, that is a good use case! Do you happen to know the max length of a token? If so, you could shorten the slice that you're matching against.

@dprien
Copy link
Author

dprien commented Mar 15, 2016

Do you happen to know the max length of a token? If so, you could shorten the slice that you're matching against.

Unfortunately, one of the possible tokens is String, which can be of arbitrary length.
If I limit the slice length to the longest string of my test input, the performance is definitely much better.

@BurntSushi
Copy link
Member

Dratz. It would be quite cool if RegexSet could be made to work here. I'll keep noodling on it.

Yet another trick is to go back to a single regex with an alternate for each token, and then look at each capture group to determine what matched. That should quit correctly I think.

Also, I noticed the str::from_utf8_unchecked---if it's useful, you can now match on arbitrary bytes.

@dprien
Copy link
Author

dprien commented Mar 15, 2016

Yet another trick is to go back to a single regex with an alternate for each token, and then look at each capture group to determine what matched.

Yep, that's what I do now.

Also, I noticed the str::from_utf8_unchecked---if it's useful, you can now match on arbitrary bytes.

Thanks! I didn't know about that.

@BurntSushi
Copy link
Member

@dprien This should now be fixed and will be in the next release. The solution was so simple---I'm a complete blockhead for not having thought of it sooner.

@dprien
Copy link
Author

dprien commented May 1, 2016

@BurntSushi Great! Thanks for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants