-
Notifications
You must be signed in to change notification settings - Fork 462
DFA misses an opportunity to short circuit with an anchored RegexSet #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yup, this looks like a bug. Playing around with it, it seems the anchoring is detected and a preceding |
Thanks for the bug report! Should be fixed now and is on crates.io in |
Thanks a lot for the quick response! |
OK, so this is pretty tricky to fix properly. I think it will need to persist as a bug for a bit until I figure out how best to fix it. Out of curiosity, what were you trying to do that caused you to notice this behavior? Leaving some breadcrumbs for when I can come back to this: The fundamental problem is in how the DFA detects which regexes match. At the moment, it looks at the very last state that matched, and inspects which NFA states it's in. Any NFA states that are matches each have pointers to the corresponding regex, and the DFA returns only those pointers. In order to make this work, the DFA has to carry around the match instructions throughout matching, but this causes the DFA to never enter a dead state since there is always a match instruction available (assuming another regex has already matched, like in this case). Since the DFA never enters a dead state, it processes the entire haystack. I had attempted to fix this by instead keeping track of matches as the DFA states are created. This doesn't actually work because the DFA uses a cache of states, which means subsequent executions of the DFA may never create any states, and therefore no matches will be captured. One way around this is to look for match instructions in DFA states that are match states during the search. We could then stop carrying them through all of the DFA states, which should permit short circuiting. The problem with this approach is that scanning all of the NFA states whenever a DFA match state is entered could be quite expensive, since the number of NFA states in any given DFA state could be quite large. My next idea is to track all of the matches in the states themselves. Currently, states only track whether it's a match state or not, which is sufficient for normal regexes. I think this could be generalized to indicate exactly which NFA states are match states. This can be used like the above approach, but avoids needing to scan all of the NFA states in DFA match states. Unfortunately, even this can be problematic if the regex set contains lots of regexes, since it could mean traversing a large list on each step of the DFA. Another problem with this approach is that it removes a one byte The current implementation avoids this nastiness and only looks for match states at the very end of execution. |
I was trying to implement a lexer like this: impl Iterator for Lexer {
type Item = Token;
fn next(&mut self) -> Option<Token> {
let slice: &[u8] = &self.input[self.pos ..];
let matches = self.regex_set.matches(unsafe { str::from_utf8_unchecked(slice) });
// ...
self.pos += /* number of bytes consumed by the last token */
Some(the_token)
}
} The idea was to use
|
@dprien Interesting, that is a good use case! Do you happen to know the max length of a token? If so, you could shorten the slice that you're matching against. |
Unfortunately, one of the possible tokens is |
Dratz. It would be quite cool if Yet another trick is to go back to a single regex with an alternate for each token, and then look at each capture group to determine what matched. That should quit correctly I think. Also, I noticed the |
Yep, that's what I do now.
Thanks! I didn't know about that. |
@dprien This should now be fixed and will be in the next release. The solution was so simple---I'm a complete blockhead for not having thought of it sooner. |
@BurntSushi Great! Thanks for the fix. |
While trying to implement a lexer, I noticed that
RegexSet::matches
exhibits very bad performance when used on long strings.The following test case runs in 1 minute, 16 seconds (cargo build --release):
Without the line marked with
// Delete me
, the program runs in about 50 ms.If I understand correctly,
matches
should try to match only at the very beginning of the string, thanks to the ^ anchor. However, it would seem thatmatches
traverses the whole string on each call.The text was updated successfully, but these errors were encountered: