Skip to content

TokensRegex cannot detect rules cross the period '.' #1396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lilyclemson opened this issue Nov 16, 2023 · 10 comments
Closed

TokensRegex cannot detect rules cross the period '.' #1396

lilyclemson opened this issue Nov 16, 2023 · 10 comments

Comments

@lilyclemson
Copy link

lilyclemson commented Nov 16, 2023

The task is to detect apartment number via tokensRegex.

Example sentence: I live in 123 Pretty RD, APT. #456.
Here is the rule used to detect the apartment number: { ruleType: "tokens", pattern: ( /APT/ /./ /#/ [{word:/[0-9]+/}]), action: Annotate($0, ner, "APT#"), result:"APARTMENT NUMBER"}

Above rule failed to detect the pattern APT. #456. It looks like TokensRegex cannot correctly recognize the rule across the period '.'

A guess is a change in line 713 would do the trick …

https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/process/PTBLexer.flex

@AngledLuffa
Copy link
Contributor

That's an excellent guess, and in fact that's the first place I went to when trying to diagnose this. The problem is clearly that the tokenizer is splitting this into two sentences rather than keeping it as one. However, despite having Apt with that exact capitalization in the abbreviation list there, the tokenizer is actually case-insensitive to those abbreviations. Here are a couple examples:

I lived at APT. 303 for many years     ... one sentence
I lived at APT. #303 for many years    .... two sentences
I lived at Apt. 303 for many years     ... one sentence
I lived at Apt. #303 for many years    ... two sentences

My belief is that we need to add # as a token that can follow ABBREV3, at which point it will work as desired. In the meantime, if you happen to be using data of sentence each, you can always use the options that force the tokenizer to produce exactly one sentence per query.

/* --- ABBREV3 abbreviations are allowed only before numbers. ---

@AngledLuffa
Copy link
Contributor

It's a small change, but I hesitate to make such a change without running it by my PI @manning (who is unfortunately out of town for the next couple weeks). If you want to give the abbrev3_hash branch a try, it might work better for your purposes.

@lilyclemson
Copy link
Author

Thank you for the fix and quick response!

@lilyclemson
Copy link
Author

Hi @AngledLuffa, I wonder if @manning had a chance to see if we can merge the current fix? Thank you.

@AngledLuffa
Copy link
Contributor

It's already merged. I can make a new release that includes it soon if you need, or you can just use the dev branch

@lilyclemson
Copy link
Author

It's already merged. I can make a new release that includes it soon if you need, or you can just use the dev branch

That would be great if we can have a release version of it. Thanks very much for the quick response!

@AngledLuffa
Copy link
Contributor

It's not an official release yet, but I built a version here:

https://nlp.stanford.edu/software/stanford-corenlp-4.5.5b.zip

I'd like to make some more changes before making an official release

@lilyclemson
Copy link
Author

lilyclemson commented Jan 23, 2024

--Thanks very much for the release!

--Looking forward to the official release because our security policies only allow official releases of dependencies to move to production.

--Shall I know when the official release will occur?

Thanks!

@AngledLuffa
Copy link
Contributor

Now released in 4.5.6 (may take a little time to show up on Maven)

@lilyclemson
Copy link
Author

Thanks so much! I appreciate it @AngledLuffa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants