TokensRegex cannot detect rules cross the period '.' #1396

lilyclemson · 2023-11-16T19:59:30Z

The task is to detect apartment number via tokensRegex.

Example sentence: I live in 123 Pretty RD, APT. #456.
Here is the rule used to detect the apartment number: { ruleType: "tokens", pattern: ( /APT/ /./ /#/ [{word:/[0-9]+/}]), action: Annotate($0, ner, "APT#"), result:"APARTMENT NUMBER"}

Above rule failed to detect the pattern APT. #456. It looks like TokensRegex cannot correctly recognize the rule across the period '.'

A guess is a change in line 713 would do the trick …

https://github.com/stanfordnlp/CoreNLP/blob/main/src/edu/stanford/nlp/process/PTBLexer.flex

AngledLuffa · 2023-11-16T20:57:10Z

That's an excellent guess, and in fact that's the first place I went to when trying to diagnose this. The problem is clearly that the tokenizer is splitting this into two sentences rather than keeping it as one. However, despite having Apt with that exact capitalization in the abbreviation list there, the tokenizer is actually case-insensitive to those abbreviations. Here are a couple examples:

I lived at APT. 303 for many years     ... one sentence
I lived at APT. #303 for many years    .... two sentences
I lived at Apt. 303 for many years     ... one sentence
I lived at Apt. #303 for many years    ... two sentences

My belief is that we need to add # as a token that can follow ABBREV3, at which point it will work as desired. In the meantime, if you happen to be using data of sentence each, you can always use the options that force the tokenizer to produce exactly one sentence per query.

CoreNLP/src/edu/stanford/nlp/process/PTBLexer.flex

Line 748 in f8838d2

/* --- ABBREV3 abbreviations are allowed only before numbers. ---

AngledLuffa · 2023-11-16T21:16:53Z

It's a small change, but I hesitate to make such a change without running it by my PI @manning (who is unfortunately out of town for the next couple weeks). If you want to give the abbrev3_hash branch a try, it might work better for your purposes.

lilyclemson · 2023-11-17T21:40:24Z

Thank you for the fix and quick response!

lilyclemson · 2024-01-12T18:53:18Z

Hi @AngledLuffa, I wonder if @manning had a chance to see if we can merge the current fix? Thank you.

AngledLuffa · 2024-01-12T19:28:48Z

It's already merged. I can make a new release that includes it soon if you need, or you can just use the dev branch

lilyclemson · 2024-01-12T19:43:44Z

It's already merged. I can make a new release that includes it soon if you need, or you can just use the dev branch

That would be great if we can have a release version of it. Thanks very much for the quick response!

AngledLuffa · 2024-01-13T03:13:34Z

It's not an official release yet, but I built a version here:

https://nlp.stanford.edu/software/stanford-corenlp-4.5.5b.zip

I'd like to make some more changes before making an official release

lilyclemson · 2024-01-23T20:18:01Z

--Thanks very much for the release!

--Looking forward to the official release because our security policies only allow official releases of dependencies to move to production.

--Shall I know when the official release will occur?

Thanks!

AngledLuffa · 2024-02-01T20:56:50Z

Now released in 4.5.6 (may take a little time to show up on Maven)

lilyclemson · 2024-02-01T20:58:30Z

Thanks so much! I appreciate it @AngledLuffa

AngledLuffa added a commit that referenced this issue Nov 16, 2023

Add #number as an allowed continuation after ABBREV3. Addresses #1396

ad6e4e5

AngledLuffa mentioned this issue Nov 16, 2023

Add #number as an allowed continuation after ABBREV3 #1397

Merged

AngledLuffa added a commit that referenced this issue Nov 28, 2023

Add #number as an allowed continuation after ABBREV3. Addresses #1396

ad37f2a

AngledLuffa closed this as completed Feb 1, 2024

lilyclemson mentioned this issue Jul 15, 2024

TokensRegex cannot detect rules cross special symbols, eg. '.' or ',' #1457

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokensRegex cannot detect rules cross the period '.' #1396

TokensRegex cannot detect rules cross the period '.' #1396

lilyclemson commented Nov 16, 2023 •

edited

Loading

AngledLuffa commented Nov 16, 2023

AngledLuffa commented Nov 16, 2023

lilyclemson commented Nov 17, 2023

lilyclemson commented Jan 12, 2024

AngledLuffa commented Jan 12, 2024

lilyclemson commented Jan 12, 2024

AngledLuffa commented Jan 13, 2024

lilyclemson commented Jan 23, 2024 •

edited

Loading

AngledLuffa commented Feb 1, 2024

lilyclemson commented Feb 1, 2024

TokensRegex cannot detect rules cross the period '.' #1396

TokensRegex cannot detect rules cross the period '.' #1396

Comments

lilyclemson commented Nov 16, 2023 • edited Loading

AngledLuffa commented Nov 16, 2023

AngledLuffa commented Nov 16, 2023

lilyclemson commented Nov 17, 2023

lilyclemson commented Jan 12, 2024

AngledLuffa commented Jan 12, 2024

lilyclemson commented Jan 12, 2024

AngledLuffa commented Jan 13, 2024

lilyclemson commented Jan 23, 2024 • edited Loading

AngledLuffa commented Feb 1, 2024

lilyclemson commented Feb 1, 2024

lilyclemson commented Nov 16, 2023 •

edited

Loading

lilyclemson commented Jan 23, 2024 •

edited

Loading