-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Unexpected error thrown on tokenize #1298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To reproduce in Java:
However, is this a problem you have run into in the wild? Some of the characters you are adding with this are not valid text characters. Which version CoreNLP, anyway? |
|
|
@AngledLuffa
By that you mean the surrogate prepended by space is not a valid character for CoreNLP, correct? |
…rogates. Putting it in an | expression should make it so that full codepoints are preferred and half codepoints are only used in an emergency Addresses #1298
Well, two things... I don't think this character actually means anything by itself in any context, and more relevantly, it's currently not a valid character for CoreNLP considering it causes a crash. However, I did just make a branch which I think has the fix to the problem. It doesn't crash any more, at least |
…rogates. Putting it in an | expression should make it so that full codepoints are preferred and half codepoints are only used in an emergency Addresses #1298 Add a debug line for the fallthrough rule Add a couple tests of the half codepoint fix
…rogates. Putting it in an | expression should make it so that full codepoints are preferred and half codepoints are only used in an emergency Addresses #1298 Add a debug line for the fallthrough rule Add a couple tests of the half codepoint fix
4.5.1 is available on github and maven |
@AngledLuffa awesome, thank you! |
Description:
edu.stanford.nlp.pipeline.StanfordCoreNLP
throws an error if you try to tokenize a string with all possible characters ("... a b c d ..."
) divided by space. Probably it's also worth to mention that string without space between characters ("...abcd..."
) is tokenized successfully.Prerequisites:
openjdk 17.0.2 2022-01-18
2.13.8
ivy"edu.stanford.nlp:stanford-corenlp:4.5.0"
Minimal example:
Error:
The text was updated successfully, but these errors were encountered: