Skip to content

Commit 1b12faa

Browse files
committed
Make the fallthrough character tokenization also capture unpaired surrogates. Putting it in an | expression should make it so that full codepoints are preferred and half codepoints are only used in an emergency
Addresses #1298 Add a debug line for the fallthrough rule Add a couple tests of the half codepoint fix
1 parent 63fda49 commit 1b12faa

File tree

3 files changed

+57005
-57323
lines changed

3 files changed

+57005
-57323
lines changed

src/edu/stanford/nlp/process/PTBLexer.flex

+2-1
Original file line numberDiff line numberDiff line change
@@ -1583,7 +1583,8 @@ CP1252_MISC_SYMBOL = [\u0086\u0087\u0089\u0095\u0098\u0099]
15831583
prevWordAfter.append(yytext());
15841584
}
15851585
}
1586-
. { String str = yytext();
1586+
. | [^] { String str = yytext();
1587+
if (DEBUG) { logger.info("Fallthrough character rule: |" + str + "|"); }
15871588
int first = str.codePointAt(0);
15881589
String msg = String.format("Untokenizable: %s (U+%s, decimal: %s)",
15891590
yytext(), Integer.toHexString(first).toUpperCase(), Integer.toString(first));

0 commit comments

Comments
 (0)