[pool-2-thread-1] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CD-SS3.4', ate 'CD-SS3.4a' #1052

jiangweiatgithub · 2022-06-20T09:19:11Z

I started a server using the following command line in a Ubuntu hyper-v server on winserver 2016:

java -Xmx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9009 -timeout 150000

When I try running the following python code:

requests.post('http://192.168.1.5:9009/tregex?pattern=NP|NN&filter=False&properties={"annotators":"tokenize,ssplit,pos,ner,depparse,parse","outputFormat":"json"}', data = {'data':tgt0}, headers={'Connection':'close'}).json()

The execution sticks there, and I see the following memssage on the server:

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
(Note: unspecified annotator properties are English defaults)
annotators = tokenize, ssplit, pos, lemma, ner, parse, coref
coref.algorithm = hybrid
coref.calculateFeatureImportance = false
coref.defaultPronounAgreement = true
coref.input.type = raw
coref.language = zh
coref.md.liberalChineseMD = false
coref.md.type = RULE
coref.path.word2vec =
coref.postprocessing = true
coref.print.md.log = false
coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
coref.useConstituencyTree = true
coref.useSemantics = false
coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz
depparse.language = chinese
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz
inputFormat = text
kbp.language = zh
kbp.model = none
kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex
kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex
ner.applyNumericClassifiers = true
ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/gazetteers/cn_regexner_mapping.tab
ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE
ner.language = chinese
ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.useSUTime = false
outputFormat = json
parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz
pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim.tagger
prettyPrint = false
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.sighanPostProcessing = true
ssplit.boundaryTokenRegex = [.。]|[!?！？]+
tokenize.language = zh
[main] INFO CoreNLP - Threads: 12
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9009
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-2-thread-1] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [21.0 sec].
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-2-thread-1] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim.tagger ... done [1.3 sec].
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[pool-2-thread-1] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model: edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz ... Time elapsed: 1.7 sec
[pool-2-thread-1] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 20000 vectors, elapsed Time: 2.301 sec
[pool-2-thread-1] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [4.0 sec].
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-2-thread-1] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/srparser/chineseSR.ser.gz ... done [5.7 sec].
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[pool-2-thread-1] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [3.2 sec].
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 21238 unique entries out of 21249 from edu/stanford/nlp/models/kbp/chinese/gazetteers/cn_regexner_mapping.tab, 0 TokensRegex patterns.
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.NERCombinerAnnotator - numeric classifiers: true; SUTime: false [no docDate]; fine grained: true
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.ChineseDictionary - edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done].
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.AffixDictionary - Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done].
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done].
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.AffixDictionary - Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done].
[pool-2-thread-1] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CD-SS3.4', ate 'CD-SS3.4a'
edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.advancePos(ChineseSegmenterAnnotator.java:296)
edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.runSegmentation(ChineseSegmenterAnnotator.java:407)
edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.doOneSentence(ChineseSegmenterAnnotator.java:133)
edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.annotate(ChineseSegmenterAnnotator.java:127)
edu.stanford.nlp.pipeline.TokenizerAnnotator.annotate(TokenizerAnnotator.java:379)
edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:680)
edu.stanford.nlp.pipeline.StanfordCoreNLPServer$TregexHandler.lambda$handle$7(StanfordCoreNLPServer.java:1332)
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base/java.lang.Thread.run(Thread.java:829)

jiangweiatgithub · 2022-06-20T10:32:02Z

And in some cases, the warning message on the server is something like the following:
[pool-2-thread-2] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CLL6.4', ate 'CLL6.4e'

jiangweiatgithub · 2022-06-21T07:28:28Z

It seems that the server would choke on and complain about certains strings, such as the following Chinese one, which also contains an ASCII string:
讨论或批准的其他主题可通过 BoardDocs 获得。

The server message:

You can test it on: https://corenlp.run/ by selecting Chinese as the language.

AngledLuffa · 2022-06-23T05:46:29Z

Found it. Debugging these things is a giant PITA, by the way. Obviously it needs to be fixed, though, so thanks for pointing it out.

Also, I learned that our segmenter doesn't properly keep 蓝妹妹 as a single word. It's probably for the best, considering the test cases I was giving it.

stanfordnlp/CoreNLP#1279

If you can replace NBSP with spaces for a few days, I'll try to get out a new release next week. We can't do it this week because our PI wants to make some upgrades to the tokenizer and Stanford is on fire.

AngledLuffa · 2022-06-23T05:54:54Z

It has since come to my attention that the Chinese segmentation standard may very well split last name from first name, and 蓝妹妹 is in fact the correct segmentation

AngledLuffa · 2022-09-30T05:16:41Z

A CoreNLP release with this fix has been made, so it should no longer be a problem for Stanza

AngledLuffa · 2022-10-11T07:36:23Z

I will dig into why this is happening a bit more, but at a high level, it seems you are putting in NBSP instead of ascii spaces and the pipeline is not properly handling that. Perhaps you could use regular spaces instead while we come up with a more permanent solution.

…

On Tue, Jun 21, 2022 at 12:28 AM jiangweiatgithub ***@***.***> wrote: It seems that the server would choke on and complain about certains strings, such as the following Chinese one, which also contains an ASCII string: 讨论或批准的其他主题可通过 BoardDocs 获得。 The server message: [image: image] <https://user-images.githubusercontent.com/14370779/174741440-f05ec616-cae2-43d3-addd-5a1bc466e20d.png> You can test it on: https://corenlp.run/ by selecting Chinese as the language. The — Reply to this email directly, view it on GitHub <#1052 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWJ2LDO2SYVVZJYVRPDVQFVKRANCNFSM5ZIG2Q6A> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

AngledLuffa added the fixed on dev label Jul 2, 2022

AngledLuffa closed this as completed Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pool-2-thread-1] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CD-SS3.4', ate 'CD-SS3.4a' #1052

[pool-2-thread-1] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CD-SS3.4', ate 'CD-SS3.4a' #1052

jiangweiatgithub commented Jun 20, 2022

jiangweiatgithub commented Jun 20, 2022

Uh oh!

jiangweiatgithub commented Jun 21, 2022 •

edited

Loading

Uh oh!

AngledLuffa commented Jun 23, 2022

Uh oh!

AngledLuffa commented Jun 23, 2022

Uh oh!

AngledLuffa commented Sep 30, 2022

Uh oh!

AngledLuffa commented Oct 11, 2022 via email

Uh oh!

[pool-2-thread-1] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CD-SS3.4', ate 'CD-SS3.4a' #1052

[pool-2-thread-1] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CD-SS3.4', ate 'CD-SS3.4a' #1052

Comments

jiangweiatgithub commented Jun 20, 2022

jiangweiatgithub commented Jun 20, 2022

Uh oh!

jiangweiatgithub commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AngledLuffa commented Jun 23, 2022

Uh oh!

AngledLuffa commented Jun 23, 2022

Uh oh!

AngledLuffa commented Sep 30, 2022

Uh oh!

AngledLuffa commented Oct 11, 2022 via email

Uh oh!

[pool-2-thread-1] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CD-SS3.4', ate 'CD-SS3.4a' #1052

[pool-2-thread-1] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CD-SS3.4', ate 'CD-SS3.4a' #1052

jiangweiatgithub commented Jun 21, 2022 •

edited

Loading