Skip to content

[pool-2-thread-1] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CD-SS3.4', ate 'CD-SS3.4a' #1052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jiangweiatgithub opened this issue Jun 20, 2022 · 6 comments

Comments

@jiangweiatgithub
Copy link

I started a server using the following command line in a Ubuntu hyper-v server on winserver 2016:

java -Xmx16g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9009 -timeout 150000

When I try running the following python code:

requests.post('http://192.168.1.5:9009/tregex?pattern=NP|NN&filter=False&properties={"annotators":"tokenize,ssplit,pos,ner,depparse,parse","outputFormat":"json"}', data = {'data':tgt0}, headers={'Connection':'close'}).json()

The execution sticks there, and I see the following memssage on the server:

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
(Note: unspecified annotator properties are English defaults)
annotators = tokenize, ssplit, pos, lemma, ner, parse, coref
coref.algorithm = hybrid
coref.calculateFeatureImportance = false
coref.defaultPronounAgreement = true
coref.input.type = raw
coref.language = zh
coref.md.liberalChineseMD = false
coref.md.type = RULE
coref.path.word2vec =
coref.postprocessing = true
coref.print.md.log = false
coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
coref.useConstituencyTree = true
coref.useSemantics = false
coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz
depparse.language = chinese
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz
inputFormat = text
kbp.language = zh
kbp.model = none
kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex
kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex
ner.applyNumericClassifiers = true
ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/gazetteers/cn_regexner_mapping.tab
ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE
ner.language = chinese
ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.useSUTime = false
outputFormat = json
parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz
pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim.tagger
prettyPrint = false
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.sighanPostProcessing = true
ssplit.boundaryTokenRegex = [.。]|[!?!?]+
tokenize.language = zh
[main] INFO CoreNLP - Threads: 12
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9009
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[pool-2-thread-1] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [21.0 sec].
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[pool-2-thread-1] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim.tagger ... done [1.3 sec].
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[pool-2-thread-1] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model: edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz ... Time elapsed: 1.7 sec
[pool-2-thread-1] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 20000 vectors, elapsed Time: 2.301 sec
[pool-2-thread-1] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [4.0 sec].
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[pool-2-thread-1] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/srparser/chineseSR.ser.gz ... done [5.7 sec].
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[pool-2-thread-1] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [3.2 sec].
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 21238 unique entries out of 21249 from edu/stanford/nlp/models/kbp/chinese/gazetteers/cn_regexner_mapping.tab, 0 TokensRegex patterns.
[pool-2-thread-1] INFO edu.stanford.nlp.pipeline.NERCombinerAnnotator - numeric classifiers: true; SUTime: false [no docDate]; fine grained: true
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.ChineseDictionary - edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done].
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.AffixDictionary - Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done].
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done].
[pool-2-thread-1] INFO edu.stanford.nlp.wordseg.AffixDictionary - Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done].
[pool-2-thread-1] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CD-SS3.4', ate 'CD-SS3.4a'
edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.advancePos(ChineseSegmenterAnnotator.java:296)
edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.runSegmentation(ChineseSegmenterAnnotator.java:407)
edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.doOneSentence(ChineseSegmenterAnnotator.java:133)
edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator.annotate(ChineseSegmenterAnnotator.java:127)
edu.stanford.nlp.pipeline.TokenizerAnnotator.annotate(TokenizerAnnotator.java:379)
edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:680)
edu.stanford.nlp.pipeline.StanfordCoreNLPServer$TregexHandler.lambda$handle$7(StanfordCoreNLPServer.java:1332)
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base/java.lang.Thread.run(Thread.java:829)

@jiangweiatgithub
Copy link
Author

And in some cases, the warning message on the server is something like the following:
[pool-2-thread-2] WARN CoreNLP - java.lang.RuntimeException: Ate the whole text without matching. Expected is ' CLL6.4', ate 'CLL6.4e'

@jiangweiatgithub
Copy link
Author

jiangweiatgithub commented Jun 21, 2022

It seems that the server would choke on and complain about certains strings, such as the following Chinese one, which also contains an ASCII string:
讨论或批准的其他主题可通过 BoardDocs 获得。

The server message:
image

You can test it on: https://corenlp.run/ by selecting Chinese as the language.

@AngledLuffa
Copy link
Collaborator

Found it. Debugging these things is a giant PITA, by the way. Obviously it needs to be fixed, though, so thanks for pointing it out.

Also, I learned that our segmenter doesn't properly keep 蓝妹妹 as a single word. It's probably for the best, considering the test cases I was giving it.

stanfordnlp/CoreNLP#1279

If you can replace NBSP with spaces for a few days, I'll try to get out a new release next week. We can't do it this week because our PI wants to make some upgrades to the tokenizer and Stanford is on fire.

@AngledLuffa
Copy link
Collaborator

It has since come to my attention that the Chinese segmentation standard may very well split last name from first name, and 蓝 妹妹 is in fact the correct segmentation

@AngledLuffa
Copy link
Collaborator

A CoreNLP release with this fix has been made, so it should no longer be a problem for Stanza

@AngledLuffa
Copy link
Collaborator

AngledLuffa commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants