Skip to content

Releases: stanfordnlp/CoreNLP

v4.5.9 - Security Updates and Semgrex / Ssurgeon features

07 Apr 15:04
Compare
Choose a tag to compare

Security updates

  • Removed the ability to specify an external library for deserialization of annotations in the server. We believe this should not be necessary given the complete nature of the protobuf format, and this was reported as a potential security vulnerability: https://github.com/stanfordnlp/CoreNLP/security/advisories/GHSA-wv35-hv9v-526p If it turns out someone has a use case for this feature, please file an issue on github.
  • Remove the naturalli demo, which is unsupported and likely not used anywhere given its Stanford-specific components

Semgrex / Ssurgeon features

  • Semgrex can now search on negated attributes of a node using !: as the syntax: 7399e9b
  • Semgrex can now search on maps (especially morphological features) with the :{feature:value} syntax, as well as search for negative matches with {feature!:value}: 84ac932 ff1d903 3c30b3b
  • Ssurgeon can now reindex nodes with ReindexGraph, such as in cases where a sentence was manually split in a conllu file: 156fad1
  • Ssurgeon can remove a feature with EditNode using the -remove option: 8e7d121

Other minor updates

  • Additional demonyms now supported in the lemmatizer, demonyms from LinES and ParTUT: 4f15b08
  • Output lemmas when training a tagger whenever available if -outputLemmas is set, even if not verbose 94739c7

v4.5.8 - Package updates and minor bug fixes

29 Dec 08:15
Compare
Choose a tag to compare
  • Update German UD POS tagger to UD 2.14 data

  • Add Austrian German month names to the German tokenizer: #1454 Thank you @j3ernhard

  • Improve the constituency to dependency converter to remove quite a few validation errors. This includes adding the PTB Corrector as an earlier step when operating specifically on PTB data #1445

  • SSurgeon feature to split one word into multiple words: 13ede5a

  • Unravel recursion in SemanticGraph - 05804a3 Fixes one server crash observed in #1461

  • Package updates: update protobuf -> 3.25.5, javax -> 1.1.6 #1465 Unfortunately updating Lucene to fix all dependency security issues will require dropping Java 8 support

  • Fix the server caching of tokenizer annotators to include segmenter properties as well. Avoids the server not respecting a request for a different segmentation model. 6f6eb93

v4.5.7 - Constituency to Dependency Converter Upgrades

28 Apr 05:36
Compare
Choose a tag to compare

UD converter upgrades

Inspired by UniversalDependencies/docs#717, although the work is not finished

  • Add an option to use the PTBCorrector, which fixes many (although not all) incorrect POS tags 5e57eab
  • Treat sort of the same as kind of bc4acf1
  • en masse is flat cb338cd
  • dinna is an MWT 1dd746c
  • Use AUX as the POS in the converter when appropriate 30f2f8e
  • Fix (heh) all but and whether or not 2513676
  • Dependency dep -> ccomp for fronted say verbs a76a854

Parser evaluation improvements

  • Include the F1 scores of each tree when scoring a constituency dataset 2725b06

v4.5.6: Lemmatizer & Tokenizer bugfixes

01 Feb 20:39
Compare
Choose a tag to compare

English Lemmatizer upgrades

  • enroll, appall as American spellings, instead of enrol & appal. de- as a verb prefix, blog and xfer as double letter exceptions 8adcbfe
  • cowritten 2dd08da
  • elder / eldest 9b5bec8
  • Yazidi as a demonym 2852da8

Tokenizer upgrades

  • #number as a single thing after an abbreviation #1396 ad37f2a

UD Processing upgrades

  • 'twas and 'tis as MWT in the UD converter b9f19a6
  • Sort morpho features in alphabetical order when writing out UD
    f77a9b4

Other Bugfixes

  • Crash when deleting the endpoints of an IntervalTree #1405 6d17c23
  • Find and remove extraneous uses of yield, which became a keyword: e5c9d44 b084233

Minor API change

  • Updating the text on a CoreLabel no longer wipes out the Lemma c03522b
  • Update to more recent Jakarta Servlet 8a671fd

Ssurgeon

  • UpdateMorphoFeatures edit 27c6703
  • Lemmatize operation (only works on English) c26b25e

v4.5.5: further Ssurgeon upgrades, SceneGraph server module, security bugfix

06 Sep 20:46
Compare
Choose a tag to compare

Ssurgeon updates beyond the capabilities listed in the GURT paper

  • MergeNodes operation: combine two words into one word in a graph. one word must be a leaf headed by the other for this to work 0660fa9
  • CombineMWT operation: mark MWT on two or more words. Stanza will treat these as Token 010a955
  • DeleteLeaf operation: remove a leaf, renumber the subsequent words
    429f61a

Bugfixes

  • fix graph serialization for sentences longer than 128 words (IdentityHashSet doesn't work for integers beyond 128) d8d9d9f
  • fix valueOf for SemanticGraph if a word is just a dash 203eb06
  • fix memory usage of evaluating a PCFG model, which would run out of memory because it was saving all of the charts while evaluating b2e67b0
  • Tregex pattern would not correctly display when using optional patterns: a9965b2 8659653
  • Tregex would infinite loop on certain optional patterns which were theoretically legal cc7983e

Security fixes

English dependency converter fixes

  • addressing issue #1363
  • fix (QP up to ...) 8c46648 9a86ece
  • fix up to 1700 kilograms if misparsed in a predicable manner 6e14527
  • better LST coverage 5745de5
  • vmod/acl when the parser misinterprets NP vs NML ad4556d
  • treat lists of NML as repeated modifiers of a noun, instead of a list, as that is the likely meaning of NML. example: a 72-game, three-month season from PTB 61ef545 5e748dc

Server features

  • Scenegraph endpoint 8b40947 #1346
  • remove one json library to reduce number of json libraries we depend on 357b1bb

Small changes

  • allow fourty as a number in SUTime 7fbb7b8
  • capture forty (40) days as a duration in SUTime b3c47a0
  • feature to print out the feature index of an NER model as a text file f636673
  • clarify the INTJ rule for the ChineseHeadFinder 56cd6bb
  • consider { } as punctuation when scoring English constituency treebanks a606afa
  • fix error in test case, from @tanloong #1373 #1372
  • dead code cleanup 86b6a03

v4.5.4: Minor Ssurgeon updates

16 Mar 01:23
Compare
Choose a tag to compare
  • Minor Ssurgeon bugfixes (make it harder to infinite loop with EditNode or RelabelNamedEdge)
  • Add a ReattachNamedEdge which is a combination of RemoveNamedEdge and AddEdge with new endpoints
  • include the Morphology CLI for using the CoreNLP lemmatizer from elsewhere, such as Python

v4.5.3: Ssurgeon interface, Collinizer fixes

11 Mar 05:40
Compare
Choose a tag to compare

Mostly changes to Semgrex, along with adding Ssurgeon to the download package for general consumption. This involved quite a few changes to classes such as AnnotationLookup. The released version should now match the Semgrex/Ssurgeon paper published at GURT 2023.

Ssurgeon / Semgrex

Bugfixes

  • Fix "Could not match" errors which occurred when scoring treebanks using a tagger that produces non-gold punct tags: #1344
  • Fix typo in KBP children rules: dbdb55b

Minor features

  • Add the choice of dependency graph to output to the TextOutputter 33e6c42 #1339
  • Hopefully minor interface change: make relation in SemanticGraphEdge final, get rid of setRelation e7a7657

v4.5.2: package dependencies, CLI additions

11 Mar 05:32
Compare
Choose a tag to compare

Bugfixes

  • Tokenize c'mon and $$$ 1e216de
  • Tokenize 'email' 76b5a6b #1316
  • Return empty mentions for empty document da08664 #1322
  • Fix CLI protobuf tools running too fast for some network conditions: 412da5c

CLI protobuf tools

  • Add output of lemmatizer to words 71bc95d
  • Convert constituency trees to dependencies b118082

Dependency updates

  • Protobuf 3.19.6 0439b62
  • xom 1.3.8, which no longer automatically includes xalan 3ded6f0

Semgraph / Semgrex improvements

  • Allow reuse of indices in SemanticGraph.valueOf cf97e36
  • Add Semgrex relations to match the capabilities introduced in Spacy 98be52a

v4.5.1: Bugfixes

30 Aug 04:13
Compare
Choose a tag to compare

CoreNLP 4.5.1

Bugfixes!

  • Fix tokenizer regression: 4.5.0 will tokenize ",5" as one word 974383a
  • Use a LinkedHashMap in the PTBTokenizer instead of Properties. Keeps the option processing order predictable. #1289 6550188
  • Fix \r\n not being properly processed on Windows: #1291 9889f4e
  • Handle one half of surrogate character pairs in the tokenizer w/o crashing #1298 1b12faa
  • Attempt to fix semgrex "Unknown vertex" errors which have plagued CoreNLP for years in hard to track down circumstances: #1296 #1229 #1169 f99b5ab

v4.5.0

22 Jul 23:21
Compare
Choose a tag to compare

CoreNLP 4.5.0

Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex

  • All PTB and German tokens normalized now in PTBLexer (previously only German umlauts).
    This makes the tokenizer 2% slower, but should avoid issues with resume' for example
    d46fecd

  • log4j removed entirely from public CoreNLP (internal "research" branch still has a use)
    f05cb54

  • Fix NumberFormatException showing up in NER models: #547 5ee2c39

  • Fix "seconds" in the lemmatizer: e7a073b

  • Fix double escaping of & in the online demos: 8413fa1

  • Report the cause of an error if "tregex" is asked for but no parse annotator is added: 4db80c0

  • Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): #1259

  • Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: #1263

  • Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: 3c40ba3 58a2288 8b97d64

  • Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas 9476a8e 6193934 afb1ea8 7c84960

  • Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases #1266

  • Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) 45b47e2

  • Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models 0d9e9c8

  • Fix NBSP in the Chinese segmenter stanfordnlp/stanza#1052 #1279