Development plans for CoreNLP – we're really moving to jdk 11 in 2025 #1488
manning
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
CoreNLP is by now a “mature” product. It has not been actively used for research or development at Stanford NLP for about five years now. While CoreNLP uses (early generation) artificial neural network models for several components, it does not seem a good move to try to provide transformer-based NLP models in CoreNLP; we think CoreNLP is best remaining a library that runs well on a single CPU. Nevertheless, we have continued to actively maintain CoreNLP, and to add new features, including improving Semgrex and Ssurgeon, adding support for Italian and Hungarian, adding better emoji support, and moving to UDv2 (or “new” Penn Treebank) tokenization for English.
We plan to continue maintaining CoreNLP, but we would like to do so in a slightly more sustainable way.
Move to jdk 11: For a long time, we have continued to target jdk 8, since many Java users (including ours) continued to use it. But by 2024, over 70% of Java users were on jdk 11 or above, and we'd like to move to it. Our plan to move to jdk 11 in 2024 did not happen, but we did release what we regard as the “final” Java 8 release (modulo any new security or showstopper problems materializing). And we've got fresh energy to make this happen for 2025. This will have a number of advantages for maintainability, including:
Remove basically unused components: The core of CoreNLP is it pipeline from tokenization and sentence splitting through NER, parsing, coreference, and knowledge base population. But during the years that it was actively used as the main Stanford NLP codebase, it also accreted some other components. We believe that these are not actively used by anyone but disproportionately cause security flaws in CoreNLP, since they're the components that expand the footprint of required libraries. These include:
An old thing we cannot move off is joda-time, since SUTime vitally uses its notion of Partial times and dates, which isn't in java.time.
Upgrade libraries and models: Finally, we plan to update the remaining Java libraries that we do use and to update some of the NLP models, such as the caseless models and the NER models.
Any concerns or suggestions of other things to get rid of, let us know. 😊
Beta Was this translation helpful? Give feedback.
All reactions