Improve rule-based "Plain citations parser" #12893

bwakkie · 2025-04-07T07:59:30Z

JabRef version

Other (please describe below)

Operating system

GNU / Linux

Details on version and operating system

JabRef 5.16--2024-07-25--771c4cd Linux 6.12.20-2-manjaro amd64 Java 21.0.2 JavaFX 22.0.2+4

Checked with the latest development build (copy version output from About dialog)

I made a backup of my libraries before testing the latest development version.
I have tested the latest development version and the problem persists

Steps to reproduce the behaviour

JabRef 5.16

There is a problem with the text parser which is changing the citations completely into a not related citations.
I created a test case see two attached files.

testcase.txt

testcase.bib.txt

How come this totally different citations are matched? Is there a way to parse the strings without the use of grobid? As I think blindly trusting grobid is wrong. At least verify the whole title string would already help to see something is not right.

More test cases

At https://github.com/inukshuk/anystyle/blob/main/spec/benchmark.rb, anystyle has the following tests for benchmarking:

data = <<-END_REFERENCES
<author> A. Cau, R. Kuiper, and W.-P. de Roever. </author> <title> Formalising Dijkstra's development strategy within Stark's formalism. </title> <editor> In C. B. Jones, R. C. Shaw, and T. Denvir, editors, </editor> <container-title> Proc. 5th. BCS-FACS Refinement Workshop, </container-title> <date> 1992. </date>
<author> M. Kitsuregawa, H. Tanaka, and T. Moto-oka. </author> <title> Application of hash to data base machine and its architecture. </title> <journal> New Generation Computing, </journal> <volume> 1(1), </volume> <date> 1983. </date>
<author> Alexander Vrchoticky. </author> <title> Modula/R language definition. </title> <tech> Technical Report TU Wien rr-02-92, version 2.0, </tech> <institution> Dept. for Real-Time Systems, Technical University of Vienna, </institution> <date> May 1993. </date>
<author> Marc Shapiro and Susan Horwitz. </author> <title> Fast and accurate flow-insensitive points-to analysis. </title> <container-title> In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages, </container-title> <date> January 1997. </date>
<author> W. Landi and B. G. Ryder. </author> <title> Aliasing with and without pointers: A problem taxonomy. </title> <institution> Center for Computer Aids for Industrial Productivity </institution> <tech> Technical Report CAIP-TR-125, </tech> <institution> Rutgers University, </institution> <date> September 1990. </date>
<author> W. H. Enright. </author> <title> Improving the efficiency of matrix operations in the numerical solution of stiff ordinary differential equations. </title> <journal> ACM Trans. Math. Softw., </journal> <volume> 4(2), </volume> <pages> 127-136, </pages> <date> June 1978. </date>
<author> Gmytrasiewicz, P. J., Durfee, E. H., & Wehe, D. K. </author> <date> (1991a). </date> <title> A decision theoretic approach to coordinating multiagent interaction. </title> <container-title> In Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, </container-title> <pages> pp. 62-68 </pages> <location> Sydney, Australia. </location>
<author> A. Bookstein and S. T. Klein, </author> <title> Detecting content-bearing words by serial clustering, </title> <container-title> Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, </container-title> <pages> pp. 319327, </pages> <date> 1995. </date>
<author> U. Dayal, H. Garcia-Molina, M. Hsu, B. Kao, and M.- C. Shan. </author> <title> Third generation TP monitors: A database challenge. </title> <container-title> In ACM SIGMOD Conference on Management of Data, </container-title> <pages> pages 393-397, </pages> <location> Washington, D. C., </location> <date> May 1993. </date>
<author> C. Qiao and R. Melhem, </author> <title> "Reducing Communication Latency with Path Multiplexing in Optically Interconnected Multiprocessor Systems", </title> <container-title> Proc. of HPCA-1, </container-title> <date> 1995. </date>
END_REFERENCES

We could re-use those for our RegEx tests.

The text was updated successfully, but these errors were encountered:

InAnYan · 2025-04-07T09:31:39Z

Hi! Thanks for checking out JabRef and its text parser! I have worked on this feature for some time.

TL;DR: Plain citation parsing in JabRef works not very well (rule-based parser is underdeveloped (it's hard to develop rule-based algorithms), Grobid gives irrelevant entries, LLM is hard to set up).

We call this thing - plain citations parser (sometimes plain references parser).

In JabRef 5.16 there are 2 methods how you can parse citations: rule-based and Grobid. As we've experimented a lot of times (see ¹, ², ³), both parsers give not-so-good results. Grobid quite often gives irrelevant entries (actually, for some time JabRef's Grobid instance was down ⁴) - this is what you might experience.

In the new version of JabRef (6.0-alpha), we have added warnings about confabulations of plain citation parsing ⁵ and expanded the documentation: ⁶. One could also use LLM to parse citations ⁷, people (including me) say it works good-enough ⁸, ⁹.

Bha2912 · 2025-04-07T09:33:07Z

Hello! I’m interested in working on this issue. Let me know if I can take it up.

InAnYan · 2025-04-07T10:14:09Z

@Bha2912, this issue is not marked as good first issue (or with other labels). And we currently don't have a plan of solving it (it's in the discussion state).

So, for now, you can look at other issues

bwakkie · 2025-04-07T14:21:58Z

Hi @InAnYan I changed the title accordingly based on your comments.

I know how difficult it is as I am doing my best with vim and regexes myself for years. But what grobid returns at the moment is in my opinion not worth it.
I know though that grobid can be trained but it goes a bit over my head now/how. I have my own grobid server running and it gives similar problems hence I was looking for an alternative solution.

I used the JabRef development version for the next test based on the above original text input. The grobid had just 10% correct and 90% was complete garbage which makes me not to trust the grobid parser system one bit for now.

For the rules based result I see that at each line the previous author is pasted back in the author fields

testcase_rulebased.bib.txt

anystyle.io helped in my case, would this way of dealing with plain citations not be a better idea?
e.g. parse -> show user -> user fix -> parser learns -> include to a library

InAnYan · 2025-04-09T13:55:33Z

Thanks for suggesting anystyle.io and giving ideas on how to improve plain citation parsing!

I'll close this issue as 1) it's known (infamous I would say 😅), 2) in favor of #12915. Don't think I'm closing this without any future attention, no-no, I made a "feature request" out of this 😃

InAnYan · 2025-04-09T14:30:08Z

Actually, after speaking with the team, we decided to keep this issue as open, as it seems, that we've closed all other issues connected to this topic. One should be still open and referred to in other duplicate issues. + We won't forget about this. Your issue is quite detailed, and you've included your own examples and results.

Sorry, for the inconveniences

ThiloteE · 2025-04-09T19:44:29Z

How to reproduce:

Add the entries from original OP mentioned in the comments above via

https://docs.jabref.org/collect/newentryfromplaintext (prefered for deterministic reproduction)
https://docs.jabref.org/collect/findunlinkedfiles (less prefered for derministic reproduction, as there are multiple methods that will be tried and if one fails, there will be a fall-back to other methods. The user will not see what method was used. Different code!)

Check the original (expected) bibtex data vs the actual metadata that ends up in JabRef.

To Do:

Add the test cases for Rule-Based citation-parser.
If the tests fail, adapt our existing RegExes (What RegEx engine is used by JabRef?), in JabRef's code, so that the tests will not fail anymore.

Bonus points for communicating with and providing the test cases to improve upstream projects, such as

Btw. for inspiration, in an older version of Jabref, this dialog looked like this:

koppor · 2025-04-10T06:23:59Z

As far as I understand the issue - This is NOT org.jabref.logic.importer.fileformat.pdf.PdfContentImporter, because it is NOT about PDF to BibTeX

It is about all classes implementing org.jabref.logic.importer.plaincitation.PlainCitationParser.

The RegEx-based is this one: org.jabref.logic.importer.plaincitation.RuleBasedPlainCitationParser

TBH, I totally forgot about that when implementing #11156 -> org.jabref.logic.importer.fileformat.BibliographyFromPdfImporter

Thus, the first action is:

Craft out citation parsing logic from BibliographyFromPdfImporter into RuleBasedPlainCitationParserV2
Compare functionality of ´RuleBasedPlainCitationParserwithRuleBasedPlainCitationParserV2`
Merge the two versions (if possible)

bwakkie changed the title ~~Text parser giving unrelated citations back. How come? Attached a test case~~ Plain citations parser giving unrelated citations back. How come? Attached a test case Apr 7, 2025

InAnYan mentioned this issue Apr 9, 2025

Integrate anystyle.io (or other plain citation parsers) to JabRef #12915

Open

InAnYan closed this as completed Apr 9, 2025

InAnYan reopened this Apr 9, 2025

InAnYan added the component: import-load label Apr 9, 2025

ThiloteE changed the title ~~Plain citations parser giving unrelated citations back. How come? Attached a test case~~ Improve "Plain citations parser" Apr 13, 2025

ThiloteE changed the title ~~Improve "Plain citations parser"~~ Improve rule-based "Plain citations parser" Apr 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve rule-based "Plain citations parser" #12893

Improve rule-based "Plain citations parser" #12893

bwakkie commented Apr 7, 2025 •

edited by ThiloteE

Loading

InAnYan commented Apr 7, 2025

Bha2912 commented Apr 7, 2025

InAnYan commented Apr 7, 2025

bwakkie commented Apr 7, 2025

InAnYan commented Apr 9, 2025

InAnYan commented Apr 9, 2025

ThiloteE commented Apr 9, 2025 •

edited

Loading

koppor commented Apr 10, 2025

Improve rule-based "Plain citations parser" #12893

Improve rule-based "Plain citations parser" #12893

Comments

bwakkie commented Apr 7, 2025 • edited by ThiloteE Loading

JabRef version

Operating system

Details on version and operating system

Checked with the latest development build (copy version output from About dialog)

Steps to reproduce the behaviour

More test cases

InAnYan commented Apr 7, 2025

Footnotes

Bha2912 commented Apr 7, 2025

InAnYan commented Apr 7, 2025

bwakkie commented Apr 7, 2025

InAnYan commented Apr 9, 2025

InAnYan commented Apr 9, 2025

ThiloteE commented Apr 9, 2025 • edited Loading

How to reproduce:

To Do:

koppor commented Apr 10, 2025

bwakkie commented Apr 7, 2025 •

edited by ThiloteE

Loading

ThiloteE commented Apr 9, 2025 •

edited

Loading