Skip to content

Improve rule-based "Plain citations parser" #12893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
bwakkie opened this issue Apr 7, 2025 · 8 comments
Open
2 tasks done

Improve rule-based "Plain citations parser" #12893

bwakkie opened this issue Apr 7, 2025 · 8 comments

Comments

@bwakkie
Copy link

bwakkie commented Apr 7, 2025

JabRef version

Other (please describe below)

Operating system

GNU / Linux

Details on version and operating system

JabRef 5.16--2024-07-25--771c4cd Linux 6.12.20-2-manjaro amd64 Java 21.0.2 JavaFX 22.0.2+4

Checked with the latest development build (copy version output from About dialog)

  • I made a backup of my libraries before testing the latest development version.
  • I have tested the latest development version and the problem persists

Steps to reproduce the behaviour

JabRef 5.16

There is a problem with the text parser which is changing the citations completely into a not related citations.
I created a test case see two attached files.

testcase.txt

testcase.bib.txt

How come this totally different citations are matched? Is there a way to parse the strings without the use of grobid? As I think blindly trusting grobid is wrong. At least verify the whole title string would already help to see something is not right.

More test cases

At https://github.com/inukshuk/anystyle/blob/main/spec/benchmark.rb, anystyle has the following tests for benchmarking:

data = <<-END_REFERENCES
<author> A. Cau, R. Kuiper, and W.-P. de Roever. </author> <title> Formalising Dijkstra's development strategy within Stark's formalism. </title> <editor> In C. B. Jones, R. C. Shaw, and T. Denvir, editors, </editor> <container-title> Proc. 5th. BCS-FACS Refinement Workshop, </container-title> <date> 1992. </date>
<author> M. Kitsuregawa, H. Tanaka, and T. Moto-oka. </author> <title> Application of hash to data base machine and its architecture. </title> <journal> New Generation Computing, </journal> <volume> 1(1), </volume> <date> 1983. </date>
<author> Alexander Vrchoticky. </author> <title> Modula/R language definition. </title> <tech> Technical Report TU Wien rr-02-92, version 2.0, </tech> <institution> Dept. for Real-Time Systems, Technical University of Vienna, </institution> <date> May 1993. </date>
<author> Marc Shapiro and Susan Horwitz. </author> <title> Fast and accurate flow-insensitive points-to analysis. </title> <container-title> In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages, </container-title> <date> January 1997. </date>
<author> W. Landi and B. G. Ryder. </author> <title> Aliasing with and without pointers: A problem taxonomy. </title> <institution> Center for Computer Aids for Industrial Productivity </institution> <tech> Technical Report CAIP-TR-125, </tech> <institution> Rutgers University, </institution> <date> September 1990. </date>
<author> W. H. Enright. </author> <title> Improving the efficiency of matrix operations in the numerical solution of stiff ordinary differential equations. </title> <journal> ACM Trans. Math. Softw., </journal> <volume> 4(2), </volume> <pages> 127-136, </pages> <date> June 1978. </date>
<author> Gmytrasiewicz, P. J., Durfee, E. H., & Wehe, D. K. </author> <date> (1991a). </date> <title> A decision theoretic approach to coordinating multiagent interaction. </title> <container-title> In Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, </container-title> <pages> pp. 62-68 </pages> <location> Sydney, Australia. </location>
<author> A. Bookstein and S. T. Klein, </author> <title> Detecting content-bearing words by serial clustering, </title> <container-title> Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, </container-title> <pages> pp. 319327, </pages> <date> 1995. </date>
<author> U. Dayal, H. Garcia-Molina, M. Hsu, B. Kao, and M.- C. Shan. </author> <title> Third generation TP monitors: A database challenge. </title> <container-title> In ACM SIGMOD Conference on Management of Data, </container-title> <pages> pages 393-397, </pages> <location> Washington, D. C., </location> <date> May 1993. </date>
<author> C. Qiao and R. Melhem, </author> <title> "Reducing Communication Latency with Path Multiplexing in Optically Interconnected Multiprocessor Systems", </title> <container-title> Proc. of HPCA-1, </container-title> <date> 1995. </date>
END_REFERENCES

We could re-use those for our RegEx tests.

@InAnYan
Copy link
Member

InAnYan commented Apr 7, 2025

Hi! Thanks for checking out JabRef and its text parser! I have worked on this feature for some time.

TL;DR: Plain citation parsing in JabRef works not very well (rule-based parser is underdeveloped (it's hard to develop rule-based algorithms), Grobid gives irrelevant entries, LLM is hard to set up).

We call this thing - plain citations parser (sometimes plain references parser).

In JabRef 5.16 there are 2 methods how you can parse citations: rule-based and Grobid. As we've experimented a lot of times (see 1, 2, 3), both parsers give not-so-good results. Grobid quite often gives irrelevant entries (actually, for some time JabRef's Grobid instance was down 4) - this is what you might experience.

In the new version of JabRef (6.0-alpha), we have added warnings about confabulations of plain citation parsing 5 and expanded the documentation: 6. One could also use LLM to parse citations 7, people (including me) say it works good-enough 8, 9.

Footnotes

  1. https://github.com/JabRef/jabref/issues/11805

  2. https://github.com/JabRef/jabref/issues/12211

  3. https://github.com/JabRef/jabref/issues/6672

  4. https://github.com/JabRef/jabref/issues/12211

  5. https://github.com/JabRef/jabref/issues/11825

  6. https://docs.jabref.org/collect/newentryfromplaintext

  7. https://docs.jabref.org/collect/newentryfromplaintext#llm

  8. https://github.com/JabRef/jabref/issues/11805#issuecomment-2445963839

  9. https://github.com/JabRef/jabref/issues/12211#issuecomment-2484934246

@Bha2912
Copy link

Bha2912 commented Apr 7, 2025

Hello! I’m interested in working on this issue. Let me know if I can take it up.

@InAnYan
Copy link
Member

InAnYan commented Apr 7, 2025

@Bha2912, this issue is not marked as good first issue (or with other labels). And we currently don't have a plan of solving it (it's in the discussion state).

So, for now, you can look at other issues

@bwakkie bwakkie changed the title Text parser giving unrelated citations back. How come? Attached a test case Plain citations parser giving unrelated citations back. How come? Attached a test case Apr 7, 2025
@bwakkie
Copy link
Author

bwakkie commented Apr 7, 2025

Hi @InAnYan I changed the title accordingly based on your comments.

I know how difficult it is as I am doing my best with vim and regexes myself for years. But what grobid returns at the moment is in my opinion not worth it.
I know though that grobid can be trained but it goes a bit over my head now/how. I have my own grobid server running and it gives similar problems hence I was looking for an alternative solution.

I used the JabRef development version for the next test based on the above original text input. The grobid had just 10% correct and 90% was complete garbage which makes me not to trust the grobid parser system one bit for now.

For the rules based result I see that at each line the previous author is pasted back in the author fields

testcase_rulebased.bib.txt

anystyle.io helped in my case, would this way of dealing with plain citations not be a better idea?
e.g. parse -> show user -> user fix -> parser learns -> include to a library

@InAnYan
Copy link
Member

InAnYan commented Apr 9, 2025

Thanks for suggesting anystyle.io and giving ideas on how to improve plain citation parsing!

I'll close this issue as 1) it's known (infamous I would say 😅), 2) in favor of #12915. Don't think I'm closing this without any future attention, no-no, I made a "feature request" out of this 😃

@InAnYan InAnYan closed this as completed Apr 9, 2025
@InAnYan
Copy link
Member

InAnYan commented Apr 9, 2025

Actually, after speaking with the team, we decided to keep this issue as open, as it seems, that we've closed all other issues connected to this topic. One should be still open and referred to in other duplicate issues. + We won't forget about this. Your issue is quite detailed, and you've included your own examples and results.

Sorry, for the inconveniences

@ThiloteE
Copy link
Member

ThiloteE commented Apr 9, 2025

How to reproduce:

  1. Add the entries from original OP mentioned in the comments above via
  1. Check the original (expected) bibtex data vs the actual metadata that ends up in JabRef.

To Do:

  • Add the test cases for Rule-Based citation-parser.
  • If the tests fail, adapt our existing RegExes (What RegEx engine is used by JabRef?), in JabRef's code, so that the tests will not fail anymore.

Bonus points for communicating with and providing the test cases to improve upstream projects, such as

Btw. for inspiration, in an older version of Jabref, this dialog looked like this:
Image

@koppor
Copy link
Member

koppor commented Apr 10, 2025

As far as I understand the issue - This is NOT org.jabref.logic.importer.fileformat.pdf.PdfContentImporter, because it is NOT about PDF to BibTeX

It is about all classes implementing org.jabref.logic.importer.plaincitation.PlainCitationParser.

The RegEx-based is this one: org.jabref.logic.importer.plaincitation.RuleBasedPlainCitationParser

TBH, I totally forgot about that when implementing #11156 -> org.jabref.logic.importer.fileformat.BibliographyFromPdfImporter

Thus, the first action is:

  • Craft out citation parsing logic from BibliographyFromPdfImporter into RuleBasedPlainCitationParserV2
  • Compare functionality of ´RuleBasedPlainCitationParserwithRuleBasedPlainCitationParserV2`
  • Merge the two versions (if possible)

@ThiloteE ThiloteE changed the title Plain citations parser giving unrelated citations back. How come? Attached a test case Improve "Plain citations parser" Apr 13, 2025
@ThiloteE ThiloteE changed the title Improve "Plain citations parser" Improve rule-based "Plain citations parser" Apr 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants