#grobid

Christian Boulangercmboulanger@sciences.social
2025-06-11

We use #grobid and the plos1000 #goldstandard as a baseline to compare the performance of LLM-based solutions.

Takeaways:

- Grobid still better choice for literature similar to the type it was trained on (mostly English-language STEM scholarship), since it is much faster & less resource-intensive
- For footnoted literature, experiments with LLamore/#Gemini show 3x better performance

Public Knowledge ProjectPublicKnowledgeProject
2024-12-16

Reminder: PKP invites communities to register for its Software Development Update webinar on December 16th, 2024, at 8 AM PST.

Topics

> () / (OMP) / (OPS) version 3.5.0 preview and release timeline

> Typesetting workflow
> Tasks and Discussions
> Receiving emails in OJS
> Breaking the upload / download pattern with
> Pre-filling metadata automatically with

Registration: pkp.sfu.ca/2024/11/28/pkp-soft

Public Knowledge ProjectPublicKnowledgeProject
2024-11-28

⚙️ You are invited to PKP's next Software Development Update webinar!

December 16 2024, 8 AM PST

Topics

* () / (OMP) / (OPS) v3.5.0 preview and release timeline

* Typesetting workflow

* Tasks / Discussions

* Receiving emails in OJS

* Breaking upload / download pattern with

* Pre-filling automatically with

Details and registration:

pkp.sfu.ca/2024/11/28/pkp-soft

Hope to meet you there!

Christian Boulangercmboulanger@sciences.social
2024-09-16

@osma @storytracer Hi-just found this old thread - we're just working on a #referenceextraction & #evaluation workflow involving #LLMs to measure their performance using a hand-annotated dataset of older scholarly articles with #footnotes . Untrained #GROBID performs very badly but that does not mean that it will when properly trained with a good dataset.

Christian Boulangercmboulanger@sciences.social
2024-09-16

Do you want to run the #GROBID PDF-to-#TEI conversion library/server with #Apptainer, for example for #ReferenceExtraction? There was a problem converting the #Docker image, but here's how to solve the problem: github.com/kermitt2/grobid/iss

2023-10-11

Curious surprise!

Grobid has started using LaTeXML for processing LaTeX inputs (I think just recently), as part of its TEI-based pipeline.

Details at:
grobid.readthedocs.io/en/lates

#TeXLaTeX #latexml #grobid #TEI

The Grobid technical architecture diagram. Three possible inputs - PDF, XML, LaTeX, where the LaTeX input is mapped via LaTeXML into TEI.

The right-hand side of the diagram shows 9 downstream applications - faceted search, bibliometrics, knowledge bases, discovery, augmented document, document summarization, corpus creation, accessible document and LLM pre-training.
Osma Suominenosma@sigmoid.social
2023-01-17

Has anyone used large language models for extracting (#bibliographic style, e.g. #DublinCore) #metadata from fulltext (PDF) documents? I tried this with a fine-tuned #OpenAI #GPT3 Curie model and the results were outrageously good at least for doctoral theses. Much better than traditional NLP methods like #GROBID.

#AI #machinelearning #LLM

2022-02-01

[#BeautifulSoup #Pandas] Parsing TEI XML documents [from #grobid] with Python | Data, code and science komax.github.io/blog/text/pyth

Nemo_bis 🌈nemobis@mamot.fr
2020-02-13

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst