#Docling

2025-05-27

#Docling sounds pretty interesting on their website (docling-project.github.io/docl), but after having played around with it for a bit, I found the JSON/Markdown/HTML results pretty disappointing.

OCR was mediocre to bad, table/heading/list recognition too. It didn't even add line breaks between the lines in the address part of a letter.

But I'm using the defaults. Any suggestions on, like, different models or engines and stuff?

2025-05-27

Yeah, colors and formatting in CLI tools is usually a good thing, but if your --help looks like this, you probably need to take a step back.

#docling #CLI #Python

The output of "uvx docling --help". It starts with a simple (yellow/white and bold) "Usage: docling [OPTIONS] source".

Below that, a full-width, but one line high rounded box titled "Arguments" that says: "* input_sources source PDF files to convert. Can be local file / directory paths or URL. [default: None] [required]".

So far, so good, but now starts a full-width rounded box titled "Options" that's taller than my vertical screen. It's a table with four columns, all of equal width:

The first column lists the actual option in cyan, e.g. --ocr, or --allow-external-plugins.

The second column is empty half of the time. It contains the negation option, if available. So for example, if the first column has --allow-external-plugins, the second one has --no-allow-external-plugins, but if the first column has --from, the second column is empty.

The third column contains the argument or enum to that option and is also empty half of the time. So for the --output option the column says "PATH", and for --table-mode it says "[fast|accurate]".

The fourth column explains what the option actually does and also lists the default in a new line after that, e.g. for --enrich-code it says "Enable the code enrichment model in the pipeline. [default: no-enrich-code]".

The fourth column being only ¼ of the screen but full of text, while the rest of the table contains a lot of horizontal and vertical whitespace (because column 4 usually spans multiple lines) makes it pretty hard to read.
2025-05-13

Check out the #Docling Meetup during the week of #RHSummit in #Boston - a great chance for the community to connect with Docling experts and dive into the latest innovations!

Tuesday, May 20, 2025
6:00 to 8:00 PM EDT

Details & RSVP: ibm.biz/DoclingCommunity

#opensource #redhat #ibm #genAI

olеg lаvrоvskyloleg@hachyderm.io
2025-05-09

Taking part in the #Docling workshop at the #OpenSource AI conference. This is a project I heard about at #DINAconCH a few months ago, and it seems to since have exploded in popularity on PyPi and GitHub - in part thanks to the #CHopen community ⛹️‍♂️

There are strong overlaps with what I've been doing at #ProxeusApp - my notes from the Docling deep-dive have been posted here: log.alets.ch/105/

Peter Starr explains the basic problem of extracting data from richly formatted (usually PDF) documents in the intro of the workshop.
InstructLabInstructLab
2025-04-30

Check out the sessions in the AI track on Community Day!

events.experiences.redhat.com/

We have topics ranging from to , inferencing to features stores, topped with your favourite tools and models. Register and add the sessions to your schedule!

2025-03-22

Как я победил в RAG Challenge: от нуля до SoTA за один конкурс

Когда новичок пытается построить свою первую вопросно-ответную LLM систему, он быстро узнаёт, что базовый RAG - это для малышей и его нужно "прокачивать" модными техниками: Hybrid Search, Parent Document Retrieval, Reranking и десятки других непонятных терминов. Глаза разбегаются, наступает паралич выбора, ладошки потеют. А что, если попробовать их все? Я решил потратить на подготовку к соревнованию 200+ часов и собственноручно проверить каждую из этих методик. Получилось настолько удачно, что я выиграл конкурс во всех номинациях. Теперь рассказываю, какие техники оказались полезными, а какие нет, и как повторить мой результат.

habr.com/ru/articles/893356/

#RAG #Docling #векторный_поиск #retrieval_augmented_generation #question_answering #LLM #FAISS #GPT #ChatGPT #парсинг_PDF

Václav Vančuravancura@mastodon.design
2025-03-19

Seeing the #Docling Actor (that @netmilk and I wrote) listed on the Docling Featured Integrations page is pretty cool! docling-project.github.io/docl

Václav Vančuravancura@mastodon.design
2025-02-14

Check out my latest work, the #Docling Actor, on @apify: apify.com/vancura/docling

This new tool may make your life easier if you need to convert documents to JSON, HTML, or Markdown. It converts PDFs, DOCX files, and images into clean, structured formats with built-in OCR support, making it ideal for text extraction, metadata analysis, and custom tagging. (1/2)

An illustration showing how the Docling Actor works. On the left, there's a bubble with "Input: PDF, image, DOCS, other text-based formats”; in the middle, a bubble with "Docling: Process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support. On the right, there's the last bubble with "Output: Markdown, JSON, HTML, Plain Text and DocTags". All bubbles are connected with arrows, coming from the left to the right, showing how the process works from the beginning to the end.

Discovering Docling

A couple of years back, I started using a RSS reader again. I find it really useful for compiling notes for the CAT newsletter I’ve been editing and publishing most weeks, but I also end up with loads of links that are interesting, but do not fit.

The first of hopefully a series of small “just a link” posts about stuff I find, as I clear out my newsreader backlog. Starting with an interesting-looking OSS project, Docling.

What is Docling?

Found via Simon Willison’s blog, Docling is a python project from IBM, that appears to use a series of small ML models working together to more effectively parse PDF documents, to make it possible to pull out meaningful information from them. There’s a technical report explaining it in detail on Arxiv, and it’s on github too.

I’d most likely find this useful at work, where I maintain a platform to aggregate sustainability data from providers of managed and hosting digital services, like WordPress hosting, virtual machines, and storage and so on.

Heres the blurb:

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.

In addition to being able to ‘read’ PDFs, it’s also able to output the content in helpful chunks that would make it suitable for carrying out RAG-style analysis for ‘asking questions of a document’, that we see.

Why I’m interested in it

I’m particularly interested in how well smarter PDF processing software like this handled working with published ‘CSRD’ reports by companies, that for the first time ever, should be publishing information in standardised, comparable formats, making it possible to make meaningful comparisons between companies.

I’ve written a bit already about how the CSRD, and more specifically, the ESRS make it possible to fetch specific datapoints out of compliant reports, but it looks like not every report in 2025 will be digitally tagged, or follow the standardised file format (the ESEF, a flavour of iXBRL, which itself is a flavour of xhtml) required by the law, as it appears to be being phased in.

Until these reports are published in that ESEF format, I’m curious about whether it’s possible to parse PDF reports to make somewhat comparable queries with docling, to the ones I might make against an iXBRL file using a dedicated tool like Arelle.

Trying it out. I just tried running it from the command line in my terminal on my 2020 Macbook M1, with 16gb of RAM:

uvx docling Netcompany_Annual-Report_2024.pdf --to json --to md

The Netcompany Annual Report is one of the first reports to be published in 2025 that follows the CSRD apparently, and it seemed a good target report to try out.

This command took about 14 minutes on my machine, to:

  1. run uv to download dependencies
  2. download the various ML models used to parse the PDF
  3. parse the 50mb PDF
  4. generate a 65mb markdown file, with embedded data-encoded images, along with a 325mb json file, also containing the same data-encoded images.

The second run took about 12 minutes to carry out steps 3 and 4. So there’s data there. Can it be easily queried to look for ESRS style datapoints?

That’s the next challenge.

#AI #docling #pdfs

2025-01-14

Join Michele Dolfi and me at #RedHat Summit Connect Zurich tomorrow to hear about #Docling, #InstructLab, and more!

Details and RSVP: redhat.com/en/summit/connect/e

Speaker card for the event that says "I'm speaking at Red Hat Summit Connect Zurich, January 15, 2025." with the Red Hat logo on the left half; and "Track 3, AI & DevX" with hashtags #AI and #DeveloperProductivity on the right half.
InstructLabInstructLab
2025-01-08

Join us at : Connect Zürich on Jan 15 to hear from from Michele Dolfi and @cybette about powered by !

Full agenda and registration: redhat.com/en/summit/connect/e

Banner for Red Hat Summit Connect indicating location and date (Zürich, January 15, 2025) plus the Red Hat logo.
2025-01-03

Wrestling with PDF files today… delighted to find #Docling ds4sd.github.io/docling/

It’s a solid CLI for parsing documents. It was annoying to install, but works well. I still have manual cleanup to do, but way easier than manual and higher quality than other AI options

2024-12-12

🔥 Docling: AI-powered document processing! 🔥
🤯 PDFs & DOCXs in your AI workflow? Docling makes it easy! Converts to markdown & JSON for RAG and more. Blazing fast! 🚀

youtu.be/zSCxbqgqeJ8?si=mede5e

#AI #Docling #DocumentProcessing #OpenSource #RAG #LlamaIndex #Efficiency #Workflows

2024-11-16

Docling, IBM’s new open-source toolkit, is designed to more easily unearth that information for generative AI applications. The toolkit streamlines the process of turning unstructured documents into JSON and Markdown files that are easy for large language models (LLMs) and other foundation models to digest.

github.com/DS4SD/docling
#docling #aiml #ml #genai

2024-11-15

Great to see #Docling generating much well-deserved buzz and trending on GitHub! This #opensource document ingestion tool by #IBMResearch is already in use by #InstructLab (and soon #RHELAI). Exciting stuff!

redhat.com/en/blog/docling-mis

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst