#OCRmyPDF

2026-02-05

Old fuzzy pages, still tricky at 1600 dpi with #XSane and #OCRmyPDF on unix.
=
Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)
=
1. Mana should have been Maria.

2. And a bunch of dots got halluncinated into random letters. Same as it ever was, back to encoded Bacon wrote Shakespeare gibberish.

Otherwise, damned decent!

Hooper Ranch Bookkeeper ............... cobhebgeneaneen Lisa Salkov, Mana Diaz (alt.)

and context.
Liane M. Dubowylmd@social.heise.de
2026-01-06

@WorziArmin Ein Kollege hatte schon mal Tools fürs #Dokumentenmanagement vorgestellt. Aber ich fürchte: Das erfordert noch mehr Disziplin. #OCRmyPDF kann das Problem nicht lösen, das scannt ja nur ein und macht die Texterkennung. Für alle, die keine Lust haben zu sortieren, empfehle ich tatsächlich #Recoll. Festplatte indizieren, dann findet das fast alles. Aber mich würde das Chaos auf der Festplatte irre machen.

2025-11-20

¯\_(ツ)_/¯ *meh
Homebrew pillow 12.0.0 Upgrade macht meinen PDF Workflow kaputt :(
Aber ich kann nicht downgraden auf die 11.3.0 weil dependencies
Und weil homebrew die alte Version nicht gelistet hat?

Hmpf

#homebrew #python #ocrmypdf

Schlaf ist überbewertethalbwach@wandzeitung.xyz
2025-11-17

Ich bin ja sonst nicht so der Typ für #Software und Empfehlungen....

Aber das hier ist ein absolutes Muss, wenn Du massenhaft pdf-Dateien nachträglich mit einem Text-Layer versehen willst.

Massenhaft scannen in eine Datei und während der Texterkennung automatisch trennen lassen mit ist nur ein Highlight...

Muss man haben!
Github:
github.com/digidigital/OCRthyP

#ocrthypdf #ocr #ocrmypdf #ubuntu #foss

Ein screenshot von OCRthyPDF Version 0.7.0 während der Arbeit.
Nur Leseordner und schreibaordner ausgewählt - allles andere so gelassen wie default eingestellt.
Unbedingt das Git-Repository lesen!
Tim Schlotfeldt ⚓🏳️‍🌈ts-new@hub.tschlotfeldt.de
2025-10-28
@Martin Seeger Ah, Benamung ist echt ein Thema. Und dann auch wieder nicht. Mein Benamungsschema für Dateien ist Datum-Typ-Ersteller.

Ich benutze allerdings kein #paperless sondern mache das händisch mit #ocrmypdf. Die Dateien sortiere ich in eine Verzeichnisstruktur. Und dank OCR findet bei mir #Recoll dann alles wieder. @Bastian
Victor Forbergervforberger@fosstodon.org
2025-10-11

@D_J_Nathanson

#pdftk for terminal
@libreoffice draw
#masterpdf v4 is free; current version is paid
#ocrmypdf
#pdfunite etc

I can send you various aliases I have created. Also, see various pdf posts at linuxatty.wordpress.com.

Jonathan Kamens 86 47jik@federate.social
2025-09-09

Editing or redacting a #PDF using #LibreOffice Draw is far superior to the commonly used method of converting the PDF's pages into images and editing the images, because the latter results in a PDF that is many times larger and doesn't render as well. Also, text copy and paste is lost, which you can recover from to some extent with a tool like #OCRmyPDF, but you'll never get the text quality back to as high as it was before you converted the PDF to images.
#FOSS

2025-09-05

Have you ever needed to extract text from images embedded in a #PDF? I can highly recommend the open source #CLI tool #OCRmyPDF which is easy to automate in for example a #DataPipeline.

It uses #Tesseract #OCR under the hood and has many options to experiment with to get the best possible accuracy for your language and PDF content.

You can get started with just a few commands:

samuelplumppu.se/blog/automate

2025-08-24

Добавление OCR-слоя и другие преобразования PDF

При сканировании и сохранении в формате PDF зачастую документы сохраняются в виде графических изображений. Это неудобно, потому что делает невозможным полнотекстовый поиск по содержанию. Утилита OCRmyPDF решает эту проблему: она одной командой из консоли добавляет к PDF-документу слой OCR с распознанным текстом. Ниже упомянуты ещё несколько полезных инструментов для парсинга PDF, в том числе для преобразования сложных математических PDF-документов в текстовый формат Markdown.

habr.com/ru/companies/globalsi

#pdf #syntax #markitdown #конвертация #ocrmypdf #ocr

DustinDustwin
2025-03-20

2/2 re

All three were set to and . None rotated the page that was sideways, but they all pages that needed it. Kofax was the speediest of the bunch, then not far behind and was by far the slowest.

File size Foxit produced the smallest file size, created files double the original. OCRmyPDF struggled here, ballooning the original size by at least 6 times larger.

DustinDustwin
2025-03-20

1/2 re OCR

I got to do a fun test at work with . Phantom with AbbyFineReader from 2013, Power PDF from 2020 and via with as the OCR engine.

The best results in OCR were from OCRmyPDF great results. Second was Kofax lagging was the over 10-year-old Foxit. OCRmyPDF did perform great and just picked up a few more characters, especially fuzzy scanned text, plus it got some handwritten text.

2025-03-06

Meine Fresse, sind wir heute wieder aktuell:
#PDF #OCR #CLI #Stapelverarbeitung ... als wäre es 2005 ;)

Wobei: Seit Dokumente zunehmend per Handy "gescannt" werden, könnte (nachträgliche) Texterkennung doch recht aktuell sein :)

tutonaut.de/pdf-texterkennung-

#Opensource #OCRmyPDF

2025-03-02

Цифровой архив с полнотекстовым поиском, в том числе по PDF и картинкам

У каждого человека с годами скапливается множество бумажных документов, в которых непросто разобраться или что-то найти. Эта проблема ещё более актуальна для организаций. Опенсорсная программа Paperless-ngx позиционируется как оптимальное решение для создания цифрового архива. Со встроенной системой распознавание символов (OCR) и обучением на основе ранее отсканированных документов она создаёт хранилище с поиском, где можно быстро найти любой документ. Всем документам присваиваются теги, так что они могут присутствовать в разных тематических категориях, это удобнее распределения по папкам. Paperless-ngx можно установить на домашний сервер и загружать документы через браузер с любого устройства.

habr.com/ru/companies/globalsi

#Paperless #Paperlessngx #цифровое_хранилище #электронный_документооборот #сканирование_документов #OCRmyPDF

DustinDustwin
2025-02-20

is nice because I can use on . I set it up to watch a folder for any new then automatically then to a "done" folder. It is very nice to have it done automatically in the background. No more opening, clicking to OCR and waiting on the software and unable to open other PDFs. Plus, this process is way lighter on resources. Man, I love .

2025-02-14

Lifehacker: This Free Tool Can Help You Search and Copy (Nearly) Any PDF. “There’s nothing worse than opening a PDF and realizing you can’t use the search function or even highlight text. This typically happens when a PDF was created by scanning a paper document — it’s just a series of images. Most modern scanning software uses Optical Character Recognition (OCR) so that words are both […]

https://rbfirehose.com/2025/02/14/lifehacker-this-free-tool-can-help-you-search-and-copy-nearly-any-pdf/

2024-09-08

For my server backend, I used a #python script to handle the requests.

Basically, it makes use of two components:

#CV2 (open computer vision). This is Swiss Army-knife for image manipulation. I use it to reduce an image to a b/w format which only contains the text. Ideal for a quick copy

#imagemagick to apply sigmoidal contrast, removing most problems due to poor illumination

#OCRmyPDF to make it more convenient to work with the scanned sheet

R. L. Dane :debian: :openbsd:RL_Dane@fosstodon.org
2024-07-24

@giantspacesquid

I'm just guessing that #ocrmypdf applied some compression options that the scanning software (#KDE Skanpage) didn't.

Elias Probsteliasp
2024-07-03

@nielso darf ich dir von unserem Herrn & Erlöser (welcher auch die Wunder des zum Nutzen seiner Jünger mehret) predigen? 😅

bujabuja
2024-06-16

@nyx@im-in.space I just stumbled over this post and it got me thinking...

If you´re still on the lookout...perhaps can help you.

github.com/ocrmypdf/OCRmyPDF

You would have to convert your documents to pdf and then throw them at OCRmyPDF to create searchable pdf files.

If you want a nice web ui, uses OCRmyPDF internally. You can run it in a , upload your pdf files into it and when it has done it´s thing, enjoy your searchable pdfs.

Peter Vágnerpvagner@fedi.ml
2024-03-17
@meatbag I'm on linux and the best I have found working for me is #ocrmypdf github.com/ocrmypdf/OCRmyPDF
It uses #tesseract under the hood and for static text it's okay. For tables and other material that is difficult to parse it's not usefull.
When PDF has a text then the tools I am using for reading these include #firefox and #evince

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst