#PyMuPDF

2025-02-03

After struggling to get #python #PyMuPDF to work and being close the deadline, I shifted to using a combination of other commands.

First using the #linux #pdftohtml command, which is so much faster than PyMuPDF and packages the result similar to saving a website.

Next with #NeoVim and #RegEx format the #HTML file to be able to be quickly processed with #NodeJs #cheerio and eventually through #json to be saved in #sqlite.

Is it elegant and automatic? No, though it works!

#JavaScript

2025-02-01

Further while trying to extract and format data from PDFs using #python #PyMuPDF.

I was trying to create a perfect chain of functions that would format all the edge cases into the final desired #HTML format. This is where I quickly realized running every tweaked version of the functions on the 100 page PDF is quite time consuming.

Instead I can run it once and save the results in a #sqlite database. Then create #sql queries to do post processing on the edge cases while having a good enough way to observe the contents of each page over the pervious method of posting the output into the #terminal and scrolling to the desired page. And in the end, I am one step closer of having the data in a #csv file, which is easily exported with #Dbeaver.

2025-02-01

Currently trying to extract and format data from PDFs using #python #PyMuPDF.

Initially used the `get_text(value)` method with the `"text"` value, only to learn that I could have potentially saved time directly using the `"html"` value, since I have been creating pattern matchers to format the text into #HTML.

After investigation, although the html option exists, the post processing is more strenuous than the initial approach.

My fascination with the `get_text(value)` method is that each value packages the data differently. Where as `"html"` puts the text in `<p><span>text</span></p>`, `"xhtml"` puts it instead in `<h1>text</h1>`.

Johan van der Knijffbitsgalore@digipres.club
2025-01-21

I just updated my 2023 post on extracting text from #EPUB files in #Python, and added an evaluation of #PyMuPDF (which also supports EPUB!). Includes link to demo script.

bitsgalore.org/2023/03/09/extr

André Ourednikandre_ourednik
2024-12-19

Ever felt the need to convert a into a fixed-layout that preserves the table of contents, internal cross-references and hyperlinks? Finding no out-of-the-box solution, I've developed one myself using and the library. Here it is, open source, and ready for use:

github.com/aourednik/pdf2epub3

My script is particularly suitable for the conversion of complex layout PDFs generated with variants of .
Enjoy!

Alexandre B A Villares 🐍villares@ciberlandia.pt
2024-12-15

Today I managed to cobble up a #Python script to remove your name from #PDF annotations using #PyMuPDF and #FreeSimpleGUI, then I tried #pyinstaller and I have something that seems to run on Linux... so many steps!!!

It never ceases to amaze me how hard it its to provide software for other people to run!

If you think it could be useful to you or someone, I AGPL licensed it here:

github.com/villares/anonymize-

UPDATE: @Introscopia built a Windows.exe version for me also using pyinstaller, yay!

screenshot of the tool panel with fields:
"Input file", "Output file" and "Change names in notes to"
A checkbox to remove the name from main metadata, a "Create modified PDF" button and a "CLOSE/EXIT" button.
Johan van der Knijffbitsgalore@digipres.club
2024-12-13

New blog post for #fileformatfriday - #PDF Quality assessment for #digitisation batches with #Python, #PyMuPDF and #Pillow. This introduces the new #Pdfquad tool, which might be useful for others as well:

bitsgalore.org/2024/12/13/pdf-

Johan van der Knijffbitsgalore@digipres.club
2024-11-12

Here's a sneak peek at a #PDF Quality Assessment tool I'm working on for digitisation batches , mostly based on #PyMuPDF, #pillow and #Schematron:

github.com/KBNLresearch/pdfqua

(Wouldn't recommend this for production yet, as it's not completely finished, and I'm still changing some things around.)

#wtfPDF

:rss: Qiita - 人気の記事qiita@rss-mstdn.studiofreesia.com
2024-10-27
IB Teguh TMteguhteja
2024-10-20

Discover how to extract images from PDFs using PyMuPDF in Python. This comprehensive guide covers installation, code explanation, best practices, and tips for efficient PDF image extraction.

teguhteja.id/pdf-image-extract

2024-09-22

#python #linguistics #NLP #pymupdf

Let's say I have a raw text that I got from a pdf , where the authors of said pdf are too boomer to release it as a structured text.

But there are Keywords and chapters.

Do you have good advice or a good resource for how to get that structure back from the content?

(I'm going into it with the agenda to prove that they are badly written, so if I can't identify what a paragraph is about that's "good")

Peter Bittnerpeterbittner
2024-09-18

Convert a document from or to PDF using Python? If doesn't have you covered try . Fantastic features for data extraction and conversion. pymupdf.readthedocs.io/

:rss: Qiita - 人気の記事qiita@rss-mstdn.studiofreesia.com
2024-05-26

PDFをLLMで解析する前処理のパーサーは何が良いのか?(pdfminer, PyMuPDF, pypdf, Unstructured)
qiita.com/cyberBOSE/items/142c

#qiita #Python #pdfminer #PyMuPDF #pyPDF #Unstructured

2024-04-21

In unserer letzten Folge ging es um #Dangerzone 0.6.0 vereinfacht Codebasis mit #PyMuPDF und wechselt Lizenz

theradio.cc/blog/2024/04/14/ll

#PDF #security #itsecurity

2023-08-08

I love and hate what my job makes me do.

I do just love #python

Learned how to extract geometry and text info from pdf drawings with #pymupdf

"... sufficiently advanced technology" indeed.

Alexandre B A Villares 🐍villares@ciberlandia.pt
2023-08-02

Estou realmente muito impressionado com #PyMuPDF "import fitz" pymupdf.readthedocs.io/en/late

Está parecendo a biblioteca atual mais poderosa para manipular #PDF com #Python

(Em breve vou postar o meu #fanzine impositionator-tabajara usando PyMuPDF WIP: gist.github.com/villares/0402a)

Make a PDF A3 zine combining 8 pages from a source PDF

WIP: I'd like to have an option for the 9th page to be a poster in the backside.
-------------------------------------------------------------------------------
License: GNU GPL V3
(c) 2023 Alexandre Villares

Based on https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/combine-pages/combine.py
(c) 2018 Jorj X. McKie

Dependencies
------------
PyMuPDF (the fitz engine)

Usage
-----
python combine.py input.pdf

[start of ASCII diagram]

|----|----|----|----|
| 0d | 7d | 6d | 5d |
|----|----|----|----|
| 1u | 2u | 3u | 4u |
|----|----|----|----|

|----|----|----|----|
|                   |
|        9s         |
|                   |
|----|----|----|----|

[end of ASCII diagram]

N = page 0-based 
u = normal orientation
d = upside down (rotated 180)
s = big poster (rotated 90)imagem exemplo do layout de zine no A3 paisagem:

na parte de cima da primeria fola as páginas 0 (capa), 7, 6, e 5 do zine, de cabaça pra baixo. Na parte de baixo páginas 1, 2, 3 e 4.

Na segunda folha o poster (página 8, a nona página, ocupando a folha toda, girado 90 graus).
Alexandre B A Villares 🐍villares@ciberlandia.pt
2023-08-01

I used to be able to point my #ThonnyIDE to a #conda env but in this other computer I can't seem to make it work anymore :((

Update 1: Well, it runs, but some libraries seem to break :((

(maybe the lib is not well behaved, but I don't have the energy to chase this right now)

Update 2: #PyMuPDF I'm looking at you!

(runs fine from the command line or from Thonny's other env, go figure)

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst