Lmst

After struggling to get #python #PyMuPDF to work and being close the deadline, I shifted to using a combination of other commands.

First using the #linux #pdftohtml command, which is so much faster than PyMuPDF and packages the result similar to saving a website.

Next with #NeoVim and #RegEx format the #HTML file to be able to be quickly processed with #NodeJs #cheerio and eventually through #json to be saved in #sqlite.

Is it elegant and automatic? No, though it works!

#JavaScript

Further while trying to extract and format data from PDFs using #python #PyMuPDF.

I was trying to create a perfect chain of functions that would format all the edge cases into the final desired #HTML format. This is where I quickly realized running every tweaked version of the functions on the 100 page PDF is quite time consuming.

Instead I can run it once and save the results in a #sqlite database. Then create #sql queries to do post processing on the edge cases while having a good enough way to observe the contents of each page over the pervious method of posting the output into the #terminal and scrolling to the desired page. And in the end, I am one step closer of having the data in a #csv file, which is easily exported with #Dbeaver.

Currently trying to extract and format data from PDFs using #python #PyMuPDF.

Initially used the `get_text(value)` method with the `"text"` value, only to learn that I could have potentially saved time directly using the `"html"` value, since I have been creating pattern matchers to format the text into #HTML.

After investigation, although the html option exists, the post processing is more strenuous than the initial approach.

My fascination with the `get_text(value)` method is that each value packages the data differently. Where as `"html"` puts the text in `<p><span>text</span></p>`, `"xhtml"` puts it instead in `<h1>text</h1>`.

I just updated my 2023 post on extracting text from #EPUB files in #Python, and added an evaluation of #PyMuPDF (which also supports EPUB!). Includes link to demo script.

https://www.bitsgalore.org/2023/03/09/extracting-text-from-epub-files-in-python

Ever felt the need to convert a #PDF into a fixed-layout #EPUB that preserves the table of contents, internal cross-references and hyperlinks? Finding no out-of-the-box solution, I've developed one myself using #Python and the #PyMuPDF library. Here it is, open source, and ready for use:

https://github.com/aourednik/pdf2epub3fixed

My script is particularly suitable for the conversion of complex layout PDFs generated with variants of #TeXLaTeX.
Enjoy!

Today I managed to cobble up a #Python script to remove your name from #PDF annotations using #PyMuPDF and #FreeSimpleGUI, then I tried #pyinstaller and I have something that seems to run on Linux... so many steps!!!

It never ceases to amaze me how hard it its to provide software for other people to run!

If you think it could be useful to you or someone, I AGPL licensed it here:

https://github.com/villares/anonymize-pdf-annotations

UPDATE: @Introscopia built a Windows.exe version for me also using pyinstaller, yay!

screenshot of the tool panel with fields:
"Input file", "Output file" and "Change names in notes to"
A checkbox to remove the name from main metadata, a "Create modified PDF" button and a "CLOSE/EXIT" button.

New blog post for #fileformatfriday - #PDF Quality assessment for #digitisation batches with #Python, #PyMuPDF and #Pillow. This introduces the new #Pdfquad tool, which might be useful for others as well:

https://www.bitsgalore.org/2024/12/13/pdf-quality-assessment-for-digitisation-batches-with-python-pymupdf-and-pillow

Here's a sneak peek at a #PDF Quality Assessment tool I'm working on for digitisation batches , mostly based on #PyMuPDF, #pillow and #Schematron:

https://github.com/KBNLresearch/pdfquad

(Wouldn't recommend this for production yet, as it's not completely finished, and I'm still changing some things around.)

#wtfPDF

PyMuPDFが進化！PDFデータ抽出の超強力ライブラリ爆誕「PyMuPDF4LLM」
https://qiita.com/ryosuke_ohori/items/a21637537dfdd6d209a9?utm_campaign=popular_items&utm_medium=feed&utm_source=popular_items

#qiita #Python #PDF #AI #PyMuPDF #LLM

Discover how to extract images from PDFs using PyMuPDF in Python. This comprehensive guide covers installation, code explanation, best practices, and tips for efficient PDF image extraction. #PDFimageextraction #Python #PyMuPDF

https://teguhteja.id/pdf-image-extraction-with-pymupdf-tutorial-guide/

【PyMuPDF】PDF内の表以外を抽出する
https://qiita.com/hats0902/items/08fb9a594f98cb4e5b0e?utm_campaign=popular_items&utm_medium=feed&utm_source=popular_items

#qiita #Python #PDF #PyMuPDF

#python #linguistics #NLP #pymupdf

Let's say I have a raw text that I got from a pdf , where the authors of said pdf are too boomer to release it as a structured text.

But there are Keywords and chapters.

Do you have good advice or a good resource for how to get that structure back from the content?

(I'm going into it with the agenda to prove that they are badly written, so if I can't identify what a paragraph is about that's "good")

Convert a document from or to PDF using Python? If #PyPandoc doesn't have you covered try #PyMuPDF. Fantastic features for data extraction and conversion. https://pymupdf.readthedocs.io/ #python #pdf #ocr #document #extraction #conversion #pandoc #mupdf

RAG/LLMの前処理：PyMuPDF4LLMを使用してPDFをMarkdownへ変換する
https://qiita.com/cyberBOSE/items/c276d273bfc20881adfc?utm_campaign=popular_items&utm_medium=feed&utm_source=popular_items

#qiita #Python #rag #PyMuPDF #LLM #PyMuPDF4LLM

PDFをLLMで解析する前処理のパーサーは何が良いのか？（pdfminer, PyMuPDF, pypdf, Unstructured）
https://qiita.com/cyberBOSE/items/142cdf91e0ee20b3114f?utm_campaign=popular_items&utm_medium=feed&utm_source=popular_items

#qiita #Python #pdfminer #PyMuPDF #pyPDF #Unstructured

In unserer letzten Folge ging es um #Dangerzone 0.6.0 vereinfacht Codebasis mit #PyMuPDF und wechselt Lizenz

https://theradio.cc/blog/2024/04/14/ll280-maintainers-chained-by-supply/

#PDF #security #itsecurity

PDFの表からデータを取得する（PyMuPDF）
https://qiita.com/alice37308108/items/c9859a66981956e1dad1?utm_campaign=popular_items&utm_medium=feed&utm_source=popular_items
#qiita #Python #PDF #PyMuPDF

I love and hate what my job makes me do.

I do just love #python

Learned how to extract geometry and text info from pdf drawings with #pymupdf

"... sufficiently advanced technology" indeed.

Estou realmente muito impressionado com #PyMuPDF "import fitz" https://pymupdf.readthedocs.io/en/latest/

Está parecendo a biblioteca atual mais poderosa para manipular #PDF com #Python

(Em breve vou postar o meu #fanzine impositionator-tabajara usando PyMuPDF WIP: https://gist.github.com/villares/0402a1c9033e6f4baf55554c16d25f4e)

Make a PDF A3 zine combining 8 pages from a source PDF

WIP: I'd like to have an option for the 9th page to be a poster in the backside.
-------------------------------------------------------------------------------
License: GNU GPL V3
(c) 2023 Alexandre Villares

Based on https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/combine-pages/combine.py
(c) 2018 Jorj X. McKie

Dependencies
------------
PyMuPDF (the fitz engine)

Usage
-----
python combine.py input.pdf

[start of ASCII diagram]

|----|----|----|----|
| 0d | 7d | 6d | 5d |
|----|----|----|----|
| 1u | 2u | 3u | 4u |
|----|----|----|----|

|----|----|----|----|
| |
| 9s |
| |
|----|----|----|----|

[end of ASCII diagram]

N = page 0-based
u = normal orientation
d = upside down (rotated 180)
s = big poster (rotated 90)

imagem exemplo do layout de zine no A3 paisagem:

na parte de cima da primeria fola as páginas 0 (capa), 7, 6, e 5 do zine, de cabaça pra baixo. Na parte de baixo páginas 1, 2, 3 e 4.

Na segunda folha o poster (página 8, a nona página, ocupando a folha toda, girado 90 graus).

I used to be able to point my #ThonnyIDE to a #conda env but in this other computer I can't seem to make it work anymore :((

Update 1: Well, it runs, but some libraries seem to break :((

(maybe the lib is not well behaved, but I don't have the energy to chase this right now)

Update 2: #PyMuPDF I'm looking at you!

(runs fine from the command line or from Thonny's other env, go figure)

#PyMuPDF

Client Info