#wtfpdf

2025-10-29

I’m adding a couple photos on file structure. #icabarcelona2025 #wtfpdf #digipres #odfa

Slide on the file structure of OFD/A. It should be machine readable if downloadedSlide on the file structure of OFD/A. It should be machine readable if downloadedSlide on the double layered file structure of OFD/A. It should be machine readable if downloaded
2025-10-29

Has anyone worked with ODF/A? Have it in your corpus? I found it interesting that one of the reasons for developing ODF/A was issues around Chinese characters in PDF (example: difficulties in copying from PDF). #digipres @wtfpdf #icabarcelona2025 #wtfpdf

2025-10-21

#archivtagAT #archivtag2025 Andreas Rauber zeigt ein Beispiel von einem PDF, auch HTML hat oder als Virtual Machine gespeichert werden kann, die dann erweiterte Funktionen haben. Was passiert, wenn man so ein PDF normalisiert oder migriert wird? #wtfPDF

2025-06-11

pdfalyze --help
(github.com/michelcrypt4d4mus/p)

outputs:

"Explore PDF's inner data structure with absurdly large and in depth visualizations. Track the control flow of her darker impulses, scan rivers of her binary data for signs of evil sorcery, and generally peer deep into the dark heart of the Portable Document Format. Just make sure you also forgive her - she knows not what she does."

#wtfpdf

2025-06-11

On the effects of the useful

`mutool clean`

command to "repair" PDFs.

If you take this PDF openreview.net/pdf?id=CSJYz1Zo and apply the command to it... the 27-page annex is cut off from the "cleaned" output (still in the PDF, but unreferenced, so not displayed).

So use it with care!

#wtfpdf

2025-05-21

Well, I've finally made a blog post, but on the #OPF website!

openpreservation.org/blogs/val

I'm walking you through the most complex (out of 2) PDF repair processes I've made. Any input is welcome!

#digipres #wtfpdf

Johan van der Knijffbitsgalore@digipres.club
2025-04-07

Today's #wtfPDF moment: #PDFs with images that are encoded as #JPEG, where the JPEG data stream in turn is ascii85 encoded.

WHY?!? (The ascii85 encoding only inflates the JPEG data streams by 25% and doesn't offer any benefits).

github.com/KBNLresearch/pdfqua

2025-02-01
Johan van der Knijffbitsgalore@digipres.club
2025-01-24

@edsu @quinnanya @bitarchivist @Literature_Geek This is a GREAT resource and I'd love to give this a boost, by WHY OH WHY is this published as a PDF where each page is a #!@@** JPEG image, which makes it 100% inaccessible (also, good luck copying and typing in all those hyperlinks) #wtfPDF

Johan van der Knijffbitsgalore@digipres.club
2024-11-15

ICYMI - are "octal escape sequences" in #PDF strings really a preservation risk, as claimed by the authors of the recent "The Phantom 👻 of a PDF File" blog post?

Some quick tests I did with eight different PDF processing tools suggest they're not, and #JHOVE's inability to handle them really seems to be the exception here #wtfPDF #fileformatfriday

bitsgalore.org/2024/11/14/esca

Johan van der Knijffbitsgalore@digipres.club
2024-11-14

The authors of the recent "The Phantom of a PDF File" blog post argue that "octal escape sequences" in #PDF strings are a potential preservation risk.

But some quick tests with 8 different PDF tools suggest that #JHOVE is really the only tool that can't handle them!

Details in my new blog post "Escape from the phantom of the PDF" #wtfPDF 👻 :

bitsgalore.org/2024/11/14/esca

Johan van der Knijffbitsgalore@digipres.club
2024-11-14

Update on the "Phantom of the #PDF" blog of a few weeks ago (link: digitalpreservation.fi/en/2024).

I did a little test of authors' claim that "#JHOVE probably is not the only software that will get confused" by octal escape sequences* in metadata strings

So I read the file with 8 different PDF tools/libraries:

github.com/openpreserve/jhove/

Turns out JHOVE actually *is* the only software that gets confused by this #wtfPDF!

*) The authors describe this as "dual encodings", but see Peter Wyatt's comment!

Johan van der Knijffbitsgalore@digipres.club
2024-11-12

Here's a sneak peek at a #PDF Quality Assessment tool I'm working on for digitisation batches , mostly based on #PyMuPDF, #pillow and #Schematron:

github.com/KBNLresearch/pdfqua

(Wouldn't recommend this for production yet, as it's not completely finished, and I'm still changing some things around.)

#wtfPDF

2024-11-07

OMG, I played this game dpconline.org/blog/wdpd/blog-f and my file format fling was PDF!?! #WTFPDF @wtfpdf #digipres #wdpd2024

2024-09-13

Got my first PDF 2.0 file into my repository. Didn’t know Word was saving to the new format! #digipres #fileformats #wtfPDF

at the cost of giving away my anonymous response to this form ... here's a lesson on "read all options before answering" .... #wtfPDF

Form with two questions. Question 1: "Finally, what's your favorite file format?" Answer "plan PDF (but it's a love-hate relationship)". Question 2 "... and why is not PDF?!" Answer options is a scale from 1 to 5 with a :rofl: emoji at 1 and a :cry: emoji at 5. 5 is selected.
Johan van der Knijffbitsgalore@digipres.club
2024-07-09

Next-level #wtfPDF shenanigans in this piece on exploiting #PDF browser rendering discrepancies:

"In this article, we will show you how to create a hybrid PDF that abuses widget annotations to create render discrepancies, and share the code so you can generate your own."

portswigger.net/research/fickl

(via @decalage on the former birdsite)

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst