I’m adding a couple photos on file structure. #icabarcelona2025 #wtfpdf #digipres #odfa
I’m adding a couple photos on file structure. #icabarcelona2025 #wtfpdf #digipres #odfa
Has anyone worked with ODF/A? Have it in your corpus? I found it interesting that one of the reasons for developing ODF/A was issues around Chinese characters in PDF (example: difficulties in copying from PDF). #digipres @wtfpdf #icabarcelona2025 #wtfpdf
#archivtagAT #archivtag2025 Andreas Rauber zeigt ein Beispiel von einem PDF, auch HTML hat oder als Virtual Machine gespeichert werden kann, die dann erweiterte Funktionen haben. Was passiert, wenn man so ein PDF normalisiert oder migriert wird? #wtfPDF
pdfalyze --help
(https://github.com/michelcrypt4d4mus/pdfalyzer)
outputs:
"Explore PDF's inner data structure with absurdly large and in depth visualizations. Track the control flow of her darker impulses, scan rivers of her binary data for signs of evil sorcery, and generally peer deep into the dark heart of the Portable Document Format. Just make sure you also forgive her - she knows not what she does."
On the effects of the useful
`mutool clean`
command to "repair" PDFs.
If you take this PDF https://openreview.net/pdf?id=CSJYz1Zovj and apply the command to it... the 27-page annex is cut off from the "cleaned" output (still in the PDF, but unreferenced, so not displayed).
So use it with care!
Well, I've finally made a blog post, but on the #OPF website!
I'm walking you through the most complex (out of 2) PDF repair processes I've made. Any input is welcome!
The smallest (valid) #PDF in the world is here:
https://pdfa.org/download-area/smallest-possible-pdf/smallest-possible-pdf-1.0.pdf
@edsu @quinnanya @bitarchivist @Literature_Geek This is a GREAT resource and I'd love to give this a boost, by WHY OH WHY is this published as a PDF where each page is a #!@@** JPEG image, which makes it 100% inaccessible (also, good luck copying and typing in all those hyperlinks) #wtfPDF
ICYMI - are "octal escape sequences" in #PDF strings really a preservation risk, as claimed by the authors of the recent "The Phantom 👻 of a PDF File" blog post?
Some quick tests I did with eight different PDF processing tools suggest they're not, and #JHOVE's inability to handle them really seems to be the exception here #wtfPDF #fileformatfriday
https://www.bitsgalore.org/2024/11/14/escape-from-the-phantom-of-the-pdf
The authors of the recent "The Phantom of a PDF File" blog post argue that "octal escape sequences" in #PDF strings are a potential preservation risk.
But some quick tests with 8 different PDF tools suggest that #JHOVE is really the only tool that can't handle them!
Details in my new blog post "Escape from the phantom of the PDF" #wtfPDF 👻 :
https://www.bitsgalore.org/2024/11/14/escape-from-the-phantom-of-the-pdf
Update on the "Phantom of the #PDF" blog of a few weeks ago (link: https://digitalpreservation.fi/en/2024-phantom-pdf-file).
I did a little test of authors' claim that "#JHOVE probably is not the only software that will get confused" by octal escape sequences* in metadata strings
So I read the file with 8 different PDF tools/libraries:
https://github.com/openpreserve/jhove/issues/927#issuecomment-2465947326
Turns out JHOVE actually *is* the only software that gets confused by this #wtfPDF!
*) The authors describe this as "dual encodings", but see Peter Wyatt's comment!
Here's a sneak peek at a #PDF Quality Assessment tool I'm working on for digitisation batches , mostly based on #PyMuPDF, #pillow and #Schematron:
https://github.com/KBNLresearch/pdfquad
(Wouldn't recommend this for production yet, as it's not completely finished, and I'm still changing some things around.)
OMG, I played this game https://www.dpconline.org/blog/wdpd/blog-fff-game-wdpd2024 and my file format fling was PDF!?! #WTFPDF @wtfpdf #digipres #wdpd2024
Got my first PDF 2.0 file into my repository. Didn’t know Word was saving to the new format! #digipres #fileformats #wtfPDF
at the cost of giving away my anonymous response to this form ... here's a lesson on "read all options before answering" .... #wtfPDF
Next-level #wtfPDF shenanigans in this piece on exploiting #PDF browser rendering discrepancies:
"In this article, we will show you how to create a hybrid PDF that abuses widget annotations to create render discrepancies, and share the code so you can generate your own."
https://portswigger.net/research/fickle-pdfs-exploiting-browser-rendering-discrepancies
(via @decalage on the former birdsite)