Removing "/Subtype /Watermark" images from a PDF using Linux
https://shkspr.mobi/blog/2026/01/removing-subtype-watermark-images-from-a-pdf-using-linux/
Problem: I've received a PDF which has a large "watermark" obscuring every page.
Investigating: Opening the PDF in LibreOffice Draw allowed me to see that the watermark was a separate image floating above the others.
Manual Solution: Hit page down, select image, delete, repeat 500 times. BORING!
Further Investigating: Using pdftk, it's possible to decompress a PDF. That makes it easier to look through manually.
pdftk input.pdf output output.pdf uncompress
Hey presto! A PDF you can open in a text editor! Deep joy!
Searching: On a hunch, I searched for "watermark" and found several lines like this:
<<
/Length 548
>>
stream
/Figure <</MCID 0 >>BDC q 0 0 477 733.464 re W n q /GS0 gs 479.2799893 0 0 735.5999836 -1.0800002 -1.0559941 cm /Im0 Do Q EMC
/Figure <</MCID 1 >>BDC Q q 28.333 300.661 420.334 126.141 re W n q /GS0 gs 420.3339603 0 0 126.1418879 28.3330078 300.6610601 cm /Im1 Do Q EMC
/Figure <</MCID 2 >>BDC Q q 16.106 0 444.787 215.464 re W n q /GS0 gs 444.7874274 0 0 216.5921386 16.1062775 -1.1281493 cm /Im2 Do Q EMC
/Artifact <</Subtype /Watermark /Type /Pagination >>BDC Q q 0.7361145 0 0 0.7361145 113.3616638 240.8575745 cm /GS1 gs /Fm0 Do Q EMC
endstream
endobj
Those are Marked Content Blocks. In theory you can just chop out the line with /Subtype /Watermark but each block has a /length variable - so you'd also need to adjust that to account for what you've changed - otherwise the layout goes all screwy.
That led me to PyMuPDF which claimed to solve the problem. But running that code only removed some of the watermarks. It got stuck on an infinite loop on certain pages.
So, now that I had more detailed knowledge, I managed to get an LLM to construct something which mostly seems to work.
Does it work with every PDF? I don't know. Does it contain subtle implementation bugs? Probably. Is there an easier way to do this? Not that I can find.
import re
import pymupdf
# Open the PDF
doc = pymupdf.open("output.pdf")
# Regex of the watermarks
pattern = re.compile(
rb"/Artifact\s*<<[^>]*?/Subtype\s*/Watermark[^>]*?>>BDC.*?EMC",
re.DOTALL
)
# Loop through the PDF's pages
for page_num, page in enumerate(doc, start=1):
print(f"Processing page {page_num}")
xrefs = page.get_contents()
for xref in xrefs:
cont = doc.xref_stream(xref)
new_cont, n = pattern.subn(b"", cont)
if n > 0:
print(f" Removed {n} watermark block(s)")
doc.update_stream(xref, new_cont)
doc.save("no-watermarks.pdf")
One of the (many) problems with Vibe Coding is that trying to get a LLM to spit out something useful depends massively on how well you know the subject area. I'm proud to say I know vanishingly little about the baroque PDF specification - which meant that most of my attempts to use various "AI" tools consisted of me saying "No, that doesn't work" and the accurs'd machine saying back "Golly-gee! You're right! Let me fix that!" and then breaking something else.
I'm not sure this is the future we wanted, but it looks like the future we've got.
#LLM #pdf #python