#ApacheTika

Tim Allisontallison
2025-12-10

On we're moving entirely to json for configuration in 4.x.

If you use tika-server and are interested in runtime configuration, please take a look and offer feedback:

lists.apache.org/thread/jlt8jv

Please repost for reach.

Offensive Sequenceoffseq@infosec.exchange
2025-12-06

⚠️ CRITICAL XXE bug (CVE-2025-66516, CVSS 10.0) in Apache Tika (tika-core, tika-pdf-module, tika-parsers). Exploitation via crafted PDFs can lead to file disclosure & RCE. Upgrade to 3.2.2+ ASAP! radar.offseq.com/threat/critic #OffSeq #ApacheTika #XXE #Security

Critical threat: Critical XXE Bug CVE-2025-66516 (CVSS 10.0) Hits Apache Tika, Requires Urgent Patch
Offensive Sequenceoffseq@infosec.exchange
2025-12-05

🚨 CVE-2025-66516 CRITICAL: XXE in Apache Tika core (v1.13–3.2.1), tika-pdf-module, tika-parsers. Exploitable via crafted PDF XFA files — risks data exfil & DoS. Patch to 3.2.2+ now! radar.offseq.com/threat/cve-20 #OffSeq #ApacheTika #XXE #Vuln

Critical threat: CVE-2025-66516: CWE-611 Improper Restriction of XML External Entity Reference in Apache Software Fou
Tim Allisontallison
2025-11-12

RE: mastodon.social/@tallison/1154

Please join me tomorrow, November 13 at noon EST to chat .

Please dm me for the connection info.

Tim Allisontallison
2025-10-30

LOL.. given that I'm going to be a remote presenter, I taped my Digital Preservation Bake-off talk last night in case I have wifi-problems during the session.

I really wish conferences would require 3 or 4 videos of the talk before I'm allowed to speak.

Tim Allisontallison
2025-10-28

In belated celebration of World Digital Preservation Day, I'm throwing a "What's new with Apache Tika/Office hours" meetup at noon on November 13 EST.

This is intended for anyone interested in files from search to digital preservation to file forensics/reverse engineering folks.

meetup.com/apache-tika-communi

Tim Allisontallison
2025-10-27

If I hosted an demo/office hours on Thursday, Nov 6 at noon EST, would that time work?

Tim Allisontallison
2025-10-27

@mutanthumb

Maybe I should throw a demo/office hours for on ?

Tim Allisontallison
2025-10-24

@mutanthumb

Y, will extract what the PDF alleges it is.

These are some of the fields that I'll focus on in the

These include pdf/a and pdf/x. hasMarkedContent suggests PDF/UA.

List of some useful keys for PDF files that Tika extracts into the metadata: pdf:totalUnmappedUnicodeChars
pdf:overallPercentageUnmappedUnicodeChars
pdf:containsNonEmbeddedFont
pdf:containsDamagedFont
pdf:hasAcroFormFields
pdf:hasCollection
pdf:hasMarkedContent
pdf:hasXFA
pdf:hasXMP
pdf:PDFExtensionVersion
pdf:PDFVersion
pdf:producer
pdfa:PDFVersion
pdfaid:conformance
pdfx:conformance
Tim Allisontallison
2025-10-15

I recently added fully recursive extraction of embedded files to Apache Tika's commandline.

This will also extract earlier versions of PDFs available through incremental updates.

This feature is still in beta. Let us know what you think.

Details in next toot.


Tim Allisontallison
2025-10-13

@Thorsted @mickylindlar

Thank you for sharing! I confirmed that is correctly throwing an EncryptedDocumentException on that file. 🎉

Tim Allisontallison
2025-10-13
Johannes Rabauerrabauer@mastodon.online
2025-10-07

🧠 Open-source & still evolving:
github.com/JohannesRabauer/qua

What would you add next, smarter summaries, multi-agent explorers, or something wild?
#LangChain4j #Quarkus #Ollama #pgvector #ApacheTika #AI #Java #LLaMA3

Tim Allisontallison
2025-09-26

Took the day "off" and made some progress on this for

Files changed: 220 🤣

I think this is the last major change for Tika 4.x.

Screenshot of a github pull request.

TIKA-4334 - move tika-pipes out of core .
#2339
tballison wants to merge 2 commits into main from TIKA-4334 

 Files changed 220
Tim Allisontallison
2025-09-26

Moving tika-pipes out of tika-core in 4.x.

Please take a look and let me know what you think:

github.com/apache/tika/pull/23

Ref: issues.apache.org/jira/browse/

Tim Allisontallison
2025-09-16

I just noticed there are 1.3 million pulls of tika-server on Docker Hub per month.

That's A LOT of files parsed!

Happy parsing!

Tim Allisontallison
2025-09-16

Just submitted an "Intro to " talk to DistrictCon, Year 1. 🤞

There will be PDFs!🤣

@DistrictCon
sessionize.com/districtcon/

Tim Allisontallison
2025-09-11

3.2.3 release candidate #1 is up for vote!

This is a bugfix release that fixes a bug in processing XFA within PDFs via tika-server.

lists.apache.org/thread/px1stb

Tim Allisontallison
2025-09-06

I just learned about @DistrictCon 's CFP, deadline is Sep 28.

Anyone interested in for file deep dives?

No matter the answer, please consider submitting your talks!

sessionize.com/districtcon

Tim Allisontallison
2025-09-05

This one is particularly rewarding for me w.r.t. .

This shows the CRS trying to trigger a zip slip in our existing unpacker code. It couldn't, so it eventually found the vulnerable harness (and new class) that I added for the competition for this "entry level zip slip" challenge.

theori-io.github.io/aixcc-publ

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst