🗓️ The May 2025 arXiv articles are now in ar5iv.
I am happy that our university hosting of the #ar5iv dataset just reached 100 verified downloaders in just about 1 year since release.
This is peanuts in a HuggingFace world, but my research group had several earlier attempts at distributing HTML5+MathML and this one went well.
P.S. Don't get me started on the zillions of crawls ar5iv has had apart from that though, sigh...
Thanks to everyone for using the dataset when you need bulk data!
@norbu presenting #OpenAccess trying to include that aspect at #arxiv.
I've been following the amazing work by @dginev towards #ar5iv for quite a while and in case you might have missed that:
https://ar5iv.labs.arxiv.org/
There's still a lot of work left, but I really love to see progress towards accessibility within the #TeX / #TeXLaTeX and generally #ScientificPublishing community. Especially showing arXiv it's possible to improve on a huge scale.
Announcing our new dataset:
ar5iv 04.2024
🔹2.1 million HTML documents
🔹1 billion formulas in MathML
https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/
🗓️ ar5iv is brand new HTML today.
Regenerated with latexml v0.8.8, which led to resolving 30+ reported issues.
Success rate is at 75.33%, and HTML exists for 97.74% of articles.
New trade-off: experiment with lower image quality, reducing our HDD use from 4.8 TB to 2.7 TB.
Total ar5iv collection now comprises 2,152,821 HTML pages, and contains over a billion formulas.
More still to come, with the usual monthly update on April 5th. Enjoy!