Lmst

WARNING

TIL that bunzip2 in Linux / Debian / MX Linux deletes, I repeat, DELETES the original archive when you run it in vanilla form

`bunzip2 dfly-x86_64-6.4.2_REL.iso.bz2`

resulted in the iso being unpacked with deletion of the original WITHOUT WARNING

Since I'm on expensive LTE+ 4G internet that is significant.

I have not used bunzip2 in years, but should have remembered this hostile default. It was not that way IIRC
Do I need to read the manpages of commands I have not used in years in Linux now? Why was the default changed

Luckily I have copies of the bzip2 iso on multiple partitions of HDD and SSD

#bzip2 #bunzip2 #sh #bash #warning #TIL #Linux #OpenSource #POSIX

Ok #Rust #bzip2 implementation done … in principle 😄 The compression test ran on a #LoremIpsum text file. A more realistic sample would have probably resulted in a much worse ratio. The BWT encoding step is ultra slow because I chose a naive approach instead of using a runtime-optimized but more complex algorithm. I'm also producing only one #Huffman tree for the whole file, which will also significantly degrade compression performance in longer and less homogeneous inputs.

Screenshot of a program output in the terminal. We can see an input text file has been compressed with a ratio of over 85%.

Ok this is fun!

My #bzip2 implementation in #Rust is coming along nicely and I've been loosely following a #TestDrivenDevelopment approach which makes working with LLMs a breeze. If an AI assisted rework turns tests red you can immediately start debugging, tuning and optimizing, which is nice and focused.

I'm done with everything including the Move-To-Front Transform encoding and am already getting text file sizes down by 25 to 30%.

Next up: Huff huff huff!

How to get back into a programming language?

"Do small hobby projects", they said.

"It will be fun!", they said.

So here I am reading university lecture notes about how to build suffix arrays in O(n) so I can optimize a Burrows-Wheeler-Transform for the #bzip2 implementation I inexcusably started writing so I could get back into #Rust.

🌘 使用 Ada 從零開始編寫高效 BZip2 編碼器 - 第三部分：熵編碼（結合 AI/機器學習！）
➤ 透過機器學習優化 BZip2 壓縮的關鍵：Huffman 樹的智慧分羣
✤ https://gautiersblog.blogspot.com/2025/09/writing-competitive-bzip2-encoder-in.html
本文是關於使用 Ada 編寫 BZip2 編碼器的系列文章的第三部分，重點探討熵編碼階段。作者解釋了 BZip2 格式在熵編碼上提供的靈活性，並透過 Calgary 和 Canterbury 語料庫的實際壓縮數據，展示了不同 BZip2 實作之間存在的差異。文章進一步闡述，這些差異主要源於 Huffman 樹的初始分配方式。作者引入了機器學習中的 k-means 聚類演算法，說明瞭其如何應用於 BZip2 符號的初始分羣，以期找到更優化的 Huffman 樹配置，從而提升壓縮效率。文中透過簡單的幾何圖例和政治黨派的類比，生動地闡述了初始分羣對最終結果的重要性，尤其在高
#壓縮演算法 #Ada 程式設計 #熵編碼 #機器學習 #BZip2

Ah yes, because what the world desperately needs is yet another #BZip2 encoder, but this time dressed up in #Ada and sprinkled with the ✨ magic ✨ of AIMachineLearning™. It's the classic tale: boy meets algorithm, algorithm meets Ada, and everyone lives happily ever after in a world of compressed bits nobody asked for. 🤷‍♂️💾
https://gautiersblog.blogspot.com/2025/09/writing-competitive-bzip2-encoder-in.html #AIMachineLearning #Compression #TechHumor #HackerNews #ngated

Writing a competitive BZip2 encoder in Ada from scratch in a few days – part 3

https://gautiersblog.blogspot.com/2025/09/writing-competitive-bzip2-encoder-in.html

#HackerNews #Writing #BZip2 #Ada #Encoder #Competitive #Coding #Part3

Xz format inadequate for general use
https://www.nongnu.org/lzip/xz_inadequate.html
#ycombinator #lzip #LZMA #bzip2 #gzip #data_compression #long_term_archiving

🎉 Behold, the groundbreaking revelation: #Xz is not the Holy Grail of data formats! 🚀 Apparently, using xz for digital preservation is like using a sieve as a bucket—bound to fail. Who knew? 🤦‍♂️ Stick to #bzip2, #gzip, or #lzip if you want actual functionality and avoid sinking your data into the abyss of inadequacy. 🔍💾
https://www.nongnu.org/lzip/xz_inadequate.html #dataformats #digitalpreservation #HackerNews #ngated

🚀 Oh, the riveting #saga continues! Witness as #Ada, the language nobody asked for, takes on yet another #unnecessary feat: building a #BZip2 #encoder that absolutely nobody needed – in #record time! 🤯 Part 2, because once wasn't enough! 🤡
https://gautiersblog.blogspot.com/2025/07/writing-bzip2-encoder-in-ada-from.html #Feat #Time #Part2 #HackerNews #ngated

Writing a competitive BZip2 encoder in Ada from scratch in a few days – part 2

https://gautiersblog.blogspot.com/2025/07/writing-bzip2-encoder-in-ada-from.html

#HackerNews #Writing #BZip2 #Encoder #Ada #Programming #Competitive #Coding #Part2

🌘 Gautier 的部落格：幾日內從零開始以 Ada 編寫具競爭力的 BZip2 編碼器 - 第二部分
➤ Ada 語言的威力：親手打造高效 BZip2 編碼器
✤ https://gautiersblog.blogspot.com/2025/07/writing-bzip2-encoder-in-ada-from.html
本文是 Gautier 部落格系列文章的第二部分，詳述他如何利用 Ada 語言，僅在數日內從頭開始建構一個效能足以與現有 BZip2 編碼器競爭的工具。作者將深入探討其技術實現細節，包括資料結構、演算法選擇以及程式碼優化策略，旨在展示 Ada 在高效能系統開發上的潛力。
+ 能夠用 Ada 在幾天內完成這麼複雜的專案，真是令人驚嘆！作者的技術功力深厚。
+ 對於想了解 BZip2 內部運作和 Ada 效能的人來說，這篇文章提供非常寶貴的見解。
#程式設計 #演算法 #資料壓縮 #BZip2 #Ada

Как написать bzip2-архиватор на Python: разбираем преобразование Барроуза-Уилера

Привет! Я Рома, бэкендер-питонист в KTS . Это вторая статья в моем цикле об алгоритме архивации bzip2 . Первую можно прочитать здесь , но для понимания сегодняшней темы она необязательна. Ниже я разберу преобразование Барроуза-Уилера — ключевой этап сжатия bzip2.

https://habr.com/ru/companies/kts/articles/937554/

#архиваторы #архивация #сжатие_данных #алгоритмы #bzip2архиватор #bzip2 #bwt

Как написать bzip2-архиватор на Python: разбираем преобразование Барроуза-Уилера

Привет! Я Рома, бэкендер-питонист в KTS . Это вторая статья в моем цикле об алгоритме архивации bzip2 . Первую можно прочитать здесь , но для понимания сегодняшней темы она необязательна. Ниже я разберу преобразование Барроуза-Уилера — ключевой этап сжатия bzip2.

https://habr.com/ru/companies/kts/articles/937554/

#архиваторы #архивация #сжатие_данных #алгоритмы #bzip2архиватор #bzip2 #bwt

@ermo

I'm very slowly creeping towards having checksum files auto-built. There's a missing part that needs to be done.

But I'm at least over one initial hurdle of switching from pax -z to pax -j. Not that that helps in the #FreeBSD 10 case because FreeBSD 10's pax does not have -j.

(Make an archive with -z and it isn't idempotent, because #gzip has a timestamp.)

So there's still the installing #GhostBSD mountain to climb, and seeing whether that has pax -j yet. (-:

#bzip2 #pax

You'll find this benchmarking adventure in its own blog post "Performance lessons of implementing lbzcat in Rust" https://anisse.astier.eu/lbzip2-rs.html

#RustLang #lbzip2 #bzip2 #benchmarking #performance

lbzip2 internally implements a full task-scheduling runtime, and splits tasks at a much smaller increments; it supports bit-aligned blocks (that are standard in bzip2 format), while my Rust implementation purposefully doesn't: I wanted to rely on the bzip2 crate that only supports byte-aligned buffers, and keep code simple (which I failed IMHO). FIN 15/15

#lbzip2 #bzip2

That's it for the benchmarking! You can find my implementation at http://github.com/anisse/lbzip2-rs/ ; it's very much PoC-quality code, so use at our own risks! I chose to manually spawn threads instead of using rayon or an async runtime; there are other things I'm not proud of, like busy-waiting instead of condvar for example. 14/N

#lbzip2 #bzip2 #RustLang #async #rayon

We've been running benchmarks on single CPU cores since the start. What if we unleash the parallel mode? Here are the results: lbzip2 is still much faster on the 8 cores; my implementation holds up fine, but is only 80% faster than bzip2, while running on 8 cores. On bigger files though, it starts to pay off, with up to 6.3x faster, while lbzip2 can go to 7.7x. 13/N

#lbzip2 #bzip2

> $ hyperfine -N -L program bzcat,lbzcat,./target/release/lbzcat "{program} readmes.tar.bz2"
Benchmark 1: bzcat readmes.tar.bz2
Time (mean ± σ): 74.8 ms ± 19.1 ms [User: 73.5 ms, System: 0.6 ms]
Range (min … max): 56.0 ms … 104.3 ms 50 runs

Benchmark 2: lbzcat readmes.tar.bz2
Time (mean ± σ): 29.3 ms ± 3.6 ms [User: 64.7 ms, System: 2.7 ms]
Range (min … max): 16.1 ms … 40.1 ms 85 runs

Benchmark 3: ./target/release/lbzcat readmes.tar.bz2
Time (mean ± σ): 44.1 ms ± 2.8 ms [User: 117.7 ms, System: 2.7 ms]
Range (min … max): 32.9 ms … 47.7 ms 63 runs

Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
lbzcat readmes.tar.bz2 ran
1.51 ± 0.21 times faster than ./target/release/lbzcat readmes.tar.bz2
2.56 ± 0.73 times faster than bzcat readmes.tar.bz2

$ hyperfine -m 1 -N -L program bzcat,./target/release/lbzcat,lbzcat "{program} contains-Q484170.json.bz2"
Benchmark 1: bzcat contains-Q484170.json.bz2
Time (abs ≡): 27.194 s [User: 27.055 s, System: 0.071 s]

Benchmark 2: ./target/release/lbzcat contains-Q484170.json.bz2
Time (abs ≡): 4.237 s [User: 33.433 s, System: 0.188 s]

Benchmark 3: lbzcat contains-Q484170.json.bz2
Time (abs ≡): 3.513 s [User: 26.614 s, System: 0.165 s]

Summary
lbzcat contains-Q484170.json.bz2 ran
1.21 times faster than ./target/release/lbzcat contains-Q484170.json.bz2
7.74 times faster than bzcat contains-Q484170.json.bz2

Overall, my Rust implementation (using the bzip2-rs crate) is (much) slower than lbzip2, and faster than bzip2. For some reasons, it also sees huge performance boost on performance cores, most likely due to better IPC and branch prediction. 12/N

#lbzip2 #bzip2

#BZip2

Client Info