#tokenizers

2025-05-24

#dailyreport #tokenizers #huggingface #rust #gentoo
#ebuild #secops #cargo
I compiled the HF 🤗 tokenizers library from sources and
enhanced Gentoo ebuild file to allow reproducible
installation from sources.
I removed optional dependencies and disabled HTTP
requirements to enhance security.

I wrote very simple tests for tokenizers, safetensors,
transformers and integration test for them, because
tokenizers require HF hub for testing, that I disabled.

It was a hard but good experience with Cargo package
manager of Rust. The main problems was due to strange
cfg flags that Gentoo should have set automaticly, for
ex. target_os=linux was not set. "cfg" is an
abomination that you can't add change this safely.

I didn't find a working solution to manage "cfg" and, so
I just patched the Cargo.toml files of dependencies by
commenting out lines.
(∠・ω )⌒

2025-05-24

#dailyreport #tokenizers #huggingface #rust #gentoo
#ebuild #secops #cargo
I compiled HF tokenizers library from sources and
enhanced Gentoo ebuild file that allow reproducable
installation from sources.
I removed optional dependencies and disabled http
requirements to enhance security.

I wrote very simple tests for tokenizers, safetensors,
transformers and integration test for them.

It was hard but good experience with Cargo package
manager of Rust. Main problems was because of strange
cfg flags that Gentoo should set automaticly:
target_os=linux was not set. "cfg" is an abomination you
can't add change this safely.

I didn't found working solution to manage "cfg" and just
patched Cargo.toml files of dependencies. by commenting
lines.
(∠・ω )⌒

2024-10-01

🔧 #code2prompt: A command-line tool for converting codebases to #LLM prompts

Key features:
• 📁 Generates well-formatted #Markdown prompts with source tree structure
• 🛠️ Customizable #Handlebars templates for versatile prompt generation
• 🔍 Respects .gitignore and supports file filtering with glob patterns
• 🔢 Displays token count using various #tokenizers (cl100k, p50k, r50k_base)
• 📊 #Git diff integration for commit messages and #PullRequest descriptions
• 📋 Automatic clipboard copy and option to save output to file

Additional capabilities:
• 🔢 Line numbering for source code blocks
• 🔀 JSON output option for structured data
• 🚫 Exclusion of files/folders from source tree
• 📝 Support for user-defined variables in templates

#opensource project written in #Rust, available on #crates_io and #AUR

Useful for:
• Quick #LLM prompt generation from codebases
• Code documentation and analysis
• Bug finding and security vulnerability assessment
• Performance optimization suggestions

github.com/mufeedvh/code2promp

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst