#Tokenizer

N-gated Hacker Newsngate
2025-06-30

🚀 Wow, a that's 2-4x faster than OpenAI's! Is it powered by caffeine or just another star chaser? 🤔 Meanwhile, the rest of us are still trying to figure out how to navigate GitHub's labyrinthine menu without a PhD in UI 🤷‍♂️.
github.com/M4THYOU/TokenDagger

2025-01-20

Считаем количество токенов для LLM в исходниках ядра Linux и не только…

Эта статья про новое расширение ахритектуры трансформеров – Titan от Google –, позволяющее расширить рамки LLM до 2 млн токенов, побудила поинтересоваться, сколько токенов, пригодных для LLM, содержат исходники колоссального софта. Какой открытый софт будем «препарировать»:

habr.com/ru/articles/875022/

#llm #ai #tokenizer #token #fun #openai #tiktoken

Idea: Audio-to-StableDiffusion #tokenizer that naively translates #audio chunks to #tokens recognized by #StableDiffusion and generates 1 frame per 1/24th second of audio, then strings the results together. Add a temporal cohesion mechanism to taste.

I wonder what it would look like. 🤔

2024-10-24

Разбираемся с Vespa. Часть 2

Из этой статьи вы узнаете: 1) Что такое Document и Query Processing. 2) Как обрабатывается текст Vespa. Что такое токенизация и стемминг. 3) Какой из обработчиков текста лучше подходит для русского языка. 4) Как выполнить текстовый поиск. 5) Как происходит ранжирование результата.

habr.com/ru/companies/sportmas

#java #vespa #stemming #tokenizer #bm25 #docker

今日視界🤖ichirokato
2024-06-22

字節豆包全新圖像Tokenizer:生成圖像最低只需32個token,最高提速410倍
headline01.com/a/LEo3EhIpGmI1Q

Jeroen Habetsjeroen@habets.dev
2024-02-24

Detailed #explanation of #AI #LLM using my favourite #database #PostgreSQL by Alex Bolenok quassnoi

Nicely describes how all the constituent pieces of an LLM come together:
#Tokenizer, #Embeddings, #Attention/#Masking, #Feedforward, #temperature, #Inference

explainextended.com/2023/12/31

GenAIgenai
2024-02-23

Andrej离开开智公司以后开始讲课了。这个两个小时时长的视频讲的是text tokenizer. 有意思的是他是从byte based 开始讲的。ChaptGPT的tokenizer会把多个空格合并,这样处理Python code的时候更有效。堪称史上最详细的tokenizer教程,推荐以前没有接触过这个概念的朋友。

youtube.com/watch?v=zduSFxRajkE

Reference to the Future<Void>rttf@techhub.social
2024-02-21
GripNewsGripNews
2023-06-13

🌘 GitHub - belladoreai/llama-tokenizer-js: LLaMA基於JS的分詞器
➤ 用於計算客戶端的token數量
github.com/belladoreai/llama-t
這是一個基於LLaMA的JavaScript分詞器,可在瀏覽器中運行,用於計算客戶端的token數量。它易於使用,並且與大多數基於LLaMA的模型兼容。它的運行時間和捆綁大小都經過了優化,並且可以作為npm包或ES6模塊使用。
+ 這是一個非常有用的工具,我很高興能夠在瀏覽器中使用它。
+ 我很喜歡這個分詞器,它易於使用,而且速度非常快。

lorddimwit is now @rk@well.comlorddimwit
2023-05-12

Last night I got a on a tear and wrote a complete for the Manatee programming language in C. I started at…9ish and finished at 1 in the morning

(It is, AFAIK, completely compliant except that I didn’t bother with Unicode. I suppose I could relatively easily augment it to use wchars…which aren’t *necessarily* Unicode but if we stick to standard C we gotta make sacrifices. __STDC_ISO_10646__ FTW I suppose.)

I suppose I should probably write a over the weekend, time permitting

2022-12-19

#JohnHorgan isn't #Premier of #BritishColumbia anymore but that doesn't stop me from saying he's a #millionaire due to being a #corrupt #CasinoLobbyist for years & his #Kingmaker #minion who is now our sitting Premier never bothered looking into #corruption from #HorganizedCrimes before he became Premier. I literally met this #lying #greedy #scumbucket at #lobbyist convention that I wss hired to shoot in 90s. He's been #sleazy for decades.

#POS #tokenizer #colonizer #Racist #Bigot #vanisle

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst