#FullText

2025-05-12

Быстрый алгоритм fulltext-поиска без токенизации

Меня зовут Дмитрий Ольшанский, я ведущий инженер Т-Банка. Расскажу о новом (насколько мне известно) алгоритме поиска текста по шаблону. Такая задача возникла в рамках проекта Sage — observability-платформы от Т-Банка, для которой мы строим новый бэкэнд для структурированных логов, SageDB.

habr.com/ru/companies/tbank/ar

#поиск_текста #алгоритмы_поиска #fulltext

gmrstudiosgmrstudios
2025-04-13
God is an Iron
black lipstick on your flight controlsvyr@princess.industries
2025-03-04
  1. advanced search operators prototype. status: not quite ready for prime time.
    • has a bunch of goofy operators nobody but me will ever use, such as is:article
    • still missing some classics like lang:, domain:, before:, and after:, and some oddballs like is:bot (would require extra join) and sort: (would break ID-based paging)
    • needs docs, although i know where Past Vyr basically already wrote them: https://github.com/VyrCossont/mastodon/pull/8 😇
  2. indexed full text search prototype. status: heretical.
    • only works on PostgreSQL: SQLite's full-text search is much fussier and requires using a "virtual table" and frankly i can't be bothered, at least tonight
    • direct port of https://github.com/VyrCossont/mastodon/pull/3 and has the same limitations: HTML isn't stripped, and media alt text and poll options aren't indexed
    • fixing that would start by adding a tsvector column that concatenates (with record separators? as an array?) the contents of filterableFields for a status, updates it every time the status or its attachments are edited, and GIN-indexes that column
    • ignores the whole issue of matching posts to language tags and language tags to PG text search configurations by assuming that everything is English
    • still massively faster than unindexed ILIKE that vanilla GTS uses

edit: fixed a backwards flag in has:media and related operators

#GoToSocial #GTS #FullText #FullTextSearch

black lipstick on your flight controlsvyr@princess.industries
2025-02-08

ok, here you go, updated GTS search patches for 0.18.0rc1. notice how they're on my repo? these are completely unofficial. do not bug anyone but me about them.

  1. improved hashtag search. status: upstreamable, mostly.
    • doesn't require # prefix to search hashtags
    • searches for matches anywhere in a hashtag: Mac now matches VintageMac as well as MacOS
    • includes hashtags when not specifically searching for accounts or statuses, like most Mastodon-compatibles
    • doesn't change existing tag sorting. popularity and/or recency might be more useful
  2. offset paging for searches. status: not upstreamable yet.
    • more compatible: many clients can't do ID paging
    • allows paging hashtag search results: Mastodon API has no concept of IDs for hashtags, so ID paging can't work for those anyway
    • possible performance issues: see comments on why main doesn't have it already. personally, i haven't noticed and i run this instance on a tiny VPS
  3. remove search restrictions. status: heretical.
    • searches any post on your instance (except other accounts' private/direct posts, and accounts that have you blocked)
    • includes public, unlisted, your own private and DM posts, and private and DM posts that are replies to you
    • expanded search is default: revert to standard GTS behavior by adding scope:classic or in:library operator to search query
    • definite performance issues: this means searching more posts! GTS does not use either PG full-text indexes/operators or SQLite full-text virtual tables, and this patch doesn't change that.
    • doesn't include alt text of media attachments, or polls, because main doesn't

i may add more patches to this list in the medium future as i add more functionality to my own instance, for example, date range operators (before:date, after:date), post property operators (has:image,has:poll, has:cw, is:sensitive, visibility:public), threading operators (to:user@instance.tld, is:reply, -is:reply), sort operators (sort:oldest, sort:newest, sort:favs) and maybe PG full-text indexing if i have a really good day (i really don't wanna figure out SQLite's weird shit! someone else do it!)

randos don't debate me about Fedi search. my clients can't set per-post interaction controls yet so i'll just block you.

#GoToSocial #GTS #FullText #FullTextSearch

Neighbourhoodie Softwareneighbourhoodie@toot.berlin
2024-11-01

Nouveau is the new #fulltext search for @couchdb that makes fuzzy search, facets, counts, and ranges even more flexible.

In our latest guide, we show you how to install, set up, and use Nouveau to find, well, all the things!

We also cover disk usage and performance. You’ll love how much faster it is 🐇

Check it out on our blog: neighbourhood.ie/blog/2024/10/

Graph showing “Number of Documents” (x-axis) against “Response Time in milliseconds” (y-axis). Queries of up to 250,000 documents respond in under 30 ms; and up to 1,000,000 in under 40 ms. It demonstrates that response times increase at a significantly slower rate than document growth.
Rpsu (326 ppm)rpsu@mas.to
2024-10-27

Wouldn't it be nice if pdf files could contain unhyphenated words even when the word is hypenated? leader-<newline>ship vs. leaderhip as an example.

Regardless, I praise OCR (Optical character recognition) which makes photocopied pages searchable, or if the text is not OCR'd yet are not you can run the #OCR and make PDF searchable! Several online tools, and several #MacOS commandline tools via #Homebrew, too.

#pdf #search #fulltext

2023-12-02

Ha! #Fulltext #Search in the Fediverse doesn't work. Especially regarding URL of climate papers I find this very very sad. But now I found a way to make at least the postings searchable that follow my new rule: replace https:// with a # or just do the whole shebang with # ?
Testing this URL pnas.org/doi/full/10.1073/pnas

#https://www.pnas.org/doi/full/10.1073/pnas.2019672118

#www.pnas.org/doi/full/10.1073/pnas.2019672118

Hope the / aren't a party stopper for my new rule.

Edit: tested it. Doesn't work. probably the / . Grmpf.

#<FirstAuthor> would work. Okay. Then I'll do that for now.

2023-10-23

Hi @buercher. Now that #Mastodon 4.2 includes the option to "Include Public Posts In Search results" right in the user config, are you planning to use the *indexable* flag for #TootFinder, instead of relying on tags in the profile description? #FullTextSearch #Fulltext #Search

2023-10-11

Now that we have proper search permissions in 4.2, is anyone working on a full fediverse search engine like @r000t did with stealthward.xyz? I don't mean just searching the posts that your server knows about, I mean a separate project to search posts from all known good servers. #FullTextSearch #Mastodon #Fulltext #Search

2023-09-28

People on social.sdf.org (really wish I could do a instance only post, one of the reasons I'm moving to a glitch instance soon™ - ironically where I have the same issue, so the question is also for people on tilde.zone) :

Does #fulltext #search work here?
I cannot get any result, no matter what I search. I imagine the opt-in rate might be small but...something? Anything?

2023-09-21

wandering.shop and a lot of other mastodon instances now support #fulltext #search!

If your server is updated to 4.2.0 and you want to opt into full text search:

1. Log into your server's website
2. Click your profile icon
3. Click "Edit Profile"
4. Click tab "Privacy & Reach"
5. Tick box "Include Public Posts In Search results"
6. Click "Save Changes"

mstdn.social/@feditips/1111042

2023-09-02

before i forget. #fulltext #search on #mastodon is amazing.

Éric Freyssinetericfreyss
2023-08-27

La plein texte (contenu des toots) est disponible sur la toute dernière version du logiciel (en version bêta, déploiement dans la branche stable prévue en septembre selon @renchap )

Si votre serveur en dispose allez dans la configuration de votre profil, puis l'onglet "Privacy and reach" et dans la rubrique "Search" pour l'activer (par défaut en opt-in uniquement).

Mastodon.social l'a activé lors de la mise à jour de cette nuit: mastodon.social/settings/priva

FlorianFfangohr
2023-08-26

is now working for me on mastodon.social in Ivory. YMMV with your server and client.

2023-08-25

Looks like Mastodon v4.2.0 is adding opt-in #fulltext search, but that functionality is up to individual instance owners and the account posting information to make themselves searchable. Probably a benefit for the #weather and #emergency tooters over here who are trying to get information out to anyone. I'll turn on searchability for all the bots.

2023-08-12

I’m not really following or otherwise knowledgeable about the details/issues of the design discussions around the upcoming #mastodon #fulltext #search implementation, but having the ability to limit opt-in to “covenant compliant” servers would be a powerful way to reinforce that covenant.

2023-08-01

Find the full text here. Please note that this is just OCR but the document is tagged wrong so acrobat and related tools think it's been processed correctly and all you see is graphics. I pulled this using JAWS. dropbox.com/scl/fi/tifk7aj7533 #accessibility #TrumpIndictment #fulltext

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst