#FulltextSearch

2025-06-19
@Ben Pate 🀘🏻 Allow me to take a look at this from a Hubzilla/(streams)/Forte point of view.

The Sin of Overwhelming Complexity: Instance Selection Paralysis


The only way to really combat this effectively is by hiding the whole concept of servers/instances at first, railroading everyone to a server and only letting them know about decentralisation and servers/instances after the fact.

In theory, this could be doable with Hubzilla, (streams) and Forte, and even better than with Mastodon with its themed servers. It wouldn't make sense to offer Hubzilla, (streams) or Forte servers for certain topics or target audiences, seeing as the whole thing would become moot the very moment when you make your first clone on another server. Simply build a kind of "automatic on-boarder" that sends everyone to the geographically closest open-registration server.

In practice, that'd be a bad idea, but for a different reason than on Mastodon. And that's how these servers tend to be very different. Not in topic. Not in target audiences. Not in rules. But in features. Hubzilla is modular, (streams) is modular, Forte is modular, and each admin decides differently on which "apps" to activate. Then you want to join Hubzilla for one cool feature, but the on-boarder railroads you to a server where that very feature isn't even activated.

Sure, the on-boarder could include the option to select certain features that you absolutely must have in your new home and then pick a server that has them. But that'd be extra hassle and extra confusing.

Besides, where'd you put that on-boarder? On the official Hubzilla website? Haha, no can do. The official Hubzilla website is a webpage on a Hubzilla channel itself. It's all just dumb old static HTML with a CSS. If it's even HTML and not Markdown or BBcode, that is. You couldn't add scripts to it if you tried.

Oh, and (streams) and Forte don't even have official websites. And (streams) will never have one, seeing as it's officially and intentionally nameless, brandless and totally not even a project. Their "websites" are readme files in their code repositories on Codeberg.

The Sin of Inconsistent Navigation: Timeline Turmoil


The streams on Hubzilla, (streams) and Forte are quite a bit different from Mastodon timelines.

First of all, what you usually don't have on public servers is the counterpart to Mastodon's local timeline and Mastodon's federated timeline. On all three, this would be only one stream, the "public stream" or "pubstream". It can be switched by the admin to either what'd be local or what'd be federated. However, public servers usually have it off entirely. Unavailable even to local users. That's because the admins don't want to be held liable for what's happening on the pubstream.

Technically speaking, you only have one stream on a public server, and that's your channel stream. It's much more efficient than a Mastodon timeline because it always shows entire conversations by default instead of detached single-message piecemeal, and because it has a counter for unread messages which even lists these unread messages for you to directly go to the corresponding conversation. But that's another story.

However, your channel stream can be viewed on your channel page, conversation by conversation, or it can be viewed on the stream page as an actual stream with all conversations shown in a feed/timeline-like fashion, one upon another, and with its own set of built-in filters such as "only my own messages" or "only conversations started by members of one particular privacy group/access list" or "only conversations from one particular group actor". It's actually much more convenient than any Mastodon timeline, but for those who want a Twitter clone for dumb-dumbs, it can be very overwhelming.

Yes, Hubzilla, (streams) and Forte are much more complex in handling than, say, snac2. But they're also much more complex in features than snac2. That power is their USP. And that power must be harnessed somehow.

The Sin of Remote Interaction Purgatory: Federation Gymnastics


Sure, Hubzilla, (streams) and Forte have some of the best built-in search systems in the whole Fediverse. They can pull almost everything onto your channel stream just by searching for it. And if it has replies, chances are they pull these in as well.

But still, they're geared towards desktop users. They still require copy-paste. Phone users don't copy paste. Most of them don't even know the very concept of copy-paste. For most of those who do, copy-paste is much too fumbly if the input device available to them is a 6" touch screen.

You can't blame them, though. This is next to impossible to do any differently. I mean, you won't see a button magically appear with which you can pull in just that one post or comment you want to pull in.

Rather, the issue is that they can only reel in almost everything. Sometimes the search returns nothing, like a void. Sometimes the search runs indefinitely without any kind of result. This may be because someone has blocked your channel, because someone has blocked your entire server, because the server someone is on has blocked you or your entire server, because Hubzilla/(streams)/Forte doesn't understand the URI pasted into the search field or whatever.

So this is made worse by Hubzilla, (streams) and Forte not knowing what they can search for, what they can't and why not.

Connecting with someone whom you encounter on your channel stream is fairly easy. Connections can be initiated with only two clicks. Either you click their long name, and you're taken to a pretty much distraction-less local "intermediate page" with a striking green button that's labelled "+ Connect". Or if you don't want to leave the channel page, you hover your mouse cursor over their profile picture, click on the little white arrow that appears, and you get a small menu that offers you the "Connect" option as well. Granted, even some veterans don't know the latter trick because it isn't immediately advertised on the channel page.

Also, sure, you don't simply follow them right off the bat with nothing else to do like on Mastodon. You're taken to your Connections page, and you have to configure the connection (you don't have to do that on Mastodon because you can't configure connections on Mastodon).

Following accounts/channels from the directory is a bit easier. The green "+ Connect" button is there right away (unless you're already connected). However, Hubzilla's directory only lists channels based on the Nomad protocol, i.e. Hubzilla and (streams) channels, because ActivityPub is only implemented in an optional, off-by-default-for-new-channels add-on whereas it's in the core and on by default on (streams) and the only available protocol on Forte.

Importing contents or following actors when seeing them locally on other servers without copy-pasting and searching can be done. It requires OpenWebAuth magic single sign-on, however, and it requires it to be implemented on all servers of all Fediverse server applications from Mastodon to WordPress to Ghost to Flipboard. Hubzilla, (streams) and Forte are the only Fediverse server applications with full (client-side and server-side) OpenWebAuth implementations. But that's of little use if the rest of the Fediverse doesn't have server-side implementations, and Mastodon has even silently rejected a mere client-side implementation already developed to a pull request two years ago.

The Sin of DM Disasters Waiting to Happen


I think this is less of an issue on Hubzilla, (streams) and Forte because they handle DMs differently from Mastodon (which "the Fediverse" actually refers to in the article).

On all three, DMs are integrated into their extensive, fine-grained permissions system in which everything is only public if it's really public. The difference between a post and a DM is not just a switch.

If I want to DM you, I can either tag you @!{benpate@mastodon.social} rather than @[url=https://mastodon.social/@benpate]Ben Pate 🀘🏻[/url]. Then you're a) the only one to whom the message is sent (it literally doesn't even go out to any other server than mastodon.social plus my clone on hub.hubzilla.de as can be seen in the delivery report) and b) the only one who is granted permission to view the message.

Or I can use the padlock icon and select you from the opening list as the sole recipient. The very moment that I select certain recipients, the post I'm composing quits being public, and the padlock icon switches from open to closed. This isn't a one-click or two-click toggle. You don't do that casually. It's basically configuration. It requires so many mouse clicks that you do it consciously and intentionally. If you want to post in private, you have to really want to post in private.

Better yet: You can default to posting only to a certain limited target audience. In fact, by default on a brand-new channel, you only post to the members of one privacy group/access list (which is a Mastodon list on coke and 'roids). You have to manually reconfigure your new channel if you want to post to the general public by default.

If you preview your post, you can see whether it's a direct message to one or multiple single connections (envelope icon next to your long name), a limited-permissions message to one or multiple privacy groups/access lists/group actors (closed padlock icon) or actually public (no icon).

Even better yet: Posts to group actors generally aren't public. Posts to at least Friendica groups, Hubzilla forums, (streams) groups and Forte groups are never public. They do not go out to your followers as well unless they're connected to the same group. And this is independent from whether a group is public or private. You can't accidentially post to a group actor in public, and if you do, you don't post to that group actor at all, at least not in a way that makes the group actor forward your post to its other connections.

Granted, what does not happen is your background switching from your background colour or background image (which can be user-configured) to red #800000 or a yellow-and-back chevron pattern when you change visibility and permissions to something that isn't public.

The Sin of Ghost Conversations and Phantom Follower Counts


And again, when @Tim Chambers says, "the Fediverse", he almost exclusively means Mastodon. He writes as if the entire Fediverse handled conversations as terribly as Mastodon, as if the entire Fediverse was as blissfully unaware of enclosed conversations as Mastodon. Which is not the case.

Hubzilla, (streams) and Forte, as well as their ancestor Friendica, handle conversations in ways that exceed Mastodon users' imaginations and wildest dreams by magnitudes. Unlike Mastodon, they know threaded conversations, and they see them as enclosed objects where only the start post counts as a post, and everything else counts as a comment.

This means that once you've received a post on your stream, you will also receive all comments on that post, regardless of whether or not you follow the commenters, regardless of whether or not they mention you. That's because all four reel in the comments not from the commentors, but from the original poster who is perceived as the owner of the thread. Only blocks or channel-wide filters can prevent comments from coming in.

Beyond that, (streams) was the first to introduce Conversation Containers. Forte inherited them from (streams), and when they were defined in FEP-171b, Hubzilla implemented them, too.

Here on Hubzilla, I can see all comments in this thread because my channel has fetched them directly from @Johannes Ernst. And I can actually see them right away because that's the default view here on Hubzilla, rather than Mastodon's piecemeal.

Even if you import a post manually using the search feature (and you better import the actual start post), AFAIK existing comments will eventually be backfilled. Comments that come in after importing will definitely end up on your stream as part of the thread.

So this is not a shortcoming of the Fediverse. The Fediverse has been able to do better for 15 years. It's a shortcoming of Mastodon.

The only "issue" here may be that it sometimes takes some time for a comment to show up for some reasons. But unless there are blocks or filters in play, it eventually will.

The Sin of Invisible Discovery: The Content Mirage


I'm not going to pick on the audacious implication that "Eugen and team" invented the Fediverse.

But Tim writes like literally everyone wants "the Fediverse" (read, actually Mastodon) to be literally Twitter without Musk.

Also:
  • Friendica has had full-blown full-text search since its inception as early as 2010. Five and a half years longer than Mastodon has even existed.
  • Hubzilla has had full-blown full-text search since its inception as early as 2011 when it was forked from Free-Friendika. It has inherited full-text search from Friendica.
  • (streams) and Forte have had full-blown full-text search since their respective inception in 2021 and 2024, both having inherited it themselves.

Oh, and none of them has an explicit opt-in switch to soothe panicking Twitter converts because panicking Twitter converts have never been the primary target audience of either of them.

Instead, on Hubzilla, whether someone can find your content depends on whether they've got permission to view it in the first place ("Can view my channel stream and posts"). If it's public, they have it. Full stop. Public is public is public. Stop whining. You've made it public, now deal with everything being able to see it.

(streams) and Forte behave the same. In addition, they have an extra permission: "Grant search access to your channel stream and posts". This controls who may search your channel stream using your own local search feature while visiting your channel locally. Something that isn't even possible on Mastodon.

As for not having any content on my channel stream before I connect to anyone: I, for one, do not want some algorithm to force content upon me that I'm not interested in. Full. Frigging. Stop. I want to have full and exclusive control over what I see and what I don't.

The Sin of User Discovery Hell


Can it really be that Mastodon's directory is so much worse than Friendica's, Hubzilla's, (streams)' and Forte's directories? I guess it is because it really only lists local accounts on that one particular server. A side-effect of Mastodon being a microblogging service and Twitter clone. And not a full-blown, fully-featured social network and Facebook alternative. No, seriously, it isn't that.

Friendica is. It was designed as such. It was designed to take Facebook's place, and not by aping and cloning Facebook, but by being better than Facebook.

The directory on each node is decentralised. It lists all actors known to that node. What's outright unimaginable from a Mastodon point of view: It takes the keywords in the profiles into account. Better even: It ranks suggestions by the number of matching keywords.

Want something centralised instead? Try the Friendica Directory. Looking for people? Looking for news accounts? Looking for groups? There are specialised tabs for that. Friendica can tell them apart, and so can the Friendica Directory.

Caveat: The Friendica Directory only lists Friendica accounts. Friendica's built-in directory should list everything it knows. I haven't used Friendica in many years, but I guess this even includes diaspora* accounts because why not?

Hubzilla has indirectly inherited its directory from Friendica. This is the directory on Netzgemeinde, the biggest Hubzilla hub.

Again, it lists local as well as federated channels. You can choose whether to see only local channels ("This Website Only") or federated channels as well. You can choose whether channels flagged NSFW shall be listed or not ("Safe Mode"). You can choose to only have group actors listed that let themselves be listed ("Public Forums Only"). You have a cloud of keywords from the keyword lists in the profiles that you can filter by (Mastodon doesn't even have keyword lists in profiles). You have full-text search for names and keywords. There's even a Facebook-style suggestion mode that proposes connections to you with a ranking based on your keywords and their keywords as well as the number of common connections, and that still has the same filters.

Caveat this time: Hubzilla's directory only supports the one sole protocol built into Hubzilla's core. And that's Zot6. This means that Hubzilla's directory only lists Hubzilla and (streams) channels because Hubzilla and (streams) are the only Fediverse server applications that support Zot6.

(streams) and Forte have inherited their directories again. And they probably have the most powerful decentralised directories in the entire Fediverse. I'd give you a link, but (streams) directories generally aren't public; only local channels can access them.

These directories are similar to the ones on Hubzilla. You see local and federated actors, and you can choose to only see local actors ("This Website Only"). You can choose to only see group actors ("Groups Only"). You can choose to not see channels flagged NSFW ("Safe Mode"). What's new: Inactive actors can be kept out, too ("Recently Updated").

Now it comes: (streams) has ActivityPub built into its core, and it's on by default on new channels. Forte is entirely based on ActivityPub.

This means that their directories can list anything from anywhere that uses ActivityPub. "Groups Only" gives you Guppe groups, Lemmy communities, /kbin and Mbin magazines, PieFed communities, Mobilizon groups, Flipboard magazines, Friendica groups, Hubzilla forums, (streams) groups, Forte groups etc., all on one list.

(streams) has a slight edge over Forte here because it also lists Hubzilla and (streams) channels that have ActivityPub off such as the Streams Users Tea Garden where ActivityPub was turned off with the very intention to keep Mastodon out.

If there was a gigantic Forte server, as big as mastodon.social, and its directory was accessible to the public, that directory would be the best directory in the Fediverse for anything really. If it was on (streams), it would list more, but it would confuse some users of e.g. Mastodon who'd try to follow Hubzilla or (streams) channels that have ActivityPub off. Forte simply doesn't list these because it can't find them.

A global directory of everything sounds like a good idea, but it's next to impossible to implement.

Either the directory would go look for actors itself. In order to do that, it would have to know within a split-second not only whenever a new actor is created somewhere so it can index that actor right away, but also whenever a new server is spun up so that the admin actor can be indexed, and that server can be watched. How is it supposed to know all that?

Well, or the directory, a single, monolithic, centralised website, would have to be hard-coded into all Fediverse server software. That way, each server could immediately report newly created actors to the central directory upon their creation.

For starters, this would make the whole Fediverse depend on one single centralised website under the control of, if bad comes to worse, one person.

Besides, this would be a privacy nightmare. Let's suppose I create a new (streams) channel that's supposed to be private. Its existence and all its properties would be sent to the central directory before I can set it to private and restrict its permissions. This wouldn't be so bad on Hubzilla because I'd make the channel private before I turn on PubCrawl and make the channel accessible to the directory in the first place because the directory would only understand ActivityPub.

Of course, the directory would mostly be built against Mastodon. It would not understand the permissions systems implemented on Hubzilla, (streams) and Forte, and it might happily siphon off the profiles of channels where access to the profile is restricted and make them publicly accessible. On the other hand, this is likely to mean that the directory couldn't read most of Hubzilla's, (streams)' and Forte's profile text fields anyway because Mastodon doesn't have them.

But such a centralised directory wouldn't make connecting to other users that much easier and more convenient. You'd still have to copy and paste URLs or IDs into your local search and search for them (unless you're on Friendica, Hubzilla, (streams) or Forte where you can connect to URLs directly). At the very least, you should be able to go to the centralised directory and follow anyone just by clicking or tapping them. That, however, would require OpenWebAuth support on both your home server and that directory.

Ideally, that directory would be firmly built into all instances of all Fediverse software from snac2 to Mastodon to Hubzilla, even replacing any existing directory to confuse people less. But that would make the Fediverse even more dependent on one central website and its owner, something which should be avoided at all cost.

Lastly, nothing can ever be built into all instances of all Fediverse software. Remember that there's software with living instances that's barely being developed such as Plume. There's even software with living instances that's been officially pronounced dead such as Calckey, Firefish or /kbin. How are Firefish servers supposed to implement such a feature if nobody maintains Firefish anymore, and even the code repository was deleted?

CC: @Risotto Bias

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #Fediverse #Friendica #Hubzilla #Streams #(streams) #Forte #OpenWebAuth #SingleSignOn #NomadicIdentity #Search #FullTextSearch #Directory #Permissions #Privacy #Conversations #ThreadedConversations #FEP_171b #ConversationContainers
Kevin Karhan :verified:kkarhan@infosec.space
2025-04-27

@ezra please complain at @Gargron and other Mastodon devs...

Doug Ortizdougortiz
2025-04-18

Did you know PostgreSQL has its own built-in full-text search engine? πŸ”ŽπŸ“

-- Create a tsvector column with GIN index
ALTER TABLE articles ADD COLUMN search_vector tsvector;
CREATE INDEX articles_search_idx ON articles USING GIN (search_vector);

UPDATE articles SET search_vector =
to_tsvector('english', title || ' ' || content);

-- Search
SELECT title FROM articles
WHERE search_vector @@ to_tsquery('english', 'postgresql & search');

Kevin Karhan :verified:kkarhan@infosec.space
2025-04-08

@bogdan complain at #Mastodon #developers so they get #FullTextSearch working and I don't have to use hashtags!

#thxbye #next #EOD

Kevin Karhan :verified:kkarhan@infosec.space
2025-04-06

@sunsetdriver just optimizing #SEO / #findability because #Mastodon #devs can't be assed to implement working #FullTextSearch or any other #QualityOfLife #features...

black lipstick on your flight controlsvyr@princess.industries
2025-03-04
  1. advanced search operators prototype. status: not quite ready for prime time.
    • has a bunch of goofy operators nobody but me will ever use, such as is:article
    • still missing some classics like lang:, domain:, before:, and after:, and some oddballs like is:bot (would require extra join) and sort: (would break ID-based paging)
    • needs docs, although i know where Past Vyr basically already wrote them: https://github.com/VyrCossont/mastodon/pull/8 πŸ˜‡
  2. indexed full text search prototype. status: heretical.
    • only works on PostgreSQL: SQLite's full-text search is much fussier and requires using a "virtual table" and frankly i can't be bothered, at least tonight
    • direct port of https://github.com/VyrCossont/mastodon/pull/3 and has the same limitations: HTML isn't stripped, and media alt text and poll options aren't indexed
    • fixing that would start by adding a tsvector column that concatenates (with record separators? as an array?) the contents of filterableFields for a status, updates it every time the status or its attachments are edited, and GIN-indexes that column
    • ignores the whole issue of matching posts to language tags and language tags to PG text search configurations by assuming that everything is English
    • still massively faster than unindexed ILIKE that vanilla GTS uses

edit: fixed a backwards flag in has:media and related operators

#GoToSocial #GTS #FullText #FullTextSearch

2025-02-09

Step 1.
Create an Index
db.personnel.createSearchIndex(idx_personnel_1, {
mapping: {
dynamic: false,
fields: {
skillset: {type: string}
}
}
})

Step 2.
Write an aggregation query
db.personnel.aggregate([
{$search: {
index: idx_personnel_1,
text: {
query: Ruby,
path: skillset
}
}}
])

black lipstick on your flight controlsvyr@princess.industries
2025-02-08

ok, here you go, updated GTS search patches for 0.18.0rc1. notice how they're on my repo? these are completely unofficial. do not bug anyone but me about them.

  1. improved hashtag search. status: upstreamable, mostly.
    • doesn't require # prefix to search hashtags
    • searches for matches anywhere in a hashtag: Mac now matches VintageMac as well as MacOS
    • includes hashtags when not specifically searching for accounts or statuses, like most Mastodon-compatibles
    • doesn't change existing tag sorting. popularity and/or recency might be more useful
  2. offset paging for searches. status: not upstreamable yet.
    • more compatible: many clients can't do ID paging
    • allows paging hashtag search results: Mastodon API has no concept of IDs for hashtags, so ID paging can't work for those anyway
    • possible performance issues: see comments on why main doesn't have it already. personally, i haven't noticed and i run this instance on a tiny VPS
  3. remove search restrictions. status: heretical.
    • searches any post on your instance (except other accounts' private/direct posts, and accounts that have you blocked)
    • includes public, unlisted, your own private and DM posts, and private and DM posts that are replies to you
    • expanded search is default: revert to standard GTS behavior by adding scope:classic or in:library operator to search query
    • definite performance issues: this means searching more posts! GTS does not use either PG full-text indexes/operators or SQLite full-text virtual tables, and this patch doesn't change that.
    • doesn't include alt text of media attachments, or polls, because main doesn't

i may add more patches to this list in the medium future as i add more functionality to my own instance, for example, date range operators (before:date, after:date), post property operators (has:image,has:poll, has:cw, is:sensitive, visibility:public), threading operators (to:user@instance.tld, is:reply, -is:reply), sort operators (sort:oldest, sort:newest, sort:favs) and maybe PG full-text indexing if i have a really good day (i really don't wanna figure out SQLite's weird shit! someone else do it!)

randos don't debate me about Fedi search. my clients can't set per-post interaction controls yet so i'll just block you.

#GoToSocial #GTS #FullText #FullTextSearch

2025-01-29

It looks like the #Universeodon relay has gone down due to an issue on the host it's running on. Apologies other #MastoAdmin 's out there. It's rebooting now and I need to do some digging to see why the host restarted in the first place. Apologies for any disruption.

It's also taken down our #FullTextSearch on #MastodonAppUK and some other core services on MastodonAppUK so I'm looking into this all now.

2025-01-21

@sylv_a Once people come to Mastodon, how do we encourage them to opt into #FullTextSearch? One reason why I still occasionally use X since moving to PMG in fourth quarter 2022 is that X still lets me #search for what others are saying about a particular topic, even if I fail to #GuessTheHashtag.

sb arms & legssb@metroholografix.ca
2025-01-08

I just came across this interesting project - an #ElasticSearch #OpenSearch replacement written in #rust. RUST! This has got to be an order of magnitude more memory efficient than it's #java counterparts.

If so, this could be a game changer for small instances. Opensearch is by FAR the worst - greediest - and most finicky process in my rack.

Is anyone on the #fediverse using this already? I'd love to hear your thoughts!

github.com/quickwit-oss/quickw

#FullTextSearch #NoMoreJavaOnServers

2024-12-16
Kevin Karhan :verified:kkarhan@infosec.space
2024-12-02
:mima_rule: Mima-samamima@makai.chaotic.ninja
2024-12-01

@jukkan@mstdn.social Worse is that a lot of this is self-inflicted (#fulltextsearch becoming "opt-in" in #Mastodon which is ignored in other #fediverse software like #Misskey which indexes everyone's posts anyway), and the #ActivityPub spec has specifically dealt with this issue which is to recursively forward every remote post an instance receive to every other remote instances they know of. This doesn't appear to be done for as:Public in the cc target. ​:sagume_think:​

2024-11-06

πŸ” Tackling complex tasks like full-text search? 

On Dead Code, Andrew Atkinson breaks down how search complexitiesβ€”like tokenizing input and working with lexemesβ€”are inherent to the domain, whether you're using Elasticsearch or Postgres. These challenges come with the territory, but Postgres offers powerful tools to keep it all under one roof. 

See how Postgres handles these specialized data tasks! πŸ‘‰ shows.acast.com/dead-code/epis #Postgres #FullTextSearch #DatabaseComplexity #DeadCodePodcast

2024-11-05

Π§Ρ‚ΠΎ ΠΈΡ‰Π΅Ρ‚ ΠΎΠ½ Π² ΠΊΡ€Π°ΡŽ Π΄Π°Π»Ρ‘ΠΊΠΎΠΌ? Как Π½Π°ΠΉΡ‚ΠΈ смысл ΠΆΠΈΠ·Π½ΠΈ с PostgreSQL

Π­Ρ‚Π° ΡΡ‚Π°Ρ‚ΡŒΡ Ρ€ΠΎΠ΄ΠΈΠ»Π°ΡΡŒ ΠΈΠ· ΠΏΠ°Ρ€Ρ‹ Π»Π΅ΠΊΡ†ΠΈΠΉ, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ я ΠΏΡ€ΠΎΡ‡ΠΈΡ‚Π°Π» студСнтам Π² Ρ€Π°ΠΌΠΊΠ°Ρ… курса, посвящСнного вопросам машинного обучСния. ΠŸΠΎΡ‡Π΅ΠΌΡƒ ΠΈΠΌΠ΅Π½Π½ΠΎ PostgreSQL? ΠŸΠΎΡ‡Π΅ΠΌΡƒ Π²Π΅ΠΊΡ‚ΠΎΡ€Ρ‹? Π—Π° послСдниС Π΄Π²Π° Π³ΠΎΠ΄Π° Ρ‚Π΅ΠΌΠ° языковых ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ стала нСвСроятно популярной, ΠΈ вмСстС с этим появилось мноТСство инструмСнтов, доступных Π΄Π°ΠΆΠ΅ Π½Π°Ρ‡ΠΈΠ½Π°ΡŽΡ‰Π΅ΠΌΡƒ ΠΈΠ½ΠΆΠ΅Π½Π΅Ρ€Ρƒ, стрСмящСмуся ΠΏΠΎΠ·Π½Π°ΠΊΠΎΠΌΠΈΡ‚ΡŒΡΡ с ΠΌΠΈΡ€ΠΎΠΌ тСкстового Π°Π½Π°Π»ΠΈΠ·Π°. Π”ΠΎΡΡ‚ΡƒΠΏΠ½ΠΎΡΡ‚ΡŒ этих Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΠΉ ΠΎΡ‚ΠΊΡ€Ρ‹Π²Π°Π΅Ρ‚ Π±Π΅Π·Π³Ρ€Π°Π½ΠΈΡ‡Π½Ρ‹Π΅ возмоТности для ΠΈΡ… примСнСния Π² самых Ρ€Π°Π·Π½Ρ‹Ρ… областях: ΠΎΡ‚ систСм управлСния знаниями Π΄ΠΎ Β«ΠΊΠΎΠΏΠΈΠ»ΠΎΡ‚ΠΎΠ²Β», ΠΏΠΎΠΌΠΎΠ³Π°ΡŽΡ‰ΠΈΡ… Π±ΠΎΠ»Π΅Π΅ Ρ‚Ρ‰Π°Ρ‚Π΅Π»ΡŒΠ½ΠΎ Π°Π½Π°Π»ΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Ρ‚ΡŒ Π°Π½Π°ΠΌΠ½Π΅Π· ΠΏΠ°Ρ†ΠΈΠ΅Π½Ρ‚ΠΎΠ², ΠΈΠ»ΠΈ ΠΈΠ½Ρ„ΠΎΡ€ΠΌΠ°Ρ†ΠΈΠΎΠ½Π½Ρ‹Ρ… киосков, ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‰ΠΈΡ… ΡΠΎΠ±Ρ€Π°Ρ‚ΡŒ ΠΈΠ΄Π΅Π°Π»ΡŒΠ½ΡƒΡŽ ΠΊΠΎΡ€Π·ΠΈΠ½Ρƒ Ρ‚ΠΎΠ²Π°Ρ€ΠΎΠ² для ΠΏΠΈΠΊΠ½ΠΈΠΊΠ°. Вряд Π»ΠΈ данная Ρ€Π°Π±ΠΎΡ‚Π° ΠΌΠΎΠΆΠ΅Ρ‚ ΠΏΠΎΡ…Π²Π°ΡΡ‚Π°Ρ‚ΡŒΡΡ ΠΏΠΎΠ»Π½ΠΎΡ‚ΠΎΠΉ ΠΈΠ»ΠΈ Π³Π»ΡƒΠ±ΠΈΠ½ΠΎΠΉ, ΠΎΠ΄Π½Π°ΠΊΠΎ, я надСюсь, Ρ‡Ρ‚ΠΎ ΠΎΠ½Π° прСдоставит Ρ‚Π΅ самыС β€œΡ…ΠΎΡ€ΠΎΡˆΠΈΠ΅β€ Ρ‚ΠΎΡ‡ΠΊΠΈ Π²Ρ…ΠΎΠ΄Π°, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ позволят, ΠΏΠΎΠ³Ρ€ΡƒΠΆΠ°ΡΡΡŒ Π² Π΄Π΅Ρ‚Π°Π»ΠΈ, ΠΎΡ‚ΠΊΡ€Ρ‹Ρ‚ΡŒ для сСбя мноТСство Π½ΠΎΠ²Ρ‹Ρ… интСрСсных ΠΈ ΠΏΠΎΠ»Π΅Π·Π½Ρ‹Ρ… Ρ‚Π΅ΠΌ для исслСдований ΠΈ ΠΈΠ½ΠΆΠ΅Π½Π΅Ρ€Π½Ρ‹Ρ… ΠΏΡ€ΠΎΠ΅ΠΊΡ‚ΠΎΠ². ΠžΡ‚ΠΊΡ€ΠΎΠ΅ΠΌ скрытыС смыслы

habr.com/ru/articles/855712/

#postgresql #postgres #pgvector #vectorization #fulltextsearch #fulltext_search #hnsw #python #java #Knowledge_Management_Systems

2024-10-24

БыстрСС ΠΏΡƒΠ»ΠΈ: ΠΊΠ°ΠΊ Π½Π°ΠΉΡ‚ΠΈ ΡΡ‡Π°ΡΡ‚ΡŒΠ΅ с PostgreSQL

Π’ этой ΡΡ‚Π°Ρ‚ΡŒΠ΅ ΠΌΡ‹ расскаТСм ΠΎ Ρ‚ΠΎΠΌ, ΠΊΠ°ΠΊ эффСктивно Ρ€Π΅Π°Π»ΠΈΠ·ΠΎΠ²Π°Ρ‚ΡŒ полнотСкстовый поиск с ΠΏΠΎΠΌΠΎΡ‰ΡŒΡŽ PostgreSQL. Π£Π·Π½Π°ΠΉΡ‚Π΅, ΠΊΠ°ΠΊ ΡƒΠ»ΡƒΡ‡ΡˆΠΈΡ‚ΡŒ ΡΠΊΠΎΡ€ΠΎΡΡ‚ΡŒ ΠΈ Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ поиска ΠΏΠΎ тСкстовым Π΄Π°Π½Π½Ρ‹ΠΌ, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡ Ρ‚Π°ΠΊΠΈΠ΅ инструмСнты, ΠΊΠ°ΠΊ tsvector , tsquery ΠΈ индСксы GIN , ΠΈ ΠΊΠ°ΠΊ эти возмоТности ΠΌΠΎΠ³ΡƒΡ‚ Π·Π½Π°Ρ‡ΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎ ΠΏΠΎΠ²Ρ‹ΡΠΈΡ‚ΡŒ ΠΏΡ€ΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎΡΡ‚ΡŒ вашСго прилоТСния.

habr.com/ru/articles/853124/

#fulltextsearch #полнотСкстовый_поиск #postgresql #gin #Π±Π°Π·Ρ‹_Π΄Π°Π½Π½Ρ‹Ρ… #Π±Π°Π·Ρ‹_Π΄Π°Π½Π½Ρ‹Ρ… #индСкс #индСксация

Kenneth J. Jaegerkjjaeger@fosstodon.org
2024-07-27

Is there such a thing as using too many #hashtags in #Mastodon or in the #fediverse in general? I guess if you are doing it on every word that might be considered #spam or could violate your instance's rules. Perhaps I am worried over #netiquette too much, but as long as #FullTextSearch is not the default in Mastodon, I will tend toward using more hashtags than less.

2024-07-25

ElasticSearch β€” поиск ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°Ρ‚Π΅Π»ΡŒΠ½ΠΎΡΡ‚ΠΈ Π² тСкстС

ΠŸΡ€ΠΈΠ²Π΅Ρ‚! На связи Аркадий ΠΈΠ· Π’-Π‘Π°Π½ΠΊΠ°, ΠΌΡ‹ ΠΏΠΎ ΠΏΡ€Π΅ΠΆΠ½Π΅ΠΌΡƒ Π΄Π΅Π»Π°Π΅ΠΌ TQM, ΠΈ Π² этой ΡΡ‚Π°Ρ‚ΡŒΠ΅ ΠΏΠΎΠΊΠ°ΠΆΡƒ, ΠΊΠ°ΠΊ ΠΌΡ‹ Ρ€Π΅ΡˆΠΈΠ»ΠΈ Π·Π°Π΄Π°Ρ‡Ρƒ с поиском ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°Ρ‚Π΅Π»ΡŒΠ½ΠΎΡΡ‚Π΅ΠΉ Π² тСкстС ΠΊΠΎΠΌΠΌΡƒΠ½ΠΈΠΊΠ°Ρ†ΠΈΠΉ. Π­Ρ‚ΠΎ Ρ€Π°Π±ΠΎΡ‚Π°Π΅Ρ‚ ΠΊΠ°ΠΊ Π½Π° простых Ρ†Π΅ΠΏΠΎΡ‡ΠΊΠ°Ρ… ΠΈΠ· словосочСтаний ΠΏΠΎ порядку, Ρ‚Π°ΠΊ ΠΈ Π½Π° слоТных кСйсах β€” со Π²Ρ€Π΅ΠΌΠ΅Π½Π΅ΠΌ Ρ„Ρ€Π°Π·Ρ‹, ΠΊΠ°Π½Π°Π»ΠΎΠΌ Β«ΠΊΠ»ΠΈΠ΅Π½Ρ‚ β€” ΠΎΠΏΠ΅Ρ€Π°Ρ‚ΠΎΡ€Β». ΠœΡ‹ ΠΏΠΎ ΠΏΡ€Π΅ΠΆΠ½Π΅ΠΌΡƒ Ρ€Π°Π±ΠΎΡ‚Π°Π΅ΠΌ с ElasticSearch, оставляя Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡ‚ΡŒ β€œΠ½Π°ΠΊΡ€ΡƒΡ‚ΠΈΡ‚ΡŒβ€ Π½Π° поиск ΠΏΠΎ тСксту Ρ‚Π°ΠΊΠΈΠ΅ Π²Π΅Ρ‰ΠΈ ΠΊΠ°ΠΊ RAG, LLM ΠΈ Π΄Ρ€ΡƒΠ³ΠΈΠ΅ ΠΌΠΎΠ΄Π½Ρ‹Π΅ Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΠΈ. НСсколько ΠΎΠ³Ρ€Π°Π½ΠΈΡ‡Π΅Π½ΠΈΠΉ для сСгодняшнСй Π·Π°Π΄Π°Ρ‡ΠΈ: - НСлинСйноС возрастаниС слоТности запроса ΠΏΡ€ΠΈ ΡƒΠ²Π΅Π»ΠΈΡ‡Π΅Π½ΠΈΠΈ количСства Ρ„Ρ€Π°Π·. ΠŸΠΎΡΡ‚ΠΎΠΌΡƒ ΠΏΡ€Π΅Π΄Π΅Π» Ρƒ нас 4. - Π¨Π°Π³ Ρ‚Π°ΠΉΠΌΠΈΠ½Π³Π° ΠΌΡ‹ Π²Ρ‹Π±Ρ€Π°Π»ΠΈ 5 сСкунд. ПослС ΠΊΠ°ΠΆΠ΄ΠΎΠΉ Ρ„Ρ€Π°Π·Ρ‹ ставим ΠΌΠ΅Ρ‚ΠΊΡƒ Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ ΠΈΠ»ΠΈ нСсколько ΠΌΠ΅Ρ‚ΠΎΠΊ, Ссли Ρ„Ρ€Π°Π·Π° заняла большС 5 сСкунд. Если ΡΠ΄Π΅Π»Π°Ρ‚ΡŒ шаг слишком ΠΌΠ΅Π»ΠΊΠΈΠΌ это ΠΏΠΎΠ·Π²ΠΎΠ»ΠΈΡ‚ ΠΈΡΠΊΠ°Ρ‚ΡŒ Π±ΠΎΠ»Π΅Π΅ Ρ‚ΠΎΡ‡Π½ΠΎ, Π½ΠΎ замусорит нашС ΠΏΠΎΠ»Π΅ ΠΌΠ΅Ρ‚ΠΊΠ°ΠΌΠΈ Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ. ΠšΠ°ΠΆΠ΅Ρ‚ΡΡ, это Ρ‚ΠΎΡ‚ ΠΌΠΎΠΌΠ΅Π½Ρ‚ ΠΊΠΎΠ³Π΄Π° Π»ΡƒΡ‡ΡˆΠ΅ Π·Π°Ρ€Π°Π½Π΅Π΅ Π΄ΠΎΠ³ΠΎΠ²ΠΎΡ€ΠΈΡ‚ΡŒΡΡ ΠΎ трСбованиях. А Ρ‚Π΅ΠΏΠ΅Ρ€ΡŒ ΠΊ самому интСрСсному. Π”ΠΎΠ±Ρ€ΠΎ ΠΏΠΎΠΆΠ°Π»ΠΎΠ²Π°Ρ‚ΡŒ ΠΏΠΎΠ΄ ΠΊΠ°Ρ‚!

habr.com/ru/companies/tbank/ar

#elasticsearch #fulltextsearch #полнотСкстовый_поиск

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst