#GeneralWebSearch

Doc Edward Morbius ⭕​dredmorbius@toot.cat
2023-06-19

@darnell General Web Search is ... sort of its own thing. That's manageable through robots.txt or permissive / exclusive in-page tags.

(Those will generally prevent content from being presented, but may not prevent crawling, and in the case of on-page headers cannot by the mechanism through which they work (the spider has to crawl and read the header to determine what's being said).

There are groups such as the #ArchiveTeam who explicitly ignore robots.txt: wiki.archiveteam.org/index.php

Then there's the somewhat newly recognised issue of AI LLM training data and derived works.

Other than those, what is your threat model here?

  • What risks do you see?
  • What are you trying to avoid?
  • What would you specifically like to see?

My view is that online content is ... online. It's published, in the sense of public. If you want closed content you need to find some way of disclosing to a limited group. That has tremendous impacts on reach and influence.

That is contrasted with community and interaction, and a Fediverse which is crawled by Google is very different from one that is interfaced by Google and Facebook, parallel with their existing social networks (FB, Instagram, YouTube, Blogger, say).

@jwildeboer

#Meta #Metablock #DefederateMeta #ThreatModels #Risk #GeneralWebSearch #LLM #ArtificialIntelligence #TrainingData

Doc Edward Morbius ⭕​dredmorbius@toot.cat
2023-06-15

It's not often that I get to point out Internet Monopolies being sharecroppers.

The real dish here is that Reddit was one of the few domains in which the ad-fed #enshittification and #SidamTouch (ad-centric media turns everything to shit, reverse of Midas) wasn't ... overly dominant.

And now courtesy of mismanagement by #spez, #Reddit, #AdvanceMedia, and the Reddit board, #GeneralWebSearch which as been in a death spiral for years is suddenly getting far, far worse.

I've commented multiple times that I rely far more on traditional media (mostly books and magazine articles) these days than the Web. Sites/services such as #SciHub, #LibraryGenesis, and #ZLibrary have been absolutely vital for this, and despite much of the online world getting markedly worse, these are bright spots.

(Internet Archive, Wikipedia / Wikimedia, Project Gutenberg, and a handful of other sites/services are among the other bright spots which happen to operate inside the law, though the fact that useful sites have to violate law says a hell of a lot about how corrupt and societally-failing the law is these days.)

My #ResearchMethods for #ContentDiscovery now are based strongly on library research techniques I'd learned in the 1980s: research topics of interest, find major works and the authors of those works, read those, and if the same names or works keep turning up then find and read those. I'll also make heavy use of podcasts, especially those reviewing books and/or interviewing authors (particularly on academic topics), most notably the #NewBooksNetwork.

This may not lead you to truth, but it will virtually always point you to the foundations of present understanding and orthodoxy.

Truly principled authors will note conflicting / contradictory viewpoints --- #PatrickOphuls is excellent in this regard. Even unprincipled authors will often point out key voices in opposition to them, though usually by trash-talking and belittling them. (I'd found a wonderful example of this in a Reason review on Conway & Oreskes latest book The Big Myth.)

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst