Okay, Back of the napkin math:
- There are probably 100 million sites and 1.5 billion pages worth indexing in a #search engine
- It takes about 1TB to #index 30 million pages.
- We only care about text on a page.
I define a page as worth indexing if:
- It is not a FAANG site
- It has at least one referrer (no DD Web)
- It's active
So, this means we need 40TB of fast data to make a good index for the internet. That's not "runs locally" sized, but it is nonprofit sized.
My size assumptions are basically as follows:
- #URL
- #TFIDF information
- Text #Embeddings
- Snippet
We can store an index for 30kb. So, for 40TB we can store an full internet index. That's about $500 in storage.
Access time becomes a problem. TFIDF for the whole internet can easily fit in ram. Even with #quantized embeddings, you can only fit 2 million per GB in ram.
Assuming you had enough RAM it could be fast: TF-IDF to get 100 million candidated, #FAISS to sort those, load snippets dynamically, potentially modify rank by referers etc.
6 128 MG #Framework #desktops each with 5tb HDs (plus one raspberry pi to sort the final condidates from the six machines) is enough to replace #Google. That's about $15k.
In two to three years this will be doable on a single machine for around $3k.
By the end of the decade it should be able to be run as an app on a powerful desktop
Three years after that it can run on a #laptop.
Three years after that it can run on a #cellphone.
By #2040 it's a background process on your cellphone.