Lmst

Dan Luu

Interesting story about Google publishing someone's phone number on searches for them when they gave the number to Google for account verification/security:

https://danq.me/2025/05/21/google-shared-my-phone-number/

Reminds me of the time a company I worked for (AFAIK) accidentally used phone numbers obtained the same way for ad targeting and got fined $150M

After the first three such calls this month, I was really starting to wonder what had changed. Had we accidentally published my phone number, somewhere? So when the fourth tech support call came through, today (which began with a confusing exchange when I didn’t recognise the name of the caller’s charity, and he didn’t get my name right, and I initially figured it must be a wrong number), I had to ask: where did you find this number?

“When I Google ‘Three Rings login’, it’s right there!” he said.

Exercises in benchmarking and experimental design, part 4:

https://www.patreon.com/posts/127627543

$19. You have a Sandy Bridge processor. You measure the memory latency and find that a read from memory takes 60ns. We have the following loop, where ITERS is some large number for (size_t i = 0; i < ITERS; i += CACHE_LINE_SIZE) { cnt += buf[i]; } A. What parameters do we need to know to guess how long this loop will take to run (assuming everything is warmed up and/or we're observing some kind of steady state behavior) and what are the parameters? [answer at the end to avoid spoilers, feel free to skip to the end if you don't want to search up these parameters] B. How fast does the loop run? C. Without trying it out, how accurate do you think ChatGPT, Gemini, etc., are at answering this question?$

Dan Luu boosted:

undefined behavior is pretty well understood at this point, but a piece of the puzzle that has always been missing is "how well could a compiler like LLVM optimize, without leaning on UB"

here's a very cool new paper that takes a crack at answering this, for LLVM:

https://web.ist.utl.pt/nuno.lopes/pubs/ub-pldi25.pdf

@ericswpark It's already the case that, without trying to do any kind of proof of work check, a lot of sites take already 5s+ to load on a slow Android device.

I'm not saying companies are going to get serious about this, but if they were to, most companies don't seem to care about people not on high-end devices and I don't see why they would suddenly start caring now.

Is one of the big drivers of consumer demand for compute going to be solving expensive proof of work proxies?

I'm still semi-banned from reddit (https://mastodon.social/@danluu/113426851280402188) and, ironically, had to start browsing it with a scraper because the scraper has good anti-blocking techniques and I don't, much like how way SEO spammers are better at ranking than people who make legit content.

With the resources AI companies have, proof of work can't "win", but it seems like the least ineffective technique?

I'm sad to say that we're following the lead of many others and putting in proof-of-work proxies into place to protect ourselves against "AI" crawler bots. Yes, I hate this as much as you, but all other options are currently worse (such as locking us into specific vendors).

We'll be rolling it out on lore.kernel.org and git.kernel.org in the next week or so.

@timrice Interesting idea. I don't have a good intuition for how effective that would be. I think rotten tomatoes / metacritic style aggregation is fundamentally broken for things that can be objectively evaluated because most reviews are wrong, so you need very unequal review weighting.

In principle, what you're describing could work, but I would guess that, today, LLMs wouldn't be great at evaluating credibility? But it could work fairly well with a human in the loop?

The review-industrial complex

https://www.patreon.com/posts/125570961

Pretty much every time I go looking for review of something I find that the most popular reviewers leave out key information as they repeat the manufacturer's talking points. Worse yet, they're frequently just plain wrong about what they're saying. Just for example, I ordered a pair of Apple Powerbeats Pro 2 headphones the day they were released in Canada, only to cancel the order after spending an hour looking at reviews. Had I just looked at the most highly recommended reviews, the reviews from the most popular reviewers, I probably would have bought the headphones and been disappointed.

If you dig around for a bit, you'd find that the features I might care about either don't work or are poorly implemented, but the top review is a stellar recommendation from Marques Brownlee titled "PowerBeats Pro 2 Review: Still Better than AirPods!" that had no information that's useful to me that you couldn't get by reading a press release — part of Brownless's case for why these are better than the Airpods Pro is that it's the Airpods Pro, but with extra features, but if you look into reviews that actually evaluate the features, some features don't work and some work much worse than on the Airpods Pro.

One big feature that, from comments from other reviewers, appears to be worse is the active noise cancellation (ANC). Brownlee tells you the press release information, it samples the world 200 times a second and uses the H2 chip and says noise cancellation is "really good"

@rwv37 IIRC, this is a side effect of an issue with Hugo. If I add a title, it breaks something else.

If I update Hugo, it breaks many different things. I've been avoiding putting time into Hugo fixes since I should just migrate off, but I'm not sure it's really worth the effort when the whole thing kinda sorta works, so I have issues like this. Sorry!

And, sure, the LLM generated code isn't great, but if I compare the LLM-generated game and "AI" in https://danluu.com/codenames/ to commercially successful online board game implementations that were created by human programmers, the LLM generated version is faster and less buggy.

Per the argument in https://danluu.com/customer-service/, AI doesn't have to be very good to replace humans in a lot of roles because, in practice, humans often aren't all that good (see also, https://danluu.com/p95-skill/).

This exchange between some programmers and a non-programmer typifies what I was getting at in https://danluu.com/codenames/#appendix-writing-the-code-for-the-post

Programmer 1: GPT-4o and Claude Haiku are useless for programming
Programmer 2: Claude Sonnet is useless for programming
Non-programmer: What do you mean GPT-4o is useless? I don't know how to program and created an app that makes $10k/mo with GPT-4o

LLMs have allowed non-programmers to produce apps for years and programmers are calling these things useless for programming.

antirez 3 hours ago | next [–]

About "people still thinking LLMs are quite useless", I still believe that the problem is that most people are exposed to ChatGPT 4o that at this point for my use case (programming / design partner) is basically a useless toy. And I guess that in tech many folks try LLMs for the same use cases. Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.

But there is more: a key thing with LLMs is that their ability to help, as a tool, changes vastly based on your communication ability. The prompt is the king to make those models 10x better than they are with the lazy one-liner question. Drop your files in the context window; ask very precise questions explaining the background. They work great to explore what is at the borders of your knowledge. They are also great at doing boring tasks for which you can provide perfect guidance (but that still would take you hours). The best LLMs (in my case just Claude Sonnet 3.5, I must admit) out there are able to accelerate you.

duped 2 hours ago | parent | prev | next [–]

> Try Claude Sonnet 3.5 (not Haiku!) and tell me if, while still flawed, is not helpful.

It's not as helpful as Google was ten years ago. It's more helpful than Google today, because Google search has slowly been corrupted by garbage SEO and other LLM spam, including their own suggestions.

mvkel 1 hour ago | parent | next [–]

I'm surprised at the description that it's "useless" as a programming /...

I tried using two different AI assistants to write the code for this post, Storytell and Cursor. I didn't use them as a programmer would use them and more used them as a non-programmer would use them to write a program. Overall, I find AI assistants to be amazingly good at some tasks while being hilariously bad at other tasks. That was the case here as well.

I basically asked them to write code and then ran it to see if it worked and would then tell the assistant what was wrong and have it re-write the code until it looked like that basically worked. Even using the assistants in this very naive way, where I deliberately avoided understanding the code and was only looking to get output that worked, I don't think it took too much longer to get working code than it would've taken if I just coded up the entire thing by hand with no assistance. I'm going to guess that it took about twice as long, but programmer estimates are notoriously inaccurate and for all I know it was a comparable amount of time. I have much less confidence that the code is correct and I'd probably have to take quite a bit more time to be as confident as I'd be if I'd written the code, but I still find it fairly impressive that you can just prompt these AI assistants and get code that basically works out in not all that much more time than it would take a programmer to write the code. These tools are certainly much cheaper than hiring a programmer and, if you're using one of these tools as a programmer an...

@eighty0n3 Sorry about that.

IIRC, I changed that when I saw that someone posted an old archive.org version of mine somewhere because they didn't like my "new" formatting. I often add fixes / corrections, and in my testing the current formatting performs better than the old formatting (at the time, I only had the before/after data, but I've also done some A/B testing since then).

Exercises in benchmarking and experimental design, part 3:

https://www.patreon.com/posts/118646029

We're going to continue this series on benchmarking and experimental design. For context, my feeling is that, if you can evaluate and run decent benchmarks, this really helps you with everything else you need to know about performance, but most benchmarks people do have some basic issues so, for anyone who's interested in writing code with decent performance, it makes sense to start with benchmarking. And there's a high degree of overlap between benchmark design and general experimental design, so the knowledge is also helpful for evaluating other kinds of experimental designs. And now, back to some exercises!

14. Geekbench 5

In Geekbench 5, the documentation for the compilation benchmark says "The Clang workload compiles a 1,094 line C source file (of which 729 lines are code)". In Geekbench 4, we have "The LLVM workload processes an LLVM IR (intermediate representation) file through the LLVM optimizer and code-generation routines. The LLVM IR file was generated from a 3,900 line C source file using Clang"

What's wrong with Geekbench's compilation benchmarks?

A version of Missile Command for the Commodore 64 where the bottom of your screen is the game state in memory and missiles cause memory corruption, which eventually causes you to lose: https://csdb.dk/release/?id=135463.

In the video below, a missile broke my controls and caused my cursor to move down and to the left so I couldn't stop other missiles.

@jefftk It's possible, but then I'd wonder why the only two EVs on the most fatal models list are Teslas. If a base Tesla is fast, a base E-tron or I-Pace is fast, the Bolt EV and some of the cheap Hyundai/Kia EVs for sale then aren't slow, and the base Taycan is very fast (3.4s measured 0-60)

It seems somewhat weird that, if it's something general about EV acceleration, only Teslas make the list of most fatal models, esp. when there are much faster EVs (because they don't sell a slow version)

@jefftk I haven't seen a more detailed analysis, but I don't think acceleration should be the primary driver because:

1. Only two fast car models are in the top 20 list.

2. If we look at how fast a typical model is for a luxury brand, a base BMW 3-series has a measured 0-60 of 5.2s (vs. 5.6s for a base Tesla Model 3) and a base BMW X5 has a 0-60 is at 3.9s (vs. 3.9s for a Tesla Model X LR, can't find base numbers). These luxury brands that sell fast cars do not have high fatality rates.

I find it really interesting/surprising that Tesla topped the fatalities per mile ranking from 2018-2022.

Fatality rate is strongly negatively correlated with price and weight and Teslas are much more expensive and heavier than average.

The commentary I've seen says that the cars are safe so it must be about Tesla drivers, but per https://danluu.com/car-safety/, maybe it's also about the cars. The most fatal manufacturers (Kia/Hyundai, Dodge, Tesla) that were rated all rank poorly there as well.

Table showing that Tesla has the most fatalities per mile driven

Table showing the Tesla Model Y and Tesla Model S are among the most fatal cars per mile driven.

Plot showing Tesla has the highest average selling price of all tracked car manufacturers

The ranking below is mainly based on how well vehicles scored when the driver-side small overlap test was added in 2012 and how well models scored when they were modified to improve test results.

Tier 1: good without modifications
Volvo
Tier 2: mediocre without modifications; good with modifications
None
Tier 3: poor without modifications; good with modifications
Mercedes
BMW
Tier 4: poor without modifications; mediocre with modifications
Honda
Toyota
Subaru
Chevrolet
Tesla
Ford
Tier 5: poor with modifications or modifications not made
Hyundai
Dodge
Nissan
Jeep
Volkswagen

These descriptions are approximations. Honda, Ford, and Tesla are the poorest fits for these descriptions, with Ford arguably being halfway in between Tier 4 and Tier 5 but also arguably being better than Tier 4 and not fitting into the classification and Honda and Tesla not really properly fitting into any category (with their category being the closest fit), but some others are also imperfect. Details below.

@againsthimself @lmk @HalvarFlake Yeah, I'd like to see some different questions. For a while, I thought, in terms of answer accuracy on my polls, Mastodon > Twitter > Threads, but then I realized it was mainly that, on pessimism, Mastodon > Twitter > Threads and I could get the opposite accuracy rating by asking questions where the correct answer is optimistic.

Here, if you ask, E[median lifetime income percentile] for student loan forgiveness beneficiaries, maybe you see the opposite result.

Would the world be better or worse if car horns didn't exist?

@lmk @HalvarFlake Not sure why I don't get the exact same thing, but I get a fairly analogous result.

BTW, I find graphs 4 and 5 of https://www.ipsos.com/en-us/link-between-media-consumption-and-public-opinion even funnier.

@graydon @waffles I'm not sure what this is a response to since I don't think anyone here has said there's no difference. Some of those stats are even in the body of the post.

Client Info

Server: https://mastodon.social

Version: 2025.04

Repository: https://github.com/cyevgeniy/lmst