Recent update from #metr #AI research finds that models are increasingly "reward hacking" complex problems presented to them instead of actually solving them. Interesting to read the model's admittance to purposefully gaming the system. Metr has good dialogue on protecting #CoT reasoning threads going forward too. #OpenAI knows of this hacking, and uses other models as judges to eval CoT to detect hacking. Can this not be trained out?
Image credit: METR.org on Bsky