That Apple paper about #ai is a cogent argument for code execution in the chain of thought. You try writing down the solution to 10-disk Towers of Hanoi; soon enough your verbal reasoning checks out for a coffee break, and you're an instruction-following automaton. Indeed it's not a job fit for an LLM, strictly speaking.
Just the other day I created what I consider an intermediate-level CTF exercise, which involved correctly determining the value of 25 bytes with very specific constraints. o3 one-shotted it -- exactly by heavily abusing Python in the chain of thought. If you want to say "that's not the LLM reasoning, that's -- " then fine, let's go with whatever other terminology. o3 basket-weaved the problem into a solved state. We're looking at a future full of this kind of basket-weaving.