Sntax Hacking: Researchers discover Sentence Structure can bypass AI Safety Rules.
Researchers from MIT, Northeastern University and Meta recently released a paper suggesting that large language models [LLMs] similar to those that power ChatGPT may sometimes prioritize sentence structure over meaning when answering questions.
⁉️The findings reveal a weakness in how these models process instructions that may shed light on why some prompt injection or jailbreaking approaches work, though the researchers caution their analysis of some production models remains speculative since training data details of prominent commercial AI models are not publicly available.⁉️
https://arxiv.org/abs/2509.21155v2
#ai #llm #prompt #injection #it #security #privacy #engineer #media #tech #news
![[ImageSource: Shaib et al.]
Figure 1 from “Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models” by Shaib et al.
⁉️The team, led by Chantal Shaib and Vinith M. Suriyakumar, tested this by asking models questions with preserved grammatical patterns but nonsensical words. For example, when prompted with “Quickly sit Paris clouded?” [mimicking the structure of “Where is Paris located?”], models still answered “France.”⁉️
This suggests models absorb both meaning and syntactic patterns, but can overrely on structural shortcuts when they strongly correlate with specific domains in training data, which sometimes allows patterns to override semantic understanding in edge cases.
👾As a refresher, syntax describes sentence structure— how words are arranged grammatically and what parts of speech they use. Semantics describes the actual meaning those words convey, which can vary even when the grammatical structure stays the same. Semantics depends heavily on context, and navigating context is what makes LLMs work. The process of turning an input, your prompt, into an output, an LLM answer, involves a complex chain of pattern matching against encoded training data.👾](https://files.mastodon.social/cache/media_attachments/files/116/098/102/919/197/918/small/2da653641116b3a1.jpeg)
![[ImageSource: Shaib et al.]
Figure 4 from “Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models” by Shaib et al.
⁉️To investigate when and how this pattern-matching can go wrong, the researchers designed a controlled experiment. They created a synthetic dataset by designing prompts in which each subject area had a unique grammatical template based on part-of-speech patterns. For instance, geography questions followed one structural pattern while questions about creative works followed another. They then trained Allen AI’s Olmo models on this data and tested whether the models could distinguish between syntax and semantics.⁉️
👾In layperson terms, the research shows that AI language models can become overly fixated on the style of a question rather than its actual meaning. Imagine if someone learned that questions starting with “Where is…” are always about geography, so when you ask “Where is the best pizza in Chicago?”, they respond with “Illinois” instead of recommending restaurants based on some other criteria. They’re responding to the grammatical pattern [“Where is…”] rather than understanding you’re asking about food.👾](https://files.mastodon.social/cache/media_attachments/files/116/098/102/945/449/113/small/8bfd1b23a010b907.jpeg)
![[ImageSource: Shaib et al.]
Table 2 from “Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models” by Shaib et al.
To verify these patterns occur in production models, the team developed a benchmarking method using the FlanV2 instruction-tuning dataset. They extracted grammatical templates from the training data and tested whether models maintained performance when those templates were applied to different subject areas.
<https://huggingface.co/datasets/SirNeural/flan_v2>
⁉️Tests on OLMo-2-7B, GPT-4o and GPT-4o-mini revealed similar drops in cross-domain performance. On the Sentiment140 classification task, GPT-4o-mini’s accuracy fell from 100 percent to 44 percent when geography templates were applied to sentiment analysis questions. GPT-4o dropped from 69 percent to 36 percent. The researchers found comparable patterns in other datasets.⁉️
<https://huggingface.co/datasets/stanfordnlp/sentiment140>
👾The team also documented a security vulnerability stemming from this behavior, which you might call a form of syntax hacking. By prepending prompts with grammatical patterns from benign training domains, they bypassed safety filters in OLMo-2-7B-Instruct. When they added a chain-of-thought template to 1,000 harmful requests from the WildJailbreak dataset, refusal rates dropped from 40 percent to 2.5 percent.👾](https://files.mastodon.social/cache/media_attachments/files/116/098/102/973/139/775/small/68fffbf3f795a559.jpeg)



