So What?
Have you ever sat through a security briefing, heard the words, “This CVE has a critical CVSS score of 9.8!” and thought to yourself, “Okay, great… but what does that actually mean for us?” You’re definitely not alone.
Great question, greater song
Throughout my career as a CISO, I’ve spent a large chunk of my time asking exactly this question. Let’s face it: CVSS scores are helpful, but they’re also generic. They don’t account for the specifics of your enterprise — your infrastructure, your configurations, or your security posture. Essentially, they’re like weather forecasters predicting rain in “the Texas.” Helpful-ish, but you still don’t know if you will need an umbrella.
This frustration is exactly why I decided to build an AI-powered risk assessment agent using synthetic data to simulate a mid-to-large enterprise environment. Because at the end of the day, cybersecurity isn’t about reacting to generic alarms, it’s about understanding your risks in your context, and making clear, informed decisions based on reality, not theory. I didn’t want another tool that simply echoed what public databases already told me. I wanted something that could reason, prioritize, and reflect the unique fingerprint of a real-world enterprise, something that could finally answer the question that every overwhelmed security team secretly asks: “Out of everything that’s happening, what actually matters right now?”
Meet My Blind Yet All-Seeing AI Sidekick
When I first kicked off this project, I had a basic plan: Can I ask an AI what a CVE actually means to the company instead of reading endless vendor bulletins that assume every system is exposed to the internet and ready to be set on fire?
The Graeae see nothing, yet see all
At the time, it seemed simple enough. I threw together some Python scripts, used an LLM to generate some synthetic network configurations for simulation purposes, and piped CVE summaries straight into an LLM with a “what does this mean” prompt.
The first results? They were…useless. I had to work through a series of LLMs to understand their strengths and weaknesses first. Then, when I finally settled on a few that worked for different portions of the project, the results were enthusiastic, let’s say. Long-winded, overly cautious, about as useful as an airport announcement that says ‘a flight is delayed’ without mentioning which one. The AI could talk about “potential risk” and “hypothetical impacts,” but it was like asking a Magic 8-Ball for incident response advice (now that I think about it… <shake> “very doubtful”).
Clearly, if I wanted real insights, I’d need to teach it to think more like a security analyst — breaking down context, assessing technical fit, and prioritizing risk based on reality, not worst-case fantasy.
That kicked off a lot of trial and error (and a lot of coffee).
One of the things I’ve learned over the years working with enterprise applications is that enterprise data doesn’t come neatly gift-wrapped. It’s messy, inconsistent, and often spread across spreadsheets, PDFs, exported scan results, policy documents, and hastily copied and pasted firewall configs into Notepad. So, I made a decision: the system had to be document-agnostic. Whether the input was a structured CMDB export, a raw Qualys scan CSV, a Word document full of access policies, or a block of firewall rules saved as plain text, the agent needed to ingest, normalize, and chunk it into usable pieces automatically. That way, I wouldn’t have to waste time hand-massaging inputs — I could just drop whatever artifacts I had, and the AI would do the heavy lifting to turn them into meaningful context for analysis. It wasn’t glamorous work, but it’s the difference between a system that works in a demo and a system that works under pressure in real-world environments.
Then, I realized the model needed more than just the CVE description. It needed to understand my simulated environment — the servers, the cloud zones, the endpoint devices, the policies in place. I built a semantic document chunking system that splits large artifacts into digestible pieces and indexed them using embeddings so the AI could “search” and “retrieve” the most relevant ones.
Keyword generation became the next big unlock. Rather than blindly guessing which documents to pull, I trained a secondary step where the model reads the CVE, extracts important concepts (“Apache,” “Log4j,” “remote code execution,” etc.), and uses those as retrieval anchors. That alone boosted the signal-to-noise ratio dramatically.
Once I had the right context, I ran into another wall: the model’s tendency to “blob” everything together in one giant answer. It needed structure. So I built a prompt-chaining system — first summarizing the CVE, then identifying impacted systems, then scoring risk, and finally suggesting remediations.
Breaking the problem into bite-sized reasoning steps made a night-and-day difference in output quality.
Along the way, I layered in sampling controls — letting the pipeline randomly select, stratify, or cluster document samples depending on the risk appetite. I wired in tunables like temperature (creativity vs. precision) and top-p values (how adventurous the sampling is) so that depending on the need, I could dial up a “paint inside the lines” analysis or let it freewheel a bit when exploring remediation strategies.
Of course, having a risk score pop out at the end is great…unless it’s wrong. So I also built a confidence scoring model. It looks at how tightly the evidence matches the CVE, whether the system is internet-exposed, whether there’s an existing patching policy, and other environmental factors. Then it generates a confidence rating alongside the risk assessment — helping me separate “this is critical” from “this might be critical, but we’re guessing.”
Technology-wise, I wanted flexibility, not lock-in. So I designed the engine to be model-agnostic: I can hit frontier models like GPT-4 Turbo over an API when I want the big guns, or I can call a locally hosted LLM through Ollama when I want speed, privacy, or just to avoid burning API credits. It also made it easy to test different models and architectures without rewriting the entire system each time.
While I designed this application for flexibility in a personal project setting, enterprise deployment would require proper governance, API security, and operational controls.
Honestly, this project taught me more about building reliable AI pipelines than any article or tutorial ever could. Every “small thing” — prompt design, chunk sizing, keyword filtering, sampling methods, temperature tuning, scoring logic — mattered. Miss one piece and the whole illusion of “smart AI” collapses into a pile of generic advice, random babbling, or my prompt being fed back to me, reworded.
Today, the agent doesn’t just tell me “this CVE has a 9.8 CVSS score.”
It tells me “this vulnerability could affect five critical systems in your PCI environment, two of which are internet-facing, patching is overdue on one, and based on our policies, your exposure window is about 14 days unless mitigated.”
It feels less like asking a Magic 8-Ball and more like having a junior analyst who’s fast, smart, and (mercifully) never asks for PTO.
The “So What?” Factor
One of the biggest lessons I’ve learned the hard way in cybersecurity is that volume is not the same as insight. Anyone can generate a wall of “critical vulnerabilities” and “urgent alerts.” But the ability to know which fires matter — and which ones are just smoke — is what separates chaos from control.
That was the real test for this AI agent. Could it help me get past “everything is bad” and tell me what matters, when it matters, to whom it matters?
At first, even after all the fancy retrieval, keywording, and prompt-chaining work, the outputs still felt a little…well, panicked. Models (especially when left to their own devices) have a tendency to be overly cautious. Everything becomes DEFCON 1. Every CVE is a crisis. Every server is a ticking time bomb.
I realized the agent needed some proportionality and to communicate the risk in terms that were realisitic to the environment and the risk tolerances of the business.
This is where the risk scoring and confidence layering came into play:
- Stage 1: Analyze the CVE independently, focusing purely on the vulnerability’s technical impact.
- Stage 2: Contextualize against the retrieved environment data.
- Stage 3: Identify asset exposure (internal-only, DMZ, internet-facing, etc.).
- Stage 4: Layer department/business criticality.
- Stage 5: Generate a real-world risk score specific to the environment.
- Stage 6: Attach a confidence rating based on evidence strength and exposure clarity.
The result was focused, realistic outputs that I could then use to make decisions regarding urgency for my team to take action.
I added a ton of documentation and help along the way — mostly to remind myself of how to use my own app
Instead of massive spreadsheets screaming “9.8!!!”, I get summaries like:
- “Critical for payment processing; exposed; patch now.”
- “Affects internal systems; covered by segmentation; patch during maintenance.”
- “Not present in environment; no action needed.”
It turns theoretical chaos into navigable risk management.
Less sirens, more signal.
The Limits and Promise of AI in Risk Analysis
If there’s one thing building this agent taught me, it’s this: Large Language Models aren’t magic.
They’re not going to replace human cybersecurity expertise, no matter how many VC pitches or keynote slides try to tell you otherwise.
What they can do — and where they shine — is accelerating the grunt work that slows down human decision-making. They can connect dots faster than a junior analyst, summarize mountains of documentation in seconds, and offer “best guess” risk prioritization that can be validated and refined by actual practitioners.
In other words, AI is shaping up to be what early dot-com dreamers once promised “decision support systems” would be — only this time, it might actually work.
But it’s crucial to understand the limits:
- Contextual Errors: Models often miss subtle nuances without tight context retrieval.
- Overconfidence: Without proper guardrails, they hallucinate with swagger.
- Blind Spots: They only know what they’re given; missing data equals missing judgment.
- Integrity Risks: They can fabricate plausible but incorrect “facts”.
In short: AI can help us move faster, but it still can’t tell us when we’re sprinting in the wrong direction.
The critical judgment still belongs to human experts.
If we treat AI as a partner — a capable but imperfect junior analyst — we can unlock enormous value.
If we treat it as a replacement for judgment, we’re setting ourselves up for failure.
What I’m seeing is that LLM and AI are not, at this point, fantasy replacements for people but amplifiers for skilled decision-makers.
And we’re going to need every bit of that amplification, because in cybersecurity, the real fight hasn’t even started yet.
My next article will be about how we, as an industry and occupation, are wholly unprepared and misaligned to what is potentially coming.
#agenticai #agents #ai #ArtificialIntelligence #infosec #risk #security