- tl;dr sec
- Posts
- [tl;dr sec] #325 - Dissecting Mythos, The $0 Security Stack, GitHub Action Red Team Framework
[tl;dr sec] #325 - Dissecting Mythos, The $0 Security Stack, GitHub Action Red Team Framework
Replicating Mythos bugs with public models and more, building a useful security program for free, new post-exploitation framework for CI/CD pipelines that can replicate the full TeamPCP attack kill chain
Hey there,
I hope you’ve been doing well!
😅 Bug Hunters Be Like
I was going to open with a fun, personal story, but then I got caught up trying to cover a round-up of what a bunch of folks are saying about Mythos and frontier of LLM-driven vulnerability discovery, and now it’s past midnight 😅
So for now I leave you this meme, H/T buherator:

Sponsor
📣 (Free!) Community Edition: Ready your attack surface for AI with runZero
Gearing up for a deluge of AI-powered exploits? You’re gonna need fast, accurate visibility into all your assets.
Created by HD Moore (the mind behind Metasploit), runZero delivers unrivaled discovery and exposure detection across your entire internal and external attack surfaces. No agents, no credentials, and no appliances required. runZero finds everything, including unknown, unmanaged, and "I didn't know that was plugged in" devices — along with broad classes of exposures from CVEs to default credentials and bad configs.
Pro Tip: When your trial ends, downshift to our free Community Edition. It’s perfect for home labs (up to 100 devices!).
I’ve met HD Moore a few times- super nice and sharp guy. I tried not to fan boy out, and he was just really friendly. I’ve heard great things about runZero, which is not surprising given the people behind it.
AppSec
thomaspreece/GitHub-Token-Tester
Tool by Thomas Preece that enumerates the exact permissions of various GitHub tokens. For Classic PATs and OAuth tokens, it checks the scopes header from the user endpoint, while for App Installation Tokens, App User Access Tokens, and Fine-grained PATs, it brute-forces each permission individually against a repository where the user has admin access.
The $0 security stack
Oblique’s Maya Kaczorowski describes their $0 security stack that’s gotten them through their first SOC2 audit period: Semgrep (free for up to 10 contributors) for SAST/SCA with AI-based triage, TruffleHog for secret scanning on commits, RunReveal's Community tier (5 data sources) as their SIEM ingesting GCP/Cloudflare/GitHub logs, and Sublime Security's Core tier (free for up to 100 mailboxes) for email security integrated with Google Workspace. They also deployed Apple Business for MDM to enforce disk encryption, password locks, and forced updates, noting that Mac devices include XProtect anti-malware that can't be disabled. The stack caught misconfigurations within 24 hours and provides built-in detections routed to a dedicated Slack channel.
Mutation testing for the agentic era
Mutation testing is a pretty neat approach in which you introduce bugs (mutants) and check if your tests catch them, flagging hot spots where code is insufficiently tested. Trail of Bits' Bo Henderson announces MuTON and mewt, two new mutation testing tools optimized for agentic use, along with a configuration optimization skill to help agents set up campaigns efficiently. MuTON provides first-class support for TON blockchain languages (FunC, Tolk, and Tact), while mewt is the language-agnostic core that also supports Solidity, Rust, Go, and more.
The tools use Tree-sitter parsers to systematically introduce bugs into code and verify tests catch them, and implement mutant prioritization (high-severity mutations replace statements with reverts, medium-severity comment out lines, low-severity swap operators) to reduce campaign runtime.
Sponsor
📣 Map your MCP attack surface across four risk categories
When MCP agents read data from connected systems and decide autonomously what to do next, the attack surface is the data, not the code. This framework maps risks across four categories (content injection, supply chain, config/governance, and ops), covers the agent-as-inadvertent-adversary problem, and provides a defense matrix mapped to the OWASP MCP Top 10.
Lots of companies are working on securing their MCP usage, nice to see a risk framework with practical defenses and governance controls 👍️
Cloud Security
The Invisible Footprint: How Anonymous S3 Requests Evade AWS Logging
Maya Parizer describes how Varonis Threat Labs discovered that anonymous S3 requests made through VPC endpoints weren't logged in CloudTrail Network Activity events, allowing attackers within compromised VPCs to exfiltrate data to external buckets with zero visibility. When anonymous requests targeted external S3 buckets and the VPC endpoint policy denied access, no events were created in either the source or target account's CloudTrail logs (management, data, or Network Activity events). AWS has since patched this issue to log all anonymous API requests to external S3 buckets as CloudTrail network activity events in the VPC endpoint owner's account.
Cracks in the Bedrock: Agent God Mode
Palo Alto Networks’ Ori Hadad discovered that AWS Bedrock AgentCore's starter toolkit generates overly permissive IAM roles with wildcard permissions, enabling a "God Mode" attack where compromising one agent grants access to all others in the account. The post describes a multi-stage attack chain in which an attacker who compromises one agent could exfiltrate proprietary ECR images, access other agents’ memories, invoke every code interpreter, and extract sensitive data. Following responsible disclosure, AWS updated documentation to warn that auto-generated roles are for development/testing only and should never be used in production.
💡Every time a major platform provider releases an insecure by default product or feature, a fairy loses its wings, and the tl;dr sec robot sheds a single, oily tear.
Wiz Custom Rules: Measuring the Cloud Security Coverage Gap
TrustOnCloud’s Jonathan Rault announces the open sourcing of their Wiz Custom Configuration Rule (CCR) packages written in Rego for AWS S3, Azure Storage, and GCP BigQuery. They mapped TrustOnCloud controls across those 3 major cloud services and compared them against the default Wiz coverage, and found Wiz’s default coverage was around 34%.
The packages include mappings between TrustOnCloud controls and default Wiz rules, plus custom CCRs for gaps (missing high-severity controls including S3 account-level Block Public Access verification, VPC endpoint IAM restrictions, and BigQuery row-level access authorization checks). Each control is weighted using a CVSS-based scoring system that factors threat mitigation, impact, and control difficulty. TrustOnCloud is exploring Sigma rules to convert their detective controls into deployable code across broader security tooling beyond Wiz.
Supply Chain
sadreck/Butler
By Pavel Tsakalidis: Butler scans every repo for workflows, actions, secrets/variables, third-party actions, and produces HTML and CSV outputs to assist with security reviews, third-party dependency audits, and workflow management.
Fork Commit Detector
Client-side tool by Rami McCarthy that detects GitHub fork commits (commits accessible via a repo's namespace but not in any of its branches), which attackers can exploit to inject malicious code into CI/CD pipelines by referencing them directly via SHA or tags.
The tool queries GitHub's API to check if a commit is HEAD of any branch, merged via PR, or referenced by signed tags, flagging unsigned fork commits as potential imposters and highlighting the highest-risk scenario: imposter tags where upstream tags point to attacker-controlled fork commits. This attack has been done in the wild, for example, the TeamPCP supply chain attack (Trivy, KICS) and Shai Hulud 2.0/AsyncAPI.
SmokedMeat: A Red Team Tool to Hack Your Pipelines First
Boost Security’s François Proulx announces SmokedMeat, a post-exploitation framework for CI/CD pipelines that demonstrates the full attack kill chain TeamPCP used to compromise Trivy, LiteLLM, KICS, and more. Video overview. The framework includes:
Reconnaissance - Scans GitHub Actions for injection flaws and overpermissive tokens.
Exploitation - Auto-crafting payloads deployed via PR/issue/comment.
Post-exploit - Sweep runner process memory for secrets, enumerate token permissions, collect loot.
Pivot - Exchange OIDC tokens for AWS/GCP/Azure access, discover private repos with stolen PATs and run the embedded Gitleaks, probe SSH deploy keys, and map the full blast radius in a live visual attack graph
💡 Cool tool 🔥
Blue Team
pandaadir05/snoop
By Adir Shitrit: A modern syscall tracer built on eBPF. Think strace, but with a real TUI, smart filters, TLS decryption, and output that's actually readable.
How A Roblox Cheat Download Triggered A $2 Million Hack At Vercel
An employee at Context.ai tried to download a Roblox Cheat but actually got hit by a Lumma Stealer, which exfiltrated every credential in the victim’s browser. The attacker used those credentials to breach Context, steal the OAuth tokens of its customers, and pivot into the Google Workspace of a Vercel employee who had signed up for Context’s product and granted it “Allow All” permissions on their enterprise account. From there, the attacker moved into Vercel's internal systems and lifted customer environment variables that had not been flagged as sensitive. Vercel’s security bulletin.
💡 Oof, the transitive trust in this example is tough. I feel like most companies probably wouldn’t have had the hardening or detection visibility to detect or prevent this.
Measuring What We’re Missing
George Chen proposes a framework for measuring detection effectiveness by testing security assumptions and tracking false negatives across control areas (Identity & Access, Endpoint, Network, Cloud, Data Protection). He maps adversarial testing results (red team, purple team, BAS, threat hunting) to specific control areas, calculating detection rates as: Detected / (Detected + Missed under test conditions), then applies organizational weightings based on critical business services and attack paths.
George recommends tracking two separate metrics: an effectiveness score from tested scenarios and a discovery count of gaps found outside testing, to avoid penalizing proactive discovery while measuring how quickly teams find gaps, adapt detections, and improve coverage over time.
💡 I really like the idea of separating a) measuring how we’re doing today and b) new things we discovered we were missing before. In many security domains, if you’re calculating security metrics naively, when you say add a new security scanning tool, you start surfacing vulnerabilities that were already there, but from an “open vulnerabilities” or “new vulns per unit time” in the short term you look worse. But really you just have fewer unknown unknowns / security gaps.
AI + Security
Quicklinks
Gadi Evron - So you’d like to get started finding 0days with LLMs?
Open Source SaaS is Dead; Long Live Open Source - In the current debate around security threats to open source code, the real danger is not new AI tools, but reliance on a broken SaaS model. Security through obscurity was never an option.*
RSA 2026: The Great Cooking - A 🌶️ roasting of which vendors are GPT wrappers, cooked, or actually hard. I don’t agree with all of the analyses, but it’s also an interesting overview of the product categories, sponsor tiers, etc.
*Sponsored by Authentik
Mythos Quicklinks
Mythos found 271 vulnerabilities in Firefox - “Defenders finally have a chance to win, decisively… The defects are finite, and we are entering a world where we can finally find them all.” LFG security fam 🤘
Davi Ottenheimer dug into the details a bit, I’m not sure what to think yet - Mythos Mystery in Mozilla Numbers: How 22 Vulns Became 271 or Maybe 3 in April
Mythos accessed by unauthorized users - Mercor (training contractor) got breached, which leaked Anthropic’s model naming conventions, some hackers guessed the URL pattern, contractor credentials still worked.
Zvi Mowshowitz - Claude Mythos #2: Cybersecurity and Project Glasswing - Mega post covering a bunch of related context and discussions
Exploits don't cause cyberattacks
Joshua Saxe argues that predictions of exponential cyberattack growth from AI vulnerability discovery capabilities (like Claude Mythos) are overblown, pointing to how previous AI capabilities (cheap Turing Test-passing chat, voice dialogue, and deepfakes) didn't cause the predicted tsunami of social engineering attacks, phone scams, or misinformation despite being free and trivially accessible. Attackers choose the easiest path to achieve their goals, and since most attacker groups already accomplish their objectives through simple phishing, credential stuffing, and known CVE exploitation, they're not blocked by vulnerability research limitations.
Joshua recommends actor-centric threat modeling by asking which attacker constituencies would actually be unblocked by AI vulnerability research capabilities, what new goals they could pursue, and what cultural/organizational barriers might slow attacker AI adoption.
💡 Joshua Saxe always has thoughtful, measured takes, I highly recommend his posts.
Myth & Mythos: Where Do We Go From Here?
Joe Slowik examines the marketing hype around Mythos, but the unique perspective he adds here is critiques around Project Glasswing and the consortium, which includes major tech companies (Microsoft, Apple, Linux Foundation, Cisco, Palo Alto), but excludes critical OT/ICS vendors like Siemens, Rockwell, and Schneider Electric, as well as major non-US networking providers like Juniper, Ericsson, and Huawei. Currently, there’s a very “tech-company” focus, when ideally there’d also be an investment in defending legacy infrastructure in critical sectors like healthcare, utilities, and industrial facilities.
💡 I agree with Joe’s emphasis on the broader set of companies and industries that foundation model labs should be helping as well. I expect Anthropic will, they just haven’t gotten there yet.
Our evaluation of Claude Mythos Preview’s cyber capabilities
The UK's AI Security Institute evaluated Mythos on cybersecurity tasks, finding it achieved 73% success on expert-level CTF challenges and became the first model to complete "The Last Ones" (TLO), a 32-step simulated corporate network attack requiring an estimated 20 hours of human expert time, succeeding in 3 out of 10 attempts and averaging 22 out of 32 steps completed.
The evaluation used progressively harder benchmarks including CTF challenges and cyber ranges, with Mythos demonstrating the ability to autonomously discover and exploit vulnerabilities in multi-stage attacks on vulnerable networks, though it struggled with operational technology environments and was tested without realistic defensive measures like active defenders, EDR, or security monitoring. Performance continued scaling with increased token budgets up to the 100M token limit tested, suggesting further improvements are possible with additional inference compute.
“There are also no penalties for the model for undertaking actions that would trigger security alerts. This means we cannot say for sure whether Mythos Preview would be able to attack well-defended systems.”
The Boy That Cried Mythos: Verification is Collapsing Trust in Anthropic
This post by Davi Ottenheimer is probably the most critical about Mythos I’ve found. Davi argues that: Anthropic's Claude Mythos Preview system card claims "thousands of zero-days" but the 244-page document dedicates only 7 pages to cybersecurity, never quantifies vulnerabilities with CVEs or CVSS scores, and its flagship Firefox demonstration collapses under scrutiny: the model exploited two bugs already found by Claude Opus 4.6, in already-patched Firefox 147, in a test harness with sandboxing and defenses stripped out, achieving 72.4% exploit success that drops to 4.4% when those two bugs are removed.
Anthropic's own cyber range tests admitted the model "failed against a properly configured sandbox with modern patches" and cannot compromise operational technology environments. The $100M Project Glasswing "defensive initiative" is actually $4M in donations plus $100M in API credits to use Mythos itself, with few partner-confirmed findings, no comparison to existing fuzzers (AFL, libFuzzer, OSS-Fuzz), no false-positive rates, and a 90-day report promise with no delivery yet.
💡 This post is overall more negative about Mythos than I personally feel is justified, but it’s good to read critical analyses of published results, as it gives us examples of analyzing blog claims more thoughtfully and critically. Worth reading, it pulls out some nuances I haven’t seen discussed elsewhere.
AI Cybersecurity After Mythos: The Jagged Frontier
I have thoughts about this post. AISLE’s Stanislav Fort describes attempting to replicate the Mythos showcase vulnerabilities on eight small, cheap, open-weights models. He argues the moat is the system (targeting, iterative deepening, validation, triage, maintainer trust) rather than the model itself.
Dawid Moczadlo pointed out that the prompts are quite specific about how exactly the vulnerability occurred (“Consider the behavior of the SEQ_LT/SEQ_GT macros with sequence number wraparound.”). Michael Bleigh and julia on giving the models just the code they need to analyze, vs scanning the total repo from scratch. See also the LinkedIn comments on the posts here and here.
💡 First off: I think the AISLE team has clearly found and reported a number of high impact vulnerabilities in open source, which is great. However, I feel this post is highly misleading from an experimental design point of view, and the claims do not match up with what was tested:
They gave the small models just the context needed to validate the vulnerabilities- the models didn’t need to search the code base, which is a core part of the problem.
The prompts given were tailored to the specific vulnerabilities (not even vulnerability class) being evaluated.
False positive rates weren’t discussed, nor costs if the whole code base were to be scanned.
System Over Model: Zero-Day Discovery at the Jagged Frontier
Follow-up post by AISLE’s Stanislav Fort in which they open sourced nano-analyzer, a deliberately simple single-file Python scanner that uses cheap models (gpt-5.4-nano at $0.20/M tokens) to brute-force scan entire codebases in parallel, detecting Anthropic's flagship Mythos FreeBSD RCE 2/3 times with models as small as 3.6B active parameters at ~100-800x lower cost than Mythos. The three-stage pipeline (context generation, vulnerability scanning, skeptical triage with grep access) found maintainer-confirmed bugs in FreeBSD's NFS RPCsec_gss subsystem and a responsibly-disclosed 26-year-old memory corruption bug. They scanned the full FreeBSD kernel (35K files, 7.5M lines) in 10 hours for under $100 in API costs. Takeaway: adequate intelligence deployed with massive parallelism can surface real zero-days without hand-scoped snippets.
Seen Heelan: “Conventionally, if you want to test if an LLM can find a bug where the root cause is a memcpy into a statically sized stack buffer, you would not put exactly that in the prompt as an example.” More.
💡The methodology in this post overall seems much better, though I need to read it in more detail. It’s still not totally clear to me what the false positive rate was though.
Overall I think it’s great that folks are analyzing what level of model intelligence + scaffolding can replicate the claims of frontier labs using nonpublic models.
Needles and haystacks: Can open-source & flagship models do what Mythos did?
Semgrep’s Kurt Boberg benchmarked Claude Opus 4.6, GPT 5.4, Gemini 3.1-pro, Deepseek R1-0528, and Qwen 3.6-plus against two vulnerabilities from the Mythos blog post (OpenBSD TCP SACK and FreeBSD NFS RCE), and found that no models reliably identified vulnerabilities when analyzing full files without extremely specific hints. When scope was narrowed to individual functions, performance improved dramatically. Kurt found that using LLMs as "hotspot interrogators" paired with deterministic pre-filtering to surface interesting targets consistently outperformed naive whole-file prompting. Reproduction GitHub repo here.
💡 I like how this post emphasizes the importance of the experimental design in replicating Mythos’ findings. Yes, if you pull the needle (vulnerability) out of the haystack (large code base), small models can find it. Also, false positive rates and token costs matter.
Great Venn diagram-ish visualization at the top of the experiment choices for this vs Anthropic’s vs AISLE’s analyses. The key takeaway: there is a lot to think about when measuring how good a model, tool, or product is at finding vulnerabilities.
We Reproduced Anthropic's Mythos Findings with Public Models
Vidoc’s Dawid Moczadło et al describe their attempts to reproduce Anthropic's Mythos vulnerability findings using the publicly available GPT-5.4 and Claude Opus 4.6 models in opencode. Both Opus 4.6 and GPT-5.4 reproduced Botan and FreeBSD, only Opus 4.6 reproduced OpenBSD, and both models had partial success on FFmpeg and wolfSSL. Across all of the scans, the cost to scan a single file stayed below $30.
“If there is still a real gap between Mythos and public models here, it looks much more like exploit construction and operationalization than basic discovery of the underlying bug.”
Misc
Humor
Misc
Inside the stealthy startup that pitched brainless human clones - Building clones of yourself that you can then harvest for parts. Makes sense from a scientific point of view but yikes 😬
Notes from the SF Peptide Scene - Insane story.
Alex Hormozi - How Acquisition.com Makes Money
OZ PEARLMAN: Sundae Conversation with Caleb Pressley - Impressive magic that hurts my brain 😂
Orson Scott Card - You don’t need advice from editors on rejected manuscripts - Some stories about Ender’s Game getting rejected multiple times
AI + Product Releases, Interviews
Codex for (almost) everything - Codex can now operate your computer, generate images, remember your preferences, learn from previous actions, and take on ongoing and repeatable work. The Codex app also now includes deeper support for developer workflows, like reviewing PRs, viewing multiple files & terminals, connecting to remote devboxes via SSH, and an in-app browser to make it faster to iterate on frontend designs, apps, and games.
Alice Hunsberger - Resources for upskilling trust and safety and fraud teams on AI
Latent Space - Extreme Harness Engineering: 1M LOC, 1B toks/day, 0% human code or review — Ryan Lopopolo, OpenAI
AI Engineer - How AI is changing Software Engineering: A Conversation with Gergely Orosz
✉️ Wrapping Up
Have questions, comments, or feedback? Just reply directly, I’d love to hear from you.
If you find this newsletter useful and know other people who would too, I'd really appreciate if you'd forward it to them 🙏
Thanks for reading!
Cheers,
Clint
P.S. Feel free to connect with me on LinkedIn 👋