• tl;dr sec
  • Posts
  • [tl;dr sec] #316 - How Trail of Bits uses Claude Code, GitHub Threat Intel, Open Source AI Pentesting Tools

[tl;dr sec] #316 - How Trail of Bits uses Claude Code, GitHub Threat Intel, Open Source AI Pentesting Tools

Extensive guide on being a Claude Code power user, tracking threat actors on GitHub, open source AI-powered pentesting tools

Hey there,

I hope you’ve been doing well!

🤖 Come on AI-leen

Am I pleased with myself that I was able to make the intro title a pun off of AI + this 1982 music classic? Absolutely.

I have a few more #PeakBayArea stories to share.

  • I “watched” the Super Bowl with the OpenAI Codex team (shout-out Ian). Multiple people were still vibe coding on their laptops throughout.

  • I attended an event hosted by several AI video generation companies, and they played a few AI-generated shorts that were quite impressive.

    • One thing I find exciting about these tools is that great storytellers can now create shorts and even movies that would have taken dozens to hundreds of people and costed millions previously.

    • I see this same trend with AI in coding and other domains: the top performers are becoming even more leveraged.

    • Sidenote: it was easy to immediately tell if an attendee was a creative or normal tech worker as they dressed very visibly differently 😂 

  • I attended a Lunar New Years party (shout-out Grace) during which someone did some crowd work for a video, “When I say ‘A’, you say ‘I’, ‘A’… (everyone) ‘I’… ‘A'… ‘I’.

You know, just normal party stuff 😂 

P.S Semgrep is doing a live keynote next week of some things we’ve been cooking.

Sponsor

  🔑☁️ Your AWS Keys Have Leaked -
Now What?

Long-lived AWS keys are a major security liability, often hiding in plain sight within source code and build artifacts. Join Joseph Leon (Truffle Security) and guest expert Eduard Agavriloae to break down:

  • Exploitation: How attackers find and use leaked keys.

  • Triage: Critical immediate steps after a leak.

  • Prevention: Shifting to short-lived, identity-based roles.

👉 Watch Webinar 👈

I’m a fan of Truffle and Eduard’s research, they’ve been featured a lot in tl;dr sec. This should be a practical, useful webinar 👍️ 

AppSec

Encrypting files with passkeys and age
Filippo Valsorda describes Typage, a TypeScript implementation of the age file encryption format that supports symmetric encryption with passkeys and other WebAuthn credentials in browsers. To learn more about passkeys and WebAuthn, Filippo highly recommends Adam Langley’s A Tour of WebAuthn.

Mercari’s Phishing-Resistant Accounts with Passkey
Karino Tatsuya describes Mercari's journey from offering passkeys as an alternative authentication method to creating fully phishing-resistant "passkey accounts" that completely disable password and SMS OTP authentication. Mercari improved the recovery experience using Japan's MyNumber digital ID card for high-assurance identity proofing, allowing self-service recovery without compromising security, and implemented risk-based requirements for different services to drive adoption. Their approach has resulted in 10.9 million passkey accounts (approximately half of monthly active users), with passkey authentication expected to surpass password authentication next year.

A Beginners Guide: Cross-Device Passkeys
Google’s Harsh Lal describes Hybrid transport, which enables cross-device passkey authentication when users need to sign in on devices where their passkey isn't stored, such as public terminals or shared computers. The flow works by having the client device display a QR code containing a FIDO URI with a session identifier, which the authenticator device (e.g. smartphone) scans to establish an end-to-end encrypted tunnel over the internet, while Bluetooth Low Energy performs a proximity check to confirm physical presence and prevent man-in-the-middle attacks.

The authenticator uses its private key (which never leaves the device) to sign a cryptographic challenge from the server, transmitting only the signature back through the encrypted tunnel for verification. This approach maintains passkeys' phishing resistance while solving the key adoption challenge of accessing accounts across different devices and operating systems without exposing credentials on shared machines.

Sponsor

📣 Over-Privileged AI Systems Drive 4.5x Higher Incident Rates

New research surveying 205 CISOs reveals a stark reality: 92% of orgs are deploying AI into production, but 67% still rely on static credentials. The result? Organizations with over-privileged AI systems report 76% incident rates vs. just 17% for those enforcing least privilege. Identity fragmentation + AI agents = exponentially larger blast radius.

Least privilege and identity for AI agents/systems are quite important, I’m curious to see what they found 👀 

Cloud Security

Testing Access to AWS Resources Without Angering the People That Pay the Bills
Plerion’s Daniel Grzelak presents a methodology for safely testing AWS resource access permissions without reading sensitive data or changing state. Daniel outlines four core techniques: comparing unsigned vs signed requests (public SQS queues accept unauthenticated requests), metadata read APIs, executing no-op operations (UntagResource with nonexistent tags), and crafting malformed requests that pass authorization but fail validation (SNS Publish with an empty message). The post describes a “3-topic method” for determining when a malformed request actually proves authorization by testing against allowed, denied, and nonexistent resources to confirm authorization checks occur before validation.

Plerion has also released sns-buster, an open-source tool that automates testing 14 SNS API actions with 30+ parameter mutations per action to empirically verify topic exposure.

💡 A pretty clever approach for “checking facts” about an environment in a thoughtfully minimally intrusive way 👍️ 

Can an AI Agent Run a Purple Team Exercise?
Permiso describes deploying their AI agent, Rufio, to emulate Scattered Spider tactics against an AWS environment and validate their detection coverage. Originally they built Rufio to detect malicious OpenClaw skills, and over 12 days it wrote 135 YARA rules, scanned >2,500 skills across markplaces, confirmed 21 threats, and built 16 custom skills.

In this post, given a blog post describing Scattered Spider TTPs, Rufio had to understand the AWS Management Console and CLI patterns required to execute each step, then translate the blog post's tactical description into actual API calls and console interactions. Rufio autonomously created an IAM user (LUCR-3-operator), attached AdministratorAccess, generated access keys, attempted to enable EC2 serial console, and harvested CloudShell credentials.

Doing this exercise, they found an instruction-following gap where Rufio failed to switch from the federated Okta session to the newly-created IAM credentials for subsequent actions, despite being explicitly told to do so.

💡 This is cool: given a public description of a threat actor’s TTPs or a breach report, automatically have an agent follow the same path → test your detections or even create new ones. It’s not perfect today, but Agents will continue to get better at following instructions and not getting lost.

I’m imagining a future where this approach converges to an almost “herd immunity” type thing where one company publishes a threat actor’s TTPs or unique attack flows, then blue teams everywhere essentially can auto test “would I detect this in my environment?” and then auto-enable the relevant logs, tune detections, etc. That’d be rad.

Supply Chain

safedep/pmg
By SafeDep: Package Manager Guard (PMG) acts as a security middleware layer, wrapping your package manager to analyze packages for malware before they are installed, sandboxing the installation process to prevent system modification, and auditing every package installation event.

AikidoSec/safe-chain
By Aikido: A lightweight proxy server that intercepts package downloads from the npm registry and PyPI to protect against malicious code (verifies packages against Aikido’s open source threat intel database). Blocks packages newer than 24 hours.

💡 See also Socket’s firewall.

The Forensic Trail On GitHub: Hunting For Supply Chain Activity
Slides for Wiz’s Rami McCarthy and Amitai Cohen’s BlackHat EU 2025 talk covering a methodology for investigating and tracking real-world supply chain attacks exploiting GitHub Actions. The talk describes useful threat intelligence available directly from both GitHub and Git, and includes demos of how to effectively pivot on user metadata and behavioral heuristics, uncover attacker forks, and recover deleted gists and commits. They also demonstrate how to trace attacker aliases, identify targets of reconnaissance, and unmask attackers and researchers in real-time.

See also GitHunt, their accompanying GitHub repo with three demos: a Flask web app showing a demo of identifying and investigating an attack based off the public GH firehose, a CLI tool for investigating GitHub activity, and a toy tool that identifies suspicious GitHub activity, enriches it, and renders it for further investigation.

Unveiling Bagel: Why Your Developer's Laptop is the Softest Target in Your Supply Chain
Boost Security’s Alexis-Maurer Fortin announces the release of bagel, an open-source CLI that inventories security-relevant metadata on developer workstations including credentials, misconfigurations, and exposed secrets across Git, SSH, NPM, cloud providers (AWS/GCP/Azure), GitHub CLI, and IDE configurations. Bagel works in a privacy-focused way, only reporting secret locations, types, and SHA-256 fingerprints, not actual values. Bagel also scans for configuration weaknesses (like disabled SSL verification in Git/NPM, SSH config settings, etc.), secrets in environment variables, and shell history, and active sessions (active GitHub CLI sessions, cached cloud provider tokens, SSH agent state).

💡 The post makes a great point- developers generally have a variety of types of privileged access (often directly or indirectly production-level access), but this access isn’t monitored as well as say CI environments that have similar levels of secrets. By inventorying what secrets live where, you can get a feel for your exposure in the face of a Shai-Hulud type supply chain attack.

Blue Team

rex-rs/rex
A kernel extension framework that lets you write eBPF-style programs in safe Rust that compile directly to native code, bypassing the in-kernel eBPF verifier and its complexity constraints. By leveraging Rust's safety guarantees instead of static verification, Rex eliminates common eBPF pain points like program complexity limits, verifier-unfriendly compiler output, and counter-intuitive code patterns.

Matmaus/LnkParse3
A Python tool for parsing Windows shortcut (.LNK) files, handling malformed files gracefully.

Trust Me, I’m a Shortcut
Wietze discusses five Windows LNK file spoofing techniques that allow attackers to hide malicious targets and command-line arguments from Explorer's Properties dialog. Wietze has released lnk-it-up, an open-source toolkit containing lnk-generator (creates malicious LNKs using these variants) and lnk-tester (identifies deceptive LNKs by using Windows APIs to detect mismatches between displayed and actual targets).

The variants: padding command-line args with whitespace to hide them beyond the 256-char display limit, using HasExpString flag with null EnvironmentVariableDataBlock to hide arguments, setting invalid paths in EnvironmentVariableDataBlock to spoof targets while executing LinkTargetIDList, exploiting non-conforming LinkTargetIDList with LinkInfo fallback, and the most powerful variant, populating only TargetAnsi while leaving TargetUnicode null to completely spoof targets and hide arguments.

“In summary, we have seen that LNK files are unpredictable because crucial pieces of information might be hidden or entirely spoofed, meaning it is not straightforward to anticipate what will happen when an LNK file is opened.”

Red Team

Introducing a new way to buzz for eBPF vulnerabilities
(2023) Google’s Juan José López Jaimez and Meador Inge announce Buzzer, a new eBPF fuzzing framework that aims to help hardening the Linux Kernel. Buzzer aims to detect errors in the eBPF verifier (which verifies that an eBPF program satisfies various safety rules) by generating many eBPF programs, and if the verifier thinks it is safe, executing the program in a running kernel to determine if it is actually safe. Runtime behavior errors are detected through instrumentation code added by Buzzer.

Breaking eBPF Security: How Kernel Rootkits Blind Observability Tools
MatheuZ demonstrates how an attacker with kernel module loading capability can systematically blind eBPF-based security tools (Falco, Tracee, GhostScan, Decloaker) by hooking kernel functions via ftrace rather than attacking the eBPF programs themselves. Testing showed complete evasion: Falco missed reverse shells, file modifications to /etc/passwd and /etc/shadow, and privilege escalation; Tracee's process enumeration and syscall tracing showed nothing; and iterator-based tools like GhostScan and Decloaker failed to detect hidden processes or network connections.

Core insight: once an attacker controls the kernel (via loaded modules when Secure Boot is disabled), they control the kernel→userspace data delivery mechanisms that eBPF tools depend on, making observability optional regardless of how correctly the eBPF programs execute.

AI + Security

Quicklinks

  • trailofbits/claude-code-config - Opinionated defaults, documentation, and workflows for Claude Code at Trail of Bits. Covers sandboxing, permissions, hooks, skills, MCP servers, and usage patterns they’ve found effective across security audits, development, and research.

  • trailofbits/skills-curated - Trail of Bits' reviewed and approved Claude Code plugins. Every skill and marketplace here has been vetted for quality and safety.

KeygraphHQ/shannon
An autonomous AI pentester that actively exploits web application vulnerabilities rather than just identifying them. Shannon uses its built-in browser to execute real exploits, such as injection attacks, and auth bypass, to prove the vulnerability is actually exploitable. It combines white-box source code analysis with black-box dynamic exploitation across four phases (reconnaissance, vulnerability analysis, exploitation, and reporting). Shannon achieved a 96% success rate on the hint-free XBOW Benchmark.

samugit83/redamon
By Samuele Giampieri: An AI-powered agentic red team framework that automates offensive security operations, from reconnaissance to exploitation to post-exploitation. RedAmon is a LangGraph-based ReAct agent that chains together a six-phase reconnaissance pipeline: subdomain discovery, port scanning with Naabu, HTTP probing with httpx, resource enumeration with Katana/GAU/Kiterunner, and vulnerability scanning with Nuclei, and MITRE enrichment and GitHub secret hunting into a Neo4j knowledge graph.

The AI agent orchestrator operates in three phases: informational (graph queries and web searches), exploitation (CVE-based exploits or credential brute-forcing with user approval), and post-exploitation (Meterpreter sessions or stateless command execution), with all successful compromises automatically recorded as Exploit nodes in the graph, linked to the target IP, CVE, and port. Video tutorial.

1Password's new benchmark teaches AI agents how not to get scammed
Jason Meller describes how 1Password built SCAM (Security Comprehension and Awareness Measure), an open-source benchmark testing whether AI agents can avoid phishing and credential theft when performing real tasks like reading emails and filling passwords. They tested eight models (Claude Opus 4.6, Sonnet 4, Haiku 4.5, GPT-5.2, GPT-4.1, GPT-4.1 Mini, Gemini 3 Flash, Gemini 2.5 Flash) across 30 scenarios and found baseline safety scores ranging from 35% (Gemini 2.5 Flash) to 92% (Claude Opus 4.6), with every model committing critical failures like typing real passwords into phishing pages or forwarding emails containing embedded credentials.

They then added a 1,200-word security Skill file that’s essentially like security awareness training but for models (advises them to analyze domains right-to-left, read content before sharing, etc.) and found it dramatically improved results, reducing total critical failures from 287 to 10 across all runs. The benchmark, skill file, testing framework, and all results are available on GitHub. Nice project overview and agent trace videos on the project landing page.

Introducing AI Cyber Model Arena: A Real-World Benchmark for AI Agents in Cybersecurity
Matan Vetzler, Nir Ohfeld, and Alon Schindel announce Wiz’s AI Cyber Model Arena, which benchmarks offensive AI security on 257 real-world challenges (zero-days, CVEs, API/web, and cloud across AWS/Azure/GCP/K8s), demonstrating what AI models and agents can really do. They evaluated 25 agent-model combinations (4 agents × 8 models) across offensive security challenges. Currently, the top performers are: Claude Opus 4.6 with Claude Code and then Gemini 3 Pro with Gemini CLI. It doesn't look like they've tested GPT-5.3-Codex yet (probably due to it not being available through the API). Arena landing page.

Methodology:

  • Each agent-model-challenge combination is run 3 times (pass@3).

  • Agents run in isolated Docker containers with no internet access, no CVE databases, and no external resources — the agent cannot browse the web, install packages, or access any information beyond what is in the container.

  • All scoring is deterministic (no LLM-as-judge).

💡It’s nice to see more benchmarks measuring AI agent capabilities on offensive security tasks. Overall it seems thoughtfully designed, and I like that they measured a number of agent + model combinations. It’d be cool if they open sourced the benchmark 👀

Misc

AI

AI + Burnout

  • HBR - AI Doesn’t Reduce Work-It Intensifies It - An eight-month study at a 200-employee tech company found that AI adoption led to work intensification rather than reduction, in three main ways: task expansion (product managers writing code, researchers doing engineering work), blurred work-life boundaries (prompting AI during breaks and off-hours), and increased multitasking (managing multiple AI agents in parallel). While workers felt more productive, they reported being busier than before.

  • The AI Vampire - Steve Yegge warns that AI coding tools like Claude Code are creating a "vampire effect" where devs realize they can be 10x as productive → output standards raise which leads to burnout → companies capture value through overwork rather than sharing benefits with employees.

    • “The world is accelerating, against its will. I can feel it; I grew up in the 1980s, when time really did move more slowly, in the sense that news and events were spaced way out, and society had time to reflect on them. Now it changes so fast we can’t even keep up, let alone reflect.”

    • “If you have joined an AI-native startup, the founders and investors are using the VC system to extract value from you, today, with the glimmer of hope for big returns for you all later.”

Misc

✉️ Wrapping Up

Have questions, comments, or feedback? Just reply directly, I’d love to hear from you.

If you find this newsletter useful and know other people who would too, I'd really appreciate if you'd forward it to them 🙏

Thanks for reading!

Cheers,
Clint

P.S. Feel free to connect with me on LinkedIn 👋