YARA-X 规则编写指南：恶意软件检测规则优化与最佳实践

yara-rule-authoring by trailofbits/skills

901 周安装量

3,900 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/trailofbits/skills --skill yara-rule-authoring

自动化数据分析安全

🇨🇳中文介绍

YARA-X 规则编写

编写能够检测恶意软件且不会产生大量误报的检测规则。

此技能针对 YARA-X，这是基于 Rust 的旧版 YARA 的继任者。YARA-X 为 VirusTotal 的生产系统提供支持，是推荐的实现方案。如果您已有现有规则，请参阅从旧版 YARA 迁移。

核心原则

字符串必须生成良好的原子 — YARA 提取 4 字节的子序列以进行快速匹配。包含重复字节、常见序列或少于 4 字节的字符串会迫使 YARA 对过多文件进行缓慢的字节码验证。
针对特定家族，而非类别 — "检测勒索软件" 会匹配所有内容，但实际上一无所获。"检测 LockBit 3.0 配置提取例程" 才能捕获您想要的目标。
部署前在良性软件上测试 — 在 Windows 系统文件上触发的规则是无用的。使用 VirusTotal 的良性软件语料库或您自己的干净文件集进行验证。
先进行廉价检查以短路求值 — 将 filesize < 10MB and uint16(0) == 0x5A4D 放在昂贵的字符串搜索或模块调用之前。
元数据即文档 — 未来的您（以及您的团队）需要知道此规则捕获什么、为什么捕获以及样本来源。

使用时机

为恶意软件检测编写新的 YARA-X 规则
审查现有规则的质量或性能问题
优化运行缓慢的规则集
将 IOC 或威胁情报转换为检测签名
调试误报问题
为生产部署准备规则
将旧版 YARA 规则迁移到 YARA-X
分析 Chrome 扩展（crx 模块）
分析 Android 应用（dex 模块）

避免使用时机

需要反汇编的静态分析 → 使用 Ghidra/IDA 技能

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

平台	魔数	不良字符串	良好字符串
Windows PE	`uint16(0) == 0x5A4D`	API 名称，Windows 路径	互斥体名称，PDB 路径
macOS Mach-O	`uint32(0) == 0xFEEDFACE` (32位), `0xFEEDFACF` (64位), `0xCAFEBABE` (通用)	常见 Obj-C 方法	键盘记录器字符串，持久化路径
JavaScript/Node	(无需)	`require`, `fetch`, `axios`	混淆器特征，eval+解码链
npm/pip 包	(无需)	`postinstall`, `dependencies`	可疑包名，数据外传 URL
Office 文档	`uint32(0) == 0x504B0304`	VBA 关键字	宏自动执行，编码载荷
VS Code 扩展	(无需)	`vscode.workspace`	不常见的 activationEvents，隐藏文件访问
Chrome 扩展	使用 `crx` 模块	常见 Chrome API	权限滥用，清单异常
Android 应用	使用 `dex` 模块	标准 DEX 结构	混淆类，可疑权限

macOS 恶意软件检测

目前尚无专用的 Mach-O 模块。使用魔数检查 + 字符串模式：

// Mach-O 32-bit
uint32(0) == 0xFEEDFACE
// Mach-O 64-bit
uint32(0) == 0xFEEDFACF
// Universal binary (fat binary)
uint32(0) == 0xCAFEBABE or uint32(0) == 0xBEBAFECA

macOS 恶意软件的优良指标：

键盘记录器痕迹：CGEventTapCreate, kCGEventKeyDown
SSH 隧道字符串：ssh -D, tunnel, socks
持久化路径：~/Library/LaunchAgents, /Library/LaunchDaemons
凭据窃取：security find-generic-password, keychain

来自 Airbnb BinaryAlert 的示例模式：

rule SUSP_Mac_ProtonRAT
{
    strings:
        // Library indicators
        $lib1 = "SRWebSocket" ascii
        $lib2 = "SocketRocket" ascii

        // Behavioral indicators
        $behav1 = "SSH tunnel not launched" ascii
        $behav2 = "Keylogger" ascii

    condition:
        (uint32(0) == 0xFEEDFACF or uint32(0) == 0xCAFEBABE) and
        any of ($lib*) and any of ($behav*)
}

JavaScript 检测决策树

Writing a JavaScript rule?
├─ npm package?
│  ├─ Check package.json patterns
│  ├─ Look for postinstall/preinstall hooks
│  └─ Target exfil patterns: fetch + env access + credential paths
├─ Browser extension?
│  ├─ Chrome: Use crx module
│  └─ Others: Target manifest patterns, background script behaviors
├─ Standalone JS file?
│  ├─ Look for obfuscation markers: eval+atob, fromCharCode chains
│  ├─ Target unique function/variable names (often survive minification)
│  └─ Check for packed/encoded payloads
└─ Minified/webpack bundle?
   ├─ Target unique strings that survive bundling (URLs, magic values)
   └─ Avoid function names (will be mangled)

JavaScript 特有的良好字符串：

以太坊函数选择器：{ 70 a0 82 31 } (transfer)
零宽度字符（隐写术）：{ E2 80 8B E2 80 8C }
混淆器特征：_0x, var _0x
特定 C2 模式：域名，webhook URL

JavaScript 特有的不良字符串：

require, fetch, axios — 太常见
Buffer, crypto — 到处都有合法用途
单独的 process.env — 需要特定的环境变量名

工具	用途
yarGen	提取候选字符串：`yarGen.py -m samples/ --excludegood` → 用 `yr check` 验证
FLOSS	提取混淆/栈字符串：`floss sample.exe` (当 yarGen 失败时)
yr CLI	验证：`yr check`，扫描：`yr scan -s`，检查：`yr dump -m pe`
signature-base	学习优质示例
YARA-CI	部署前进行良性软件语料库测试

掌握这五个工具。不要被工具目录分散注意力。

应拒绝的合理化理由

当您发现自己有这些想法时，请停下来重新考虑。

合理化理由	专家回应
"这个通用字符串足够独特"	先在良性软件上测试。您的直觉是错的。
"yarGen 给了我这些字符串"	yarGen 是建议，您需要验证。手动检查每一个。
"它在我的 10 个样本上有效"	10 个样本 ≠ 生产环境。使用 VirusTotal 良性软件语料库。
"一条规则捕获所有变种"	会导致误报泛滥。针对特定家族。
"如果我们收到误报，我会让它更具体"	一开始就编写严格的规则。误报会消耗信任。
"这个十六进制模式是唯一的"	在一个样本中唯一 ≠ 在整个恶意软件生态系统中唯一。
"性能不重要"	一条慢规则会拖慢整个规则集。优化原子。
"PEiD 规则仍然有效"	已过时。32 位加壳器已不相关。
"我稍后会添加更多条件"	部署弱规则 = 造成损害。
"这只是用于狩猎"	狩猎规则会变成检测规则。质量标准相同。
"这个 API 名称使其具有恶意"	合法软件使用相同的 API。需要行为上下文。
"对于这些常见字符串，any of 没问题"	常见字符串 + any = 误报泛滥。仅对单独唯一的字符串使用 `any of`。
"这个正则表达式足够具体"	`/fetch.*token/` 匹配所有认证代码。添加数据外传目的地要求。
"这个 JavaScript 看起来很干净"	攻击者会向合法代码中注入恶意代码。检查 eval+解码链。
"我会用 .* 来保持灵活性"	无界正则表达式 = 性能灾难 + 内存爆炸。使用 `.{0,30}`。
"我会到处使用 --relaxed-re-syntax"	掩盖了真正的错误。修复正则表达式，而不是隐藏问题。

这个字符串足够好吗？

Is this string good enough?
├─ Less than 4 bytes?
│  └─ NO — find longer string
├─ Contains repeated bytes (0000, 9090)?
│  └─ NO — add surrounding context
├─ Is an API name (VirtualAlloc, CreateRemoteThread)?
│  └─ NO — use hex pattern of call site instead
├─ Appears in Windows system files?
│  └─ NO — too generic, find something unique
├─ Is it a common path (C:\Windows\, cmd.exe)?
│  └─ NO — find malware-specific paths
├─ Unique to this malware family?
│  └─ YES — use it
└─ Appears in other malware too?
   └─ MAYBE — combine with family-specific marker

何时使用 "all of" 与 "any of"

Should I require all strings or allow any?
├─ Strings are individually unique to malware?
│  └─ any of them (each alone is suspicious)
├─ Strings are common but combination is suspicious?
│  └─ all of them (require the full pattern)
├─ Strings have different confidence levels?
│  └─ Group: all of ($core_*) and any of ($variant_*)
└─ Seeing many false positives?
   └─ Tighten: switch any → all, add more required strings

生产经验教训： 使用 any of ($network_*) 的规则，其中字符串包含 "fetch"、"axios"、"http"，几乎匹配了所有 Web 应用程序。改为要求凭据路径 AND 网络调用 AND 数据外传目的地后，消除了误报。

何时放弃一种规则方法

在以下情况时停止并转向：

yarGen 仅返回 API 名称和路径 → 参见当字符串失效时，转向结构
找不到 3 个唯一字符串 → 可能已加壳。针对解压后的版本或检测加壳器本身。
规则匹配良性软件文件 → 字符串不够独特。1-2 个匹配 = 调查并收紧；3-5 个匹配 = 寻找不同的指标；6+ 个匹配 = 重新开始。
即使优化后性能仍然很差 → 架构问题。拆分为多个聚焦的规则或添加严格的预过滤器。
难以编写描述 → 规则太模糊。如果您无法解释它捕获了什么，那它捕获的东西就太多了。

FP Investigation Flow:
│
├─ 1. Which string matched?
│     Run: yr scan -s rule.yar false_positive.exe
│
├─ 2. Is it in a legitimate library?
│     └─ Add: not $fp_vendor_string exclusion
│
├─ 3. Is it a common development pattern?
│     └─ Find more specific indicator, replace the string
│
├─ 4. Are multiple generic strings matching together?
│     └─ Tighten to require all + add unique marker
│
└─ 5. Is the malware using common techniques?
      └─ Target malware-specific implementation details, not the technique

十六进制 vs 文本 vs 正则表达式

What string type should I use?
│
├─ Exact ASCII/Unicode text?
│  └─ TEXT: $s = "MutexName" ascii wide
│
├─ Specific byte sequence?
│  └─ HEX: $h = { 4D 5A 90 00 }
│
├─ Byte sequence with variation?
│  └─ HEX with wildcards: { 4D 5A ?? ?? 50 45 }
│
├─ Pattern with structure (URLs, paths)?
│  └─ BOUNDED REGEX: /https:\/\/[a-z]{5,20}\.onion/
│
└─ Unknown encoding (XOR, base64)?
   └─ TEXT with modifier: $s = "config" xor(0x00-0xFF)

样本是否已加壳？（首先检查）

在编写任何基于字符串的规则之前：

Is the sample packed?
├─ Entropy > 7.0?
│  └─ Likely packed — find unpacked layer first
├─ Few/no readable strings?
│  └─ Likely packed — use entropy, PE structure, or packer signatures
├─ UPX/MPRESS/custom packer detected?
│  └─ Target the unpacked payload OR detect the packer itself
└─ Readable strings available?
   └─ Proceed with string-based detection

专家指导： 不要针对加壳层编写规则。加壳方式会变；载荷不会。

当字符串失效时，转向结构

如果 yarGen 仅返回 API 名称和通用路径：

String extraction failed — what now?
├─ High entropy sections?
│  └─ Use math.entropy() on specific sections
├─ Unusual imports pattern?
│  └─ Use pe.imphash() for import hash clustering
├─ Consistent PE structure anomalies?
│  └─ Target section names, sizes, characteristics
├─ Metadata present?
│  └─ Target version info, timestamps, resources
└─ Nothing unique?
   └─ This sample may not be detectable with YARA alone

专家指导： "可以尝试使用其他文件属性，例如元数据、熵、导入哈希或其他保持恒定的数据。" — Kaspersky Applied YARA Training

专家启发式方法

字符串选择： 互斥体名称是黄金；C2 路径是白银；错误消息是青铜。栈字符串几乎总是唯一的。如果您需要超过 6 个字符串，那就是过度拟合。

条件设计： 从 filesize < 开始，然后是魔数，接着是字符串，最后是模块。如果超过 5 行，拆分为多个规则。

质量信号： yarGen 输出需要过滤掉 80%。匹配变种少于 50% 的规则太窄；匹配良性软件的规则太宽。

修饰符纪律：

切勿推测性地使用 nocase 或 wide — 仅当您有确凿证据表明样本中大小写/编码存在变化时才使用
nocase 会使原子生成翻倍；wide 会使字符串匹配翻倍 — 两者都有实际成本
"如果您没有明确的理由使用这些修饰符，就不要用" — Kaspersky Applied YARA

正则表达式锚定：

没有 4+ 字节字面量子串的正则表达式 会在每个文件偏移处求值 — 灾难性的性能
始终将正则表达式锚定到一个独特的字面量：/mshta\.exe http:\/\/.../ 而不是 /http:\/\/.../
如果无法锚定，考虑改用带通配符的十六进制模式

始终用 filesize 限制循环：filesize < 100KB and for all i in (1..#a) : ...
无界的 #a 在大文件中可能成千上万 — 指数级减速

YARA-X 技巧： $_unused 用于抑制警告；private $s 用于隐藏输出；每次提交前使用 yr check + yr fmt。

何时使用模块 vs 字节检查

Should I use a module or raw bytes?
├─ Need imphash/rich header/authenticode?
│  └─ Use PE module — too complex to replicate
├─ Just checking magic bytes or simple offsets?
│  └─ Use uint16/uint32 — faster, no module overhead
├─ Checking section names/sizes?
│  └─ PE module is cleaner, but add magic bytes filter FIRST
├─ Checking Chrome extension permissions?
│  └─ Use crx module — string parsing is fragile
└─ Checking LNK target paths?
   └─ Use lnk module — LNK format is complex

专家指导： "避免使用 magic 模块 — 改用显式的十六进制检查" — Neo23x0。应用这个原则：如果能用 uint32() 完成，就不要加载模块。

近期版本的关键新增功能：

私有模式 (v1.3.0+): private $helper = "pattern" — 匹配但隐藏输出
警告抑制 (v1.4.0+): // suppress: slow_pattern 内联注释
数字下划线 (v1.5.0+): filesize < 10_000_000 以提高可读性
内置格式化程序 : yr fmt rules/ 以标准化格式
NDJSON 输出 : yr scan --output-format ndjson 用于工具集成

YARA-X 工具工作流

YARA-X 提供了旧版 YARA 所缺乏的诊断工具：

规则开发周期：

# 1. Write initial rule
# 2. Check syntax with detailed errors
yr check rule.yar

# 3. Format consistently
yr fmt -w rule.yar

# 4. Dump module output to inspect file structure (no dummy rule needed)
yr dump -m pe sample.exe --output-format yaml

# 5. Scan with timing info
time yr scan -s rule.yar corpus/

何时使用 yr dump：

调查可用的 PE/ELF/Mach-O 字段
调试模块条件为何不匹配
在编写规则前探索新模块（crx、lnk、dotnet）

YARA-X 诊断优势： 错误信息包含精确的源代码位置。如果 yr check 指向第 15 行，问题确实在第 15 行（不像旧版 YARA）。

Chrome 扩展分析 (crx 模块)

crx 模块支持检测恶意的 Chrome 扩展。需要 YARA-X v1.5.0+（基本功能），v1.11.0+ 支持 permhash()。

关键 API： crx.is_crx, crx.permissions, crx.permhash()

危险信号： nativeMessaging + downloads, debugger 权限，在 <all_urls> 上的内容脚本

import "crx"

rule SUSP_CRX_HighRiskPerms {
    condition:
        crx.is_crx and
        for any perm in crx.permissions : (perm == "debugger")
}

完整 API 参考、权限风险评估和示例规则，请参阅 crx-module.md。

Android DEX 分析 (dex 模块)

dex 模块支持检测 Android 恶意软件。需要 YARA-X v1.11.0+。与旧版 YARA 的 dex 模块不兼容 — API 完全不同。

关键 API： dex.is_dex, dex.contains_class(), dex.contains_method(), dex.contains_string()

危险信号： 单字母类名（混淆），DexClassLoader 反射，加密资产

import "dex"

rule SUSP_DEX_DynamicLoading {
    condition:
        dex.is_dex and
        dex.contains_class("Ldalvik/system/DexClassLoader;")
}

完整 API 参考、混淆检测和示例规则，请参阅 dex-module.md。

从旧版 YARA 迁移

YARA-X 具有 99% 的规则兼容性，但执行更严格的验证。

yr check --relaxed-re-syntax rules/  # Identify issues
# Fix each issue, then:
yr check rules/  # Verify without relaxed mode

问题	旧版	YARA-X 修复
正则表达式中的字面量 `{`	`/{/`	`/\{/`
无效转义	`\R` 静默视为字面量	`\\R` 或 `R`
Base64 字符串	任意长度	需要 3+ 字符
负索引	`@a[-1]`	`@a[#a - 1]`
重复修饰符	允许	移除重复项

注意： 仅将 --relaxed-re-syntax 用作诊断工具。修复问题，而不是依赖宽松模式。

{CATEGORY}_{PLATFORM}_{FAMILY}_{VARIANT}_{DATE}

常见前缀： MAL_ (恶意软件), HKTL_ (黑客工具), WEBSHELL_, EXPL_, SUSP_ (可疑), GEN_ (通用)

平台： Win_, Lnx_, Mac_, Android_, CRX_

示例： MAL_Win_Emotet_Loader_Jan25

完整约定、元数据要求和命名示例，请参阅 style-guide.md。

每条规则都需要：description (以 "Detects" 开头), author, reference, date。

meta:
    description = "Detects Example malware via unique mutex and C2 path"
    author = "Your Name <email@example.com>"
    reference = "https://example.com/analysis"
    date = "2025-01-29"

良好： 互斥体名称，PDB 路径，C2 路径，栈字符串，配置标记 不良： API 名称，常见可执行文件，格式说明符，通用路径

完整的决策树和示例，请参阅 strings.md。

为短路求值排序条件：

filesize < 10MB (即时)
uint16(0) == 0x5A4D (几乎即时)
字符串匹配 (廉价)
模块检查 (昂贵)

详细的优化模式，请参阅 performance.md。

收集样本 — 多个样本；单样本规则很脆弱
提取候选 — yarGen -m samples/ --excludegood
验证质量 — 使用决策树；yarGen 输出需要过滤 80%
编写初始规则 — 遵循模板并包含适当的元数据
代码检查和测试 — yr check, yr fmt, linter 脚本
良性软件验证 — VirusTotal 语料库或本地干净文件
部署 — 添加到仓库并包含完整元数据，监控误报

详细的验证工作流和误报调查，请参阅 testing.md。

涵盖从样本收集到部署所有阶段的综合分步指南，请参阅 rule-development.md。

错误	不良	良好
将 API 名称作为指标	`"VirtualAlloc"`	调用点的十六进制模式 + 唯一互斥体
无界正则表达式	`/https?:\/\/.*/`	`/https?:\/\/[a-z0-9]{8,12}\.onion/`
缺少文件类型过滤器	首先 `pe.imports(...)`	首先 `uint16(0) == 0x5A4D and filesize < 10MB`
短字符串	`"abc"` (3 字节)	`"abcdef"` (4+ 字节)
未转义的大括号 (YARA-X)	`/config{key}/`	`/config\{key\}/`

快速见效： 将 filesize 放在首位，避免 nocase，使用有界正则表达式 {1,100}，优先使用十六进制而非正则表达式。

危险信号： 字符串 <4 字节，无界正则表达式 (.*)，没有文件类型过滤器的模块。

原子理论和优化细节，请参阅 performance.md。

主题	文档
命名和元数据约定	style-guide.md
性能和原子优化	performance.md
字符串类型和判断	strings.md
测试和验证	testing.md
Chrome 扩展模块 (crx)	crx-module.md
Android DEX 模块 (dex)	dex-module.md

主题	文档
完整的规则开发流程	rule-development.md

examples/ 目录包含真实的、注明出源的规则，展示了最佳实践：

示例	展示内容	来源
MAL_Win_Remcos_Jan25.yar	PE 恶意软件：分级字符串计数，每个家族多条规则	Elastic Security
MAL_Mac_ProtonRAT_Jan25.yar	macOS：Mach-O 魔数，多类别分组	Airbnb BinaryAlert
MAL_NPM_SupplyChain_Jan25.yar	npm 供应链：真实攻击模式，ERC-20 选择器	Stairwell Research
SUSP_JS_Obfuscation_Jan25.yar	JavaScript：混淆器检测，基于密度的匹配	imp0rtp3, Nils Kuhnert
SUSP_CRX_SuspiciousPermissions.yar	Chrome 扩展：crx 模块，权限	Educational

uv run {baseDir}/scripts/yara_lint.py rule.yar      # Validate style/metadata
uv run {baseDir}/scripts/atom_analyzer.py rule.yar  # Check string quality

详细的脚本文档，请参阅 README.md。

部署任何规则前：

名称遵循 {CATEGORY}_{PLATFORM}_{FAMILY}_{VARIANT}_{DATE} 格式
描述以 "Detects" 开头并解释捕获什么/如何捕获
所有必需元数据齐全（作者、参考、日期）
字符串是唯一的（不是 API 名称、常见路径或格式字符串）
所有字符串都有 4+ 字节且具有良好的原子潜力
Base64 修饰符仅用于 3+ 字符的字符串
正则表达式模式已转义 { 并具有有效的转义序列
条件以廉价检查开始（filesize、魔数）
规则匹配所有目标样本
规则在良性软件语料库上产生零匹配
yr check 通过且无错误
yr fmt --check 通过（格式一致）
Linter 通过且无错误
同行评审已完成

优质 YARA 规则仓库

从生产规则中学习。这些仓库包含经过良好测试、正确注明出源的规则：

仓库	重点	维护者
Neo23x0/signature-base	17,000+ 生产规则，多平台	Florian Roth
Elastic/protections-artifacts	1,000+ 经过端点测试的规则	Elastic Security
reversinglabs/reversinglabs-yara-rules	威胁研究规则	ReversingLabs
imp0rtp3/js-yara-rules	JavaScript/浏览器恶意软件	imp0rtp3
InQuest/awesome-yara	精选资源索引	InQuest

风格与性能指南

指南	用途
YARA Style Guide	命名约定、元数据、字符串前缀
YARA Performance Guidelines	原子优化、正则表达式边界
Kaspersky Applied YARA Training	来自生产使用的专家技术

工具	用途
yarGen	从样本中提取候选字符串
FLOSS	提取混淆和栈字符串
YARA-CI	自动化良性软件测试
YaraDbg	基于 Web 的规则调试器

资源	用途
Apple XProtect	位于 `/System/Library/CoreServices/XProtect.bundle/` 的生产 macOS 规则
objective-see	macOS 恶意软件研究和样本
macOS Security Tools	参考列表

多指标聚类模式

生产规则通常按类型分组指标：

strings:
    // Category A: Library indicators
    $a1 = "SRWebSocket" ascii
    $a2 = "SocketRocket" ascii

    // Category B: Behavioral indicators
    $b1 = "SSH tunnel" ascii
    $b2 = "keylogger" ascii nocase

    // Category C: C2 patterns
    $c1 = /https:\/\/[a-z0-9]{8,16}\.onion/

condition:
    filesize < 10MB and
    any of ($a*) and any of ($b*)  // Require evidence from BOTH categories

为何有效： 不同类型的指标具有不同的置信度。单个 C2 域名可能是决定性的，而您需要多个库导入才能有信心。通过 $a*、$b*、$c* 分组可以让您表达分级要求。

Gen Agent Trust HubPass[SocketPass](/trailofbits/skills/yara-rule-authoring/security/socket

🇺🇸English

YARA-X Rule Authoring

Write detection rules that catch malware without drowning in false positives.

This skill targets YARA-X , the Rust-based successor to legacy YARA. YARA-X powers VirusTotal's production systems and is the recommended implementation. See Migrating from Legacy YARA if you have existing rules.

Core Principles

Strings must generate good atoms — YARA extracts 4-byte subsequences for fast matching. Strings with repeated bytes, common sequences, or under 4 bytes force slow bytecode verification on too many files.
Target specific families, not categories — "Detects ransomware" catches everything and nothing. "Detects LockBit 3.0 configuration extraction routine" catches what you want.
Test against goodware before deployment — A rule that fires on Windows system files is useless. Validate against VirusTotal's goodware corpus or your own clean file set.
Short-circuit with cheap checks first — Put filesize < 10MB and uint16(0) == 0x5A4D before expensive string searches or module calls.
Metadata is documentation — Future you (and your team) need to know what this catches, why, and where the sample came from.

When to Use

Writing new YARA-X rules for malware detection
Reviewing existing rules for quality or performance issues
Optimizing slow-running rulesets
Converting IOCs or threat intel into detection signatures
Debugging false positive issues
Preparing rules for production deployment
Migrating legacy YARA rules to YARA-X
Analyzing Chrome extensions (crx module)
Analyzing Android apps (dex module)

When NOT to Use

Static analysis requiring disassembly → use Ghidra/IDA skills
Dynamic malware analysis → use sandbox analysis skills
Network-based detection → use Suricata/Snort skills
Memory forensics with Volatility → use memory forensics skills
Simple hash-based detection → just use hash lists

YARA-X Overview

YARA-X is the Rust-based successor to legacy YARA: 5-10x faster regex, better errors, built-in formatter, stricter validation, new modules (crx, dex), 99% rule compatibility.

Install: brew install yara-x (macOS) or cargo install yara-x

Essential commands: yr scan, yr check, yr fmt, yr dump

Platform Considerations

YARA works on any file type. Adapt patterns to your target:

Platform	Magic Bytes	Bad Strings	Good Strings
Windows PE	`uint16(0) == 0x5A4D`	API names, Windows paths	Mutex names, PDB paths
macOS Mach-O	`uint32(0) == 0xFEEDFACE` (32-bit), `0xFEEDFACF` (64-bit), `0xCAFEBABE` (universal)	Common Obj-C methods	Keylogger strings, persistence paths
JavaScript/Node	(none needed)	`require`, ,

macOS Malware Detection

No dedicated Mach-O module exists yet. Use magic byte checks + string patterns:

Magic bytes:

// Mach-O 32-bit
uint32(0) == 0xFEEDFACE
// Mach-O 64-bit
uint32(0) == 0xFEEDFACF
// Universal binary (fat binary)
uint32(0) == 0xCAFEBABE or uint32(0) == 0xBEBAFECA

Good indicators for macOS malware:

Keylogger artifacts: CGEventTapCreate, kCGEventKeyDown
SSH tunnel strings: ssh -D, tunnel, socks
Persistence paths: ~/Library/LaunchAgents, /Library/LaunchDaemons
Credential theft: security find-generic-password, keychain

Example pattern from Airbnb BinaryAlert:

rule SUSP_Mac_ProtonRAT
{
    strings:
        // Library indicators
        $lib1 = "SRWebSocket" ascii
        $lib2 = "SocketRocket" ascii

        // Behavioral indicators
        $behav1 = "SSH tunnel not launched" ascii
        $behav2 = "Keylogger" ascii

    condition:
        (uint32(0) == 0xFEEDFACF or uint32(0) == 0xCAFEBABE) and
        any of ($lib*) and any of ($behav*)
}

JavaScript Detection Decision Tree

Writing a JavaScript rule?
├─ npm package?
│  ├─ Check package.json patterns
│  ├─ Look for postinstall/preinstall hooks
│  └─ Target exfil patterns: fetch + env access + credential paths
├─ Browser extension?
│  ├─ Chrome: Use crx module
│  └─ Others: Target manifest patterns, background script behaviors
├─ Standalone JS file?
│  ├─ Look for obfuscation markers: eval+atob, fromCharCode chains
│  ├─ Target unique function/variable names (often survive minification)
│  └─ Check for packed/encoded payloads
└─ Minified/webpack bundle?
   ├─ Target unique strings that survive bundling (URLs, magic values)
   └─ Avoid function names (will be mangled)

JavaScript-specific good strings:

Ethereum function selectors: { 70 a0 82 31 } (transfer)
Zero-width characters (steganography): { E2 80 8B E2 80 8C }
Obfuscator signatures: _0x, var _0x
Specific C2 patterns: domain names, webhook URLs

JavaScript-specific bad strings:

require, fetch, axios — too common
Buffer, crypto — legitimate uses everywhere
process.env alone — need specific env var names

Essential Toolkit

Tool	Purpose
yarGen	Extract candidate strings: `yarGen.py -m samples/ --excludegood` → validate with `yr check`
FLOSS	Extract obfuscated/stack strings: `floss sample.exe` (when yarGen fails)
yr CLI	Validate: `yr check`, scan: `yr scan -s`, inspect: `yr dump -m pe`
signature-base	Study quality examples

Master these five. Don't get distracted by tool catalogs.

Rationalizations to Reject

When you catch yourself thinking these, stop and reconsider.

Rationalization	Expert Response
"This generic string is unique enough"	Test against goodware first. Your intuition is wrong.
"yarGen gave me these strings"	yarGen suggests, you validate. Check each one manually.
"It works on my 10 samples"	10 samples ≠ production. Use VirusTotal goodware corpus.
"One rule to catch all variants"	Causes FP floods. Target specific families.
"I'll make it more specific if we get FPs"	Write tight rules upfront. FPs burn trust.
"This hex pattern is unique"	Unique in one sample ≠ unique across malware ecosystem.
"Performance doesn't matter"	One slow rule slows entire ruleset. Optimize atoms.
"PEiD rules still work"	Obsolete. 32-bit packers aren't relevant.
"I'll add more conditions later"	Weak rules deployed = damage done.
"This is just for hunting"	Hunting rules become detection rules. Same quality bar.
"The API name makes it malicious"	Legitimate software uses same APIs. Need behavioral context.

Decision Trees

Is This String Good Enough?

Is this string good enough?
├─ Less than 4 bytes?
│  └─ NO — find longer string
├─ Contains repeated bytes (0000, 9090)?
│  └─ NO — add surrounding context
├─ Is an API name (VirtualAlloc, CreateRemoteThread)?
│  └─ NO — use hex pattern of call site instead
├─ Appears in Windows system files?
│  └─ NO — too generic, find something unique
├─ Is it a common path (C:\Windows\, cmd.exe)?
│  └─ NO — find malware-specific paths
├─ Unique to this malware family?
│  └─ YES — use it
└─ Appears in other malware too?
   └─ MAYBE — combine with family-specific marker

When to Use "all of" vs "any of"

Should I require all strings or allow any?
├─ Strings are individually unique to malware?
│  └─ any of them (each alone is suspicious)
├─ Strings are common but combination is suspicious?
│  └─ all of them (require the full pattern)
├─ Strings have different confidence levels?
│  └─ Group: all of ($core_*) and any of ($variant_*)
└─ Seeing many false positives?
   └─ Tighten: switch any → all, add more required strings

Lesson from production: Rules using any of ($network_*) where strings included "fetch", "axios", "http" matched virtually all web applications. Switching to require credential path AND network call AND exfil destination eliminated FPs.

When to Abandon a Rule Approach

Stop and pivot when:

yarGen returns only API names and paths → See When Strings Fail, Pivot to Structure
Can't find 3 unique strings → Probably packed. Target the unpacked version or detect the packer.
Rule matches goodware files → Strings aren't unique enough. 1-2 matches = investigate and tighten; 3-5 matches = find different indicators; 6+ matches = start over.
Performance is terrible even after optimization → Architecture problem. Split into multiple focused rules or add strict pre-filters.
Description is hard to write → The rule is too vague. If you can't explain what it catches, it catches too much.

Debugging False Positives

FP Investigation Flow:
│
├─ 1. Which string matched?
│     Run: yr scan -s rule.yar false_positive.exe
│
├─ 2. Is it in a legitimate library?
│     └─ Add: not $fp_vendor_string exclusion
│
├─ 3. Is it a common development pattern?
│     └─ Find more specific indicator, replace the string
│
├─ 4. Are multiple generic strings matching together?
│     └─ Tighten to require all + add unique marker
│
└─ 5. Is the malware using common techniques?
      └─ Target malware-specific implementation details, not the technique

Hex vs Text vs Regex

What string type should I use?
│
├─ Exact ASCII/Unicode text?
│  └─ TEXT: $s = "MutexName" ascii wide
│
├─ Specific byte sequence?
│  └─ HEX: $h = { 4D 5A 90 00 }
│
├─ Byte sequence with variation?
│  └─ HEX with wildcards: { 4D 5A ?? ?? 50 45 }
│
├─ Pattern with structure (URLs, paths)?
│  └─ BOUNDED REGEX: /https:\/\/[a-z]{5,20}\.onion/
│
└─ Unknown encoding (XOR, base64)?
   └─ TEXT with modifier: $s = "config" xor(0x00-0xFF)

Is the Sample Packed? (Check First)

Before writing any string-based rule:

Is the sample packed?
├─ Entropy > 7.0?
│  └─ Likely packed — find unpacked layer first
├─ Few/no readable strings?
│  └─ Likely packed — use entropy, PE structure, or packer signatures
├─ UPX/MPRESS/custom packer detected?
│  └─ Target the unpacked payload OR detect the packer itself
└─ Readable strings available?
   └─ Proceed with string-based detection

Expert guidance: Don't write rules against packed layers. The packing changes; the payload doesn't.

When Strings Fail, Pivot to Structure

If yarGen returns only API names and generic paths:

String extraction failed — what now?
├─ High entropy sections?
│  └─ Use math.entropy() on specific sections
├─ Unusual imports pattern?
│  └─ Use pe.imphash() for import hash clustering
├─ Consistent PE structure anomalies?
│  └─ Target section names, sizes, characteristics
├─ Metadata present?
│  └─ Target version info, timestamps, resources
└─ Nothing unique?
   └─ This sample may not be detectable with YARA alone

Expert guidance: "One can try to use other file properties, such as metadata, entropy, import hashes or other data which stays constant." — Kaspersky Applied YARA Training

Expert Heuristics

String selection: Mutex names are gold; C2 paths silver; error messages bronze. Stack strings are almost always unique. If you need >6 strings, you're over-fitting.

Condition design: Start with filesize <, then magic bytes, then strings, then modules. If >5 lines, split into multiple rules.

Quality signals: yarGen output needs 80% filtering. Rules matching <50% of variants are too narrow; matching goodware are too broad.

Modifier discipline:

Never usenocase or wide speculatively — only when you have confirmed evidence the case/encoding varies in samples
nocase doubles atom generation; wide doubles string matching — both have real costs
"If you don't have a clear reason for using those modifiers, don't do it" — Kaspersky Applied YARA

Regex anchoring:

Regex without a 4+ byte literal substring evaluates at every file offset — catastrophic performance
Always anchor regex to a distinctive literal: /mshta\.exe http:\/\/.../ not /http:\/\/.../
If you can't anchor, consider hex pattern with wildcards instead

Loop discipline:

Always bound loops with filesize: filesize < 100KB and for all i in (1..#a) : ...
Unbounded #a can be thousands in large files — exponential slowdown

YARA-X tips: $_unused to suppress warnings; private $s to hide from output; yr check + yr fmt before every commit.

When to Use Modules vs. Byte Checks

Should I use a module or raw bytes?
├─ Need imphash/rich header/authenticode?
│  └─ Use PE module — too complex to replicate
├─ Just checking magic bytes or simple offsets?
│  └─ Use uint16/uint32 — faster, no module overhead
├─ Checking section names/sizes?
│  └─ PE module is cleaner, but add magic bytes filter FIRST
├─ Checking Chrome extension permissions?
│  └─ Use crx module — string parsing is fragile
└─ Checking LNK target paths?
   └─ Use lnk module — LNK format is complex

Expert guidance: "Avoid the magic module — use explicit hex checks instead" — Neo23x0. Apply this principle: if you can do it with uint32(), don't load a module.

YARA-X New Features

Key additions from recent releases:

Private patterns (v1.3.0+): private $helper = "pattern" — matches but hidden from output
Warning suppression (v1.4.0+): // suppress: slow_pattern inline comments
Numeric underscores (v1.5.0+): filesize < 10_000_000 for readability
Built-in formatter : yr fmt rules/ to standardize formatting
NDJSON output : yr scan --output-format ndjson for tooling

YARA-X Tooling Workflow

YARA-X provides diagnostic tools legacy YARA lacks:

Rule development cycle:

# 1. Write initial rule
# 2. Check syntax with detailed errors
yr check rule.yar

# 3. Format consistently
yr fmt -w rule.yar

# 4. Dump module output to inspect file structure (no dummy rule needed)
yr dump -m pe sample.exe --output-format yaml

# 5. Scan with timing info
time yr scan -s rule.yar corpus/

When to useyr dump:

Investigating what PE/ELF/Mach-O fields are available
Debugging why module conditions aren't matching
Exploring new modules (crx, lnk, dotnet) before writing rules

YARA-X diagnostic advantage: Error messages include precise source locations. If yr check points to line 15, the issue is actually on line 15 (unlike legacy YARA).

Chrome Extension Analysis (crx module)

The crx module enables detection of malicious Chrome extensions. Requires YARA-X v1.5.0+ (basic), v1.11.0+ for permhash().

Key APIs: crx.is_crx, crx.permissions, crx.permhash()

Red flags: nativeMessaging + downloads, debugger permission, content scripts on <all_urls>

import "crx"

rule SUSP_CRX_HighRiskPerms {
    condition:
        crx.is_crx and
        for any perm in crx.permissions : (perm == "debugger")
}

See crx-module.md for complete API reference, permission risk assessment, and example rules.

Android DEX Analysis (dex module)

The dex module enables detection of Android malware. Requires YARA-X v1.11.0+. Not compatible with legacy YARA's dex module — API is completely different.

Key APIs: dex.is_dex, dex.contains_class(), dex.contains_method(), dex.contains_string()

Red flags: Single-letter class names (obfuscation), DexClassLoader reflection, encrypted assets

import "dex"

rule SUSP_DEX_DynamicLoading {
    condition:
        dex.is_dex and
        dex.contains_class("Ldalvik/system/DexClassLoader;")
}

See dex-module.md for complete API reference, obfuscation detection, and example rules.

Migrating from Legacy YARA

YARA-X has 99% rule compatibility, but enforces stricter validation.

Quick migration:

yr check --relaxed-re-syntax rules/  # Identify issues
# Fix each issue, then:
yr check rules/  # Verify without relaxed mode

Common fixes:

Issue	Legacy	YARA-X Fix
Literal `{` in regex	`/{/`	`/\{/`
Invalid escapes	`\R` silently literal	`\\R` or `R`
Base64 strings	Any length	3+ chars required
Negative indexing	`@a[-1]`

Note: Use --relaxed-re-syntax only as a diagnostic tool. Fix issues rather than relying on relaxed mode.

Quick Reference

Naming Convention

{CATEGORY}_{PLATFORM}_{FAMILY}_{VARIANT}_{DATE}

Common prefixes: MAL_ (malware), HKTL_ (hacking tool), WEBSHELL_, EXPL_, SUSP_ (suspicious), GEN_ (generic)

Platforms: Win_, Lnx_, Mac_, Android_, CRX_

Example: MAL_Win_Emotet_Loader_Jan25

See style-guide.md for full conventions, metadata requirements, and naming examples.

Required Metadata

Every rule needs: description (starts with "Detects"), author, reference, date.

meta:
    description = "Detects Example malware via unique mutex and C2 path"
    author = "Your Name <email@example.com>"
    reference = "https://example.com/analysis"
    date = "2025-01-29"

String Selection

Good: Mutex names, PDB paths, C2 paths, stack strings, configuration markers Bad: API names, common executables, format specifiers, generic paths

See strings.md for the full decision tree and examples.

Condition Patterns

Order conditions for short-circuit:

filesize < 10MB (instant)
uint16(0) == 0x5A4D (nearly instant)
String matches (cheap)
Module checks (expensive)

See performance.md for detailed optimization patterns.

Workflow

Gather samples — Multiple samples; single-sample rules are brittle
Extract candidates — yarGen -m samples/ --excludegood
Validate quality — Use decision tree; yarGen needs 80% filtering
Write initial rule — Follow template with proper metadata
Lint and test — yr check, yr fmt, linter script
Goodware validation — VirusTotal corpus or local clean files
Deploy — Add to repo with full metadata, monitor for FPs

See testing.md for detailed validation workflow and FP investigation.

For a comprehensive step-by-step guide covering all phases from sample collection to deployment, see rule-development.md.

Common Mistakes

Mistake	Bad	Good
API names as indicators	`"VirtualAlloc"`	Hex pattern of call site + unique mutex
Unbounded regex	`/https?:\/\/.*/`	`/https?:\/\/[a-z0-9]{8,12}\.onion/`
Missing file type filter	`pe.imports(...)` first	`uint16(0) == 0x5A4D and filesize < 10MB` first
Short strings	`"abc"` (3 bytes)

Performance Optimization

Quick wins: Put filesize first, avoid nocase, bounded regex {1,100}, prefer hex over regex.

Red flags: Strings <4 bytes, unbounded regex (.*), modules without file-type filter.

See performance.md for atom theory and optimization details.

Reference Documents

Topic	Document
Naming and metadata conventions	style-guide.md
Performance and atom optimization	performance.md
String types and judgment	strings.md
Testing and validation	testing.md
Chrome extension module (crx)	crx-module.md
Android DEX module (dex)	dex-module.md

Workflows

Topic	Document
Complete rule development process	rule-development.md

Example Rules

The examples/ directory contains real, attributed rules demonstrating best practices:

Example	Demonstrates	Source
MAL_Win_Remcos_Jan25.yar	PE malware: graduated string counts, multiple rules per family	Elastic Security
MAL_Mac_ProtonRAT_Jan25.yar	macOS: Mach-O magic bytes, multi-category grouping	Airbnb BinaryAlert
MAL_NPM_SupplyChain_Jan25.yar	npm supply chain: real attack patterns, ERC-20 selectors	Stairwell Research
SUSP_JS_Obfuscation_Jan25.yar	JavaScript: obfuscator detection, density-based matching	imp0rtp3, Nils Kuhnert
SUSP_CRX_SuspiciousPermissions.yar

Scripts

uv run {baseDir}/scripts/yara_lint.py rule.yar      # Validate style/metadata
uv run {baseDir}/scripts/atom_analyzer.py rule.yar  # Check string quality

See README.md for detailed script documentation.

Quality Checklist

Before deploying any rule:

Name follows {CATEGORY}_{PLATFORM}_{FAMILY}_{VARIANT}_{DATE} format
Description starts with "Detects" and explains what/how
All required metadata present (author, reference, date)
Strings are unique (not API names, common paths, or format strings)
All strings have 4+ bytes with good atom potential
Base64 modifier only on strings with 3+ characters
Regex patterns have escaped { and valid escape sequences
Condition starts with cheap checks (filesize, magic bytes)
Rule matches all target samples
Rule produces zero matches on goodware corpus
yr check passes with no errors
yr fmt --check passes (consistent formatting)
Linter passes with no errors
Peer review completed

Resources

Quality YARA Rule Repositories

Learn from production rules. These repositories contain well-tested, properly attributed rules:

Repository	Focus	Maintainer
Neo23x0/signature-base	17,000+ production rules, multi-platform	Florian Roth
Elastic/protections-artifacts	1,000+ endpoint-tested rules	Elastic Security
reversinglabs/reversinglabs-yara-rules	Threat research rules	ReversingLabs
imp0rtp3/js-yara-rules	JavaScript/browser malware	imp0rtp3
InQuest/awesome-yara	Curated index of resources	InQuest

Style & Performance Guides

Guide	Purpose
YARA Style Guide	Naming conventions, metadata, string prefixes
YARA Performance Guidelines	Atom optimization, regex bounds
Kaspersky Applied YARA Training	Expert techniques from production use

Tools

Tool	Purpose
yarGen	Extract candidate strings from samples
FLOSS	Extract obfuscated and stack strings
YARA-CI	Automated goodware testing
YaraDbg	Web-based rule debugger

macOS-Specific Resources

Resource	Purpose
Apple XProtect	Production macOS rules at `/System/Library/CoreServices/XProtect.bundle/`
objective-see	macOS malware research and samples
macOS Security Tools	Reference list

Multi-Indicator Clustering Pattern

Production rules often group indicators by type:

strings:
    // Category A: Library indicators
    $a1 = "SRWebSocket" ascii
    $a2 = "SocketRocket" ascii

    // Category B: Behavioral indicators
    $b1 = "SSH tunnel" ascii
    $b2 = "keylogger" ascii nocase

    // Category C: C2 patterns
    $c1 = /https:\/\/[a-z0-9]{8,16}\.onion/

condition:
    filesize < 10MB and
    any of ($a*) and any of ($b*)  // Require evidence from BOTH categories

Why this works: Different indicator types have different confidence levels. A single C2 domain might be definitive, while you need multiple library imports to be confident. Grouping by $a*, $b*, $c* lets you express graduated requirements.

Weekly Installs

901

Repository

trailofbits/skills

GitHub Stars

3.9K

First Seen

Jan 30, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

claude-code802

opencode790

codex784

gemini-cli777

cursor762

github-copilot745

Excel财务建模规范与xlsx文件处理指南：专业格式、零错误公式与数据分析

37,200 周安装