regex-vs-llm-structured-text by affaan-m/everything-claude-code
npx skills add https://github.com/affaan-m/everything-claude-code --skill regex-vs-llm-structured-text一个用于解析结构化文本(测验、表单、发票、文档)的实用决策框架。核心观点:正则表达式能以低成本、确定性的方式处理 95-98% 的情况。将昂贵的 LLM 调用留给剩余的边缘情况。
文本格式是否一致且重复?
├── 是 (>90% 遵循某种模式) → 从正则表达式开始
│ ├── 正则表达式处理 95%+ → 完成,无需 LLM
│ └── 正则表达式处理 <95% → 仅为边缘情况添加 LLM
└── 否(自由格式,高度可变)→ 直接使用 LLM
源文本
│
▼
[正则表达式解析器] ─── 提取结构(95-98% 准确率)
│
▼
[文本清理器] ─── 移除噪声(标记、页码、伪影)
│
▼
[置信度评分器] ─── 标记低置信度的提取结果
│
├── 高置信度 (≥0.95) → 直接输出
│
└── 低置信度 (<0.95) → [LLM 验证器] → 输出
import re
from dataclasses import dataclass
@dataclass(frozen=True)
class ParsedItem:
id: str
text: str
choices: tuple[str, ...]
answer: str
confidence: float = 1.0
def parse_structured_text(content: str) -> list[ParsedItem]:
"""使用正则表达式模式解析结构化文本。"""
pattern = re.compile(
r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
r"(?P<choices>(?:[A-D]\..+?\n)+)"
r"Answer:\s*(?P<answer>[A-D])",
re.MULTILINE | re.DOTALL,
)
items = []
for match in pattern.finditer(content):
choices = tuple(
c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
)
items.append(ParsedItem(
id=match.group("id"),
text=match.group("text").strip(),
choices=choices,
answer=match.group("answer"),
))
return items
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
标记可能需要 LLM 审核的项目:
@dataclass(frozen=True)
class ConfidenceFlag:
item_id: str
score: float
reasons: tuple[str, ...]
def score_confidence(item: ParsedItem) -> ConfidenceFlag:
"""对提取置信度进行评分并标记问题。"""
reasons = []
score = 1.0
if len(item.choices) < 3:
reasons.append("few_choices")
score -= 0.3
if not item.answer:
reasons.append("missing_answer")
score -= 0.5
if len(item.text) < 10:
reasons.append("short_text")
score -= 0.2
return ConfidenceFlag(
item_id=item.id,
score=max(0.0, score),
reasons=tuple(reasons),
)
def identify_low_confidence(
items: list[ParsedItem],
threshold: float = 0.95,
) -> list[ConfidenceFlag]:
"""返回低于置信度阈值的项目。"""
flags = [score_confidence(item) for item in items]
return [f for f in flags if f.score < threshold]
def validate_with_llm(
item: ParsedItem,
original_text: str,
client,
) -> ParsedItem:
"""使用 LLM 修正低置信度的提取结果。"""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # 用于验证的最便宜模型
max_tokens=500,
messages=[{
"role": "user",
"content": (
f"从以下文本中提取问题、选项和答案。\n\n"
f"文本:{original_text}\n\n"
f"当前提取结果:{item}\n\n"
f"如果需要,返回修正后的 JSON,如果准确则返回 'CORRECT'。"
),
}],
)
# 解析 LLM 响应并返回修正后的项目...
return corrected_item
def process_document(
content: str,
*,
llm_client=None,
confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
"""完整管道:正则表达式 -> 置信度检查 -> 针对边缘情况的 LLM。"""
# 步骤 1:正则表达式提取(处理 95-98%)
items = parse_structured_text(content)
# 步骤 2:置信度评分
low_confidence = identify_low_confidence(items, confidence_threshold)
if not low_confidence or llm_client is None:
return items
# 步骤 3:LLM 验证(仅针对标记的项目)
low_conf_ids = {f.item_id for f in low_confidence}
result = []
for item in items:
if item.id in low_conf_ids:
result.append(validate_with_llm(item, content, llm_client))
else:
result.append(item)
return result
来自一个生产级测验解析管道(410 个项目):
| 指标 | 值 |
|---|---|
| 正则表达式成功率 | 98.0% |
| 低置信度项目 | 8 (2.0%) |
| 需要的 LLM 调用次数 | ~5 |
| 相比全 LLM 的成本节省 | ~95% |
| 测试覆盖率 | 93% |
每周安装次数
447
代码仓库
GitHub 星标数
69.1K
首次出现
2026 年 2 月 17 日
安全审计
安装于
codex397
opencode387
gemini-cli377
github-copilot376
cursor364
kimi-cli363
A practical decision framework for parsing structured text (quizzes, forms, invoices, documents). The key insight: regex handles 95-98% of cases cheaply and deterministically. Reserve expensive LLM calls for the remaining edge cases.
Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│ ├── Regex handles 95%+ → Done, no LLM needed
│ └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly
Source Text
│
▼
[Regex Parser] ─── Extracts structure (95-98% accuracy)
│
▼
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
│
▼
[Confidence Scorer] ─── Flags low-confidence extractions
│
├── High confidence (≥0.95) → Direct output
│
└── Low confidence (<0.95) → [LLM Validator] → Output
import re
from dataclasses import dataclass
@dataclass(frozen=True)
class ParsedItem:
id: str
text: str
choices: tuple[str, ...]
answer: str
confidence: float = 1.0
def parse_structured_text(content: str) -> list[ParsedItem]:
"""Parse structured text using regex patterns."""
pattern = re.compile(
r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
r"(?P<choices>(?:[A-D]\..+?\n)+)"
r"Answer:\s*(?P<answer>[A-D])",
re.MULTILINE | re.DOTALL,
)
items = []
for match in pattern.finditer(content):
choices = tuple(
c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
)
items.append(ParsedItem(
id=match.group("id"),
text=match.group("text").strip(),
choices=choices,
answer=match.group("answer"),
))
return items
Flag items that may need LLM review:
@dataclass(frozen=True)
class ConfidenceFlag:
item_id: str
score: float
reasons: tuple[str, ...]
def score_confidence(item: ParsedItem) -> ConfidenceFlag:
"""Score extraction confidence and flag issues."""
reasons = []
score = 1.0
if len(item.choices) < 3:
reasons.append("few_choices")
score -= 0.3
if not item.answer:
reasons.append("missing_answer")
score -= 0.5
if len(item.text) < 10:
reasons.append("short_text")
score -= 0.2
return ConfidenceFlag(
item_id=item.id,
score=max(0.0, score),
reasons=tuple(reasons),
)
def identify_low_confidence(
items: list[ParsedItem],
threshold: float = 0.95,
) -> list[ConfidenceFlag]:
"""Return items below confidence threshold."""
flags = [score_confidence(item) for item in items]
return [f for f in flags if f.score < threshold]
def validate_with_llm(
item: ParsedItem,
original_text: str,
client,
) -> ParsedItem:
"""Use LLM to fix low-confidence extractions."""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Cheapest model for validation
max_tokens=500,
messages=[{
"role": "user",
"content": (
f"Extract the question, choices, and answer from this text.\n\n"
f"Text: {original_text}\n\n"
f"Current extraction: {item}\n\n"
f"Return corrected JSON if needed, or 'CORRECT' if accurate."
),
}],
)
# Parse LLM response and return corrected item...
return corrected_item
def process_document(
content: str,
*,
llm_client=None,
confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
"""Full pipeline: regex -> confidence check -> LLM for edge cases."""
# Step 1: Regex extraction (handles 95-98%)
items = parse_structured_text(content)
# Step 2: Confidence scoring
low_confidence = identify_low_confidence(items, confidence_threshold)
if not low_confidence or llm_client is None:
return items
# Step 3: LLM validation (only for flagged items)
low_conf_ids = {f.item_id for f in low_confidence}
result = []
for item in items:
if item.id in low_conf_ids:
result.append(validate_with_llm(item, content, llm_client))
else:
result.append(item)
return result
From a production quiz parsing pipeline (410 items):
| Metric | Value |
|---|---|
| Regex success rate | 98.0% |
| Low confidence items | 8 (2.0%) |
| LLM calls needed | ~5 |
| Cost savings vs all-LLM | ~95% |
| Test coverage | 93% |
Weekly Installs
447
Repository
GitHub Stars
69.1K
First Seen
Feb 17, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
codex397
opencode387
gemini-cli377
github-copilot376
cursor364
kimi-cli363
React 组合模式指南:Vercel 组件架构最佳实践,提升代码可维护性
102,200 周安装