正则表达式 vs LLM：结构化文本解析决策框架与混合实现指南

regex-vs-llm-structured-text by affaan-m/everything-claude-code

1,000 周安装量

105,000 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/affaan-m/everything-claude-code --skill regex-vs-llm-structured-text

AI/机器学习开发自动化

🇨🇳中文介绍

正则表达式与 LLM 在结构化文本解析中的对比

一个用于解析结构化文本（测验、表单、发票、文档）的实用决策框架。核心观点：正则表达式能以低成本、确定性的方式处理 95-98% 的情况。将昂贵的 LLM 调用留给剩余的边缘情况。

何时使用

解析具有重复模式的结构化文本（问题、表单、表格）
在文本提取时决定使用正则表达式还是 LLM
构建结合两种方法的混合管道
在文本处理中优化成本/准确性的权衡

决策框架

文本格式是否一致且重复？
├── 是 (>90% 遵循某种模式) → 从正则表达式开始
│   ├── 正则表达式处理 95%+ → 完成，无需 LLM
│   └── 正则表达式处理 <95% → 仅为边缘情况添加 LLM
└── 否（自由格式，高度可变）→ 直接使用 LLM

架构模式

源文本
    │
    ▼
[正则表达式解析器] ─── 提取结构（95-98% 准确率）
    │
    ▼
[文本清理器] ─── 移除噪声（标记、页码、伪影）
    │
    ▼
[置信度评分器] ─── 标记低置信度的提取结果
    │
    ├── 高置信度 (≥0.95) → 直接输出
    │
    └── 低置信度 (<0.95) → [LLM 验证器] → 输出

实现

1. 正则表达式解析器（处理大多数情况）

import re
from dataclasses import dataclass

@dataclass(frozen=True)
class ParsedItem:
    id: str
    text: str
    choices: tuple[str, ...]
    answer: str
    confidence: float = 1.0

def parse_structured_text(content: str) -> list[ParsedItem]:
    """使用正则表达式模式解析结构化文本。"""
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    items = []
    for match in pattern.finditer(content):
        choices = tuple(
            c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
        )
        items.append(ParsedItem(
            id=match.group("id"),
            text=match.group("text").strip(),
            choices=choices,
            answer=match.group("answer"),
        ))
    return items

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

3. LLM 验证器（仅用于边缘情况）

def validate_with_llm(
    item: ParsedItem,
    original_text: str,
    client,
) -> ParsedItem:
    """使用 LLM 修正低置信度的提取结果。"""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # 用于验证的最便宜模型
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"从以下文本中提取问题、选项和答案。\n\n"
                f"文本：{original_text}\n\n"
                f"当前提取结果：{item}\n\n"
                f"如果需要，返回修正后的 JSON，如果准确则返回 'CORRECT'。"
            ),
        }],
    )
    # 解析 LLM 响应并返回修正后的项目...
    return corrected_item

def process_document(
    content: str,
    *,
    llm_client=None,
    confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
    """完整管道：正则表达式 -> 置信度检查 -> 针对边缘情况的 LLM。"""
    # 步骤 1：正则表达式提取（处理 95-98%）
    items = parse_structured_text(content)

    # 步骤 2：置信度评分
    low_confidence = identify_low_confidence(items, confidence_threshold)

    if not low_confidence or llm_client is None:
        return items

    # 步骤 3：LLM 验证（仅针对标记的项目）
    low_conf_ids = {f.item_id for f in low_confidence}
    result = []
    for item in items:
        if item.id in low_conf_ids:
            result.append(validate_with_llm(item, content, llm_client))
        else:
            result.append(item)

    return result

来自一个生产级测验解析管道（410 个项目）：

指标	值
正则表达式成功率	98.0%
低置信度项目	8 (2.0%)
需要的 LLM 调用次数	~5
相比全 LLM 的成本节省	~95%
测试覆盖率	93%

从正则表达式开始 — 即使不完美的正则表达式也能提供一个可改进的基线
使用置信度评分 来以编程方式识别需要 LLM 帮助的部分
使用最便宜的 LLM 进行验证（Haiku 级别的模型已足够）
切勿修改 已解析的项目 — 从清理/验证步骤返回新的实例
TDD 对解析器很有效 — 首先为已知模式编写测试，然后是边缘情况
记录指标（正则表达式成功率、LLM 调用次数）以跟踪管道健康状况

应避免的反模式

当正则表达式能处理 95%+ 的情况时，将所有文本发送给 LLM（昂贵且缓慢）
对自由格式、高度可变的文本使用正则表达式（LLM 在此更合适）
跳过置信度评分，指望正则表达式“能正常工作”
在清理/验证步骤中修改已解析的对象
不测试边缘情况（格式错误的输入、缺失字段、编码问题）

测验/考试题目解析
表单数据提取
发票/收据处理
文档结构解析（标题、章节、表格）
任何具有重复模式且成本重要的结构化文本

2026 年 2 月 17 日

🇺🇸English

Regex vs LLM for Structured Text Parsing

A practical decision framework for parsing structured text (quizzes, forms, invoices, documents). The key insight: regex handles 95-98% of cases cheaply and deterministically. Reserve expensive LLM calls for the remaining edge cases.

When to Activate

Parsing structured text with repeating patterns (questions, forms, tables)
Deciding between regex and LLM for text extraction
Building hybrid pipelines that combine both approaches
Optimizing cost/accuracy tradeoffs in text processing

Decision Framework

Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│   ├── Regex handles 95%+ → Done, no LLM needed
│   └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly

Architecture Pattern

Source Text
    │
    ▼
[Regex Parser] ─── Extracts structure (95-98% accuracy)
    │
    ▼
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
    │
    ▼
[Confidence Scorer] ─── Flags low-confidence extractions
    │
    ├── High confidence (≥0.95) → Direct output
    │
    └── Low confidence (<0.95) → [LLM Validator] → Output

Implementation

1. Regex Parser (Handles the Majority)

import re
from dataclasses import dataclass

@dataclass(frozen=True)
class ParsedItem:
    id: str
    text: str
    choices: tuple[str, ...]
    answer: str
    confidence: float = 1.0

def parse_structured_text(content: str) -> list[ParsedItem]:
    """Parse structured text using regex patterns."""
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    items = []
    for match in pattern.finditer(content):
        choices = tuple(
            c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
        )
        items.append(ParsedItem(
            id=match.group("id"),
            text=match.group("text").strip(),
            choices=choices,
            answer=match.group("answer"),
        ))
    return items

2. Confidence Scoring

Flag items that may need LLM review:

@dataclass(frozen=True)
class ConfidenceFlag:
    item_id: str
    score: float
    reasons: tuple[str, ...]

def score_confidence(item: ParsedItem) -> ConfidenceFlag:
    """Score extraction confidence and flag issues."""
    reasons = []
    score = 1.0

    if len(item.choices) < 3:
        reasons.append("few_choices")
        score -= 0.3

    if not item.answer:
        reasons.append("missing_answer")
        score -= 0.5

    if len(item.text) < 10:
        reasons.append("short_text")
        score -= 0.2

    return ConfidenceFlag(
        item_id=item.id,
        score=max(0.0, score),
        reasons=tuple(reasons),
    )

def identify_low_confidence(
    items: list[ParsedItem],
    threshold: float = 0.95,
) -> list[ConfidenceFlag]:
    """Return items below confidence threshold."""
    flags = [score_confidence(item) for item in items]
    return [f for f in flags if f.score < threshold]

3. LLM Validator (Edge Cases Only)

def validate_with_llm(
    item: ParsedItem,
    original_text: str,
    client,
) -> ParsedItem:
    """Use LLM to fix low-confidence extractions."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheapest model for validation
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"Extract the question, choices, and answer from this text.\n\n"
                f"Text: {original_text}\n\n"
                f"Current extraction: {item}\n\n"
                f"Return corrected JSON if needed, or 'CORRECT' if accurate."
            ),
        }],
    )
    # Parse LLM response and return corrected item...
    return corrected_item

4. Hybrid Pipeline

def process_document(
    content: str,
    *,
    llm_client=None,
    confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
    """Full pipeline: regex -> confidence check -> LLM for edge cases."""
    # Step 1: Regex extraction (handles 95-98%)
    items = parse_structured_text(content)

    # Step 2: Confidence scoring
    low_confidence = identify_low_confidence(items, confidence_threshold)

    if not low_confidence or llm_client is None:
        return items

    # Step 3: LLM validation (only for flagged items)
    low_conf_ids = {f.item_id for f in low_confidence}
    result = []
    for item in items:
        if item.id in low_conf_ids:
            result.append(validate_with_llm(item, content, llm_client))
        else:
            result.append(item)

    return result

Real-World Metrics

From a production quiz parsing pipeline (410 items):

Metric	Value
Regex success rate	98.0%
Low confidence items	8 (2.0%)
LLM calls needed	~5
Cost savings vs all-LLM	~95%
Test coverage	93%

Best Practices

Start with regex — even imperfect regex gives you a baseline to improve
Use confidence scoring to programmatically identify what needs LLM help
Use the cheapest LLM for validation (Haiku-class models are sufficient)
Never mutate parsed items — return new instances from cleaning/validation steps
TDD works well for parsers — write tests for known patterns first, then edge cases
Log metrics (regex success rate, LLM call count) to track pipeline health

Anti-Patterns to Avoid

Sending all text to an LLM when regex handles 95%+ of cases (expensive and slow)
Using regex for free-form, highly variable text (LLM is better here)
Skipping confidence scoring and hoping regex "just works"
Mutating parsed objects during cleaning/validation steps
Not testing edge cases (malformed input, missing fields, encoding issues)

When to Use

Quiz/exam question parsing
Form data extraction
Invoice/receipt processing
Document structure parsing (headers, sections, tables)
Any structured text with repeating patterns where cost matters

Weekly Installs

447

Repository

affaan-m/everyt…ude-code

GitHub Stars

69.1K

First Seen

Feb 17, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex397

opencode387

gemini-cli377

github-copilot376

cursor364

kimi-cli363

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

102,200 周安装