基于内容哈希的文件缓存模式：Python缓存优化与高性能文件处理方案

content-hash-cache-pattern by affaan-m/everything-claude-code

1,000 周安装量

105,000 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/affaan-m/everything-claude-code --skill content-hash-cache-pattern

开发自动化性能优化

🇨🇳中文介绍

基于内容哈希的文件缓存模式

使用 SHA-256 内容哈希作为缓存键，缓存昂贵的文件处理结果（PDF 解析、文本提取、图像分析）。与基于路径的缓存不同，这种方法在文件移动/重命名后依然有效，并在内容更改时自动失效。

何时启用

构建文件处理流水线（PDF、图像、文本提取）
处理成本高且相同文件被重复处理
需要 --cache/--no-cache CLI 选项
希望在不修改现有纯函数的情况下为其添加缓存功能

核心模式

1. 基于内容哈希的缓存键

使用文件内容（而非路径）作为缓存键：

import hashlib
from pathlib import Path

_HASH_CHUNK_SIZE = 65536  # 大文件使用 64KB 分块

def compute_file_hash(path: Path) -> str:
    """计算文件内容的 SHA-256 哈希值（大文件分块处理）。"""
    if not path.is_file():
        raise FileNotFoundError(f"File not found: {path}")
    sha256 = hashlib.sha256()
    with open(path, "rb") as f:
        while True:
            chunk = f.read(_HASH_CHUNK_SIZE)
            if not chunk:
                break
            sha256.update(chunk)
    return sha256.hexdigest()

为何使用内容哈希？ 文件重命名/移动 = 缓存命中。内容更改 = 自动失效。无需索引文件。

2. 用于缓存条目的冻结数据类

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CacheEntry:
    file_hash: str
    source_path: str
    document: ExtractedDocument  # 缓存的结果

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

3. 基于文件的缓存存储

每个缓存条目存储为 {hash}.json — 通过哈希实现 O(1) 查找，无需索引文件。

import json
from typing import Any

def write_cache(cache_dir: Path, entry: CacheEntry) -> None:
    cache_dir.mkdir(parents=True, exist_ok=True)
    cache_file = cache_dir / f"{entry.file_hash}.json"
    data = serialize_entry(entry)
    cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8")

def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None:
    cache_file = cache_dir / f"{file_hash}.json"
    if not cache_file.is_file():
        return None
    try:
        raw = cache_file.read_text(encoding="utf-8")
        data = json.loads(raw)
        return deserialize_entry(data)
    except (json.JSONDecodeError, ValueError, KeyError):
        return None  # 将损坏视为缓存未命中

4. 服务层包装器（单一职责原则）

保持处理函数纯净。将缓存作为单独的服务层添加。

def extract_with_cache(
    file_path: Path,
    *,
    cache_enabled: bool = True,
    cache_dir: Path = Path(".cache"),
) -> ExtractedDocument:
    """服务层：缓存检查 -> 提取 -> 缓存写入。"""
    if not cache_enabled:
        return extract_text(file_path)  # 纯函数，不感知缓存

    file_hash = compute_file_hash(file_path)

    # 检查缓存
    cached = read_cache(cache_dir, file_hash)
    if cached is not None:
        logger.info("Cache hit: %s (hash=%s)", file_path.name, file_hash[:12])
        return cached.document

    # 缓存未命中 -> 提取 -> 存储
    logger.info("Cache miss: %s (hash=%s)", file_path.name, file_hash[:12])
    doc = extract_text(file_path)
    entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc)
    write_cache(cache_dir, entry)
    return doc

决策	原理
SHA-256 内容哈希	与路径无关，内容更改时自动失效
`{hash}.json` 文件命名	O(1) 查找，无需索引文件
服务层包装器	单一职责原则：提取保持纯净，缓存是独立关注点
手动 JSON 序列化	完全控制冻结数据类的序列化
损坏返回 `None`	优雅降级，下次运行时重新处理
`cache_dir.mkdir(parents=True)`	首次写入时惰性创建目录

对内容而非路径进行哈希 — 路径会变，内容标识不会
哈希时对大文件进行分块 — 避免将整个文件加载到内存中
保持处理函数纯净 — 它们不应了解缓存
记录缓存命中/未命中（使用截断的哈希值）以便调试
优雅处理损坏 — 将无效缓存条目视为未命中，永不崩溃

应避免的反模式

# 错误：基于路径的缓存（文件移动/重命名时会失效）
cache = {"/path/to/file.pdf": result}

# 错误：在处理函数内部添加缓存逻辑（违反单一职责原则）
def extract_text(path, *, cache_enabled=False, cache_dir=None):
    if cache_enabled:  # 现在这个函数有两个职责
        ...

# 错误：对嵌套的冻结数据类使用 dataclasses.asdict()
# （可能导致复杂嵌套类型出现问题）
data = dataclasses.asdict(entry)  # 应改用手动序列化

文件处理流水线（PDF 解析、OCR、文本提取、图像分析）
受益于 --cache/--no-cache 选项的 CLI 工具
相同文件在多次运行中出现的批处理
在不修改现有纯函数的情况下为其添加缓存功能

必须始终保持最新的数据（实时数据流）
缓存条目可能极其庞大的情况（考虑使用流式处理）
结果依赖于文件内容之外参数的情况（例如，不同的提取配置）

2026 年 2 月 17 日

🇺🇸English

Content-Hash File Cache Pattern

Cache expensive file processing results (PDF parsing, text extraction, image analysis) using SHA-256 content hashes as cache keys. Unlike path-based caching, this approach survives file moves/renames and auto-invalidates when content changes.

When to Activate

Building file processing pipelines (PDF, images, text extraction)
Processing cost is high and same files are processed repeatedly
Need a --cache/--no-cache CLI option
Want to add caching to existing pure functions without modifying them

Core Pattern

1. Content-Hash Based Cache Key

Use file content (not path) as the cache key:

import hashlib
from pathlib import Path

_HASH_CHUNK_SIZE = 65536  # 64KB chunks for large files

def compute_file_hash(path: Path) -> str:
    """SHA-256 of file contents (chunked for large files)."""
    if not path.is_file():
        raise FileNotFoundError(f"File not found: {path}")
    sha256 = hashlib.sha256()
    with open(path, "rb") as f:
        while True:
            chunk = f.read(_HASH_CHUNK_SIZE)
            if not chunk:
                break
            sha256.update(chunk)
    return sha256.hexdigest()

Why content hash? File rename/move = cache hit. Content change = automatic invalidation. No index file needed.

2. Frozen Dataclass for Cache Entry

from dataclasses import dataclass

@dataclass(frozen=True, slots=True)
class CacheEntry:
    file_hash: str
    source_path: str
    document: ExtractedDocument  # The cached result

3. File-Based Cache Storage

Each cache entry is stored as {hash}.json — O(1) lookup by hash, no index file required.

import json
from typing import Any

def write_cache(cache_dir: Path, entry: CacheEntry) -> None:
    cache_dir.mkdir(parents=True, exist_ok=True)
    cache_file = cache_dir / f"{entry.file_hash}.json"
    data = serialize_entry(entry)
    cache_file.write_text(json.dumps(data, ensure_ascii=False), encoding="utf-8")

def read_cache(cache_dir: Path, file_hash: str) -> CacheEntry | None:
    cache_file = cache_dir / f"{file_hash}.json"
    if not cache_file.is_file():
        return None
    try:
        raw = cache_file.read_text(encoding="utf-8")
        data = json.loads(raw)
        return deserialize_entry(data)
    except (json.JSONDecodeError, ValueError, KeyError):
        return None  # Treat corruption as cache miss

4. Service Layer Wrapper (SRP)

Keep the processing function pure. Add caching as a separate service layer.

def extract_with_cache(
    file_path: Path,
    *,
    cache_enabled: bool = True,
    cache_dir: Path = Path(".cache"),
) -> ExtractedDocument:
    """Service layer: cache check -> extraction -> cache write."""
    if not cache_enabled:
        return extract_text(file_path)  # Pure function, no cache knowledge

    file_hash = compute_file_hash(file_path)

    # Check cache
    cached = read_cache(cache_dir, file_hash)
    if cached is not None:
        logger.info("Cache hit: %s (hash=%s)", file_path.name, file_hash[:12])
        return cached.document

    # Cache miss -> extract -> store
    logger.info("Cache miss: %s (hash=%s)", file_path.name, file_hash[:12])
    doc = extract_text(file_path)
    entry = CacheEntry(file_hash=file_hash, source_path=str(file_path), document=doc)
    write_cache(cache_dir, entry)
    return doc

Key Design Decisions

Decision	Rationale
SHA-256 content hash	Path-independent, auto-invalidates on content change
`{hash}.json` file naming	O(1) lookup, no index file needed
Service layer wrapper	SRP: extraction stays pure, cache is a separate concern
Manual JSON serialization	Full control over frozen dataclass serialization
Corruption returns `None`	Graceful degradation, re-processes on next run
`cache_dir.mkdir(parents=True)`	Lazy directory creation on first write

Best Practices

Hash content, not paths — paths change, content identity doesn't
Chunk large files when hashing — avoid loading entire files into memory
Keep processing functions pure — they should know nothing about caching
Log cache hit/miss with truncated hashes for debugging
Handle corruption gracefully — treat invalid cache entries as misses, never crash

Anti-Patterns to Avoid

# BAD: Path-based caching (breaks on file move/rename)
cache = {"/path/to/file.pdf": result}

# BAD: Adding cache logic inside the processing function (SRP violation)
def extract_text(path, *, cache_enabled=False, cache_dir=None):
    if cache_enabled:  # Now this function has two responsibilities
        ...

# BAD: Using dataclasses.asdict() with nested frozen dataclasses
# (can cause issues with complex nested types)
data = dataclasses.asdict(entry)  # Use manual serialization instead

When to Use

File processing pipelines (PDF parsing, OCR, text extraction, image analysis)
CLI tools that benefit from --cache/--no-cache options
Batch processing where the same files appear across runs
Adding caching to existing pure functions without modifying them

When NOT to Use

Data that must always be fresh (real-time feeds)
Cache entries that would be extremely large (consider streaming instead)
Results that depend on parameters beyond file content (e.g., different extraction configs)

Weekly Installs

455

Repository

affaan-m/everyt…ude-code

GitHub Stars

69.1K

First Seen

Feb 17, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

codex405

opencode395

gemini-cli386

github-copilot384

kimi-cli371

cursor371

React 组合模式指南：Vercel 组件架构最佳实践，提升代码可维护性

102,200 周安装