NOWAIT推理优化器：无需训练，减少27-51%推理令牌，加速AI模型推理效率

nowait-reasoning-optimizer by davila7/claude-code-templates

210 周安装量

23,400 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/davila7/claude-code-templates --skill nowait-reasoning-optimizer

AI/机器学习性能优化自然语言处理

🇨🇳中文介绍

NOWAIT 推理优化器

实现了论文《Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency》（Wang 等人，2025 年）中的 NOWAIT 技术。

概述

NOWAIT 是一种免训练推理时干预技术，可在生成过程中抑制自我反思令牌（例如 "Wait"、"Hmm"、"Alternatively"），将思维链推理轨迹长度减少 27-51%，且不影响模型效用。

使用场景

在计算资源有限的情况下部署 R1 风格推理模型
降低生产系统的推理延迟
优化推理任务的令牌成本
处理需要精简的冗长思维链输出

支持的模型

模型系列	类型	令牌减少量
QwQ-32B	基于强化学习	16-31%
Phi4-Reasoning-Plus	基于强化学习	23-28%
Qwen3-32B	基于强化学习	13-16%
Kimi-VL-A3B	多模态	40-60%
QvQ-72B-Preview	多模态	20-30%

：NOWAIT 在基于强化学习的模型上效果最佳。当抑制反思令牌时，蒸馏模型（Qwen3-4B/8B/14B）的性能会下降。

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

2. 被抑制的关键词

完整列表请参见 references/keywords.md。核心关键词：

wait, alternatively, hmm, but, however, check, 
double-check, maybe, verify, again, oh, ah

初始化关键词：根据实证分析识别反思关键词
扩展到令牌变体：将关键词映射到词汇表中的所有令牌变体（例如，"wait" → " wait"、"Wait"、" Wait"、".wait"、"WAIT"）
在推理过程中抑制：在解码过程中将反思令牌的对数概率设置为很大的负值

Logits (Before)         Logits (After)
Wait     0.8     →     Wait     -inf
First    0.6     →     First    0.6
Hmm      0.5     →     Hmm      -inf
Let      0.4     →     Let      0.4

NOWAIT 并非完全消除自我反思——它引导模型跳过不必要的 "等待" 式推理
模型仍然在关键决策点执行必要的验证
产生更线性、更直接的推理路径

强化学习模型 vs 蒸馏模型

模型类型	NOWAIT 效果	建议
基于强化学习（QwQ、Phi4、Qwen3-32B）	准确率稳定，令牌减少显著	✅ 推荐使用
蒸馏模型（Qwen3-4B/8B/14B）	在困难任务上准确率下降	⚠️ 谨慎使用

蒸馏模型严重依赖训练数据中的思维链结构——移除反思令牌会破坏它们的推理模式。

HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
from scripts.nowait_processor import NOWAITLogitProcessor

model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")

processor = NOWAITLogitProcessor(tokenizer)

response = model.generate(
    tokenizer(prompt, return_tensors="pt").input_ids,
    logits_processor=[processor],
    max_new_tokens=32768,
    do_sample=True,
    temperature=0.7
)

from vllm import LLM, SamplingParams
from scripts.nowait_processor import get_nowait_bad_words_ids

llm = LLM(model="Qwen/QwQ-32B")
bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())

sampling_params = SamplingParams(
    max_tokens=32768,
    bad_words_ids=bad_words_ids
)

任务类型	原始令牌数	NOWAIT 令牌数	减少量
数学（AIME）	15,000	10,500	30%
视觉问答（MMMU）	2,900	1,450	50%
视频问答（MMVU）	1,700	1,250	27%

在思维链开销本身就很小的非常简单问题上效果较差
蒸馏模型在挑战性任务上可能遭受准确率损失
某些领域可能需要针对特定模型调整关键词

论文：arXiv:2506.08343v2
完整关键词列表：references/keywords.md
实现代码：scripts/nowait_processor.py

2026 年 1 月 21 日

🇺🇸English

NOWAIT Reasoning Optimizer

Implements the NOWAIT technique from the paper "Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency" (Wang et al., 2025).

Overview

NOWAIT is a training-free inference-time intervention that suppresses self-reflection tokens (e.g., "Wait", "Hmm", "Alternatively") during generation, reducing chain-of-thought (CoT) trajectory length by 27-51% without compromising model utility.

When to Use

Deploying R1-style reasoning models with limited compute
Reducing inference latency for production systems
Optimizing token costs for reasoning tasks
Working with verbose CoT outputs that need streamlining

Supported Models

Model Series	Type	Token Reduction
QwQ-32B	RL-based	16-31%
Phi4-Reasoning-Plus	RL-based	23-28%
Qwen3-32B	RL-based	13-16%
Kimi-VL-A3B	Multimodal	40-60%
QvQ-72B-Preview	Multimodal	20-30%

Important : NOWAIT works best with RL-based models. Distilled models (Qwen3-4B/8B/14B) show degraded performance when reflection tokens are suppressed.

Quick Start

1. Basic Implementation

from scripts.nowait_processor import NOWAITLogitProcessor

# Initialize processor for your model's tokenizer
processor = NOWAITLogitProcessor(tokenizer)

# Use during generation
outputs = model.generate(
    inputs,
    logits_processor=[processor],
    max_new_tokens=32768
)

2. Keywords Suppressed

See references/keywords.md for the complete list. Core keywords:

wait, alternatively, hmm, but, however, check, 
double-check, maybe, verify, again, oh, ah

How It Works

Initialize Keywords : Identify reflection keywords from empirical analysis
Expand to Token Variants : Map keywords to all token variants in vocabulary (e.g., "wait" → " wait", "Wait", " Wait", ".wait", "WAIT")
Suppress During Inference : Set logits of reflection tokens to large negative values during decoding

Logits (Before)         Logits (After)

Wait     0.8     →     Wait     -inf
First    0.6     →     First    0.6
Hmm      0.5     →     Hmm      -inf
Let      0.4     →     Let      0.4

Key Findings

Why It Works

NOWAIT doesn't eliminate self-reflection entirely—it guides models to skip unnecessary "waiting" reasoning
Models still perform essential verification at key decision points
Results in more linear, straightforward reasoning paths

RL vs Distilled Models

Model Type	NOWAIT Effect	Recommendation
RL-based (QwQ, Phi4, Qwen3-32B)	Stable accuracy, significant token reduction	✅ Recommended
Distilled (Qwen3-4B/8B/14B)	Accuracy degradation on hard tasks	⚠️ Use with caution

Distilled models rely heavily on CoT structure from training data—removing reflection tokens disrupts their reasoning patterns.

Integration Examples

HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
from scripts.nowait_processor import NOWAITLogitProcessor

model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")

processor = NOWAITLogitProcessor(tokenizer)

response = model.generate(
    tokenizer(prompt, return_tensors="pt").input_ids,
    logits_processor=[processor],
    max_new_tokens=32768,
    do_sample=True,
    temperature=0.7
)

vLLM

from vllm import LLM, SamplingParams
from scripts.nowait_processor import get_nowait_bad_words_ids

llm = LLM(model="Qwen/QwQ-32B")
bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())

sampling_params = SamplingParams(
    max_tokens=32768,
    bad_words_ids=bad_words_ids
)

Expected Results

Task Type	Original Tokens	NOWAIT Tokens	Reduction
Math (AIME)	15,000	10,500	30%
Visual QA (MMMU)	2,900	1,450	50%
Video QA (MMVU)	1,700	1,250	27%

Limitations

Less effective on very simple problems where CoT overhead is already minimal
Distilled models may suffer accuracy loss on challenging tasks
Some domains may require model-specific keyword tuning

References

Paper: arXiv:2506.08343v2
Complete keyword list: references/keywords.md
Implementation: scripts/nowait_processor.py

Weekly Installs

210

Repository

davila7/claude-…emplates

GitHub Stars

23.4K

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode174

gemini-cli170

cursor159

codex157

claude-code143

github-copilot140

超能力技能使用指南：AI助手技能调用优先级与工作流程详解

45,100 周安装

NOWAIT推理优化器：无需训练，减少27-51%推理令牌，加速AI模型推理效率

🇨🇳中文介绍

NOWAIT 推理优化器

概述

使用场景

支持的模型

相关 Skills

快速开始

1. 基本实现

2. 被抑制的关键词

工作原理

主要发现

为何有效

强化学习模型 vs 蒸馏模型

集成示例

HuggingFace Transformers

vLLM

预期效果

局限性

参考文献

🇺🇸English

NOWAIT Reasoning Optimizer

Overview

When to Use

Supported Models

Quick Start

1. Basic Implementation

2. Keywords Suppressed

How It Works

Key Findings

Why It Works

RL vs Distilled Models

Integration Examples

HuggingFace Transformers

vLLM

Expected Results

Limitations

References

最新 Skills