利基信号发现：通过拉普拉斯平滑提升分数识别买家与非买家差异 | Deepline 数据分析

The Agent Skills Directory

364 周安装量

安装命令

npx skills add https://code.deepline.com

自动化数据分析营销

🇨🇳中文介绍

利基信号发现

通过提取多页面网站内容和招聘信息，计算拉普拉斯平滑提升分数，发现已成交客户与未成交客户之间的差异信号，从而识别买家与非买家的区别。

前提条件

必需项：

Deepline CLI — 所有数据丰富化操作都通过 deepline enrich 进行（无需单独的 exa、crustdata 等 API 密钥）
Python 3 — 分析脚本仅使用标准库（csv、json、re、argparse — 无需 pip 包）

非必需项：

❌ 无需 exa API 密钥（通过 Deepline 访问）
❌ 无需 crustdata API 密钥（通过 Deepline 访问）
❌ 无需外部 Python 包（无需 pip install）
❌ 无需 OpenAI/Anthropic API 密钥（Deepline 管理发现过程中使用的任何模型访问）

积分： 数据丰富化会消耗 Deepline 积分（exa + crustdata 约 6 积分/公司）。使用 deeplineagent 的发现提示是额外的付费调用。在运行付费数据丰富化之前，务必获得用户批准。

Deepline 优先原则

始终使用 Deepline CLI (deepline enrich, deepline tools, deepline playground) 进行数据丰富化、数据提取和批量操作。 Deepline 提供：

自动 CSV 列管理的批量数据丰富化
内置工具集成（exa_search、crustdata、apollo 等）
用于检查和迭代数据丰富化结果的 Playground UI
幂等性重运行 — 重新运行会更新现有列，而不是重复创建

对所有数据丰富化步骤使用 deepline enrich。对一次性工具调用使用 deepline tools execute。对检查结果使用 deepline playground。有关 Deepline 命令模式和供应商操作手册，请参考 gtm-meta-skill。

输入要求

已成交和未成交客户域名列表（至少 20 个已成交 + 10 个未成交，以获得统计显著性）
相似公司可以补充已成交列表 — 如果已成交列表较小（<15），相似公司（类似于理想客户但尚未购买的公司）可以视为已成交以增加样本量。在报告中添加数据集注意事项以说明此情况。
目标公司背景 — 他们销售什么、销售给谁、关键角色（在步骤 0 中发现）

流程

0. 发现目标公司（他们销售什么、销售给谁、差异化优势）
0.5. 发现生态系统（竞争对手、技术栈类别、买家角色）
1. 准备输入 CSV（域名、状态）— 去重同时出现在已成交和未成交列表中的域名
1.5. 生成垂直领域特定配置（关键词、工具、职位角色）
2. 多页面网站提取 + 招聘信息（Deepline 数据丰富化）
3. 质量门 — 验证文件完整性 + 覆盖率（>80% 有内容）
3.5. 审查生成的配置（根据丰富化数据进行验证）
4. 运行差异分析 (scripts/analyze_signals.py)
5. 生成报告（使用 references/report-template.md）
6. 结合信号解释规则进行审查 (references/signal-interpretation.md)

信号可靠性层级

并非所有信号都同等重要。根据跨多个垂直领域的实际运行情况，信号遵循明确的可靠性顺序：

排名	信号来源	可靠性	原因
1	招聘信息（招聘领域相关职位）	最高	活跃预算 + 公认痛点。一家公司招聘 3 名客户经理比其网站上出现"销售"是更强的信号。
2	分析师验证（Gartner、Forrester 提及）	非常高	企业成熟度 + 品类认知。通常提升 4-7 倍，很少出现在未成交组。
3	合规基础设施（SOC2、GDPR、ISO）	高	采购成熟度 + 企业就绪度。拥有合规页面的公司有正式的审批流程。
4	买家痛点语言（在职业/博客页面上）	高	对问题的运营认知 — 例如，对于创意运营目标，"碎片化工具"有 3-6 倍提升。
5	技术栈工具（特定于角色的利基 SaaS）	中等	基础设施就绪度 — 针对垂直领域特定买家的利基 SaaS 工具有 2-4 倍提升。
6	网站产品/营销内容	可变	可能表示买家或竞争对手 — 来源上下文至关重要。

当网站信号失效时： 对于 B2B 基础设施工具（AR 自动化、计费、合规），买家不会在公共网站上公开他们的痛点。一家批发分销商在其网站上谈论他们的产品，而不是应收账款挑战。对于这些垂直领域，优先考虑招聘信息、技术栈和公司特征信号，而不是网站关键词匹配。

步骤 0：目标公司发现

关键：在进行任何数据丰富化或配置生成之前，首先执行此步骤。

使用 deeplineagent 来了解目标公司销售什么以及销售给谁：

# 示例提示
deeplineagent: "研究 {{company-domain}}。总结该公司销售什么、销售给谁、他们的差异化优势是什么，以及任何示例客户。如果需要，使用 Deepline 管理的工具。"

记录以下内容：

产品类别 — 例如，"创意运营与 DAM 平台"、"AR 自动化软件"、"销售互动平台"
目标买家角色 — 例如，"营销和创意团队"、"财务/AR 团队"、"销售开发代表"
关键差异化优势 — 与竞争对手相比，他们有何独特之处
示例客户（如果可用）— 谁目前在使用该产品

为何重要： 整个流程（exa 查询、关键词、技术栈、职位角色）都基于此发现进行调整。跳过此步骤会导致通用/不相关的信号。

步骤 0.5：生态系统发现

使用 deeplineagent 来发现竞争格局和买家生态系统：

竞争对手发现：

deeplineagent: "研究 {product category} 的竞争格局。列出 3-5 个相关的软件公司或替代方案。"

示例：对于创意运营/DAM 工具，询问主要的 "{product category} software alternatives competitors"。

技术栈发现：

deeplineagent: "研究 {buyer persona} 的常见软件栈。按类别对工具进行分组。"

示例：对于创意团队，询问常见的 "creative teams software stack"。

职位角色发现：

deeplineagent: "研究 {buyer persona} 的常见职位头衔、职责和招聘模式。返回 10-15 个角色变体。"

示例：对于创意运营，询问 "creative operations job titles creative director content manager"。

记录发现：

竞争对手产品（3-5 个名称）— 用于识别迁移机会（目前使用这些工具的公司）
常见技术栈工具（按类别 10-15 个）— 用于基础设施信号
买家职位头衔（10-15 个变体）— 用于招聘意向信号

步骤 1：准备输入 CSV

创建 output/{{company}}-icp-input.csv：

domain,status
customer1.com,won
customer2.com,won
non-customer1.com,lost
non-customer2.com,lost

关键 — 在数据丰富化之前去重。 如果同一个域名同时出现在已成交和未成交组中（同一公司，多个 CRM 交易），Deepline 可能只获取一次招聘信息（针对第一行）。重复域名的内容在两组中是相同的 — 包含它会污染提升分数并导致 won_with_jobs 被低估。在运行数据丰富化之前，务必检查并删除重复域名：

# 构建输入 CSV 后检查重复项
from collections import Counter
domain_counts = Counter(r['domain'] for r in rows)
duplicate_domains = {d for d, c in domain_counts.items() if c > 1}
if duplicate_domains:
    print(f"警告：{len(duplicate_domains)} 个域名同时出现在已成交和未成交中：")
    for d in sorted(duplicate_domains):
        print(f"  {{d}}")
    print("在数据丰富化之前删除这些行 — 它们会污染提升分数。")

如果存在重复项，删除这些域名的所有行（不仅仅是其中一个副本）。同一家公司有不同的交易结果并不能告诉我们买家与非买家的区别。

步骤 1.5：生成垂直领域特定配置

使用步骤 0 和 0.5 的发现，创建三个 JSON 配置文件。

有关 JSON 格式和生成指南，请参阅 references/keyword-catalog.md。

# 在 output/{{company}}/ 中创建配置文件
output/{{company}}-keywords.json    # 关键词类别
output/{{company}}-tools.json       # 按类别的技术栈工具
output/{{company}}-job-roles.json   # 职位角色类别

生成方法：

关键词 — 混合以下内容：
- 产品类别术语（例如，对于 DAM 工具，"创意运营"、"资产管理"）
- 买家痛点（例如，"碎片化工具"、"内容发现"）
- 用于迁移细分市场的竞争对手名称（例如，"bynder"、"widen"）— 使用这些工具的公司是迁移机会，而非排除对象
- 通用业务成熟度（例如，"安全"、"合规"、"集成"）
技术栈 — 来自生态系统发现的工具：
- 买家角色使用的基础设施工具（例如，"figma"、"adobe creative cloud"）
- 买家工作流程中的相邻工具（例如，"contentful"、"monday.com"）
- 用于迁移细分市场的竞争对手工具 — 使用这些工具的公司代表替换机会
职位角色 — 来自角色发现的头衔：
- 买家角色变体（例如，"创意总监"、"内容经理"、"品牌经理"）
- 相邻角色（例如，"营销运营"、"品牌设计师"）
- 通用增长角色（例如，"产品"、"工程"）

有关多垂直领域示例（创意运营、AR 自动化、销售互动、开发者工具），请参阅 references/keyword-catalog.md。每个示例都包含该垂直领域的关键词、工具和职位角色。

验证： 生成的配置是否与目标的垂直领域和买家角色匹配？如果不匹配，请根据步骤 0/0.5 的发现进行改进。

步骤 2：Deepline 数据丰富化

关键：切勿仅抓取主页。 使用带有 contents.text 的 exa_search 来发现并抓取每个域名约 8 个页面，只需一次 API 调用。

根据目标的产品类别动态生成 exa 查询：

# 通用查询（适用于大多数面向营销/销售/产品团队的 B2B SaaS）
QUERY="company product features integrations customers security pricing careers about case-studies"

# 对于面向后台团队（财务、人力资源、法务）的工具：
# 买家不会在营销页面上公开痛点 — 添加合规/审计页面，信号存在于这些页面
QUERY="company product features integrations customers security pricing careers compliance audit regulatory about"

# 对于开发者工具：
# 添加文档/API 页面 — 这些页面揭示了基础设施成熟度和集成就绪度
QUERY="company product features documentation api changelog github integrations security pricing careers about"

# 对于创意/营销工具：
QUERY="company product features portfolio use cases creative workflow customers integrations security pricing careers about"

# 对于销售工具：
QUERY="company product features playbooks outbound pipeline customers integrations security pricing careers about"

示例：

deepline enrich \
  --input output/{{company}}-icp-input.csv \
  --output output/{{company}}-enriched.csv \
  --with '{"alias":"website","tool":"exa_search","payload":{"query":"{{exa-query-from-above}}","numResults":8,"type":"auto","includeDomains":["{{domain}}"],"contents":{"text":{"maxCharacters":3000,"verbosity":"compact","includeSections":["body"]}}}}' \
  --with '{"alias":"jobs","tool":"crustdata_job_listings","payload":{"companyDomains":"{{domain}}","limit":50}}' \

为何使用带 contents 的 exa_search（而非 parallel_extract）？

在一个调用中发现页面并返回内容 — 无需单独的页面发现步骤
每个域名约 8 个页面，每个约 3K 字符 = 每个公司约 24K 字符
成本：约 5 积分/公司（exa）+ 约 1 积分（crustdata 招聘信息）= 约 6 积分/公司

在运行前获得用户积分批准。示例："60 家公司 x 6 积分 = 约 360 积分。"

步骤 3：质量门

关键 — 在运行分析之前验证文件完整性。 deepline enrich 在操作系统缓冲区完全刷新到磁盘之前就将控制权返回给终端。在数据丰富化完成后立即运行分析脚本可能会读取部分写入的文件，其中最后 N 行的招聘信息列尚未同步 — 导致 won_with_jobs: 0 或招聘数据严重低估。务必验证：

# 1. 检查行数是否与输入匹配
INPUT_ROWS=$(wc -l < output/{{company}}-icp-input.csv)
OUTPUT_ROWS=$(wc -l < output/{{company}}-enriched.csv)
echo "输入：$INPUT_ROWS 行，输出：$OUTPUT_ROWS 行"
# 输出应等于输入（两者都包含标题行）

# 2. 抽查已知有招聘信息的已成交账户的招聘数据
python3 -c "
import csv, json, sys
csv.field_size_limit(sys.maxsize)
with open('output/{{company}}-enriched.csv') as f:
    rows = list(csv.DictReader(f))
won_rows = [r for r in rows if r.get('status') == 'won']
jobs_col = 'jobs'  # 或使用列索引
has_jobs = sum(1 for r in won_rows if r.get(jobs_col, '').strip() not in ('', '{}', 'null'))
print(f'有招聘数据的已成交行数：{{has_jobs}}/{len(won_rows)}')
# 如果这是 0 但您知道已成交账户应该有招聘信息，请等待并重新运行
"

如果 won_with_jobs 为 0 但您期望有招聘数据：

等待 5-10 秒（操作系统缓冲区刷新）
重新运行验证检查
如果仍然为 0，检查列索引 — 丰富化的 CSV 使用 website 和 jobs 列名，而不是 __dl_full_result__。使用 --website-col N --jobs-col N 覆盖。

文件验证后，检查覆盖率：

覆盖率：>80% 的公司应有网站内容。如果 <80%，检查域名拼写并重试失败的行。
内容深度：平均每个公司应有 6-8 个页面，12-20K 字符。
招聘信息：已成交公司应比未成交公司有更多招聘数据（预期 — 规模更大/正在扩张的公司赢得更多）。

如果覆盖率不佳，使用 --rows 针对特定行重新运行失败的域名。

域名验证（如果使用自动提取的客户列表）

如果客户域名来自自动提取（CRM 导出、Exa API、案例研究抓取）而非手动验证的列表，请验证域名是否确实属于指定的公司。根据实际运行情况：高达 53% 的自动提取客户可能是误报 — 销售相同产品的竞争对手、域名不匹配以及不相关的公司。

# 检查可疑的域名模式
python3 -c "
import csv, sys
csv.field_size_limit(sys.maxsize)
with open('output/{{company}}-enriched.csv') as f:
    rows = list(csv.DictReader(f))
for r in rows:
    domain = r.get('domain', '')
    # 标记用作来源 URL 的内容平台，而非公司域名
    if any(x in domain for x in ['blog.', 'medium.com', 'substack.', 'wordpress.']):
        print(f'警告：{{domain}} 看起来像内容平台，而非公司域名')
    # 标记可能很通用的非常短的域名
    if len(domain.split('.')[0]) <= 2:
        print(f'检查：{{domain}} — 域名非常短，请验证它是否属于预期公司')
"

误报的危险信号：

域名是目标公司的子域名 (blog.target.com)
域名属于知名的 AI/科技公司，但"客户"是另一家公司（域名解析失败）
公司出现在竞争对手的案例研究中，而非目标公司自己的客户列表
公司本身是同一产品类别的供应商（他们销售解决方案，而不是购买它）

步骤 3.5：审查生成的配置

在运行分析之前，根据丰富化数据抽查生成的配置：

# 抽样几个丰富化的公司
deepline playground output/{{company}}-enriched.csv

# 在 playground UI 中，检查：
# - 网站页面是否提及 keywords.json 中的关键词？
# - 招聘信息是否提及 job-roles.json 中的角色？
# - 集成/技术栈页面是否提及 tools.json 中的工具？

危险信号：

关键词出现在 <10% 的丰富化公司中 → 过于小众，拓宽范围
关键词出现在 >90% 的公司中 → 过于通用，细化
产品类别关键词（目标销售什么）频繁出现在已成交公司中 → 错误的产品类别，这些公司是竞争对手而非买家
实际招聘信息中缺少职位角色 → 错误的买家角色，修订

如果需要，修复并重新生成配置。

步骤 4：差异分析

使用配置文件运行分析脚本：

python3 scripts/analyze_signals.py \
  --input output/{{company}}-enriched.csv \
  --keywords output/{{company}}-keywords.json \
  --tools output/{{company}}-tools.json \
  --job-roles output/{{company}}-job-roles.json \
  --output output/{{company}}-analysis.json

脚本会自动检测网站和招聘数据的 __dl_full_result__ 列。如果需要，使用 --website-col N --jobs-col N 覆盖。

脚本计算的内容：

每个公司每个关键词的子字符串匹配存在情况（非精确分词 TF-IDF）
拉普拉斯平滑提升：((won + 0.5) / (won_total + 1)) / ((lost + 0.5) / (lost_total + 1))
每个关键词的来源细分（仅网站 / 仅招聘信息 / 两者）
跨类别的技术栈工具提及
已成交与未成交中职位角色的普遍性
反契合信号（提升 < 0.5 倍）
来源证据：每个关键词的确切引用（±40 字符）及页面 URL 和招聘信息 URL

步骤 5：报告生成

阅读 references/report-template.md 了解完整的报告结构和质量规则。

报告结构概述：

第 0 部分：快速参考仪表板 ← 从此处开始。所有其他部分之前必需。
- 0.1 TLDR — 5 个要点：#1 信号、最佳契合原型、最快获取线索路径、硬性跳过标志、评分摘要
- 0.2 信号强度一览 — 两个可视化表格（正面 + 反契合），带有按提升排序的 🟩🟥 表情符号提升条
- 0.3 平台搜索配方 — 预构建的 Apollo URL（人员 + 公司搜索）和 Google 操作符
- 0.4 买家角色快速参考 — 3–5 个角色，包含头衔、痛点、信号和 Apollo 链接
- 0.5 线索评分速查表 — 所有评分信号汇总在一个表格中，包含"如何检查"列
第 1–9 部分：完整数据和方法论（现有格式，未更改）

信号强度条比例（用于第 0.2 部分）：

≥10x → 🟩🟩🟩🟩🟩🟩   ≥4x  → 🟩🟩🟩🟩🟩   ≥2.5x → 🟩🟩🟩🟩
≥2.0x → 🟩🟩🟩         ≥1.5x → 🟩🟩         ≥1.0x → 🟩
≥0.4x → 🟥🟥           ≥0.25x → 🟥🟥🟥       ≥0.15x → 🟥🟥🟥🟥
≥0.07x → 🟥🟥🟥🟥🟥    <0.07x → 🟥🟥🟥🟥🟥🟥

Apollo URL 格式（用于第 0.3 和 0.4 部分）：

People: https://app.apollo.io/#/people?personTitles[]=Title+One&personTitles[]=Title+Two&personSeniorities[]=vp&personSeniorities[]=director&qOrganizationKeywordTags[]=vertical&organizationLocations[]=United+States&page=1
Companies: https://app.apollo.io/#/companies?qOrganizationKeywordTags[]=keyword&organizationLocations[]=United+States&organizationNumEmployeesRanges[]=201-500&page=1

使用 qOrganizationKeywordTags[] 进行关键词过滤（而非硬编码的行业标签 ID）。

所有部分的关键质量规则：

始终显示原始计数：15% (6) 而不仅仅是 15%
标题中的样本量：Won (n=37)、Lost (n=18)
仅对提升 > 2 倍且计数 >= 3 家公司加粗 — 在 1 家公司中出现 10 倍提升的信号，其可靠性低于在 4 家公司中出现 3 倍提升的信号
标记 n=1 信号：如果一个信号仅出现在 1 家已成交公司中，添加注释：*(单家公司 — 在用于评分前请验证)*。在评分模型中，给予 n=1 信号 0.3 倍权重，相对于 n=3+ 信号。
所有关键词表格的来源细分：添加一个"来源"列，显示 3w / 20j / 2both 格式（3 个仅网站、20 个仅招聘信息、2 个两者皆有）。这对于区分仅网站信号（置信度较低）和招聘信息信号（置信度较高）至关重要。
```
| 关键词 | Won (n=X) | Lost (n=Y) | 提升 | 来源 (w/j/both) | 解释 |
```
：在每个关键词表格之后，为前 3 个关键词添加带有链接来源的确切引用。分析脚本为每个关键词输出，包含、、、和 /。格式：

步骤 6：信号解释审查

在编写解释列之前，阅读 references/signal-interpretation.md。关键规则：

提及目标销售内容的网站内容 = 竞争对手信号（非买家）
招聘信息 = 最高意向的买家信号
同一关键词在产品页面、职业页面和博客上的含义不同
技术栈工具需要上下文 — 它们是制造还是解决目标的问题？

数据丰富化数据结构

网站数据：包含 exa_search 响应的 __dl_full_result__ 列。

data.results[].text — 页面内容
data.results[].url — 页面 URL
data.results[].title — 页面标题

招聘信息：包含 crustdata 响应的 __dl_full_result__ 列。

data.listings[].title — 职位头衔（非 "job_title"）
data.listings[].description — 职位描述（非 "job_description"）
data.listings[].category — 职位类别
data.listings[].url — 招聘信息 URL

积分估算

步骤	每行积分	总计（60 家公司）
带内容的 exa_search	~5	~300
crustdata_job_listings	~1	~60
总计	~6	~360

在运行付费数据丰富化步骤之前，务必获得用户批准。

常见陷阱

跳过目标发现（步骤 0） — 如果不了解目标销售什么，您将生成通用/不相关的配置。
仅抓取主页 — 始终使用多页面发现。仅主页会错过定价、集成、安全、职业信息。
使用硬编码示例 — 不要为创意运营工具复制面向销售的关键词。为每个垂直领域生成配置。
通用技术栈 — "AWS"、"GitHub"、"Slack" 出现在大多数 B2B 网站上，不具有区分度。搜索特定于买家角色的利基 SaaS 工具（例如，创意团队的 Figma，财务团队的 NetSuite）。
忽略来源上下文 — 产品页面上的"prospect" = 卖家信号。招聘信息中的"prospect" = 买家信号。同一关键词，相反含义。
缺少未成交数据 — 在分析前验证未成交公司是否有内容。空的未成交组 = 无意义的提升分数。
子字符串误报 — "sequenc" 匹配 "consequences"。抽查高提升关键词的误匹配情况。
跳过配置审查（步骤 3.5） — 在分析之前，务必根据丰富化数据验证生成的配置。
输入中存在重复域名 — CRM 导出通常在同一家公司同时出现在已成交和未成交中（多个交易）。Deepline 每个域名只获取一次招聘信息，因此重复的招聘数据只落在一行上 — 静默地低估了 won_with_jobs。务必在步骤 1 中去重。
数据丰富化后立即运行分析 — deepline enrich 在操作系统缓冲区刷新之前就返回到终端。在步骤 3 中运行文件完整性检查，然后再执行 analyze_signals.py。当您期望有数据时出现 won_with_jobs: 0 的结果是症状；重新运行分析（无需重新丰富化）可以修复它。
自动提取列表中的域名不匹配 — 当使用 CRM 导出或自动客户发现时，域名 → 公司名称映射可能出错。在实际运行中，高达 53% 的自动提取域名是误报。在数据丰富化之前，务必根据预期公司名称验证域名。
将供应商信号视为买家信号 — 公司产品页面上的"应收账款自动化"意味着他们销售 AR 工具（竞争对手）。招聘信息中的相同短语意味着他们需要 AR 工具（买家）。来源上下文至关重要 — 参见。

已验证的信号模式（来自实际运行）

这些模式已在跨越创意运营、销售互动、AR 自动化、法律科技和 GTM 工具的多个客户分析中得到验证。在解释结果时，将它们作为起点 — 但务必根据特定目标的垂直领域进行验证。

高置信度正面信号

信号模式	典型提升	已验证领域	含义
分析师验证（Gartner、Forrester）	4.5x-6.5x	企业 B2B SaaS	公司已评估该类别，拥有企业采购流程
招聘 ICP 相关角色	3.8x-5.5x	所有垂直领域	活跃预算 + 公认痛点 — 最高意向信号
发布案例研究	3.7x	产品驱动 + 销售辅助	成熟的营销组织，重视证明点，对供应商友好
合规基础设施（GDPR、SOC2、ISO）	2.1x-6.5x	企业工具	正式审批流程、安全审查、更高的成交率
买家痛点语言（例如，"碎片化工具"）	2.9x-5.2x	创意运营、MarTech	对目标解决的具体问题有运营认知
SDK/webhook/API 存在	2.5x-3.5x	开发者相关工具	开发者文化，以编程方式集成工具
联系销售 / 销售主导的 GTM	2.2x-5.5x	企业销售工具	人工主导的销售流程 = 依赖客户经理 = 销售互动工具买家

高置信度反契合信号

信号模式	典型提升	含义
消费者信号（购物者、结账、取消、借记）	0.2x	B2C 公司，非 B2B 销售组织
留存/流失语言	0.2x-0.4x	消费者订阅模式，非企业购买
销售相同产品类别	0.1x-0.3x	竞争对手，非买家 — 他们销售解决方案
12 个月以上无招聘信息	N/A	未增长，无招聘预算

评分模型指南

根据实际运行情况，一个包含三个层级的 0-100 分模型效果良好：

层级 1：核心契合度（0-40 分） — 合规、分析师验证、结构性信号
层级 2：购买意向（0-30 分） — 招聘领域角色、痛点语言、技术栈
层级 3：基础设施就绪度（0-30 分） — API 存在、集成成熟度、案例研究

评分阈值：60+ = 层级 1 立即联系，35-59 = 层级 2 基于触发条件，<35 = 培育或跳过。

参考资料

keyword-catalog.md ：在生成配置时阅读（步骤 1.5）。包含 JSON 格式、生成模式和多垂直领域示例。
report-template.md ：在生成报告时阅读（步骤 5）。完整的部分结构、表格格式、内联示例格式、质量规则。
signal-interpretation.md ：在编写解释列或审查信号质量时阅读。买家 vs 卖家 vs 竞争对手规则。
scripts/analyze_signals.py ：确定性分析脚本。在步骤 4 中运行。自动检测列，通过 JSON 接受自定义关键词/工具。

每周安装数

364

来源

code.deepline.com

首次出现

2026年3月2日

安全审计

SocketPass

安装于

codex364

cursor363

gemini-cli363

github-copilot363

amp363

cline363

🇺🇸English

Niche Signal Discovery

Discover differential signals between Closed Won and Closed Lost accounts by extracting multi-page website content and job listings, then computing Laplace-smoothed lift scores to identify what distinguishes buyers from non-buyers.

Prerequisites

Required:

Deepline CLI — All enrichment runs through deepline enrich (no separate API keys for exa, crustdata, etc.)
Python 3 — Analysis script uses standard library only (csv, json, re, argparse — no pip packages)

NOT required:

❌ No exa API key (accessed via Deepline)
❌ No crustdata API key (accessed via Deepline)
❌ No external Python packages (no pip install needed)
❌ No OpenAI/Anthropic API keys (Deepline manages any model access used in discovery)

Credits: Enrichment consumes Deepline credits (~6 credits/company for exa + crustdata). Discovery prompts with deeplineagent are additional paid calls. Always get user approval before running paid enrichment.

Deepline-First Principle

Always use Deepline CLI (deepline enrich, , ) for enrichment, data extraction, and batch operations. Deepline provides:

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

Signal Reliability Hierarchy

Not all signals are equal. From actual runs across multiple verticals, signals follow a clear reliability order:

Rank	Signal Source	Reliability	Why
1	Job listings (hiring for domain-related roles)	Highest	Active budget + acknowledged pain. A company hiring 3 AEs is a stronger signal than "sales" on their website.
2	Analyst validation (Gartner, Forrester mentions)	Very High	Enterprise maturity + category awareness. Typically 4-7x lift, rarely appears in lost group.
3	Compliance infrastructure (SOC2, GDPR, ISO)	High	Procurement maturity + enterprise readiness. Companies with compliance pages have formal approval processes.
4	Buyer pain language (on careers/blog pages)	High	Operational awareness of the problem — e.g., "fragmented tools" at 3-6x lift for creative ops targets.
5	Tech stack tools (niche SaaS specific to persona)	Medium	Infrastructure readiness — niche SaaS tools at 2-4x lift for vertical-specific buyers.
6	Website product/marketing content	Variable	Can indicate buyer OR competitor — source context is everything.

When website signals fail: For B2B infrastructure tools (AR automation, billing, compliance), buyers DON'T publish their pain on public websites. A wholesale distributor talks about their products on their website, not accounts receivable challenges. For these verticals, prioritize job listings, tech stack, and firmographic signals over website keyword matching.

Step 0: Target Company Discovery

CRITICAL: Do this FIRST before any enrichment or config generation.

Use deeplineagent to understand what the target company sells and who they sell to:

# Example prompt
deeplineagent: "Research {{company-domain}}. Summarize what the company sells, who they sell to, what makes them different, and any example customers. Use Deepline-managed tools if needed."

Document the following:

Product category — e.g., "Creative Operations & DAM platform", "AR automation software", "Sales engagement platform"
Target buyer persona — e.g., "Marketing and creative teams", "Finance/AR teams", "Sales Development Reps"
Key differentiation — What makes them unique vs competitors
Example customers (if available) — Who currently uses the product

Why this matters: The entire pipeline (exa query, keywords, tech stack, job roles) adapts based on this discovery. Skipping this step results in generic/irrelevant signals.

Step 0.5: Ecosystem Discovery

Use deeplineagent to discover the competitive landscape and buyer ecosystem:

Competitor Discovery:

deeplineagent: "Research the competitive landscape for {product category}. List 3-5 relevant software companies or alternatives."

Example: For a creative ops/DAM tool, ask for the main "{product category} software alternatives competitors".

Tech Stack Discovery:

deeplineagent: "Research the common software stack for {buyer persona}. Group the tools by category."

Example: For creative teams, ask for the common "creative teams software stack".

Job Role Discovery:

deeplineagent: "Research the common job titles, responsibilities, and hiring patterns for {buyer persona}. Return 10-15 role variants."

Example: For creative ops, ask for "creative operations job titles creative director content manager".

Document findings:

Competitor products (3-5 names) — Used to identify migration opportunities (companies currently using these tools)
Common tech stack tools (10-15 by category) — Used for infrastructure signals
Buyer job titles (10-15 variations) — Used for hiring intent signals

Step 1: Prepare Input CSV

Create output/{{company}}-icp-input.csv:

domain,status
customer1.com,won
customer2.com,won
non-customer1.com,lost
non-customer2.com,lost

CRITICAL — Deduplicate before enrichment. If the same domain appears in both won and lost groups (same company, multiple CRM deals), Deepline may only fetch job listings once (for the first row). The duplicate domain's content is identical in both groups — including it pollutes lift scores and causes won_with_jobs to be undercounted. Always check and remove duplicate domains before running enrichment:

# Check for duplicates after building the input CSV
from collections import Counter
domain_counts = Counter(r['domain'] for r in rows)
duplicate_domains = {d for d, c in domain_counts.items() if c > 1}
if duplicate_domains:
    print(f"WARNING: {len(duplicate_domains)} domains appear in both won and lost:")
    for d in sorted(duplicate_domains):
        print(f"  {{d}}")
    print("Remove these rows before enrichment — they pollute lift scores.")

If duplicates exist, remove ALL rows for those domains (not just one copy). The same company with different deal outcomes tells us nothing about what distinguishes buyers from non-buyers.

Step 1.5: Generate Vertical-Specific Configs

Using the discovery from Steps 0 and 0.5 , create three JSON config files.

See references/keyword-catalog.md for JSON format and generation guidance.

# Create config files in output/{{company}}/
output/{{company}}-keywords.json    # keyword categories
output/{{company}}-tools.json       # tech stack tools by category
output/{{company}}-job-roles.json   # job role categories

Generation approach:

Keywords — Mix of:
- Product category terms (e.g., "creative ops", "asset management" for a DAM tool)
- Buyer pain points (e.g., "fragmented tools", "content discovery")
- Competitor names for migration segment (e.g., "bynder", "widen") — companies using these are migration opportunities, not excluded
- Generic business maturity (e.g., "security", "compliance", "integrations")
Tech Stack — Tools from ecosystem discovery:
- Infrastructure tools buyer persona uses (e.g., "figma", "adobe creative cloud")
- Adjacent tools in buyer workflow (e.g., "contentful", "monday.com")
- Competitor tools for migration segment — companies using these represent displacement opportunities
Job Roles — Titles from role discovery:
- Buyer persona variations (e.g., "creative director", "content manager", "brand manager")
- Adjacent roles (e.g., "marketing operations", "brand designer")
- Generic growth roles (e.g., "product", "engineering")

See references/keyword-catalog.md for multi-vertical examples (creative ops, AR automation, sales engagement, developer tools). Each example includes keywords, tools, and job roles for that vertical.

Validation: Do the generated configs match the target's vertical and buyer persona? If not, refine based on Step 0/0.5 findings.

Step 2: Deepline Enrichment

CRITICAL: Never scrape just the homepage. Use exa_search with contents.text to discover AND scrape ~8 pages per domain in a single API call.

Generate exa query dynamically based on target's product category:

# Generic query (works for most B2B SaaS selling to marketing/sales/product teams)
QUERY="company product features integrations customers security pricing careers about case-studies"

# For tools selling to back-office teams (finance, HR, legal):
# Buyers don't publish pain on marketing pages — add compliance/audit pages where signals live
QUERY="company product features integrations customers security pricing careers compliance audit regulatory about"

# For developer tools:
# Add documentation/API pages — these reveal infrastructure maturity and integration readiness
QUERY="company product features documentation api changelog github integrations security pricing careers about"

# For creative/marketing tools:
QUERY="company product features portfolio use cases creative workflow customers integrations security pricing careers about"

# For sales tools:
QUERY="company product features playbooks outbound pipeline customers integrations security pricing careers about"

deepline enrich \
  --input output/{{company}}-icp-input.csv \
  --output output/{{company}}-enriched.csv \
  --with '{"alias":"website","tool":"exa_search","payload":{"query":"{{exa-query-from-above}}","numResults":8,"type":"auto","includeDomains":["{{domain}}"],"contents":{"text":{"maxCharacters":3000,"verbosity":"compact","includeSections":["body"]}}}}' \
  --with '{"alias":"jobs","tool":"crustdata_job_listings","payload":{"companyDomains":"{{domain}}","limit":50}}' \

Why exa_search with contents (not parallel_extract)?

Discovers pages AND returns content in one call — no separate page discovery step
~8 pages per domain, ~3K chars each = ~24K chars per company
Cost: ~5 credits/company (exa) + ~1 credit (crustdata jobs) = ~6 credits/company

Get user credit approval before running. Example: "60 companies x 6 credits = ~360 credits."

Step 3: Quality Gate

CRITICAL — Verify file completeness BEFORE running analysis. deepline enrich returns control to the terminal before OS buffers fully flush to disk. Running the analysis script immediately after enrichment completes can read a partially-written file where job columns for the last N rows haven't synced yet — resulting in won_with_jobs: 0 or severely undercounted job data. Always verify:

# 1. Check row count matches input
INPUT_ROWS=$(wc -l < output/{{company}}-icp-input.csv)
OUTPUT_ROWS=$(wc -l < output/{{company}}-enriched.csv)
echo "Input: $INPUT_ROWS rows, Output: $OUTPUT_ROWS rows"
# Output should equal input (both include header)

# 2. Spot-check job data for a known won account with job listings
python3 -c "
import csv, json, sys
csv.field_size_limit(sys.maxsize)
with open('output/{{company}}-enriched.csv') as f:
    rows = list(csv.DictReader(f))
won_rows = [r for r in rows if r.get('status') == 'won']
jobs_col = 'jobs'  # or use column index
has_jobs = sum(1 for r in won_rows if r.get(jobs_col, '').strip() not in ('', '{}', 'null'))
print(f'Won rows with job data: {{has_jobs}}/{len(won_rows)}')
# If this is 0 and you know won accounts should have listings, wait and re-run
"

If won_with_jobs is 0 but you expect job data:

Wait 5-10 seconds (OS buffer flush)
Re-run the verification check
If still 0, check column indices — the enriched CSV uses website and jobs column names, NOT __dl_full_result__. Use --website-col N --jobs-col N overrides.

After file verification, check coverage:

Coverage : >80% of companies should have website content. If <80%, check domain spelling and retry failed rows.
Content depth : Average should be 6-8 pages per company, 12-20K chars.
Job listings : Won companies should have more job data than lost (expected — larger/scaling companies win more).

If coverage is poor, re-run failed domains with --rows targeting specific rows.

Domain Validation (if using auto-extracted customer lists)

If customer domains came from automated extraction (CRM exports, Exa API, case study scraping) rather than a manually verified list, validate that domains actually belong to the named companies. From actual runs: up to 53% of auto-extracted customers can be false positives — competitors selling the same product, domain mismatches, and unrelated companies.

# Check for suspicious domain patterns
python3 -c "
import csv, sys
csv.field_size_limit(sys.maxsize)
with open('output/{{company}}-enriched.csv') as f:
    rows = list(csv.DictReader(f))
for r in rows:
    domain = r.get('domain', '')
    # Flag content platforms used as source URLs, not company domains
    if any(x in domain for x in ['blog.', 'medium.com', 'substack.', 'wordpress.']):
        print(f'WARNING: {{domain}} looks like a content platform, not a company domain')
    # Flag very short domains that might be generic
    if len(domain.split('.')[0]) <= 2:
        print(f'CHECK: {{domain}} — very short domain, verify it belongs to the expected company')
"

Red flags for false positives:

Domain is a subdomain of the target company (blog.target.com)
Domain belongs to a well-known AI/tech company but the "customer" is a different firm (domain resolution failed)
Company appears in competitor case studies, not target's own customer list
Company is itself a vendor in the same product category (they SELL the solution, they don't BUY it)

Step 3.5: Review Generated Configs

Before running analysis , spot-check the generated configs against enriched data:

# Sample a few enriched companies
deepline playground output/{{company}}-enriched.csv

# In playground UI, check:
# - Do website pages mention the keywords in keywords.json?
# - Do job listings mention the roles in job-roles.json?
# - Do integrations/tech stack pages mention the tools in tools.json?

Keywords appear in <10% of enriched companies → too niche, broaden
Keywords appear in >90% of companies → too generic, refine
Product category keywords (what the target SELLS) appear frequently in Won companies → wrong product category, these companies are competitors not buyers
Job roles missing from actual job listings → wrong buyer persona, revise

Fix and re-generate configs if needed.

Step 4: Differential Analysis

Run the analysis script with the config files:

python3 scripts/analyze_signals.py \
  --input output/{{company}}-enriched.csv \
  --keywords output/{{company}}-keywords.json \
  --tools output/{{company}}-tools.json \
  --job-roles output/{{company}}-job-roles.json \
  --output output/{{company}}-analysis.json

The script auto-detects __dl_full_result__ columns for website and jobs data. Override with --website-col N --jobs-col N if needed.

What the script computes:

Substring-match presence per company per keyword (NOT exact-token TF-IDF)
Laplace-smoothed lift: ((won + 0.5) / (won_total + 1)) / ((lost + 0.5) / (lost_total + 1))
Source breakdown per keyword (website only / jobs only / both)
Tech stack tool mentions across categories
Job role prevalence in won vs lost
Anti-fit signals (lift < 0.5x)
Source evidence: exact quotes (±40 chars) with page URLs and job listing URLs per keyword

Step 5: Report Generation

Read references/report-template.md for the full report structure and quality rules.

Report structure overview:

Section 0: Quick Reference Dashboard ← Start here. Required before all other sections.
- 0.1 TLDR — 5 bullets: #1 signal, best-fit archetype, fastest path to pipeline, hard skip flags, scoring summary
- 0.2 Signal Strength at a Glance — Two visual tables (positive + anti-fit) with 🟩🟥 emoji lift bars sorted by lift
- 0.3 Platform Search Recipes — Pre-built Apollo URLs (people + company searches) and Google operators
- 0.4 Buyer Persona Quick Reference — 3–5 personas with titles, pain points, signals, and Apollo links
- 0.5 Lead Scoring Cheatsheet — All scoring signals in one table with "How to Check" column
Sections 1–9: Full data and methodology (existing format, unchanged)

Signal Strength Bar scale (use in Section 0.2):

≥10x → 🟩🟩🟩🟩🟩🟩   ≥4x  → 🟩🟩🟩🟩🟩   ≥2.5x → 🟩🟩🟩🟩
≥2.0x → 🟩🟩🟩         ≥1.5x → 🟩🟩         ≥1.0x → 🟩
≥0.4x → 🟥🟥           ≥0.25x → 🟥🟥🟥       ≥0.15x → 🟥🟥🟥🟥
≥0.07x → 🟥🟥🟥🟥🟥    <0.07x → 🟥🟥🟥🟥🟥🟥

Apollo URL format (use in Section 0.3 and 0.4):

People: https://app.apollo.io/#/people?personTitles[]=Title+One&personTitles[]=Title+Two&personSeniorities[]=vp&personSeniorities[]=director&qOrganizationKeywordTags[]=vertical&organizationLocations[]=United+States&page=1
Companies: https://app.apollo.io/#/companies?qOrganizationKeywordTags[]=keyword&organizationLocations[]=United+States&organizationNumEmployeesRanges[]=201-500&page=1

Use qOrganizationKeywordTags[] for keyword filters (not hardcoded industry tag IDs).

Key quality rules for all sections:

Raw counts always : 15% (6) not just 15%
Sample sizes in headers : Won (n=37), Lost (n=18)
Bold only lift > 2x AND count >= 3 companies — a signal in 1 company with 10x lift is less reliable than a signal in 4 companies with 3x lift
Flag n=1 signals : If a signal appears in only 1 won company, add a note: *(single company — verify before using in scoring)*. In the scoring model, give n=1 signals 0.3x weight vs n=3+ signals.
Source breakdown for ALL keyword tables : Add a Source column showing 3w / 20j / 2both format (3 website-only, 20 jobs-only, 2 from both). This is critical for distinguishing website-only signals (lower confidence) from job-listing signals (higher confidence).
```
| Keyword | Won (n=X) | Lost (n=Y) | Lift | Source (w/j/both) | Interpretation |
```
Source evidence required : After each keyword table, add exact quotes with linked sources for top 3 keywords. The analysis script outputs evidence per keyword with company, source_type, quote, url, and page_title/job_title. Format:
```
> **Evidence — "keyword":**
> - [company.com](url) (page title): "...exact quote with keyword..."
> - [company.com](url) (job: "Job Title"): "...exact quote from listing..."
```
Niche tech stack tools : Report specific SaaS tools by category, not generic keywords. "AWS", "GitHub", "Slack" appear on most B2B sites — these aren't differentiating.
Anti-fit signals in separate section
Interpretation column required : Explains WHY each signal matters for the target company
Vendor-adjacent evidence annotation : When citing evidence quotes, annotate each with ✅ (clear buyer signal) or ⚠️ (vendor-adjacent — e.g., the company's own product/pricing page mentions the keyword because they sell something similar). This prevents treating competitor evidence as buyer evidence.
Scoring reconciliation : Section 0.5 (Lead Scoring Cheatsheet) and Section 6 (Scoring Model) MUST have matching point values. After writing both sections, cross-check every signal's point allocation. Mismatches confuse users who reference both.
Dataset caveat : If the dataset uses Lookalikes as Won, has small sample sizes, or other limitations, add a "Dataset Caveat" subsection to the Executive Summary explaining what the limitations are and how they affect interpretation.

Step 6: Signal Interpretation Review

Read references/signal-interpretation.md before writing interpretation columns. Key rules:

Website content mentioning what the target sells = competitor signal (NOT buyer)
Job listings = highest-intent buyer signal
Same keyword means different things on product page vs careers page vs blog
Tech stack tools need context — do they create or solve the target's problem?

Enrichment Data Structure

Website data: __dl_full_result__ column containing exa_search response.

data.results[].text — page content
data.results[].url — page URL
data.results[].title — page title

Job listings: __dl_full_result__ column containing crustdata response.

data.listings[].title — job title (NOT "job_title")
data.listings[].description — job description (NOT "job_description")
data.listings[].category — job category
data.listings[].url — listing URL

Step	Credits per row	Total (60 companies)
exa_search with contents	~5	~300
crustdata_job_listings	~1	~60
Total	~6	~360

Always get user approval before running paid enrichment steps.

Skipping target discovery (Step 0) — Without understanding what the target sells, you'll generate generic/irrelevant configs.
Homepage-only scraping — Always use multi-page discovery. Homepage alone misses pricing, integrations, security, careers.
Using hardcoded examples — Don't copy sales-focused keywords for a creative-ops tool. Generate configs per vertical.
Generic tech stack — "AWS", "GitHub", "Slack" appear on most B2B sites and aren't differentiating. Search for niche SaaS tools specific to the buyer persona (e.g., Figma for creative teams, NetSuite for finance teams).
Ignoring source context — "prospect" on a product page = seller signal. "prospect" in a job listing = buyer signal. Same keyword, opposite meaning.
Missing lost data — Verify lost companies have content before analysis. Empty lost = meaningless lift scores.
Substring false positives — "sequenc" matches "consequences". Spot-check high-lift keywords for false matches.
Skipping config review (Step 3.5) — Always validate generated configs against enriched data before analysis.
Duplicate domains in input — CRM exports often have the same company in both won and lost (multiple deals). Deepline only fetches job listings once per domain, so the duplicate's job data lands on one row only — silently undercounting won_with_jobs. Always deduplicate in Step 1.
Running analysis immediately after enrichment — deepline enrich returns to terminal before OS buffers flush. Run the file completeness check in Step 3 before executing analyze_signals.py. A won_with_jobs: 0 result when you expect data is the symptom; re-running the analysis (without re-enriching) fixes it.
Domain mismatches in auto-extracted lists — When using CRM exports or automated customer discovery, domain → company name mapping can be wrong. In actual runs, up to 53% of auto-extracted domains were false positives. Always validate domains against expected company names before enrichment.
Treating vendor signals as buyer signals — "accounts receivable automation" on a company's product page means they SELL AR tools (competitor). The same phrase in a job listing means they NEED AR tools (buyer). Source context is everything — see references/signal-interpretation.md.
Trusting n=1 signals — A signal in 1 won company with 0 lost = mathematically high lift but statistically meaningless. Require 3+ companies for Tier 1 scoring signals. Flag single-company signals in the report with a verification note.
Expecting website signals for back-office tools — Companies buying AR automation, billing, or compliance tools don't discuss these needs on their marketing websites. For these verticals, rely on job listings (hiring AR Manager = budget + pain), tech stack (NetSuite, Salesforce in jobs), and firmographics (wholesale/distribution/manufacturing) instead.
Including generic business words as signals — "platform", "automat*", "integrat*" appear at near-identical rates in won and lost (1.0-1.1x lift). These are baseline terms, not differentiators. Focus on signals with lift > 1.5x that are specific to the target's vertical.

Proven Signal Patterns (from actual runs)

These patterns have been validated across multiple customer analyses spanning creative ops, sales engagement, AR automation, legal tech, and GTM tools. Use them as a starting point when interpreting results — but always validate against the specific target's vertical.

High-Confidence Positive Signals

Signal Pattern	Typical Lift	Validated For	What It Means
Analyst validation (Gartner, Forrester)	4.5x-6.5x	Enterprise B2B SaaS	Company has evaluated the category, has enterprise procurement process
Hiring for ICP-related roles	3.8x-5.5x	All verticals	Active budget + acknowledged pain — highest-intent signal
Published case studies	3.7x	Product-led + sales-assist	Mature marketing org, values proof points, vendor-friendly
Compliance infrastructure (GDPR, SOC2, ISO)	2.1x-6.5x	Enterprise tools	Formal approval processes, security reviews, higher close rates
Buyer pain language (e.g., "fragmented tools")	2.9x-5.2x	Creative ops, MarTech	Operational awareness of the specific problem the target solves
SDK/webhook/API presence	2.5x-3.5x	Developer-adjacent tools	Developer culture, integrates tools programmatically
Contact sales / sales-led GTM	2.2x-5.5x	Enterprise sales tools	Human-led sales motion = AE-dependent = sales engagement tool buyer
Niche tech stack (Figma, Frame.io, NetSuite)	1.5x-5.5x	Vertical-specific	Infrastructure readiness for the target's integration ecosystem

High-Confidence Anti-Fit Signals

Signal Pattern	Typical Lift	What It Means
Consumer signals (shopper, checkout, cancel, debit)	0.2x	B2C company, not B2B sales org
Retention/churn language	0.2x-0.4x	Consumer subscription model, not enterprise buying
Selling same product category	0.1x-0.3x	Competitor, not buyer — they SELL the solution
No job listings in 12+ months	N/A	Not growing, no hiring budget

Scoring Model Guidance

From actual runs, a 0-100 point model with three tiers works well:

Tier 1: Core Fit (0-40 pts) — Compliance, analyst validation, structural signals
Tier 2: Buying Intent (0-30 pts) — Hiring for domain roles, pain language, tech stack
Tier 3: Infrastructure Readiness (0-30 pts) — API presence, integration maturity, case studies

Score thresholds: 60+ = Tier 1 immediate outreach, 35-59 = Tier 2 trigger-based, <35 = nurture or skip.

keyword-catalog.md : Read when generating configs (Step 1.5). Contains JSON format, generation patterns, and multi-vertical examples.
report-template.md : Read when generating the report (Step 5). Full section structure, table formats, inline example format, quality rules.
signal-interpretation.md : Read when writing interpretation columns or reviewing signal quality. Buyer vs seller vs competitor rules.
scripts/analyze_signals.py : Deterministic analysis script. Run in Step 4. Auto-detects columns, accepts custom keywords/tools via JSON.

Python PDF处理教程：合并拆分、提取文本表格、创建PDF文件

55,400 周安装