npx skills add https://github.com/noizai/skills --skill chat-with-anyone从在线视频中克隆真实人物的声音,或根据照片设计声音,然后通过 TTS 扮演该人物进行角色扮演。
此技能合成模仿真实声音的语音。在继续之前,必须遵守以下规定:
如果用户的意图看起来有害,请礼貌拒绝并解释原因。
| 依赖项 | 类型 | 如何验证 |
|---|---|---|
ffmpeg | 系统二进制文件 | ffmpeg -version |
yt-dlp | 系统二进制文件 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
yt-dlp --version |
tts 技能 | Cursor 技能 | ls skills/tts/scripts/tts.py |
NOIZ_API_KEY | 环境变量或文件 | python3 skills/tts/scripts/tts.py config --show |
在首次运行前,请验证所有依赖项是否已安装:
ffmpeg -version && yt-dlp --version && ls skills/tts/scripts/tts.py
如果缺少 yt-dlp,请安装它:
uv pip install yt-dlp
如果未配置 Noiz API 密钥:
python3 skills/tts/scripts/tts.py config --set-api-key YOUR_KEY
使用此清单跟踪进度:
- [ ] A1. 明确人物身份
- [ ] A2. 查找参考视频
- [ ] A3. 下载音频 + 字幕
- [ ] A4. 提取最佳参考片段
- [ ] A5. 生成语音
如果身份不明确(例如"美国总统"、"蜘蛛侠演员"),请要求用户在继续之前指定确切的人物。
使用网络搜索查找该人物说话清晰的 YouTube(或 Bilibili)视频。最佳候选:采访、演讲、新闻发布会。避免背景音乐过重的视频。
可尝试的搜索查询:
{CHARACTER_NAME} interview / {CHARACTER_NAME} 采访{CHARACTER_NAME} speech / {CHARACTER_NAME} 演讲{CHARACTER_NAME} press conferencemkdir -p "tmp/chat_with_anyone/{CHARACTER_NAME}"
yt-dlp -x --audio-format mp3 \
--write-subs --write-auto-subs --sub-langs "en,zh-Hans" \
--convert-subs srt \
-o "tmp/chat_with_anyone/{CHARACTER_NAME}/%(title)s.%(ext)s" \
"{VIDEO_URL}"
下载后,列出输出目录以识别音频文件和 SRT 字幕文件:
ls tmp/chat_with_anyone/{CHARACTER_NAME}/
预期输出:一个 .mp3 音频文件和一个或多个 .srt 字幕文件。
如果没有出现字幕文件:尝试使用具有自动生成字幕的不同视频,或针对目标语言调整 --sub-langs。
使用自动提取脚本——它会解析 SRT,找到最密集的 3-12 秒语音窗口,并将其提取为 WAV:
python3 skills/chat-with-anyone/scripts/extract_ref_segment.py \
--srt "tmp/chat_with_anyone/{CHARACTER_NAME}/{SRT_FILE}" \
--audio "tmp/chat_with_anyone/{CHARACTER_NAME}/{AUDIO_FILE}" \
-o "tmp/chat_with_anyone/{CHARACTER_NAME}/ref.wav"
脚本会打印选定的时间范围并保存参考 WAV 文件。在继续之前,请验证输出文件存在且非空。
如果脚本报告没有合适的片段:对于较短的剪辑,请尝试 --min-duration 2,或下载不同的视频。
以角色身份撰写回复,然后合成语音:
python3 skills/tts/scripts/tts.py \
-t "{RESPONSE_TEXT}" \
--ref-audio "tmp/chat_with_anyone/{CHARACTER_NAME}/ref.wav" \
-o "tmp/chat_with_anyone/{CHARACTER_NAME}/reply.wav"
将生成的音频文件连同文本一起呈现给用户。对于后续消息,请重用相同的 --ref-audio 路径。
使用此清单跟踪进度:
- [ ] B1. 分析图片
- [ ] B2. 设计声音
- [ ] B3. 预览(可选)
- [ ] B4. 生成语音
使用您的视觉能力检查图片:
将图片和描述传递给声音设计脚本:
python3 skills/chat-with-anyone/scripts/voice_design.py \
--picture "{IMAGE_PATH}" \
--voice-description "{VOICE_DESCRIPTION}" \
-o "tmp/chat_with_anyone/voice_design"
脚本输出:
voice_id.txt 文件读取语音 ID:
cat tmp/chat_with_anyone/voice_design/voice_id.txt
呈现输出目录中的预览音频文件,以便用户可以听到声音。如果不满意,请使用调整后的 --voice-description 或 --guidance-scale 重新运行 B2。
python3 skills/tts/scripts/tts.py \
-t "{RESPONSE_TEXT}" \
--voice-id "{VOICE_ID}" \
-o "tmp/chat_with_anyone/voice_design/reply.wav"
对于后续消息,请继续使用相同的 --voice-id 以保持一致性。
用户 : 我想跟特朗普聊天,让他给我讲个睡前故事。
代理步骤 :
Donald Trump speech youtube,找到一个清晰的演讲视频。yt-dlp -x --audio-format mp3 --write-subs --write-auto-subs --sub-langs "en" --convert-subs srt -o "tmp/chat_with_anyone/trump/%(title)s.%(ext)s" "https://youtube.com/watch?v=..."python3 skills/chat-with-anyone/scripts/extract_ref_segment.py --srt "tmp/chat_with_anyone/trump/....srt" --audio "tmp/chat_with_anyone/trump/....mp3" -o "tmp/chat_with_anyone/trump/ref.wav"python3 skills/tts/scripts/tts.py -t "Let me tell you a tremendous bedtime story..." --ref-audio "tmp/chat_with_anyone/trump/ref.wav" -o "tmp/chat_with_anyone/trump/reply.wav"reply.wav 和故事文本呈现给用户。用户 : [上传 photo.jpg] 我想跟这张图片里的人聊天
代理步骤 :
python3 skills/chat-with-anyone/scripts/voice_design.py --picture "photo.jpg" --voice-description "A young Chinese woman around 25, gentle and warm voice, friendly tone" -o "tmp/chat_with_anyone/voice_design"tmp/chat_with_anyone/voice_design/voice_id.txt 读取语音 ID。python3 skills/tts/scripts/tts.py -t "你好呀!很高兴认识你!" --voice-id "{VOICE_ID}" -o "tmp/chat_with_anyone/voice_design/reply.wav"--voice-id 继续角色扮演。| 问题 | 解决方案 |
|---|---|
yt-dlp 下载失败或视频不可用 | 尝试不同的视频 URL;某些地区/视频受到限制。运行 yt-dlp -U 进行更新 |
| 没有 SRT 字幕文件 | 使用 --sub-lang en,zh-Hans 重新下载;如果仍然没有,请尝试具有自动字幕的不同视频 |
extract_ref_segment.py 找不到合适的窗口 | 对于较短的剪辑,使用 --min-duration 2,或尝试不同的视频 |
| 声音设计返回错误 | 检查 Noiz API 密钥;确保图片是清晰的人物照片 |
| TTS 输出听起来不对 | 对于工作流 A,尝试不同的参考视频;对于工作流 B,调整 --voice-description |
每周安装量
1.5K
仓库
GitHub 星标数
402
首次出现
2026年3月3日
安全审计
安装于
opencode1.5K
gemini-cli1.5K
kimi-cli1.5K
cursor1.5K
cline1.5K
codex1.5K
Clone a real person's voice from online video, or design a voice from a photo, then roleplay as that person with TTS.
This skill synthesizes speech that imitates real voices. Before proceeding, the agent must :
If the user's intent appears harmful, refuse politely and explain why.
| Dependency | Type | How to verify |
|---|---|---|
ffmpeg | System binary | ffmpeg -version |
yt-dlp | System binary | yt-dlp --version |
tts skill | Cursor skill | ls skills/tts/scripts/tts.py |
NOIZ_API_KEY | Env var or file | python3 skills/tts/scripts/tts.py config --show |
Before the first run , verify all dependencies are present:
ffmpeg -version && yt-dlp --version && ls skills/tts/scripts/tts.py
If yt-dlp is missing, install it:
uv pip install yt-dlp
If the Noiz API key is not configured:
python3 skills/tts/scripts/tts.py config --set-api-key YOUR_KEY
Track progress with this checklist:
- [ ] A1. Disambiguate character
- [ ] A2. Find reference video
- [ ] A3. Download audio + subtitles
- [ ] A4. Extract best reference segment
- [ ] A5. Generate speech
If ambiguous (e.g. "US President", "Spider-Man actor"), ask the user to specify the exact person before proceeding.
Use web search to find a YouTube (or Bilibili) video of the person speaking clearly. Best candidates: interviews, speeches, press conferences. Avoid videos with heavy background music.
Search queries to try:
{CHARACTER_NAME} interview / {CHARACTER_NAME} 采访{CHARACTER_NAME} speech / {CHARACTER_NAME} 演讲{CHARACTER_NAME} press conferencemkdir -p "tmp/chat_with_anyone/{CHARACTER_NAME}"
yt-dlp -x --audio-format mp3 \
--write-subs --write-auto-subs --sub-langs "en,zh-Hans" \
--convert-subs srt \
-o "tmp/chat_with_anyone/{CHARACTER_NAME}/%(title)s.%(ext)s" \
"{VIDEO_URL}"
After download, list the output directory to identify the audio file and SRT subtitle file:
ls tmp/chat_with_anyone/{CHARACTER_NAME}/
Expected output: a .mp3 audio file and one or more .srt subtitle files.
If no subtitle files appear : try a different video that has auto-generated captions, or adjust --sub-langs for the target language.
Use the automated extraction script — it parses the SRT, finds the densest 3-12 second speech window, and extracts it as a WAV:
python3 skills/chat-with-anyone/scripts/extract_ref_segment.py \
--srt "tmp/chat_with_anyone/{CHARACTER_NAME}/{SRT_FILE}" \
--audio "tmp/chat_with_anyone/{CHARACTER_NAME}/{AUDIO_FILE}" \
-o "tmp/chat_with_anyone/{CHARACTER_NAME}/ref.wav"
The script prints the selected time range and saves the reference WAV. Verify the output exists and is non-empty before proceeding.
If the script reports no suitable segment : try --min-duration 2 for shorter clips, or download a different video.
Write a response in character, then synthesize it:
python3 skills/tts/scripts/tts.py \
-t "{RESPONSE_TEXT}" \
--ref-audio "tmp/chat_with_anyone/{CHARACTER_NAME}/ref.wav" \
-o "tmp/chat_with_anyone/{CHARACTER_NAME}/reply.wav"
Present the generated audio file to the user along with the text. For subsequent messages, reuse the same --ref-audio path.
Track progress with this checklist:
- [ ] B1. Analyze image
- [ ] B2. Design voice
- [ ] B3. Preview (optional)
- [ ] B4. Generate speech
Use your vision capability to examine the image:
Pass both the image and the description to the voice-design script:
python3 skills/chat-with-anyone/scripts/voice_design.py \
--picture "{IMAGE_PATH}" \
--voice-description "{VOICE_DESCRIPTION}" \
-o "tmp/chat_with_anyone/voice_design"
The script outputs:
voice_id.txt containing the best voice IDRead the voice ID:
cat tmp/chat_with_anyone/voice_design/voice_id.txt
Present the preview audio files from the output directory so the user can hear the voice. If unsatisfied, re-run B2 with adjusted --voice-description or --guidance-scale.
python3 skills/tts/scripts/tts.py \
-t "{RESPONSE_TEXT}" \
--voice-id "{VOICE_ID}" \
-o "tmp/chat_with_anyone/voice_design/reply.wav"
For subsequent messages, keep using the same --voice-id for consistency.
User : 我想跟特朗普聊天,让他给我讲个睡前故事。
Agent steps :
Donald Trump speech youtube, find a clear speech video.yt-dlp -x --audio-format mp3 --write-subs --write-auto-subs --sub-langs "en" --convert-subs srt -o "tmp/chat_with_anyone/trump/%(title)s.%(ext)s" "https://youtube.com/watch?v=..."python3 skills/chat-with-anyone/scripts/extract_ref_segment.py --srt "tmp/chat_with_anyone/trump/....srt" --audio "tmp/chat_with_anyone/trump/....mp3" -o "tmp/chat_with_anyone/trump/ref.wav"python3 skills/tts/scripts/tts.py -t "Let me tell you a tremendous bedtime story..." --ref-audio "tmp/chat_with_anyone/trump/ref.wav" -o "tmp/chat_with_anyone/trump/reply.wav"reply.wav and the story text to the user.User : [uploads photo.jpg] 我想跟这张图片里的人聊天
Agent steps :
python3 skills/chat-with-anyone/scripts/voice_design.py --picture "photo.jpg" --voice-description "A young Chinese woman around 25, gentle and warm voice, friendly tone" -o "tmp/chat_with_anyone/voice_design"tmp/chat_with_anyone/voice_design/voice_id.txt.python3 skills/tts/scripts/tts.py -t "你好呀!很高兴认识你!" --voice-id "{VOICE_ID}" -o "tmp/chat_with_anyone/voice_design/reply.wav"--voice-id.| Problem | Solution |
|---|---|
yt-dlp download fails or video unavailable | Try a different video URL; some regions/videos are restricted. Run yt-dlp -U to update |
| No SRT subtitle files | Re-download with --sub-lang en,zh-Hans; if still none, try a different video with auto-captions |
extract_ref_segment.py finds no suitable window | Use --min-duration 2 for shorter clips, or try a different video |
| Voice design returns error | Check Noiz API key; ensure image is a clear photo of a person |
| TTS output sounds wrong | For Workflow A, try a different reference video; for Workflow B, adjust --voice-description |
Weekly Installs
1.5K
Repository
GitHub Stars
402
First Seen
Mar 3, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode1.5K
gemini-cli1.5K
kimi-cli1.5K
cursor1.5K
cline1.5K
codex1.5K
99,500 周安装