talking-head-production by inferen-sh/skills
npx skills add https://github.com/inferen-sh/skills --skill talking-head-production通过 inference.sh CLI,使用 AI 数字人和唇形同步技术创建口播视频。
需要 inference.sh CLI (
infsh)。安装说明
infsh login
# 生成对话音频
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Welcome to our product tour. Today I will show you three features that will save you hours every week."
}'
# 使用 OmniHuman 创建口播视频
infsh app run bytedance/omnihuman-1-5 --input '{
"image": "path/to/portrait.png",
"audio": "path/to/dialogue.mp3"
}'
源人像图像至关重要。不佳的人像 = 不佳的视频输出。
| 要求 | 原因 | 规格 |
|---|---|---|
| 居中构图 | 数字人需要脸部在可预测的位置 | 脸部位于画面中心 |
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
| 头部和肩部 | 身体可见以实现自然手势 | 裁剪至胸部以下 |
| 注视镜头 | 与观众建立联系 | 直接正面注视 |
| 中性表情 | 动画的起点 | 轻微微笑可以,避免大笑/皱眉 |
| 清晰的面部 | 模型需要检测特征 | 无太阳镜、严重阴影或遮挡物 |
| 高分辨率 | 细节保留 | 面部区域最小 512x512,理想情况 1024x1024+ |
| 类型 | 何时使用 |
|---|---|
| 纯色背景 | 专业、干净、易于合成 |
| 柔和虚化 | 自然、生活化感觉 |
| 办公室/工作室 | 商务场景 |
| 透明背景(通过背景移除) | 合成到其他场景中 |
# 生成专业人像背景
infsh app run falai/flux-dev-lora --input '{
"prompt": "professional headshot photograph of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, high quality portrait photography"
}'
# 或者从现有肖像中移除背景
infsh app run <bg-removal-app> --input '{
"image": "path/to/portrait-with-background.png"
}'
音频质量直接影响唇形同步的准确性。清晰的音频 = 准确的嘴唇动作。
| 参数 | 目标 | 原因 |
|---|---|---|
| 背景噪音 | 无/极少 | 噪音会干扰唇形同步的时序 |
| 音量 | 全程一致 | 防止同步漂移 |
| 采样率 | 44.1kHz 或 48kHz | 标准质量 |
| 格式 | MP3 128kbps+ 或 WAV | 兼容所有工具 |
# 简单旁白
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Hi there! I am excited to share something with you today. We have been working on a feature that our users have been requesting for months... and it is finally here."
}'
# 带情感和节奏
infsh app run falai/dia-tts --input '{
"prompt": "[S1] You know what is frustrating? Spending hours on tasks that should take minutes. (sighs) We have all been there. But what if I told you... there is a better way?"
}'
| 模型 | 应用 ID | 最适合 | 最大时长 |
|---|---|---|---|
| OmniHuman 1.5 | bytedance/omnihuman-1-5 | 多角色、手势、高质量 | 每片段约30秒 |
| OmniHuman 1.0 | bytedance/omnihuman-1-0 | 单角色、更简单 | 每片段约30秒 |
| PixVerse 唇形同步 | falai/pixverse-lipsync | 在现有视频上快速唇形同步 | 短片段 |
| Fabric | falai/fabric-1-0 | 人像上的布料/织物动画 | 短片段 |
# 1. 生成或准备音频
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Your narration script here."
}'
# 2. 生成口播视频
infsh app run bytedance/omnihuman-1-5 --input '{
"image": "portrait.png",
"audio": "narration.mp3"
}'
# 1-2. 同上
# 3. 为口播视频添加字幕
infsh app run infsh/caption-videos --input '{
"video": "talking-head.mp4",
"caption_file": "captions.srt"
}'
对于超过 30 秒的内容,请分割成多个片段:
# 生成音频片段
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment one script."}' --no-wait
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment two script."}' --no-wait
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment three script."}' --no-wait
# 为每个片段生成口播视频(使用相同人像以保持一致性)
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment1.mp3"}' --no-wait
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment2.mp3"}' --no-wait
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment3.mp3"}' --no-wait
# 合并所有片段
infsh app run infsh/media-merger --input '{
"media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"]
}'
OmniHuman 1.5 最多支持 2 个角色:
# 1. 生成两个说话者的对话
infsh app run falai/dia-tts --input '{
"prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch."
}'
# 2. 创建包含两个角色的视频
infsh app run bytedance/omnihuman-1-5 --input '{
"image": "two-person-portrait.png",
"audio": "dialogue.mp3"
}'
┌─────────────────────────────────┐
│ 头部空间(最小化) │
│ ┌───────────────────────────┐ │
│ │ │ │
│ │ ● ─ ─ 眼睛位于1/3处 ─ ─│─ │ ← 眼睛位于顶部1/3线
│ │ /|\ │ │
│ │ | 头部和肩部 │ │
│ │ / \ 可见 │ │
│ │ │ │
│ └───────────────────────────┘ │
│ 裁剪至胸部以下 │
└─────────────────────────────────┘
| 错误 | 问题 | 修复方法 |
|---|---|---|
| 低分辨率人像 | 脸部模糊,唇形同步效果差 | 使用 1024x1024+ 的面部区域 |
| 侧面/侧角 | 唇形同步无法很好跟踪嘴部 | 使用正面或接近正面的角度 |
| 有噪音的音频 | 唇形同步漂移,看起来不自然 | 录制清晰的音频或使用 TTS |
| 片段过长 | 30 秒后质量下降 | 分割成片段,然后拼接 |
| 太阳镜/遮挡物 | 面部特征被隐藏 | 需要清晰的面部 |
| 光照不一致 | 动画时显得不自然 | 均匀、柔和的光照 |
| 无字幕 | 失去静音/移动端观众 | 始终添加字幕 |
npx skills add inference-sh/skills@ai-avatar-video
npx skills add inference-sh/skills@ai-video-generation
npx skills add inference-sh/skills@text-to-speech
浏览所有应用:infsh app list
每周安装量
7.2K
仓库
GitHub Stars
202
首次出现
14 天前
安全审计
安装于
claude-code5.8K
gemini-cli5.1K
codex5.1K
opencode5.1K
amp5.1K
kimi-cli5.1K
Create talking head videos with AI avatars and lipsync via inference.sh CLI.
Requires inference.sh CLI (
infsh). Install instructions
infsh login
# Generate dialogue audio
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Welcome to our product tour. Today I will show you three features that will save you hours every week."
}'
# Create talking head video with OmniHuman
infsh app run bytedance/omnihuman-1-5 --input '{
"image": "path/to/portrait.png",
"audio": "path/to/dialogue.mp3"
}'
The source portrait image is critical. Poor portraits = poor video output.
| Requirement | Why | Spec |
|---|---|---|
| Center-framed | Avatar needs face in predictable position | Face centered in frame |
| Head and shoulders | Body visible for natural gestures | Crop below chest |
| Eyes to camera | Creates connection with viewer | Direct frontal gaze |
| Neutral expression | Starting point for animation | Slight smile OK, not laughing/frowning |
| Clear face | Model needs to detect features | No sunglasses, heavy shadows, or obstructions |
| High resolution | Detail preservation | Min 512x512 face region, ideally 1024x1024+ |
| Type | When to Use |
|---|---|
| Solid color | Professional, clean, easy to composite |
| Soft bokeh | Natural, lifestyle feel |
| Office/studio | Business context |
| Transparent (via bg removal) | Compositing into other scenes |
# Generate a professional portrait background
infsh app run falai/flux-dev-lora --input '{
"prompt": "professional headshot photograph of a friendly business person, soft studio lighting, clean grey background, head and shoulders, direct eye contact, neutral pleasant expression, high quality portrait photography"
}'
# Or remove background from existing portrait
infsh app run <bg-removal-app> --input '{
"image": "path/to/portrait-with-background.png"
}'
Audio quality directly impacts lipsync accuracy. Clean audio = accurate lip movement.
| Parameter | Target | Why |
|---|---|---|
| Background noise | None/minimal | Noise confuses lipsync timing |
| Volume | Consistent throughout | Prevents sync drift |
| Sample rate | 44.1kHz or 48kHz | Standard quality |
| Format | MP3 128kbps+ or WAV | Compatible with all tools |
# Simple narration
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Hi there! I am excited to share something with you today. We have been working on a feature that our users have been requesting for months... and it is finally here."
}'
# With emotion and pacing
infsh app run falai/dia-tts --input '{
"prompt": "[S1] You know what is frustrating? Spending hours on tasks that should take minutes. (sighs) We have all been there. But what if I told you... there is a better way?"
}'
| Model | App ID | Best For | Max Duration |
|---|---|---|---|
| OmniHuman 1.5 | bytedance/omnihuman-1-5 | Multi-character, gestures, high quality | ~30s per clip |
| OmniHuman 1.0 | bytedance/omnihuman-1-0 | Single character, simpler | ~30s per clip |
| PixVerse Lipsync | falai/pixverse-lipsync | Quick lipsync on existing video | Short clips |
| Fabric | falai/fabric-1-0 | Cloth/fabric animation on portraits |
# 1. Generate or prepare audio
infsh app run falai/dia-tts --input '{
"prompt": "[S1] Your narration script here."
}'
# 2. Generate talking head
infsh app run bytedance/omnihuman-1-5 --input '{
"image": "portrait.png",
"audio": "narration.mp3"
}'
# 1-2. Same as above
# 3. Add captions to the talking head video
infsh app run infsh/caption-videos --input '{
"video": "talking-head.mp4",
"caption_file": "captions.srt"
}'
For content longer than 30 seconds, split into segments:
# Generate audio segments
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment one script."}' --no-wait
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment two script."}' --no-wait
infsh app run falai/dia-tts --input '{"prompt": "[S1] Segment three script."}' --no-wait
# Generate talking head for each segment (same portrait for consistency)
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment1.mp3"}' --no-wait
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment2.mp3"}' --no-wait
infsh app run bytedance/omnihuman-1-5 --input '{"image": "portrait.png", "audio": "segment3.mp3"}' --no-wait
# Merge all segments
infsh app run infsh/media-merger --input '{
"media": ["segment1.mp4", "segment2.mp4", "segment3.mp4"]
}'
OmniHuman 1.5 supports up to 2 characters:
# 1. Generate dialogue with two speakers
infsh app run falai/dia-tts --input '{
"prompt": "[S1] So tell me about the new feature. [S2] Sure! We built a dashboard that shows real-time analytics. [S1] That sounds great. How long did it take? [S2] About two weeks from concept to launch."
}'
# 2. Create video with two characters
infsh app run bytedance/omnihuman-1-5 --input '{
"image": "two-person-portrait.png",
"audio": "dialogue.mp3"
}'
┌─────────────────────────────────┐
│ Headroom (minimal) │
│ ┌───────────────────────────┐ │
│ │ │ │
│ │ ● ─ ─ Eyes at 1/3 ─ ─│─ │ ← Eyes at top 1/3 line
│ │ /|\ │ │
│ │ | Head & shoulders │ │
│ │ / \ visible │ │
│ │ │ │
│ └───────────────────────────┘ │
│ Crop below chest │
└─────────────────────────────────┘
| Mistake | Problem | Fix |
|---|---|---|
| Low-res portrait | Blurry face, poor lipsync | Use 1024x1024+ face region |
| Profile/side angle | Lipsync can't track mouth well | Use frontal or near-frontal |
| Noisy audio | Lipsync drifts, looks unnatural | Record clean or use TTS |
| Too-long clips | Quality degrades after 30s | Split into segments, stitch |
| Sunglasses/obstruction | Face features hidden | Clear face required |
| Inconsistent lighting | Uncanny when animated | Even, soft lighting |
| No captions | Loses silent/mobile viewers | Always add captions |
npx skills add inference-sh/skills@ai-avatar-video
npx skills add inference-sh/skills@ai-video-generation
npx skills add inference-sh/skills@text-to-speech
Browse all apps: infsh app list
Weekly Installs
7.2K
Repository
GitHub Stars
202
First Seen
14 days ago
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
claude-code5.8K
gemini-cli5.1K
codex5.1K
opencode5.1K
amp5.1K
kimi-cli5.1K
专业文案撰写指南:转化文案写作技巧、框架与SEO优化原则
48,600 周安装
| Short clips |