ElevenLabs AI语音生成与Remotion视频合成工具 - 专业旁白制作

elevenlabs-remotion by maartenlouis/elevenlabs-remotion-skill

293 周安装量

2 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/maartenlouis/elevenlabs-remotion-skill --skill elevenlabs-remotion

内容创作自动化音频处理

🇨🇳中文介绍

ElevenLabs 语音生成

使用 ElevenLabs API 为 Remotion 视频生成专业的 AI 语音。

前提条件

在 .env.local 文件中设置 ELEVENLABS_API_KEY

快速开始

# 从文本生成语音
node .claude/skills/elevenlabs-remotion-skill/generate.js --text "您的文本内容" --output public/audio/voiceover.mp3

# 使用旁白风格生成（更自然）
node .claude/skills/elevenlabs-remotion-skill/generate.js --text "您的文本" --character narrator --output voiceover.mp3

# 使用请求拼接生成多个场景
node .claude/skills/elevenlabs-remotion-skill/generate.js --scenes remotion/scenes.json --output-dir public/audio/project/

# 重新生成单个场景
node .claude/skills/elevenlabs-remotion-skill/generate.js --scenes scenes.json --scene scene2 --new-text "更新后的文本"

# 列出可用的语音和角色预设
node .claude/skills/elevenlabs-remotion-skill/generate.js --list-voices
node .claude/skills/elevenlabs-remotion-skill/generate.js --list-characters

角色预设

使用角色预设可以获得更自然的语音，而不是逐字朗读屏幕文本：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

基于场景的生成与请求拼接

使用 ElevenLabs 的请求拼接功能生成多个场景，保持一致的韵律：

{
  "name": "product-demo",
  "voice": "George",
  "character": "narrator",
  "scenes": [
    {
      "id": "scene1",
      "text": "通用文本转语音听起来很机械。您的品牌值得更好的。",
      "duration": 4.5,
      "character": "dramatic"
    },
    {
      "id": "scene2",
      "text": "通过语音克隆，您可以使用自己的声音创建无限内容。",
      "duration": 5.5
    },
    {
      "id": "scene3",
      "text": "录制一个简短样本。克隆它。几分钟内创建专业语音。",
      "duration": 6,
      "delay": 0.3
    }
  ]
}

node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/product-demo-scenes.json \
  --output-dir public/audio/product-demo/

product-demo-scene1.mp3 到 sceneN.mp3
product-demo-combined.mp3（所有场景拼接）
product-demo-info.json（包含持续时间的元数据）

单个场景重新生成

如果某个场景开始过早、时间错误或需要不同的文本：

# 使用新文本重新生成 scene2
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/scenes.json \
  --scene scene2 \
  --new-text "更新后的场景 2 文本" \
  --output-dir public/audio/project/

# 使用不同角色重新生成 scene3
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/scenes.json \
  --scene scene3 \
  --character salesperson \
  --output-dir public/audio/project/

# 仅重新生成（相同文本，相同角色）
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/scenes.json \
  --scene scene1 \
  --output-dir public/audio/project/

# 将缩略图嵌入 MP4 视频
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --embed-thumbnail public/videos/my-video.mp4 \
  --thumbnail public/videos/my-thumbnail.png \
  --output public/videos/my-video-with-thumb.mp4

该工具会自动：

使用先前场景的请求拼接以保持一致的韵律
使用新的元数据更新 info.json 文件
如果提供了 --new-text，则更新 scenes.json

将缩略图图像嵌入 MP4 视频中，以便 Twitter、YouTube 和视频播放器等平台显示您的自定义缩略图，而不是第一帧。

将缩略图嵌入视频

# 基本用法 - 输出到 video-thumb.mp4
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --embed-thumbnail public/videos/promo.mp4 \
  --thumbnail public/videos/thumbnail.png

# 自定义输出路径
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --embed-thumbnail public/videos/promo.mp4 \
  --thumbnail public/videos/thumbnail.png \
  --output public/videos/promo-final.mp4

与 Remotion 的工作流程

# 1. 渲染您的视频
npx remotion render MyVideo public/videos/my-video.mp4

# 2. 渲染您的缩略图（使用 Still 组件）
npx remotion still MyVideoThumbnail public/videos/my-thumbnail.png

# 3. 嵌入缩略图
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --embed-thumbnail public/videos/my-video.mp4 \
  --thumbnail public/videos/my-thumbnail.png \
  --output public/videos/my-video-final.mp4

视频 : MP4 (H.264/H.265)
缩略图 : PNG、JPG、JPEG

嵌入使用 ffmpeg 的 -disposition:v:1 attached_pic 标志将缩略图设置为附加图片，大多数视频播放器和平台都能识别。

该技能在生成后使用 ffprobe 自动验证时间：

检查项	阈值	描述
持续时间不匹配	>15%	如果实际持续时间与预期不同则警告
前导静音	>200ms	音频开始较晚（语音延迟）
尾部静音	>500ms	结尾有不必要的静音
语速	2-4.5 wps	最佳约每秒 3 个单词

# 验证项目中的所有场景
node .claude/skills/elevenlabs-remotion-skill/generate.js --validate public/audio/product-demo/

🔍 正在验证 product-demo（6 个场景）

❌ scene1: 3.00s（预期：4.5s）
   ❌ 音频比预期短 1.50 秒
   👍 8 个单词 @ 3.1 单词/秒
⚠️ scene2: 6.35s（预期：5.5s）
   ⚠️ 前导静音：235ms（可能开始较晚）
   🐢 10 个单词 @ 1.8 单词/秒
✅ scene4: 4.36s（预期：4s）
   👍 9 个单词 @ 2.3 单词/秒

📊 总持续时间：30.80s（预期：30.00s）

更新后的 info.json

验证后，info.json 包含实际测量值：

{
  "scenes": [
    {
      "id": "scene1",
      "duration": 4.5,
      "actualDuration": 3.0,
      "leadingSilence": 0.05,
      "wordsPerSecond": 3.1
    }
  ]
}

在您的 Remotion 合成中使用 actualDuration 以实现精确同步。

选项	描述	默认值
`--text`, `-t`	要转换为语音的文本	必需（或 --file/--scenes）
`--file`, `-f`	从文件读取文本	-
`--output`, `-o`	输出文件路径	`output.mp3`
`--output-dir`	场景的输出目录	`public/audio`
`--voice`, `-v`	语音名称或 ID	`George`
`--model`, `-m`	模型 ID	`eleven_multilingual_v2`
`--character`, `-c`	角色预设	`literal`
`--scenes`	包含场景的 JSON 文件	-
`--scene`	重新生成单个场景 ID	-
`--new-text`	场景重新生成的新文本	-
`--validate`	验证现有音频目录	-
`--skip-validation`	跳过自动验证	false
`--embed-thumbnail`	要嵌入缩略图的视频文件	-
`--thumbnail`	缩略图图像文件（PNG/JPG）	-
`--stability`	语音稳定性（0-1）	因角色而异
`--similarity`	语音相似度（0-1）	因角色而异
`--style`	风格夸张程度（0-1）	因角色而异
`--no-combined`	跳过合并文件	false

语音	风格	最佳用途
`George`	温暖、迷人的英式口音	旁白、解说
`Antoni`	专业、温暖	法律内容、教程
`Arnold`	权威、深沉	企业、严肃话题
`Josh`	友好、对话式	营销、休闲内容

生成场景语音后，在您的合成中使用它们：

import { Audio, Sequence, staticFile } from "remotion";

// 使用单个场景音频文件实现精确同步
const SCENE_DURATIONS = {
  scene1: 4.5,  // 来自 info.json
  scene2: 5.5,
  scene3: 8.0,
};

export const VideoWithVoiceover: React.FC = () => {
  const { fps } = useVideoConfig();

  const scene1Frames = Math.round(SCENE_DURATIONS.scene1 * fps);
  const scene2Frames = Math.round(SCENE_DURATIONS.scene2 * fps);

  return (
    <>
      <Sequence from={0} durationInFrames={scene1Frames}>
        <Audio src={staticFile("audio/project/project-scene1.mp3")} />
        <Scene1Visual />
      </Sequence>

      <Sequence from={scene1Frames} durationInFrames={scene2Frames}>
        <Audio src={staticFile("audio/project/project-scene2.mp3")} />
        <Scene2Visual />
      </Sequence>
    </>
  );
};

使用角色预设：不要逐字朗读屏幕文本 - 使用 narrator 或 expert 以获得自然流畅的效果
标点符号很重要：使用句号表示停顿，逗号表示短暂停顿
数字：写出数字（"五百" 而不是 "500"）以获得自然的语音
缩写：写出完整单词（"二十四小时" 而不是 "24h"）
逐场景处理：不同场景可以使用不同角色（戏剧性开场、平静的行动号召）
微调：使用 --scene 重新生成单个场景，无需重做所有内容
请求拼接：保持所有场景中语音的一致性

# 1. 使用您的脚本创建 scenes.json
# 2. 使用旁白风格生成所有场景
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/my-video-scenes.json \
  --character narrator \
  --output-dir public/audio/my-video/

# 3. 在 Remotion 中预览，注意到 scene2 开始过早
# 4. 仅使用更新后的文本重新生成 scene2
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/my-video-scenes.json \
  --scene scene2 \
  --new-text "稍长的文本以填充视觉时间" \
  --output-dir public/audio/my-video/

# 5. 使用 info.json 中的新持续时间更新视频合成
# 6. 重复直到时间完美

🇺🇸English

ElevenLabs Voiceover Generation

Generate professional AI voiceovers for Remotion videos using ElevenLabs API.

Prerequisites

ELEVENLABS_API_KEY in .env.local

Quick Start

# Generate voiceover from text
node .claude/skills/elevenlabs-remotion-skill/generate.js --text "Your text here" --output public/audio/voiceover.mp3

# Generate with narrator style (more natural)
node .claude/skills/elevenlabs-remotion-skill/generate.js --text "Your text" --character narrator --output voiceover.mp3

# Generate scenes with request stitching
node .claude/skills/elevenlabs-remotion-skill/generate.js --scenes remotion/scenes.json --output-dir public/audio/project/

# Regenerate a single scene
node .claude/skills/elevenlabs-remotion-skill/generate.js --scenes scenes.json --scene scene2 --new-text "Updated text"

# List available voices and character presets
node .claude/skills/elevenlabs-remotion-skill/generate.js --list-voices
node .claude/skills/elevenlabs-remotion-skill/generate.js --list-characters

Character Presets

Use character presets for more natural voiceovers instead of literal screen text reading:

Character	Description	Best For
`literal`	Reads text exactly as written	Screen text, quotes
`narrator`	Professional storyteller, smooth, engaging	Explainers, documentaries
`salesperson`	Enthusiastic, persuasive, energetic	Marketing, ads
`expert`	Authoritative, confident, knowledgeable	Legal content, tutorials
`conversational`	Casual, friendly, natural

# Use narrator style globally
node .claude/skills/elevenlabs-remotion-skill/generate.js --scenes scenes.json --character narrator --output-dir public/audio/

# Or set per-scene in scenes.json
{
  "scenes": [
    { "id": "scene1", "text": "Problem statement", "character": "dramatic" },
    { "id": "scene2", "text": "Solution", "character": "calm" }
  ]
}

Scene-Based Generation with Request Stitching

Generate multiple scenes with consistent prosody using ElevenLabs request stitching:

scenes.json Format

{
  "name": "product-demo",
  "voice": "George",
  "character": "narrator",
  "scenes": [
    {
      "id": "scene1",
      "text": "Generic text-to-speech sounds robotic. Your brand deserves better.",
      "duration": 4.5,
      "character": "dramatic"
    },
    {
      "id": "scene2",
      "text": "With voice cloning, you can use your own voice for unlimited content.",
      "duration": 5.5
    },
    {
      "id": "scene3",
      "text": "Record a short sample. Clone it. Create professional voiceovers in minutes.",
      "duration": 6,
      "delay": 0.3
    }
  ]
}

Generate All Scenes

node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/product-demo-scenes.json \
  --output-dir public/audio/product-demo/

This creates:

product-demo-scene1.mp3 through sceneN.mp3
product-demo-combined.mp3 (all scenes stitched)
product-demo-info.json (metadata with durations)

Single Scene Regeneration

If a scene starts too early, has wrong timing, or needs different text:

# Regenerate scene2 with new text
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/scenes.json \
  --scene scene2 \
  --new-text "Updated scene 2 text" \
  --output-dir public/audio/project/

# Regenerate scene3 with different character
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/scenes.json \
  --scene scene3 \
  --character salesperson \
  --output-dir public/audio/project/

# Just regenerate (same text, same character)
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/scenes.json \
  --scene scene1 \
  --output-dir public/audio/project/

# Embed a thumbnail into an MP4 video
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --embed-thumbnail public/videos/my-video.mp4 \
  --thumbnail public/videos/my-thumbnail.png \
  --output public/videos/my-video-with-thumb.mp4

The tool automatically:

Uses request stitching from previous scenes for consistent prosody
Updates the info.json file with new metadata
Updates scenes.json if --new-text is provided

Thumbnail Embedding

Embed a thumbnail image into MP4 videos so platforms like Twitter, YouTube, and video players display your custom thumbnail instead of the first frame.

Embed Thumbnail into Video

# Basic usage - outputs to video-thumb.mp4
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --embed-thumbnail public/videos/promo.mp4 \
  --thumbnail public/videos/thumbnail.png

# Custom output path
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --embed-thumbnail public/videos/promo.mp4 \
  --thumbnail public/videos/thumbnail.png \
  --output public/videos/promo-final.mp4

Workflow with Remotion

# 1. Render your video
npx remotion render MyVideo public/videos/my-video.mp4

# 2. Render your thumbnail (use Still composition)
npx remotion still MyVideoThumbnail public/videos/my-thumbnail.png

# 3. Embed the thumbnail
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --embed-thumbnail public/videos/my-video.mp4 \
  --thumbnail public/videos/my-thumbnail.png \
  --output public/videos/my-video-final.mp4

Supported Formats

Video : MP4 (H.264/H.265)
Thumbnail : PNG, JPG, JPEG

The embedding uses ffmpeg's -disposition:v:1 attached_pic flag to set the thumbnail as an attached picture, which most video players and platforms recognize.

Timing Validation

The skill automatically validates timing after generation using ffprobe:

What It Checks

Check	Threshold	Description
Duration mismatch	>15%	Warns if actual differs from expected duration
Leading silence	>200ms	Audio starts late (voiceover delayed)
Trailing silence	>500ms	Unnecessary silence at end
Speaking rate	2-4.5 wps	Optimal ~3 words/second

Validate Existing Audio

# Validate all scenes in a project
node .claude/skills/elevenlabs-remotion-skill/generate.js --validate public/audio/product-demo/

Output example:

🔍 Validating product-demo (6 scenes)

❌ scene1: 3.00s (expected: 4.5s)
   ❌ Audio 1.50s shorter than expected
   👍 8 words @ 3.1 words/sec
⚠️ scene2: 6.35s (expected: 5.5s)
   ⚠️ Leading silence: 235ms (may start late)
   🐢 10 words @ 1.8 words/sec
✅ scene4: 4.36s (expected: 4s)
   👍 9 words @ 2.3 words/sec

📊 Total duration: 30.80s (expected: 30.00s)

Updated info.json

After validation, the info.json includes actual measurements:

{
  "scenes": [
    {
      "id": "scene1",
      "duration": 4.5,
      "actualDuration": 3.0,
      "leadingSilence": 0.05,
      "wordsPerSecond": 3.1
    }
  ]
}

Use actualDuration in your Remotion composition for precise sync.

Options

Option	Description	Default
`--text`, `-t`	Text to convert to speech	Required (or --file/--scenes)
`--file`, `-f`	Read text from file	-
`--output`, `-o`	Output file path	`output.mp3`

Recommended Voices

Voice	Style	Best For
`George`	Warm, captivating British	Narration, explainers
`Antoni`	Professional, warm	Legal content, tutorials
`Arnold`	Authoritative, deep	Corporate, serious topics
`Josh`	Friendly, conversational	Marketing, casual content

Integration with Remotion

After generating scene voiceovers, use them in your composition:

import { Audio, Sequence, staticFile } from "remotion";

// Use individual scene audio files for precise sync
const SCENE_DURATIONS = {
  scene1: 4.5,  // From info.json
  scene2: 5.5,
  scene3: 8.0,
};

export const VideoWithVoiceover: React.FC = () => {
  const { fps } = useVideoConfig();

  const scene1Frames = Math.round(SCENE_DURATIONS.scene1 * fps);
  const scene2Frames = Math.round(SCENE_DURATIONS.scene2 * fps);

  return (
    <>
      <Sequence from={0} durationInFrames={scene1Frames}>
        <Audio src={staticFile("audio/project/project-scene1.mp3")} />
        <Scene1Visual />
      </Sequence>

      <Sequence from={scene1Frames} durationInFrames={scene2Frames}>
        <Audio src={staticFile("audio/project/project-scene2.mp3")} />
        <Scene2Visual />
      </Sequence>
    </>
  );
};

Tips for Best Results

Use character presets : Don't read screen text literally - use narrator or expert for natural flow
Punctuation matters : Use periods for pauses, commas for brief breaks
Numbers : Write out numbers ("five hundred" not "500") for natural speech
Abbreviations : Write full words ("twenty-four hours" not "24h")
Scene-by-scene : Different scenes can have different characters (dramatic intro, calm CTA)
Fine-tune : Use --scene to regenerate individual scenes without redoing everything
Request stitching : Keeps voice consistent across all scenes

Workflow Example

# 1. Create scenes.json with your script
# 2. Generate all scenes with narrator style
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/my-video-scenes.json \
  --character narrator \
  --output-dir public/audio/my-video/

# 3. Preview in Remotion, notice scene2 starts too early
# 4. Regenerate just scene2 with updated text
node .claude/skills/elevenlabs-remotion-skill/generate.js \
  --scenes remotion/my-video-scenes.json \
  --scene scene2 \
  --new-text "Slightly longer text to fill the visual timing" \
  --output-dir public/audio/my-video/

# 5. Update video composition with new duration from info.json
# 6. Repeat until timing is perfect

Weekly Installs

293

Repository

maartenlouis/el…on-skill

GitHub Stars

First Seen

Jan 23, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode250

claude-code234

gemini-cli226

codex226

cursor204

github-copilot199

Skills CLI 使用指南：AI Agent 技能包管理器安装与管理教程

27,400 周安装

`literal`	完全按照书面文本朗读	屏幕文本、引用
`narrator`	专业讲故事者，流畅、引人入胜	解说视频、纪录片
`salesperson`	热情、有说服力、充满活力	营销、广告
`expert`	权威、自信、知识渊博	法律内容、教程
`conversational`	随意、友好、自然	社交媒体、休闲内容
`dramatic`	紧张、情感丰富、有影响力	开场钩子、问题陈述
`calm`	舒缓、令人安心、温和	建立信任、结论