Elasticsearch 文件数据导入工具 - 流式处理 NDJSON/CSV/Parquet/Arrow 文件到 ES

elasticsearch-file-ingest by elastic/agent-skills

157 周安装量

89 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/elastic/agent-skills --skill elasticsearch-file-ingest

Node.js 数据处理搜索

🇨🇳中文介绍

Elasticsearch 文件数据导入

基于流式处理的大型数据文件（NDJSON、CSV、Parquet、Arrow IPC）导入与转换到 Elasticsearch。

功能与使用场景

基于流式处理：处理大文件而不会耗尽内存
高吞吐量：在普通硬件上达到每秒 50,000+ 个文档
跨版本兼容：在 ES 8.x 和 9.x 之间无缝迁移，或在集群间复制数据
支持格式：NDJSON、CSV、Parquet、Arrow IPC
数据转换：在导入过程中应用自定义 JavaScript 转换（丰富、拆分、过滤）
重建索引：复制并转换现有索引（重命名字段、重构文档）
批处理：导入匹配模式（例如 logs/*.json）的多个文件
文档拆分：将一个源文档转换为多个目标文档

先决条件

可访问的 Elasticsearch 8.x 或 9.x（本地或远程）
已安装 Node.js 22+

设置

此技能是自包含的。scripts/ 文件夹和 package.json 位于此技能目录中。请从此目录运行所有命令。引用位于其他位置的数据文件时，请使用绝对路径。

首次使用前，请安装依赖项：

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

通过 stdin 流式传输 NDJSON/CSV

# NDJSON
cat /absolute/path/to/data.ndjson | node scripts/ingest.js --stdin --target my-index

# CSV
cat /absolute/path/to/data.csv | node scripts/ingest.js --stdin --source-format csv --target my-index

node scripts/ingest.js --file /absolute/path/to/users.csv --source-format csv --target users

直接导入 Parquet

node scripts/ingest.js --file /absolute/path/to/users.parquet --source-format parquet --target users

直接导入 Arrow IPC

node scripts/ingest.js --file /absolute/path/to/users.arrow --source-format arrow --target users

使用解析器选项导入 CSV

# csv-options.json
# {
#   "columns": true,
#   "delimiter": ";",
#   "trim": true
# }

node scripts/ingest.js --file /absolute/path/to/users.csv --source-format csv --csv-options csv-options.json --target users

从 CSV 推断映射/管道

使用 --infer-mappings 时，不要与 --source-format csv 结合使用。推断功能会将原始样本发送到 Elasticsearch 的 _text_structure/find_structure 端点，该端点会返回映射和一个包含 CSV 处理器的导入管道。如果同时设置了 --source-format csv，CSV 将在客户端和服务器端进行解析，导致索引为空。让 --infer-mappings 处理所有事情：

node scripts/ingest.js --file /absolute/path/to/users.csv --infer-mappings --target users

使用选项推断映射

# infer-options.json
# {
#   "sampleBytes": 200000,
#   "lines_to_sample": 2000
# }

node scripts/ingest.js --file /absolute/path/to/users.csv --infer-mappings --infer-mappings-options infer-options.json --target users

使用自定义映射导入

node scripts/ingest.js --file /absolute/path/to/data.json --target my-index --mappings mappings.json

node scripts/ingest.js --file /absolute/path/to/data.json --target my-index --transform transform.js

从另一个索引重建索引

node scripts/ingest.js --source-index old-index --target new-index

跨集群重建索引（ES 8.x → 9.x）

node scripts/ingest.js --source-index logs \
  --node https://es8.example.com:9200 --api-key es8-key \
  --target new-logs \
  --target-node https://es9.example.com:9200 --target-api-key es9-key

--target <index>         # 目标索引名称

源选项（选择其一）

--file <path>            # 源文件（支持通配符，例如 logs/*.json）
--source-index <name>    # 源 Elasticsearch 索引
--stdin                  # 从 stdin 读取 NDJSON/CSV

Elasticsearch 连接

--node <url>             # ES 节点 URL（默认：http://localhost:9200）
--api-key <key>          # API 密钥身份验证
--username <user>        # 基本身份验证用户名
--password <pass>        # 基本身份验证密码

目标连接（用于跨集群）

--target-node <url>      # 目标 ES 节点 URL（如果未指定则使用 --node）
--target-api-key <key>   # 目标 API 密钥
--target-username <user> # 目标用户名
--target-password <pass> # 目标密码

--mappings <file.json>          # 映射文件（重建索引时自动从源复制）
--infer-mappings                # 从文件/流推断映射/管道（请勿与 --source-format 结合使用）
--infer-mappings-options <file> # 推断选项（JSON 文件）
--delete-index                  # 如果目标索引存在则删除
--pipeline <name>               # 导入管道名称

--transform <file.js>    # 转换函数（导出为 default 或 module.exports）
--query <file.json>      # 查询文件，用于过滤源文档
--source-format <fmt>    # 源格式：ndjson|csv|parquet|arrow（默认：ndjson）
--csv-options <file>     # CSV 解析器选项（JSON 文件）
--skip-header            # 跳过第一行（例如 CSV 表头）

--buffer-size <kb>       # 缓冲区大小，单位 KB（默认：5120）
--search-size <n>        # 重建索引时每次搜索的文档数（默认：100）
--total-docs <n>         # 进度条的总文档数（文件/流）
--stall-warn-seconds <n> # 停滞警告阈值（默认：30）
--progress-mode <mode>   # 进度输出：auto|line|newline（默认：auto）
--debug-events           # 记录暂停/恢复/停滞事件
--quiet                  # 禁用进度条

转换函数允许您在导入过程中修改文档。创建一个导出转换函数的 JavaScript 文件：

基本转换（transform.js）

// ES 模块（默认）
export default function transform(doc) {
  return {
    ...doc,
    full_name: `${doc.first_name} ${doc.last_name}`,
    timestamp: new Date().toISOString(),
  };
}

// 或 CommonJS
module.exports = function transform(doc) {
  return {
    ...doc,
    full_name: `${doc.first_name} ${doc.last_name}`,
  };
};

返回 null 或 undefined 以跳过文档：

export default function transform(doc) {
  // 跳过无效文档
  if (!doc.email || !doc.email.includes("@")) {
    return null;
  }
  return doc;
}

返回一个数组，从一个源文档创建多个目标文档：

export default function transform(doc) {
  // 将一条推文拆分为多个标签文档
  const hashtags = doc.text.match(/#\w+/g) || [];
  return hashtags.map((tag) => ({
    hashtag: tag,
    tweet_id: doc.id,
    created_at: doc.created_at,
  }));
}

自动复制映射（重建索引）

重建索引时，映射会自动从源索引复制：

node scripts/ingest.js --source-index old-logs --target new-logs

自定义映射（mappings.json）

{
  "properties": {
    "@timestamp": { "type": "date" },
    "message": { "type": "text" },
    "user": {
      "properties": {
        "name": { "type": "keyword" },
        "email": { "type": "keyword" }
      }
    }
  }
}

node scripts/ingest.js --file /absolute/path/to/data.json --target my-index --mappings mappings.json

在重建索引时使用查询文件过滤源文档：

查询文件（filter.json）

{
  "range": {
    "@timestamp": {
      "gte": "2024-01-01",
      "lt": "2024-02-01"
    }
  }
}

node scripts/ingest.js \
  --source-index logs \
  --target filtered-logs \
  --query filter.json

切勿在没有明确用户确认的情况下运行破坏性命令（例如使用 --delete-index 标志或删除现有索引和数据）。

切勿将 --infer-mappings 与 --source-format 结合使用。推断功能会创建一个服务器端的导入管道来处理解析（例如 CSV 处理器）。使用 --source-format csv 也会在客户端进行解析，导致双重解析和空索引。对于自动检测，请单独使用 --infer-mappings；对于手动控制，请使用 --source-format 和显式的 --mappings。
当您希望使用已知字段类型进行客户端 CSV 解析时，请使用 --source-format csv 配合 --mappings。
当您希望 Elasticsearch 自动检测格式、推断字段类型并创建导入管道时，请单独使用 --infer-mappings。

考虑以下替代方案：

实时数据导入：使用 Filebeat 或 Elastic Agent
企业级管道：使用 Logstash
内置转换：使用 Elasticsearch Transforms

常见模式 - CSV 加载、迁移、过滤等的详细示例
故障排除 - 常见问题的解决方案

🇺🇸English

Elasticsearch File Ingest

Stream-based ingestion and transformation of large data files (NDJSON, CSV, Parquet, Arrow IPC) into Elasticsearch.

Features & Use Cases

Stream-based : Handle large files without running out of memory
High throughput : 50k+ documents/second on commodity hardware
Cross-version : Seamlessly migrate between ES 8.x and 9.x, or replicate across clusters
Formats : NDJSON, CSV, Parquet, Arrow IPC
Transformations : Apply custom JavaScript transforms during ingestion (enrich, split, filter)
Reindexing : Copy and transform existing indices (rename fields, restructure documents)
Batch processing : Ingest multiple files matching a pattern (e.g., logs/*.json)
Document splitting : Transform one source document into multiple targets

Prerequisites

Elasticsearch 8.x or 9.x accessible (local or remote)
Node.js 22+ installed

Setup

This skill is self-contained. The scripts/ folder and package.json live in this skill's directory. Run all commands from this directory. Use absolute paths when referencing data files located elsewhere.

Before first use, install dependencies:

npm install

Environment Configuration

Elasticsearch connection is configured via environment variables. The CLI flags --node, --api-key, --username, and --password override environment variables when provided.

Option 1: Elastic Cloud (recommended for production)

export ELASTICSEARCH_CLOUD_ID="deployment-name:base64encodedcloudid"
export ELASTICSEARCH_API_KEY="base64encodedapikey"

Option 2: Direct URL with API Key

export ELASTICSEARCH_URL="https://elasticsearch:9200"
export ELASTICSEARCH_API_KEY="base64encodedapikey"

Option 3: Basic Authentication

export ELASTICSEARCH_URL="https://elasticsearch:9200"
export ELASTICSEARCH_USERNAME="elastic"
export ELASTICSEARCH_PASSWORD="changeme"

Option 4: Local Development with start-local

For local development and testing, use start-local to quickly spin up Elasticsearch and Kibana using Docker or Podman:

curl -fsSL https://elastic.co/start-local | sh

After installation completes, source the generated .env file:

source elastic-start-local/.env
export ELASTICSEARCH_URL="$ES_LOCAL_URL"
export ELASTICSEARCH_API_KEY="$ES_LOCAL_API_KEY"

Optional: Skip TLS verification (development only)

export ELASTICSEARCH_INSECURE="true"

Examples

Ingest a JSON file

node scripts/ingest.js --file /absolute/path/to/data.json --target my-index

Stream NDJSON/CSV via stdin

# NDJSON
cat /absolute/path/to/data.ndjson | node scripts/ingest.js --stdin --target my-index

# CSV
cat /absolute/path/to/data.csv | node scripts/ingest.js --stdin --source-format csv --target my-index

Ingest CSV directly

node scripts/ingest.js --file /absolute/path/to/users.csv --source-format csv --target users

Ingest Parquet directly

node scripts/ingest.js --file /absolute/path/to/users.parquet --source-format parquet --target users

Ingest Arrow IPC directly

node scripts/ingest.js --file /absolute/path/to/users.arrow --source-format arrow --target users

Ingest CSV with parser options

# csv-options.json
# {
#   "columns": true,
#   "delimiter": ";",
#   "trim": true
# }

node scripts/ingest.js --file /absolute/path/to/users.csv --source-format csv --csv-options csv-options.json --target users

Infer mappings/pipeline from CSV

When using --infer-mappings, do not combine it with --source-format csv. Inference sends a raw sample to Elasticsearch's _text_structure/find_structure endpoint, which returns both mappings and an ingest pipeline with a CSV processor. If --source-format csv is also set, CSV is parsed client-side and server-side, resulting in an empty index. Let --infer-mappings handle everything:

node scripts/ingest.js --file /absolute/path/to/users.csv --infer-mappings --target users

Infer mappings with options

# infer-options.json
# {
#   "sampleBytes": 200000,
#   "lines_to_sample": 2000
# }

node scripts/ingest.js --file /absolute/path/to/users.csv --infer-mappings --infer-mappings-options infer-options.json --target users

Ingest with custom mappings

node scripts/ingest.js --file /absolute/path/to/data.json --target my-index --mappings mappings.json

Ingest with transformation

node scripts/ingest.js --file /absolute/path/to/data.json --target my-index --transform transform.js

Reindex from another index

node scripts/ingest.js --source-index old-index --target new-index

Cross-cluster reindex (ES 8.x → 9.x)

node scripts/ingest.js --source-index logs \
  --node https://es8.example.com:9200 --api-key es8-key \
  --target new-logs \
  --target-node https://es9.example.com:9200 --target-api-key es9-key

Command Reference

Required Options

--target <index>         # Target index name

Source Options (choose one)

--file <path>            # Source file (supports wildcards, e.g., logs/*.json)
--source-index <name>    # Source Elasticsearch index
--stdin                  # Read NDJSON/CSV from stdin

Elasticsearch Connection

--node <url>             # ES node URL (default: http://localhost:9200)
--api-key <key>          # API key authentication
--username <user>        # Basic auth username
--password <pass>        # Basic auth password

Target Connection (for cross-cluster)

--target-node <url>      # Target ES node URL (uses --node if not specified)
--target-api-key <key>   # Target API key
--target-username <user> # Target username
--target-password <pass> # Target password

Index Configuration

--mappings <file.json>          # Mappings file (auto-copy from source if reindexing)
--infer-mappings                # Infer mappings/pipeline from file/stream (do NOT combine with --source-format)
--infer-mappings-options <file> # Options for inference (JSON file)
--delete-index                  # Delete target index if exists
--pipeline <name>               # Ingest pipeline name

Processing

--transform <file.js>    # Transform function (export as default or module.exports)
--query <file.json>      # Query file to filter source documents
--source-format <fmt>    # Source format: ndjson|csv|parquet|arrow (default: ndjson)
--csv-options <file>     # CSV parser options (JSON file)
--skip-header            # Skip first line (e.g., CSV header)

Performance

--buffer-size <kb>       # Buffer size in KB (default: 5120)
--search-size <n>        # Docs per search when reindexing (default: 100)
--total-docs <n>         # Total docs for progress bar (file/stream)
--stall-warn-seconds <n> # Stall warning threshold (default: 30)
--progress-mode <mode>   # Progress output: auto|line|newline (default: auto)
--debug-events           # Log pause/resume/stall events
--quiet                  # Disable progress bars

Transform Functions

Transform functions let you modify documents during ingestion. Create a JavaScript file that exports a transform function:

Basic Transform (transform.js)

// ES modules (default)
export default function transform(doc) {
  return {
    ...doc,
    full_name: `${doc.first_name} ${doc.last_name}`,
    timestamp: new Date().toISOString(),
  };
}

// Or CommonJS
module.exports = function transform(doc) {
  return {
    ...doc,
    full_name: `${doc.first_name} ${doc.last_name}`,
  };
};

Skip Documents

Return null or undefined to skip a document:

export default function transform(doc) {
  // Skip invalid documents
  if (!doc.email || !doc.email.includes("@")) {
    return null;
  }
  return doc;
}

Split Documents

Return an array to create multiple target documents from one source:

export default function transform(doc) {
  // Split a tweet into multiple hashtag documents
  const hashtags = doc.text.match(/#\w+/g) || [];
  return hashtags.map((tag) => ({
    hashtag: tag,
    tweet_id: doc.id,
    created_at: doc.created_at,
  }));
}

Mappings

Auto-Copy Mappings (Reindexing)

When reindexing, mappings are automatically copied from the source index:

node scripts/ingest.js --source-index old-logs --target new-logs

Custom Mappings (mappings.json)

{
  "properties": {
    "@timestamp": { "type": "date" },
    "message": { "type": "text" },
    "user": {
      "properties": {
        "name": { "type": "keyword" },
        "email": { "type": "keyword" }
      }
    }
  }
}



node scripts/ingest.js --file /absolute/path/to/data.json --target my-index --mappings mappings.json

Query Filters

Filter source documents during reindexing with a query file:

Query File (filter.json)

{
  "range": {
    "@timestamp": {
      "gte": "2024-01-01",
      "lt": "2024-02-01"
    }
  }
}



node scripts/ingest.js \
  --source-index logs \
  --target filtered-logs \
  --query filter.json

Boundaries

Never run destructive commands (such as using the --delete-index flag or deleting existing indices and data) without explicit user confirmation.

Guidelines

Never combine--infer-mappings with --source-format. Inference creates a server-side ingest pipeline that handles parsing (e.g., CSV processor). Using --source-format csv parses client-side as well, causing double-parsing and an empty index. Use --infer-mappings alone for automatic detection, or --source-format with explicit --mappings for manual control.
Use--source-format csv with --mappings when you want client-side CSV parsing with known field types.
Use--infer-mappings alone when you want Elasticsearch to detect the format, infer field types, and create an ingest pipeline automatically.

When NOT to Use

Consider alternatives for:

Real-time ingestion : Use Filebeat or Elastic Agent
Enterprise pipelines : Use Logstash
Built-in transforms : Use Elasticsearch Transforms

Additional Resources

Common Patterns - Detailed examples for CSV loading, migrations, filtering, and more
Troubleshooting - Solutions for common issues

References

Weekly Installs

157

Repository

elastic/agent-skills

GitHub Stars

First Seen

11 days ago

Security Audits

Gen Agent Trust HubPass SocketPass SnykFail

Installed on

cursor137

github-copilot127

opencode126

gemini-cli126

codex126

kimi-cli125

Apify Actor 输出模式生成工具 - 自动化创建 dataset_schema.json 与 output_schema.json

1,000 周安装