Stata数据清洗指南：经济学家必备的数据准备与清理技能

stata-data-cleaning by meleantonio/awesome-econ-ai-stuff

109 周安装量

307 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/meleantonio/awesome-econ-ai-stuff --skill stata-data-cleaning

数据分析科研工具数据处理

🇨🇳中文介绍

Stata 数据清洗

目的

此技能帮助经济学家在 Stata 中清洗、转换和准备数据集以进行分析。它强调可重复性、适当的文档记录以及处理经济研究中常见的数据质量问题。

使用时机

清洗原始调查数据或行政数据
合并多个数据源
处理缺失值、重复值和异常值
创建可用于分析的面板数据集
记录数据转换过程以便复现

使用说明

步骤 1：理解数据

在生成代码之前，询问用户：

数据来源是什么？（调查、行政、API 等）
观测单位是什么？
分析所需的关键变量有哪些？
是否存在已知的需要解决的数据质量问题？

步骤 2：生成清洗流程

创建一个 Stata do 文件，该文件应：

包含清晰的头部信息，包括项目信息和日期
设置环境（清除所有、设置内存、日志）
加载并检查原始数据
用注释记录每次转换
为最终数据集创建数据字典

步骤 3：遵循最佳实践

使用 assert 语句验证数据完整性
使用 label variable 创建带标签的变量
为分类变量使用值标签
生成日志文件以确保可重复性

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

相关 Skills

Azure Data Explorer (Kusto) 查询技能：KQL数据分析、日志遥测与时间序列处理

138,800 周安装

专业SEO审计工具：全面网站诊断、技术SEO优化与页面分析指南

68,800 周安装

Python PDF处理教程：合并拆分、提取文本表格、创建PDF文件

65,000 周安装

DOCX文件创建、编辑与分析完整指南 - 使用docx-js、Pandoc和Python脚本

51,800 周安装

/*==============================================================================
    Project:    Economic Analysis Data Cleaning
    Author:     [Your Name]
    Date:       [Date]
    Purpose:    Clean raw survey data for regression analysis
    Input:      raw_survey_data.dta
    Output:     cleaned_analysis_data.dta
==============================================================================*/

* ============================================
* 1. SETUP
* ============================================

clear all
set more off
cap log close
log using "logs/data_cleaning_`c(current_date)'.log", replace

* Set working directory
cd "/path/to/project"

* Define globals for paths
global raw_data "data/raw"
global clean_data "data/clean"
global output "output"

* ============================================
* 2. LOAD AND INSPECT RAW DATA
* ============================================

use "${raw_data}/raw_survey_data.dta", clear

* Basic inspection
describe
summarize
codebook, compact

* Check for duplicates
duplicates report id_var
duplicates list id_var if _dup > 0

* ============================================
* 3. VARIABLE CLEANING
* ============================================

* --- Rename variables for clarity ---
rename q1 age
rename q2 income_reported
rename q3 education_level

* --- Clean numeric variables ---
* Replace missing value codes with .
mvdecode age income_reported, mv(-99 -88 -77)

* Cap outliers at 99th percentile
qui sum income_reported, detail
replace income_reported = r(p99) if income_reported > r(p99) & !mi(income_reported)

* --- Clean string variables ---
* Standardize state names
replace state = upper(trim(state))
replace state = "NEW YORK" if inlist(state, "NY", "N.Y.", "N Y")

* --- Create categorical variables ---
gen education_cat = .
replace education_cat = 1 if education_level < 12
replace education_cat = 2 if education_level == 12
replace education_cat = 3 if education_level > 12 & education_level <= 16
replace education_cat = 4 if education_level > 16 & !mi(education_level)

label define edu_lbl 1 "Less than HS" 2 "High School" 3 "College" 4 "Graduate"
label values education_cat edu_lbl

* ============================================
* 4. HANDLE MISSING DATA
* ============================================

* Create missing indicator variables
gen mi_income = mi(income_reported)

* Document missingness
tab mi_income

* Count complete cases
egen complete_case = rownonmiss(age income_reported education_cat)
tab complete_case

* ============================================
* 5. CREATE DERIVED VARIABLES
* ============================================

* Age groups
gen age_group = .
replace age_group = 1 if age >= 18 & age < 30
replace age_group = 2 if age >= 30 & age < 50
replace age_group = 3 if age >= 50 & age < 65
replace age_group = 4 if age >= 65 & !mi(age)

label define age_lbl 1 "18-29" 2 "30-49" 3 "50-64" 4 "65+"
label values age_group age_lbl

* Log income
gen log_income = ln(income_reported + 1)

* ============================================
* 6. DATA VALIDATION
* ============================================

* Assert expected ranges
assert age >= 18 & age <= 120 if !mi(age)
assert income_reported >= 0 if !mi(income_reported)

* Check variable types
assert !mi(id_var)
isid id_var  // Verify unique identifier

* ============================================
* 7. LABEL VARIABLES
* ============================================

label variable age "Age in years"
label variable income_reported "Annual income (USD)"
label variable education_cat "Education category"
label variable log_income "Log of annual income"
label variable mi_income "Missing income indicator"

* ============================================
* 8. FINAL CHECKS AND SAVE
* ============================================

* Keep relevant variables
keep id_var age age_group income_reported log_income ///
     education_cat mi_income state year

* Order variables logically
order id_var year state age age_group income_reported ///
      log_income education_cat mi_income

* Compress to minimize file size
compress

* Save cleaned data
save "${clean_data}/cleaned_analysis_data.dta", replace

* Create codebook
codebook, compact

* Close log
log close

* ============================================
* END OF FILE
* ============================================

🇺🇸English

Stata Data Cleaning

Purpose

This skill helps economists clean, transform, and prepare datasets for analysis in Stata. It emphasizes reproducibility, proper documentation, and handling common data quality issues found in economic research.

When to Use

Cleaning raw survey or administrative data
Merging multiple data sources
Handling missing values, duplicates, and outliers
Creating analysis-ready panel datasets
Documenting data transformations for replication

Instructions

Step 1: Understand the Data

Before generating code, ask the user:

What is the data source? (survey, administrative, API, etc.)
What is the unit of observation?
What are the key variables needed for analysis?
Are there known data quality issues to address?

Step 2: Generate Cleaning Pipeline

Create a Stata do-file that:

Has a clear header with project info and date
Sets up the environment (clear all, set memory, log)
Loads and inspects raw data
Documents each transformation with comments
Creates a codebook for the final dataset

Step 3: Follow Best Practices

Use assert statements to verify data integrity
Create labeled variables with label variable
Use value labels for categorical variables
Generate a log file for reproducibility
Save intermediate files when appropriate

Example Output

/*==============================================================================
    Project:    Economic Analysis Data Cleaning
    Author:     [Your Name]
    Date:       [Date]
    Purpose:    Clean raw survey data for regression analysis
    Input:      raw_survey_data.dta
    Output:     cleaned_analysis_data.dta
==============================================================================*/

* ============================================
* 1. SETUP
* ============================================

clear all
set more off
cap log close
log using "logs/data_cleaning_`c(current_date)'.log", replace

* Set working directory
cd "/path/to/project"

* Define globals for paths
global raw_data "data/raw"
global clean_data "data/clean"
global output "output"

* ============================================
* 2. LOAD AND INSPECT RAW DATA
* ============================================

use "${raw_data}/raw_survey_data.dta", clear

* Basic inspection
describe
summarize
codebook, compact

* Check for duplicates
duplicates report id_var
duplicates list id_var if _dup > 0

* ============================================
* 3. VARIABLE CLEANING
* ============================================

* --- Rename variables for clarity ---
rename q1 age
rename q2 income_reported
rename q3 education_level

* --- Clean numeric variables ---
* Replace missing value codes with .
mvdecode age income_reported, mv(-99 -88 -77)

* Cap outliers at 99th percentile
qui sum income_reported, detail
replace income_reported = r(p99) if income_reported > r(p99) & !mi(income_reported)

* --- Clean string variables ---
* Standardize state names
replace state = upper(trim(state))
replace state = "NEW YORK" if inlist(state, "NY", "N.Y.", "N Y")

* --- Create categorical variables ---
gen education_cat = .
replace education_cat = 1 if education_level < 12
replace education_cat = 2 if education_level == 12
replace education_cat = 3 if education_level > 12 & education_level <= 16
replace education_cat = 4 if education_level > 16 & !mi(education_level)

label define edu_lbl 1 "Less than HS" 2 "High School" 3 "College" 4 "Graduate"
label values education_cat edu_lbl

* ============================================
* 4. HANDLE MISSING DATA
* ============================================

* Create missing indicator variables
gen mi_income = mi(income_reported)

* Document missingness
tab mi_income

* Count complete cases
egen complete_case = rownonmiss(age income_reported education_cat)
tab complete_case

* ============================================
* 5. CREATE DERIVED VARIABLES
* ============================================

* Age groups
gen age_group = .
replace age_group = 1 if age >= 18 & age < 30
replace age_group = 2 if age >= 30 & age < 50
replace age_group = 3 if age >= 50 & age < 65
replace age_group = 4 if age >= 65 & !mi(age)

label define age_lbl 1 "18-29" 2 "30-49" 3 "50-64" 4 "65+"
label values age_group age_lbl

* Log income
gen log_income = ln(income_reported + 1)

* ============================================
* 6. DATA VALIDATION
* ============================================

* Assert expected ranges
assert age >= 18 & age <= 120 if !mi(age)
assert income_reported >= 0 if !mi(income_reported)

* Check variable types
assert !mi(id_var)
isid id_var  // Verify unique identifier

* ============================================
* 7. LABEL VARIABLES
* ============================================

label variable age "Age in years"
label variable income_reported "Annual income (USD)"
label variable education_cat "Education category"
label variable log_income "Log of annual income"
label variable mi_income "Missing income indicator"

* ============================================
* 8. FINAL CHECKS AND SAVE
* ============================================

* Keep relevant variables
keep id_var age age_group income_reported log_income ///
     education_cat mi_income state year

* Order variables logically
order id_var year state age age_group income_reported ///
      log_income education_cat mi_income

* Compress to minimize file size
compress

* Save cleaned data
save "${clean_data}/cleaned_analysis_data.dta", replace

* Create codebook
codebook, compact

* Close log
log close

* ============================================
* END OF FILE
* ============================================

Requirements

Software

Stata 15+ (some commands require newer versions)

Recommended User-Written Commands

ssc install unique     // For unique value checking
ssc install mdesc      // For missing data patterns
ssc install labutil    // For label manipulation

Best Practices

Always start withclear all to ensure clean environment
Use log files to document all transformations
Comment extensively - explain WHY, not just WHAT
Useassert statements to catch data errors early
Create a data dictionary alongside your cleaned data
Version your do-files and datasets

Common Pitfalls

❌ Not checking for duplicates before merging
❌ Forgetting to handle missing value codes (-99, -88, etc.)
❌ Not labeling variables and values
❌ Overwriting raw data files
❌ Not documenting data transformations

References

Changelog

v1.0.0

Initial release with comprehensive cleaning template

Weekly Installs

Repository

meleantonio/awe…ai-stuff

GitHub Stars

291

First Seen

Jan 27, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykPass

Installed on

opencode78

codex74

gemini-cli73

github-copilot72

cursor71

kimi-cli66

Stata数据清洗指南：经济学家必备的数据准备与清理技能

🇨🇳中文介绍

Stata 数据清洗

目的

使用时机

使用说明

步骤 1：理解数据

步骤 2：生成清洗流程

步骤 3：遵循最佳实践

相关 Skills

示例输出

要求

软件

推荐的用户编写命令

最佳实践

常见陷阱

参考资料

更新日志

v1.0.0

🇺🇸English

Stata Data Cleaning

Purpose

When to Use

Instructions

Step 1: Understand the Data

Step 2: Generate Cleaning Pipeline

Step 3: Follow Best Practices

Example Output

Requirements

Software

Recommended User-Written Commands

Best Practices

Common Pitfalls

References

Changelog

v1.0.0

最新 Skills