stata-data-cleaning by meleantonio/awesome-econ-ai-stuff
npx skills add https://github.com/meleantonio/awesome-econ-ai-stuff --skill stata-data-cleaning此技能帮助经济学家在 Stata 中清洗、转换和准备数据集以进行分析。它强调可重复性、适当的文档记录以及处理经济研究中常见的数据质量问题。
在生成代码之前,询问用户:
创建一个 Stata do 文件,该文件应:
assert 语句验证数据完整性label variable 创建带标签的变量广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
/*==============================================================================
Project: Economic Analysis Data Cleaning
Author: [Your Name]
Date: [Date]
Purpose: Clean raw survey data for regression analysis
Input: raw_survey_data.dta
Output: cleaned_analysis_data.dta
==============================================================================*/
* ============================================
* 1. SETUP
* ============================================
clear all
set more off
cap log close
log using "logs/data_cleaning_`c(current_date)'.log", replace
* Set working directory
cd "/path/to/project"
* Define globals for paths
global raw_data "data/raw"
global clean_data "data/clean"
global output "output"
* ============================================
* 2. LOAD AND INSPECT RAW DATA
* ============================================
use "${raw_data}/raw_survey_data.dta", clear
* Basic inspection
describe
summarize
codebook, compact
* Check for duplicates
duplicates report id_var
duplicates list id_var if _dup > 0
* ============================================
* 3. VARIABLE CLEANING
* ============================================
* --- Rename variables for clarity ---
rename q1 age
rename q2 income_reported
rename q3 education_level
* --- Clean numeric variables ---
* Replace missing value codes with .
mvdecode age income_reported, mv(-99 -88 -77)
* Cap outliers at 99th percentile
qui sum income_reported, detail
replace income_reported = r(p99) if income_reported > r(p99) & !mi(income_reported)
* --- Clean string variables ---
* Standardize state names
replace state = upper(trim(state))
replace state = "NEW YORK" if inlist(state, "NY", "N.Y.", "N Y")
* --- Create categorical variables ---
gen education_cat = .
replace education_cat = 1 if education_level < 12
replace education_cat = 2 if education_level == 12
replace education_cat = 3 if education_level > 12 & education_level <= 16
replace education_cat = 4 if education_level > 16 & !mi(education_level)
label define edu_lbl 1 "Less than HS" 2 "High School" 3 "College" 4 "Graduate"
label values education_cat edu_lbl
* ============================================
* 4. HANDLE MISSING DATA
* ============================================
* Create missing indicator variables
gen mi_income = mi(income_reported)
* Document missingness
tab mi_income
* Count complete cases
egen complete_case = rownonmiss(age income_reported education_cat)
tab complete_case
* ============================================
* 5. CREATE DERIVED VARIABLES
* ============================================
* Age groups
gen age_group = .
replace age_group = 1 if age >= 18 & age < 30
replace age_group = 2 if age >= 30 & age < 50
replace age_group = 3 if age >= 50 & age < 65
replace age_group = 4 if age >= 65 & !mi(age)
label define age_lbl 1 "18-29" 2 "30-49" 3 "50-64" 4 "65+"
label values age_group age_lbl
* Log income
gen log_income = ln(income_reported + 1)
* ============================================
* 6. DATA VALIDATION
* ============================================
* Assert expected ranges
assert age >= 18 & age <= 120 if !mi(age)
assert income_reported >= 0 if !mi(income_reported)
* Check variable types
assert !mi(id_var)
isid id_var // Verify unique identifier
* ============================================
* 7. LABEL VARIABLES
* ============================================
label variable age "Age in years"
label variable income_reported "Annual income (USD)"
label variable education_cat "Education category"
label variable log_income "Log of annual income"
label variable mi_income "Missing income indicator"
* ============================================
* 8. FINAL CHECKS AND SAVE
* ============================================
* Keep relevant variables
keep id_var age age_group income_reported log_income ///
education_cat mi_income state year
* Order variables logically
order id_var year state age age_group income_reported ///
log_income education_cat mi_income
* Compress to minimize file size
compress
* Save cleaned data
save "${clean_data}/cleaned_analysis_data.dta", replace
* Create codebook
codebook, compact
* Close log
log close
* ============================================
* END OF FILE
* ============================================
ssc install unique // For unique value checking
ssc install mdesc // For missing data patterns
ssc install labutil // For label manipulation
clear all 开始,以确保环境干净assert 语句及早捕获数据错误每周安装数
83
代码仓库
GitHub 星标数
291
首次出现
2026年1月27日
安全审计
安装于
opencode78
codex74
gemini-cli73
github-copilot72
cursor71
kimi-cli66
This skill helps economists clean, transform, and prepare datasets for analysis in Stata. It emphasizes reproducibility, proper documentation, and handling common data quality issues found in economic research.
Before generating code, ask the user:
Create a Stata do-file that:
assert statements to verify data integritylabel variable/*==============================================================================
Project: Economic Analysis Data Cleaning
Author: [Your Name]
Date: [Date]
Purpose: Clean raw survey data for regression analysis
Input: raw_survey_data.dta
Output: cleaned_analysis_data.dta
==============================================================================*/
* ============================================
* 1. SETUP
* ============================================
clear all
set more off
cap log close
log using "logs/data_cleaning_`c(current_date)'.log", replace
* Set working directory
cd "/path/to/project"
* Define globals for paths
global raw_data "data/raw"
global clean_data "data/clean"
global output "output"
* ============================================
* 2. LOAD AND INSPECT RAW DATA
* ============================================
use "${raw_data}/raw_survey_data.dta", clear
* Basic inspection
describe
summarize
codebook, compact
* Check for duplicates
duplicates report id_var
duplicates list id_var if _dup > 0
* ============================================
* 3. VARIABLE CLEANING
* ============================================
* --- Rename variables for clarity ---
rename q1 age
rename q2 income_reported
rename q3 education_level
* --- Clean numeric variables ---
* Replace missing value codes with .
mvdecode age income_reported, mv(-99 -88 -77)
* Cap outliers at 99th percentile
qui sum income_reported, detail
replace income_reported = r(p99) if income_reported > r(p99) & !mi(income_reported)
* --- Clean string variables ---
* Standardize state names
replace state = upper(trim(state))
replace state = "NEW YORK" if inlist(state, "NY", "N.Y.", "N Y")
* --- Create categorical variables ---
gen education_cat = .
replace education_cat = 1 if education_level < 12
replace education_cat = 2 if education_level == 12
replace education_cat = 3 if education_level > 12 & education_level <= 16
replace education_cat = 4 if education_level > 16 & !mi(education_level)
label define edu_lbl 1 "Less than HS" 2 "High School" 3 "College" 4 "Graduate"
label values education_cat edu_lbl
* ============================================
* 4. HANDLE MISSING DATA
* ============================================
* Create missing indicator variables
gen mi_income = mi(income_reported)
* Document missingness
tab mi_income
* Count complete cases
egen complete_case = rownonmiss(age income_reported education_cat)
tab complete_case
* ============================================
* 5. CREATE DERIVED VARIABLES
* ============================================
* Age groups
gen age_group = .
replace age_group = 1 if age >= 18 & age < 30
replace age_group = 2 if age >= 30 & age < 50
replace age_group = 3 if age >= 50 & age < 65
replace age_group = 4 if age >= 65 & !mi(age)
label define age_lbl 1 "18-29" 2 "30-49" 3 "50-64" 4 "65+"
label values age_group age_lbl
* Log income
gen log_income = ln(income_reported + 1)
* ============================================
* 6. DATA VALIDATION
* ============================================
* Assert expected ranges
assert age >= 18 & age <= 120 if !mi(age)
assert income_reported >= 0 if !mi(income_reported)
* Check variable types
assert !mi(id_var)
isid id_var // Verify unique identifier
* ============================================
* 7. LABEL VARIABLES
* ============================================
label variable age "Age in years"
label variable income_reported "Annual income (USD)"
label variable education_cat "Education category"
label variable log_income "Log of annual income"
label variable mi_income "Missing income indicator"
* ============================================
* 8. FINAL CHECKS AND SAVE
* ============================================
* Keep relevant variables
keep id_var age age_group income_reported log_income ///
education_cat mi_income state year
* Order variables logically
order id_var year state age age_group income_reported ///
log_income education_cat mi_income
* Compress to minimize file size
compress
* Save cleaned data
save "${clean_data}/cleaned_analysis_data.dta", replace
* Create codebook
codebook, compact
* Close log
log close
* ============================================
* END OF FILE
* ============================================
ssc install unique // For unique value checking
ssc install mdesc // For missing data patterns
ssc install labutil // For label manipulation
clear all to ensure clean environmentassert statements to catch data errors earlyWeekly Installs
83
Repository
GitHub Stars
291
First Seen
Jan 27, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykPass
Installed on
opencode78
codex74
gemini-cli73
github-copilot72
cursor71
kimi-cli66
Excel财务建模规范与xlsx文件处理指南:专业格式、零错误公式与数据分析
46,700 周安装
Tamagui Monorepo 跨平台开发指南:React Native、Next.js、Expo、TypeScript 全栈解决方案
122 周安装
Motion动画库指南:高性能JavaScript/TypeScript网页动效开发与性能优化
123 周安装
Node.js开发专家指南:TypeScript、Payload CMS、Next.js与Vue.js全栈实战
122 周安装
Tailwind CSS 最佳实践指南:29条规则构建响应式、可维护界面(含v4迁移)
120 周安装
产品需求文档(PRD)创建指南与模板 - SDD工作流第2层产物,含双重评分标准
127 周安装
KPI仪表盘设计指南:业务指标可视化框架、布局原则与最佳实践
129 周安装