信用风险数据分析：自动化数据清洗与变量筛选流程，提升模型稳定性

datanalysis-credit-risk by github/awesome-copilot

5,600 周安装量

26,900 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/github/awesome-copilot --skill datanalysis-credit-risk

AI/机器学习数据分析

🇨🇳中文介绍

数据清洗与变量筛选

快速开始

# 运行完整的数据清洗流程
python ".github/skills/datanalysis-credit-risk/scripts/example.py"

完整流程说明

数据清洗流程包含以下11个步骤，每个步骤独立执行，不会删除原始数据：

获取数据 - 加载并格式化原始数据
机构样本分析 - 统计各机构的样本数量与坏样本率
分离OOS数据 - 从建模样本中分离出样本外（OOS）样本
过滤异常月份 - 移除坏样本数量或总样本数量不足的月份
计算缺失率 - 计算各特征的整体缺失率与机构级缺失率
剔除高缺失率特征 - 移除整体缺失率超过阈值的特征
剔除低IV特征 - 移除整体IV过低或在过多机构中IV过低的特征
剔除高PSI特征 - 移除PSI不稳定的特征
空重要性降噪 - 使用标签置换法去除噪声特征
剔除高相关性特征 - 基于原始增益剔除高相关性特征
导出报告 - 生成包含所有步骤详情与统计信息的Excel报告

核心函数

函数	用途	模块

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

🇺🇸English

Data Cleaning and Variable Screening

Quick Start

# Run the complete data cleaning pipeline
python ".github/skills/datanalysis-credit-risk/scripts/example.py"

Complete Process Description

The data cleaning pipeline consists of the following 11 steps, each executed independently without deleting the original data:

Get Data - Load and format raw data
Organization Sample Analysis - Statistics of sample count and bad sample rate for each organization
Separate OOS Data - Separate out-of-sample (OOS) samples from modeling samples
Filter Abnormal Months - Remove months with insufficient bad sample count or total sample count
Calculate Missing Rate - Calculate overall and organization-level missing rates for each feature
Drop High Missing Rate Features - Remove features with overall missing rate exceeding threshold
Drop Low IV Features - Remove features with overall IV too low or IV too low in too many organizations
Drop High PSI Features - Remove features with unstable PSI
Null Importance Denoising - Remove noise features using label permutation method
Drop High Correlation Features - Remove high correlation features based on original gain
Export Report - Generate Excel report containing details and statistics of all steps

Core Functions

Function	Purpose	Module
`get_dataset()`	Load and format data	references.func
`org_analysis()`	Organization sample analysis	references.func
`missing_check()`	Calculate missing rate	references.func
`drop_abnormal_ym()`	Filter abnormal months	references.analysis
`drop_highmiss_features()`	Drop high missing rate features

Parameter Description

Data Loading Parameters

DATA_PATH: Data file path (best are parquet format)
DATE_COL: Date column name
Y_COL: Label column name
ORG_COL: Organization column name
KEY_COLS: Primary key column name list

OOS Organization Configuration

OOS_ORGS: Out-of-sample organization list

Abnormal Month Filtering Parameters

min_ym_bad_sample: Minimum bad sample count per month (default 10)
min_ym_sample: Minimum total sample count per month (default 500)

Missing Rate Parameters

missing_ratio: Overall missing rate threshold (default 0.6)

IV Parameters

overall_iv_threshold: Overall IV threshold (default 0.1)
org_iv_threshold: Single organization IV threshold (default 0.1)
max_org_threshold: Maximum tolerated low IV organization count (default 2)

PSI Parameters

psi_threshold: PSI threshold (default 0.1)
max_months_ratio: Maximum unstable month ratio (default 1/3)
max_orgs: Maximum unstable organization count (default 6)

Null Importance Parameters

n_estimators: Number of trees (default 100)
max_depth: Maximum tree depth (default 5)
gain_threshold: Gain difference threshold (default 50)

High Correlation Parameters

max_corr: Correlation threshold (default 0.9)
top_n_keep: Keep top N features by original gain ranking (default 20)

Output Report

The generated Excel report contains the following sheets:

汇总 - Summary information of all steps, including operation results and conditions
机构样本统计 - Sample count and bad sample rate for each organization
分离OOS数据 - OOS sample and modeling sample counts
Step4-异常月份处理 - Abnormal months that were removed
缺失率明细 - Overall and organization-level missing rates for each feature
Step5-有值率分布统计 - Distribution of features in different value ratio ranges
Step6-高缺失率处理 - High missing rate features that were removed
Step7-IV明细 - IV values of each feature in each organization and overall
Step7-IV处理 - Features that do not meet IV conditions and low IV organizations
Step7-IV分布统计 - Distribution of features in different IV ranges
Step8-PSI明细 - PSI values of each feature in each organization each month
Step8-PSI处理 - Features that do not meet PSI conditions and unstable organizations
Step8-PSI分布统计 - Distribution of features in different PSI ranges
Step9-null importance处理 - Noise features that were removed
Step10-高相关性剔除 - High correlation features that were removed

Features

Interactive Input : Parameters can be input before each step execution, with default values supported
Independent Execution : Each step is executed independently without deleting original data, facilitating comparative analysis
Complete Report : Generate complete Excel report containing details, statistics, and distributions
Multi-process Support : IV and PSI calculations support multi-process acceleration
Organization-level Analysis : Support organization-level statistics and modeling/OOS distinction

Weekly Installs

5.6K

Repository

github/awesome-copilot

GitHub Stars

26.9K

First Seen

Mar 2, 2026

Security Audits

Gen Agent Trust HubWarn SocketPass SnykPass

Installed on

codex5.5K

gemini-cli5.5K

opencode5.5K

cursor5.5K

github-copilot5.5K

kimi-cli5.5K

信用风险数据分析：自动化数据清洗与变量筛选流程，提升模型稳定性

🇨🇳中文介绍

数据清洗与变量筛选

快速开始

完整流程说明

核心函数

相关 Skills

参数说明

数据加载参数

OOS机构配置

异常月份过滤参数

缺失率参数

IV参数

PSI参数

空重要性参数

高相关性参数

输出报告

特性