重要前提
安装AI Skills的关键前提是:必须科学上网,且开启TUN模式,这一点至关重要,直接决定安装能否顺利完成,在此郑重提醒三遍:科学上网,科学上网,科学上网。查看完整安装教程 →
data-journalism by jamditis/claude-skills-journalism
npx skills add https://github.com/jamditis/claude-skills-journalism --skill data-journalism在新闻工作中寻找、分析和呈现数据的系统性方法。
数据新闻框架由奈特-里德报系记者、哈佛大学尼曼研究员、北卡罗来纳大学教堂山分校教授菲利普·迈耶建立。在其阐述其思想的著作《新精确新闻学》中,迈耶鼓励记者采用科学方法,将新闻工作"当作一门科学"来对待:
- 进行观察 / 提出问题
- 研究问题 / 收集、存储和检索数据
- 提出假设
- 使用定性(访谈、文件等)和定量(数据分析等)方法检验假设
- 分析结果并将其提炼为最重要的发现
- 向受众呈现
这个过程应被视为迭代的,而非线性的。
## 数据故事弧
### 1. 引子(核心段落)
- 关键发现是什么?
- 读者为何要关心?
- 对人类有何影响?
### 2. 证据
- 展示数据
- 解释方法
- 承认局限性
### 3. 背景
- 与过去相比如何?
- 与其他地方相比如何?
- 趋势是什么?
### 4. 人文元素
- 说明数据的个体案例
- 专家解读
- 受影响者的声音
### 5. 影响
- 这对未来意味着什么?
- 还有哪些问题未解决?
- 可能导致什么行动?
### 6. 方法说明框
- 数据从何而来?
- 如何分析的?
- 有哪些局限性?
- 读者如何进一步探索?
## 我们如何进行此项分析
### 数据来源
[列出所有数据来源,附链接和访问日期]
### 时间范围
[明确说明涵盖的时间段]
### 定义
[定义关键术语及其操作化方式]
### 分析步骤
1. [分析第一步]
2. [第二步]
3. [继续...]
### 局限性
- [局限性 1]
- [局限性 2]
### 我们排除了什么及原因
- [排除类别]:[原因]
### 验证
[发现如何被验证/检查]
### 代码和数据可用性
[如果共享代码/数据,提供 GitHub 仓库链接]
### 联系方式
[读者如何联系您提问]
## 联邦数据源
### 通用
- Data.gov - 联邦开放数据门户
- 人口普查局 (census.gov) - 人口统计、经济数据
- 劳工统计局 (bls.gov) - 就业、通胀、工资
- 经济分析局 (bea.gov) - GDP、经济账户
- 联邦储备系统 (federalreserve.gov) - 金融数据
- SEC EDGAR - 公司申报文件
### 特定领域
- 环境保护局 (epa.gov/data) - 环境数据
- 食品药品监督管理局 (fda.gov/data) - 药品批准、召回、不良事件
- CDC WONDER - 健康统计
- 国家公路交通安全管理局 - 车辆安全数据
- 交通部 - 交通统计
- 联邦选举委员会 - 竞选资金
- USASpending.gov - 联邦合同和拨款
### 州和地方
- 州开放数据门户(搜索:"[州名] open data")
- Socrata 支持的网站(许多城市/州)
- OpenStreets、市政 GIS 门户
- 州审计长/审计员报告
广告位招租
在这里展示您的产品或服务
触达数万 AI 开发者,精准高效
## 获取非公开数据
### 公共记录请求(如 FOIA)获取数据集
- 请求数据库,而不仅仅是文件
- 要求提供数据字典/模式
- 请求原生格式(CSV、SQL 转储)
- 指定字段级需求
### 构建自己的数据集
- 抓取公开信息
- 从读者处众包
- 系统性文件审查
- 调查(使用适当方法)
### 商业数据源(供新闻编辑室使用)
- LexisNexis
- Refinitiv
- Bloomberg
- 行业特定数据库
from typing import Any
import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinations
# Inflation adjustment
import cpi
import wbdata
def standardize_name(name: Any) -> str | None:
"""Standardize name format to 'First Last'."""
if pd.isna(name):
return None
name = str(name).strip().upper()
# Handle "LAST, FIRST" format
if ',' in name:
parts = name.split(',')
name = f"{parts[1].strip()} {parts[0].strip()}"
return name
def parse_date(date_str: Any) -> pd.Timestamp | None:
"""Parse dates in various formats."""
if pd.isna(date_str):
return None
formats = [
'%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
'%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt)
except:
continue
# Fall back to pandas parser
try:
return pd.to_datetime(date_str)
except:
return None
def handle_missing(df:pd.DataFrame, thresh:int | None, per_thresh:float | None, required_col:str | None) -> pd.DataFrame:
'''Handles Dataframes with too many missing values, defined by the user.'''
if thresh and data_clean.isna().sum() >= thresh:
return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
elif per_thresh and (data_clean.isna().sum() / len(data_clean) * 100) >= per_thresh:
return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
else:
return df
def handle_duplicates(df:pd.DataFrame, thresh=int | None)
'''Handle duplicate rows of data.'''
if thresh and df.duplicated().sum() >= thresh:
return df.drop_duplicates().reset_index(drop=True).copy()
else:
return df
def flag_similar_names(df: pd.DataFrame, name_col: str, threshold: int = 85) -> pd.DataFrame:
"""Flag rows that have potential duplicate names using vectorized comparison."""
names = df[name_col].dropna().unique()
# Use combinations() to avoid nested loop and duplicate comparisons
dup_names: set[Any] = {
name
for name1, name2 in combinations(names, 2)
if fuzz.ratio(str(name1).lower(), str(name2).lower()) >= threshold
for name in (name1, name2)
}
df['has_similar_name'] = df[name_col].isin(dup_names)
return df
def flag_outliers(series: pd.Series, method: str = 'iqr', threshold: float = 1.5) -> pd.Series:
"""Flag statistical outliers."""
if method == 'iqr':
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - threshold * IQR
upper = Q3 + threshold * IQR
return (series < lower) | (series > upper)
elif method == 'zscore':
z_scores = np.abs((series - series.mean()) / series.std())
return z_scores > threshold
# use descriptive variable names and chain methods
data_clean = (pd
# Load messy data — raw_data is a placeholder
# Be sure to use the right reader for the filetype
.read_csv('..data/raw/raw_data.csv')
# DATA TYPE CORRECTIONS
# Ensure proper types for analysis
.assign(# Convert to numeric (handling errors)
amount = lambda x: pd.to_numeric(x['amount'], errors='coerce'),
# Convert to categorical (saves memory, enables ordering)
status = lambda x: pd.to_Categorical(x['status']))
.assign(
# INCONSISTENT FORMATTING
# Problem: Names in different formats
# ie. "SMITH, JOHN" vs "John Smith" vs "smith john"
name_clean = lambda x: standaridize_name(x['name']),
# DATE INCONSISTENCIES
# Problem: Dates in multiple formats
# ie. "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
parse_date = lambda x: parse_date(x['date']),
# OUTLIERS
# Identify potential data entry errors
amount_outlier = lambda x: flag_outliers(x['amount']),
)
# Fuzzy duplicates (similar but not identical)
# Use record linkage or manual review
.pipe(find_similar_names, name_col='name_clean', threshold=85)
# MISSING VALUES
# Strategy depends on context
# First check missing value patterns
.pipe(handle_missing, thresh=None, per_thresh=None)
# DUPLICATES — Find and handle duplicates
.pipe(handle_duplicates, thresh=None)
.reset_index(drop=True)
.copy())
## 分析前数据验证
### 结构检查
- [ ] 行数与预期相符
- [ ] 列数和列名正确
- [ ] 数据类型合适
- [ ] 没有意外的空列
### 内容检查
- [ ] 日期范围合理
- [ ] 数值在预期范围内
- [ ] 分类值与预期选项匹配
- [ ] 地理数据解析正确
- [ ] 预期唯一的 ID 确实唯一
### 一致性检查
- [ ] 总计符合预期值
- [ ] 交叉表平衡
- [ ] 相关字段一致
- [ ] 时间序列连续
### 来源验证
- [ ] 可追溯至原始来源
- [ ] 方法已记录
- [ ] 已知局限性已注明
- [ ] 更新频率已了解
# 任何数据集的基本统计
def describe_for_journalism(df: pd.DataFrame, col: str) -> pd.DataFrame:
"""Generate journalist-friendly statistics."""
stats = df[col].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])
# Add skewness to the describe() output
stats['skewness'] = df[col].skew()
return stats.to_frame(name=col)
# Example interpretation
stats = describe_for_journalism(salaries, 'salary')
print(f"""
ANALYSIS
---------------
We analyzed {stats['count']:,} salary records.
The median salary is ${stats['median']:,.0f}, meaning half of workers
earn more and half earn less.
The average salary is ${stats['mean']:,.0f}, which is
{'higher' if stats['mean'] > stats['median'] else 'lower'} than the median,
indicating the distribution is {'right-skewed (pulled up by high earners)'
if stats['skewness'] > 0 else 'left-skewed'}.
The top 10% of earners make at least ${stats['90th_percentile']:,.0f}.
The top 1% make at least ${stats['99th_percentile']:,.0f}.
""")
# Calculate change metrics for a column
def calculate_change(df: pd.DataFrame, col: str, periods: int = 1) -> pd.DataFrame:
"""Add change metrics to a DataFrame using built-in pandas methods.
Args:
df: Input DataFrame
col: Column to calculate changes for
periods: Number of rows to look back (1=previous row, 12=year-over-year for monthly)
"""
return df.assign(
absolute_change=df[col].diff(periods),
percent_change=df[col].pct_change(periods) * 100,
direction=np.sign(df[col].diff(periods)).map({1: 'increased', -1: 'decreased', 0: 'unchanged'})
)
# Usage:
# changes = data_clean.pipe(calculate_change, 'revenue', periods=12) # Year-over-year for monthly data
# Per capita calculations (essential for fair comparisons)
def per_capita(value: float, population: float, multiplier: int = 100000) -> float:
"""Calculate per capita rate."""
return (value / population) * multiplier # Per 100,000 is standard
# Example: Crime rates
city_a = {'crimes': 5000, 'population': 100000}
city_b = {'crimes': 8000, 'population': 500000}
rate_a = per_capita(city_a['crimes'], city_a['population'])
rate_b = per_capita(city_b['crimes'], city_b['population'])
print(f"City A: {rate_a:.1f} crimes per 100,000 residents")
print(f"City B: {rate_b:.1f} crimes per 100,000 residents")
# City A actually has higher crime rate despite fewer total crimes!
def adjust_for_inflation(
amount: float | pd.Series,
from_year: int | pd.Series,
to_year: int,
country: str = 'US'
) -> float | pd.Series:
"""Adjust dollar amounts for inflation. Works with scalars or Series for .assign().
Args:
amount: Value(s) to adjust
from_year: Original year(s) of the amount
to_year: Target year to adjust to
country: ISO 2-letter country code (default 'US'). US uses BLS data via cpi package,
others use World Bank CPI data (FP.CPI.TOTL indicator)
"""
if country == 'US':
# Use cpi package for US (more accurate, from BLS)
if isinstance(from_year, pd.Series):
return pd.Series([cpi.inflate(amt, yr, to=to_year)
for amt, yr in zip(amount, from_year)], index=amount.index)
return cpi.inflate(amount, from_year, to=to_year)
else:
# Use World Bank data for other countries
cpi_data = wbdata.get_dataframe(
{'FP.CPI.TOTL': 'cpi'},
country=country
)['cpi'].to_dict()
from_cpi = pd.Series(from_year).map(cpi_data) if isinstance(from_year, pd.Series) else cpi_data[from_year]
to_cpi = cpi_data[to_year]
return amount * (to_cpi / from_cpi)
# Usage:
# adjust_for_inflation(100, 2020, 2024) # US by default
# adjust_for_inflation(100, 2020, 2024, country='GB') # UK
# df.assign(inf_adjust24=lambda x: adjust_for_inflation(x['amount'], x['year'], 2024, country='DE'))
# Always adjust when comparing dollars across years!
## 负责任地报告相关性
### 你可以说的内容
- "X 和 Y 相关"
- "随着 X 增加,Y 倾向于增加"
- "X 较高的地区往往 Y 也较高"
- "X 与 Y 相关联"
### 你不能说的内容(没有更多证据)
- "X 导致 Y"
- "X 引起 Y"
- "Y 的发生是因为 X"
### 在暗示因果关系前要问的问题
1. 是否存在合理的机制?
2. 时间顺序是否合理(原因在结果之前)?
3. 是否存在剂量-反应关系?
4. 该发现是否已被重复验证?
5. 是否控制了混杂变量?
6. 是否存在其他解释?
### 虚假相关的危险信号
- 与不相关事物存在极高相关性(r > 0.95)
- 变量之间没有逻辑联系
- 第三个变量可以解释两者
- 样本量小且方差大
## 选择合适的图表
### 比较
- **条形图**:比较类别
- **分组条形图**:跨组比较类别
- **子弹图**:实际值与目标值
### 随时间变化
- **折线图**:随时间趋势
- **面积图**:随时间累积总计
- **斜率图**:两点之间的变化
### 分布
- **直方图**:一个变量的分布
- **箱线图**:跨组比较分布
- **小提琴图**:详细的分布形状
### 关系
- **散点图**:两个变量之间的关系
- **气泡图**:三个变量(x、y、大小)
- **连接散点图**:关系随时间变化
### 构成
- **饼图**:整体的一部分(几乎从不使用,最多 5 片,优先使用环形图)
- **环形图**:整体的一部分
- **堆叠条形图**:跨类别的部分构成
- **树状图**:层次结构构成
### 地理
- **分级统计图**:按区域的值(使用标准化数据!)
- **点图**:单个位置
- **比例符号图**:位置处的量级
import plotly.express as px
# Set default template for all charts
px.defaults.template = 'simple_white'
def create_bar_chart(
data: pd.DataFrame,
title: str,
source: str,
desc: str = '',
x_val: str,
y_val: str,
x_lab: str | None,
y_lab: str | None
) -> px.bar:
"""Create a bar chart."""
fig = px.bar(
data,
x=x_val,
y=y_val,
text=desc,
title=title,
labels={'category': (x_lab if x_lab else x_val), 'value': (y_lab if y_lab else y_val)}
)
return fig
# Example
fig = create_bar_chart(
data,
title='Annual Widget Production',
source='Department of Widgets, 2024',
desc='The widget department increased its production dramatically starting in 2014.',
x_val='year',
y_val='widgets_prod',
x_lab='Year',
y_label='Units produced'
)
fig.show() # Interactive display
import pandas as pd
import datawrapper as dw
# Authentication: Set DATAWRAPPER_ACCESS_TOKEN environment variable,
# or read from file and pass to create()
with open('datawrapper_api_key.txt', 'r') as f:
api_key = f.read().strip()
# read in your data
data = pd.read_csv('../data/raw/data.csv')
# Create a bar chart using the new OOP API
chart = dw.BarChart(
title='My Bar Chart Title',
intro='Subtitle or description text',
data=data,
# Formatting options
value_label_format=dw.NumberFormat.ONE_DECIMAL,
show_value_labels=True,
value_label_alignment='left',
sort_bars=True, # sort by value
reverse_order=False,
# Source attribution
source_name='Your Data Source',
source_url='https://example.com',
byline='Your Name',
# Optional: custom base color
base_color='#1d81a2'
)
# Create and publish (uses DATAWRAPPER_ACCESS_TOKEN env var, or pass token)
chart.create(access_token=api_key)
chart.publish()
# Get chart URL and embed code
print(f"Chart ID: {chart.chart_id}")
print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")
iframe_code = chart.get_iframe_code(responsive=True)
# Update existing chart with new data (for live-updating charts)
existing_chart = dw.get_chart('YOUR_CHART_ID') # retrieve by ID
existing_chart.data = new_df # assign new DataFrame
existing_chart.title = 'Updated Title' # modify properties
existing_chart.update() # push changes to Datawrapper
existing_chart.publish() # republish to make live
# Optional — Export chart as image
chart.export(filepath='chart.png', width=800, height=600)
#view chart
chart
## 图表完整性检查清单
### 坐标轴
- [ ] Y 轴从零开始(对于条形图)
- [ ] 坐标轴标签清晰
- [ ] 比例合适(未截断以夸大)
- [ ] 两个坐标轴都标有单位
### 数据表示
- [ ] 所有数据点可见
- [ ] 颜色可区分(包括色盲)
- [ ] 比例准确
- [ ] 3D 效果未扭曲感知
### 背景
- [ ] 标题描述显示内容,而非结论
- [ ] 时间段明确说明
- [ ] 来源引用
- [ ] 如相关,注明样本量/方法
- [ ] 在适当时显示不确定性
### 诚实性
- [ ] 避免选择性使用日期
- [ ] 异常值已解释,未隐藏
- [ ] 双坐标轴的使用有正当理由(通常避免)
- [ ] 注释不误导
最适合: 仅限美国地址。返回人口普查地理信息(普查区、街区、FIPS 代码)以及坐标——这对于与人口普查人口统计数据连接至关重要。
优点: 完全免费,无需 API 密钥。返回人口普查地理信息(州/县 FIPS、普查区、街区),可让您与 ACS/十年人口普查数据连接。对于标准美国地址匹配率良好。
缺点: 每批最多 10,000 个地址。仅限美国地址。比商业替代方案慢。对于非标准地址(邮政信箱、乡村路线、新建建筑)匹配率较低。
使用时机: 您需要地理编码格式良好的美国地址,或者没有预算使用付费服务。
# pip install censusbatchgeocoder
import censusbatchgeocoder
import pandas as pd
# DataFrame must have columns: id, address, city, state, zipcode
# (state and zipcode are optional but improve match rates)
def census_geocode(
df: pd.DataFrame,
id_col: str = 'id',
address_col: str = 'address',
city_col: str = 'city',
state_col: str = 'state',
zipcode_col: str = 'zipcode',
chunk_size: int = 9999
) -> pd.DataFrame:
"""
Geocode a DataFrame using the U.S. Census batch geocoder.
Automatically handles datasets larger than 10,000 rows by chunking.
Returns DataFrame with: latitude, longitude, state_fips, county_fips,
tract, block, is_match, is_exact, returned_address, geocoded_address
"""
# Rename columns to expected format
col_map = {id_col: 'id', address_col: 'address', city_col: 'city'}
if state_col and state_col in df.columns:
col_map[state_col] = 'state'
if zipcode_col and zipcode_col in df.columns:
col_map[zipcode_col] = 'zipcode'
renamed_df = df.rename(columns=col_map)
records = renamed_df.to_dict('records')
# Small dataset: geocode directly
if len(records) <= chunk_size:
results = censusbatchgeocoder.geocode(records)
return pd.DataFrame(results)
# Large dataset: process in chunks to stay under 10,000 limit
all_results = []
for i in range(0, len(records), chunk_size):
chunk = records[i:i + chunk_size]
print(f"Geocoding rows {i:,} to {i + len(chunk):,} of {len(records):,}...")
try:
results = censusbatchgeocoder.geocode(chunk)
all_results.extend(results)
except Exception as e:
print(f"Error on chunk starting at {i}: {e}")
for record in chunk:
all_results.append({**record, 'is_match': 'No_Match', 'latitude': None, 'longitude': None})
return pd.DataFrame(all_results)
# Usage:
geocoded = (pd
.read_csv('../data/raw/addresses.csv')
.assign(id=lambda x: x.index)
.pipe(census_geocode,
id_col='id',
address_col='street',
city_col='city'.
state_col='state',
zipcode_col='zip'))
最适合: 国际地址、高匹配率以及混乱/非标准地址格式。
优点: 即使对于格式不佳的地址,匹配率也非常出色。全球可用。快速可靠。返回丰富的元数据(地点类型、地址组件、地点 ID)。
缺点: 需要付费(超出免费额度后每 1,000 次请求 5 美元)。需要 API 密钥和结算账户。不返回人口普查地理信息——您需要进行单独的空间连接。
使用时机: 您需要地理编码国际地址、拥有美国人口普查地理编码器无法匹配的混乱地址数据,或者需要最高可能的匹配率且有预算支持。
import googlemaps
from typing import Optional
def geocode_address_google(address: str, api_key: str) -> Optional[dict]:
"""
Geocode address using Google Maps API.
Requires API key with Geocoding API enabled.
"""
gmaps = googlemaps.Client(key=api_key)
result = gmaps.geocode(address)
if result:
location = result[0]['geometry']['location']
return {
'formatted_address': result[0]['formatted_address'],
'lat': location['lat'],
'lon': location['lng'],
'place_id': result[0]['place_id']
}
return None
# Batch geocode a DataFrame
def batch_geocode(df: pd.DataFrame, address_col: str, api_key: str) -> pd.DataFrame:
gmaps = googlemaps.Client(key=api_key)
results = []
for address in df[address_col]:
try:
result = gmaps.geocode(address)
if result:
loc = result[0]['geometry']['location']
results.append({'lat': loc['lat'], 'lon': loc['lng']})
else:
results.append({'lat': None, 'lon': None})
except Exception:
results.append({'lat': None, 'lon': None})
return pd.concat([df, pd.DataFrame(results)], axis=1)
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
# Read data from various formats
gdf = gpd.read_file('data.geojson') # GeoJSON
gdf = gpd.read_file('data.shp') # Shapefile
gdf = gpd.read_file('https://example.com/data.geojson') # From URL
gdf = gpd.read_parquet('data.parquet') # GeoParquet (fast!)
# Transform DataFrame with lat/lon to GeoDataFrame
df = pd.read_csv('locations.csv')
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
gdf = gpd.GeoDataFrame(df, geometry=geometry)
# Set CRS (Coordinate Reference System)
# EPSG:4326 = WGS84 (standard latitude, longitude)
gdf = gdf.set_crs('EPSG:4326')
# Transform to different CRS (for area/distance calculations, use projected CRS)
gdf_projected = gdf.to_crs('EPSG:3857') # Web Mercator, for distance in meters
# Basic spatial operations
#Find the area of a shape
gdf['area'] = gdf_projected.geometry.area
#Find the center of a shape
gdf['centroid'] = gdf.geometry.centroid
#Draw a 1km boundary around a point
gdf['buffer_1km'] = gdf_projected.geometry.buffer(1000) #when set to CRS 3857
# Spatial join: find points within polygons
points = gpd.read_file('points.geojson')
polygons = gpd.read_file('boundaries.geojson')
joined = gpd.sjoin(points, polygons, predicate='within')
# Dissolve: merge geometries by attribute
dissolved = gdf.dissolve(by='state', aggfunc='sum')
# Export to various formats
gdf.to_parquet('output.parquet') # GeoParquet (recommended)
gdf.to_file('output.geojson', driver='GeoJSON') #for tools that dont support GeoParquet
.explore()、lonboard 和 Datawrapper 进行地理可视化.explore()最适合: 数据分析期间的快速探索和原型设计。
优点: 内置于 GeoPandas 中——任何 GeoDataFrame 都可用此方法。非常适合探索性数据分析——检查数据是否正确、探索空间模式以及快速迭代地图设计。
缺点: 处理大型数据集(>10 万个要素)时变慢。与专用地图库相比,自定义选项有限。需要安装额外的依赖项。
使用时机: 您正在分析过程中,希望在不切换工具的情况下快速可视化您的 GeoDataFrame。
所需依赖项:
pip install folium mapclassify matplotlib
folium - .explore() 正常工作所必需(渲染交互式地图)
mapclassify - 使用 scheme= 参数进行分类(例如 'naturalbreaks'、'quantiles'、'equalinterval')时所必需
matplotlib - 支持色彩映射(cmap=)所必需
import geopandas as gpd
gdf.explore()
gdf.explore( column='population', # Column for color scale cmap='YlOrRd', # Matplotlib colormap scheme='naturalbreaks', # Classification scheme (needs mapclassify) k=5, # Number of bins legend=True, tooltip=['name', 'population'], # Columns to show on hover popup=True, # Show all columns on click tiles='CartoDB positron', # Background tiles style_kwds={'color': 'black', 'weight': 0.5} # Border style )
lonboard最适合: Jupyter 笔记本中的大型数据集和高性能可视化。
优点: 通过 deck.gl 进行 GPU 加速渲染,可以流畅处理数百万个点。出色的交互性——即使处理海量数据集,平移、缩放和悬停也能流畅工作。原生支持 GeoArrow 格式,可实现高效数据传输。
缺点: 需要单独安装(pip install lonboard)。样式选项更技术化(RGBA 数组、deck.gl 约定)。
使用时机: 您拥有大型点数据集(犯罪事件、传感器读数、商业位置)或需要与 10 万+要素进行流畅交互。
import geopandas as gpd
from lonboard import viz, Map, ScatterplotLayer, PolygonLayer
# Quick visualization (auto-detects geometry type)
viz(gdf)
# Custom ScatterplotLayer for points
layer = ScatterplotLayer.from_geopandas(
gdf,
get_radius=100,
get_fill_color=[255, 0, 0, 200], # RGBA
pickable=True
)
m = Map(layer)
m
# PolygonLayer with color based on column
from lonboard.colormap import apply_continuous_cmap
import matplotlib.pyplot as plt
colors = apply_continuous_cmap(gdf['value'], plt.cm.viridis)
layer = PolygonLayer.from_geopandas(
gdf,
get_fill_color=colors,
get_line_color=[0, 0, 0, 100],
pickable=True
)
Map(layer)
最适合: 用于文章和报告的可发布分级统计图和比例符号图。
优点: 开箱即用的精美、专业默认设置。生成可在任何 CMS 中使用的可嵌入、响应式 iframe。读者无需运行任何代码即可交互(悬停、点击)。可访问且移动友好。易于以编程方式更新数据。
缺点: 需要 Datawrapper 账户(提供免费套餐)。仅限于 Datawrapper 支持的边界文件——您不能使用任意几何图形。自定义可视化灵活性较低。
使用时机: 您需要用于发布的精美地图。非常适合显示按区域统计的分级统计图(按州划分的失业率、按县划分的 COVID 病例、按选区划分的选举结果)。您的受众将在浏览器中查看地图,而不是在笔记本中。
与 .explore() 或 lonboard 不同,您不传递原始几何图形——而是使用标准代码(FIPS、ISO 等)将您的数据与 Datawrapper 的内置边界文件进行匹配。
import datawrapper as dw
import pandas as pd
# Read API key
with open('datawrapper_api_key.txt', 'r') as f:
api_key = f.read().strip()
# Prepare data with location codes that match Datawrapper's boundaries
# For US states: use 2-letter abbreviations or FIPS codes
# For countries: use ISO 3166-1 alpha-2 codes
df = pd.DataFrame({
'state': ['AL', 'AK', 'AZ', 'AR', 'CA'], # State abbreviations
'unemployment_rate': [4.9, 3.2, 7.1, 4.2, 5.8]
})
# Create a choropleth map
chart = dw.ChoroplethMap(
title='Unemployment Rate by State',
intro='Percentage of labor force unemployed, 2024',
data=df,
# Map configuration
basemap='us-states', # Built-in US states boundaries
basemap_key='state', # Column in your data with location
Systematic approaches for finding, analyzing and presenting data in journalism.
The framework for data journalism was established by Philip Meyer, a journalist for Knight-Ridder, Harvard Nieman Fellow and professor at UNC-Chapel Hill. In his book <i>The New Precision Journalism</i>, which outlines his ideas, Meyer encourages journalists to treat journalism "as if it were a science" by adopting the scientific method:
- Making observation(s) / formulating a questiom
- Researching the question / Collect, store and retrieve data
- Formulate a hypothesis
- Test the hypothesis, using both qualitative (interviews, documents etc.) and quantitative (data analysis etc.) methods
- Analyze the results and reduce them to the most important findings
- Present them to the audience
This process should be thought of as iterative, rather than sequential.
## The data story arc
### 1. The hook (nut graf)
- What's the key finding(s)?
- Why should readers care?
- What's the human impact?
### 2. The evidence
- Show the data
- Explain the methodology
- Acknowledge limitations
### 3. The context
- How does this compare to past?
- How does this compare to elsewhere?
- What's the trend?
### 4. The human element
- Individual examples that illustrate the data
- Expert interpretation
- Affected voices
### 5. The implications
- What does this mean going forward?
- What questions remain?
- What actions could result?
### 6. The methodology box
- Where did data come from?
- How was it analyzed?
- What are the limitations?
- How can readers explore further?
## How we did this analysis
### Data sources
[List all data sources with links and access dates]
### Time period
[Specify exactly what time period is covered]
### Definitions
[Define key terms and how you operationalized them]
### Analysis steps
1. [First step of analysis]
2. [Second step]
3. [Continue...]
### Limitations
- [Limitation 1]
- [Limitation 2]
### What we excluded and why
- [Excluded category]: [Reason]
### Verification
[How findings were verified/checked]
### Code and data availability
[Link to GitHub repo if sharing code/data]
### Contact
[How readers can reach you with questions]
## Federal data sources
### General
- Data.gov - Federal open data portal
- Census Bureau (census.gov) - Demographics, economic data
- BLS (bls.gov) - Employment, inflation, wages
- BEA (bea.gov) - GDP, economic accounts
- Federal Reserve (federalreserve.gov) - Financial data
- SEC EDGAR - Corporate filings
### Specific domains
- EPA (epa.gov/data) - Environmental data
- FDA (fda.gov/data) - Drug approvals, recalls, adverse events
- CDC WONDER - Health statistics
- NHTSA - Vehicle safety data
- DOT - Transportation statistics
- FEC - Campaign finance
- USASpending.gov - Federal contracts and grants
### State and local
- State open data portals (search: "[state] open data")
- Socrata-powered sites (many cities/states)
- OpenStreets, municipal GIS portals
- State comptroller/auditor reports
## Getting data that isn't public
### Public records request (ie. FOIA) for datasets
- Request databases, not just documents
- Ask for data dictionary/schema
- Request in native format (CSV, SQL dump)
- Specify field-level needs
### Building your own dataset
- Scraping public information
- Crowdsourcing from readers
- Systematic document review
- Surveys (with proper methodology)
### Commercial data sources (for newsrooms)
- LexisNexis
- Refinitiv
- Bloomberg
- Industry-specific databases
from typing import Any
import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinations
# Inflation adjustment
import cpi
import wbdata
def standardize_name(name: Any) -> str | None:
"""Standardize name format to 'First Last'."""
if pd.isna(name):
return None
name = str(name).strip().upper()
# Handle "LAST, FIRST" format
if ',' in name:
parts = name.split(',')
name = f"{parts[1].strip()} {parts[0].strip()}"
return name
def parse_date(date_str: Any) -> pd.Timestamp | None:
"""Parse dates in various formats."""
if pd.isna(date_str):
return None
formats = [
'%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
'%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt)
except:
continue
# Fall back to pandas parser
try:
return pd.to_datetime(date_str)
except:
return None
def handle_missing(df:pd.DataFrame, thresh:int | None, per_thresh:float | None, required_col:str | None) -> pd.DataFrame:
'''Handles Dataframes with too many missing values, defined by the user.'''
if thresh and data_clean.isna().sum() >= thresh:
return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
elif per_thresh and (data_clean.isna().sum() / len(data_clean) * 100) >= per_thresh:
return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
else:
return df
def handle_duplicates(df:pd.DataFrame, thresh=int | None)
'''Handle duplicate rows of data.'''
if thresh and df.duplicated().sum() >= thresh:
return df.drop_duplicates().reset_index(drop=True).copy()
else:
return df
def flag_similar_names(df: pd.DataFrame, name_col: str, threshold: int = 85) -> pd.DataFrame:
"""Flag rows that have potential duplicate names using vectorized comparison."""
names = df[name_col].dropna().unique()
# Use combinations() to avoid nested loop and duplicate comparisons
dup_names: set[Any] = {
name
for name1, name2 in combinations(names, 2)
if fuzz.ratio(str(name1).lower(), str(name2).lower()) >= threshold
for name in (name1, name2)
}
df['has_similar_name'] = df[name_col].isin(dup_names)
return df
def flag_outliers(series: pd.Series, method: str = 'iqr', threshold: float = 1.5) -> pd.Series:
"""Flag statistical outliers."""
if method == 'iqr':
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - threshold * IQR
upper = Q3 + threshold * IQR
return (series < lower) | (series > upper)
elif method == 'zscore':
z_scores = np.abs((series - series.mean()) / series.std())
return z_scores > threshold
# use descriptive variable names and chain methods
data_clean = (pd
# Load messy data — raw_data is a placeholder
# Be sure to use the right reader for the filetype
.read_csv('..data/raw/raw_data.csv')
# DATA TYPE CORRECTIONS
# Ensure proper types for analysis
.assign(# Convert to numeric (handling errors)
amount = lambda x: pd.to_numeric(x['amount'], errors='coerce'),
# Convert to categorical (saves memory, enables ordering)
status = lambda x: pd.to_Categorical(x['status']))
.assign(
# INCONSISTENT FORMATTING
# Problem: Names in different formats
# ie. "SMITH, JOHN" vs "John Smith" vs "smith john"
name_clean = lambda x: standaridize_name(x['name']),
# DATE INCONSISTENCIES
# Problem: Dates in multiple formats
# ie. "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
parse_date = lambda x: parse_date(x['date']),
# OUTLIERS
# Identify potential data entry errors
amount_outlier = lambda x: flag_outliers(x['amount']),
)
# Fuzzy duplicates (similar but not identical)
# Use record linkage or manual review
.pipe(find_similar_names, name_col='name_clean', threshold=85)
# MISSING VALUES
# Strategy depends on context
# First check missing value patterns
.pipe(handle_missing, thresh=None, per_thresh=None)
# DUPLICATES — Find and handle duplicates
.pipe(handle_duplicates, thresh=None)
.reset_index(drop=True)
.copy())
## Pre-analysis data validation
### Structural checks
- [ ] Row count matches expected
- [ ] Column count and names correct
- [ ] Data types appropriate
- [ ] No unexpected null columns
### Content checks
- [ ] Date ranges make sense
- [ ] Numeric values within expected bounds
- [ ] Categorical values match expected options
- [ ] Geographic data resolves correctly
- [ ] IDs are unique where expected
### Consistency checks
- [ ] Totals add up to expected values
- [ ] Cross-tabulations balance
- [ ] Related fields are consistent
- [ ] Time series is continuous
### Source verification
- [ ] Can trace back to original source
- [ ] Methodology documented
- [ ] Known limitations noted
- [ ] Update frequency understood
# Essential statistics for any dataset
def describe_for_journalism(df: pd.DataFrame, col: str) -> pd.DataFrame:
"""Generate journalist-friendly statistics."""
stats = df[col].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])
# Add skewness to the describe() output
stats['skewness'] = df[col].skew()
return stats.to_frame(name=col)
# Example interpretation
stats = describe_for_journalism(salaries, 'salary')
print(f"""
ANALYSIS
---------------
We analyzed {stats['count']:,} salary records.
The median salary is ${stats['median']:,.0f}, meaning half of workers
earn more and half earn less.
The average salary is ${stats['mean']:,.0f}, which is
{'higher' if stats['mean'] > stats['median'] else 'lower'} than the median,
indicating the distribution is {'right-skewed (pulled up by high earners)'
if stats['skewness'] > 0 else 'left-skewed'}.
The top 10% of earners make at least ${stats['90th_percentile']:,.0f}.
The top 1% make at least ${stats['99th_percentile']:,.0f}.
""")
# Calculate change metrics for a column
def calculate_change(df: pd.DataFrame, col: str, periods: int = 1) -> pd.DataFrame:
"""Add change metrics to a DataFrame using built-in pandas methods.
Args:
df: Input DataFrame
col: Column to calculate changes for
periods: Number of rows to look back (1=previous row, 12=year-over-year for monthly)
"""
return df.assign(
absolute_change=df[col].diff(periods),
percent_change=df[col].pct_change(periods) * 100,
direction=np.sign(df[col].diff(periods)).map({1: 'increased', -1: 'decreased', 0: 'unchanged'})
)
# Usage:
# changes = data_clean.pipe(calculate_change, 'revenue', periods=12) # Year-over-year for monthly data
# Per capita calculations (essential for fair comparisons)
def per_capita(value: float, population: float, multiplier: int = 100000) -> float:
"""Calculate per capita rate."""
return (value / population) * multiplier # Per 100,000 is standard
# Example: Crime rates
city_a = {'crimes': 5000, 'population': 100000}
city_b = {'crimes': 8000, 'population': 500000}
rate_a = per_capita(city_a['crimes'], city_a['population'])
rate_b = per_capita(city_b['crimes'], city_b['population'])
print(f"City A: {rate_a:.1f} crimes per 100,000 residents")
print(f"City B: {rate_b:.1f} crimes per 100,000 residents")
# City A actually has higher crime rate despite fewer total crimes!
def adjust_for_inflation(
amount: float | pd.Series,
from_year: int | pd.Series,
to_year: int,
country: str = 'US'
) -> float | pd.Series:
"""Adjust dollar amounts for inflation. Works with scalars or Series for .assign().
Args:
amount: Value(s) to adjust
from_year: Original year(s) of the amount
to_year: Target year to adjust to
country: ISO 2-letter country code (default 'US'). US uses BLS data via cpi package,
others use World Bank CPI data (FP.CPI.TOTL indicator)
"""
if country == 'US':
# Use cpi package for US (more accurate, from BLS)
if isinstance(from_year, pd.Series):
return pd.Series([cpi.inflate(amt, yr, to=to_year)
for amt, yr in zip(amount, from_year)], index=amount.index)
return cpi.inflate(amount, from_year, to=to_year)
else:
# Use World Bank data for other countries
cpi_data = wbdata.get_dataframe(
{'FP.CPI.TOTL': 'cpi'},
country=country
)['cpi'].to_dict()
from_cpi = pd.Series(from_year).map(cpi_data) if isinstance(from_year, pd.Series) else cpi_data[from_year]
to_cpi = cpi_data[to_year]
return amount * (to_cpi / from_cpi)
# Usage:
# adjust_for_inflation(100, 2020, 2024) # US by default
# adjust_for_inflation(100, 2020, 2024, country='GB') # UK
# df.assign(inf_adjust24=lambda x: adjust_for_inflation(x['amount'], x['year'], 2024, country='DE'))
# Always adjust when comparing dollars across years!
## Reporting correlations responsibly
### What you CAN say
- "X and Y are correlated"
- "As X increases, Y tends to increase"
- "Areas with higher X also tend to have higher Y"
- "X is associated with Y"
### What you CANNOT say (without more evidence)
- "X causes Y"
- "X leads to Y"
- "Y happens because of X"
### Questions to ask before implying causation
1. Is there a plausible mechanism?
2. Does the timing make sense (cause before effect)?
3. Is there a dose-response relationship?
4. Has the finding been replicated?
5. Have confounding variables been controlled?
6. Are there alternative explanations?
### Red flags for spurious correlations
- Extremely high correlation (r > 0.95) with unrelated things
- No logical connection between variables
- Third variable could explain both
- Small sample size with high variance
## Choosing the right chart
### Comparison
- **Bar chart**: Compare categories
- **Grouped bar**: Compare categories across groups
- **Bullet chart**: Actual vs target
### Change over time
- **Line chart**: Trends over time
- **Area chart**: Cumulative totals over time
- **Slope chart**: Change between two points
### Distribution
- **Histogram**: Distribution of one variable
- **Box plot**: Compare distributions across groups
- **Violin plot**: Detailed distribution shape
### Relationship
- **Scatter plot**: Relationship between two variables
- **Bubble chart**: Three variables (x, y, size)
- **Connected scatter**: Change in relationship over time
### Composition
- **Pie chart**: Parts of a whole (almost never use, max 5 slices, prefer donut charts)
- **Donut chart**: Parts of a whole
- **Stacked bar**: Parts of whole across categories
- **Treemap**: Hierarchical composition
### Geographic
- **Choropleth**: Values by region (use normalized data!)
- **Dot map**: Individual locations
- **Proportional symbol**: Magnitude at locations
import plotly.express as px
# Set default template for all charts
px.defaults.template = 'simple_white'
def create_bar_chart(
data: pd.DataFrame,
title: str,
source: str,
desc: str = '',
x_val: str,
y_val: str,
x_lab: str | None,
y_lab: str | None
) -> px.bar:
"""Create a bar chart."""
fig = px.bar(
data,
x=x_val,
y=y_val,
text=desc,
title=title,
labels={'category': (x_lab if x_lab else x_val), 'value': (y_lab if y_lab else y_val)}
)
return fig
# Example
fig = create_bar_chart(
data,
title='Annual Widget Production',
source='Department of Widgets, 2024',
desc='The widget department increased its production dramatically starting in 2014.',
x_val='year',
y_val='widgets_prod',
x_lab='Year',
y_label='Units produced'
)
fig.show() # Interactive display
import pandas as pd
import datawrapper as dw
# Authentication: Set DATAWRAPPER_ACCESS_TOKEN environment variable,
# or read from file and pass to create()
with open('datawrapper_api_key.txt', 'r') as f:
api_key = f.read().strip()
# read in your data
data = pd.read_csv('../data/raw/data.csv')
# Create a bar chart using the new OOP API
chart = dw.BarChart(
title='My Bar Chart Title',
intro='Subtitle or description text',
data=data,
# Formatting options
value_label_format=dw.NumberFormat.ONE_DECIMAL,
show_value_labels=True,
value_label_alignment='left',
sort_bars=True, # sort by value
reverse_order=False,
# Source attribution
source_name='Your Data Source',
source_url='https://example.com',
byline='Your Name',
# Optional: custom base color
base_color='#1d81a2'
)
# Create and publish (uses DATAWRAPPER_ACCESS_TOKEN env var, or pass token)
chart.create(access_token=api_key)
chart.publish()
# Get chart URL and embed code
print(f"Chart ID: {chart.chart_id}")
print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")
iframe_code = chart.get_iframe_code(responsive=True)
# Update existing chart with new data (for live-updating charts)
existing_chart = dw.get_chart('YOUR_CHART_ID') # retrieve by ID
existing_chart.data = new_df # assign new DataFrame
existing_chart.title = 'Updated Title' # modify properties
existing_chart.update() # push changes to Datawrapper
existing_chart.publish() # republish to make live
# Optional — Export chart as image
chart.export(filepath='chart.png', width=800, height=600)
#view chart
chart
## Chart integrity checklist
### Axes
- [ ] Y-axis starts at zero (for bar charts)
- [ ] Axis labels are clear
- [ ] Scale is appropriate (not truncated to exaggerate)
- [ ] Both axes labeled with units
### Data representation
- [ ] All data points visible
- [ ] Colors are distinguishable (including colorblind)
- [ ] Proportions are accurate
- [ ] 3D effects not distorting perception
### Context
- [ ] Title describes what's shown, not conclusion
- [ ] Time period clearly stated
- [ ] Source cited
- [ ] Sample size/methodology noted if relevant
- [ ] Uncertainty shown where appropriate
### Honesty
- [ ] Cherry-picking dates avoided
- [ ] Outliers explained, not hidden
- [ ] Dual axes justified (usually avoid)
- [ ] Annotations don't mislead
Best for: U.S. addresses only. Returns Census geography (tract, block, FIPS codes) along with coordinates—essential for joining with Census demographic data.
Pros: Completely free with no API key required. Returns Census geographies (state/county FIPS, tract, block) that let you join with ACS/decennial Census data. Good match rates for standard U.S. addresses.
Cons: Limited to 10,000 addresses per batch. U.S. addresses only. Slower than commercial alternatives. Lower match rates for non-standard addresses (PO boxes, rural routes, new construction).
Use when: You need to geocode nicely formatted U.S. addresses or you don't have budget for a paid service.
# pip install censusbatchgeocoder
import censusbatchgeocoder
import pandas as pd
# DataFrame must have columns: id, address, city, state, zipcode
# (state and zipcode are optional but improve match rates)
def census_geocode(
df: pd.DataFrame,
id_col: str = 'id',
address_col: str = 'address',
city_col: str = 'city',
state_col: str = 'state',
zipcode_col: str = 'zipcode',
chunk_size: int = 9999
) -> pd.DataFrame:
"""
Geocode a DataFrame using the U.S. Census batch geocoder.
Automatically handles datasets larger than 10,000 rows by chunking.
Returns DataFrame with: latitude, longitude, state_fips, county_fips,
tract, block, is_match, is_exact, returned_address, geocoded_address
"""
# Rename columns to expected format
col_map = {id_col: 'id', address_col: 'address', city_col: 'city'}
if state_col and state_col in df.columns:
col_map[state_col] = 'state'
if zipcode_col and zipcode_col in df.columns:
col_map[zipcode_col] = 'zipcode'
renamed_df = df.rename(columns=col_map)
records = renamed_df.to_dict('records')
# Small dataset: geocode directly
if len(records) <= chunk_size:
results = censusbatchgeocoder.geocode(records)
return pd.DataFrame(results)
# Large dataset: process in chunks to stay under 10,000 limit
all_results = []
for i in range(0, len(records), chunk_size):
chunk = records[i:i + chunk_size]
print(f"Geocoding rows {i:,} to {i + len(chunk):,} of {len(records):,}...")
try:
results = censusbatchgeocoder.geocode(chunk)
all_results.extend(results)
except Exception as e:
print(f"Error on chunk starting at {i}: {e}")
for record in chunk:
all_results.append({**record, 'is_match': 'No_Match', 'latitude': None, 'longitude': None})
return pd.DataFrame(all_results)
# Usage:
geocoded = (pd
.read_csv('../data/raw/addresses.csv')
.assign(id=lambda x: x.index)
.pipe(census_geocode,
id_col='id',
address_col='street',
city_col='city'.
state_col='state',
zipcode_col='zip'))
Best for: International addresses, high match rates, and messy/non-standard address formats.
Pros: Excellent match rates even for poorly formatted addresses. Works worldwide. Fast and reliable. Returns rich metadata (place types, address components, place IDs).
Cons: Costs money ($5 per 1,000 requests after free tier). Requires API key and billing account. Does not return Census geography—you'd need to do a separate spatial join.
Use when: You need to geocode international addresses, have messy address data that the Census geocoder can't match, or need the highest possible match rate and have budget for it.
import googlemaps
from typing import Optional
def geocode_address_google(address: str, api_key: str) -> Optional[dict]:
"""
Geocode address using Google Maps API.
Requires API key with Geocoding API enabled.
"""
gmaps = googlemaps.Client(key=api_key)
result = gmaps.geocode(address)
if result:
location = result[0]['geometry']['location']
return {
'formatted_address': result[0]['formatted_address'],
'lat': location['lat'],
'lon': location['lng'],
'place_id': result[0]['place_id']
}
return None
# Batch geocode a DataFrame
def batch_geocode(df: pd.DataFrame, address_col: str, api_key: str) -> pd.DataFrame:
gmaps = googlemaps.Client(key=api_key)
results = []
for address in df[address_col]:
try:
result = gmaps.geocode(address)
if result:
loc = result[0]['geometry']['location']
results.append({'lat': loc['lat'], 'lon': loc['lng']})
else:
results.append({'lat': None, 'lon': None})
except Exception:
results.append({'lat': None, 'lon': None})
return pd.concat([df, pd.DataFrame(results)], axis=1)
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
# Read data from various formats
gdf = gpd.read_file('data.geojson') # GeoJSON
gdf = gpd.read_file('data.shp') # Shapefile
gdf = gpd.read_file('https://example.com/data.geojson') # From URL
gdf = gpd.read_parquet('data.parquet') # GeoParquet (fast!)
# Transform DataFrame with lat/lon to GeoDataFrame
df = pd.read_csv('locations.csv')
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
gdf = gpd.GeoDataFrame(df, geometry=geometry)
# Set CRS (Coordinate Reference System)
# EPSG:4326 = WGS84 (standard latitude, longitude)
gdf = gdf.set_crs('EPSG:4326')
# Transform to different CRS (for area/distance calculations, use projected CRS)
gdf_projected = gdf.to_crs('EPSG:3857') # Web Mercator, for distance in meters
# Basic spatial operations
#Find the area of a shape
gdf['area'] = gdf_projected.geometry.area
#Find the center of a shape
gdf['centroid'] = gdf.geometry.centroid
#Draw a 1km boundary around a point
gdf['buffer_1km'] = gdf_projected.geometry.buffer(1000) #when set to CRS 3857
# Spatial join: find points within polygons
points = gpd.read_file('points.geojson')
polygons = gpd.read_file('boundaries.geojson')
joined = gpd.sjoin(points, polygons, predicate='within')
# Dissolve: merge geometries by attribute
dissolved = gdf.dissolve(by='state', aggfunc='sum')
# Export to various formats
gdf.to_parquet('output.parquet') # GeoParquet (recommended)
gdf.to_file('output.geojson', driver='GeoJSON') #for tools that dont support GeoParquet
.explore(), lonboard and Datawrapper.explore()Best for: Quick exploration and prototyping during data analysis.
Pros: Built into GeoPandas—method is available on any GeoDataFrame. Great for exploratory data analysis—checking that your data looks right, exploring spatial patterns, and iterating quickly on map designs.
Cons: Becomes slow with large datasets (>100k features). Limited customization compared to dedicated mapping libraries. Requires extra dependencies to be installed.
Use when: You're in the middle of analysis and want to quickly visualize your GeoDataFrame without switching tools.
Required dependencies:
pip install folium mapclassify matplotlib
folium - Required for .explore() to work at all (renders the interactive map)
mapclassify - Required when using scheme= parameter for classification (e.g., 'naturalbreaks', 'quantiles', 'equalinterval')
matplotlib - Required for colormap (cmap=) support
import geopandas as gpd
gdf.explore()
gdf.explore( column='population', # Column for color scale cmap='YlOrRd', # Matplotlib colormap scheme='naturalbreaks', # Classification scheme (needs mapclassify) k=5, # Number of bins legend=True, tooltip=['name', 'population'], # Columns to show on hover popup=True, # Show all columns on click tiles='CartoDB positron', # Background tiles style_kwds={'color': 'black', 'weight': 0.5} # Border style )
lonboardBest for: Large datasets and high-performance visualization in Jupyter notebooks.
Pros: GPU-accelerated rendering via deck.gl can handle millions of points smoothly. Excellent interactivity—pan, zoom, and hover work fluidly even with massive datasets. Native support for GeoArrow format for efficient data transfer.
Cons: Requires separate installation (pip install lonboard). Styling options are more technical (RGBA arrays, deck.gl conventions).
Use when: You have large point datasets (crime incidents, sensor readings, business locations) or need smooth interactivity with 100k+ features.
import geopandas as gpd
from lonboard import viz, Map, ScatterplotLayer, PolygonLayer
# Quick visualization (auto-detects geometry type)
viz(gdf)
# Custom ScatterplotLayer for points
layer = ScatterplotLayer.from_geopandas(
gdf,
get_radius=100,
get_fill_color=[255, 0, 0, 200], # RGBA
pickable=True
)
m = Map(layer)
m
# PolygonLayer with color based on column
from lonboard.colormap import apply_continuous_cmap
import matplotlib.pyplot as plt
colors = apply_continuous_cmap(gdf['value'], plt.cm.viridis)
layer = PolygonLayer.from_geopandas(
gdf,
get_fill_color=colors,
get_line_color=[0, 0, 0, 100],
pickable=True
)
Map(layer)
Best for: Publication-ready choropleth and proportional symbol maps for articles and reports.
Pros: Beautiful, professional defaults out of the box. Generates embeddable, responsive iframes that work in any CMS. Readers can interact (hover, click) without running any code. Accessible and mobile-friendly. Easy to update data programmatically for updating data.
Cons: Requires a Datawrapper account (free tier available). Limited to Datawrapper's supported boundary files—you can't bring arbitrary geometries. Less flexibility for custom visualizations.
Use when: You need a polished map for publication. Ideal for choropleth maps showing statistics by region (unemployment by state, COVID cases by county, election results by district). Your audience will view the map in a browser, not a notebook.
Unlike .explore() or lonboard, you don't pass raw geometry—instead you match your data to Datawrapper's built-in boundary files using standard codes (FIPS, ISO, etc.).
import datawrapper as dw
import pandas as pd
# Read API key
with open('datawrapper_api_key.txt', 'r') as f:
api_key = f.read().strip()
# Prepare data with location codes that match Datawrapper's boundaries
# For US states: use 2-letter abbreviations or FIPS codes
# For countries: use ISO 3166-1 alpha-2 codes
df = pd.DataFrame({
'state': ['AL', 'AK', 'AZ', 'AR', 'CA'], # State abbreviations
'unemployment_rate': [4.9, 3.2, 7.1, 4.2, 5.8]
})
# Create a choropleth map
chart = dw.ChoroplethMap(
title='Unemployment Rate by State',
intro='Percentage of labor force unemployed, 2024',
data=df,
# Map configuration
basemap='us-states', # Built-in US states boundaries
basemap_key='state', # Column in your data with location codes
value_column='unemployment_rate',
# Styling
color_palette='YlOrRd', # Color scheme
legend_title='Unemployment %',
# Attribution
source_name='Bureau of Labor Statistics',
source_url='https://www.bls.gov/',
byline='Your Name'
)
# Create and publish
chart.create(access_token=api_key)
chart.publish()
# Get embed code for your article
iframe = chart.get_iframe_code(responsive=True)
print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")
# Update with new data (for live-updating maps)
new_df = pd.DataFrame({...}) # Updated data
existing_chart = dw.get_chart('YOUR_CHART_ID')
existing_chart.data = new_df
existing_chart.update()
existing_chart.publish()
Available Datawrapper basemaps include:
us-states, us-counties, us-congressional-districtsworld, europe, africa, asiagermany-states, uk-constituencies)Weekly Installs
83
Repository
GitHub Stars
84
First Seen
Jan 21, 2026
Security Audits
Gen Agent Trust HubPassSocketPassSnykWarn
Installed on
opencode73
codex72
gemini-cli71
cursor67
github-copilot65
claude-code62
前端代码审计工具 - 自动化检测可访问性、性能、响应式设计、主题化与反模式
53,100 周安装