⚠️

重要前提

安装AI Skills的关键前提是：必须科学上网，且开启TUN模式，这一点至关重要，直接决定安装能否顺利完成，在此郑重提醒三遍：科学上网，科学上网，科学上网。查看完整安装教程 →

数据新闻方法论：从数据获取到故事呈现的完整指南与工具

data-journalism by jamditis/claude-skills-journalism

100 周安装量

107 GitHub Stars

GitHub

安装命令

npx skills add https://github.com/jamditis/claude-skills-journalism --skill data-journalism

方法论内容创作数据分析

🇨🇳中文介绍

数据新闻方法论

在新闻工作中寻找、分析和呈现数据的系统性方法。

数据新闻的故事结构

数据新闻框架

数据新闻框架由奈特-里德报系记者、哈佛大学尼曼研究员、北卡罗来纳大学教堂山分校教授菲利普·迈耶建立。在其阐述其思想的著作《新精确新闻学》中，迈耶鼓励记者采用科学方法，将新闻工作"当作一门科学"来对待：
- 进行观察 / 提出问题
- 研究问题 / 收集、存储和检索数据
- 提出假设
- 使用定性（访谈、文件等）和定量（数据分析等）方法检验假设
- 分析结果并将其提炼为最重要的发现
- 向受众呈现

这个过程应被视为迭代的，而非线性的。

## 数据故事弧

### 1. 引子（核心段落）
- 关键发现是什么？
- 读者为何要关心？
- 对人类有何影响？

### 2. 证据
- 展示数据
- 解释方法
- 承认局限性

### 3. 背景
- 与过去相比如何？
- 与其他地方相比如何？
- 趋势是什么？

### 4. 人文元素
- 说明数据的个体案例
- 专家解读
- 受影响者的声音

### 5. 影响
- 这对未来意味着什么？
- 还有哪些问题未解决？
- 可能导致什么行动？

### 6. 方法说明框
- 数据从何而来？
- 如何分析的？
- 有哪些局限性？
- 读者如何进一步探索？

方法文档模板

## 我们如何进行此项分析

### 数据来源
[列出所有数据来源，附链接和访问日期]

### 时间范围
[明确说明涵盖的时间段]

### 定义
[定义关键术语及其操作化方式]

### 分析步骤
1. [分析第一步]
2. [第二步]
3. [继续...]

### 局限性
- [局限性 1]
- [局限性 2]

### 我们排除了什么及原因
- [排除类别]：[原因]

### 验证
[发现如何被验证/检查]

### 代码和数据可用性
[如果共享代码/数据，提供 GitHub 仓库链接]

### 联系方式
[读者如何联系您提问]

数据获取

公共数据源

## 联邦数据源

### 通用
- Data.gov - 联邦开放数据门户
- 人口普查局 (census.gov) - 人口统计、经济数据
- 劳工统计局 (bls.gov) - 就业、通胀、工资
- 经济分析局 (bea.gov) - GDP、经济账户
- 联邦储备系统 (federalreserve.gov) - 金融数据
- SEC EDGAR - 公司申报文件

### 特定领域
- 环境保护局 (epa.gov/data) - 环境数据
- 食品药品监督管理局 (fda.gov/data) - 药品批准、召回、不良事件
- CDC WONDER - 健康统计
- 国家公路交通安全管理局 - 车辆安全数据
- 交通部 - 交通统计
- 联邦选举委员会 - 竞选资金
- USASpending.gov - 联邦合同和拨款

### 州和地方
- 州开放数据门户（搜索："[州名] open data"）
- Socrata 支持的网站（许多城市/州）
- OpenStreets、市政 GIS 门户
- 州审计长/审计员报告

广告位招租

在这里展示您的产品或服务

触达数万 AI 开发者，精准高效

联系我们

数据清洗和准备

from typing import Any

import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinations

# Inflation adjustment
import cpi
import wbdata

def standardize_name(name: Any) -> str | None:
    """Standardize name format to 'First Last'."""
    if pd.isna(name):
        return None
    name = str(name).strip().upper()
    # Handle "LAST, FIRST" format
    if ',' in name:
        parts = name.split(',')
        name = f"{parts[1].strip()} {parts[0].strip()}"
    return name

def parse_date(date_str: Any) -> pd.Timestamp | None:
    """Parse dates in various formats."""
    if pd.isna(date_str):
        return None

    formats = [
        '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
        '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
    ]

    for fmt in formats:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue

    # Fall back to pandas parser
    try:
        return pd.to_datetime(date_str)
    except:
        return None


def handle_missing(df:pd.DataFrame, thresh:int | None, per_thresh:float | None, required_col:str | None) -> pd.DataFrame:
    '''Handles Dataframes with too many missing values, defined by the user.''' 
    if thresh and data_clean.isna().sum() >= thresh:
        return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
    
    elif per_thresh and (data_clean.isna().sum() / len(data_clean) * 100) >= per_thresh:
        return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
    
    else:
        return df


def handle_duplicates(df:pd.DataFrame, thresh=int | None)
    '''Handle duplicate rows of data.'''
    if thresh and df.duplicated().sum() >= thresh:
        return df.drop_duplicates().reset_index(drop=True).copy()
    else:
        return df


def flag_similar_names(df: pd.DataFrame, name_col: str, threshold: int = 85) -> pd.DataFrame:
    """Flag rows that have potential duplicate names using vectorized comparison."""
    
    names = df[name_col].dropna().unique()
    
    # Use combinations() to avoid nested loop and duplicate comparisons
    dup_names: set[Any] = {
        name
        for name1, name2 in combinations(names, 2)
        if fuzz.ratio(str(name1).lower(), str(name2).lower()) >= threshold
        for name in (name1, name2)
    }
    
    df['has_similar_name'] = df[name_col].isin(dup_names)
    return df


def flag_outliers(series: pd.Series, method: str = 'iqr', threshold: float = 1.5) -> pd.Series:
    """Flag statistical outliers."""
    if method == 'iqr':
        Q1 = series.quantile(0.25)
        Q3 = series.quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - threshold * IQR
        upper = Q3 + threshold * IQR
        return (series < lower) | (series > upper)
    elif method == 'zscore':
        z_scores = np.abs((series - series.mean()) / series.std())
        return z_scores > threshold



# use descriptive variable names and chain methods
data_clean = (pd

            # Load messy data — raw_data is a placeholder
            # Be sure to use the right reader for the filetype
            .read_csv('..data/raw/raw_data.csv')

            # DATA TYPE CORRECTIONS
            # Ensure proper types for analysis
            .assign(# Convert to numeric (handling errors)
                    amount = lambda x: pd.to_numeric(x['amount'], errors='coerce'),
                    
                    # Convert to categorical (saves memory, enables ordering)
                    status = lambda x: pd.to_Categorical(x['status'])) 
            
            .assign(
                    # INCONSISTENT FORMATTING
                    # Problem: Names in different formats
                    # ie. "SMITH, JOHN" vs "John Smith" vs "smith john"
                    name_clean = lambda x: standaridize_name(x['name']),
                    
                    # DATE INCONSISTENCIES
                    # Problem: Dates in multiple formats
                    # ie. "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
                    parse_date = lambda x: parse_date(x['date']),
                    
                    # OUTLIERS
                    # Identify potential data entry errors
                    amount_outlier = lambda x: flag_outliers(x['amount']),
                    
                    )
            
            # Fuzzy duplicates (similar but not identical)
            # Use record linkage or manual review
            .pipe(find_similar_names, name_col='name_clean', threshold=85)

            # MISSING VALUES
            # Strategy depends on context
            # First check missing value patterns
            .pipe(handle_missing, thresh=None, per_thresh=None)

            # DUPLICATES — Find and handle duplicates
            .pipe(handle_duplicates, thresh=None)
            
            .reset_index(drop=True)
            .copy())

## 分析前数据验证

### 结构检查
- [ ] 行数与预期相符
- [ ] 列数和列名正确
- [ ] 数据类型合适
- [ ] 没有意外的空列

### 内容检查
- [ ] 日期范围合理
- [ ] 数值在预期范围内
- [ ] 分类值与预期选项匹配
- [ ] 地理数据解析正确
- [ ] 预期唯一的 ID 确实唯一

### 一致性检查
- [ ] 总计符合预期值
- [ ] 交叉表平衡
- [ ] 相关字段一致
- [ ] 时间序列连续

### 来源验证
- [ ] 可追溯至原始来源
- [ ] 方法已记录
- [ ] 已知局限性已注明
- [ ] 更新频率已了解

新闻统计分析方法

带背景的基本统计

# 任何数据集的基本统计
def describe_for_journalism(df: pd.DataFrame, col: str) -> pd.DataFrame:
    """Generate journalist-friendly statistics."""
    stats = df[col].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])
    
    # Add skewness to the describe() output
    stats['skewness'] = df[col].skew()
    
    return stats.to_frame(name=col)

# Example interpretation
stats = describe_for_journalism(salaries, 'salary')

print(f"""
ANALYSIS
---------------
We analyzed {stats['count']:,} salary records.

The median salary is ${stats['median']:,.0f}, meaning half of workers
earn more and half earn less.

The average salary is ${stats['mean']:,.0f}, which is
{'higher' if stats['mean'] > stats['median'] else 'lower'} than the median,
indicating the distribution is {'right-skewed (pulled up by high earners)'
if stats['skewness'] > 0 else 'left-skewed'}.

The top 10% of earners make at least ${stats['90th_percentile']:,.0f}.
The top 1% make at least ${stats['99th_percentile']:,.0f}.
""")

# Calculate change metrics for a column
def calculate_change(df: pd.DataFrame, col: str, periods: int = 1) -> pd.DataFrame:
    """Add change metrics to a DataFrame using built-in pandas methods.
    
    Args:
        df: Input DataFrame
        col: Column to calculate changes for
        periods: Number of rows to look back (1=previous row, 12=year-over-year for monthly)
    """
    return df.assign(
        absolute_change=df[col].diff(periods),
        percent_change=df[col].pct_change(periods) * 100,
        direction=np.sign(df[col].diff(periods)).map({1: 'increased', -1: 'decreased', 0: 'unchanged'})
    )

# Usage:
# changes = data_clean.pipe(calculate_change, 'revenue', periods=12) # Year-over-year for monthly data

# Per capita calculations (essential for fair comparisons)
def per_capita(value: float, population: float, multiplier: int = 100000) -> float:
    """Calculate per capita rate."""
    return (value / population) * multiplier  # Per 100,000 is standard

# Example: Crime rates
city_a = {'crimes': 5000, 'population': 100000}
city_b = {'crimes': 8000, 'population': 500000}

rate_a = per_capita(city_a['crimes'], city_a['population'])
rate_b = per_capita(city_b['crimes'], city_b['population'])

print(f"City A: {rate_a:.1f} crimes per 100,000 residents")
print(f"City B: {rate_b:.1f} crimes per 100,000 residents")
# City A actually has higher crime rate despite fewer total crimes!


def adjust_for_inflation(
    amount: float | pd.Series, 
    from_year: int | pd.Series, 
    to_year: int,
    country: str = 'US'
) -> float | pd.Series:
    """Adjust dollar amounts for inflation. Works with scalars or Series for .assign().
    
    Args:
        amount: Value(s) to adjust
        from_year: Original year(s) of the amount
        to_year: Target year to adjust to
        country: ISO 2-letter country code (default 'US'). US uses BLS data via cpi package,
                 others use World Bank CPI data (FP.CPI.TOTL indicator)
    """
    if country == 'US':
        # Use cpi package for US (more accurate, from BLS)
        if isinstance(from_year, pd.Series):
            return pd.Series([cpi.inflate(amt, yr, to=to_year) 
                            for amt, yr in zip(amount, from_year)], index=amount.index)
        return cpi.inflate(amount, from_year, to=to_year)
    else:
        # Use World Bank data for other countries
        cpi_data = wbdata.get_dataframe(
            {'FP.CPI.TOTL': 'cpi'}, 
            country=country
        )['cpi'].to_dict()
        
        from_cpi = pd.Series(from_year).map(cpi_data) if isinstance(from_year, pd.Series) else cpi_data[from_year]
        to_cpi = cpi_data[to_year]
        return amount * (to_cpi / from_cpi)

# Usage:
# adjust_for_inflation(100, 2020, 2024)  # US by default
# adjust_for_inflation(100, 2020, 2024, country='GB')  # UK
# df.assign(inf_adjust24=lambda x: adjust_for_inflation(x['amount'], x['year'], 2024, country='DE'))

# Always adjust when comparing dollars across years!

使用 Plotly Express 进行探索性交互可视化

import plotly.express as px

# Set default template for all charts
px.defaults.template = 'simple_white'

def create_bar_chart(
    data: pd.DataFrame, 
    title: str, 
    source: str,
    desc: str  = '', 
    x_val: str, 
    y_val: str,
    x_lab: str | None,
    y_lab: str | None
) -> px.bar:
    """Create a bar chart."""
    
    fig = px.bar(
        data, 
        x=x_val, 
        y=y_val,
        text=desc,
        title=title,
        labels={'category': (x_lab if x_lab else x_val), 'value': (y_lab if y_lab else y_val)}
    )
    
    return fig

# Example
fig = create_bar_chart(
    data,
    title='Annual Widget Production',
    source='Department of Widgets, 2024',
    desc='The widget department increased its production dramatically starting in 2014.',
    x_val='year',
    y_val='widgets_prod',
    x_lab='Year',
    y_label='Units produced'
)

fig.show()  # Interactive display

使用 Datawrapper 制作可发布的自动化数据可视化

import pandas as pd
import datawrapper as dw

# Authentication: Set DATAWRAPPER_ACCESS_TOKEN environment variable, 
# or read from file and pass to create()
with open('datawrapper_api_key.txt', 'r') as f:
    api_key = f.read().strip()

# read in your data
data = pd.read_csv('../data/raw/data.csv')

# Create a bar chart using the new OOP API
chart = dw.BarChart(
    title='My Bar Chart Title',
    intro='Subtitle or description text',
    data=data,

    # Formatting options
    value_label_format=dw.NumberFormat.ONE_DECIMAL,
    show_value_labels=True,
    value_label_alignment='left',
    sort_bars=True,  # sort by value
    reverse_order=False,

    # Source attribution
    source_name='Your Data Source',
    source_url='https://example.com',
    byline='Your Name',

    # Optional: custom base color
    base_color='#1d81a2'
)

# Create and publish (uses DATAWRAPPER_ACCESS_TOKEN env var, or pass token)
chart.create(access_token=api_key)
chart.publish()

# Get chart URL and embed code
print(f"Chart ID: {chart.chart_id}")
print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")
iframe_code = chart.get_iframe_code(responsive=True)

# Update existing chart with new data (for live-updating charts)
existing_chart = dw.get_chart('YOUR_CHART_ID')  # retrieve by ID
existing_chart.data = new_df  # assign new DataFrame
existing_chart.title = 'Updated Title'  # modify properties
existing_chart.update()  # push changes to Datawrapper
existing_chart.publish()  # republish to make live

# Optional — Export chart as image
chart.export(filepath='chart.png', width=800, height=600)

#view chart
chart

避免误导性可视化

## 图表完整性检查清单

### 坐标轴
- [ ] Y 轴从零开始（对于条形图）
- [ ] 坐标轴标签清晰
- [ ] 比例合适（未截断以夸大）
- [ ] 两个坐标轴都标有单位

### 数据表示
- [ ] 所有数据点可见
- [ ] 颜色可区分（包括色盲）
- [ ] 比例准确
- [ ] 3D 效果未扭曲感知

### 背景
- [ ] 标题描述显示内容，而非结论
- [ ] 时间段明确说明
- [ ] 来源引用
- [ ] 如相关，注明样本量/方法
- [ ] 在适当时显示不确定性

### 诚实性
- [ ] 避免选择性使用日期
- [ ] 异常值已解释，未隐藏
- [ ] 双坐标轴的使用有正当理由（通常避免）
- [ ] 注释不误导

处理地理空间数据

美国人口普查地理编码器

最适合： 仅限美国地址。返回人口普查地理信息（普查区、街区、FIPS 代码）以及坐标——这对于与人口普查人口统计数据连接至关重要。

优点： 完全免费，无需 API 密钥。返回人口普查地理信息（州/县 FIPS、普查区、街区），可让您与 ACS/十年人口普查数据连接。对于标准美国地址匹配率良好。

缺点： 每批最多 10,000 个地址。仅限美国地址。比商业替代方案慢。对于非标准地址（邮政信箱、乡村路线、新建建筑）匹配率较低。

使用时机： 您需要地理编码格式良好的美国地址，或者没有预算使用付费服务。

# pip install censusbatchgeocoder
import censusbatchgeocoder
import pandas as pd

# DataFrame must have columns: id, address, city, state, zipcode
# (state and zipcode are optional but improve match rates)

def census_geocode(
    df: pd.DataFrame,
    id_col: str = 'id',
    address_col: str = 'address',
    city_col: str = 'city',
    state_col: str = 'state',
    zipcode_col: str = 'zipcode',
    chunk_size: int = 9999
) -> pd.DataFrame:
    """
    Geocode a DataFrame using the U.S. Census batch geocoder.
    Automatically handles datasets larger than 10,000 rows by chunking.
    
    Returns DataFrame with: latitude, longitude, state_fips, county_fips, 
    tract, block, is_match, is_exact, returned_address, geocoded_address
    """
    # Rename columns to expected format
    col_map = {id_col: 'id', address_col: 'address', city_col: 'city'}
    if state_col and state_col in df.columns:
        col_map[state_col] = 'state'
    if zipcode_col and zipcode_col in df.columns:
        col_map[zipcode_col] = 'zipcode'
    
    renamed_df = df.rename(columns=col_map)
    records = renamed_df.to_dict('records')
    
    # Small dataset: geocode directly
    if len(records) <= chunk_size:
        results = censusbatchgeocoder.geocode(records)
        return pd.DataFrame(results)
    
    # Large dataset: process in chunks to stay under 10,000 limit
    all_results = []
    for i in range(0, len(records), chunk_size):
        chunk = records[i:i + chunk_size]
        print(f"Geocoding rows {i:,} to {i + len(chunk):,} of {len(records):,}...")
        
        try:
            results = censusbatchgeocoder.geocode(chunk)
            all_results.extend(results)
        except Exception as e:
            print(f"Error on chunk starting at {i}: {e}")
            for record in chunk:
                all_results.append({**record, 'is_match': 'No_Match', 'latitude': None, 'longitude': None})
    
    return pd.DataFrame(all_results)

# Usage:
geocoded = (pd
              .read_csv('../data/raw/addresses.csv')
              .assign(id=lambda x: x.index)
              .pipe(census_geocode, 
                    id_col='id', 
                    address_col='street', 
                    city_col='city'.
                    state_col='state',
                    zipcode_col='zip'))

Google 地图地理编码器

最适合： 国际地址、高匹配率以及混乱/非标准地址格式。

优点： 即使对于格式不佳的地址，匹配率也非常出色。全球可用。快速可靠。返回丰富的元数据（地点类型、地址组件、地点 ID）。

缺点： 需要付费（超出免费额度后每 1,000 次请求 5 美元）。需要 API 密钥和结算账户。不返回人口普查地理信息——您需要进行单独的空间连接。

使用时机： 您需要地理编码国际地址、拥有美国人口普查地理编码器无法匹配的混乱地址数据，或者需要最高可能的匹配率且有预算支持。

import googlemaps
from typing import Optional

def geocode_address_google(address: str, api_key: str) -> Optional[dict]:
    """
    Geocode address using Google Maps API.
    Requires API key with Geocoding API enabled.
    """
    gmaps = googlemaps.Client(key=api_key)
    result = gmaps.geocode(address)
    
    if result:
        location = result[0]['geometry']['location']
        return {
            'formatted_address': result[0]['formatted_address'],
            'lat': location['lat'],
            'lon': location['lng'],
            'place_id': result[0]['place_id']
        }
    return None

# Batch geocode a DataFrame
def batch_geocode(df: pd.DataFrame, address_col: str, api_key: str) -> pd.DataFrame:
    gmaps = googlemaps.Client(key=api_key)
    
    results = []
    for address in df[address_col]:
        try:
            result = gmaps.geocode(address)
            if result:
                loc = result[0]['geometry']['location']
                results.append({'lat': loc['lat'], 'lon': loc['lng']})
            else:
                results.append({'lat': None, 'lon': None})
        except Exception:
            results.append({'lat': None, 'lon': None})
    
    return pd.concat([df, pd.DataFrame(results)], axis=1)

import geopandas as gpd
import pandas as pd
from shapely.geometry import Point

# Read data from various formats
gdf = gpd.read_file('data.geojson')                    # GeoJSON
gdf = gpd.read_file('data.shp')                         # Shapefile
gdf = gpd.read_file('https://example.com/data.geojson') # From URL
gdf = gpd.read_parquet('data.parquet')                  # GeoParquet (fast!)

# Transform DataFrame with lat/lon to GeoDataFrame
df = pd.read_csv('locations.csv')
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
gdf = gpd.GeoDataFrame(df, geometry=geometry)

# Set CRS (Coordinate Reference System)
# EPSG:4326 = WGS84 (standard latitude, longitude)
gdf = gdf.set_crs('EPSG:4326')

# Transform to different CRS (for area/distance calculations, use projected CRS)
gdf_projected = gdf.to_crs('EPSG:3857')  # Web Mercator, for distance in meters

# Basic spatial operations

#Find the area of a shape
gdf['area'] = gdf_projected.geometry.area 

#Find the center of a shape
gdf['centroid'] = gdf.geometry.centroid

#Draw a 1km boundary around a point
gdf['buffer_1km'] = gdf_projected.geometry.buffer(1000) #when set to CRS 3857

# Spatial join: find points within polygons
points = gpd.read_file('points.geojson')
polygons = gpd.read_file('boundaries.geojson')
joined = gpd.sjoin(points, polygons, predicate='within')

# Dissolve: merge geometries by attribute
dissolved = gdf.dissolve(by='state', aggfunc='sum')

# Export to various formats
gdf.to_parquet('output.parquet')          # GeoParquet (recommended)
gdf.to_file('output.geojson', driver='GeoJSON') #for tools that dont support GeoParquet

使用 `.explore()`、`lonboard` 和 Datawrapper 进行地理可视化

最适合： 数据分析期间的快速探索和原型设计。

优点： 内置于 GeoPandas 中——任何 GeoDataFrame 都可用此方法。非常适合探索性数据分析——检查数据是否正确、探索空间模式以及快速迭代地图设计。

缺点： 处理大型数据集（>10 万个要素）时变慢。与专用地图库相比，自定义选项有限。需要安装额外的依赖项。

使用时机： 您正在分析过程中，希望在不切换工具的情况下快速可视化您的 GeoDataFrame。

所需依赖项：

pip install folium mapclassify matplotlib

folium - .explore() 正常工作所必需（渲染交互式地图）
mapclassify - 使用 scheme= 参数进行分类（例如 'naturalbreaks'、'quantiles'、'equalinterval'）时所必需
matplotlib - 支持色彩映射（cmap=）所必需

import geopandas as gpd

folium, mapclassify, and matplotlib must be installed but don't need to be imported

geopandas imports them automatically when you call .explore()

Basic interactive map (uses folium under the hood)

gdf.explore()

Choropleth map with customization

(requires mapclassify for scheme parameter)

gdf.explore( column='population', # Column for color scale cmap='YlOrRd', # Matplotlib colormap scheme='naturalbreaks', # Classification scheme (needs mapclassify) k=5, # Number of bins legend=True, tooltip=['name', 'population'], # Columns to show on hover popup=True, # Show all columns on click tiles='CartoDB positron', # Background tiles style_kwds={'color': 'black', 'weight': 0.5} # Border style )

最适合： Jupyter 笔记本中的大型数据集和高性能可视化。

优点： 通过 deck.gl 进行 GPU 加速渲染，可以流畅处理数百万个点。出色的交互性——即使处理海量数据集，平移、缩放和悬停也能流畅工作。原生支持 GeoArrow 格式，可实现高效数据传输。

缺点： 需要单独安装（pip install lonboard）。样式选项更技术化（RGBA 数组、deck.gl 约定）。

使用时机： 您拥有大型点数据集（犯罪事件、传感器读数、商业位置）或需要与 10 万+要素进行流畅交互。

import geopandas as gpd
from lonboard import viz, Map, ScatterplotLayer, PolygonLayer

# Quick visualization (auto-detects geometry type)
viz(gdf)

# Custom ScatterplotLayer for points
layer = ScatterplotLayer.from_geopandas(
    gdf,
    get_radius=100,
    get_fill_color=[255, 0, 0, 200],  # RGBA
    pickable=True
)
m = Map(layer)
m

# PolygonLayer with color based on column
from lonboard.colormap import apply_continuous_cmap
import matplotlib.pyplot as plt

colors = apply_continuous_cmap(gdf['value'], plt.cm.viridis)
layer = PolygonLayer.from_geopandas(
    gdf,
    get_fill_color=colors,
    get_line_color=[0, 0, 0, 100],
    pickable=True
)
Map(layer)

最适合： 用于文章和报告的可发布分级统计图和比例符号图。

优点： 开箱即用的精美、专业默认设置。生成可在任何 CMS 中使用的可嵌入、响应式 iframe。读者无需运行任何代码即可交互（悬停、点击）。可访问且移动友好。易于以编程方式更新数据。

缺点： 需要 Datawrapper 账户（提供免费套餐）。仅限于 Datawrapper 支持的边界文件——您不能使用任意几何图形。自定义可视化灵活性较低。

使用时机： 您需要用于发布的精美地图。非常适合显示按区域统计的分级统计图（按州划分的失业率、按县划分的 COVID 病例、按选区划分的选举结果）。您的受众将在浏览器中查看地图，而不是在笔记本中。

与 .explore() 或 lonboard 不同，您不传递原始几何图形——而是使用标准代码（FIPS、ISO 等）将您的数据与 Datawrapper 的内置边界文件进行匹配。

import datawrapper as dw
import pandas as pd

# Read API key
with open('datawrapper_api_key.txt', 'r') as f:
    api_key = f.read().strip()

# Prepare data with location codes that match Datawrapper's boundaries
# For US states: use 2-letter abbreviations or FIPS codes
# For countries: use ISO 3166-1 alpha-2 codes
df = pd.DataFrame({
    'state': ['AL', 'AK', 'AZ', 'AR', 'CA'],  # State abbreviations
    'unemployment_rate': [4.9, 3.2, 7.1, 4.2, 5.8]
})

# Create a choropleth map
chart = dw.ChoroplethMap(
    title='Unemployment Rate by State',
    intro='Percentage of labor force unemployed, 2024',
    data=df,

    # Map configuration
    basemap='us-states',           # Built-in US states boundaries
    basemap_key='state',           # Column in your data with location

🇺🇸English

Data journalism methodology

Systematic approaches for finding, analyzing and presenting data in journalism.

Story structure for data journalism

Data journalism framework

The framework for data journalism was established by Philip Meyer, a journalist for Knight-Ridder, Harvard Nieman Fellow and professor at UNC-Chapel Hill. In his book <i>The New Precision Journalism</i>, which outlines his ideas, Meyer encourages journalists to treat journalism "as if it were a science" by adopting the scientific method:
- Making observation(s) / formulating a questiom
- Researching the question / Collect, store and retrieve data
- Formulate a hypothesis
- Test the hypothesis, using both qualitative (interviews, documents etc.) and quantitative (data analysis etc.) methods
- Analyze the results and reduce them to the most important findings
- Present them to the audience

This process should be thought of as iterative, rather than sequential.

## The data story arc

### 1. The hook (nut graf)
- What's the key finding(s)?
- Why should readers care?
- What's the human impact?

### 2. The evidence
- Show the data
- Explain the methodology
- Acknowledge limitations

### 3. The context
- How does this compare to past?
- How does this compare to elsewhere?
- What's the trend?

### 4. The human element
- Individual examples that illustrate the data
- Expert interpretation
- Affected voices

### 5. The implications
- What does this mean going forward?
- What questions remain?
- What actions could result?

### 6. The methodology box
- Where did data come from?
- How was it analyzed?
- What are the limitations?
- How can readers explore further?

Methodology documentation template

## How we did this analysis

### Data sources
[List all data sources with links and access dates]

### Time period
[Specify exactly what time period is covered]

### Definitions
[Define key terms and how you operationalized them]

### Analysis steps
1. [First step of analysis]
2. [Second step]
3. [Continue...]

### Limitations
- [Limitation 1]
- [Limitation 2]

### What we excluded and why
- [Excluded category]: [Reason]

### Verification
[How findings were verified/checked]

### Code and data availability
[Link to GitHub repo if sharing code/data]

### Contact
[How readers can reach you with questions]

Data acquisition

Public data sources

## Federal data sources

### General
- Data.gov - Federal open data portal
- Census Bureau (census.gov) - Demographics, economic data
- BLS (bls.gov) - Employment, inflation, wages
- BEA (bea.gov) - GDP, economic accounts
- Federal Reserve (federalreserve.gov) - Financial data
- SEC EDGAR - Corporate filings

### Specific domains
- EPA (epa.gov/data) - Environmental data
- FDA (fda.gov/data) - Drug approvals, recalls, adverse events
- CDC WONDER - Health statistics
- NHTSA - Vehicle safety data
- DOT - Transportation statistics
- FEC - Campaign finance
- USASpending.gov - Federal contracts and grants

### State and local
- State open data portals (search: "[state] open data")
- Socrata-powered sites (many cities/states)
- OpenStreets, municipal GIS portals
- State comptroller/auditor reports

Data request strategies

## Getting data that isn't public

### Public records request (ie. FOIA) for datasets
- Request databases, not just documents
- Ask for data dictionary/schema
- Request in native format (CSV, SQL dump)
- Specify field-level needs

### Building your own dataset
- Scraping public information
- Crowdsourcing from readers
- Systematic document review
- Surveys (with proper methodology)

### Commercial data sources (for newsrooms)
- LexisNexis
- Refinitiv
- Bloomberg
- Industry-specific databases

Data cleaning and preparation

Common data problems

from typing import Any

import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinations

# Inflation adjustment
import cpi
import wbdata

def standardize_name(name: Any) -> str | None:
    """Standardize name format to 'First Last'."""
    if pd.isna(name):
        return None
    name = str(name).strip().upper()
    # Handle "LAST, FIRST" format
    if ',' in name:
        parts = name.split(',')
        name = f"{parts[1].strip()} {parts[0].strip()}"
    return name

def parse_date(date_str: Any) -> pd.Timestamp | None:
    """Parse dates in various formats."""
    if pd.isna(date_str):
        return None

    formats = [
        '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
        '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
    ]

    for fmt in formats:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue

    # Fall back to pandas parser
    try:
        return pd.to_datetime(date_str)
    except:
        return None


def handle_missing(df:pd.DataFrame, thresh:int | None, per_thresh:float | None, required_col:str | None) -> pd.DataFrame:
    '''Handles Dataframes with too many missing values, defined by the user.''' 
    if thresh and data_clean.isna().sum() >= thresh:
        return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
    
    elif per_thresh and (data_clean.isna().sum() / len(data_clean) * 100) >= per_thresh:
        return df.dropna(subset=[required_col]).reset_index(drop=True).copy()
    
    else:
        return df


def handle_duplicates(df:pd.DataFrame, thresh=int | None)
    '''Handle duplicate rows of data.'''
    if thresh and df.duplicated().sum() >= thresh:
        return df.drop_duplicates().reset_index(drop=True).copy()
    else:
        return df


def flag_similar_names(df: pd.DataFrame, name_col: str, threshold: int = 85) -> pd.DataFrame:
    """Flag rows that have potential duplicate names using vectorized comparison."""
    
    names = df[name_col].dropna().unique()
    
    # Use combinations() to avoid nested loop and duplicate comparisons
    dup_names: set[Any] = {
        name
        for name1, name2 in combinations(names, 2)
        if fuzz.ratio(str(name1).lower(), str(name2).lower()) >= threshold
        for name in (name1, name2)
    }
    
    df['has_similar_name'] = df[name_col].isin(dup_names)
    return df


def flag_outliers(series: pd.Series, method: str = 'iqr', threshold: float = 1.5) -> pd.Series:
    """Flag statistical outliers."""
    if method == 'iqr':
        Q1 = series.quantile(0.25)
        Q3 = series.quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - threshold * IQR
        upper = Q3 + threshold * IQR
        return (series < lower) | (series > upper)
    elif method == 'zscore':
        z_scores = np.abs((series - series.mean()) / series.std())
        return z_scores > threshold



# use descriptive variable names and chain methods
data_clean = (pd

            # Load messy data — raw_data is a placeholder
            # Be sure to use the right reader for the filetype
            .read_csv('..data/raw/raw_data.csv')

            # DATA TYPE CORRECTIONS
            # Ensure proper types for analysis
            .assign(# Convert to numeric (handling errors)
                    amount = lambda x: pd.to_numeric(x['amount'], errors='coerce'),
                    
                    # Convert to categorical (saves memory, enables ordering)
                    status = lambda x: pd.to_Categorical(x['status'])) 
            
            .assign(
                    # INCONSISTENT FORMATTING
                    # Problem: Names in different formats
                    # ie. "SMITH, JOHN" vs "John Smith" vs "smith john"
                    name_clean = lambda x: standaridize_name(x['name']),
                    
                    # DATE INCONSISTENCIES
                    # Problem: Dates in multiple formats
                    # ie. "01/15/2024", "2024-01-15", "January 15, 2024", "15-Jan-24"
                    parse_date = lambda x: parse_date(x['date']),
                    
                    # OUTLIERS
                    # Identify potential data entry errors
                    amount_outlier = lambda x: flag_outliers(x['amount']),
                    
                    )
            
            # Fuzzy duplicates (similar but not identical)
            # Use record linkage or manual review
            .pipe(find_similar_names, name_col='name_clean', threshold=85)

            # MISSING VALUES
            # Strategy depends on context
            # First check missing value patterns
            .pipe(handle_missing, thresh=None, per_thresh=None)

            # DUPLICATES — Find and handle duplicates
            .pipe(handle_duplicates, thresh=None)
            
            .reset_index(drop=True)
            .copy())

Data validation checklist

## Pre-analysis data validation

### Structural checks
- [ ] Row count matches expected
- [ ] Column count and names correct
- [ ] Data types appropriate
- [ ] No unexpected null columns

### Content checks
- [ ] Date ranges make sense
- [ ] Numeric values within expected bounds
- [ ] Categorical values match expected options
- [ ] Geographic data resolves correctly
- [ ] IDs are unique where expected

### Consistency checks
- [ ] Totals add up to expected values
- [ ] Cross-tabulations balance
- [ ] Related fields are consistent
- [ ] Time series is continuous

### Source verification
- [ ] Can trace back to original source
- [ ] Methodology documented
- [ ] Known limitations noted
- [ ] Update frequency understood

Statistical analysis for journalism

Basic statistics with context

# Essential statistics for any dataset
def describe_for_journalism(df: pd.DataFrame, col: str) -> pd.DataFrame:
    """Generate journalist-friendly statistics."""
    stats = df[col].describe(percentiles=[0.25, 0.5, 0.75, 0.9, 0.99])
    
    # Add skewness to the describe() output
    stats['skewness'] = df[col].skew()
    
    return stats.to_frame(name=col)

# Example interpretation
stats = describe_for_journalism(salaries, 'salary')

print(f"""
ANALYSIS
---------------
We analyzed {stats['count']:,} salary records.

The median salary is ${stats['median']:,.0f}, meaning half of workers
earn more and half earn less.

The average salary is ${stats['mean']:,.0f}, which is
{'higher' if stats['mean'] > stats['median'] else 'lower'} than the median,
indicating the distribution is {'right-skewed (pulled up by high earners)'
if stats['skewness'] > 0 else 'left-skewed'}.

The top 10% of earners make at least ${stats['90th_percentile']:,.0f}.
The top 1% make at least ${stats['99th_percentile']:,.0f}.
""")

Comparisons and context

# Calculate change metrics for a column
def calculate_change(df: pd.DataFrame, col: str, periods: int = 1) -> pd.DataFrame:
    """Add change metrics to a DataFrame using built-in pandas methods.
    
    Args:
        df: Input DataFrame
        col: Column to calculate changes for
        periods: Number of rows to look back (1=previous row, 12=year-over-year for monthly)
    """
    return df.assign(
        absolute_change=df[col].diff(periods),
        percent_change=df[col].pct_change(periods) * 100,
        direction=np.sign(df[col].diff(periods)).map({1: 'increased', -1: 'decreased', 0: 'unchanged'})
    )

# Usage:
# changes = data_clean.pipe(calculate_change, 'revenue', periods=12) # Year-over-year for monthly data

# Per capita calculations (essential for fair comparisons)
def per_capita(value: float, population: float, multiplier: int = 100000) -> float:
    """Calculate per capita rate."""
    return (value / population) * multiplier  # Per 100,000 is standard

# Example: Crime rates
city_a = {'crimes': 5000, 'population': 100000}
city_b = {'crimes': 8000, 'population': 500000}

rate_a = per_capita(city_a['crimes'], city_a['population'])
rate_b = per_capita(city_b['crimes'], city_b['population'])

print(f"City A: {rate_a:.1f} crimes per 100,000 residents")
print(f"City B: {rate_b:.1f} crimes per 100,000 residents")
# City A actually has higher crime rate despite fewer total crimes!


def adjust_for_inflation(
    amount: float | pd.Series, 
    from_year: int | pd.Series, 
    to_year: int,
    country: str = 'US'
) -> float | pd.Series:
    """Adjust dollar amounts for inflation. Works with scalars or Series for .assign().
    
    Args:
        amount: Value(s) to adjust
        from_year: Original year(s) of the amount
        to_year: Target year to adjust to
        country: ISO 2-letter country code (default 'US'). US uses BLS data via cpi package,
                 others use World Bank CPI data (FP.CPI.TOTL indicator)
    """
    if country == 'US':
        # Use cpi package for US (more accurate, from BLS)
        if isinstance(from_year, pd.Series):
            return pd.Series([cpi.inflate(amt, yr, to=to_year) 
                            for amt, yr in zip(amount, from_year)], index=amount.index)
        return cpi.inflate(amount, from_year, to=to_year)
    else:
        # Use World Bank data for other countries
        cpi_data = wbdata.get_dataframe(
            {'FP.CPI.TOTL': 'cpi'}, 
            country=country
        )['cpi'].to_dict()
        
        from_cpi = pd.Series(from_year).map(cpi_data) if isinstance(from_year, pd.Series) else cpi_data[from_year]
        to_cpi = cpi_data[to_year]
        return amount * (to_cpi / from_cpi)

# Usage:
# adjust_for_inflation(100, 2020, 2024)  # US by default
# adjust_for_inflation(100, 2020, 2024, country='GB')  # UK
# df.assign(inf_adjust24=lambda x: adjust_for_inflation(x['amount'], x['year'], 2024, country='DE'))

# Always adjust when comparing dollars across years!

Correlation vs causation

## Reporting correlations responsibly

### What you CAN say
- "X and Y are correlated"
- "As X increases, Y tends to increase"
- "Areas with higher X also tend to have higher Y"
- "X is associated with Y"

### What you CANNOT say (without more evidence)
- "X causes Y"
- "X leads to Y"
- "Y happens because of X"

### Questions to ask before implying causation
1. Is there a plausible mechanism?
2. Does the timing make sense (cause before effect)?
3. Is there a dose-response relationship?
4. Has the finding been replicated?
5. Have confounding variables been controlled?
6. Are there alternative explanations?

### Red flags for spurious correlations
- Extremely high correlation (r > 0.95) with unrelated things
- No logical connection between variables
- Third variable could explain both
- Small sample size with high variance

Data visualization

Chart selection guide

## Choosing the right chart

### Comparison
- **Bar chart**: Compare categories
- **Grouped bar**: Compare categories across groups
- **Bullet chart**: Actual vs target

### Change over time
- **Line chart**: Trends over time
- **Area chart**: Cumulative totals over time
- **Slope chart**: Change between two points

### Distribution
- **Histogram**: Distribution of one variable
- **Box plot**: Compare distributions across groups
- **Violin plot**: Detailed distribution shape

### Relationship
- **Scatter plot**: Relationship between two variables
- **Bubble chart**: Three variables (x, y, size)
- **Connected scatter**: Change in relationship over time

### Composition
- **Pie chart**: Parts of a whole (almost never use, max 5 slices, prefer donut charts)
- **Donut chart**: Parts of a whole
- **Stacked bar**: Parts of whole across categories
- **Treemap**: Hierarchical composition

### Geographic
- **Choropleth**: Values by region (use normalized data!)
- **Dot map**: Individual locations
- **Proportional symbol**: Magnitude at locations

Exploratory interactive visualizations with Plotly Express

import plotly.express as px

# Set default template for all charts
px.defaults.template = 'simple_white'

def create_bar_chart(
    data: pd.DataFrame, 
    title: str, 
    source: str,
    desc: str  = '', 
    x_val: str, 
    y_val: str,
    x_lab: str | None,
    y_lab: str | None
) -> px.bar:
    """Create a bar chart."""
    
    fig = px.bar(
        data, 
        x=x_val, 
        y=y_val,
        text=desc,
        title=title,
        labels={'category': (x_lab if x_lab else x_val), 'value': (y_lab if y_lab else y_val)}
    )
    
    return fig

# Example
fig = create_bar_chart(
    data,
    title='Annual Widget Production',
    source='Department of Widgets, 2024',
    desc='The widget department increased its production dramatically starting in 2014.',
    x_val='year',
    y_val='widgets_prod',
    x_lab='Year',
    y_label='Units produced'
)

fig.show()  # Interactive display

Publication-ready automated data visualizations with Datawrapper

import pandas as pd
import datawrapper as dw

# Authentication: Set DATAWRAPPER_ACCESS_TOKEN environment variable, 
# or read from file and pass to create()
with open('datawrapper_api_key.txt', 'r') as f:
    api_key = f.read().strip()

# read in your data
data = pd.read_csv('../data/raw/data.csv')

# Create a bar chart using the new OOP API
chart = dw.BarChart(
    title='My Bar Chart Title',
    intro='Subtitle or description text',
    data=data,

    # Formatting options
    value_label_format=dw.NumberFormat.ONE_DECIMAL,
    show_value_labels=True,
    value_label_alignment='left',
    sort_bars=True,  # sort by value
    reverse_order=False,

    # Source attribution
    source_name='Your Data Source',
    source_url='https://example.com',
    byline='Your Name',

    # Optional: custom base color
    base_color='#1d81a2'
)

# Create and publish (uses DATAWRAPPER_ACCESS_TOKEN env var, or pass token)
chart.create(access_token=api_key)
chart.publish()

# Get chart URL and embed code
print(f"Chart ID: {chart.chart_id}")
print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")
iframe_code = chart.get_iframe_code(responsive=True)

# Update existing chart with new data (for live-updating charts)
existing_chart = dw.get_chart('YOUR_CHART_ID')  # retrieve by ID
existing_chart.data = new_df  # assign new DataFrame
existing_chart.title = 'Updated Title'  # modify properties
existing_chart.update()  # push changes to Datawrapper
existing_chart.publish()  # republish to make live

# Optional — Export chart as image
chart.export(filepath='chart.png', width=800, height=600)

#view chart
chart

Avoiding misleading visualizations

## Chart integrity checklist

### Axes
- [ ] Y-axis starts at zero (for bar charts)
- [ ] Axis labels are clear
- [ ] Scale is appropriate (not truncated to exaggerate)
- [ ] Both axes labeled with units

### Data representation
- [ ] All data points visible
- [ ] Colors are distinguishable (including colorblind)
- [ ] Proportions are accurate
- [ ] 3D effects not distorting perception

### Context
- [ ] Title describes what's shown, not conclusion
- [ ] Time period clearly stated
- [ ] Source cited
- [ ] Sample size/methodology noted if relevant
- [ ] Uncertainty shown where appropriate

### Honesty
- [ ] Cherry-picking dates avoided
- [ ] Outliers explained, not hidden
- [ ] Dual axes justified (usually avoid)
- [ ] Annotations don't mislead

Working with geospatial data

Geocoding data

U.S. Census Geocoder

Best for: U.S. addresses only. Returns Census geography (tract, block, FIPS codes) along with coordinates—essential for joining with Census demographic data.

Pros: Completely free with no API key required. Returns Census geographies (state/county FIPS, tract, block) that let you join with ACS/decennial Census data. Good match rates for standard U.S. addresses.

Cons: Limited to 10,000 addresses per batch. U.S. addresses only. Slower than commercial alternatives. Lower match rates for non-standard addresses (PO boxes, rural routes, new construction).

Use when: You need to geocode nicely formatted U.S. addresses or you don't have budget for a paid service.

# pip install censusbatchgeocoder
import censusbatchgeocoder
import pandas as pd

# DataFrame must have columns: id, address, city, state, zipcode
# (state and zipcode are optional but improve match rates)

def census_geocode(
    df: pd.DataFrame,
    id_col: str = 'id',
    address_col: str = 'address',
    city_col: str = 'city',
    state_col: str = 'state',
    zipcode_col: str = 'zipcode',
    chunk_size: int = 9999
) -> pd.DataFrame:
    """
    Geocode a DataFrame using the U.S. Census batch geocoder.
    Automatically handles datasets larger than 10,000 rows by chunking.
    
    Returns DataFrame with: latitude, longitude, state_fips, county_fips, 
    tract, block, is_match, is_exact, returned_address, geocoded_address
    """
    # Rename columns to expected format
    col_map = {id_col: 'id', address_col: 'address', city_col: 'city'}
    if state_col and state_col in df.columns:
        col_map[state_col] = 'state'
    if zipcode_col and zipcode_col in df.columns:
        col_map[zipcode_col] = 'zipcode'
    
    renamed_df = df.rename(columns=col_map)
    records = renamed_df.to_dict('records')
    
    # Small dataset: geocode directly
    if len(records) <= chunk_size:
        results = censusbatchgeocoder.geocode(records)
        return pd.DataFrame(results)
    
    # Large dataset: process in chunks to stay under 10,000 limit
    all_results = []
    for i in range(0, len(records), chunk_size):
        chunk = records[i:i + chunk_size]
        print(f"Geocoding rows {i:,} to {i + len(chunk):,} of {len(records):,}...")
        
        try:
            results = censusbatchgeocoder.geocode(chunk)
            all_results.extend(results)
        except Exception as e:
            print(f"Error on chunk starting at {i}: {e}")
            for record in chunk:
                all_results.append({**record, 'is_match': 'No_Match', 'latitude': None, 'longitude': None})
    
    return pd.DataFrame(all_results)

# Usage:
geocoded = (pd
              .read_csv('../data/raw/addresses.csv')
              .assign(id=lambda x: x.index)
              .pipe(census_geocode, 
                    id_col='id', 
                    address_col='street', 
                    city_col='city'.
                    state_col='state',
                    zipcode_col='zip'))

Google Maps Geocoder

Best for: International addresses, high match rates, and messy/non-standard address formats.

Pros: Excellent match rates even for poorly formatted addresses. Works worldwide. Fast and reliable. Returns rich metadata (place types, address components, place IDs).

Cons: Costs money ($5 per 1,000 requests after free tier). Requires API key and billing account. Does not return Census geography—you'd need to do a separate spatial join.

Use when: You need to geocode international addresses, have messy address data that the Census geocoder can't match, or need the highest possible match rate and have budget for it.

import googlemaps
from typing import Optional

def geocode_address_google(address: str, api_key: str) -> Optional[dict]:
    """
    Geocode address using Google Maps API.
    Requires API key with Geocoding API enabled.
    """
    gmaps = googlemaps.Client(key=api_key)
    result = gmaps.geocode(address)
    
    if result:
        location = result[0]['geometry']['location']
        return {
            'formatted_address': result[0]['formatted_address'],
            'lat': location['lat'],
            'lon': location['lng'],
            'place_id': result[0]['place_id']
        }
    return None

# Batch geocode a DataFrame
def batch_geocode(df: pd.DataFrame, address_col: str, api_key: str) -> pd.DataFrame:
    gmaps = googlemaps.Client(key=api_key)
    
    results = []
    for address in df[address_col]:
        try:
            result = gmaps.geocode(address)
            if result:
                loc = result[0]['geometry']['location']
                results.append({'lat': loc['lat'], 'lon': loc['lng']})
            else:
                results.append({'lat': None, 'lon': None})
        except Exception:
            results.append({'lat': None, 'lon': None})
    
    return pd.concat([df, pd.DataFrame(results)], axis=1)

Geopandas

import geopandas as gpd
import pandas as pd
from shapely.geometry import Point

# Read data from various formats
gdf = gpd.read_file('data.geojson')                    # GeoJSON
gdf = gpd.read_file('data.shp')                         # Shapefile
gdf = gpd.read_file('https://example.com/data.geojson') # From URL
gdf = gpd.read_parquet('data.parquet')                  # GeoParquet (fast!)

# Transform DataFrame with lat/lon to GeoDataFrame
df = pd.read_csv('locations.csv')
geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])]
gdf = gpd.GeoDataFrame(df, geometry=geometry)

# Set CRS (Coordinate Reference System)
# EPSG:4326 = WGS84 (standard latitude, longitude)
gdf = gdf.set_crs('EPSG:4326')

# Transform to different CRS (for area/distance calculations, use projected CRS)
gdf_projected = gdf.to_crs('EPSG:3857')  # Web Mercator, for distance in meters

# Basic spatial operations

#Find the area of a shape
gdf['area'] = gdf_projected.geometry.area 

#Find the center of a shape
gdf['centroid'] = gdf.geometry.centroid

#Draw a 1km boundary around a point
gdf['buffer_1km'] = gdf_projected.geometry.buffer(1000) #when set to CRS 3857

# Spatial join: find points within polygons
points = gpd.read_file('points.geojson')
polygons = gpd.read_file('boundaries.geojson')
joined = gpd.sjoin(points, polygons, predicate='within')

# Dissolve: merge geometries by attribute
dissolved = gdf.dissolve(by='state', aggfunc='sum')

# Export to various formats
gdf.to_parquet('output.parquet')          # GeoParquet (recommended)
gdf.to_file('output.geojson', driver='GeoJSON') #for tools that dont support GeoParquet

Geo-Visualization with `.explore()`, `lonboard` and Datawrapper

`.explore()`

Best for: Quick exploration and prototyping during data analysis.

Pros: Built into GeoPandas—method is available on any GeoDataFrame. Great for exploratory data analysis—checking that your data looks right, exploring spatial patterns, and iterating quickly on map designs.

Cons: Becomes slow with large datasets (>100k features). Limited customization compared to dedicated mapping libraries. Requires extra dependencies to be installed.

Use when: You're in the middle of analysis and want to quickly visualize your GeoDataFrame without switching tools.

Required dependencies:

pip install folium mapclassify matplotlib

folium - Required for .explore() to work at all (renders the interactive map)
mapclassify - Required when using scheme= parameter for classification (e.g., 'naturalbreaks', 'quantiles', 'equalinterval')
matplotlib - Required for colormap (cmap=) support

import geopandas as gpd

folium, mapclassify, and matplotlib must be installed but don't need to be imported

geopandas imports them automatically when you call .explore()

Basic interactive map (uses folium under the hood)

gdf.explore()

Choropleth map with customization

(requires mapclassify for scheme parameter)

gdf.explore( column='population', # Column for color scale cmap='YlOrRd', # Matplotlib colormap scheme='naturalbreaks', # Classification scheme (needs mapclassify) k=5, # Number of bins legend=True, tooltip=['name', 'population'], # Columns to show on hover popup=True, # Show all columns on click tiles='CartoDB positron', # Background tiles style_kwds={'color': 'black', 'weight': 0.5} # Border style )

`lonboard`

Best for: Large datasets and high-performance visualization in Jupyter notebooks.

Pros: GPU-accelerated rendering via deck.gl can handle millions of points smoothly. Excellent interactivity—pan, zoom, and hover work fluidly even with massive datasets. Native support for GeoArrow format for efficient data transfer.

Cons: Requires separate installation (pip install lonboard). Styling options are more technical (RGBA arrays, deck.gl conventions).

Use when: You have large point datasets (crime incidents, sensor readings, business locations) or need smooth interactivity with 100k+ features.

import geopandas as gpd
from lonboard import viz, Map, ScatterplotLayer, PolygonLayer

# Quick visualization (auto-detects geometry type)
viz(gdf)

# Custom ScatterplotLayer for points
layer = ScatterplotLayer.from_geopandas(
    gdf,
    get_radius=100,
    get_fill_color=[255, 0, 0, 200],  # RGBA
    pickable=True
)
m = Map(layer)
m

# PolygonLayer with color based on column
from lonboard.colormap import apply_continuous_cmap
import matplotlib.pyplot as plt

colors = apply_continuous_cmap(gdf['value'], plt.cm.viridis)
layer = PolygonLayer.from_geopandas(
    gdf,
    get_fill_color=colors,
    get_line_color=[0, 0, 0, 100],
    pickable=True
)
Map(layer)

Datawrapper

Best for: Publication-ready choropleth and proportional symbol maps for articles and reports.

Pros: Beautiful, professional defaults out of the box. Generates embeddable, responsive iframes that work in any CMS. Readers can interact (hover, click) without running any code. Accessible and mobile-friendly. Easy to update data programmatically for updating data.

Cons: Requires a Datawrapper account (free tier available). Limited to Datawrapper's supported boundary files—you can't bring arbitrary geometries. Less flexibility for custom visualizations.

Use when: You need a polished map for publication. Ideal for choropleth maps showing statistics by region (unemployment by state, COVID cases by county, election results by district). Your audience will view the map in a browser, not a notebook.

Unlike .explore() or lonboard, you don't pass raw geometry—instead you match your data to Datawrapper's built-in boundary files using standard codes (FIPS, ISO, etc.).

import datawrapper as dw
import pandas as pd

# Read API key
with open('datawrapper_api_key.txt', 'r') as f:
    api_key = f.read().strip()

# Prepare data with location codes that match Datawrapper's boundaries
# For US states: use 2-letter abbreviations or FIPS codes
# For countries: use ISO 3166-1 alpha-2 codes
df = pd.DataFrame({
    'state': ['AL', 'AK', 'AZ', 'AR', 'CA'],  # State abbreviations
    'unemployment_rate': [4.9, 3.2, 7.1, 4.2, 5.8]
})

# Create a choropleth map
chart = dw.ChoroplethMap(
    title='Unemployment Rate by State',
    intro='Percentage of labor force unemployed, 2024',
    data=df,

    # Map configuration
    basemap='us-states',           # Built-in US states boundaries
    basemap_key='state',           # Column in your data with location codes
    value_column='unemployment_rate',

    # Styling
    color_palette='YlOrRd',        # Color scheme
    legend_title='Unemployment %',

    # Attribution
    source_name='Bureau of Labor Statistics',
    source_url='https://www.bls.gov/',
    byline='Your Name'
)

# Create and publish
chart.create(access_token=api_key)
chart.publish()

# Get embed code for your article
iframe = chart.get_iframe_code(responsive=True)
print(f"Chart URL: https://datawrapper.dwcdn.net/{chart.chart_id}")

# Update with new data (for live-updating maps)
new_df = pd.DataFrame({...})  # Updated data
existing_chart = dw.get_chart('YOUR_CHART_ID')
existing_chart.data = new_df
existing_chart.update()
existing_chart.publish()

Available Datawrapper basemaps include:

us-states, us-counties, us-congressional-districts
world, europe, africa, asia
Country-specific maps (e.g., germany-states, uk-constituencies)

Learning resources

NICAR (Investigative Reporters & Editors)
Knight Center for Journalism in the Americas
Data Journalism Handbook (datajournalism.com)
Flowing Data (flowingdata.com)
The Pudding (pudding.cool) - examples
Sigma Awards (https://www.sigmaawards.org/) - examples

Weekly Installs

Repository

jamditis/claude…urnalism

GitHub Stars

First Seen

Jan 21, 2026

Security Audits

Gen Agent Trust HubPass SocketPass SnykWarn

Installed on

opencode73

codex72

gemini-cli71

cursor67

github-copilot65

claude-code62

前端代码审计工具 - 自动化检测可访问性、性能、响应式设计、主题化与反模式

53,100 周安装

数据新闻方法论：从数据获取到故事呈现的完整指南与工具

🇨🇳中文介绍

数据新闻方法论

数据新闻的故事结构

数据新闻框架

方法文档模板

数据获取

公共数据源

相关 Skills

数据请求策略

数据清洗和准备

常见数据问题

数据验证清单

新闻统计分析方法

带背景的基本统计

比较和背景

相关性与因果关系

数据可视化

图表选择指南

使用 Plotly Express 进行探索性交互可视化

使用 Datawrapper 制作可发布的自动化数据可视化

避免误导性可视化

处理地理空间数据

地理编码数据

美国人口普查地理编码器

Google 地图地理编码器

Geopandas

使用 .explore()、lonboard 和 Datawrapper 进行地理可视化

.explore()

folium, mapclassify, and matplotlib must be installed but don't need to be imported

geopandas imports them automatically when you call .explore()

Basic interactive map (uses folium under the hood)

Choropleth map with customization

(requires mapclassify for scheme parameter)

lonboard

Datawrapper

🇺🇸English

Data journalism methodology

Story structure for data journalism

Data journalism framework

Methodology documentation template

Data acquisition

Public data sources

Data request strategies

Data cleaning and preparation

Common data problems

Data validation checklist

Statistical analysis for journalism

Basic statistics with context

Comparisons and context

Correlation vs causation

Data visualization

Chart selection guide

Exploratory interactive visualizations with Plotly Express

Publication-ready automated data visualizations with Datawrapper

Avoiding misleading visualizations

Working with geospatial data

Geocoding data

U.S. Census Geocoder

Google Maps Geocoder

Geopandas

Geo-Visualization with .explore(), lonboard and Datawrapper

.explore()

folium, mapclassify, and matplotlib must be installed but don't need to be imported

geopandas imports them automatically when you call .explore()

Basic interactive map (uses folium under the hood)

Choropleth map with customization

(requires mapclassify for scheme parameter)

lonboard

Datawrapper

Learning resources

最新 Skills

使用 `.explore()`、`lonboard` 和 Datawrapper 进行地理可视化

`.explore()`

`lonboard`

Geo-Visualization with `.explore()`, `lonboard` and Datawrapper

`.explore()`

`lonboard`