Pandas 时间重采样

时间重采样是指将时间序列从一个频率转换到另一个频率的过程。Pandas 的 resample() 函数是处理这类任务的核心工具，它可以进行降采样（高频到低频）和升采样（低频到高频），配合各种聚合和插值方法，让你轻松完成时间数据的转换。

📌 准备示例数据

import pandas as pd
import numpy as np

# 创建分钟级数据
dates = pd.date_range('2024-01-01 00:00', periods=1000, freq='T')  # 'T' = 分钟
df = pd.DataFrame({
    'value': np.random.randn(1000).cumsum(),  # 模拟累积值
    'volume': np.random.randint(100, 1000, 1000)
}, index=dates)

print(df.head())

🔍 重采样的基本概念

resample() 函数基于时间索引进行分组，然后应用聚合或转换。它的工作流程是：指定目标频率 → 进行分组 → 应用函数。

降采样 (Downsampling)

从高频数据聚合到低频数据（如分钟→小时）。通常需要聚合函数（求和、均值等）。

升采样 (Upsampling)

从低频数据转换到高频数据（如小时→分钟）。需要插值或填充方法来处理新增的时间点。

📊 常用频率别名

别名	说明	示例
B	工作日	resample('B')
D	日历日	resample('D')
W	每周（默认周日）	resample('W-MON') 指定周一
M	月末	resample('M')
Q	季末	resample('Q')
A 或 Y	年末	resample('Y')
H	小时	resample('H')
T 或 min	分钟	resample('5T') 每5分钟
S	秒	resample('30S') 每30秒
L 或 ms	毫秒	resample('500L') 每500毫秒
U	微秒	resample('1000U')

📉 降采样 (Downsampling)

将高频数据聚合到低频数据，必须指定聚合函数。

# 将分钟数据降采样为小时数据，计算每小时的平均值和总和
hourly_mean = df.resample('H').mean()
hourly_sum = df.resample('H').sum()
print(hourly_mean.head())

# 对不同的列应用不同聚合函数
hourly_agg = df.resample('H').agg({
    'value': 'mean',
    'volume': 'sum'
})
print(hourly_agg.head())

# 使用自定义函数
def price_range(x):
    return x.max() - x.min()

hourly_range = df['value'].resample('H').agg(price_range)
print(hourly_range.head())

📌 OHLC 重采样（金融常用）

ohlc() 方法可以快速计算开盘、最高、最低、收盘价。

# 模拟股票价格
prices = pd.Series(np.cumsum(np.random.randn(1000)) + 100,
                   index=pd.date_range('2024-01-01', periods=1000, freq='T'))

# 计算5分钟OHLC
ohlc_5min = prices.resample('5T').ohlc()
print(ohlc_5min.head())

📈 升采样 (Upsampling)

将低频数据转换为高频数据，新增的时间点通常需要填充或插值。

# 创建日数据
daily = pd.Series([100, 150, 130, 200],
                  index=pd.date_range('2024-01-01', periods=4, freq='D'))

# 升采样到小时，使用前向填充
hourly_ffill = daily.resample('H').ffill()
print(hourly_ffill.head(24))

# 升采样到小时，使用线性插值
hourly_interp = daily.resample('H').interpolate(method='linear')
print(hourly_interp.head(24))

# 其他填充方法：bfill（后向填充）、pad（同ffill）、asfreq（不填充，引入NaN）

⚙️ 参数详解

rule 频率字符串

axis 0/1

closed 区间闭合端点（'left'/'right'）

label 标签使用左边界还是右边界

loffset 调整标签偏移

kind 聚合到时间点或时间段

convention 升采样时的起始规则

origin 对齐的起始点

# 调整区间闭合和标签
# 默认 closed='left', label='left'
df.resample('H', closed='right', label='right').mean()

# 偏移标签（已弃用，建议使用偏移索引）
df.resample('H').mean().shift(1, freq='H')  # 手动偏移

🧪 复杂聚合

可以对不同的列使用不同的聚合函数，甚至自定义函数。

# 同时对多列进行不同聚合
result = df.resample('H').agg({
    'value': ['mean', 'max', 'min'],
    'volume': ['sum', 'std']
})
print(result.head())

# 使用命名聚合（Pandas 0.25+）
result_named = df.resample('H').agg(
    avg_value=('value', 'mean'),
    max_value=('value', 'max'),
    total_volume=('volume', 'sum')
)
print(result_named.head())

🔄 分组重采样

可以与其他列组合进行分组重采样，例如按产品、区域等。

# 增加分类列
df['category'] = np.random.choice(['A', 'B'], size=len(df))

# 先按类别分组，再对每组进行时间重采样
grouped_resampled = df.groupby('category').resample('H')['value'].mean()
print(grouped_resampled.head(10))

🧩 处理缺失值

升采样后往往会产生缺失值，需要选择合适的填充方法。

# 创建日数据
daily = pd.Series([1, 2, np.nan, 4], index=pd.date_range('2024-01-01', periods=4, freq='D'))

# 升采样到小时，并插值
hourly = daily.resample('H').asfreq()  # 产生 NaN
hourly_interp = hourly.interpolate(method='time')  # 时间插值

print(hourly_interp.head(48))

🧪 综合示例：销售数据分析

import pandas as pd
import numpy as np

# 模拟15分钟级别的销售数据
dates = pd.date_range('2024-01-01', periods=500, freq='15T')
sales = pd.DataFrame({
    'store': np.random.choice(['Store A', 'Store B'], 500),
    'sales': np.random.randint(10, 100, 500),
    'customers': np.random.randint(1, 20, 500)
}, index=dates)

print("原始数据（15分钟粒度）:")
print(sales.head())

# 需求1：按小时汇总销售额和顾客数
hourly_sales = sales.resample('H').agg({
    'sales': 'sum',
    'customers': 'sum'
})
print("\n小时汇总:")
print(hourly_sales.head())

# 需求2：按天计算平均销售额和总顾客数
daily = sales.resample('D').agg({
    'sales': ['sum', 'mean'],
    'customers': 'sum'
})
print("\n日报:")
print(daily.head())

# 需求3：不同门店的日汇总
store_daily = sales.groupby('store').resample('D')['sales'].sum().unstack('store')
print("\n各门店日销售额:")
print(store_daily.head())

# 需求4：7天移动平均（基于日汇总）
store_daily_ma7 = store_daily.rolling(window=7).mean()
print("\n7天移动平均:")
print(store_daily_ma7.head(10))

⚠️ 注意事项

索引必须为 DatetimeIndex：使用 resample() 前，确保 DataFrame 或 Series 的索引是时间类型，否则会报错。

区间边界：closed 和 label 参数控制分组的边界和标签位置，需理解具体业务需求。

升采样后的数据量：从低频到高频会大大增加数据量，注意内存使用。

缺失值处理：升采样后记得填充或插值，否则很多模型无法处理 NaN。

最佳实践：

降采样前考虑业务聚合逻辑（求和、平均还是 OHLC）。
升采样时优先使用 interpolate() 或 ffill()，避免引入未来信息。
对于大型数据集，使用 resample() 后可以考虑 asfreq() 配合 interpolate() 进行高效插值。
处理时间序列时，先确认数据的时区和频率，避免偏移错误。

上一章: 时间序列入门下一章: 文本数据处理

Pandas教程