diff --git a/REPORT.md b/REPORT.md new file mode 100644 index 0000000..401d4b0 --- /dev/null +++ b/REPORT.md @@ -0,0 +1,921 @@ +# BTC/USDT 价格规律性全面分析报告 + +> **数据源**: Binance BTCUSDT | **时间跨度**: 2017-08-17 ~ 2026-02-01 (3,091 日线) | **时间粒度**: 1m/3m/5m/15m/30m/1h/2h/4h/6h/8h/12h/1d/3d/1w/1mo (15种) + +--- + +## 目录 + +- [1. 数据概览](#1-数据概览) +- [2. 收益率分布特征](#2-收益率分布特征) +- [3. 波动率聚集与长记忆性](#3-波动率聚集与长记忆性) +- [4. 频域周期分析](#4-频域周期分析) +- [5. Hurst 指数与分形分析](#5-hurst-指数与分形分析) +- [6. 幂律增长模型](#6-幂律增长模型) +- [7. 量价关系与因果检验](#7-量价关系与因果检验) +- [8. 日历效应](#8-日历效应) +- [9. 减半周期分析](#9-减半周期分析) +- [10. 技术指标有效性验证](#10-技术指标有效性验证) +- [11. K线形态统计验证](#11-k线形态统计验证) +- [12. 市场状态聚类](#12-市场状态聚类) +- [13. 时序预测模型](#13-时序预测模型) +- [14. 异常检测与前兆模式](#14-异常检测与前兆模式) +- [15. 综合结论](#15-综合结论) + +--- + +## 1. 数据概览 + +![价格概览](output/price_overview.png) + +| 指标 | 值 | +|------|-----| +| 日线样本数 | 3,091 | +| 小时样本数 | 74,053 | +| 价格范围 | $3,189.02 ~ $124,658.54 | +| 缺失值 | 0 | +| 重复索引 | 0 | + +数据切分策略(严格按时间顺序,不随机打乱): + +| 集合 | 时间范围 | 样本数 | 比例 | +|------|---------|--------|------| +| 训练集 | 2017-08 ~ 2022-09 | 1,871 | 60.5% | +| 验证集 | 2022-10 ~ 2024-06 | 639 | 20.7% | +| 测试集 | 2024-07 ~ 2026-02 | 581 | 18.8% | + +--- + +## 2. 收益率分布特征 + +### 2.1 正态性检验 + +三项独立检验**一致拒绝正态假设**: + +| 检验方法 | 统计量 | p 值 | 结论 | +|---------|--------|------|------| +| Kolmogorov-Smirnov | 0.0974 | 5.97e-26 | 拒绝 | +| Jarque-Bera | 31,996.3 | 0.00 | 拒绝 | +| Anderson-Darling | 64.18 | 在所有临界值(1%~15%)下均拒绝 | 拒绝 | + +### 2.2 厚尾特征 + +| 指标 | BTC实际值 | 正态分布理论值 | 倍数 | +|------|----------|--------------|------| +| 超额峰度 | 15.65 | 0 | — | +| 偏度 | -0.97 | 0 | — | +| 3σ超越比率 | 1.553% | 0.270% | **5.75x** | +| 4σ超越比率 | 0.550% | 0.006% | **86.86x** | + +4σ 极端事件的出现频率是正态分布预测的近 87 倍,证明 BTC 收益率具有显著的厚尾特征。 + +![收益率直方图 vs 正态](output/returns/returns_histogram_vs_normal.png) + +![QQ图](output/returns/returns_qq_plot.png) + +### 2.3 多时间尺度分布 + +| 时间尺度 | 样本数 | 均值 | 标准差 | 峰度 | 偏度 | +|---------|--------|------|--------|------|------| +| 1h | 74,052 | 0.000039 | 0.0078 | 35.88 | -0.47 | +| 4h | 18,527 | 0.000155 | 0.0149 | 20.54 | -0.20 | +| 1d | 3,090 | 0.000935 | 0.0361 | 15.65 | -0.97 | +| 1w | 434 | 0.006812 | 0.0959 | 2.08 | -0.44 | + +**关键发现**: 峰度随时间尺度增大从 35.88 → 2.08 单调递减,趋向正态分布,符合中心极限定理的聚合正态性。 + +![多时间尺度分布](output/returns/multi_timeframe_distributions.png) + +--- + +## 3. 波动率聚集与长记忆性 + +### 3.1 GARCH 建模 + +| 参数 | GARCH(1,1) | EGARCH(1,1) | GJR-GARCH(1,1) | +|------|-----------|-------------|-----------------| +| α | 0.0962 | — | — | +| β | 0.8768 | — | — | +| 持续性(α+β) | **0.9730** | — | — | +| 杠杆参数 γ | — | < 0 | > 0 | + +持续性 0.973 接近 1,意味着波动率冲击衰减极慢 — 一次大幅波动的影响需要数十天才能消散。 + +![GARCH条件波动率](output/returns/garch_conditional_volatility.png) + +### 3.2 波动率 ACF 幂律衰减 + +| 指标 | 值 | +|------|-----| +| 幂律衰减指数 d(线性拟合) | 0.6351 | +| 幂律衰减指数 d(非线性拟合) | 0.3449 | +| R² | 0.4231 | +| p 值 | 5.82e-25 | +| 长记忆性判断 (0 < d < 1) | **是** | + +绝对收益率的自相关以幂律速度缓慢衰减,证实波动率具有长记忆特征。标准 GARCH 模型的指数衰减假设可能不足以完整刻画这一特征。 + +![ACF幂律衰减](output/volatility/acf_power_law_fit.png) + +### 3.3 ACF 分析证据 + +| 序列 | ACF显著滞后数 | Ljung-Box Q(100) | p 值 | +|------|-------------|-----------------|------| +| 对数收益率 | 10 | 148.68 | 0.001151 | +| 平方收益率 | 11 | 211.18 | 0.000000 | +| 绝对收益率 | **88** | 2,294.61 | 0.000000 | +| 成交量 | **100** | 103,242.29 | 0.000000 | + +绝对收益率前 88 阶 ACF 均显著(100 阶中的 88 阶),成交量全部 100 阶均显著(ACF(1) = 0.892),证明极强的非线性依赖和波动聚集。 + +![ACF分析](output/acf/acf_grid.png) + +![PACF分析](output/acf/pacf_grid.png) + +![GARCH模型对比](output/volatility/garch_model_comparison.png) + +### 3.4 杠杆效应 + +| 前瞻窗口 | Pearson r | p 值 | 结论 | +|---------|-----------|------|------| +| 5d | -0.0620 | 5.72e-04 | 显著弱负相关 | +| 10d | -0.0337 | 0.062 | 不显著 | +| 20d | -0.0176 | 0.329 | 不显著 | + +仅在 5 天窗口内观测到弱杠杆效应(下跌后波动率上升),效应量极小(r=-0.062),比传统股市弱得多。 + +![杠杆效应](output/volatility/leverage_effect_scatter.png) + +--- + +## 4. 频域周期分析 + +### 4.1 FFT 频谱分析 + +对日线对数收益率施加 Hann 窗后做 FFT,以 AR(1) 红噪声为基准检测显著周期: + +| 周期(天) | SNR (信噪比) | 跨时间框架确认 | +|---------|-------------|--------------| +| 39.6 | 6.36x | 4h + 1d + 1w(三框架确认) | +| 3.1 | 5.27x | 4h + 1d | +| 14.4 | 5.22x | 4h + 1d | +| 13.3 | 5.19x | 4h + 1d | + +**带通滤波方差占比**: + +| 周期分量 | 方差占比 | +|---------|---------| +| 7d | 14.917% | +| 30d | 3.770% | +| 90d | 2.405% | +| 365d | 0.749% | +| 1400d | 0.233% | + +7 天周期分量解释了最多的方差(14.9%),但总体所有周期分量加起来仅解释 ~22% 的方差,约 78% 的波动无法用周期性解释。 + +![FFT功率谱](output/fft/fft_power_spectrum.png) + +![多时间框架FFT](output/fft/fft_multi_timeframe.png) + +![带通滤波分量](output/fft/fft_bandpass_components.png) + +### 4.2 小波变换 (CWT) + +使用复 Morlet 小波(cmor1.5-1.0),1000 次 AR(1) Monte Carlo 替代数据构建 95% 显著性阈值: + +| 显著周期(天) | 年数 | 功率/阈值比 | +|-------------|------|-----------| +| 633 | 1.73 | 1.01x | +| 316 | 0.87 | 1.15x | +| 297 | 0.81 | 1.07x | +| 278 | 0.76 | 1.10x | +| 267 | 0.73 | 1.07x | +| 251 | 0.69 | 1.11x | +| 212 | 0.58 | 1.14x | + +这些周期虽然通过了 95% 显著性检验,但功率/阈值比值仅 1.01~1.15x,属于**边际显著**,实际应用价值有限。 + +![小波时频图](output/wavelet/wavelet_scalogram.png) + +![全局小波谱](output/wavelet/wavelet_global_spectrum.png) + +![关键周期追踪](output/wavelet/wavelet_key_periods.png) + +--- + +## 5. Hurst 指数与分形分析 + +### 5.1 Hurst 指数 + +R/S 分析和 DFA 两种独立方法交叉验证: + +| 方法 | Hurst 值 | 解读 | +|------|---------|------| +| R/S 分析 | 0.5991 | 弱趋势性 | +| DFA | 0.5868 | 弱趋势性 | +| **平均** | **0.5930** | 弱趋势性 (H > 0.55) | +| 方法差异 | 0.0122 | 一致性好 (< 0.05) | + +判定标准:H > 0.55 趋势性 / H < 0.45 均值回归 / 0.45 ≤ H ≤ 0.55 随机游走 + +**多时间框架 Hurst**: + +| 时间尺度 | R/S | DFA | 平均 | +|---------|-----|-----|------| +| 1h | 0.5552 | 0.5559 | 0.5556 | +| 4h | 0.5749 | 0.5771 | 0.5760 | +| 1d | 0.5991 | 0.5868 | 0.5930 | +| 1w | 0.6864 | 0.6552 | **0.6708** | + +Hurst 指数随时间尺度增大而增大,周线级别(H=0.67)呈现更明显的趋势性。 + +**滚动窗口分析**(500 天窗口,30 天步进): + +| 指标 | 值 | +|------|-----| +| 窗口数 | 87 | +| 趋势状态占比 | **98.9%** (86/87) | +| 随机游走占比 | 1.1% | +| 均值回归占比 | 0.0% | +| Hurst 范围 | [0.549, 0.654] | + +几乎所有时间窗口都显示弱趋势性,没有任何窗口进入均值回归状态。 + +![R/S对数-对数图](output/hurst/hurst_rs_loglog.png) + +![滚动Hurst](output/hurst/hurst_rolling.png) + +![多时间框架Hurst](output/hurst/hurst_multi_timeframe.png) + +### 5.2 分形维度 + +| 指标 | BTC | 随机游走均值 | 随机游走标准差 | +|------|-----|-----------|-------------| +| 盒计数维数 D | 1.3398 | 1.3805 | 0.0295 | +| 由 D 推算 H (D=2-H) | 0.6602 | — | — | +| Z 统计量 | -1.3821 | — | — | +| p 值 | 0.1669 | — | — | + +BTC 的分形维数 D=1.34 低于随机游走的 D=1.38(序列更光滑),但 100 次蒙特卡洛模拟 Z 检验的 p=0.167 **未达到 5% 显著性**。 + +**多尺度自相似性**:峰度从尺度 1 的 15.65 降至尺度 50 的 -0.25,大尺度下趋于正态,自相似性有限。 + +![盒计数分形维度](output/fractal/fractal_box_counting.png) + +![蒙特卡洛对比](output/fractal/fractal_monte_carlo.png) + +![自相似性分析](output/fractal/fractal_self_similarity.png) + +--- + +## 6. 幂律增长模型 + +| 指标 | 值 | +|------|-----| +| 幂律指数 α | 0.770 | +| R² | 0.568 | +| p 值 | 0.00 | + +### 6.1 幂律走廊模型 + +| 分位数 | 当前走廊价格 | +|--------|-----------| +| 5%(低估) | $16,879 | +| 50%(中枢) | $51,707 | +| 95%(高估) | $119,340 | +| **当前价格** | **$76,968** | +| 历史残差分位 | **67.9%** | + +当前价格处于走廊的 67.9% 分位,属于历史正常波动范围内。 + +### 6.2 幂律 vs 指数增长模型对比 + +| 模型 | AIC | BIC | +|------|-----|-----| +| 幂律 | 68,301 | 68,313 | +| 指数 | **67,807** | **67,820** | +| 差值 | +493 | +493 | + +AIC/BIC 均支持指数增长模型优于幂律模型(差值 493),说明 BTC 的长期增长更接近指数而非幂律。 + +![对数-对数回归](output/power_law/power_law_loglog_regression.png) + +![幂律走廊](output/power_law/power_law_corridor.png) + +![模型对比](output/power_law/power_law_model_comparison.png) + +--- + +## 7. 量价关系与因果检验 + +### 7.1 成交量-波动率相关性 + +| 指标 | 值 | +|------|-----| +| Spearman ρ (volume vs \|return\|) | **0.3215** | +| p 值 | 3.11e-75 | + +成交量放大伴随大幅波动,中等正相关且极其显著。 + +![量价散点图](output/volume_price/volume_return_scatter.png) + +### 7.2 Granger 因果检验 + +共 50 次检验(10 对 × 5 个滞后阶),Bonferroni 校正阈值 = 0.001: + +| 因果方向 | 校正后显著的滞后阶数 | 最大 F 统计量 | +|---------|-----------------|-------------| +| abs_return → volume | **5/5 全显著** | 55.19 | +| log_return → taker_buy_ratio | **5/5 全显著** | 139.21 | +| squared_return → volume | **4/5 显著** | 52.44 | +| log_return → range_pct | 1/5 | 5.74 | +| volume → abs_return | 1/5 | 3.69 | +| volume → log_return | 0/5 | — | +| log_return → volume | 0/5 | — | +| taker_buy_ratio → log_return | 0/5(校正后) | — | + +**核心发现**: 因果关系是**单向**的 — 波动率/收益率 Granger-cause 成交量和 taker_buy_ratio,反向不成立。这意味着成交量是价格波动的结果而非原因。 + +![Granger p值热力图](output/causality/granger_pvalue_heatmap.png) + +![因果网络图](output/causality/granger_causal_network.png) + +### 7.3 跨时间尺度因果 + +| 方向 | 显著滞后阶 | +|------|----------| +| hourly_intraday_vol → log_return | lag=10 显著 (Bonferroni) | +| hourly_volume_sum → log_return | 不显著 | +| hourly_max_abs_return → log_return | lag=10 边际显著 | + +小时级别日内波动率对日线收益率存在微弱的领先信号,但仅在 10 天滞后下显著。 + +### 7.4 OBV 背离 + +检测到 82 个价量背离信号(49 个顶背离 + 33 个底背离)。 + +![OBV背离](output/volume_price/obv_divergence.png) + +--- + +## 8. 日历效应 + +### 8.1 星期效应 + +| 星期 | 样本数 | 日均收益率 | 标准差 | +|------|--------|----------|--------| +| 周一 | 441 | +0.310% | 4.05% | +| 周二 | 441 | -0.027% | 3.56% | +| 周三 | 441 | +0.374% | 3.69% | +| 周四 | 441 | -0.319% | 4.58% | +| 周五 | 442 | +0.180% | 3.62% | +| 周六 | 442 | +0.117% | 2.45% | +| 周日 | 442 | +0.021% | 2.87% | + +**Kruskal-Wallis H 检验: H=8.24, p=0.221 → 不显著** + +Bonferroni 校正后的 21 对 Mann-Whitney U 两两比较均不显著。 + +![星期效应](output/calendar/calendar_weekday_effect.png) + +### 8.2 月份效应 + +**Kruskal-Wallis H 检验: H=6.12, p=0.865 → 不显著** + +10 月份均值收益率最高(+0.501%),8 月最低(-0.123%),但 66 对两两比较经 Bonferroni 校正后无一显著。 + +![月份效应](output/calendar/calendar_month_effect.png) + +### 8.3 小时效应 + +**收益率 Kruskal-Wallis: H=56.88, p=0.000107 → 显著** +**成交量 Kruskal-Wallis: H=2601.9, p=0.000000 → 显著** + +日内小时效应在收益率和成交量上均显著存在。14:00 UTC 成交量最高(3,805 BTC),03:00-05:00 UTC 成交量最低(~1,980 BTC)。 + +![小时效应](output/calendar/calendar_hour_effect.png) + +### 8.4 季度 & 月初月末效应 + +| 检验 | 统计量 | p 值 | 结论 | +|------|--------|------|------| +| 季度 Kruskal-Wallis | 1.15 | 0.765 | 不显著 | +| 月初 vs 月末 Mann-Whitney | 134,569 | 0.236 | 不显著 | + +![季度和月初月末效应](output/calendar/calendar_quarter_boundary_effect.png) + +### 日历效应总结 + +| 效应类型 | 检验 p 值 | 结论 | +|---------|----------|------| +| 星期效应 | 0.221 | **不显著** | +| 月份效应 | 0.865 | **不显著** | +| 小时效应(收益率) | 0.000107 | **显著** | +| 小时效应(成交量) | 0.000000 | **显著** | +| 季度效应 | 0.765 | **不显著** | +| 月初/月末 | 0.236 | **不显著** | + +仅日内小时效应在统计上显著。 + +--- + +## 9. 减半周期分析 + +> ⚠️ **重要局限**: 仅覆盖 2 次减半事件(2020-05-11, 2024-04-20),统计功效极低。 + +### 9.1 减半前后收益率对比 + +| 周期 | 减半前500天均值 | 减半后500天均值 | Welch's t | p 值 | +|------|-------------|-------------|-----------|------| +| 第三次(2020) | +0.179%/天 | +0.331%/天 | -0.590 | 0.555 | +| 第四次(2024) | +0.264%/天 | +0.108%/天 | 1.008 | 0.314 | +| **合并** | +0.221%/天 | +0.220%/天 | 0.011 | **0.991** | + +合并后 p=0.991,减半前后收益率几乎完全无差异。 + +### 9.2 波动率变化 (Levene 检验) + +| 周期 | 减半前年化波动率 | 减半后年化波动率 | Levene W | p 值 | +|------|--------------|--------------|---------|------| +| 第三次 | 82.72% | 73.13% | 0.608 | 0.436 | +| 第四次 | 47.18% | 46.26% | 0.197 | 0.657 | + +波动率变化在两个周期中均**不显著**。 + +### 9.3 累计收益率 + +| 减半后天数 | 第三次(2020) | 第四次(2024) | +|-----------|-------------|-------------| +| 30天 | +13.32% | +11.95% | +| 90天 | +33.92% | +4.45% | +| 180天 | +69.88% | +5.65% | +| 365天 | **+549.68%** | +33.47% | +| 500天 | +414.35% | +74.31% | + +两次减半后的轨迹差异巨大(365天:550% vs 33%)。 + +### 9.4 轨迹相关性 + +| 时段 | Pearson r | p 值 | +|------|-----------|------| +| 全部 (1001天) | **0.808** | 0.000 | +| 减半前 (500天) | 0.213 | 0.000002 | +| 减半后 (500天) | **0.737** | 0.000 | + +两个周期的归一化价格轨迹高度相关(r=0.81),但仅 2 个样本无法做出因果推断。 + +![归一化轨迹叠加](output/halving/halving_normalized_trajectories.png) + +![减半前后收益率](output/halving/halving_pre_post_returns.png) + +![累计收益率](output/halving/halving_cumulative_returns.png) + +![综合摘要](output/halving/halving_combined_summary.png) + +--- + +## 10. 技术指标有效性验证 + +对 21 个指标信号(8 种 MA/EMA 交叉 + 9 种 RSI + 3 种 MACD + 1 种布林带)进行严格统计验证。 + +### 10.1 FDR 校正 + +| 数据集 | 通过 FDR 校正的指标数 | +|--------|-------------------| +| 训练集 (1,871 bars) | **0 / 21** | +| 验证集 (639 bars) | **0 / 21** | + +**所有 21 个技术指标经 Benjamini-Hochberg FDR 校正后均不显著。** + +### 10.2 置换检验 (Top-5 IC 指标) + +| 指标 | IC 差值 | 置换 p 值 | 结论 | +|------|--------|----------|------| +| RSI_14_30_70 | -0.005 | 0.566 | 不通过 | +| RSI_14_25_75 | -0.030 | 0.015 | 通过 | +| RSI_21_30_70 | -0.012 | 0.268 | 不通过 | +| RSI_7_25_75 | -0.014 | 0.021 | 通过 | +| RSI_21_20_80 | -0.025 | 0.303 | 不通过 | + +仅 2/5 通过置换检验,且 IC 值均极小(|IC| < 0.05),实际预测力可忽略。 + +### 10.3 训练集 vs 验证集 IC 一致性 + +Top-10 IC 中有 9/10 方向一致,1 个(SMA_20_100)发生方向翻转。但所有 IC 值均在 [-0.10, +0.05] 范围内,效果量极小。 + +![IC分布-训练集](output/indicators/ic_distribution_train.png) + +![IC分布-验证集](output/indicators/ic_distribution_val.png) + +![p值热力图-训练集](output/indicators/pvalue_heatmap_train.png) + +--- + +## 11. K线形态统计验证 + +对 12 种手动实现的经典 K 线形态进行前瞻收益率分析。 + +### 11.1 形态出现频率(训练集) + +| 形态 | 出现次数 | FDR 通过 | +|------|---------|---------| +| Doji | 219 | 否 | +| Bullish_Engulfing | 159 | 否 | +| Bearish_Engulfing | 149 | 否 | +| Pin_Bar_Bull | 116 | 否 | +| Pin_Bar_Bear | 57 | 否 | +| Hammer | 49 | 否 | +| Morning_Star | 23 | 否 | +| Evening_Star | 20 | 否 | +| Inverted_Hammer | 17 | 否 | +| Three_White_Soldiers | 11 | 否 | +| Shooting_Star | 6 | 否 | +| Three_Black_Crows | 4 | 否 | + +**训练集 FDR 校正后 0/12 通过。** + +### 11.2 验证集结果 + +验证集中 3 个形态通过 FDR 校正(Doji 53.1%、Pin_Bar_Bull 39.3%、Bullish_Engulfing 36.2%),但命中率接近或低于 50%(随机水平),缺乏实际交易价值。 + +### 11.3 训练集 → 验证集稳定性 + +| 形态 | 训练集命中率 | 验证集命中率 | 变化 | 评价 | +|------|-----------|-----------|------|------| +| Doji | 51.1% | 53.1% | +1.9% | 稳定 | +| Hammer | 63.3% | 50.0% | -13.3% | 衰减 | +| Pin_Bar_Bear | 57.9% | 60.0% | +2.1% | 稳定 | +| Bullish_Engulfing | 50.9% | 36.2% | -14.7% | 衰减 | +| Morning_Star | 56.5% | 40.0% | -16.5% | 衰减 | + +大部分形态的命中率在验证集上出现衰减,说明训练集中的表现可能是过拟合。 + +![形态出现频率](output/patterns/pattern_counts_train.png) + +![形态前瞻收益率](output/patterns/pattern_forward_returns_train.png) + +![命中率分析](output/patterns/pattern_hit_rate_train.png) + +--- + +## 12. 市场状态聚类 + +### 12.1 K-Means (k=3, 轮廓系数=0.338) + +| 状态 | 占比 | 日均收益率 | 7d年化波动率 | 成交量比 | +|------|------|----------|-----------|---------| +| 横盘整理 | 73.6% | -0.010% | 46.5% | 0.896 | +| 急剧下跌 | 11.8% | -5.636% | 95.2% | 1.452 | +| 强势上涨 | 14.6% | +5.279% | 87.6% | 1.330 | + +### 12.2 马尔可夫转移概率矩阵 + +| | → 横盘 | → 暴跌 | → 暴涨 | +|---|-------|-------|-------| +| 横盘 | 0.820 | 0.077 | 0.103 | +| 暴跌 | 0.452 | 0.230 | 0.319 | +| 暴涨 | 0.546 | 0.230 | 0.224 | + +**平稳分布**: 横盘 73.6%、暴跌 11.8%、暴涨 14.6% + +**平均持有时间**: 横盘 5.55 天 / 暴跌 1.30 天 / 暴涨 1.29 天 + +暴涨暴跌状态平均仅持续 1.3 天即回归横盘。暴跌后有 31.9% 概率转为暴涨(反弹)。 + +![PCA聚类散点图](output/clustering/cluster_pca_k-means.png) + +![聚类特征热力图](output/clustering/cluster_heatmap_k-means.png) + +![转移概率矩阵](output/clustering/cluster_transition_matrix.png) + +![状态时间序列](output/clustering/cluster_state_timeseries.png) + +--- + +## 13. 时序预测模型 + +| 模型 | RMSE | RMSE/RW | 方向准确率 | DM p 值 | +|------|------|---------|----------|--------| +| Random Walk | 0.02532 | 1.000 | 0.0%* | — | +| Historical Mean | 0.02527 | 0.998 | 49.9% | 0.152 | +| ARIMA | 未完成** | — | — | — | +| Prophet | 未安装 | — | — | — | +| LSTM | 未安装 | — | — | — | + +\* Random Walk 预测收益=0,方向准确率定义为 0% +\*\* ARIMA 因 numpy 二进制兼容性问题未能完成 + +Historical Mean 的 RMSE/RW = 0.998,仅比随机游走好 0.2%,Diebold-Mariano 检验 p=0.152 **不显著**,本质上等同于随机游走。 + +![预测对比](output/time_series/ts_predictions_comparison.png) + +![方向准确率](output/time_series/ts_direction_accuracy.png) + +--- + +## 14. 异常检测与前兆模式 + +### 14.1 集成异常检测 + +| 方法 | 异常数 | 占比 | +|------|--------|------| +| Isolation Forest | 154 | 5.01% | +| LOF | 154 | 5.01% | +| COPOD | 154 | 5.01% | +| **集成 (≥2/3)** | **142** | **4.62%** | +| GARCH 残差异常 | 48 | 1.55% | +| 集成 ∩ GARCH 重叠 | 41 | — | + +### 14.2 已知事件对齐(容差 5 天) + +| 事件 | 日期 | 是否对齐 | 最小偏差(天) | +|------|------|---------|------------| +| 2017年牛市顶点 | 2017-12-17 | ✓ | 1 | +| 2018年熊市底部 | 2018-12-15 | ✓ | 5 | +| 新冠黑色星期四 | 2020-03-12 | ✓ | **0** | +| 第三次减半 | 2020-05-11 | ✓ | 1 | +| Luna/3AC 暴跌 | 2022-06-18 | ✓ | **0** | +| FTX 崩盘 | 2022-11-09 | ✓ | **0** | + +12 个已知事件中 6 个被成功对齐,其中 3 个精确到 0 天偏差。 + +### 14.3 前兆分类器 + +| 指标 | 值 | +|------|-----| +| 分类器 AUC | **0.9935** | +| 样本数 | 3,053 (异常 134, 正常 2,919) | + +**Top-5 前兆特征(异常前 5~20 天的信号)**: + +| 特征 | 重要性 | +|------|--------| +| range_pct_max_5d | 0.0856 | +| range_pct_std_5d | 0.0836 | +| abs_return_std_5d | 0.0605 | +| abs_return_max_5d | 0.0583 | +| range_pct_deviation_20d | 0.0562 | + +异常事件前 5 天的价格波动幅度(range_pct)和绝对收益率的最大值/标准差是最强的前兆信号。 + +> **注意**: AUC=0.99 部分反映了异常本身的聚集性(异常日前后也是异常的),不等于真正的"事前预测"能力。 + +![异常标记图](output/anomaly/anomaly_price_chart.png) + +![特征分布对比](output/anomaly/anomaly_feature_distributions.png) + +![ROC曲线](output/anomaly/precursor_roc_curve.png) + +![特征重要性](output/anomaly/precursor_feature_importance.png) + +--- + +## 15. 综合结论 + +### 证据分级汇总 + +#### ✅ 强证据(高度可重复,具有经济意义) + +| 规律 | 关键证据 | 可利用性 | +|------|---------|---------| +| 收益率厚尾分布 | KS/JB/AD p≈0,超额峰度=15.65,4σ事件87倍于正态 | 风控必须考虑 | +| 波动率聚集 | GARCH persistence=0.973,绝对收益率ACF 88阶显著 | 可预测波动率 | +| 波动率长记忆性 | 幂律衰减 d=0.635, p=5.8e-25 | FIGARCH建模 | +| 单向因果:波动→成交量 | abs_return→volume F=55.19, Bonferroni校正后全显著 | 理解市场微观结构 | +| 异常事件前兆 | AUC=0.9935,6/12已知事件精确对齐 | 波动率异常预警 | + +#### ⚠️ 中等证据(统计显著但效果有限) + +| 规律 | 关键证据 | 限制 | +|------|---------|------| +| 弱趋势性 | Hurst H=0.593, 98.9%窗口>0.55 | 效应量小(H仅略>0.5) | +| 日内小时效应 | Kruskal-Wallis p=0.0001 | 仅限小时级别 | +| FFT 39.6天周期 | SNR=6.36, 三框架确认 | 7天分量仅解释15%方差 | +| 小波 ~300天周期 | 95% MC显著 | 功率/阈值比仅1.01-1.15x | + +#### ❌ 弱证据/不显著 + +| 规律 | 关键证据 | 结论 | +|------|---------|------| +| 日历效应(星期/月份/季度) | Kruskal-Wallis p=0.22~0.87 | **不存在** | +| 减半效应 | Welch's t p=0.55/0.31, 合并p=0.991 | **不显著**(仅2样本) | +| 技术指标预测力 | 21个指标FDR校正后0通过,IC<0.05 | **不存在** | +| K线形态超额收益 | 训练集FDR 0/12通过,验证集多数衰减 | **不存在** | +| 分形维度偏离随机游走 | Z=-1.38, p=0.167 | **不显著** | +| 时序模型超越随机游走 | RMSE/RW=0.998, DM p=0.152 | **不显著** | + +### 最终判断 + +> **BTC 价格走势存在可测量的统计规律,但绝大多数不具备价格方向的预测可利用性。** +> +> 1. **波动率可预测,价格方向不可预测**。GARCH 效应、波动率聚集、长记忆性是确凿的市场特征,可用于风险管理和期权定价,但不能用于预测涨跌。 +> +> 2. **市场效率的非对称性**。BTC 市场对价格水平(一阶矩)接近有效,但对波动率(二阶矩)远非有效 — 这与传统金融市场的"波动率可预测悖论"一致。 +> +> 3. **流行的交易信号经不起严格检验**。21 个技术指标、12 种 K 线形态、日历效应、减半效应在 FDR/Bonferroni 校正后全部不显著或效果量极小。 +> +> 4. **实际启示**:关注波动率管理而非方向预测;极端事件的风险评估应使用厚尾模型;异常检测可作为风控辅助工具。 + +--- + +--- + +## 16. 基于分析数据的未来价格推演(2026-02 ~ 2028-02) + +> **重要免责声明**: 本章节是基于前述 15 章的统计分析结果所做的数据驱动推演,**不构成任何投资建议**。BTC 价格的方向准确率在统计上等同于随机游走(第 13 章),任何点位预测的精确性都是幻觉。以下推演的价值在于**量化不确定性的范围**,而非给出精确预测。 + +### 16.1 推演方法论 + +我们综合使用 6 个独立分析框架的量化输出,构建概率分布而非单一预测值: + +| 框架 | 数据来源 | 作用 | +|------|---------|------| +| 几何布朗运动 (GBM) | 日收益率 μ=0.0935%/天, σ=3.61%/天 (第 2 章) | 中性基准的概率锥 | +| 幂律走廊外推 | α=0.770, R²=0.568 (第 6 章) | 长期结构性锚定区间 | +| GARCH 波动率锥 | persistence=0.973 (第 3 章) | 动态波动率调整 | +| 减半周期类比 | 第 3/4 次减半轨迹 r=0.81 (第 9 章) | 周期性参考(仅 2 样本) | +| 马尔可夫状态模型 | 3 状态转移矩阵 (第 12 章) | 状态持续与切换概率 | +| Hurst 趋势推断 | H=0.593, 周线 H=0.67 (第 5 章) | 趋势持续性修正 | + +### 16.2 当前市场状态诊断 + +**基准价格**: $76,968(2026-02-01 收盘价) + +| 诊断维度 | 值 | 含义 | +|---------|-----|------| +| 幂律走廊分位 | 67.9% | 偏高但未极端(5%=$16,879, 95%=$119,340) | +| 距第 4 次减半天数 | ~652 天 | 进入减半后期(第 3 次在 ~550 天见顶) | +| 马尔可夫当前状态 | 横盘整理(73.6%概率) | 日均收益 -0.01%, 年化波动率 46.5% | +| Hurst 最近窗口 | 0.549 ~ 0.654 | 弱趋势持续,未进入均值回归 | +| GARCH 波动率持续性 | 0.973 | 当前波动率水平有强惯性 | + +### 16.3 框架一:GBM 概率锥(假设收益率独立同分布) + +基于日线对数收益率参数(μ=0.000935, σ=0.0361),在几何布朗运动假设下: + +**风险中性漂移修正**: E[ln(S_T/S_0)] = (μ - σ²/2) × T = 0.000283/天 + +| 时间跨度 | 中位数预期 | -1σ (16%分位) | +1σ (84%分位) | -2σ (2.5%分位) | +2σ (97.5%分位) | +|---------|-----------|-------------|-------------|-------------|---------------| +| 6 个月 (183天) | $80,834 | $52,891 | $123,470 | $36,267 | $180,129 | +| 1 年 (365天) | $85,347 | $42,823 | $170,171 | $21,502 | $338,947 | +| 2 年 (730天) | $94,618 | $35,692 | $250,725 | $13,475 | $664,268 | + +> **关键修正**: 由于 BTC 收益率呈厚尾分布(超额峰度=15.65,4σ事件概率是正态的 87 倍),上述 GBM 模型**严重低估了尾部风险**。实际 2.5%/97.5% 分位数的范围应显著宽于上表。 + +### 16.4 框架二:幂律走廊外推 + +以当前幂律参数 α=0.770 外推走廊上下轨: + +| 时间点 | 5% 下轨 | 50% 中轨 | 95% 上轨 | 当前价格位置 | +|--------|---------|---------|---------|-----------| +| 2026-02(现在, day 3091) | $16,879 | $51,707 | $119,340 | $76,968 (67.9%) | +| 2026-08(day 3274) | $17,647 | $54,060 | $124,773 | — | +| 2027-02(day 3456) | $18,412 | $56,404 | $130,183 | — | +| 2028-02(day 3821) | $19,861 | $60,839 | $140,423 | — | + +> **注意**: 幂律模型 R²=0.568 且 AIC 显示指数增长模型拟合更好(差值 493),因此幂律走廊仅做结构性参考,不应作为主要定价依据。走廊的年增速约 9%,远低于历史年化回报 34%。 + +### 16.5 框架三:减半周期类比 + +第 4 次减半(2024-04-20)已过约 652 天。以第 3 次减半为参照: + +| 事件 | 第 3 次(2020-05-11) | 第 4 次(2024-04-20) | 缩减比 | +|------|-------|-------|--------| +| 减半日价格 | ~$8,600 | ~$64,000 | — | +| 365 天累计 | **+549.68%** | +33.47% | **0.061x** | +| 500 天累计 | +414.35% | +74.31% | **0.179x** | +| 周期峰值 | ~$69,000 (~550天) | **?** | — | +| 轨迹相关性 | r = 0.808 (p < 0.001) | — | — | + +**推演**: +- 如果按第 3 次减半的轨迹形态(r=0.81),但收益率大幅衰减(0.06x~0.18x 缩减比),第 4 次周期可能已经或接近峰值 +- 第 3 次减半在 ~550 天达到顶点后进入长期下跌(随后的 2022 年熊市),若类比成立,2026Q1-Q2 可能处于"周期后期" +- **但仅 2 个样本的统计功效极低**(Welch's t 合并 p=0.991),不能依赖此推演 + +### 16.6 框架四:马尔可夫状态模型推演 + +基于 3 状态马尔可夫转移矩阵的条件概率预测: + +**当前状态假设为横盘整理**(73.6% 的日子处于此状态): + +| 未来状态 | 1 天后概率 | 5 天后概率* | 30 天后概率* | +|---------|-----------|-----------|------------| +| 继续横盘 | 82.0% | ~51.3% | ≈平稳分布 73.6% | +| 转入暴跌 | 7.7% | ~10.5% | ≈平稳分布 11.8% | +| 转入暴涨 | 10.3% | ~13.4% | ≈平稳分布 14.6% | + +\* 多步概率通过转移矩阵幂次计算,约 15-20 步后收敛到平稳分布。 + +**关键含义**: +- 暴涨暴跌平均仅持续 1.3 天即回归横盘 +- 暴跌后有 31.9% 概率立即反弹为暴涨("V 型反转"概率) +- 长期来看,市场约 73.6% 的时间在横盘,约 14.6% 的时间在强势上涨,约 11.8% 的时间在急剧下跌 +- **暴涨与暴跌的概率不对称**:暴涨概率(14.6%)略高于暴跌(11.8%),与长期正漂移一致 + +### 16.7 框架五:厚尾修正的概率分布 + +标准 GBM 假设正态分布,但 BTC 的超额峰度=15.65。我们用历史尾部概率修正极端场景: + +| 场景 | 正态模型概率 | BTC 实际概率(历史) | 1 年内触发一次的概率 | +|------|-----------|-----------------|------------------| +| 单日 ≥ 3σ (+10.8%) | 0.135% | **0.776%** (5.75x) | ~94% | +| 单日 ≤ -3σ (-10.8%) | 0.135% | **0.776%** (5.75x) | ~94% | +| 单日 ≥ 4σ (+14.4%) | 0.003% | **0.275%** (86.9x) | ~63% | +| 单日 ≤ -4σ (-14.4%) | 0.003% | **0.275%** (86.9x) | ~63% | +| 单日 ≥ 5σ (+18.1%) | ~0.00003% | **估计 0.06%** | ~20% | +| 单日 ≤ -5σ (-18.1%) | ~0.00003% | **估计 0.06%** | ~20% | + +在未来 1 年内,**几乎确定会出现至少一次单日 ±10% 的波动**,且有约 63% 的概率出现 ±14% 以上的极端日。 + +### 16.8 综合情景推演 + +综合上述 6 个框架,构建 5 个离散情景: + +#### 情景 A:持续牛市(概率 ~15%) + +| 指标 | 值 | 数据依据 | +|------|-----|---------| +| 1 年目标 | $130,000 ~ $200,000 | GBM +1σ 区间 + Hurst 趋势持续 | +| 2 年目标 | $180,000 ~ $350,000 | GBM +1σ~+2σ,幂律上轨 $140K | +| 触发条件 | 连续突破幂律 95% 上轨 ($119,340) | 历史上 2021 年曾发生 | +| 概率依据 | 马尔可夫暴涨状态 14.6% × Hurst 趋势延续 98.9% | 但单次暴涨仅持续 1.3 天 | + +**数据支撑**: Hurst H=0.593 表明价格有弱趋势延续性,一旦进入上行通道可能持续。周线 H=0.67 暗示更长周期趋势性更强。但暴涨状态平均仅 1.3 天,需要连续多次暴涨才能实现。 + +**数据矛盾**: ARIMA/历史均值模型均无法显著超越随机游走(RMSE/RW=0.998),方向预测准确率仅 49.9%。 + +#### 情景 B:温和上涨(概率 ~25%) + +| 指标 | 值 | 数据依据 | +|------|-----|---------| +| 1 年目标 | $85,000 ~ $130,000 | GBM 中位数 $85K ~ +1σ $170K 之间 | +| 2 年目标 | $95,000 ~ $180,000 | 幂律中轨上方,历史漂移率 | +| 触发条件 | 维持在幂律 50%~95% 区间内 | 当前 67.9% 已在此区间 | +| 概率依据 | 历史日均收益 +0.094% 的长期漂移 | 8.5 年数据支撑 | + +**数据支撑**: 日均正漂移 0.094% 在 8.5 年 3,091 天中持续存在。指数增长模型优于幂律(AIC 差 493),暗示增长速率可能不会减缓。 + +#### 情景 C:横盘震荡(概率 ~30%) + +| 指标 | 值 | 数据依据 | +|------|-----|---------| +| 1 年区间 | $50,000 ~ $100,000 | 幂律走廊 50%-95% | +| 2 年区间 | $45,000 ~ $110,000 | GBM ±0.5σ | +| 触发条件 | 横盘状态延续(马尔可夫 82% 自我转移) | 最可能的单一状态 | +| 概率依据 | 马尔可夫平稳分布 73.6% 横盘 | 市场多数时间在整理 | + +**数据支撑**: 横盘整理是最频繁的市场状态(73.6% 的日子),且自我转移概率高达 82%。当前年化波动率约 46.5%,与横盘状态特征一致。FFT 检测到的 ~39.6 天周期(SNR=6.36)暗示中短期存在围绕均值的振荡结构。 + +#### 情景 D:温和下跌(概率 ~20%) + +| 指标 | 值 | 数据依据 | +|------|-----|---------| +| 1 年目标 | $40,000 ~ $65,000 | GBM -1σ ($43K) 附近 | +| 2 年目标 | $35,000 ~ $55,000 | 回归幂律中轨 ($57K~$61K) | +| 触发条件 | 减半周期后期回撤 | 第 3 次在 ~550天后转熊 | +| 概率依据 | 幂律位置 67.9% → 回归 50% 中轨 | 均值回归力量 | + +**数据支撑**: 当前位于幂律走廊 67.9% 分位(偏高),统计上有回归中轨的倾向。第 3 次减半在峰值(~550 天)后经历了约 -75% 的回撤($69K → $16K),第 4 次减半已过 652 天。 + +#### 情景 E:黑天鹅暴跌(概率 ~10%) + +| 指标 | 值 | 数据依据 | +|------|-----|---------| +| 1 年最低 | $15,000 ~ $35,000 | GBM -2σ ($21.5K),接近幂律 5% 下轨 | +| 触发条件 | 系统性事件(如 2020 新冠、2022 FTX) | 异常检测 6/12 事件对齐 | +| 概率依据 | 4σ事件年概率 63% × 持续下行 | 厚尾 87x 增强 | + +**数据支撑**: 历史上确实发生过 -75%(2022)、-84%(2018)的回撤。异常检测模型(AUC=0.9935)显示极端事件具有前兆特征(前 5 天波动幅度和绝对收益率标准差异常升高),但不等于可精确预测时间点。 + +### 16.9 概率加权预期 + +| 情景 | 概率 | 1 年中点 | 2 年中点 | +|------|------|---------|---------| +| A 持续牛市 | 15% | $165,000 | $265,000 | +| B 温和上涨 | 25% | $107,500 | $137,500 | +| C 横盘震荡 | 30% | $75,000 | $77,500 | +| D 温和下跌 | 20% | $52,500 | $45,000 | +| E 黑天鹅 | 10% | $25,000 | $25,000 | +| **概率加权** | **100%** | **$87,750** | **$107,875** | + +概率加权后的 1 年预期价格约 $87,750(+14%),2 年预期约 $107,875(+40%),与历史日均正漂移的累积效应(1 年 +34%)在同一量级。 + +### 16.10 推演的核心局限性 + +1. **方向不可预测**: 本报告第 13 章已证明,所有时序模型均无法显著超越随机游走(DM 检验 p=0.152),方向预测准确率仅 49.9% +2. **周期样本不足**: 减半效应仅基于 2 个样本(合并 p=0.991),统计功效极低 +3. **结构性变化**: 2017-2026 年期间 BTC 的市场结构(机构化、ETF、监管)发生了根本性变化,历史参数可能不适用于未来 +4. **外生冲击不可建模**: 监管政策、宏观经济、地缘政治等外生因素对 BTC 价格有重大影响,但无法从历史价格数据中推断 +5. **波动率可预测,方向不可预测**: 本分析的核心发现是 GARCH persistence=0.973 和波动率长记忆性(d=0.635),意味着我们能较准确预测"波动有多大",但无法预测"方向是什么" +6. **厚尾风险**: 正态假设下的置信区间**严重低估**极端场景概率,BTC 的 4σ 事件是正态的 87 倍 + +> **最诚实的结论**: 如果你必须对 BTC 未来 1-2 年做出判断,唯一有统计证据支持的陈述是: +> 1. **波动率会很大**(年化 ~60%,即 1 年内 ±60% 波动属于"正常"范围) +> 2. **极端日几乎确定会出现**(年内 ±10% 单日波动概率 >90%) +> 3. **长期存在微弱的正漂移**(日均 +0.094%,但单日标准差 3.61% 是漂移的 39 倍) +> 4. **任何精确的价格预测都没有统计学基础** + +--- + +*报告生成日期: 2026-02-03 | 分析代码: [src/](src/) | 图表输出: [output/](output/)* diff --git a/main.py b/main.py new file mode 100644 index 0000000..86e17e2 --- /dev/null +++ b/main.py @@ -0,0 +1,219 @@ +#!/usr/bin/env python3 +"""BTC/USDT 价格规律性全面分析 — 主入口 + +串联执行所有分析模块,输出结果到 output/ 目录。 +每个模块独立运行,单个模块失败不影响其他模块。 + +用法: + python3 main.py # 运行全部模块 + python3 main.py --modules fft wavelet # 只运行指定模块 + python3 main.py --list # 列出所有可用模块 +""" + +import sys +import time +import argparse +import traceback +from pathlib import Path +from collections import OrderedDict + +# 确保 src 在路径中 +ROOT = Path(__file__).parent +sys.path.insert(0, str(ROOT)) + +from src.data_loader import load_klines, load_daily, load_hourly, validate_data +from src.preprocessing import add_derived_features + + +# ── 模块注册表 ───────────────────────────────────────────── + +def _import_module(name): + """延迟导入分析模块,避免启动时全部加载""" + import importlib + return importlib.import_module(f"src.{name}") + + +# (模块key, 显示名称, 源模块名, 入口函数名, 是否需要hourly数据) +MODULE_REGISTRY = OrderedDict([ + ("fft", ("FFT频谱分析", "fft_analysis", "run_fft_analysis", False)), + ("wavelet", ("小波变换分析", "wavelet_analysis", "run_wavelet_analysis", False)), + ("acf", ("ACF/PACF分析", "acf_analysis", "run_acf_analysis", False)), + ("returns", ("收益率分布分析", "returns_analysis", "run_returns_analysis", False)), + ("volatility", ("波动率聚集分析", "volatility_analysis", "run_volatility_analysis", False)), + ("hurst", ("Hurst指数分析", "hurst_analysis", "run_hurst_analysis", False)), + ("fractal", ("分形维度分析", "fractal_analysis", "run_fractal_analysis", False)), + ("power_law", ("幂律增长分析", "power_law_analysis", "run_power_law_analysis", False)), + ("volume_price", ("量价关系分析", "volume_price_analysis", "run_volume_price_analysis", False)), + ("calendar", ("日历效应分析", "calendar_analysis", "run_calendar_analysis", True)), + ("halving", ("减半周期分析", "halving_analysis", "run_halving_analysis", False)), + ("indicators", ("技术指标验证", "indicators", "run_indicators_analysis", False)), + ("patterns", ("K线形态分析", "patterns", "run_patterns_analysis", False)), + ("clustering", ("市场状态聚类", "clustering", "run_clustering_analysis", False)), + ("time_series", ("时序预测", "time_series", "run_time_series_analysis", False)), + ("causality", ("因果检验", "causality", "run_causality_analysis", False)), + ("anomaly", ("异常检测", "anomaly", "run_anomaly_analysis", False)), +]) + + +OUTPUT_DIR = ROOT / "output" + + +def run_single_module(key, df, df_hourly, output_base): + """ + 运行单个分析模块 + + Returns + ------- + dict or None + 模块返回的结果字典,失败返回 None + """ + display_name, mod_name, func_name, needs_hourly = MODULE_REGISTRY[key] + module_output = str(output_base / key) + Path(module_output).mkdir(parents=True, exist_ok=True) + + print(f"\n{'='*60}") + print(f" [{key}] {display_name}") + print(f"{'='*60}") + + try: + mod = _import_module(mod_name) + func = getattr(mod, func_name) + + if needs_hourly: + result = func(df, df_hourly, module_output) + else: + result = func(df, module_output) + + if result is None: + result = {"status": "completed", "findings": []} + + result["status"] = "success" + print(f" [{key}] 完成 ✓") + return result + + except Exception as e: + print(f" [{key}] 失败 ✗: {e}") + traceback.print_exc() + return {"status": "error", "error": str(e), "findings": []} + + +def main(): + parser = argparse.ArgumentParser(description="BTC/USDT 价格规律性全面分析") + parser.add_argument("--modules", nargs="*", default=None, + help="指定要运行的模块 (默认运行全部)") + parser.add_argument("--list", action="store_true", + help="列出所有可用模块") + parser.add_argument("--start", type=str, default=None, + help="数据起始日期, 如 2020-01-01") + parser.add_argument("--end", type=str, default=None, + help="数据结束日期, 如 2025-12-31") + args = parser.parse_args() + + if args.list: + print("\n可用分析模块:") + print("-" * 50) + for key, (name, _, _, _) in MODULE_REGISTRY.items(): + print(f" {key:<15} {name}") + print() + return + + # ── 1. 加载数据 ────────────────────────────────────── + print("=" * 60) + print(" BTC/USDT 价格规律性全面分析") + print("=" * 60) + + print("\n[1/3] 加载日线数据...") + df_daily = load_daily(start=args.start, end=args.end) + report = validate_data(df_daily, "1d") + print(f" 行数: {report['rows']}") + print(f" 日期范围: {report['date_range']}") + print(f" 价格范围: {report['price_range']}") + + print("\n[2/3] 添加衍生特征...") + df = add_derived_features(df_daily) + print(f" 特征列: {list(df.columns)}") + + print("\n[3/3] 加载小时数据 (日历效应需要)...") + try: + df_hourly_raw = load_hourly(start=args.start, end=args.end) + df_hourly = add_derived_features(df_hourly_raw) + print(f" 小时数据行数: {len(df_hourly)}") + except Exception as e: + print(f" 小时数据加载失败 (日历效应小时分析将跳过): {e}") + df_hourly = None + + # ── 2. 确定要运行的模块 ────────────────────────────── + if args.modules: + modules_to_run = [] + for m in args.modules: + if m in MODULE_REGISTRY: + modules_to_run.append(m) + else: + print(f" 警告: 未知模块 '{m}', 跳过") + else: + modules_to_run = list(MODULE_REGISTRY.keys()) + + print(f"\n将运行 {len(modules_to_run)} 个分析模块:") + for m in modules_to_run: + print(f" - {m}: {MODULE_REGISTRY[m][0]}") + + # ── 3. 逐一运行模块 ───────────────────────────────── + OUTPUT_DIR.mkdir(parents=True, exist_ok=True) + all_results = {} + timings = {} + + for key in modules_to_run: + t0 = time.time() + result = run_single_module(key, df, df_hourly, OUTPUT_DIR) + elapsed = time.time() - t0 + timings[key] = elapsed + if result is not None: + all_results[key] = result + print(f" 耗时: {elapsed:.1f}s") + + # ── 4. 生成综合报告 ────────────────────────────────── + print(f"\n{'='*60}") + print(" 生成综合分析报告") + print(f"{'='*60}") + + from src.visualization import generate_summary_dashboard, plot_price_overview + + # 价格概览图 + plot_price_overview(df_daily, str(OUTPUT_DIR)) + + # 综合仪表盘 + dashboard_result = generate_summary_dashboard(all_results, str(OUTPUT_DIR)) + + # ── 5. 打印执行摘要 ────────────────────────────────── + print(f"\n{'='*60}") + print(" 执行摘要") + print(f"{'='*60}") + + success = sum(1 for r in all_results.values() if r.get("status") == "success") + failed = sum(1 for r in all_results.values() if r.get("status") == "error") + total_time = sum(timings.values()) + + print(f"\n 模块总数: {len(modules_to_run)}") + print(f" 成功: {success}") + print(f" 失败: {failed}") + print(f" 总耗时: {total_time:.1f}s") + + print(f"\n 各模块耗时:") + for key, t in sorted(timings.items(), key=lambda x: -x[1]): + status = all_results.get(key, {}).get("status", "unknown") + mark = "✓" if status == "success" else "✗" + print(f" {mark} {key:<15} {t:>8.1f}s") + + print(f"\n 输出目录: {OUTPUT_DIR.resolve()}") + if dashboard_result: + print(f" 综合报告: {dashboard_result.get('report_path', 'N/A')}") + print(f" 仪表盘图: {dashboard_result.get('dashboard_path', 'N/A')}") + print(f" JSON结果: {dashboard_result.get('json_path', 'N/A')}") + + print(f"\n{'='*60}") + print(" 分析完成!") + print(f"{'='*60}\n") + + +if __name__ == "__main__": + main() diff --git a/output/acf/acf_grid.png b/output/acf/acf_grid.png new file mode 100644 index 0000000..f5c4b9f Binary files /dev/null and b/output/acf/acf_grid.png differ diff --git a/output/acf/pacf_grid.png b/output/acf/pacf_grid.png new file mode 100644 index 0000000..ce77ae5 Binary files /dev/null and b/output/acf/pacf_grid.png differ diff --git a/output/acf/significant_lags_heatmap.png b/output/acf/significant_lags_heatmap.png new file mode 100644 index 0000000..23d2ecf Binary files /dev/null and b/output/acf/significant_lags_heatmap.png differ diff --git a/output/all_results.json b/output/all_results.json new file mode 100644 index 0000000..cf32c07 --- /dev/null +++ b/output/all_results.json @@ -0,0 +1,44 @@ +{ + "indicators": { + "train_results": " n_buy n_sell ... ic_rejected any_fdr_pass\nindicator ... \nSMA_5_20 47.0 48.0 ... False False\nEMA_5_20 53.0 54.0 ... False False\nSMA_10_50 21.0 22.0 ... False False\nEMA_10_50 19.0 20.0 ... False False\nSMA_20_100 7.0 8.0 ... False False\nEMA_20_100 9.0 10.0 ... False False\nSMA_50_200 4.0 5.0 ... False False\nEMA_50_200 6.0 7.0 ... False False\nRSI_7_30_70 66.0 78.0 ... False False\nRSI_7_25_75 48.0 62.0 ... False False\nRSI_7_20_80 21.0 41.0 ... False False\nRSI_14_30_70 24.0 47.0 ... False False\nRSI_14_25_75 15.0 27.0 ... False False\nRSI_14_20_80 4.0 17.0 ... False False\nRSI_21_30_70 14.0 29.0 ... False False\nRSI_21_25_75 4.0 16.0 ... False False\nRSI_21_20_80 2.0 11.0 ... False False\nMACD_12_26_9 65.0 65.0 ... False False\nMACD_8_17_9 92.0 92.0 ... False False\nMACD_5_35_5 123.0 123.0 ... False False\nBB_20_2 39.0 59.0 ... False False\n\n[21 rows x 23 columns]", + "val_results": " n_buy n_sell ... ic_rejected any_fdr_pass\nindicator ... \nSMA_5_20 21.0 21.0 ... False False\nEMA_5_20 17.0 17.0 ... False False\nSMA_10_50 7.0 7.0 ... False False\nEMA_10_50 8.0 8.0 ... False False\nSMA_20_100 4.0 4.0 ... False False\nEMA_20_100 3.0 3.0 ... False False\nSMA_50_200 2.0 1.0 ... False False\nEMA_50_200 2.0 1.0 ... False False\nRSI_7_30_70 16.0 27.0 ... False False\nRSI_7_25_75 9.0 16.0 ... False False\nRSI_7_20_80 4.0 17.0 ... False False\nRSI_14_30_70 4.0 17.0 ... False False\nRSI_14_25_75 3.0 6.0 ... False False\nRSI_14_20_80 1.0 7.0 ... False False\nRSI_21_30_70 1.0 7.0 ... False False\nRSI_21_25_75 0.0 9.0 ... False False\nRSI_21_20_80 0.0 7.0 ... False False\nMACD_12_26_9 22.0 23.0 ... False False\nMACD_8_17_9 28.0 29.0 ... False False\nMACD_5_35_5 42.0 43.0 ... False False\nBB_20_2 12.0 26.0 ... False False\n\n[21 rows x 23 columns]", + "fdr_passed_train": [], + "fdr_passed_val": [], + "permutation_results": { + "RSI_14_30_70": { + "observed_diff": -0.004977440100087348, + "perm_pval": 0.5664335664335665 + }, + "RSI_14_25_75": { + "observed_diff": -0.03017610738336842, + "perm_pval": 0.014985014985014986 + }, + "RSI_21_30_70": { + "observed_diff": -0.012247499113796413, + "perm_pval": 0.2677322677322677 + }, + "RSI_7_25_75": { + "observed_diff": -0.014302431427126703, + "perm_pval": 0.02097902097902098 + }, + "RSI_21_20_80": { + "observed_diff": -0.0252918754365221, + "perm_pval": 0.3026973026973027 + } + }, + "all_signals": "{'SMA_5_20': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'EMA_5_20': datetime\n2017-08-17 0\n2017-08-18 -1\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'SMA_10_50': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'EMA_10_50': datetime\n2017-08-17 0\n2017-08-18 -1\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'SMA_20_100': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'EMA_20_100': datetime\n2017-08-17 0\n2017-08-18 -1\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'SMA_50_200': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'EMA_50_200': datetime\n2017-08-17 0\n2017-08-18 -1\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'RSI_7_30_70': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'RSI_7_25_75': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'RSI_7_20_80': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'RSI_14_30_70': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'RSI_14_25_75': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'RSI_14_20_80': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'RSI_21_30_70': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'RSI_21_25_75': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'RSI_21_20_80': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'MACD_12_26_9': datetime\n2017-08-17 0\n2017-08-18 -1\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'MACD_8_17_9': datetime\n2017-08-17 0\n2017-08-18 -1\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'MACD_5_35_5': datetime\n2017-08-17 0\n2017-08-18 -1\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'BB_20_2': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64}", + "status": "success" + }, + "patterns": { + "train_results": " n_occurrences ... any_fdr_pass\npattern ... \nDoji 219.0 ... False\nHammer 49.0 ... False\nInverted_Hammer 17.0 ... False\nShooting_Star 6.0 ... False\nPin_Bar_Bull 116.0 ... False\nPin_Bar_Bear 57.0 ... False\nBullish_Engulfing 159.0 ... False\nBearish_Engulfing 149.0 ... False\nMorning_Star 23.0 ... False\nEvening_Star 20.0 ... False\nThree_White_Soldiers 11.0 ... False\nThree_Black_Crows 4.0 ... False\n\n[12 rows x 41 columns]", + "val_results": " n_occurrences ... any_fdr_pass\npattern ... \nDoji 81.0 ... True\nHammer 12.0 ... False\nInverted_Hammer 6.0 ... False\nShooting_Star 3.0 ... False\nPin_Bar_Bull 28.0 ... True\nPin_Bar_Bear 20.0 ... False\nBullish_Engulfing 69.0 ... True\nBearish_Engulfing 47.0 ... False\nMorning_Star 5.0 ... False\nEvening_Star 6.0 ... False\nThree_White_Soldiers 4.0 ... False\nThree_Black_Crows 0.0 ... False\n\n[12 rows x 41 columns]", + "fdr_passed_train": [], + "fdr_passed_val": [ + "Doji", + "Pin_Bar_Bull", + "Bullish_Engulfing" + ], + "all_patterns": "{'Doji': datetime\n2017-08-17 1\n2017-08-18 0\n2017-08-19 1\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 1\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Hammer': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 1\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Inverted_Hammer': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Shooting_Star': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Pin_Bar_Bull': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 1\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 1\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Pin_Bar_Bear': datetime\n2017-08-17 1\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 1\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Bullish_Engulfing': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Bearish_Engulfing': datetime\n2017-08-17 0\n2017-08-18 1\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 1\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Morning_Star': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Evening_Star': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 1\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Three_White_Soldiers': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64, 'Three_Black_Crows': datetime\n2017-08-17 0\n2017-08-18 0\n2017-08-19 0\n2017-08-20 0\n2017-08-21 0\n ..\n2026-01-28 0\n2026-01-29 0\n2026-01-30 0\n2026-01-31 0\n2026-02-01 0\nLength: 3091, dtype: int64}", + "status": "success" + } +} \ No newline at end of file diff --git a/output/anomaly/anomaly_feature_distributions.png b/output/anomaly/anomaly_feature_distributions.png new file mode 100644 index 0000000..8a6e9cb Binary files /dev/null and b/output/anomaly/anomaly_feature_distributions.png differ diff --git a/output/anomaly/anomaly_price_chart.png b/output/anomaly/anomaly_price_chart.png new file mode 100644 index 0000000..f4d7d83 Binary files /dev/null and b/output/anomaly/anomaly_price_chart.png differ diff --git a/output/anomaly/precursor_feature_importance.png b/output/anomaly/precursor_feature_importance.png new file mode 100644 index 0000000..d252750 Binary files /dev/null and b/output/anomaly/precursor_feature_importance.png differ diff --git a/output/anomaly/precursor_roc_curve.png b/output/anomaly/precursor_roc_curve.png new file mode 100644 index 0000000..64b912c Binary files /dev/null and b/output/anomaly/precursor_roc_curve.png differ diff --git a/output/calendar/calendar_hour_effect.png b/output/calendar/calendar_hour_effect.png new file mode 100644 index 0000000..e6a9fe6 Binary files /dev/null and b/output/calendar/calendar_hour_effect.png differ diff --git a/output/calendar/calendar_month_effect.png b/output/calendar/calendar_month_effect.png new file mode 100644 index 0000000..0bcebf3 Binary files /dev/null and b/output/calendar/calendar_month_effect.png differ diff --git a/output/calendar/calendar_quarter_boundary_effect.png b/output/calendar/calendar_quarter_boundary_effect.png new file mode 100644 index 0000000..f684eda Binary files /dev/null and b/output/calendar/calendar_quarter_boundary_effect.png differ diff --git a/output/calendar/calendar_weekday_effect.png b/output/calendar/calendar_weekday_effect.png new file mode 100644 index 0000000..648145f Binary files /dev/null and b/output/calendar/calendar_weekday_effect.png differ diff --git a/output/causality/granger_causal_network.png b/output/causality/granger_causal_network.png new file mode 100644 index 0000000..354c967 Binary files /dev/null and b/output/causality/granger_causal_network.png differ diff --git a/output/causality/granger_pvalue_heatmap.png b/output/causality/granger_pvalue_heatmap.png new file mode 100644 index 0000000..e63e343 Binary files /dev/null and b/output/causality/granger_pvalue_heatmap.png differ diff --git a/output/clustering/cluster_heatmap_gmm.png b/output/clustering/cluster_heatmap_gmm.png new file mode 100644 index 0000000..68a2faf Binary files /dev/null and b/output/clustering/cluster_heatmap_gmm.png differ diff --git a/output/clustering/cluster_heatmap_k-means.png b/output/clustering/cluster_heatmap_k-means.png new file mode 100644 index 0000000..6bb1469 Binary files /dev/null and b/output/clustering/cluster_heatmap_k-means.png differ diff --git a/output/clustering/cluster_k_selection.png b/output/clustering/cluster_k_selection.png new file mode 100644 index 0000000..3b1eaae Binary files /dev/null and b/output/clustering/cluster_k_selection.png differ diff --git a/output/clustering/cluster_pca_gmm.png b/output/clustering/cluster_pca_gmm.png new file mode 100644 index 0000000..d3070ff Binary files /dev/null and b/output/clustering/cluster_pca_gmm.png differ diff --git a/output/clustering/cluster_pca_k-means.png b/output/clustering/cluster_pca_k-means.png new file mode 100644 index 0000000..1027f8b Binary files /dev/null and b/output/clustering/cluster_pca_k-means.png differ diff --git a/output/clustering/cluster_silhouette_k-means.png b/output/clustering/cluster_silhouette_k-means.png new file mode 100644 index 0000000..d3289fa Binary files /dev/null and b/output/clustering/cluster_silhouette_k-means.png differ diff --git a/output/clustering/cluster_state_timeseries.png b/output/clustering/cluster_state_timeseries.png new file mode 100644 index 0000000..474b045 Binary files /dev/null and b/output/clustering/cluster_state_timeseries.png differ diff --git a/output/clustering/cluster_transition_matrix.png b/output/clustering/cluster_transition_matrix.png new file mode 100644 index 0000000..764a85b Binary files /dev/null and b/output/clustering/cluster_transition_matrix.png differ diff --git a/output/evidence_dashboard.png b/output/evidence_dashboard.png new file mode 100644 index 0000000..95aafde Binary files /dev/null and b/output/evidence_dashboard.png differ diff --git a/output/fft/fft_bandpass_components.png b/output/fft/fft_bandpass_components.png new file mode 100644 index 0000000..ece56ad Binary files /dev/null and b/output/fft/fft_bandpass_components.png differ diff --git a/output/fft/fft_multi_timeframe.png b/output/fft/fft_multi_timeframe.png new file mode 100644 index 0000000..cd8e6e3 Binary files /dev/null and b/output/fft/fft_multi_timeframe.png differ diff --git a/output/fft/fft_power_spectrum.png b/output/fft/fft_power_spectrum.png new file mode 100644 index 0000000..9bc1ce6 Binary files /dev/null and b/output/fft/fft_power_spectrum.png differ diff --git a/output/fractal/fractal_box_counting.png b/output/fractal/fractal_box_counting.png new file mode 100644 index 0000000..6047c80 Binary files /dev/null and b/output/fractal/fractal_box_counting.png differ diff --git a/output/fractal/fractal_monte_carlo.png b/output/fractal/fractal_monte_carlo.png new file mode 100644 index 0000000..66d784e Binary files /dev/null and b/output/fractal/fractal_monte_carlo.png differ diff --git a/output/fractal/fractal_self_similarity.png b/output/fractal/fractal_self_similarity.png new file mode 100644 index 0000000..bdbb4e3 Binary files /dev/null and b/output/fractal/fractal_self_similarity.png differ diff --git a/output/halving/halving_combined_summary.png b/output/halving/halving_combined_summary.png new file mode 100644 index 0000000..42ac250 Binary files /dev/null and b/output/halving/halving_combined_summary.png differ diff --git a/output/halving/halving_cumulative_returns.png b/output/halving/halving_cumulative_returns.png new file mode 100644 index 0000000..505e657 Binary files /dev/null and b/output/halving/halving_cumulative_returns.png differ diff --git a/output/halving/halving_normalized_trajectories.png b/output/halving/halving_normalized_trajectories.png new file mode 100644 index 0000000..d8880e6 Binary files /dev/null and b/output/halving/halving_normalized_trajectories.png differ diff --git a/output/halving/halving_pre_post_returns.png b/output/halving/halving_pre_post_returns.png new file mode 100644 index 0000000..d1006e6 Binary files /dev/null and b/output/halving/halving_pre_post_returns.png differ diff --git a/output/hurst/hurst_multi_timeframe.png b/output/hurst/hurst_multi_timeframe.png new file mode 100644 index 0000000..c3effc6 Binary files /dev/null and b/output/hurst/hurst_multi_timeframe.png differ diff --git a/output/hurst/hurst_rolling.png b/output/hurst/hurst_rolling.png new file mode 100644 index 0000000..e6ffdda Binary files /dev/null and b/output/hurst/hurst_rolling.png differ diff --git a/output/hurst/hurst_rs_loglog.png b/output/hurst/hurst_rs_loglog.png new file mode 100644 index 0000000..4438c90 Binary files /dev/null and b/output/hurst/hurst_rs_loglog.png differ diff --git a/output/indicators/best_indicator_train.png b/output/indicators/best_indicator_train.png new file mode 100644 index 0000000..35550c6 Binary files /dev/null and b/output/indicators/best_indicator_train.png differ diff --git a/output/indicators/best_indicator_val.png b/output/indicators/best_indicator_val.png new file mode 100644 index 0000000..8578004 Binary files /dev/null and b/output/indicators/best_indicator_val.png differ diff --git a/output/indicators/ic_distribution_train.png b/output/indicators/ic_distribution_train.png new file mode 100644 index 0000000..27e88c4 Binary files /dev/null and b/output/indicators/ic_distribution_train.png differ diff --git a/output/indicators/ic_distribution_val.png b/output/indicators/ic_distribution_val.png new file mode 100644 index 0000000..bc71137 Binary files /dev/null and b/output/indicators/ic_distribution_val.png differ diff --git a/output/indicators/pvalue_heatmap_train.png b/output/indicators/pvalue_heatmap_train.png new file mode 100644 index 0000000..6aad8db Binary files /dev/null and b/output/indicators/pvalue_heatmap_train.png differ diff --git a/output/indicators/pvalue_heatmap_val.png b/output/indicators/pvalue_heatmap_val.png new file mode 100644 index 0000000..ee0ebf0 Binary files /dev/null and b/output/indicators/pvalue_heatmap_val.png differ diff --git a/output/patterns/pattern_counts_train.png b/output/patterns/pattern_counts_train.png new file mode 100644 index 0000000..17aa12c Binary files /dev/null and b/output/patterns/pattern_counts_train.png differ diff --git a/output/patterns/pattern_counts_val.png b/output/patterns/pattern_counts_val.png new file mode 100644 index 0000000..b4697ba Binary files /dev/null and b/output/patterns/pattern_counts_val.png differ diff --git a/output/patterns/pattern_forward_returns_train.png b/output/patterns/pattern_forward_returns_train.png new file mode 100644 index 0000000..944ea36 Binary files /dev/null and b/output/patterns/pattern_forward_returns_train.png differ diff --git a/output/patterns/pattern_forward_returns_val.png b/output/patterns/pattern_forward_returns_val.png new file mode 100644 index 0000000..58ebe35 Binary files /dev/null and b/output/patterns/pattern_forward_returns_val.png differ diff --git a/output/patterns/pattern_hit_rate_train.png b/output/patterns/pattern_hit_rate_train.png new file mode 100644 index 0000000..cfc6c81 Binary files /dev/null and b/output/patterns/pattern_hit_rate_train.png differ diff --git a/output/patterns/pattern_hit_rate_val.png b/output/patterns/pattern_hit_rate_val.png new file mode 100644 index 0000000..5a20751 Binary files /dev/null and b/output/patterns/pattern_hit_rate_val.png differ diff --git a/output/power_law/power_law_corridor.png b/output/power_law/power_law_corridor.png new file mode 100644 index 0000000..ae2db0e Binary files /dev/null and b/output/power_law/power_law_corridor.png differ diff --git a/output/power_law/power_law_loglog_regression.png b/output/power_law/power_law_loglog_regression.png new file mode 100644 index 0000000..54fc72e Binary files /dev/null and b/output/power_law/power_law_loglog_regression.png differ diff --git a/output/power_law/power_law_model_comparison.png b/output/power_law/power_law_model_comparison.png new file mode 100644 index 0000000..a1f5586 Binary files /dev/null and b/output/power_law/power_law_model_comparison.png differ diff --git a/output/power_law/power_law_residual_distribution.png b/output/power_law/power_law_residual_distribution.png new file mode 100644 index 0000000..20dddad Binary files /dev/null and b/output/power_law/power_law_residual_distribution.png differ diff --git a/output/price_overview.png b/output/price_overview.png new file mode 100644 index 0000000..f42e4b2 Binary files /dev/null and b/output/price_overview.png differ diff --git a/output/returns/garch_conditional_volatility.png b/output/returns/garch_conditional_volatility.png new file mode 100644 index 0000000..a07d715 Binary files /dev/null and b/output/returns/garch_conditional_volatility.png differ diff --git a/output/returns/multi_timeframe_distributions.png b/output/returns/multi_timeframe_distributions.png new file mode 100644 index 0000000..5151ed7 Binary files /dev/null and b/output/returns/multi_timeframe_distributions.png differ diff --git a/output/returns/returns_histogram_vs_normal.png b/output/returns/returns_histogram_vs_normal.png new file mode 100644 index 0000000..161a07c Binary files /dev/null and b/output/returns/returns_histogram_vs_normal.png differ diff --git a/output/returns/returns_qq_plot.png b/output/returns/returns_qq_plot.png new file mode 100644 index 0000000..243696e Binary files /dev/null and b/output/returns/returns_qq_plot.png differ diff --git a/output/time_series/ts_cumulative_error.png b/output/time_series/ts_cumulative_error.png new file mode 100644 index 0000000..59f4157 Binary files /dev/null and b/output/time_series/ts_cumulative_error.png differ diff --git a/output/time_series/ts_direction_accuracy.png b/output/time_series/ts_direction_accuracy.png new file mode 100644 index 0000000..e28d3a4 Binary files /dev/null and b/output/time_series/ts_direction_accuracy.png differ diff --git a/output/time_series/ts_predictions_comparison.png b/output/time_series/ts_predictions_comparison.png new file mode 100644 index 0000000..fe2b612 Binary files /dev/null and b/output/time_series/ts_predictions_comparison.png differ diff --git a/output/volatility/acf_power_law_fit.png b/output/volatility/acf_power_law_fit.png new file mode 100644 index 0000000..90a239e Binary files /dev/null and b/output/volatility/acf_power_law_fit.png differ diff --git a/output/volatility/garch_model_comparison.png b/output/volatility/garch_model_comparison.png new file mode 100644 index 0000000..9268c97 Binary files /dev/null and b/output/volatility/garch_model_comparison.png differ diff --git a/output/volatility/leverage_effect_scatter.png b/output/volatility/leverage_effect_scatter.png new file mode 100644 index 0000000..189d8be Binary files /dev/null and b/output/volatility/leverage_effect_scatter.png differ diff --git a/output/volatility/realized_volatility_multiwindow.png b/output/volatility/realized_volatility_multiwindow.png new file mode 100644 index 0000000..af49b94 Binary files /dev/null and b/output/volatility/realized_volatility_multiwindow.png differ diff --git a/output/volume_price/granger_causality_heatmap.png b/output/volume_price/granger_causality_heatmap.png new file mode 100644 index 0000000..856e6bf Binary files /dev/null and b/output/volume_price/granger_causality_heatmap.png differ diff --git a/output/volume_price/obv_divergence.png b/output/volume_price/obv_divergence.png new file mode 100644 index 0000000..3021e8c Binary files /dev/null and b/output/volume_price/obv_divergence.png differ diff --git a/output/volume_price/taker_buy_lead_lag.png b/output/volume_price/taker_buy_lead_lag.png new file mode 100644 index 0000000..078d2c9 Binary files /dev/null and b/output/volume_price/taker_buy_lead_lag.png differ diff --git a/output/volume_price/volume_return_scatter.png b/output/volume_price/volume_return_scatter.png new file mode 100644 index 0000000..4e86d56 Binary files /dev/null and b/output/volume_price/volume_return_scatter.png differ diff --git a/output/wavelet/wavelet_global_spectrum.png b/output/wavelet/wavelet_global_spectrum.png new file mode 100644 index 0000000..e52694e Binary files /dev/null and b/output/wavelet/wavelet_global_spectrum.png differ diff --git a/output/wavelet/wavelet_key_periods.png b/output/wavelet/wavelet_key_periods.png new file mode 100644 index 0000000..f5773d3 Binary files /dev/null and b/output/wavelet/wavelet_key_periods.png differ diff --git a/output/wavelet/wavelet_scalogram.png b/output/wavelet/wavelet_scalogram.png new file mode 100644 index 0000000..e2b366f Binary files /dev/null and b/output/wavelet/wavelet_scalogram.png differ diff --git a/output/综合结论报告.txt b/output/综合结论报告.txt new file mode 100644 index 0000000..fa84500 --- /dev/null +++ b/output/综合结论报告.txt @@ -0,0 +1,35 @@ +====================================================================== +BTC/USDT 价格规律性分析 — 综合结论报告 +====================================================================== + + +"真正有规律" 判定标准(必须同时满足): + 1. FDR校正后 p < 0.05 + 2. 排列检验 p < 0.01(如适用) + 3. 测试集上效果方向一致且显著 + 4. >80% bootstrap子样本中成立(如适用) + 5. Cohen's d > 0.2 或经济意义显著 + 6. 有合理的经济/市场直觉解释 + + +---------------------------------------------------------------------- +模块 得分 强度 发现数 +---------------------------------------------------------------------- +indicators 0.00 none 0 +patterns 0.00 none 0 +---------------------------------------------------------------------- + +## 强证据规律(可重复、有经济意义): + (无) + +## 中等证据规律(统计显著但效果有限): + (无) + +## 弱证据/不显著: + * indicators + * patterns + +====================================================================== +注: 得分基于各模块自报告的统计检验结果。 + 具体参数和图表请参见各子目录的输出。 +====================================================================== \ No newline at end of file diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..d481281 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,17 @@ +pandas>=2.0 +numpy>=1.24 +scipy>=1.11 +matplotlib>=3.7 +seaborn>=0.12 +statsmodels>=0.14 +PyWavelets>=1.4 +arch>=6.0 +scikit-learn>=1.3 +# pandas-ta 已移除,技术指标在 indicators.py 中手动实现 +hdbscan>=0.8 +nolds>=0.5.2 +prophet>=1.1 +torch>=2.0 +pyod>=1.1 +plotly>=5.15 +pmdarima>=2.0 diff --git a/src/__init__.py b/src/__init__.py new file mode 100644 index 0000000..f90ee24 --- /dev/null +++ b/src/__init__.py @@ -0,0 +1 @@ +# BTC/USDT Price Analysis Package diff --git a/src/acf_analysis.py b/src/acf_analysis.py new file mode 100644 index 0000000..af06b6a --- /dev/null +++ b/src/acf_analysis.py @@ -0,0 +1,758 @@ +"""ACF/PACF 自相关分析模块 + +对BTC日线数据的多序列(对数收益率、平方收益率、绝对收益率、成交量)进行 +自相关函数(ACF)、偏自相关函数(PACF)分析,自动检测显著滞后阶与周期性模式, +并执行 Ljung-Box 检验以验证序列依赖结构。 +""" + +import numpy as np +import pandas as pd +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +from statsmodels.tsa.stattools import acf, pacf +from statsmodels.stats.diagnostic import acorr_ljungbox +from pathlib import Path +from typing import Dict, List, Tuple, Optional, Any, Union + + +# ============================================================ +# 常量配置 +# ============================================================ + +# ACF/PACF 最大滞后阶数 +ACF_MAX_LAGS = 100 +PACF_MAX_LAGS = 40 + +# Ljung-Box 检验的滞后组 +LJUNGBOX_LAG_GROUPS = [10, 20, 50, 100] + +# 显著性水平对应的 z 值(双侧 5%) +Z_CRITICAL = 1.96 + +# 分析目标序列名称 -> 列名映射 +SERIES_CONFIG = { + "log_return": { + "column": "log_return", + "label": "对数收益率 (Log Return)", + "purpose": "检测线性序列相关性", + }, + "squared_return": { + "column": "squared_return", + "label": "平方收益率 (Squared Return)", + "purpose": "检测波动聚集效应 / ARCH效应", + }, + "abs_return": { + "column": "abs_return", + "label": "绝对收益率 (Absolute Return)", + "purpose": "非线性依赖关系的稳健性检验", + }, + "volume": { + "column": "volume", + "label": "成交量 (Volume)", + "purpose": "检测成交量自相关性", + }, +} + + +# ============================================================ +# 核心计算函数 +# ============================================================ + +def compute_acf(series: pd.Series, nlags: int = ACF_MAX_LAGS) -> Tuple[np.ndarray, np.ndarray]: + """ + 计算自相关函数及置信区间 + + Parameters + ---------- + series : pd.Series + 输入时间序列(已去除NaN) + nlags : int + 最大滞后阶数 + + Returns + ------- + acf_values : np.ndarray + ACF 值数组,shape=(nlags+1,) + confint : np.ndarray + 置信区间数组,shape=(nlags+1, 2) + """ + clean = series.dropna().values + # alpha=0.05 对应 95% 置信区间 + acf_values, confint = acf(clean, nlags=nlags, alpha=0.05, fft=True) + return acf_values, confint + + +def compute_pacf(series: pd.Series, nlags: int = PACF_MAX_LAGS) -> Tuple[np.ndarray, np.ndarray]: + """ + 计算偏自相关函数及置信区间 + + Parameters + ---------- + series : pd.Series + 输入时间序列(已去除NaN) + nlags : int + 最大滞后阶数 + + Returns + ------- + pacf_values : np.ndarray + PACF 值数组 + confint : np.ndarray + 置信区间数组 + """ + clean = series.dropna().values + # 确保 nlags 不超过样本量的一半 + max_allowed = len(clean) // 2 - 1 + nlags = min(nlags, max_allowed) + pacf_values, confint = pacf(clean, nlags=nlags, alpha=0.05, method='ywm') + return pacf_values, confint + + +def find_significant_lags( + acf_values: np.ndarray, + n_obs: int, + start_lag: int = 1, +) -> List[int]: + """ + 识别超过 ±1.96/√N 置信带的显著滞后阶 + + Parameters + ---------- + acf_values : np.ndarray + ACF 值数组(包含 lag 0) + n_obs : int + 样本总数(用于计算 Bartlett 置信带宽度) + start_lag : int + 从哪个滞后阶开始检测(默认跳过 lag 0) + + Returns + ------- + significant : list of int + 显著的滞后阶列表 + """ + threshold = Z_CRITICAL / np.sqrt(n_obs) + significant = [] + for lag in range(start_lag, len(acf_values)): + if abs(acf_values[lag]) > threshold: + significant.append(lag) + return significant + + +def detect_periodic_pattern( + significant_lags: List[int], + min_period: int = 2, + max_period: int = 50, + min_occurrences: int = 3, + tolerance: int = 1, +) -> List[Dict[str, Any]]: + """ + 检测显著滞后阶中的周期性模式 + + 算法:对每个候选周期 p,检查 p, 2p, 3p, ... 是否在显著滞后阶集合中 + (允许 ±tolerance 偏差),若命中次数 >= min_occurrences 则认为存在周期。 + + Parameters + ---------- + significant_lags : list of int + 显著滞后阶列表 + min_period : int + 最小候选周期 + max_period : int + 最大候选周期 + min_occurrences : int + 最少需要出现的周期倍数次数 + tolerance : int + 允许的滞后偏差(天数) + + Returns + ------- + patterns : list of dict + 检测到的周期性模式列表,每个元素包含: + - period: 周期长度 + - hits: 命中的滞后阶列表 + - count: 命中次数 + - fft_note: FFT 交叉验证说明 + """ + if not significant_lags: + return [] + + sig_set = set(significant_lags) + max_lag = max(significant_lags) + patterns = [] + + for period in range(min_period, min(max_period + 1, max_lag + 1)): + hits = [] + # 检查周期的整数倍是否出现在显著滞后阶中 + multiple = 1 + while period * multiple <= max_lag + tolerance: + target = period * multiple + # 在 ±tolerance 范围内查找匹配 + for offset in range(-tolerance, tolerance + 1): + if (target + offset) in sig_set: + hits.append(target + offset) + break + multiple += 1 + + if len(hits) >= min_occurrences: + # FFT 交叉验证说明:周期 p 天对应频率 1/p + fft_freq = 1.0 / period + patterns.append({ + "period": period, + "hits": hits, + "count": len(hits), + "fft_note": ( + f"若FFT频谱在 f={fft_freq:.4f} (1/{period}天) " + f"处存在峰值,则交叉验证通过" + ), + }) + + # 按命中次数降序排列,去除被更短周期包含的冗余模式 + patterns.sort(key=lambda x: (-x["count"], x["period"])) + filtered = _filter_harmonic_patterns(patterns) + + return filtered + + +def _filter_harmonic_patterns( + patterns: List[Dict[str, Any]], +) -> List[Dict[str, Any]]: + """ + 过滤谐波冗余的周期模式 + + 如果周期 A 是周期 B 的整数倍且命中数不明显更多,则保留较短周期。 + """ + if len(patterns) <= 1: + return patterns + + kept = [] + periods_kept = set() + + for pat in patterns: + p = pat["period"] + # 检查是否为已保留周期的整数倍 + is_harmonic = False + for kp in periods_kept: + if p % kp == 0 and p != kp: + is_harmonic = True + break + if not is_harmonic: + kept.append(pat) + periods_kept.add(p) + + return kept + + +def run_ljungbox_test( + series: pd.Series, + lag_groups: List[int] = None, +) -> pd.DataFrame: + """ + 对序列执行 Ljung-Box 白噪声检验 + + Parameters + ---------- + series : pd.Series + 输入时间序列 + lag_groups : list of int + 检验的滞后阶组 + + Returns + ------- + results : pd.DataFrame + 包含 lag, lb_stat, lb_pvalue 的结果表 + """ + if lag_groups is None: + lag_groups = LJUNGBOX_LAG_GROUPS + + clean = series.dropna() + max_lag = max(lag_groups) + + # 确保最大滞后不超过样本量 + if max_lag >= len(clean): + lag_groups = [lg for lg in lag_groups if lg < len(clean)] + if not lag_groups: + return pd.DataFrame(columns=["lag", "lb_stat", "lb_pvalue"]) + max_lag = max(lag_groups) + + lb_result = acorr_ljungbox(clean, lags=max_lag, return_df=True) + + rows = [] + for lg in lag_groups: + if lg <= len(lb_result): + rows.append({ + "lag": lg, + "lb_stat": lb_result.loc[lg, "lb_stat"], + "lb_pvalue": lb_result.loc[lg, "lb_pvalue"], + }) + + return pd.DataFrame(rows) + + +# ============================================================ +# 可视化函数 +# ============================================================ + +def _plot_acf_grid( + acf_data: Dict[str, Tuple[np.ndarray, np.ndarray, int, List[int]]], + output_path: Path, +) -> None: + """ + 绘制 2x2 ACF 图 + + Parameters + ---------- + acf_data : dict + 键为序列名称,值为 (acf_values, confint, n_obs, significant_lags) 元组 + output_path : Path + 输出文件路径 + """ + fig, axes = plt.subplots(2, 2, figsize=(16, 12)) + fig.suptitle("BTC 自相关函数 (ACF) 分析", fontsize=16, fontweight='bold', y=0.98) + + series_keys = list(SERIES_CONFIG.keys()) + + for idx, key in enumerate(series_keys): + ax = axes[idx // 2, idx % 2] + + if key not in acf_data: + ax.set_visible(False) + continue + + acf_vals, confint, n_obs, sig_lags = acf_data[key] + config = SERIES_CONFIG[key] + lags = np.arange(len(acf_vals)) + threshold = Z_CRITICAL / np.sqrt(n_obs) + + # 绘制 ACF 柱状图 + colors = [] + for lag in lags: + if lag == 0: + colors.append('#2196F3') # lag 0 用蓝色 + elif lag in sig_lags: + colors.append('#F44336') # 显著滞后用红色 + else: + colors.append('#90CAF9') # 非显著用浅蓝 + + ax.bar(lags, acf_vals, color=colors, width=0.8, alpha=0.85) + + # 绘制置信带 + ax.axhline(y=threshold, color='#E91E63', linestyle='--', + linewidth=1.2, alpha=0.7, label=f'±{Z_CRITICAL}/√N = ±{threshold:.4f}') + ax.axhline(y=-threshold, color='#E91E63', linestyle='--', + linewidth=1.2, alpha=0.7) + ax.axhline(y=0, color='black', linewidth=0.5) + + # 标注显著滞后阶(仅标注前10个避免拥挤) + sig_lags_sorted = sorted(sig_lags)[:10] + for lag in sig_lags_sorted: + if lag < len(acf_vals): + ax.annotate( + f'{lag}', + xy=(lag, acf_vals[lag]), + xytext=(0, 8 if acf_vals[lag] > 0 else -12), + textcoords='offset points', + fontsize=7, + color='#D32F2F', + ha='center', + fontweight='bold', + ) + + ax.set_title(f'{config["label"]}\n({config["purpose"]})', fontsize=11) + ax.set_xlabel('滞后阶 (Lag)', fontsize=10) + ax.set_ylabel('ACF', fontsize=10) + ax.legend(fontsize=8, loc='upper right') + ax.set_xlim(-1, len(acf_vals)) + ax.grid(axis='y', alpha=0.3) + ax.tick_params(labelsize=9) + + plt.tight_layout(rect=[0, 0, 1, 0.95]) + fig.savefig(output_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[ACF图] 已保存: {output_path}") + + +def _plot_pacf_grid( + pacf_data: Dict[str, Tuple[np.ndarray, np.ndarray, int, List[int]]], + output_path: Path, +) -> None: + """ + 绘制 2x2 PACF 图 + + Parameters + ---------- + pacf_data : dict + 键为序列名称,值为 (pacf_values, confint, n_obs, significant_lags) 元组 + output_path : Path + 输出文件路径 + """ + fig, axes = plt.subplots(2, 2, figsize=(16, 12)) + fig.suptitle("BTC 偏自相关函数 (PACF) 分析", fontsize=16, fontweight='bold', y=0.98) + + series_keys = list(SERIES_CONFIG.keys()) + + for idx, key in enumerate(series_keys): + ax = axes[idx // 2, idx % 2] + + if key not in pacf_data: + ax.set_visible(False) + continue + + pacf_vals, confint, n_obs, sig_lags = pacf_data[key] + config = SERIES_CONFIG[key] + lags = np.arange(len(pacf_vals)) + threshold = Z_CRITICAL / np.sqrt(n_obs) + + # 绘制 PACF 柱状图 + colors = [] + for lag in lags: + if lag == 0: + colors.append('#4CAF50') + elif lag in sig_lags: + colors.append('#FF5722') + else: + colors.append('#A5D6A7') + + ax.bar(lags, pacf_vals, color=colors, width=0.6, alpha=0.85) + + # 置信带 + ax.axhline(y=threshold, color='#E91E63', linestyle='--', + linewidth=1.2, alpha=0.7, label=f'±{Z_CRITICAL}/√N = ±{threshold:.4f}') + ax.axhline(y=-threshold, color='#E91E63', linestyle='--', + linewidth=1.2, alpha=0.7) + ax.axhline(y=0, color='black', linewidth=0.5) + + # 标注显著滞后阶 + sig_lags_sorted = sorted(sig_lags)[:10] + for lag in sig_lags_sorted: + if lag < len(pacf_vals): + ax.annotate( + f'{lag}', + xy=(lag, pacf_vals[lag]), + xytext=(0, 8 if pacf_vals[lag] > 0 else -12), + textcoords='offset points', + fontsize=7, + color='#BF360C', + ha='center', + fontweight='bold', + ) + + ax.set_title(f'{config["label"]}\n(PACF - 偏自相关)', fontsize=11) + ax.set_xlabel('滞后阶 (Lag)', fontsize=10) + ax.set_ylabel('PACF', fontsize=10) + ax.legend(fontsize=8, loc='upper right') + ax.set_xlim(-1, len(pacf_vals)) + ax.grid(axis='y', alpha=0.3) + ax.tick_params(labelsize=9) + + plt.tight_layout(rect=[0, 0, 1, 0.95]) + fig.savefig(output_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[PACF图] 已保存: {output_path}") + + +def _plot_significant_lags_summary( + all_sig_lags: Dict[str, List[int]], + n_obs: int, + output_path: Path, +) -> None: + """ + 绘制所有序列的显著滞后阶汇总热力图 + + Parameters + ---------- + all_sig_lags : dict + 键为序列名称,值为显著滞后阶列表 + n_obs : int + 样本总数 + output_path : Path + 输出文件路径 + """ + max_lag = ACF_MAX_LAGS + series_names = list(SERIES_CONFIG.keys()) + labels = [SERIES_CONFIG[k]["label"].split(" (")[0] for k in series_names] + + # 构建二值矩阵:行=序列,列=滞后阶 + matrix = np.zeros((len(series_names), max_lag + 1)) + for i, key in enumerate(series_names): + for lag in all_sig_lags.get(key, []): + if lag <= max_lag: + matrix[i, lag] = 1 + + fig, ax = plt.subplots(figsize=(20, 4)) + im = ax.imshow(matrix, aspect='auto', cmap='YlOrRd', interpolation='none') + ax.set_yticks(range(len(labels))) + ax.set_yticklabels(labels, fontsize=10) + ax.set_xlabel('滞后阶 (Lag)', fontsize=11) + ax.set_title('显著自相关滞后阶汇总 (ACF > 置信带)', fontsize=13, fontweight='bold') + + # 每隔 5 个标注 x 轴 + ax.set_xticks(range(0, max_lag + 1, 5)) + ax.tick_params(labelsize=8) + + plt.colorbar(im, ax=ax, label='显著 (1) / 不显著 (0)', shrink=0.8) + plt.tight_layout() + fig.savefig(output_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[显著滞后汇总图] 已保存: {output_path}") + + +# ============================================================ +# 主入口函数 +# ============================================================ + +def run_acf_analysis( + df: pd.DataFrame, + output_dir: Union[str, Path] = "output/acf", +) -> Dict[str, Any]: + """ + ACF/PACF 自相关分析主入口 + + 对对数收益率、平方收益率、绝对收益率、成交量四个序列执行完整的 + 自相关分析流程,包括:ACF计算、PACF计算、显著滞后检测、周期性 + 模式识别、Ljung-Box检验以及可视化。 + + Parameters + ---------- + df : pd.DataFrame + 日线DataFrame,需包含 log_return, squared_return, abs_return, volume 列 + (通常由 preprocessing.add_derived_features 生成) + output_dir : str or Path + 图表输出目录 + + Returns + ------- + results : dict + 分析结果字典,结构如下: + { + "acf": {series_name: {"values": ndarray, "significant_lags": list, ...}}, + "pacf": {series_name: {"values": ndarray, "significant_lags": list, ...}}, + "ljungbox": {series_name: DataFrame}, + "periodic_patterns": {series_name: list of dict}, + "summary": {...} + } + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + # 验证必要列存在 + required_cols = [cfg["column"] for cfg in SERIES_CONFIG.values()] + missing = [c for c in required_cols if c not in df.columns] + if missing: + raise ValueError(f"DataFrame 缺少必要列: {missing}。请先调用 add_derived_features()。") + + print("=" * 70) + print("ACF / PACF 自相关分析") + print("=" * 70) + print(f"样本量: {len(df)}") + print(f"时间范围: {df.index.min()} ~ {df.index.max()}") + print(f"ACF最大滞后: {ACF_MAX_LAGS} | PACF最大滞后: {PACF_MAX_LAGS}") + print(f"置信水平: 95% (z={Z_CRITICAL})") + print() + + # 存储结果 + results = { + "acf": {}, + "pacf": {}, + "ljungbox": {}, + "periodic_patterns": {}, + "summary": {}, + } + + # 用于绘图的中间数据 + acf_plot_data = {} # {key: (acf_vals, confint, n_obs, sig_lags_set)} + pacf_plot_data = {} + all_sig_lags = {} # {key: list of significant lag indices} + + # -------------------------------------------------------- + # 逐序列分析 + # -------------------------------------------------------- + for key, config in SERIES_CONFIG.items(): + col = config["column"] + label = config["label"] + purpose = config["purpose"] + series = df[col].dropna() + n_obs = len(series) + + print(f"{'─' * 60}") + print(f"序列: {label}") + print(f" 目的: {purpose}") + print(f" 有效样本: {n_obs}") + + # ---------- ACF ---------- + acf_vals, acf_confint = compute_acf(series, nlags=ACF_MAX_LAGS) + sig_lags_acf = find_significant_lags(acf_vals, n_obs) + sig_lags_set = set(sig_lags_acf) + + results["acf"][key] = { + "values": acf_vals, + "confint": acf_confint, + "significant_lags": sig_lags_acf, + "n_obs": n_obs, + "threshold": Z_CRITICAL / np.sqrt(n_obs), + } + acf_plot_data[key] = (acf_vals, acf_confint, n_obs, sig_lags_set) + all_sig_lags[key] = sig_lags_acf + + print(f" [ACF] 显著滞后阶数: {len(sig_lags_acf)}") + if sig_lags_acf: + # 打印前 20 个显著滞后 + display_lags = sig_lags_acf[:20] + lag_str = ", ".join(str(l) for l in display_lags) + if len(sig_lags_acf) > 20: + lag_str += f" ... (共{len(sig_lags_acf)}个)" + print(f" 滞后阶: {lag_str}") + # 打印最大 ACF 值的滞后阶(排除 lag 0) + max_idx = max(range(1, len(acf_vals)), key=lambda i: abs(acf_vals[i])) + print(f" 最大|ACF|: lag={max_idx}, ACF={acf_vals[max_idx]:.6f}") + + # ---------- PACF ---------- + pacf_vals, pacf_confint = compute_pacf(series, nlags=PACF_MAX_LAGS) + sig_lags_pacf = find_significant_lags(pacf_vals, n_obs) + sig_lags_pacf_set = set(sig_lags_pacf) + + results["pacf"][key] = { + "values": pacf_vals, + "confint": pacf_confint, + "significant_lags": sig_lags_pacf, + "n_obs": n_obs, + } + pacf_plot_data[key] = (pacf_vals, pacf_confint, n_obs, sig_lags_pacf_set) + + print(f" [PACF] 显著滞后阶数: {len(sig_lags_pacf)}") + if sig_lags_pacf: + display_lags_p = sig_lags_pacf[:15] + lag_str_p = ", ".join(str(l) for l in display_lags_p) + if len(sig_lags_pacf) > 15: + lag_str_p += f" ... (共{len(sig_lags_pacf)}个)" + print(f" 滞后阶: {lag_str_p}") + + # ---------- 周期性模式检测 ---------- + periodic = detect_periodic_pattern(sig_lags_acf) + results["periodic_patterns"][key] = periodic + + if periodic: + print(f" [周期性] 检测到 {len(periodic)} 个周期模式:") + for pat in periodic: + hit_str = ", ".join(str(h) for h in pat["hits"][:8]) + print(f" - 周期 {pat['period']}天 (命中{pat['count']}次): " + f"lags=[{hit_str}]") + print(f" FFT验证: {pat['fft_note']}") + else: + print(f" [周期性] 未检测到明显周期模式") + + # ---------- Ljung-Box 检验 ---------- + lb_df = run_ljungbox_test(series, LJUNGBOX_LAG_GROUPS) + results["ljungbox"][key] = lb_df + + print(f" [Ljung-Box检验]") + if not lb_df.empty: + for _, row in lb_df.iterrows(): + lag_val = int(row["lag"]) + stat = row["lb_stat"] + pval = row["lb_pvalue"] + # 判断显著性 + sig_mark = "***" if pval < 0.001 else "**" if pval < 0.01 else "*" if pval < 0.05 else "" + reject_str = "拒绝H0(存在自相关)" if pval < 0.05 else "不拒绝H0(无显著自相关)" + print(f" lag={lag_val:3d}: Q={stat:12.2f}, p={pval:.6f} {sig_mark} → {reject_str}") + print() + + # -------------------------------------------------------- + # 汇总 + # -------------------------------------------------------- + print("=" * 70) + print("分析汇总") + print("=" * 70) + + summary = {} + for key, config in SERIES_CONFIG.items(): + label_short = config["label"].split(" (")[0] + acf_sig = results["acf"][key]["significant_lags"] + pacf_sig = results["pacf"][key]["significant_lags"] + lb = results["ljungbox"][key] + periodic = results["periodic_patterns"][key] + + # Ljung-Box 在最大 lag 下是否显著 + lb_significant = False + if not lb.empty: + max_lag_row = lb.iloc[-1] + lb_significant = max_lag_row["lb_pvalue"] < 0.05 + + summary[key] = { + "label": label_short, + "acf_significant_count": len(acf_sig), + "pacf_significant_count": len(pacf_sig), + "ljungbox_rejects_white_noise": lb_significant, + "periodic_patterns_count": len(periodic), + "periodic_periods": [p["period"] for p in periodic], + } + + lb_verdict = "存在自相关" if lb_significant else "无显著自相关" + period_str = ( + ", ".join(f"{p}天" for p in summary[key]["periodic_periods"]) + if periodic else "无" + ) + + print(f" {label_short}:") + print(f" ACF显著滞后: {len(acf_sig)}个 | PACF显著滞后: {len(pacf_sig)}个") + print(f" Ljung-Box: {lb_verdict} | 周期性模式: {period_str}") + + results["summary"] = summary + + # -------------------------------------------------------- + # 可视化 + # -------------------------------------------------------- + print() + print("生成可视化图表...") + + # 1) ACF 2x2 网格图 + _plot_acf_grid(acf_plot_data, output_dir / "acf_grid.png") + + # 2) PACF 2x2 网格图 + _plot_pacf_grid(pacf_plot_data, output_dir / "pacf_grid.png") + + # 3) 显著滞后汇总热力图 + _plot_significant_lags_summary( + all_sig_lags, + n_obs=len(df.dropna(subset=["log_return"])), + output_path=output_dir / "significant_lags_heatmap.png", + ) + + print() + print("=" * 70) + print("ACF/PACF 分析完成") + print(f"图表输出目录: {output_dir.resolve()}") + print("=" * 70) + + return results + + +# ============================================================ +# 独立运行入口 +# ============================================================ + +if __name__ == "__main__": + from data_loader import load_daily + from preprocessing import add_derived_features + + # 加载并预处理数据 + print("加载日线数据...") + df = load_daily() + print(f"原始数据: {len(df)} 行") + + print("添加衍生特征...") + df = add_derived_features(df) + print(f"预处理后: {len(df)} 行, 列={list(df.columns)}") + print() + + # 执行 ACF/PACF 分析 + results = run_acf_analysis(df, output_dir="output/acf") + + # 打印结果概要 + print() + print("返回结果键:") + for k, v in results.items(): + if isinstance(v, dict): + print(f" results['{k}']: {list(v.keys())}") + else: + print(f" results['{k}']: {type(v).__name__}") diff --git a/src/anomaly.py b/src/anomaly.py new file mode 100644 index 0000000..c4cb03c --- /dev/null +++ b/src/anomaly.py @@ -0,0 +1,774 @@ +"""异常检测与前兆模式提取模块 + +分析内容: +- 集成异常检测(Isolation Forest + LOF + COPOD,≥2/3 一致判定) +- GARCH 条件波动率异常检测(标准化残差 > 3) +- 异常前兆模式提取(Random Forest 分类器) +- 事件对齐分析(比特币减半等重大事件) +- 可视化:异常标记价格图、特征分布对比、ROC 曲线、特征重要性 +""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +import warnings +from pathlib import Path +from typing import Optional, Dict, List, Tuple + +from sklearn.ensemble import IsolationForest, RandomForestClassifier +from sklearn.neighbors import LocalOutlierFactor +from sklearn.preprocessing import StandardScaler +from sklearn.model_selection import cross_val_predict, StratifiedKFold +from sklearn.metrics import roc_auc_score, roc_curve + +try: + from pyod.models.copod import COPOD + HAS_COPOD = True +except ImportError: + HAS_COPOD = False + print("[警告] pyod 未安装,COPOD 检测将跳过,使用 2/2 一致判定") + + +# ============================================================ +# 1. 检测特征定义 +# ============================================================ + +# 用于异常检测的特征列 +DETECTION_FEATURES = [ + 'log_return', + 'abs_return', + 'volume_ratio', + 'range_pct', + 'taker_buy_ratio', + 'vol_7d', +] + +# 比特币减半及其他重大事件日期 +KNOWN_EVENTS = { + '2012-11-28': '第一次减半', + '2016-07-09': '第二次减半', + '2020-05-11': '第三次减半', + '2024-04-20': '第四次减半', + '2017-12-17': '2017年牛市顶点', + '2018-12-15': '2018年熊市底部', + '2020-03-12': '新冠黑色星期四', + '2021-04-14': '2021年牛市中期高点', + '2021-11-10': '2021年牛市顶点', + '2022-06-18': 'Luna/3AC 暴跌', + '2022-11-09': 'FTX 崩盘', + '2024-01-11': 'BTC ETF 获批', +} + + +# ============================================================ +# 2. 集成异常检测 +# ============================================================ + +def prepare_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, np.ndarray]: + """ + 准备异常检测特征矩阵 + + Parameters + ---------- + df : pd.DataFrame + 含衍生特征的日线数据 + + Returns + ------- + features_df : pd.DataFrame + 特征子集(已去除 NaN) + X_scaled : np.ndarray + 标准化后的特征矩阵 + """ + # 选取可用特征 + available = [f for f in DETECTION_FEATURES if f in df.columns] + if len(available) < 3: + raise ValueError(f"可用特征不足: {available},至少需要 3 个") + + features_df = df[available].dropna() + + # 标准化 + scaler = StandardScaler() + X_scaled = scaler.fit_transform(features_df.values) + + return features_df, X_scaled + + +def detect_isolation_forest(X: np.ndarray, contamination: float = 0.05) -> np.ndarray: + """Isolation Forest 异常检测""" + model = IsolationForest( + n_estimators=200, + contamination=contamination, + random_state=42, + n_jobs=-1, + ) + # -1 = 异常, 1 = 正常 + labels = model.fit_predict(X) + return (labels == -1).astype(int) + + +def detect_lof(X: np.ndarray, contamination: float = 0.05) -> np.ndarray: + """Local Outlier Factor 异常检测""" + model = LocalOutlierFactor( + n_neighbors=20, + contamination=contamination, + novelty=False, + n_jobs=-1, + ) + labels = model.fit_predict(X) + return (labels == -1).astype(int) + + +def detect_copod(X: np.ndarray, contamination: float = 0.05) -> np.ndarray: + """COPOD 异常检测(基于 Copula)""" + if not HAS_COPOD: + return None + + model = COPOD(contamination=contamination) + labels = model.fit_predict(X) + return labels.astype(int) + + +def ensemble_anomaly_detection( + df: pd.DataFrame, + contamination: float = 0.05, + min_agreement: int = 2, +) -> pd.DataFrame: + """ + 集成异常检测:要求 ≥ min_agreement / n_methods 一致判定 + + Parameters + ---------- + df : pd.DataFrame + 含衍生特征的日线数据 + contamination : float + 预期异常比例 + min_agreement : int + 最少多少个方法一致才标记为异常 + + Returns + ------- + pd.DataFrame + 添加了各方法检测结果及集成结果的数据 + """ + features_df, X_scaled = prepare_features(df) + + print(f" 特征矩阵: {X_scaled.shape[0]} 样本 x {X_scaled.shape[1]} 特征") + + # 执行各方法检测 + print(" [1/3] Isolation Forest...") + if_labels = detect_isolation_forest(X_scaled, contamination) + + print(" [2/3] Local Outlier Factor...") + lof_labels = detect_lof(X_scaled, contamination) + + n_methods = 2 + vote_matrix = np.column_stack([if_labels, lof_labels]) + method_names = ['iforest', 'lof'] + + print(" [3/3] COPOD...") + copod_labels = detect_copod(X_scaled, contamination) + if copod_labels is not None: + vote_matrix = np.column_stack([vote_matrix, copod_labels]) + method_names.append('copod') + n_methods = 3 + else: + print(" COPOD 不可用,使用 2 方法集成") + + # 投票 + vote_sum = vote_matrix.sum(axis=1) + ensemble_label = (vote_sum >= min_agreement).astype(int) + + # 构建结果 DataFrame + result = features_df.copy() + for i, name in enumerate(method_names): + result[f'anomaly_{name}'] = vote_matrix[:, i] + result['anomaly_votes'] = vote_sum + result['anomaly_ensemble'] = ensemble_label + + # 打印各方法统计 + print(f"\n 异常检测统计:") + for name in method_names: + n_anom = result[f'anomaly_{name}'].sum() + print(f" {name:>12}: {n_anom} 个异常 ({n_anom / len(result) * 100:.2f}%)") + n_ensemble = ensemble_label.sum() + print(f" {'集成(≥' + str(min_agreement) + ')':>12}: {n_ensemble} 个异常 ({n_ensemble / len(result) * 100:.2f}%)") + + # 方法间重叠度 + print(f"\n 方法间重叠:") + for i in range(len(method_names)): + for j in range(i + 1, len(method_names)): + overlap = ((vote_matrix[:, i] == 1) & (vote_matrix[:, j] == 1)).sum() + n_i = vote_matrix[:, i].sum() + n_j = vote_matrix[:, j].sum() + if min(n_i, n_j) > 0: + jaccard = overlap / ((vote_matrix[:, i] == 1) | (vote_matrix[:, j] == 1)).sum() + else: + jaccard = 0.0 + print(f" {method_names[i]} ∩ {method_names[j]}: " + f"{overlap} 个 (Jaccard={jaccard:.3f})") + + return result + + +# ============================================================ +# 3. GARCH 条件波动率异常 +# ============================================================ + +def garch_anomaly_detection( + df: pd.DataFrame, + threshold: float = 3.0, +) -> pd.Series: + """ + 基于 GARCH(1,1) 的条件波动率异常检测 + + 标准化残差 |ε_t / σ_t| > threshold 的日期标记为异常 + + Parameters + ---------- + df : pd.DataFrame + 含 log_return 列的数据 + threshold : float + 标准化残差阈值 + + Returns + ------- + pd.Series + 异常标记(1 = 异常,0 = 正常),索引与输入对齐 + """ + from arch import arch_model + + returns = df['log_return'].dropna() + r_pct = returns * 100 # arch 库使用百分比收益率 + + # 拟合 GARCH(1,1) + model = arch_model(r_pct, vol='Garch', p=1, q=1, mean='Constant', dist='Normal') + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + result = model.fit(disp='off') + + # 计算标准化残差 + std_resid = result.resid / result.conditional_volatility + anomaly = (std_resid.abs() > threshold).astype(int) + + n_anom = anomaly.sum() + print(f" GARCH 异常: {n_anom} 个 (|标准化残差| > {threshold})") + print(f" GARCH 模型: α={result.params.get('alpha[1]', np.nan):.4f}, " + f"β={result.params.get('beta[1]', np.nan):.4f}, " + f"持续性={result.params.get('alpha[1]', 0) + result.params.get('beta[1]', 0):.4f}") + + return anomaly + + +# ============================================================ +# 4. 前兆模式提取 +# ============================================================ + +def extract_precursor_features( + df: pd.DataFrame, + anomaly_labels: pd.Series, + lookback_windows: List[int] = None, +) -> Tuple[pd.DataFrame, pd.Series]: + """ + 提取异常日前若干天的特征作为前兆信号 + + Parameters + ---------- + df : pd.DataFrame + 含衍生特征的数据 + anomaly_labels : pd.Series + 异常标记(1 = 异常) + lookback_windows : list of int + 向前回溯的天数窗口 + + Returns + ------- + X : pd.DataFrame + 前兆特征矩阵 + y : pd.Series + 标签(1 = 后续发生异常, 0 = 正常) + """ + if lookback_windows is None: + lookback_windows = [5, 10, 20] + + # 确保对齐 + common_idx = df.index.intersection(anomaly_labels.index) + df_aligned = df.loc[common_idx] + labels_aligned = anomaly_labels.loc[common_idx] + + base_features = [f for f in DETECTION_FEATURES if f in df.columns] + precursor_features = {} + + for window in lookback_windows: + for feat in base_features: + if feat not in df_aligned.columns: + continue + series = df_aligned[feat] + + # 滚动统计作为前兆特征 + precursor_features[f'{feat}_mean_{window}d'] = series.rolling(window).mean() + precursor_features[f'{feat}_std_{window}d'] = series.rolling(window).std() + precursor_features[f'{feat}_max_{window}d'] = series.rolling(window).max() + precursor_features[f'{feat}_min_{window}d'] = series.rolling(window).min() + + # 趋势特征(最近值 vs 窗口均值的偏离) + rolling_mean = series.rolling(window).mean() + precursor_features[f'{feat}_deviation_{window}d'] = series - rolling_mean + + X = pd.DataFrame(precursor_features, index=df_aligned.index) + + # 标签: 未来是否出现异常(shift(-1) 使得特征是"之前"的) + # 我们用当前特征预测当天是否异常 + y = labels_aligned + + # 去除 NaN + valid_mask = X.notna().all(axis=1) & y.notna() + X = X[valid_mask] + y = y[valid_mask] + + return X, y + + +def train_precursor_classifier( + X: pd.DataFrame, + y: pd.Series, +) -> Dict: + """ + 训练前兆模式分类器(Random Forest) + + 使用分层 K 折交叉验证评估 + + Parameters + ---------- + X : pd.DataFrame + 前兆特征矩阵 + y : pd.Series + 标签 + + Returns + ------- + dict + AUC、特征重要性等结果 + """ + if len(X) < 50 or y.sum() < 10: + print(f" [警告] 样本不足 (n={len(X)}, 正例={y.sum()}),跳过分类器训练") + return {} + + # 标准化 + scaler = StandardScaler() + X_scaled = scaler.fit_transform(X) + + # 分层 K 折 + n_splits = min(5, int(y.sum())) + if n_splits < 2: + print(" [警告] 正例数过少,无法进行交叉验证") + return {} + + cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42) + + clf = RandomForestClassifier( + n_estimators=200, + max_depth=10, + min_samples_split=5, + class_weight='balanced', + random_state=42, + n_jobs=-1, + ) + + # 交叉验证预测概率 + try: + y_prob = cross_val_predict(clf, X_scaled, y, cv=cv, method='predict_proba')[:, 1] + auc = roc_auc_score(y, y_prob) + except Exception as e: + print(f" [错误] 交叉验证失败: {e}") + return {} + + # 在全量数据上训练获取特征重要性 + clf.fit(X_scaled, y) + importances = pd.Series(clf.feature_importances_, index=X.columns) + importances = importances.sort_values(ascending=False) + + # ROC 曲线数据 + fpr, tpr, thresholds = roc_curve(y, y_prob) + + results = { + 'auc': auc, + 'feature_importances': importances, + 'y_true': y, + 'y_prob': y_prob, + 'fpr': fpr, + 'tpr': tpr, + } + + print(f"\n 前兆分类器结果:") + print(f" AUC: {auc:.4f}") + print(f" 样本: {len(y)} (异常: {y.sum()}, 正常: {(y == 0).sum()})") + print(f" Top-10 重要特征:") + for feat, imp in importances.head(10).items(): + print(f" {feat:<40} {imp:.4f}") + + return results + + +# ============================================================ +# 5. 事件对齐分析 +# ============================================================ + +def align_with_events( + anomaly_dates: pd.DatetimeIndex, + tolerance_days: int = 5, +) -> pd.DataFrame: + """ + 将异常日期与已知事件对齐 + + Parameters + ---------- + anomaly_dates : pd.DatetimeIndex + 异常日期列表 + tolerance_days : int + 容差天数(异常日期与事件日期相差 ≤ tolerance_days 天即视为匹配) + + Returns + ------- + pd.DataFrame + 匹配结果 + """ + matches = [] + + for event_date_str, event_name in KNOWN_EVENTS.items(): + event_date = pd.Timestamp(event_date_str) + + for anom_date in anomaly_dates: + diff_days = abs((anom_date - event_date).days) + if diff_days <= tolerance_days: + matches.append({ + 'anomaly_date': anom_date, + 'event_date': event_date, + 'event_name': event_name, + 'diff_days': diff_days, + }) + + if matches: + result = pd.DataFrame(matches) + print(f"\n 事件对齐 (容差 {tolerance_days} 天):") + for _, row in result.iterrows(): + print(f" 异常 {row['anomaly_date'].strftime('%Y-%m-%d')} ↔ " + f"{row['event_name']} ({row['event_date'].strftime('%Y-%m-%d')}, " + f"差 {row['diff_days']} 天)") + return result + else: + print(f" [信息] 无异常日期与已知事件匹配 (容差 {tolerance_days} 天)") + return pd.DataFrame() + + +# ============================================================ +# 6. 可视化 +# ============================================================ + +def plot_price_with_anomalies( + df: pd.DataFrame, + anomaly_result: pd.DataFrame, + garch_anomaly: Optional[pd.Series], + output_dir: Path, +): + """绘制价格图,标注异常点""" + fig, axes = plt.subplots(2, 1, figsize=(16, 10), gridspec_kw={'height_ratios': [3, 1]}) + + # 上图:价格 + 异常标记 + ax1 = axes[0] + ax1.plot(df.index, df['close'], linewidth=0.6, color='steelblue', alpha=0.8, label='BTC 收盘价') + + # 集成异常 + ensemble_anom = anomaly_result[anomaly_result['anomaly_ensemble'] == 1] + if not ensemble_anom.empty: + # 获取异常日期对应的收盘价 + anom_prices = df.loc[df.index.isin(ensemble_anom.index), 'close'] + ax1.scatter(anom_prices.index, anom_prices.values, + color='red', s=30, zorder=5, label=f'集成异常 (n={len(anom_prices)})', + alpha=0.7, edgecolors='darkred', linewidths=0.5) + + # GARCH 异常 + if garch_anomaly is not None: + garch_anom_dates = garch_anomaly[garch_anomaly == 1].index + garch_prices = df.loc[df.index.isin(garch_anom_dates), 'close'] + if not garch_prices.empty: + ax1.scatter(garch_prices.index, garch_prices.values, + color='orange', s=20, zorder=4, marker='^', + label=f'GARCH 异常 (n={len(garch_prices)})', + alpha=0.7, edgecolors='darkorange', linewidths=0.5) + + ax1.set_ylabel('价格 (USDT)', fontsize=12) + ax1.set_title('BTC 价格与异常检测结果', fontsize=14) + ax1.legend(fontsize=10, loc='upper left') + ax1.grid(True, alpha=0.3) + ax1.set_yscale('log') + + # 下图:成交量 + 异常标记 + ax2 = axes[1] + if 'volume' in df.columns: + ax2.bar(df.index, df['volume'], width=1, color='steelblue', alpha=0.4, label='成交量') + if not ensemble_anom.empty: + anom_vol = df.loc[df.index.isin(ensemble_anom.index), 'volume'] + ax2.bar(anom_vol.index, anom_vol.values, width=1, color='red', alpha=0.7, label='异常日成交量') + ax2.set_ylabel('成交量', fontsize=12) + ax2.set_xlabel('日期', fontsize=12) + ax2.legend(fontsize=10) + ax2.grid(True, alpha=0.3) + + fig.tight_layout() + fig.savefig(output_dir / 'anomaly_price_chart.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] {output_dir / 'anomaly_price_chart.png'}") + + +def plot_anomaly_feature_distributions( + anomaly_result: pd.DataFrame, + output_dir: Path, +): + """绘制异常日 vs 正常日的特征分布对比""" + features_to_plot = [f for f in DETECTION_FEATURES if f in anomaly_result.columns] + n_feats = len(features_to_plot) + if n_feats == 0: + print(" [警告] 无可绘制特征") + return + + n_cols = 3 + n_rows = (n_feats + n_cols - 1) // n_cols + + fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows)) + axes = np.array(axes).flatten() + + normal = anomaly_result[anomaly_result['anomaly_ensemble'] == 0] + anomaly = anomaly_result[anomaly_result['anomaly_ensemble'] == 1] + + for idx, feat in enumerate(features_to_plot): + ax = axes[idx] + + # 正常分布 + vals_normal = normal[feat].dropna() + vals_anomaly = anomaly[feat].dropna() + + ax.hist(vals_normal, bins=50, density=True, alpha=0.6, + color='steelblue', label=f'正常 (n={len(vals_normal)})', edgecolor='white', linewidth=0.3) + if len(vals_anomaly) > 0: + ax.hist(vals_anomaly, bins=30, density=True, alpha=0.6, + color='red', label=f'异常 (n={len(vals_anomaly)})', edgecolor='white', linewidth=0.3) + + ax.set_title(feat, fontsize=11) + ax.legend(fontsize=8) + ax.grid(True, alpha=0.3) + + # 隐藏多余子图 + for idx in range(n_feats, len(axes)): + axes[idx].set_visible(False) + + fig.suptitle('异常日 vs 正常日 特征分布对比', fontsize=14, y=1.02) + fig.tight_layout() + fig.savefig(output_dir / 'anomaly_feature_distributions.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] {output_dir / 'anomaly_feature_distributions.png'}") + + +def plot_precursor_roc(precursor_results: Dict, output_dir: Path): + """绘制前兆分类器 ROC 曲线""" + if not precursor_results or 'fpr' not in precursor_results: + print(" [警告] 无前兆分类器结果,跳过 ROC 曲线") + return + + fig, ax = plt.subplots(figsize=(8, 8)) + + fpr = precursor_results['fpr'] + tpr = precursor_results['tpr'] + auc = precursor_results['auc'] + + ax.plot(fpr, tpr, color='steelblue', linewidth=2, + label=f'Random Forest (AUC = {auc:.4f})') + ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='随机基线') + + ax.set_xlabel('假阳性率 (FPR)', fontsize=12) + ax.set_ylabel('真阳性率 (TPR)', fontsize=12) + ax.set_title('异常前兆分类器 ROC 曲线', fontsize=14) + ax.legend(fontsize=11) + ax.grid(True, alpha=0.3) + ax.set_xlim([-0.02, 1.02]) + ax.set_ylim([-0.02, 1.02]) + + fig.savefig(output_dir / 'precursor_roc_curve.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] {output_dir / 'precursor_roc_curve.png'}") + + +def plot_feature_importance(precursor_results: Dict, output_dir: Path, top_n: int = 20): + """绘制前兆特征重要性条形图""" + if not precursor_results or 'feature_importances' not in precursor_results: + print(" [警告] 无特征重要性数据,跳过") + return + + importances = precursor_results['feature_importances'].head(top_n) + + fig, ax = plt.subplots(figsize=(10, max(6, top_n * 0.35))) + + colors = plt.cm.RdYlBu_r(np.linspace(0.2, 0.8, len(importances))) + ax.barh(range(len(importances)), importances.values[::-1], + color=colors[::-1], edgecolor='white', linewidth=0.5) + ax.set_yticks(range(len(importances))) + ax.set_yticklabels(importances.index[::-1], fontsize=9) + ax.set_xlabel('特征重要性', fontsize=12) + ax.set_title(f'异常前兆 Top-{top_n} 特征重要性 (Random Forest)', fontsize=13) + ax.grid(True, alpha=0.3, axis='x') + + fig.savefig(output_dir / 'precursor_feature_importance.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] {output_dir / 'precursor_feature_importance.png'}") + + +# ============================================================ +# 7. 结果打印 +# ============================================================ + +def print_anomaly_summary( + anomaly_result: pd.DataFrame, + garch_anomaly: Optional[pd.Series], + precursor_results: Dict, +): + """打印异常检测汇总""" + print("\n" + "=" * 70) + print("异常检测结果汇总") + print("=" * 70) + + # 集成异常统计 + n_total = len(anomaly_result) + n_ensemble = anomaly_result['anomaly_ensemble'].sum() + print(f"\n 总样本数: {n_total}") + print(f" 集成异常数: {n_ensemble} ({n_ensemble / n_total * 100:.2f}%)") + + # 各方法统计 + method_cols = [c for c in anomaly_result.columns if c.startswith('anomaly_') and c != 'anomaly_ensemble' and c != 'anomaly_votes'] + for col in method_cols: + method_name = col.replace('anomaly_', '') + n_anom = anomaly_result[col].sum() + print(f" {method_name:>12}: {n_anom} ({n_anom / n_total * 100:.2f}%)") + + # GARCH 异常 + if garch_anomaly is not None: + n_garch = garch_anomaly.sum() + print(f" {'GARCH':>12}: {n_garch} ({n_garch / len(garch_anomaly) * 100:.2f}%)") + + # 集成异常与 GARCH 异常的重叠 + common_idx = anomaly_result.index.intersection(garch_anomaly.index) + if len(common_idx) > 0: + ensemble_set = set(anomaly_result.loc[common_idx][anomaly_result.loc[common_idx, 'anomaly_ensemble'] == 1].index) + garch_set = set(garch_anomaly[garch_anomaly == 1].index) + overlap = len(ensemble_set & garch_set) + print(f"\n 集成 ∩ GARCH 重叠: {overlap} 个") + + # 前兆分类器 + if precursor_results and 'auc' in precursor_results: + print(f"\n 前兆分类器 AUC: {precursor_results['auc']:.4f}") + print(f" Top-5 前兆特征:") + for feat, imp in precursor_results['feature_importances'].head(5).items(): + print(f" {feat:<40} {imp:.4f}") + + +# ============================================================ +# 8. 主入口 +# ============================================================ + +def run_anomaly_analysis( + df: pd.DataFrame, + output_dir: str = "output/anomaly", +) -> Dict: + """ + 异常检测与前兆模式分析主函数 + + Parameters + ---------- + df : pd.DataFrame + 日线数据(已通过 add_derived_features 添加衍生特征) + output_dir : str + 图表输出目录 + + Returns + ------- + dict + 包含所有分析结果的字典 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + print("=" * 70) + print("BTC 异常检测与前兆模式分析") + print("=" * 70) + print(f"数据范围: {df.index.min()} ~ {df.index.max()}") + print(f"样本数量: {len(df)}") + + # 设置中文字体 + plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans'] + plt.rcParams['axes.unicode_minus'] = False + + # --- 集成异常检测 --- + print("\n>>> [1/5] 执行集成异常检测...") + anomaly_result = ensemble_anomaly_detection(df, contamination=0.05, min_agreement=2) + + # --- GARCH 条件波动率异常 --- + print("\n>>> [2/5] 执行 GARCH 条件波动率异常检测...") + garch_anomaly = None + try: + garch_anomaly = garch_anomaly_detection(df, threshold=3.0) + except Exception as e: + print(f" [错误] GARCH 异常检测失败: {e}") + + # --- 事件对齐 --- + print("\n>>> [3/5] 执行事件对齐分析...") + ensemble_anom_dates = anomaly_result[anomaly_result['anomaly_ensemble'] == 1].index + event_alignment = align_with_events(ensemble_anom_dates, tolerance_days=5) + + # --- 前兆模式提取 --- + print("\n>>> [4/5] 提取前兆模式并训练分类器...") + precursor_results = {} + try: + X_precursor, y_precursor = extract_precursor_features( + df, anomaly_result['anomaly_ensemble'], lookback_windows=[5, 10, 20] + ) + print(f" 前兆特征矩阵: {X_precursor.shape[0]} 样本 x {X_precursor.shape[1]} 特征") + precursor_results = train_precursor_classifier(X_precursor, y_precursor) + except Exception as e: + print(f" [错误] 前兆模式提取失败: {e}") + + # --- 可视化 --- + print("\n>>> [5/5] 生成可视化图表...") + plot_price_with_anomalies(df, anomaly_result, garch_anomaly, output_dir) + plot_anomaly_feature_distributions(anomaly_result, output_dir) + plot_precursor_roc(precursor_results, output_dir) + plot_feature_importance(precursor_results, output_dir) + + # --- 汇总打印 --- + print_anomaly_summary(anomaly_result, garch_anomaly, precursor_results) + + print("\n" + "=" * 70) + print("异常检测与前兆模式分析完成!") + print(f"图表已保存至: {output_dir.resolve()}") + print("=" * 70) + + return { + 'anomaly_result': anomaly_result, + 'garch_anomaly': garch_anomaly, + 'event_alignment': event_alignment, + 'precursor_results': precursor_results, + } + + +# ============================================================ +# 独立运行入口 +# ============================================================ + +if __name__ == '__main__': + from src.data_loader import load_daily + from src.preprocessing import add_derived_features + + df = load_daily() + df = add_derived_features(df) + run_anomaly_analysis(df) diff --git a/src/calendar_analysis.py b/src/calendar_analysis.py new file mode 100644 index 0000000..667a0cc --- /dev/null +++ b/src/calendar_analysis.py @@ -0,0 +1,565 @@ +"""日历效应分析模块 - 星期、月份、小时、季度、月初月末效应""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +import matplotlib.ticker as mticker +import seaborn as sns +from pathlib import Path +from itertools import combinations +from scipy import stats + +# 中文显示配置 +plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans'] +plt.rcParams['axes.unicode_minus'] = False + +# 星期名称映射(中英文) +WEEKDAY_NAMES_CN = {0: '周一', 1: '周二', 2: '周三', 3: '周四', + 4: '周五', 5: '周六', 6: '周日'} +WEEKDAY_NAMES_EN = {0: 'Mon', 1: 'Tue', 2: 'Wed', 3: 'Thu', + 4: 'Fri', 5: 'Sat', 6: 'Sun'} + +# 月份名称映射 +MONTH_NAMES_CN = {1: '1月', 2: '2月', 3: '3月', 4: '4月', + 5: '5月', 6: '6月', 7: '7月', 8: '8月', + 9: '9月', 10: '10月', 11: '11月', 12: '12月'} +MONTH_NAMES_EN = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', + 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', + 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'} + + +def _bonferroni_pairwise_mannwhitney(groups: dict, alpha: float = 0.05): + """ + 对多组数据进行 Mann-Whitney U 两两检验,并做 Bonferroni 校正。 + + Parameters + ---------- + groups : dict + {组标签: 收益率序列} + alpha : float + 显著性水平(校正前) + + Returns + ------- + list[dict] + 每对检验的结果列表 + """ + keys = sorted(groups.keys()) + pairs = list(combinations(keys, 2)) + n_tests = len(pairs) + corrected_alpha = alpha / n_tests if n_tests > 0 else alpha + + results = [] + for k1, k2 in pairs: + g1, g2 = groups[k1].dropna(), groups[k2].dropna() + if len(g1) < 3 or len(g2) < 3: + continue + stat, pval = stats.mannwhitneyu(g1, g2, alternative='two-sided') + results.append({ + 'group1': k1, + 'group2': k2, + 'U_stat': stat, + 'p_value': pval, + 'p_corrected': min(pval * n_tests, 1.0), # Bonferroni 校正 + 'significant': pval * n_tests < alpha, + 'corrected_alpha': corrected_alpha, + }) + return results + + +def _kruskal_wallis_test(groups: dict): + """ + Kruskal-Wallis H 检验(非参数单因素检验)。 + + Parameters + ---------- + groups : dict + {组标签: 收益率序列} + + Returns + ------- + dict + 包含 H 统计量、p 值等 + """ + valid_groups = [g.dropna().values for g in groups.values() if len(g.dropna()) >= 3] + if len(valid_groups) < 2: + return {'H_stat': np.nan, 'p_value': np.nan, 'n_groups': len(valid_groups)} + + h_stat, p_val = stats.kruskal(*valid_groups) + return {'H_stat': h_stat, 'p_value': p_val, 'n_groups': len(valid_groups)} + + +# -------------------------------------------------------------------------- +# 1. 星期效应分析 +# -------------------------------------------------------------------------- +def analyze_day_of_week(df: pd.DataFrame, output_dir: Path): + """ + 分析日收益率的星期效应。 + + Parameters + ---------- + df : pd.DataFrame + 日线数据(需含 log_return 列,DatetimeIndex 索引) + output_dir : Path + 图片保存目录 + """ + print("\n" + "=" * 70) + print("【星期效应分析】Day-of-Week Effect") + print("=" * 70) + + df = df.dropna(subset=['log_return']).copy() + df['weekday'] = df.index.dayofweek # 0=周一, 6=周日 + + # --- 描述性统计 --- + groups = {wd: df.loc[df['weekday'] == wd, 'log_return'] for wd in range(7)} + + print("\n--- 各星期对数收益率统计 ---") + stats_rows = [] + for wd in range(7): + g = groups[wd] + row = { + '星期': WEEKDAY_NAMES_CN[wd], + '样本量': len(g), + '均值': g.mean(), + '中位数': g.median(), + '标准差': g.std(), + '偏度': g.skew(), + '峰度': g.kurtosis(), + } + stats_rows.append(row) + stats_df = pd.DataFrame(stats_rows) + print(stats_df.to_string(index=False, float_format='{:.6f}'.format)) + + # --- Kruskal-Wallis 检验 --- + kw_result = _kruskal_wallis_test(groups) + print(f"\nKruskal-Wallis H 检验: H={kw_result['H_stat']:.4f}, " + f"p={kw_result['p_value']:.6f}") + if kw_result['p_value'] < 0.05: + print(" => 在 5% 显著性水平下,各星期收益率存在显著差异") + else: + print(" => 在 5% 显著性水平下,各星期收益率无显著差异") + + # --- Mann-Whitney U 两两检验 (Bonferroni 校正) --- + pairwise = _bonferroni_pairwise_mannwhitney(groups) + sig_pairs = [p for p in pairwise if p['significant']] + print(f"\nMann-Whitney U 两两检验 (Bonferroni 校正, {len(pairwise)} 对比较):") + if sig_pairs: + for p in sig_pairs: + print(f" {WEEKDAY_NAMES_CN[p['group1']]} vs {WEEKDAY_NAMES_CN[p['group2']]}: " + f"U={p['U_stat']:.1f}, p_raw={p['p_value']:.6f}, " + f"p_corrected={p['p_corrected']:.6f} *") + else: + print(" 无显著差异的配对(校正后)") + + # --- 可视化: 箱线图 --- + fig, axes = plt.subplots(1, 2, figsize=(14, 6)) + + # 箱线图 + box_data = [groups[wd].values for wd in range(7)] + bp = axes[0].boxplot(box_data, labels=[WEEKDAY_NAMES_CN[i] for i in range(7)], + patch_artist=True, showfliers=False, showmeans=True, + meanprops=dict(marker='D', markerfacecolor='red', markersize=5)) + colors = plt.cm.Set3(np.linspace(0, 1, 7)) + for patch, color in zip(bp['boxes'], colors): + patch.set_facecolor(color) + axes[0].axhline(y=0, color='grey', linestyle='--', alpha=0.5) + axes[0].set_title('BTC 日收益率 - 星期效应(箱线图)', fontsize=13) + axes[0].set_ylabel('对数收益率') + axes[0].set_xlabel('星期') + + # 均值柱状图 + means = [groups[wd].mean() for wd in range(7)] + sems = [groups[wd].sem() for wd in range(7)] + bar_colors = ['#2ecc71' if m > 0 else '#e74c3c' for m in means] + axes[1].bar(range(7), means, yerr=sems, color=bar_colors, + alpha=0.8, capsize=3, edgecolor='black', linewidth=0.5) + axes[1].set_xticks(range(7)) + axes[1].set_xticklabels([WEEKDAY_NAMES_CN[i] for i in range(7)]) + axes[1].axhline(y=0, color='grey', linestyle='--', alpha=0.5) + axes[1].set_title('BTC 日均收益率 - 星期效应(均值±SE)', fontsize=13) + axes[1].set_ylabel('平均对数收益率') + axes[1].set_xlabel('星期') + + plt.tight_layout() + fig_path = output_dir / 'calendar_weekday_effect.png' + fig.savefig(fig_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"\n图表已保存: {fig_path}") + + +# -------------------------------------------------------------------------- +# 2. 月份效应分析 +# -------------------------------------------------------------------------- +def analyze_month_of_year(df: pd.DataFrame, output_dir: Path): + """ + 分析日收益率的月份效应,并绘制年×月热力图。 + + Parameters + ---------- + df : pd.DataFrame + 日线数据(需含 log_return 列) + output_dir : Path + 图片保存目录 + """ + print("\n" + "=" * 70) + print("【月份效应分析】Month-of-Year Effect") + print("=" * 70) + + df = df.dropna(subset=['log_return']).copy() + df['month'] = df.index.month + df['year'] = df.index.year + + # --- 描述性统计 --- + groups = {m: df.loc[df['month'] == m, 'log_return'] for m in range(1, 13)} + + print("\n--- 各月份对数收益率统计 ---") + stats_rows = [] + for m in range(1, 13): + g = groups[m] + row = { + '月份': MONTH_NAMES_CN[m], + '样本量': len(g), + '均值': g.mean(), + '中位数': g.median(), + '标准差': g.std(), + } + stats_rows.append(row) + stats_df = pd.DataFrame(stats_rows) + print(stats_df.to_string(index=False, float_format='{:.6f}'.format)) + + # --- Kruskal-Wallis 检验 --- + kw_result = _kruskal_wallis_test(groups) + print(f"\nKruskal-Wallis H 检验: H={kw_result['H_stat']:.4f}, " + f"p={kw_result['p_value']:.6f}") + if kw_result['p_value'] < 0.05: + print(" => 在 5% 显著性水平下,各月份收益率存在显著差异") + else: + print(" => 在 5% 显著性水平下,各月份收益率无显著差异") + + # --- Mann-Whitney U 两两检验 (Bonferroni 校正) --- + pairwise = _bonferroni_pairwise_mannwhitney(groups) + sig_pairs = [p for p in pairwise if p['significant']] + print(f"\nMann-Whitney U 两两检验 (Bonferroni 校正, {len(pairwise)} 对比较):") + if sig_pairs: + for p in sig_pairs: + print(f" {MONTH_NAMES_CN[p['group1']]} vs {MONTH_NAMES_CN[p['group2']]}: " + f"U={p['U_stat']:.1f}, p_raw={p['p_value']:.6f}, " + f"p_corrected={p['p_corrected']:.6f} *") + else: + print(" 无显著差异的配对(校正后)") + + # --- 可视化 --- + fig, axes = plt.subplots(1, 2, figsize=(16, 6)) + + # 均值柱状图 + means = [groups[m].mean() for m in range(1, 13)] + sems = [groups[m].sem() for m in range(1, 13)] + bar_colors = ['#2ecc71' if m > 0 else '#e74c3c' for m in means] + axes[0].bar(range(1, 13), means, yerr=sems, color=bar_colors, + alpha=0.8, capsize=3, edgecolor='black', linewidth=0.5) + axes[0].set_xticks(range(1, 13)) + axes[0].set_xticklabels([MONTH_NAMES_EN[i] for i in range(1, 13)]) + axes[0].axhline(y=0, color='grey', linestyle='--', alpha=0.5) + axes[0].set_title('BTC 月均收益率(均值±SE)', fontsize=13) + axes[0].set_ylabel('平均对数收益率') + axes[0].set_xlabel('月份') + + # 年×月 热力图:每月累计收益率 + monthly_returns = df.groupby(['year', 'month'])['log_return'].sum().unstack(fill_value=np.nan) + monthly_returns.columns = [MONTH_NAMES_EN[c] for c in monthly_returns.columns] + sns.heatmap(monthly_returns, annot=True, fmt='.3f', cmap='RdYlGn', center=0, + linewidths=0.5, ax=axes[1], cbar_kws={'label': '累计对数收益率'}) + axes[1].set_title('BTC 年×月 累计对数收益率热力图', fontsize=13) + axes[1].set_ylabel('年份') + axes[1].set_xlabel('月份') + + plt.tight_layout() + fig_path = output_dir / 'calendar_month_effect.png' + fig.savefig(fig_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"\n图表已保存: {fig_path}") + + +# -------------------------------------------------------------------------- +# 3. 小时效应分析(1h 数据) +# -------------------------------------------------------------------------- +def analyze_hour_of_day(df_hourly: pd.DataFrame, output_dir: Path): + """ + 分析小时级别收益率与成交量的日内效应。 + + Parameters + ---------- + df_hourly : pd.DataFrame + 小时线数据(需含 close、volume 列,DatetimeIndex 索引) + output_dir : Path + 图片保存目录 + """ + print("\n" + "=" * 70) + print("【小时效应分析】Hour-of-Day Effect") + print("=" * 70) + + df = df_hourly.copy() + # 计算小时收益率 + df['log_return'] = np.log(df['close'] / df['close'].shift(1)) + df = df.dropna(subset=['log_return']) + df['hour'] = df.index.hour + + # --- 描述性统计 --- + groups_ret = {h: df.loc[df['hour'] == h, 'log_return'] for h in range(24)} + groups_vol = {h: df.loc[df['hour'] == h, 'volume'] for h in range(24)} + + print("\n--- 各小时对数收益率与成交量统计 ---") + stats_rows = [] + for h in range(24): + gr = groups_ret[h] + gv = groups_vol[h] + row = { + '小时(UTC)': f'{h:02d}:00', + '样本量': len(gr), + '收益率均值': gr.mean(), + '收益率中位数': gr.median(), + '收益率标准差': gr.std(), + '成交量均值': gv.mean(), + } + stats_rows.append(row) + stats_df = pd.DataFrame(stats_rows) + print(stats_df.to_string(index=False, float_format='{:.6f}'.format)) + + # --- Kruskal-Wallis 检验 (收益率) --- + kw_ret = _kruskal_wallis_test(groups_ret) + print(f"\n收益率 Kruskal-Wallis H 检验: H={kw_ret['H_stat']:.4f}, " + f"p={kw_ret['p_value']:.6f}") + if kw_ret['p_value'] < 0.05: + print(" => 在 5% 显著性水平下,各小时收益率存在显著差异") + else: + print(" => 在 5% 显著性水平下,各小时收益率无显著差异") + + # --- Kruskal-Wallis 检验 (成交量) --- + kw_vol = _kruskal_wallis_test(groups_vol) + print(f"\n成交量 Kruskal-Wallis H 检验: H={kw_vol['H_stat']:.4f}, " + f"p={kw_vol['p_value']:.6f}") + if kw_vol['p_value'] < 0.05: + print(" => 在 5% 显著性水平下,各小时成交量存在显著差异") + else: + print(" => 在 5% 显著性水平下,各小时成交量无显著差异") + + # --- 可视化 --- + fig, axes = plt.subplots(2, 1, figsize=(14, 10)) + + hours = list(range(24)) + hour_labels = [f'{h:02d}' for h in hours] + + # 收益率 + ret_means = [groups_ret[h].mean() for h in hours] + ret_sems = [groups_ret[h].sem() for h in hours] + bar_colors_ret = ['#2ecc71' if m > 0 else '#e74c3c' for m in ret_means] + axes[0].bar(hours, ret_means, yerr=ret_sems, color=bar_colors_ret, + alpha=0.8, capsize=2, edgecolor='black', linewidth=0.3) + axes[0].set_xticks(hours) + axes[0].set_xticklabels(hour_labels) + axes[0].axhline(y=0, color='grey', linestyle='--', alpha=0.5) + axes[0].set_title('BTC 小时均收益率 (UTC, 均值±SE)', fontsize=13) + axes[0].set_ylabel('平均对数收益率') + axes[0].set_xlabel('小时 (UTC)') + + # 成交量 + vol_means = [groups_vol[h].mean() for h in hours] + axes[1].bar(hours, vol_means, color='steelblue', alpha=0.8, + edgecolor='black', linewidth=0.3) + axes[1].set_xticks(hours) + axes[1].set_xticklabels(hour_labels) + axes[1].set_title('BTC 小时均成交量 (UTC)', fontsize=13) + axes[1].set_ylabel('平均成交量 (BTC)') + axes[1].set_xlabel('小时 (UTC)') + axes[1].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:,.0f}')) + + plt.tight_layout() + fig_path = output_dir / 'calendar_hour_effect.png' + fig.savefig(fig_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"\n图表已保存: {fig_path}") + + +# -------------------------------------------------------------------------- +# 4. 季度效应 & 月初月末效应 +# -------------------------------------------------------------------------- +def analyze_quarter_and_month_boundary(df: pd.DataFrame, output_dir: Path): + """ + 分析季度效应,以及每月前5日/后5日的收益率差异。 + + Parameters + ---------- + df : pd.DataFrame + 日线数据(需含 log_return 列) + output_dir : Path + 图片保存目录 + """ + print("\n" + "=" * 70) + print("【季度效应 & 月初/月末效应分析】") + print("=" * 70) + + df = df.dropna(subset=['log_return']).copy() + df['quarter'] = df.index.quarter + df['month'] = df.index.month + df['day'] = df.index.day + + # ========== 季度效应 ========== + groups_q = {q: df.loc[df['quarter'] == q, 'log_return'] for q in range(1, 5)} + + print("\n--- 各季度对数收益率统计 ---") + quarter_names = {1: 'Q1', 2: 'Q2', 3: 'Q3', 4: 'Q4'} + for q in range(1, 5): + g = groups_q[q] + print(f" {quarter_names[q]}: 均值={g.mean():.6f}, 中位数={g.median():.6f}, " + f"标准差={g.std():.6f}, 样本量={len(g)}") + + kw_q = _kruskal_wallis_test(groups_q) + print(f"\n季度 Kruskal-Wallis H 检验: H={kw_q['H_stat']:.4f}, p={kw_q['p_value']:.6f}") + if kw_q['p_value'] < 0.05: + print(" => 在 5% 显著性水平下,各季度收益率存在显著差异") + else: + print(" => 在 5% 显著性水平下,各季度收益率无显著差异") + + # 季度两两比较 + pairwise_q = _bonferroni_pairwise_mannwhitney(groups_q) + sig_q = [p for p in pairwise_q if p['significant']] + if sig_q: + print(f"\n季度两两检验 (Bonferroni 校正, {len(pairwise_q)} 对):") + for p in sig_q: + print(f" {quarter_names[p['group1']]} vs {quarter_names[p['group2']]}: " + f"U={p['U_stat']:.1f}, p_corrected={p['p_corrected']:.6f} *") + + # ========== 月初/月末效应 ========== + # 判断每月最后5天:通过计算每个日期距当月末的天数 + from pandas.tseries.offsets import MonthEnd + df['month_end'] = df.index + MonthEnd(0) # 当月最后一天 + df['days_to_end'] = (df['month_end'] - df.index).dt.days + + # 月初前5天 vs 月末后5天 + mask_start = df['day'] <= 5 + mask_end = df['days_to_end'] < 5 # 距离月末不到5天(即最后5天) + + ret_start = df.loc[mask_start, 'log_return'] + ret_end = df.loc[mask_end, 'log_return'] + ret_mid = df.loc[~mask_start & ~mask_end, 'log_return'] + + print("\n--- 月初 / 月中 / 月末 收益率统计 ---") + for label, data in [('月初(前5日)', ret_start), ('月中', ret_mid), ('月末(后5日)', ret_end)]: + print(f" {label}: 均值={data.mean():.6f}, 中位数={data.median():.6f}, " + f"标准差={data.std():.6f}, 样本量={len(data)}") + + # Mann-Whitney U 检验:月初 vs 月末 + if len(ret_start) >= 3 and len(ret_end) >= 3: + u_stat, p_val = stats.mannwhitneyu(ret_start, ret_end, alternative='two-sided') + print(f"\n月初 vs 月末 Mann-Whitney U 检验: U={u_stat:.1f}, p={p_val:.6f}") + if p_val < 0.05: + print(" => 在 5% 显著性水平下,月初与月末收益率存在显著差异") + else: + print(" => 在 5% 显著性水平下,月初与月末收益率无显著差异") + + # --- 可视化 --- + fig, axes = plt.subplots(1, 2, figsize=(14, 6)) + + # 季度柱状图 + q_means = [groups_q[q].mean() for q in range(1, 5)] + q_sems = [groups_q[q].sem() for q in range(1, 5)] + q_colors = ['#2ecc71' if m > 0 else '#e74c3c' for m in q_means] + axes[0].bar(range(1, 5), q_means, yerr=q_sems, color=q_colors, + alpha=0.8, capsize=4, edgecolor='black', linewidth=0.5) + axes[0].set_xticks(range(1, 5)) + axes[0].set_xticklabels(['Q1', 'Q2', 'Q3', 'Q4']) + axes[0].axhline(y=0, color='grey', linestyle='--', alpha=0.5) + axes[0].set_title('BTC 季度均收益率(均值±SE)', fontsize=13) + axes[0].set_ylabel('平均对数收益率') + axes[0].set_xlabel('季度') + + # 月初/月中/月末 柱状图 + boundary_means = [ret_start.mean(), ret_mid.mean(), ret_end.mean()] + boundary_sems = [ret_start.sem(), ret_mid.sem(), ret_end.sem()] + boundary_colors = ['#3498db', '#95a5a6', '#e67e22'] + axes[1].bar(range(3), boundary_means, yerr=boundary_sems, color=boundary_colors, + alpha=0.8, capsize=4, edgecolor='black', linewidth=0.5) + axes[1].set_xticks(range(3)) + axes[1].set_xticklabels(['月初(前5日)', '月中', '月末(后5日)']) + axes[1].axhline(y=0, color='grey', linestyle='--', alpha=0.5) + axes[1].set_title('BTC 月初/月中/月末 均收益率(均值±SE)', fontsize=13) + axes[1].set_ylabel('平均对数收益率') + + plt.tight_layout() + fig_path = output_dir / 'calendar_quarter_boundary_effect.png' + fig.savefig(fig_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"\n图表已保存: {fig_path}") + + # 清理临时列 + df.drop(columns=['month_end', 'days_to_end'], inplace=True, errors='ignore') + + +# -------------------------------------------------------------------------- +# 主入口 +# -------------------------------------------------------------------------- +def run_calendar_analysis( + df: pd.DataFrame, + df_hourly: pd.DataFrame = None, + output_dir: str = 'output/calendar', +): + """ + 日历效应分析主入口。 + + Parameters + ---------- + df : pd.DataFrame + 日线数据,已通过 add_derived_features 添加衍生特征(含 log_return 列) + df_hourly : pd.DataFrame, optional + 小时线原始数据(含 close、volume 列)。若为 None 则跳过小时效应分析。 + output_dir : str or Path + 输出目录 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + print("\n" + "#" * 70) + print("# BTC 日历效应分析 (Calendar Effects Analysis)") + print("#" * 70) + + # 1. 星期效应 + analyze_day_of_week(df, output_dir) + + # 2. 月份效应 + analyze_month_of_year(df, output_dir) + + # 3. 小时效应(若有小时数据) + if df_hourly is not None and len(df_hourly) > 0: + analyze_hour_of_day(df_hourly, output_dir) + else: + print("\n[跳过] 小时效应分析:未提供小时数据 (df_hourly is None)") + + # 4. 季度 & 月初月末效应 + analyze_quarter_and_month_boundary(df, output_dir) + + print("\n" + "#" * 70) + print("# 日历效应分析完成") + print("#" * 70) + + +# -------------------------------------------------------------------------- +# 可独立运行 +# -------------------------------------------------------------------------- +if __name__ == '__main__': + from data_loader import load_daily, load_hourly + from preprocessing import add_derived_features + + # 加载数据 + df_daily = load_daily() + df_daily = add_derived_features(df_daily) + + try: + df_hourly = load_hourly() + except Exception as e: + print(f"[警告] 加载小时数据失败: {e}") + df_hourly = None + + run_calendar_analysis(df_daily, df_hourly, output_dir='output/calendar') diff --git a/src/causality.py b/src/causality.py new file mode 100644 index 0000000..56b7a95 --- /dev/null +++ b/src/causality.py @@ -0,0 +1,615 @@ +"""Granger 因果检验模块 + +分析内容: +- 双向 Granger 因果检验(5 对变量,各 5 个滞后阶数) +- 跨时间尺度因果检验(小时级聚合特征 → 日级收益率) +- Bonferroni 多重检验校正 +- 可视化:p 值热力图、显著因果关系网络图 +""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +import warnings +from pathlib import Path +from typing import Optional, List, Tuple, Dict + +from statsmodels.tsa.stattools import grangercausalitytests + +from src.data_loader import load_hourly +from src.preprocessing import log_returns, add_derived_features + + +# ============================================================ +# 1. 因果检验对定义 +# ============================================================ + +# 5 对双向因果关系,每对 (cause, effect) +CAUSALITY_PAIRS = [ + ('volume', 'log_return'), + ('log_return', 'volume'), + ('abs_return', 'volume'), + ('volume', 'abs_return'), + ('taker_buy_ratio', 'log_return'), + ('log_return', 'taker_buy_ratio'), + ('squared_return', 'volume'), + ('volume', 'squared_return'), + ('range_pct', 'log_return'), + ('log_return', 'range_pct'), +] + +# 测试的滞后阶数 +TEST_LAGS = [1, 2, 3, 5, 10] + + +# ============================================================ +# 2. 单对 Granger 因果检验 +# ============================================================ + +def granger_test_pair( + df: pd.DataFrame, + cause: str, + effect: str, + max_lag: int = 10, + test_lags: Optional[List[int]] = None, +) -> List[Dict]: + """ + 对指定的 (cause → effect) 方向执行 Granger 因果检验 + + Parameters + ---------- + df : pd.DataFrame + 包含 cause 和 effect 列的数据 + cause : str + 原因变量列名 + effect : str + 结果变量列名 + max_lag : int + 最大滞后阶数 + test_lags : list of int, optional + 需要测试的滞后阶数列表 + + Returns + ------- + list of dict + 每个滞后阶数的检验结果 + """ + if test_lags is None: + test_lags = TEST_LAGS + + # grangercausalitytests 要求: 第一列是 effect,第二列是 cause + data = df[[effect, cause]].dropna() + + if len(data) < max_lag + 20: + print(f" [警告] {cause} → {effect}: 样本量不足 ({len(data)}),跳过") + return [] + + results = [] + try: + # 执行检验,maxlag 取最大值,一次获取所有滞后 + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + gc_results = grangercausalitytests(data, maxlag=max_lag, verbose=False) + + # 提取指定滞后阶数的结果 + for lag in test_lags: + if lag > max_lag: + continue + test_result = gc_results[lag] + # 取 ssr_ftest 的 F 统计量和 p 值 + f_stat = test_result[0]['ssr_ftest'][0] + p_value = test_result[0]['ssr_ftest'][1] + + results.append({ + 'cause': cause, + 'effect': effect, + 'lag': lag, + 'f_stat': f_stat, + 'p_value': p_value, + }) + except Exception as e: + print(f" [错误] {cause} → {effect}: {e}") + + return results + + +# ============================================================ +# 3. 批量因果检验 +# ============================================================ + +def run_all_granger_tests( + df: pd.DataFrame, + pairs: Optional[List[Tuple[str, str]]] = None, + test_lags: Optional[List[int]] = None, +) -> pd.DataFrame: + """ + 对所有变量对执行双向 Granger 因果检验 + + Parameters + ---------- + df : pd.DataFrame + 包含衍生特征的日线数据 + pairs : list of tuple, optional + 变量对列表 [(cause, effect), ...] + test_lags : list of tuple, optional + 滞后阶数列表 + + Returns + ------- + pd.DataFrame + 所有检验结果汇总表 + """ + if pairs is None: + pairs = CAUSALITY_PAIRS + if test_lags is None: + test_lags = TEST_LAGS + + max_lag = max(test_lags) + all_results = [] + + for cause, effect in pairs: + if cause not in df.columns or effect not in df.columns: + print(f" [警告] 列 {cause} 或 {effect} 不存在,跳过") + continue + pair_results = granger_test_pair(df, cause, effect, max_lag=max_lag, test_lags=test_lags) + all_results.extend(pair_results) + + results_df = pd.DataFrame(all_results) + return results_df + + +# ============================================================ +# 4. Bonferroni 校正 +# ============================================================ + +def apply_bonferroni(results_df: pd.DataFrame, alpha: float = 0.05) -> pd.DataFrame: + """ + 对 Granger 检验结果应用 Bonferroni 多重检验校正 + + Parameters + ---------- + results_df : pd.DataFrame + 包含 p_value 列的检验结果 + alpha : float + 原始显著性水平 + + Returns + ------- + pd.DataFrame + 添加了校正后显著性判断的结果 + """ + n_tests = len(results_df) + if n_tests == 0: + return results_df + + out = results_df.copy() + # Bonferroni 校正阈值 + corrected_alpha = alpha / n_tests + out['bonferroni_alpha'] = corrected_alpha + out['significant_raw'] = out['p_value'] < alpha + out['significant_corrected'] = out['p_value'] < corrected_alpha + + return out + + +# ============================================================ +# 5. 跨时间尺度因果检验 +# ============================================================ + +def cross_timeframe_causality( + daily_df: pd.DataFrame, + test_lags: Optional[List[int]] = None, +) -> pd.DataFrame: + """ + 检验小时级聚合特征是否 Granger 因果于日级收益率 + + 具体步骤: + 1. 加载小时级数据 + 2. 计算小时级波动率和成交量的日内聚合指标 + 3. 与日线收益率合并 + 4. 执行 Granger 因果检验 + + Parameters + ---------- + daily_df : pd.DataFrame + 日线数据(含 log_return) + test_lags : list of int, optional + 滞后阶数列表 + + Returns + ------- + pd.DataFrame + 跨时间尺度因果检验结果 + """ + if test_lags is None: + test_lags = TEST_LAGS + + # 加载小时数据 + try: + hourly_raw = load_hourly() + except (FileNotFoundError, Exception) as e: + print(f" [警告] 无法加载小时级数据,跳过跨时间尺度因果检验: {e}") + return pd.DataFrame() + + # 计算小时级衍生特征 + hourly = add_derived_features(hourly_raw) + + # 日内聚合:按日期聚合小时数据 + hourly['date'] = hourly.index.date + agg_dict = {} + + # 小时级日内波动率(对数收益率标准差) + if 'log_return' in hourly.columns: + hourly_vol = hourly.groupby('date')['log_return'].std() + hourly_vol.name = 'hourly_intraday_vol' + agg_dict['hourly_intraday_vol'] = hourly_vol + + # 小时级日内成交量总和 + if 'volume' in hourly.columns: + hourly_volume = hourly.groupby('date')['volume'].sum() + hourly_volume.name = 'hourly_volume_sum' + agg_dict['hourly_volume_sum'] = hourly_volume + + # 小时级日内最大绝对收益率 + if 'abs_return' in hourly.columns: + hourly_max_abs = hourly.groupby('date')['abs_return'].max() + hourly_max_abs.name = 'hourly_max_abs_return' + agg_dict['hourly_max_abs_return'] = hourly_max_abs + + if not agg_dict: + print(" [警告] 小时级聚合特征为空,跳过") + return pd.DataFrame() + + # 合并聚合结果 + hourly_agg = pd.DataFrame(agg_dict) + hourly_agg.index = pd.to_datetime(hourly_agg.index) + + # 与日线数据合并 + daily_for_merge = daily_df[['log_return']].copy() + merged = daily_for_merge.join(hourly_agg, how='inner') + + print(f" [跨时间尺度] 合并后样本数: {len(merged)}") + + # 对每个小时级聚合特征检验 → 日级收益率 + cross_pairs = [] + for col in agg_dict.keys(): + cross_pairs.append((col, 'log_return')) + + max_lag = max(test_lags) + all_results = [] + for cause, effect in cross_pairs: + pair_results = granger_test_pair(merged, cause, effect, max_lag=max_lag, test_lags=test_lags) + all_results.extend(pair_results) + + results_df = pd.DataFrame(all_results) + return results_df + + +# ============================================================ +# 6. 可视化:p 值热力图 +# ============================================================ + +def plot_pvalue_heatmap(results_df: pd.DataFrame, output_dir: Path): + """ + 绘制 p 值热力图(变量对 x 滞后阶数) + + Parameters + ---------- + results_df : pd.DataFrame + 因果检验结果 + output_dir : Path + 输出目录 + """ + if results_df.empty: + print(" [警告] 无检验结果,跳过热力图绘制") + return + + # 构建标签 + results_df = results_df.copy() + results_df['pair'] = results_df['cause'] + ' → ' + results_df['effect'] + + # 构建 pivot table: 行=pair, 列=lag + pivot = results_df.pivot_table(index='pair', columns='lag', values='p_value') + + fig, ax = plt.subplots(figsize=(12, max(6, len(pivot) * 0.5))) + + # 绘制热力图 + im = ax.imshow(-np.log10(pivot.values + 1e-300), cmap='RdYlGn_r', aspect='auto') + + # 设置坐标轴 + ax.set_xticks(range(len(pivot.columns))) + ax.set_xticklabels([f'Lag {c}' for c in pivot.columns], fontsize=10) + ax.set_yticks(range(len(pivot.index))) + ax.set_yticklabels(pivot.index, fontsize=9) + + # 在每个格子中标注 p 值 + for i in range(len(pivot.index)): + for j in range(len(pivot.columns)): + val = pivot.values[i, j] + if np.isnan(val): + text = 'N/A' + else: + text = f'{val:.4f}' + color = 'white' if -np.log10(val + 1e-300) > 2 else 'black' + ax.text(j, i, text, ha='center', va='center', fontsize=8, color=color) + + # Bonferroni 校正线 + n_tests = len(results_df) + if n_tests > 0: + bonf_alpha = 0.05 / n_tests + ax.set_title( + f'Granger 因果检验 p 值热力图 (-log10)\n' + f'Bonferroni 校正阈值: {bonf_alpha:.6f} (共 {n_tests} 次检验)', + fontsize=13 + ) + + cbar = fig.colorbar(im, ax=ax, shrink=0.8) + cbar.set_label('-log10(p-value)', fontsize=11) + + fig.savefig(output_dir / 'granger_pvalue_heatmap.png', + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] {output_dir / 'granger_pvalue_heatmap.png'}") + + +# ============================================================ +# 7. 可视化:因果关系网络图 +# ============================================================ + +def plot_causal_network(results_df: pd.DataFrame, output_dir: Path, alpha: float = 0.05): + """ + 绘制显著因果关系网络图(matplotlib 箭头实现) + + 仅显示 Bonferroni 校正后仍显著的因果对(取最优滞后的结果) + + Parameters + ---------- + results_df : pd.DataFrame + 含 significant_corrected 列的检验结果 + output_dir : Path + 输出目录 + alpha : float + 显著性水平 + """ + if results_df.empty or 'significant_corrected' not in results_df.columns: + print(" [警告] 无校正后结果,跳过网络图绘制") + return + + # 筛选显著因果对(取每对中 p 值最小的滞后) + sig = results_df[results_df['significant_corrected']].copy() + if sig.empty: + print(" [信息] Bonferroni 校正后无显著因果关系,绘制空网络图") + + # 对每对取最小 p 值 + if not sig.empty: + sig_best = sig.loc[sig.groupby(['cause', 'effect'])['p_value'].idxmin()] + else: + sig_best = pd.DataFrame(columns=results_df.columns) + + # 收集所有变量节点 + all_vars = set() + for _, row in results_df.iterrows(): + all_vars.add(row['cause']) + all_vars.add(row['effect']) + all_vars = sorted(all_vars) + n_vars = len(all_vars) + + if n_vars == 0: + return + + # 布局:圆形排列 + angles = np.linspace(0, 2 * np.pi, n_vars, endpoint=False) + positions = {v: (np.cos(a), np.sin(a)) for v, a in zip(all_vars, angles)} + + fig, ax = plt.subplots(figsize=(10, 10)) + + # 绘制节点 + for var, (x, y) in positions.items(): + circle = plt.Circle((x, y), 0.12, color='steelblue', alpha=0.8) + ax.add_patch(circle) + ax.text(x, y, var, ha='center', va='center', fontsize=8, + fontweight='bold', color='white') + + # 绘制显著因果箭头 + for _, row in sig_best.iterrows(): + cause_pos = positions[row['cause']] + effect_pos = positions[row['effect']] + + # 计算起点和终点(缩短到节点边缘) + dx = effect_pos[0] - cause_pos[0] + dy = effect_pos[1] - cause_pos[1] + dist = np.sqrt(dx ** 2 + dy ** 2) + if dist < 0.01: + continue + + # 缩短箭头到节点圆的边缘 + shrink = 0.14 + start_x = cause_pos[0] + shrink * dx / dist + start_y = cause_pos[1] + shrink * dy / dist + end_x = effect_pos[0] - shrink * dx / dist + end_y = effect_pos[1] - shrink * dy / dist + + # 箭头粗细与 -log10(p) 相关 + width = min(3.0, -np.log10(row['p_value'] + 1e-300) * 0.5) + + ax.annotate( + '', + xy=(end_x, end_y), + xytext=(start_x, start_y), + arrowprops=dict( + arrowstyle='->', color='red', lw=width, + connectionstyle='arc3,rad=0.1', + mutation_scale=15, + ), + ) + # 标注滞后阶数和 p 值 + mid_x = (start_x + end_x) / 2 + mid_y = (start_y + end_y) / 2 + ax.text(mid_x, mid_y, f'lag={int(row["lag"])}\np={row["p_value"]:.2e}', + fontsize=7, ha='center', va='center', + bbox=dict(boxstyle='round,pad=0.2', facecolor='yellow', alpha=0.7)) + + n_sig = len(sig_best) + n_total = len(results_df) + ax.set_title( + f'Granger 因果关系网络 (Bonferroni 校正后)\n' + f'显著链接: {n_sig}/{n_total}', + fontsize=14 + ) + ax.set_xlim(-1.6, 1.6) + ax.set_ylim(-1.6, 1.6) + ax.set_aspect('equal') + ax.axis('off') + + fig.savefig(output_dir / 'granger_causal_network.png', + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] {output_dir / 'granger_causal_network.png'}") + + +# ============================================================ +# 8. 结果打印 +# ============================================================ + +def print_causality_results(results_df: pd.DataFrame): + """打印所有因果检验结果""" + if results_df.empty: + print(" [信息] 无检验结果") + return + + print("\n" + "=" * 90) + print("Granger 因果检验结果明细") + print("=" * 90) + print(f" {'因果方向':<40} {'滞后':>4} {'F统计量':>12} {'p值':>12} {'原始显著':>8} {'校正显著':>8}") + print(" " + "-" * 88) + + for _, row in results_df.iterrows(): + pair_label = f"{row['cause']} → {row['effect']}" + sig_raw = '***' if row.get('significant_raw', False) else '' + sig_corr = '***' if row.get('significant_corrected', False) else '' + print(f" {pair_label:<40} {int(row['lag']):>4} " + f"{row['f_stat']:>12.4f} {row['p_value']:>12.6f} " + f"{sig_raw:>8} {sig_corr:>8}") + + # 汇总统计 + n_total = len(results_df) + n_sig_raw = results_df.get('significant_raw', pd.Series(dtype=bool)).sum() + n_sig_corr = results_df.get('significant_corrected', pd.Series(dtype=bool)).sum() + + print(f"\n 汇总: 共 {n_total} 次检验") + print(f" 原始显著 (p < 0.05): {n_sig_raw} ({n_sig_raw / n_total * 100:.1f}%)") + print(f" Bonferroni 校正后显著: {n_sig_corr} ({n_sig_corr / n_total * 100:.1f}%)") + + if n_total > 0: + bonf_alpha = 0.05 / n_total + print(f" Bonferroni 校正阈值: {bonf_alpha:.6f}") + + +# ============================================================ +# 9. 主入口 +# ============================================================ + +def run_causality_analysis( + df: pd.DataFrame, + output_dir: str = "output/causality", +) -> Dict: + """ + Granger 因果检验主函数 + + Parameters + ---------- + df : pd.DataFrame + 日线数据(已通过 add_derived_features 添加衍生特征) + output_dir : str + 图表输出目录 + + Returns + ------- + dict + 包含所有检验结果的字典 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + print("=" * 70) + print("BTC Granger 因果检验分析") + print("=" * 70) + print(f"数据范围: {df.index.min()} ~ {df.index.max()}") + print(f"样本数量: {len(df)}") + print(f"测试滞后阶数: {TEST_LAGS}") + print(f"因果变量对数: {len(CAUSALITY_PAIRS)}") + print(f"总检验次数(含所有滞后): {len(CAUSALITY_PAIRS) * len(TEST_LAGS)}") + + # 设置中文字体 + plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans'] + plt.rcParams['axes.unicode_minus'] = False + + # --- 日线级 Granger 因果检验 --- + print("\n>>> [1/4] 执行日线级 Granger 因果检验...") + daily_results = run_all_granger_tests(df, pairs=CAUSALITY_PAIRS, test_lags=TEST_LAGS) + + if not daily_results.empty: + daily_results = apply_bonferroni(daily_results, alpha=0.05) + print_causality_results(daily_results) + else: + print(" [警告] 日线级因果检验未产生结果") + + # --- 跨时间尺度因果检验 --- + print("\n>>> [2/4] 执行跨时间尺度因果检验(小时 → 日线)...") + cross_results = cross_timeframe_causality(df, test_lags=TEST_LAGS) + + if not cross_results.empty: + cross_results = apply_bonferroni(cross_results, alpha=0.05) + print("\n跨时间尺度因果检验结果:") + print_causality_results(cross_results) + else: + print(" [信息] 跨时间尺度因果检验无结果(可能小时数据不可用)") + + # --- 合并所有结果用于可视化 --- + all_results = pd.concat([daily_results, cross_results], ignore_index=True) + if not all_results.empty and 'significant_corrected' not in all_results.columns: + all_results = apply_bonferroni(all_results, alpha=0.05) + + # --- p 值热力图(仅日线级结果,避免混淆) --- + print("\n>>> [3/4] 绘制 p 值热力图...") + plot_pvalue_heatmap(daily_results, output_dir) + + # --- 因果关系网络图 --- + print("\n>>> [4/4] 绘制因果关系网络图...") + # 使用所有结果(含跨时间尺度) + if not all_results.empty: + # 重新做一次 Bonferroni 校正(因为合并后总检验数增加) + all_corrected = apply_bonferroni(all_results.drop( + columns=['bonferroni_alpha', 'significant_raw', 'significant_corrected'], + errors='ignore' + ), alpha=0.05) + plot_causal_network(all_corrected, output_dir) + else: + print(" [警告] 无可用结果,跳过网络图") + + print("\n" + "=" * 70) + print("Granger 因果检验分析完成!") + print(f"图表已保存至: {output_dir.resolve()}") + print("=" * 70) + + return { + 'daily_results': daily_results, + 'cross_timeframe_results': cross_results, + 'all_results': all_results, + } + + +# ============================================================ +# 独立运行入口 +# ============================================================ + +if __name__ == '__main__': + from src.data_loader import load_daily + from src.preprocessing import add_derived_features + + df = load_daily() + df = add_derived_features(df) + run_causality_analysis(df) diff --git a/src/clustering.py b/src/clustering.py new file mode 100644 index 0000000..9b39e9f --- /dev/null +++ b/src/clustering.py @@ -0,0 +1,742 @@ +"""市场状态聚类与马尔可夫链分析模块 + +基于K-Means、GMM、HDBSCAN对BTC日线特征进行聚类, +构建状态转移矩阵并计算平稳分布。 +""" + +import warnings +import numpy as np +import pandas as pd +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +import matplotlib.gridspec as gridspec +from pathlib import Path +from typing import Optional, Tuple, Dict, List + +from sklearn.preprocessing import StandardScaler +from sklearn.cluster import KMeans +from sklearn.mixture import GaussianMixture +from sklearn.decomposition import PCA +from sklearn.metrics import silhouette_score, silhouette_samples + +try: + import hdbscan + HAS_HDBSCAN = True +except ImportError: + HAS_HDBSCAN = False + warnings.warn("hdbscan 未安装,将跳过 HDBSCAN 聚类。pip install hdbscan") + + +# ============================================================ +# 特征工程 +# ============================================================ + +FEATURE_COLS = [ + "log_return", "abs_return", "vol_7d", "vol_30d", + "volume_ratio", "taker_buy_ratio", "range_pct", "body_pct", + "log_return_lag1", "log_return_lag2", +] + + +def _prepare_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, np.ndarray, StandardScaler]: + """ + 准备聚类特征:添加滞后收益率、标准化、去除NaN行 + + Returns + ------- + df_clean : 清洗后的DataFrame(保留索引用于后续映射) + X_scaled : 标准化后的特征矩阵 + scaler : 标准化器(可用于逆变换) + """ + out = df.copy() + + # 添加滞后收益率特征 + out["log_return_lag1"] = out["log_return"].shift(1) + out["log_return_lag2"] = out["log_return"].shift(2) + + # 只保留所需特征列,删除含NaN的行 + df_feat = out[FEATURE_COLS].copy() + mask = df_feat.notna().all(axis=1) + df_clean = out.loc[mask].copy() + X_raw = df_feat.loc[mask].values + + # Z-score标准化 + scaler = StandardScaler() + X_scaled = scaler.fit_transform(X_raw) + + print(f"[特征准备] 有效样本数: {X_scaled.shape[0]}, 特征维度: {X_scaled.shape[1]}") + return df_clean, X_scaled, scaler + + +# ============================================================ +# K-Means 聚类 +# ============================================================ + +def _run_kmeans(X: np.ndarray, k_range: List[int] = None) -> Tuple[int, np.ndarray, Dict]: + """ + K-Means聚类,通过轮廓系数选择最优k + + Returns + ------- + best_k : 最优聚类数 + labels : 最优k对应的聚类标签 + info : 包含每个k的轮廓系数、惯性等 + """ + if k_range is None: + k_range = [3, 4, 5, 6, 7] + + results = {} + best_score = -1 + best_k = k_range[0] + best_labels = None + + print("\n" + "=" * 60) + print("K-Means 聚类分析") + print("=" * 60) + + for k in k_range: + km = KMeans(n_clusters=k, n_init=20, max_iter=500, random_state=42) + labels = km.fit_predict(X) + sil = silhouette_score(X, labels) + inertia = km.inertia_ + results[k] = {"silhouette": sil, "inertia": inertia, "labels": labels, "model": km} + print(f" k={k}: 轮廓系数={sil:.4f}, 惯性={inertia:.1f}") + + if sil > best_score: + best_score = sil + best_k = k + best_labels = labels + + print(f"\n >>> 最优 k = {best_k} (轮廓系数 = {best_score:.4f})") + return best_k, best_labels, results + + +# ============================================================ +# GMM (高斯混合模型) +# ============================================================ + +def _run_gmm(X: np.ndarray, k_range: List[int] = None) -> Tuple[int, np.ndarray, Dict]: + """ + GMM聚类,通过BIC选择最优组件数 + + Returns + ------- + best_k : BIC最低的组件数 + labels : 对应的聚类标签 + info : 每个k的BIC、AIC、标签等 + """ + if k_range is None: + k_range = [3, 4, 5, 6, 7] + + results = {} + best_bic = np.inf + best_k = k_range[0] + best_labels = None + + print("\n" + "=" * 60) + print("GMM (高斯混合模型) 聚类分析") + print("=" * 60) + + for k in k_range: + gmm = GaussianMixture(n_components=k, covariance_type='full', + n_init=5, max_iter=500, random_state=42) + gmm.fit(X) + labels = gmm.predict(X) + bic = gmm.bic(X) + aic = gmm.aic(X) + sil = silhouette_score(X, labels) + results[k] = {"bic": bic, "aic": aic, "silhouette": sil, + "labels": labels, "model": gmm} + print(f" k={k}: BIC={bic:.1f}, AIC={aic:.1f}, 轮廓系数={sil:.4f}") + + if bic < best_bic: + best_bic = bic + best_k = k + best_labels = labels + + print(f"\n >>> 最优 k = {best_k} (BIC = {best_bic:.1f})") + return best_k, best_labels, results + + +# ============================================================ +# HDBSCAN (密度聚类) +# ============================================================ + +def _run_hdbscan(X: np.ndarray) -> Tuple[np.ndarray, Dict]: + """ + HDBSCAN密度聚类 + + Returns + ------- + labels : 聚类标签 (-1表示噪声) + info : 聚类统计信息 + """ + if not HAS_HDBSCAN: + print("\n[HDBSCAN] 跳过 - hdbscan 未安装") + return None, {} + + print("\n" + "=" * 60) + print("HDBSCAN 密度聚类分析") + print("=" * 60) + + clusterer = hdbscan.HDBSCAN( + min_cluster_size=30, + min_samples=10, + metric='euclidean', + cluster_selection_method='eom', + ) + labels = clusterer.fit_predict(X) + + n_clusters = len(set(labels)) - (1 if -1 in labels else 0) + n_noise = (labels == -1).sum() + noise_pct = n_noise / len(labels) * 100 + + info = { + "n_clusters": n_clusters, + "n_noise": n_noise, + "noise_pct": noise_pct, + "labels": labels, + "model": clusterer, + } + + print(f" 聚类数: {n_clusters}") + print(f" 噪声点: {n_noise} ({noise_pct:.1f}%)") + + # 排除噪声点后计算轮廓系数 + if n_clusters >= 2: + mask = labels >= 0 + if mask.sum() > n_clusters: + sil = silhouette_score(X[mask], labels[mask]) + info["silhouette"] = sil + print(f" 轮廓系数(去噪): {sil:.4f}") + + return labels, info + + +# ============================================================ +# 聚类解释与标签映射 +# ============================================================ + +# 状态标签定义 +STATE_LABELS = { + "sideways": "横盘整理", + "mild_up": "温和上涨", + "mild_down": "温和下跌", + "surge": "强势上涨", + "crash": "急剧下跌", + "high_vol": "高波动", + "low_vol": "低波动", +} + + +def _interpret_clusters(df_clean: pd.DataFrame, labels: np.ndarray, + method_name: str = "K-Means") -> pd.DataFrame: + """ + 解释聚类结果:计算每个簇的特征均值,并自动标注状态名称 + + Returns + ------- + cluster_desc : 每个聚类的特征均值表 + state_label列 + """ + df_work = df_clean.copy() + col_name = f"cluster_{method_name}" + df_work[col_name] = labels + + # 计算每个聚类的特征均值 + cluster_means = df_work.groupby(col_name)[FEATURE_COLS].mean() + + print(f"\n{'=' * 60}") + print(f"{method_name} 聚类特征均值") + print("=" * 60) + + # 自动标注状态 + state_labels = {} + for cid in cluster_means.index: + row = cluster_means.loc[cid] + lr = row["log_return"] + vol = row["vol_7d"] + abs_r = row["abs_return"] + + # 基于收益率和波动率的规则判断 + if lr > 0.02 and abs_r > 0.02: + label = "surge" + elif lr < -0.02 and abs_r > 0.02: + label = "crash" + elif lr > 0.005: + label = "mild_up" + elif lr < -0.005: + label = "mild_down" + elif abs_r > 0.015 or vol > cluster_means["vol_7d"].median() * 1.5: + label = "high_vol" + else: + label = "sideways" + + state_labels[cid] = label + + cluster_means["state_label"] = pd.Series(state_labels) + cluster_means["state_cn"] = cluster_means["state_label"].map(STATE_LABELS) + + # 统计每个聚类的样本数和占比 + counts = df_work[col_name].value_counts().sort_index() + cluster_means["count"] = counts + cluster_means["pct"] = (counts / counts.sum() * 100).round(1) + + for cid in cluster_means.index: + row = cluster_means.loc[cid] + print(f"\n 聚类 {cid} [{row['state_cn']}] (n={int(row['count'])}, {row['pct']:.1f}%)") + print(f" log_return: {row['log_return']:.5f}, abs_return: {row['abs_return']:.5f}") + print(f" vol_7d: {row['vol_7d']:.4f}, vol_30d: {row['vol_30d']:.4f}") + print(f" volume_ratio: {row['volume_ratio']:.3f}, taker_buy_ratio: {row['taker_buy_ratio']:.4f}") + print(f" range_pct: {row['range_pct']:.5f}, body_pct: {row['body_pct']:.5f}") + + return cluster_means + + +# ============================================================ +# 马尔可夫转移矩阵 +# ============================================================ + +def _compute_transition_matrix(labels: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]: + """ + 计算状态转移概率矩阵、平稳分布和平均持有时间 + + Parameters + ---------- + labels : 时间序列的聚类标签 + + Returns + ------- + trans_matrix : 转移概率矩阵 (n_states x n_states) + stationary : 平稳分布向量 + holding_time : 各状态平均持有时间 + """ + states = np.sort(np.unique(labels)) + n_states = len(states) + + # 状态映射到连续索引 + state_to_idx = {s: i for i, s in enumerate(states)} + + # 计数矩阵 + count_matrix = np.zeros((n_states, n_states), dtype=np.float64) + for t in range(len(labels) - 1): + i = state_to_idx[labels[t]] + j = state_to_idx[labels[t + 1]] + count_matrix[i, j] += 1 + + # 转移概率矩阵(行归一化) + row_sums = count_matrix.sum(axis=1, keepdims=True) + row_sums[row_sums == 0] = 1 # 避免除零 + trans_matrix = count_matrix / row_sums + + # 平稳分布:求转移矩阵的左特征向量(特征值=1对应的) + # π * P = π => P^T * π^T = π^T + eigenvalues, eigenvectors = np.linalg.eig(trans_matrix.T) + + # 找最接近1的特征值对应的特征向量 + idx = np.argmin(np.abs(eigenvalues - 1.0)) + stationary = np.real(eigenvectors[:, idx]) + stationary = stationary / stationary.sum() # 归一化为概率 + + # 确保非负(数值误差可能导致微小负值) + stationary = np.abs(stationary) + stationary = stationary / stationary.sum() + + # 平均持有时间 = 1 / (1 - p_ii) + diag = np.diag(trans_matrix) + holding_time = np.where(diag < 1.0, 1.0 / (1.0 - diag), np.inf) + + return trans_matrix, stationary, holding_time + + +def _print_markov_results(trans_matrix: np.ndarray, stationary: np.ndarray, + holding_time: np.ndarray, cluster_desc: pd.DataFrame): + """打印马尔可夫链分析结果""" + states = cluster_desc.index.tolist() + state_names = cluster_desc["state_cn"].tolist() + + print("\n" + "=" * 60) + print("马尔可夫链状态转移分析") + print("=" * 60) + + # 转移概率矩阵 + print("\n转移概率矩阵:") + header = " " + " ".join([f" {state_names[j][:4]:>4s}" for j in range(len(states))]) + print(header) + for i, s in enumerate(states): + row_str = f" {state_names[i][:4]:>4s}" + for j in range(len(states)): + row_str += f" {trans_matrix[i, j]:6.3f}" + print(row_str) + + # 平稳分布 + print("\n平稳分布 (长期均衡概率):") + for i, s in enumerate(states): + print(f" {state_names[i]}: {stationary[i]:.4f} ({stationary[i]*100:.1f}%)") + + # 平均持有时间 + print("\n平均持有时间 (天):") + for i, s in enumerate(states): + if np.isinf(holding_time[i]): + print(f" {state_names[i]}: ∞ (吸收态)") + else: + print(f" {state_names[i]}: {holding_time[i]:.2f} 天") + + +# ============================================================ +# 可视化 +# ============================================================ + +def _plot_pca_scatter(X: np.ndarray, labels: np.ndarray, + cluster_desc: pd.DataFrame, method_name: str, + output_dir: Path): + """2D PCA散点图,按聚类着色""" + pca = PCA(n_components=2) + X_2d = pca.fit_transform(X) + + fig, ax = plt.subplots(figsize=(12, 8)) + states = np.sort(np.unique(labels)) + colors = plt.cm.Set2(np.linspace(0, 1, len(states))) + + for i, s in enumerate(states): + mask = labels == s + label_name = cluster_desc.loc[s, "state_cn"] if s in cluster_desc.index else f"Cluster {s}" + ax.scatter(X_2d[mask, 0], X_2d[mask, 1], c=[colors[i]], label=label_name, + alpha=0.5, s=15, edgecolors='none') + + ax.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)", fontsize=12) + ax.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)", fontsize=12) + ax.set_title(f"{method_name} 聚类结果 - PCA 2D投影", fontsize=14) + ax.legend(fontsize=10, loc='best') + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / f"cluster_pca_{method_name.lower().replace(' ', '_')}.png", + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] cluster_pca_{method_name.lower().replace(' ', '_')}.png") + + +def _plot_silhouette(X: np.ndarray, labels: np.ndarray, method_name: str, output_dir: Path): + """轮廓系数分析图""" + n_clusters = len(set(labels) - {-1}) + if n_clusters < 2: + return + + # 排除噪声点 + mask = labels >= 0 + if mask.sum() < n_clusters + 1: + return + + sil_vals = silhouette_samples(X[mask], labels[mask]) + avg_sil = silhouette_score(X[mask], labels[mask]) + + fig, ax = plt.subplots(figsize=(10, 7)) + y_lower = 10 + valid_labels = np.sort(np.unique(labels[mask])) + colors = plt.cm.Set2(np.linspace(0, 1, len(valid_labels))) + + for i, c in enumerate(valid_labels): + c_sil = sil_vals[labels[mask] == c] + c_sil.sort() + size = c_sil.shape[0] + y_upper = y_lower + size + + ax.fill_betweenx(np.arange(y_lower, y_upper), 0, c_sil, + facecolor=colors[i], edgecolor=colors[i], alpha=0.7) + ax.text(-0.05, y_lower + 0.5 * size, str(c), fontsize=10) + y_lower = y_upper + 10 + + ax.axvline(x=avg_sil, color="red", linestyle="--", label=f"平均={avg_sil:.3f}") + ax.set_xlabel("轮廓系数", fontsize=12) + ax.set_ylabel("聚类标签", fontsize=12) + ax.set_title(f"{method_name} 轮廓系数分析 (平均={avg_sil:.3f})", fontsize=14) + ax.legend(fontsize=10) + + fig.savefig(output_dir / f"cluster_silhouette_{method_name.lower().replace(' ', '_')}.png", + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] cluster_silhouette_{method_name.lower().replace(' ', '_')}.png") + + +def _plot_cluster_heatmap(cluster_desc: pd.DataFrame, method_name: str, output_dir: Path): + """聚类特征热力图""" + # 只选择数值型特征列 + feat_cols = [c for c in FEATURE_COLS if c in cluster_desc.columns] + data = cluster_desc[feat_cols].copy() + + # 对每列进行Z-score标准化(便于比较不同量纲的特征) + data_norm = (data - data.mean()) / (data.std() + 1e-10) + + fig, ax = plt.subplots(figsize=(14, max(6, len(data) * 1.2))) + + # 行标签用中文状态名 + row_labels = [f"{idx}-{cluster_desc.loc[idx, 'state_cn']}" for idx in data.index] + + im = ax.imshow(data_norm.values, cmap='RdYlGn', aspect='auto') + ax.set_xticks(range(len(feat_cols))) + ax.set_xticklabels(feat_cols, rotation=45, ha='right', fontsize=10) + ax.set_yticks(range(len(row_labels))) + ax.set_yticklabels(row_labels, fontsize=11) + + # 在格子中显示原始数值 + for i in range(data.shape[0]): + for j in range(data.shape[1]): + val = data.iloc[i, j] + ax.text(j, i, f"{val:.4f}", ha='center', va='center', fontsize=8, + color='black' if abs(data_norm.iloc[i, j]) < 1.5 else 'white') + + plt.colorbar(im, ax=ax, shrink=0.8, label="标准化值") + ax.set_title(f"{method_name} 各聚类特征热力图", fontsize=14) + + fig.savefig(output_dir / f"cluster_heatmap_{method_name.lower().replace(' ', '_')}.png", + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] cluster_heatmap_{method_name.lower().replace(' ', '_')}.png") + + +def _plot_transition_heatmap(trans_matrix: np.ndarray, cluster_desc: pd.DataFrame, + output_dir: Path): + """状态转移概率矩阵热力图""" + state_names = [cluster_desc.loc[idx, "state_cn"] for idx in cluster_desc.index] + + fig, ax = plt.subplots(figsize=(10, 8)) + im = ax.imshow(trans_matrix, cmap='YlOrRd', vmin=0, vmax=1, aspect='auto') + + n = len(state_names) + ax.set_xticks(range(n)) + ax.set_xticklabels(state_names, rotation=45, ha='right', fontsize=11) + ax.set_yticks(range(n)) + ax.set_yticklabels(state_names, fontsize=11) + + # 标注概率值 + for i in range(n): + for j in range(n): + color = 'white' if trans_matrix[i, j] > 0.5 else 'black' + ax.text(j, i, f"{trans_matrix[i, j]:.3f}", ha='center', va='center', + fontsize=11, color=color, fontweight='bold') + + plt.colorbar(im, ax=ax, shrink=0.8, label="转移概率") + ax.set_xlabel("下一状态", fontsize=12) + ax.set_ylabel("当前状态", fontsize=12) + ax.set_title("马尔可夫状态转移概率矩阵", fontsize=14) + + fig.savefig(output_dir / "cluster_transition_matrix.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] cluster_transition_matrix.png") + + +def _plot_state_timeseries(df_clean: pd.DataFrame, labels: np.ndarray, + cluster_desc: pd.DataFrame, output_dir: Path): + """状态随时间变化的时间序列图""" + fig, axes = plt.subplots(2, 1, figsize=(18, 10), height_ratios=[2, 1], sharex=True) + + dates = df_clean.index + close = df_clean["close"].values + + states = np.sort(np.unique(labels)) + colors = plt.cm.Set2(np.linspace(0, 1, len(states))) + color_map = {s: colors[i] for i, s in enumerate(states)} + + # 上图:价格走势,按状态着色 + ax1 = axes[0] + for i in range(len(dates) - 1): + ax1.plot([dates[i], dates[i + 1]], [close[i], close[i + 1]], + color=color_map[labels[i]], linewidth=0.8) + + # 添加图例 + from matplotlib.patches import Patch + legend_patches = [] + for s in states: + name = cluster_desc.loc[s, "state_cn"] if s in cluster_desc.index else f"Cluster {s}" + legend_patches.append(Patch(color=color_map[s], label=name)) + ax1.legend(handles=legend_patches, fontsize=9, loc='upper left') + ax1.set_ylabel("BTC 价格 (USDT)", fontsize=12) + ax1.set_title("BTC 价格与市场状态时间序列", fontsize=14) + ax1.set_yscale('log') + ax1.grid(True, alpha=0.3) + + # 下图:状态标签时间线 + ax2 = axes[1] + state_colors = [color_map[l] for l in labels] + ax2.bar(dates, np.ones(len(dates)), color=state_colors, width=1.5, edgecolor='none') + ax2.set_yticks([]) + ax2.set_ylabel("市场状态", fontsize=12) + ax2.set_xlabel("日期", fontsize=12) + + plt.tight_layout() + fig.savefig(output_dir / "cluster_state_timeseries.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] cluster_state_timeseries.png") + + +def _plot_kmeans_selection(kmeans_results: Dict, gmm_results: Dict, output_dir: Path): + """K选择对比图:轮廓系数 + BIC""" + fig, axes = plt.subplots(1, 3, figsize=(18, 5)) + + # 1. K-Means 轮廓系数 + ks_km = sorted(kmeans_results.keys()) + sils_km = [kmeans_results[k]["silhouette"] for k in ks_km] + axes[0].plot(ks_km, sils_km, 'bo-', linewidth=2, markersize=8) + best_k_km = ks_km[np.argmax(sils_km)] + axes[0].axvline(x=best_k_km, color='red', linestyle='--', alpha=0.7) + axes[0].set_xlabel("k", fontsize=12) + axes[0].set_ylabel("轮廓系数", fontsize=12) + axes[0].set_title("K-Means 轮廓系数", fontsize=13) + axes[0].grid(True, alpha=0.3) + + # 2. K-Means 惯性 (Elbow) + inertias = [kmeans_results[k]["inertia"] for k in ks_km] + axes[1].plot(ks_km, inertias, 'gs-', linewidth=2, markersize=8) + axes[1].set_xlabel("k", fontsize=12) + axes[1].set_ylabel("惯性 (Inertia)", fontsize=12) + axes[1].set_title("K-Means 肘部法则", fontsize=13) + axes[1].grid(True, alpha=0.3) + + # 3. GMM BIC + ks_gmm = sorted(gmm_results.keys()) + bics = [gmm_results[k]["bic"] for k in ks_gmm] + axes[2].plot(ks_gmm, bics, 'r^-', linewidth=2, markersize=8) + best_k_gmm = ks_gmm[np.argmin(bics)] + axes[2].axvline(x=best_k_gmm, color='blue', linestyle='--', alpha=0.7) + axes[2].set_xlabel("k", fontsize=12) + axes[2].set_ylabel("BIC", fontsize=12) + axes[2].set_title("GMM BIC 选择", fontsize=13) + axes[2].grid(True, alpha=0.3) + + plt.tight_layout() + fig.savefig(output_dir / "cluster_k_selection.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] cluster_k_selection.png") + + +# ============================================================ +# 主入口 +# ============================================================ + +def run_clustering_analysis(df: pd.DataFrame, output_dir: "str | Path" = "output/clustering") -> Dict: + """ + 市场状态聚类与马尔可夫链分析 - 主入口 + + Parameters + ---------- + df : pd.DataFrame + 已经通过 add_derived_features() 添加了衍生特征的日线数据 + output_dir : str or Path + 图表输出目录 + + Returns + ------- + results : dict + 包含聚类结果、转移矩阵、平稳分布等 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + # 设置中文字体(macOS) + plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans'] + plt.rcParams['axes.unicode_minus'] = False + + print("=" * 60) + print(" BTC 市场状态聚类与马尔可夫链分析") + print("=" * 60) + + # ---- 1. 特征准备 ---- + df_clean, X_scaled, scaler = _prepare_features(df) + + # ---- 2. K-Means 聚类 ---- + best_k_km, km_labels, kmeans_results = _run_kmeans(X_scaled) + + # ---- 3. GMM 聚类 ---- + best_k_gmm, gmm_labels, gmm_results = _run_gmm(X_scaled) + + # ---- 4. HDBSCAN 聚类 ---- + hdbscan_labels, hdbscan_info = _run_hdbscan(X_scaled) + + # ---- 5. K选择对比图 ---- + print("\n[可视化] 生成K选择对比图...") + _plot_kmeans_selection(kmeans_results, gmm_results, output_dir) + + # ---- 6. K-Means 聚类解释 ---- + km_desc = _interpret_clusters(df_clean, km_labels, "K-Means") + + # ---- 7. GMM 聚类解释 ---- + gmm_desc = _interpret_clusters(df_clean, gmm_labels, "GMM") + + # ---- 8. 马尔可夫链分析(基于K-Means结果)---- + trans_matrix, stationary, holding_time = _compute_transition_matrix(km_labels) + _print_markov_results(trans_matrix, stationary, holding_time, km_desc) + + # ---- 9. 可视化 ---- + print("\n[可视化] 生成分析图表...") + + # PCA散点图 + _plot_pca_scatter(X_scaled, km_labels, km_desc, "K-Means", output_dir) + _plot_pca_scatter(X_scaled, gmm_labels, gmm_desc, "GMM", output_dir) + if hdbscan_labels is not None and hdbscan_info.get("n_clusters", 0) >= 2: + # 为HDBSCAN创建简易描述 + hdb_states = np.sort(np.unique(hdbscan_labels[hdbscan_labels >= 0])) + hdb_desc = _interpret_clusters(df_clean, hdbscan_labels, "HDBSCAN") + _plot_pca_scatter(X_scaled, hdbscan_labels, hdb_desc, "HDBSCAN", output_dir) + + # 轮廓系数图 + _plot_silhouette(X_scaled, km_labels, "K-Means", output_dir) + + # 聚类特征热力图 + _plot_cluster_heatmap(km_desc, "K-Means", output_dir) + _plot_cluster_heatmap(gmm_desc, "GMM", output_dir) + + # 转移矩阵热力图 + _plot_transition_heatmap(trans_matrix, km_desc, output_dir) + + # 状态时间序列图 + _plot_state_timeseries(df_clean, km_labels, km_desc, output_dir) + + # ---- 10. 汇总结果 ---- + results = { + "kmeans": { + "best_k": best_k_km, + "labels": km_labels, + "cluster_desc": km_desc, + "all_results": kmeans_results, + }, + "gmm": { + "best_k": best_k_gmm, + "labels": gmm_labels, + "cluster_desc": gmm_desc, + "all_results": gmm_results, + }, + "hdbscan": { + "labels": hdbscan_labels, + "info": hdbscan_info, + }, + "markov": { + "transition_matrix": trans_matrix, + "stationary_distribution": stationary, + "holding_time": holding_time, + }, + "features": { + "df_clean": df_clean, + "X_scaled": X_scaled, + "scaler": scaler, + }, + } + + print("\n" + "=" * 60) + print(" 聚类与马尔可夫链分析完成!") + print("=" * 60) + + return results + + +# ============================================================ +# 命令行入口 +# ============================================================ + +if __name__ == "__main__": + from data_loader import load_daily + from preprocessing import add_derived_features + + df = load_daily() + df = add_derived_features(df) + + results = run_clustering_analysis(df, output_dir="output/clustering") diff --git a/src/data_loader.py b/src/data_loader.py new file mode 100644 index 0000000..b2856cd --- /dev/null +++ b/src/data_loader.py @@ -0,0 +1,142 @@ +"""统一数据加载模块 - 处理毫秒/微秒时间戳差异""" + +import pandas as pd +import numpy as np +from pathlib import Path +from typing import Optional + +DATA_DIR = Path(__file__).parent.parent / "data" + +AVAILABLE_INTERVALS = [ + "1m", "3m", "5m", "15m", "30m", + "1h", "2h", "4h", "6h", "8h", "12h", + "1d", "3d", "1w", "1mo" +] + +COLUMNS = [ + "open_time", "open", "high", "low", "close", "volume", + "close_time", "quote_volume", "trades", + "taker_buy_volume", "taker_buy_quote_volume", "ignore" +] + +NUMERIC_COLS = [ + "open", "high", "low", "close", "volume", + "quote_volume", "trades", "taker_buy_volume", "taker_buy_quote_volume" +] + + +def _adaptive_timestamp(ts_series: pd.Series) -> pd.DatetimeIndex: + """自适应处理毫秒(13位)和微秒(16位)时间戳""" + ts = ts_series.astype(np.int64) + # 16位时间戳(微秒) -> 转为毫秒 + mask = ts > 1e15 + ts = ts.copy() + ts[mask] = ts[mask] // 1000 + return pd.to_datetime(ts, unit="ms") + + +def load_klines( + interval: str = "1d", + start: Optional[str] = None, + end: Optional[str] = None, + data_dir: Optional[Path] = None, +) -> pd.DataFrame: + """ + 加载指定时间粒度的K线数据 + + Parameters + ---------- + interval : str + K线粒度,如 '1d', '1h', '4h', '1w', '1mo' + start : str, optional + 起始日期,如 '2020-01-01' + end : str, optional + 结束日期,如 '2025-12-31' + data_dir : Path, optional + 数据目录,默认使用 data/ + + Returns + ------- + pd.DataFrame + 以 DatetimeIndex 为索引的K线数据 + """ + if data_dir is None: + data_dir = DATA_DIR + + filepath = data_dir / f"btcusdt_{interval}.csv" + if not filepath.exists(): + raise FileNotFoundError(f"数据文件不存在: {filepath}") + + df = pd.read_csv(filepath) + + # 类型转换 + for col in NUMERIC_COLS: + if col in df.columns: + df[col] = pd.to_numeric(df[col], errors="coerce") + + # 自适应时间戳处理 + df.index = _adaptive_timestamp(df["open_time"]) + df.index.name = "datetime" + + # close_time 也做处理 + if "close_time" in df.columns: + df["close_time"] = _adaptive_timestamp(df["close_time"]) + + # 删除原始时间戳列和ignore列 + df.drop(columns=["open_time", "ignore"], inplace=True, errors="ignore") + + # 排序去重 + df.sort_index(inplace=True) + df = df[~df.index.duplicated(keep="first")] + + # 时间范围过滤 + if start: + df = df[df.index >= pd.Timestamp(start)] + if end: + df = df[df.index <= pd.Timestamp(end)] + + return df + + +def load_daily(start: Optional[str] = None, end: Optional[str] = None) -> pd.DataFrame: + """快捷加载日线数据""" + return load_klines("1d", start=start, end=end) + + +def load_hourly(start: Optional[str] = None, end: Optional[str] = None) -> pd.DataFrame: + """快捷加载小时数据""" + return load_klines("1h", start=start, end=end) + + +def validate_data(df: pd.DataFrame, interval: str = "1d") -> dict: + """数据完整性校验""" + report = { + "rows": len(df), + "date_range": f"{df.index.min()} ~ {df.index.max()}", + "null_counts": df.isnull().sum().to_dict(), + "duplicate_index": df.index.duplicated().sum(), + } + + # 检查价格合理性 + report["price_range"] = f"{df['close'].min():.2f} ~ {df['close'].max():.2f}" + report["negative_volume"] = (df["volume"] < 0).sum() + + # 检查缺失天数(仅日线) + if interval == "1d": + expected_days = (df.index.max() - df.index.min()).days + 1 + report["expected_days"] = expected_days + report["missing_days"] = expected_days - len(df) + + return report + + +# 数据切分常量 +TRAIN_END = "2022-09-30" +VAL_END = "2024-06-30" + +def split_data(df: pd.DataFrame): + """按时间顺序切分 训练/验证/测试 集""" + train = df[df.index <= TRAIN_END] + val = df[(df.index > TRAIN_END) & (df.index <= VAL_END)] + test = df[df.index > VAL_END] + return train, val, test diff --git a/src/fft_analysis.py b/src/fft_analysis.py new file mode 100644 index 0000000..63e371e --- /dev/null +++ b/src/fft_analysis.py @@ -0,0 +1,901 @@ +"""FFT 频谱分析模块 - BTC价格周期性检测与频域特征提取""" + +import matplotlib +matplotlib.use("Agg") + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from scipy.fft import fft, fftfreq, ifft +from scipy.signal import find_peaks, butter, sosfiltfilt +from pathlib import Path +from typing import Dict, List, Optional, Tuple + +from src.data_loader import load_klines +from src.preprocessing import log_returns, detrend_linear + + +# ============================================================ +# 常量定义 +# ============================================================ + +# 多时间框架比较所用的K线粒度及其对应采样周期(天) +MULTI_TF_INTERVALS = { + "4h": 4 / 24, # 0.1667天 + "1d": 1.0, # 1天 + "1w": 7.0, # 7天 +} + +# 带通滤波目标周期(天) +BANDPASS_PERIODS_DAYS = [7, 30, 90, 365, 1400] + +# 峰值检测阈值:功率必须超过背景噪声的倍数 +PEAK_THRESHOLD_RATIO = 5.0 + +# 图表保存参数 +SAVE_KW = dict(dpi=150, bbox_inches="tight") + + +# ============================================================ +# 核心FFT计算函数 +# ============================================================ + +def compute_fft_spectrum( + signal: np.ndarray, + sampling_period_days: float, + apply_window: bool = True, +) -> Tuple[np.ndarray, np.ndarray, np.ndarray]: + """ + 计算信号的FFT功率谱 + + Parameters + ---------- + signal : np.ndarray + 输入时域信号(需已去趋势/取对数收益率) + sampling_period_days : float + 采样周期,单位为天(日线=1.0, 4h线=4/24) + apply_window : bool + 是否应用Hann窗函数以抑制频谱泄漏 + + Returns + ------- + freqs : np.ndarray + 频率数组(仅正频率部分),单位 cycles/day + periods : np.ndarray + 周期数组(天),即 1/freqs + power : np.ndarray + 功率谱(振幅平方的归一化值) + """ + n = len(signal) + if n == 0: + return np.array([]), np.array([]), np.array([]) + + # 应用Hann窗减少频谱泄漏 + if apply_window: + window = np.hanning(n) + windowed = signal * window + # 窗函数能量补偿:保持总功率不变 + window_energy = np.sum(window ** 2) / n + else: + windowed = signal.copy() + window_energy = 1.0 + + # FFT计算 + yf = fft(windowed) + freqs = fftfreq(n, d=sampling_period_days) + + # 仅取正频率部分(排除直流分量 freq=0) + pos_mask = freqs > 0 + freqs_pos = freqs[pos_mask] + yf_pos = yf[pos_mask] + + # 功率谱密度:|FFT|^2 / (N * 窗函数能量) + power = (np.abs(yf_pos) ** 2) / (n * window_energy) + + # 对应周期 + periods = 1.0 / freqs_pos + + return freqs_pos, periods, power + + +# ============================================================ +# AR(1) 红噪声基线模型 +# ============================================================ + +def ar1_red_noise_spectrum( + signal: np.ndarray, + freqs: np.ndarray, + sampling_period_days: float, + confidence_percentile: float = 95.0, +) -> Tuple[np.ndarray, np.ndarray]: + """ + 基于AR(1)模型估算红噪声理论功率谱 + + AR(1)模型的功率谱密度公式: + S(f) = S0 * (1 - rho^2) / (1 - 2*rho*cos(2*pi*f*dt) + rho^2) + + Parameters + ---------- + signal : np.ndarray + 原始信号 + freqs : np.ndarray + 频率数组 + sampling_period_days : float + 采样周期 + confidence_percentile : float + 置信水平百分位数(默认95%) + + Returns + ------- + noise_mean : np.ndarray + 红噪声理论均值功率谱 + noise_threshold : np.ndarray + 指定置信水平的功率阈值 + """ + n = len(signal) + if n < 3: + return np.zeros_like(freqs), np.zeros_like(freqs) + + # 估计AR(1)系数 rho(滞后1自相关) + signal_centered = signal - np.mean(signal) + autocov_0 = np.sum(signal_centered ** 2) / n + autocov_1 = np.sum(signal_centered[:-1] * signal_centered[1:]) / n + rho = autocov_1 / autocov_0 if autocov_0 > 0 else 0.0 + rho = np.clip(rho, -0.999, 0.999) # 防止数值不稳定 + + # AR(1)理论功率谱 + variance = autocov_0 + s0 = variance * (1 - rho ** 2) + cos_term = np.cos(2 * np.pi * freqs * sampling_period_days) + denominator = 1 - 2 * rho * cos_term + rho ** 2 + noise_mean = s0 / denominator + + # 归一化使均值与信号功率谱均值匹配(经验缩放) + # 在chi-squared分布下,FFT功率近似服从指数分布(自由度2) + # 95%置信上界 = 均值 * chi2_ppf(0.95, 2) / 2 ≈ 均值 * 2.996 + from scipy.stats import chi2 + scale_factor = chi2.ppf(confidence_percentile / 100.0, df=2) / 2.0 + noise_threshold = noise_mean * scale_factor + + return noise_mean, noise_threshold + + +# ============================================================ +# 峰值检测 +# ============================================================ + +def detect_spectral_peaks( + freqs: np.ndarray, + periods: np.ndarray, + power: np.ndarray, + noise_mean: np.ndarray, + noise_threshold: np.ndarray, + threshold_ratio: float = PEAK_THRESHOLD_RATIO, + min_period_days: float = 2.0, +) -> pd.DataFrame: + """ + 在功率谱中检测显著峰值 + + 峰值判定标准: + 1. scipy.signal.find_peaks 局部峰值 + 2. 功率 > threshold_ratio * 背景噪声均值 + 3. 周期 > min_period_days(过滤高频噪声) + + Parameters + ---------- + freqs, periods, power : np.ndarray + 频率、周期、功率数组 + noise_mean, noise_threshold : np.ndarray + 红噪声均值和置信阈值 + threshold_ratio : float + 峰值必须超过噪声均值的倍数 + min_period_days : float + 最小周期阈值(天) + + Returns + ------- + pd.DataFrame + 检测到的峰值信息表,包含 period_days, frequency, power, noise_level, snr 列 + """ + if len(power) == 0: + return pd.DataFrame(columns=["period_days", "frequency", "power", "noise_level", "snr"]) + + # 使用scipy检测局部峰值 + peak_indices, properties = find_peaks(power, height=0) + + results = [] + for idx in peak_indices: + period_d = periods[idx] + pwr = power[idx] + noise_lvl = noise_mean[idx] if idx < len(noise_mean) else 1.0 + snr = pwr / noise_lvl if noise_lvl > 0 else 0.0 + + # 筛选:周期足够长且功率显著超过噪声 + if period_d >= min_period_days and snr >= threshold_ratio: + results.append({ + "period_days": period_d, + "frequency": freqs[idx], + "power": pwr, + "noise_level": noise_lvl, + "snr": snr, + }) + + df_peaks = pd.DataFrame(results) + if not df_peaks.empty: + df_peaks = df_peaks.sort_values("snr", ascending=False).reset_index(drop=True) + + return df_peaks + + +# ============================================================ +# 带通滤波器 +# ============================================================ + +def bandpass_filter( + signal: np.ndarray, + sampling_period_days: float, + center_period_days: float, + bandwidth_ratio: float = 0.3, + order: int = 4, +) -> np.ndarray: + """ + 带通滤波提取特定周期分量 + + 对于长周期(归一化低频 < 0.01)自动使用FFT域滤波以避免 + Butterworth滤波器的数值不稳定问题。其余情况使用SOS格式的 + Butterworth带通滤波(sosfiltfilt),保证数值稳定性。 + + Parameters + ---------- + signal : np.ndarray + 输入信号 + sampling_period_days : float + 采样周期(天) + center_period_days : float + 目标中心周期(天) + bandwidth_ratio : float + 带宽比例:实际带宽 = center_period * (1 +/- bandwidth_ratio) + order : int + Butterworth滤波器阶数 + + Returns + ------- + np.ndarray + 滤波后的信号分量 + """ + fs = 1.0 / sampling_period_days # 采样频率 (cycles/day) + nyquist = fs / 2.0 + + # 带通频率范围 + low_period = center_period_days * (1 + bandwidth_ratio) + high_period = center_period_days * (1 - bandwidth_ratio) + + if high_period <= 0: + high_period = sampling_period_days * 2.1 # 保证物理意义 + + low_freq = 1.0 / low_period + high_freq = 1.0 / high_period + + # 归一化到Nyquist频率 + low_norm = low_freq / nyquist + high_norm = high_freq / nyquist + + # 确保归一化频率在有效范围 (0, 1) 内 + low_norm = np.clip(low_norm, 1e-6, 0.9999) + high_norm = np.clip(high_norm, low_norm + 1e-6, 0.9999) + + if low_norm >= high_norm: + return np.zeros_like(signal) + + # 对于长周期(归一化低频极小),Butterworth滤波器数值不稳定 + # 直接使用FFT域带通滤波作为可靠替代 + if low_norm < 0.01: + return _fft_bandpass_fallback(signal, sampling_period_days, + center_period_days, bandwidth_ratio) + + # 信号长度检查:sosfiltfilt 需要足够的样本点 + min_samples = 3 * (2 * order + 1) + if len(signal) < min_samples: + return np.zeros_like(signal) + + try: + # 使用SOS格式(二阶节)保证数值稳定性 + sos = butter(order, [low_norm, high_norm], btype="band", output="sos") + filtered = sosfiltfilt(sos, signal) + return filtered + except (ValueError, np.linalg.LinAlgError): + # 若滤波失败,回退到FFT方式 + return _fft_bandpass_fallback(signal, sampling_period_days, + center_period_days, bandwidth_ratio) + + +def _fft_bandpass_fallback( + signal: np.ndarray, + sampling_period_days: float, + center_period_days: float, + bandwidth_ratio: float, +) -> np.ndarray: + """FFT域带通滤波备选方案""" + n = len(signal) + freqs = fftfreq(n, d=sampling_period_days) + yf = fft(signal) + + center_freq = 1.0 / center_period_days + low_freq = center_freq / (1 + bandwidth_ratio) + high_freq = center_freq / (1 - bandwidth_ratio) if bandwidth_ratio < 1 else center_freq * 10 + + # 频域掩码:保留目标频段 + mask = (np.abs(freqs) >= low_freq) & (np.abs(freqs) <= high_freq) + yf_filtered = np.zeros_like(yf) + yf_filtered[mask] = yf[mask] + + return np.real(ifft(yf_filtered)) + + +# ============================================================ +# 可视化函数 +# ============================================================ + +def plot_power_spectrum( + periods: np.ndarray, + power: np.ndarray, + noise_mean: np.ndarray, + noise_threshold: np.ndarray, + peaks_df: pd.DataFrame, + title: str = "BTC Log Returns - FFT Power Spectrum", + save_path: Optional[Path] = None, +) -> plt.Figure: + """ + 功率谱图:包含峰值标注和红噪声置信带 + + Parameters + ---------- + periods, power : np.ndarray + 周期和功率数组 + noise_mean, noise_threshold : np.ndarray + 红噪声均值和置信阈值 + peaks_df : pd.DataFrame + 检测到的峰值表 + title : str + 图表标题 + save_path : Path, optional + 保存路径 + + Returns + ------- + fig : plt.Figure + """ + fig, ax = plt.subplots(figsize=(14, 7)) + + # 功率谱(对数坐标) + ax.loglog(periods, power, color="#2196F3", linewidth=0.6, alpha=0.8, label="Power Spectrum") + + # 红噪声基线 + ax.loglog(periods, noise_mean, color="#FF9800", linewidth=1.5, + linestyle="--", label="AR(1) Red Noise Mean") + + # 95%置信带 + ax.fill_between(periods, 0, noise_threshold, + alpha=0.15, color="#FF9800", label="95% Confidence Band") + ax.loglog(periods, noise_threshold, color="#FF5722", linewidth=1.0, + linestyle=":", alpha=0.7, label="95% Confidence Threshold") + + # 5x噪声阈值线 + noise_5x = noise_mean * PEAK_THRESHOLD_RATIO + ax.loglog(periods, noise_5x, color="#F44336", linewidth=1.0, + linestyle="-.", alpha=0.5, label=f"{PEAK_THRESHOLD_RATIO:.0f}x Noise Threshold") + + # 峰值标注 + if not peaks_df.empty: + for _, row in peaks_df.iterrows(): + period_d = row["period_days"] + pwr = row["power"] + snr = row["snr"] + + ax.plot(period_d, pwr, "rv", markersize=10, zorder=5) + + # 周期标签格式化 + if period_d >= 365: + label_str = f"{period_d / 365:.1f}y (SNR={snr:.1f})" + elif period_d >= 30: + label_str = f"{period_d:.0f}d (SNR={snr:.1f})" + else: + label_str = f"{period_d:.1f}d (SNR={snr:.1f})" + + ax.annotate( + label_str, + xy=(period_d, pwr), + xytext=(0, 15), + textcoords="offset points", + fontsize=8, + fontweight="bold", + color="#D32F2F", + ha="center", + arrowprops=dict(arrowstyle="-", color="#D32F2F", lw=0.5), + ) + + ax.set_xlabel("Period (days)", fontsize=12) + ax.set_ylabel("Power", fontsize=12) + ax.set_title(title, fontsize=14, fontweight="bold") + ax.legend(loc="upper right", fontsize=9) + ax.grid(True, which="both", alpha=0.3) + + # X轴标记关键周期 + key_periods = [7, 14, 30, 60, 90, 180, 365, 730, 1460] + ax.set_xticks(key_periods) + ax.set_xticklabels([str(p) for p in key_periods], fontsize=8) + ax.set_xlim(left=max(2, periods.min()), right=periods.max()) + + plt.tight_layout() + + if save_path: + fig.savefig(save_path, **SAVE_KW) + print(f" [保存] 功率谱图 -> {save_path}") + + return fig + + +def plot_multi_timeframe( + tf_results: Dict[str, dict], + save_path: Optional[Path] = None, +) -> plt.Figure: + """ + 多时间框架FFT频谱对比图 + + Parameters + ---------- + tf_results : dict + 键为时间框架标签,值为包含 periods/power/noise_mean 的dict + save_path : Path, optional + 保存路径 + + Returns + ------- + fig : plt.Figure + """ + n_tf = len(tf_results) + fig, axes = plt.subplots(n_tf, 1, figsize=(14, 5 * n_tf), sharex=False) + if n_tf == 1: + axes = [axes] + + colors = ["#2196F3", "#4CAF50", "#9C27B0"] + + for ax, (label, data), color in zip(axes, tf_results.items(), colors): + periods = data["periods"] + power = data["power"] + noise_mean = data["noise_mean"] + + ax.loglog(periods, power, color=color, linewidth=0.6, alpha=0.8, + label=f"{label} Spectrum") + ax.loglog(periods, noise_mean, color="#FF9800", linewidth=1.2, + linestyle="--", alpha=0.7, label="AR(1) Noise") + + # 标注峰值 + peaks_df = data.get("peaks", pd.DataFrame()) + if not peaks_df.empty: + for _, row in peaks_df.head(5).iterrows(): + period_d = row["period_days"] + pwr = row["power"] + ax.plot(period_d, pwr, "rv", markersize=8, zorder=5) + if period_d >= 365: + lbl = f"{period_d / 365:.1f}y" + elif period_d >= 30: + lbl = f"{period_d:.0f}d" + else: + lbl = f"{period_d:.1f}d" + ax.annotate(lbl, xy=(period_d, pwr), xytext=(0, 10), + textcoords="offset points", fontsize=8, + color="#D32F2F", ha="center", fontweight="bold") + + ax.set_ylabel("Power", fontsize=11) + ax.set_title(f"BTC FFT Spectrum - {label}", fontsize=12, fontweight="bold") + ax.legend(loc="upper right", fontsize=9) + ax.grid(True, which="both", alpha=0.3) + + axes[-1].set_xlabel("Period (days)", fontsize=12) + plt.tight_layout() + + if save_path: + fig.savefig(save_path, **SAVE_KW) + print(f" [保存] 多时间框架对比图 -> {save_path}") + + return fig + + +def plot_bandpass_components( + dates: pd.DatetimeIndex, + original_signal: np.ndarray, + components: Dict[str, np.ndarray], + save_path: Optional[Path] = None, +) -> plt.Figure: + """ + 带通滤波分量子图 + + Parameters + ---------- + dates : pd.DatetimeIndex + 日期索引 + original_signal : np.ndarray + 原始信号(对数收益率) + components : dict + 键为周期标签(如 "7d"),值为滤波后的信号数组 + save_path : Path, optional + 保存路径 + + Returns + ------- + fig : plt.Figure + """ + n_comp = len(components) + 1 # +1 for original + fig, axes = plt.subplots(n_comp, 1, figsize=(14, 3 * n_comp), sharex=True) + + # 原始信号 + axes[0].plot(dates, original_signal, color="#455A64", linewidth=0.5, alpha=0.8) + axes[0].set_title("Original Log Returns", fontsize=11, fontweight="bold") + axes[0].set_ylabel("Log Return", fontsize=9) + axes[0].grid(True, alpha=0.3) + + # 各周期分量 + colors_bp = ["#E91E63", "#2196F3", "#4CAF50", "#FF9800", "#9C27B0"] + for i, ((label, comp), color) in enumerate(zip(components.items(), colors_bp)): + ax = axes[i + 1] + ax.plot(dates, comp, color=color, linewidth=0.8, alpha=0.9) + ax.set_title(f"Bandpass Component: {label} cycle", fontsize=11, fontweight="bold") + ax.set_ylabel("Amplitude", fontsize=9) + ax.grid(True, alpha=0.3) + + # 显示该分量的方差占比 + if np.var(original_signal) > 0: + var_ratio = np.var(comp) / np.var(original_signal) * 100 + ax.text(0.02, 0.92, f"Variance ratio: {var_ratio:.2f}%", + transform=ax.transAxes, fontsize=9, + bbox=dict(boxstyle="round,pad=0.3", facecolor=color, alpha=0.15)) + + axes[-1].set_xlabel("Date", fontsize=11) + plt.tight_layout() + + if save_path: + fig.savefig(save_path, **SAVE_KW) + print(f" [保存] 带通滤波分量图 -> {save_path}") + + return fig + + +# ============================================================ +# 单时间框架FFT分析流水线 +# ============================================================ + +def _analyze_single_timeframe( + df: pd.DataFrame, + sampling_period_days: float, + label: str = "1d", +) -> dict: + """ + 对单个时间框架执行完整FFT分析 + + Returns + ------- + dict + 包含 freqs, periods, power, noise_mean, noise_threshold, peaks, log_ret 等 + """ + prices = df["close"].dropna() + if len(prices) < 10: + print(f" [警告] {label} 数据量不足 ({len(prices)} 条),跳过分析") + return {} + + # 计算对数收益率 + log_ret = np.log(prices / prices.shift(1)).dropna().values + + # FFT频谱计算(Hann窗) + freqs, periods, power = compute_fft_spectrum( + log_ret, sampling_period_days, apply_window=True + ) + + if len(freqs) == 0: + return {} + + # AR(1)红噪声基线 + noise_mean, noise_threshold = ar1_red_noise_spectrum( + log_ret, freqs, sampling_period_days, confidence_percentile=95.0 + ) + + # 峰值检测 + # 对于低频数据(如周线),放宽最小周期约束 + min_period = max(2.0, sampling_period_days * 3) + peaks_df = detect_spectral_peaks( + freqs, periods, power, noise_mean, noise_threshold, + threshold_ratio=PEAK_THRESHOLD_RATIO, + min_period_days=min_period, + ) + + return { + "freqs": freqs, + "periods": periods, + "power": power, + "noise_mean": noise_mean, + "noise_threshold": noise_threshold, + "peaks": peaks_df, + "log_ret": log_ret, + "label": label, + } + + +# ============================================================ +# 主入口函数 +# ============================================================ + +def run_fft_analysis( + df: pd.DataFrame, + output_dir: str, +) -> Dict: + """ + BTC价格FFT频谱分析主入口 + + 执行以下分析并保存可视化结果: + 1. 日线对数收益率FFT频谱分析(Hann窗 + AR1红噪声基线) + 2. 功率谱峰值检测(5x噪声阈值) + 3. 多时间框架(4h/1d/1w)频谱对比 + 4. 带通滤波提取关键周期分量(7d/30d/90d/365d/1400d) + + Parameters + ---------- + df : pd.DataFrame + 日线K线数据,DatetimeIndex,需包含 close 列 + output_dir : str + 图表输出目录路径 + + Returns + ------- + dict + 分析结果汇总: + - daily_peaks: 日线显著周期峰值表 + - multi_tf_peaks: 各时间框架峰值字典 + - bandpass_variance_ratios: 各带通分量方差占比 + - ar1_rho: AR(1)自相关系数 + """ + output_path = Path(output_dir) + output_path.mkdir(parents=True, exist_ok=True) + + print("=" * 70) + print("BTC FFT 频谱分析") + print("=" * 70) + + # ---------------------------------------------------------- + # 第一部分:日线对数收益率FFT分析 + # ---------------------------------------------------------- + print("\n[1/4] 日线对数收益率FFT分析 (Hann窗)") + daily_result = _analyze_single_timeframe(df, sampling_period_days=1.0, label="1d") + + if not daily_result: + print(" [错误] 日线分析失败,数据不足") + return {} + + log_ret = daily_result["log_ret"] + periods = daily_result["periods"] + power = daily_result["power"] + noise_mean = daily_result["noise_mean"] + noise_threshold = daily_result["noise_threshold"] + peaks_df = daily_result["peaks"] + + # 打印AR(1)参数 + signal_centered = log_ret - np.mean(log_ret) + autocov_0 = np.sum(signal_centered ** 2) / len(log_ret) + autocov_1 = np.sum(signal_centered[:-1] * signal_centered[1:]) / len(log_ret) + ar1_rho = autocov_1 / autocov_0 if autocov_0 > 0 else 0.0 + print(f" AR(1) 自相关系数 rho = {ar1_rho:.4f}") + print(f" 数据长度: {len(log_ret)} 个交易日") + print(f" 频率分辨率: {1.0 / len(log_ret):.6f} cycles/day (最大可分辨周期: {len(log_ret):.0f} 天)") + + # 打印显著峰值 + if not peaks_df.empty: + print(f"\n 检测到 {len(peaks_df)} 个显著周期峰值 (SNR > {PEAK_THRESHOLD_RATIO:.0f}x):") + print(" " + "-" * 60) + print(f" {'周期(天)':>10} | {'周期':>12} | {'SNR':>8} | {'功率':>12}") + print(" " + "-" * 60) + for _, row in peaks_df.iterrows(): + pd_days = row["period_days"] + snr = row["snr"] + pwr = row["power"] + if pd_days >= 365: + human_period = f"{pd_days / 365:.1f} 年" + elif pd_days >= 30: + human_period = f"{pd_days / 30:.1f} 月" + else: + human_period = f"{pd_days:.1f} 天" + print(f" {pd_days:>10.1f} | {human_period:>12} | {snr:>8.2f} | {pwr:>12.6e}") + print(" " + "-" * 60) + else: + print(" 未检测到显著超过红噪声基线的周期峰值") + + # 功率谱图 + fig_spectrum = plot_power_spectrum( + periods, power, noise_mean, noise_threshold, peaks_df, + title="BTC Daily Log Returns - FFT Power Spectrum (Hann Window)", + save_path=output_path / "fft_power_spectrum.png", + ) + plt.close(fig_spectrum) + + # ---------------------------------------------------------- + # 第二部分:多时间框架FFT对比 + # ---------------------------------------------------------- + print("\n[2/4] 多时间框架FFT对比 (4h / 1d / 1w)") + tf_results = {} + + for interval, sp_days in MULTI_TF_INTERVALS.items(): + try: + if interval == "1d": + tf_df = df + else: + tf_df = load_klines(interval) + result = _analyze_single_timeframe(tf_df, sp_days, label=interval) + if result: + tf_results[interval] = result + n_peaks = len(result["peaks"]) if not result["peaks"].empty else 0 + print(f" {interval}: {len(result['log_ret'])} 样本, {n_peaks} 个显著峰值") + except FileNotFoundError: + print(f" [警告] {interval} 数据文件未找到,跳过") + except Exception as e: + print(f" [警告] {interval} 分析失败: {e}") + + # 多时间框架对比图 + if len(tf_results) > 1: + fig_mtf = plot_multi_timeframe( + tf_results, + save_path=output_path / "fft_multi_timeframe.png", + ) + plt.close(fig_mtf) + else: + print(" [警告] 可用时间框架不足,跳过对比图") + + # ---------------------------------------------------------- + # 第三部分:带通滤波提取周期分量 + # ---------------------------------------------------------- + print(f"\n[3/4] 带通滤波提取周期分量: {BANDPASS_PERIODS_DAYS}") + prices = df["close"].dropna() + dates = prices.index[1:] # 与log_ret对齐(差分损失1个点) + # 确保dates和log_ret长度一致 + if len(dates) > len(log_ret): + dates = dates[:len(log_ret)] + elif len(dates) < len(log_ret): + log_ret = log_ret[:len(dates)] + + components = {} + variance_ratios = {} + original_var = np.var(log_ret) + + for period_days in BANDPASS_PERIODS_DAYS: + # 检查Nyquist条件:目标周期必须大于2倍采样周期 + if period_days < 2.0 * 1.0: + print(f" [跳过] {period_days}d 周期低于Nyquist极限") + continue + # 检查信号长度是否覆盖至少2个完整周期 + if len(log_ret) < period_days * 2: + print(f" [跳过] {period_days}d 周期:数据长度不足 ({len(log_ret)} < {period_days * 2:.0f})") + continue + + filtered = bandpass_filter( + log_ret, + sampling_period_days=1.0, + center_period_days=float(period_days), + bandwidth_ratio=0.3, + order=4, + ) + + label = f"{period_days}d" + components[label] = filtered + var_ratio = np.var(filtered) / original_var * 100 if original_var > 0 else 0 + variance_ratios[label] = var_ratio + print(f" {label:>6} 分量方差占比: {var_ratio:.3f}%") + + # 带通分量图 + if components: + fig_bp = plot_bandpass_components( + dates, log_ret, components, + save_path=output_path / "fft_bandpass_components.png", + ) + plt.close(fig_bp) + else: + print(" [警告] 无有效带通分量可绘制") + + # ---------------------------------------------------------- + # 第四部分:汇总输出 + # ---------------------------------------------------------- + print("\n[4/4] 分析汇总") + + # 收集多时间框架峰值 + multi_tf_peaks = {} + for tf_label, tf_data in tf_results.items(): + if not tf_data["peaks"].empty: + multi_tf_peaks[tf_label] = tf_data["peaks"] + + # 跨时间框架一致性检验 + print("\n 跨时间框架周期一致性检查:") + if len(multi_tf_peaks) >= 2: + # 收集所有检测到的周期 + all_detected_periods = [] + for tf_label, p_df in multi_tf_peaks.items(): + for _, row in p_df.iterrows(): + all_detected_periods.append({ + "timeframe": tf_label, + "period_days": row["period_days"], + "snr": row["snr"], + }) + + if all_detected_periods: + all_periods_df = pd.DataFrame(all_detected_periods) + # 按周期分组(允许20%误差范围),寻找多时间框架确认的周期 + confirmed = [] + used = set() + for i, row_i in all_periods_df.iterrows(): + if i in used: + continue + p_i = row_i["period_days"] + group = [row_i] + used.add(i) + for j, row_j in all_periods_df.iterrows(): + if j in used: + continue + if row_j["timeframe"] != row_i["timeframe"]: + if abs(row_j["period_days"] - p_i) / p_i < 0.2: + group.append(row_j) + used.add(j) + if len(group) > 1: + tfs = [g["timeframe"] for g in group] + avg_period = np.mean([g["period_days"] for g in group]) + avg_snr = np.mean([g["snr"] for g in group]) + confirmed.append({ + "period_days": avg_period, + "confirmed_by": tfs, + "avg_snr": avg_snr, + }) + + if confirmed: + for c in confirmed: + tfs_str = " & ".join(c["confirmed_by"]) + print(f" {c['period_days']:.1f}d 周期被 {tfs_str} 共同确认 (平均SNR={c['avg_snr']:.2f})") + else: + print(" 未发现跨时间框架一致确认的周期") + else: + print(" 各时间框架均未检测到显著峰值") + else: + print(" 可用时间框架不足,无法进行一致性检查") + + print("\n" + "=" * 70) + print("FFT分析完成") + print(f"图表已保存至: {output_path.resolve()}") + print("=" * 70) + + # ---------------------------------------------------------- + # 返回结果字典 + # ---------------------------------------------------------- + results = { + "daily_peaks": peaks_df, + "multi_tf_peaks": multi_tf_peaks, + "bandpass_variance_ratios": variance_ratios, + "bandpass_components": components, + "ar1_rho": ar1_rho, + "daily_spectrum": { + "freqs": daily_result["freqs"], + "periods": daily_result["periods"], + "power": daily_result["power"], + "noise_mean": daily_result["noise_mean"], + "noise_threshold": daily_result["noise_threshold"], + }, + "multi_tf_results": tf_results, + } + + return results + + +# ============================================================ +# 独立运行入口 +# ============================================================ + +if __name__ == "__main__": + from src.data_loader import load_daily + + print("加载BTC日线数据...") + df = load_daily() + print(f"数据范围: {df.index.min()} ~ {df.index.max()}, 共 {len(df)} 条") + + results = run_fft_analysis(df, output_dir="output/fft") diff --git a/src/fractal_analysis.py b/src/fractal_analysis.py new file mode 100644 index 0000000..5fc4dc4 --- /dev/null +++ b/src/fractal_analysis.py @@ -0,0 +1,645 @@ +""" +分形维数与自相似性分析模块 +======================== +通过盒计数法(Box-Counting)计算BTC价格序列的分形维数, +并通过蒙特卡洛模拟与随机游走对比,检验BTC价格是否具有显著不同的分形特征。 + +核心功能: +- 盒计数法(Box-Counting Dimension)计算分形维数 +- 蒙特卡洛模拟对比(Z检验) +- 多尺度自相似性分析 +""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from pathlib import Path +from typing import Tuple, Dict, List, Optional +from scipy import stats + +import sys +sys.path.insert(0, str(Path(__file__).parent.parent)) +from src.data_loader import load_klines +from src.preprocessing import log_returns + + +# ============================================================ +# 盒计数法(Box-Counting Dimension) +# ============================================================ +def box_counting_dimension(prices: np.ndarray, + num_scales: int = 30, + min_boxes: int = 5, + max_boxes: int = None) -> Tuple[float, np.ndarray, np.ndarray]: + """ + 盒计数法计算价格序列的分形维数 + + 方法: + 1. 将价格序列归一化到 [0,1] x [0,1] 空间 + 2. 在不同尺度(box size)下计数覆盖曲线所需的盒子数 + 3. 通过 log(count) vs log(1/scale) 的线性回归得到分形维数 + + Parameters + ---------- + prices : np.ndarray + 价格序列 + num_scales : int + 尺度数量 + min_boxes : int + 最小划分数量 + max_boxes : int, optional + 最大划分数量,默认为序列长度的1/4 + + Returns + ------- + D : float + 盒计数分形维数 + log_inv_scales : np.ndarray + log(1/scale) 数组 + log_counts : np.ndarray + log(count) 数组 + """ + n = len(prices) + if max_boxes is None: + max_boxes = n // 4 + + # 步骤1:归一化到 [0,1] x [0,1] + # x轴:时间归一化 + x = np.linspace(0, 1, n) + # y轴:价格归一化 + y = (prices - prices.min()) / (prices.max() - prices.min()) + + # 步骤2:在不同尺度下计数 + # 生成对数均匀分布的划分数量 + box_counts_list = np.unique( + np.logspace(np.log10(min_boxes), np.log10(max_boxes), num=num_scales).astype(int) + ) + + log_inv_scales = [] + log_counts = [] + + for num_boxes_per_side in box_counts_list: + if num_boxes_per_side < 2: + continue + + # 盒子大小(在归一化空间中) + box_size = 1.0 / num_boxes_per_side + + # 计算每个数据点所在的盒子编号 + # x方向:时间划分 + x_box = np.floor(x / box_size).astype(int) + x_box = np.clip(x_box, 0, num_boxes_per_side - 1) + + # y方向:价格划分 + y_box = np.floor(y / box_size).astype(int) + y_box = np.clip(y_box, 0, num_boxes_per_side - 1) + + # 还需要考虑相邻点之间的连线经过的盒子 + occupied = set() + for i in range(n): + occupied.add((x_box[i], y_box[i])) + + # 对于相邻点,如果它们不在同一个盒子中,需要插值连接 + for i in range(n - 1): + if x_box[i] == x_box[i + 1] and y_box[i] == y_box[i + 1]: + continue + + # 线性插值找出经过的所有盒子 + steps = max(abs(x_box[i + 1] - x_box[i]), abs(y_box[i + 1] - y_box[i])) + 1 + if steps <= 1: + continue + + for t in np.linspace(0, 1, steps + 1): + xi = x[i] + t * (x[i + 1] - x[i]) + yi = y[i] + t * (y[i + 1] - y[i]) + bx = int(np.clip(np.floor(xi / box_size), 0, num_boxes_per_side - 1)) + by = int(np.clip(np.floor(yi / box_size), 0, num_boxes_per_side - 1)) + occupied.add((bx, by)) + + count = len(occupied) + if count > 0: + log_inv_scales.append(np.log(1.0 / box_size)) + log_counts.append(np.log(count)) + + log_inv_scales = np.array(log_inv_scales) + log_counts = np.array(log_counts) + + # 步骤3:线性回归 + if len(log_inv_scales) < 3: + return 1.5, log_inv_scales, log_counts + + coeffs = np.polyfit(log_inv_scales, log_counts, 1) + D = coeffs[0] # 斜率即分形维数 + + return D, log_inv_scales, log_counts + + +# ============================================================ +# 蒙特卡洛模拟对比 +# ============================================================ +def generate_random_walk(n: int, seed: Optional[int] = None) -> np.ndarray: + """ + 生成一条与BTC价格序列等长的随机游走 + + Parameters + ---------- + n : int + 序列长度 + seed : int, optional + 随机种子 + + Returns + ------- + np.ndarray + 随机游走价格序列 + """ + if seed is not None: + rng = np.random.RandomState(seed) + else: + rng = np.random.RandomState() + + # 生成标准正态分布的增量 + increments = rng.randn(n - 1) + # 累积求和得到随机游走 + walk = np.cumsum(increments) + # 加上一个正的起始值避免负数 + walk = walk - walk.min() + 1.0 + return walk + + +def monte_carlo_fractal_test(prices: np.ndarray, n_simulations: int = 100, + seed: int = 42) -> Dict: + """ + 蒙特卡洛模拟检验BTC分形维数是否显著偏离随机游走 + + 方法: + 1. 生成n_simulations条随机游走 + 2. 计算每条的分形维数 + 3. 与BTC分形维数做Z检验 + + Parameters + ---------- + prices : np.ndarray + BTC价格序列 + n_simulations : int + 模拟次数(默认100) + seed : int + 随机种子(可重复性) + + Returns + ------- + dict + 包含BTC分形维数、随机游走分形维数分布、Z检验结果 + """ + n = len(prices) + + # 计算BTC分形维数 + print(f" 计算BTC分形维数...") + d_btc, _, _ = box_counting_dimension(prices) + print(f" BTC分形维数: {d_btc:.4f}") + + # 蒙特卡洛模拟 + print(f" 运行{n_simulations}次随机游走模拟...") + d_random = [] + for i in range(n_simulations): + if (i + 1) % 20 == 0: + print(f" 进度: {i + 1}/{n_simulations}") + rw = generate_random_walk(n, seed=seed + i) + d_rw, _, _ = box_counting_dimension(rw) + d_random.append(d_rw) + + d_random = np.array(d_random) + + # Z检验:BTC分形维数 vs 随机游走分形维数分布 + mean_rw = np.mean(d_random) + std_rw = np.std(d_random, ddof=1) + + if std_rw > 0: + z_score = (d_btc - mean_rw) / std_rw + # 双侧p值 + p_value = 2 * (1 - stats.norm.cdf(abs(z_score))) + else: + z_score = np.nan + p_value = np.nan + + result = { + 'BTC分形维数': d_btc, + '随机游走均值': mean_rw, + '随机游走标准差': std_rw, + '随机游走范围': (d_random.min(), d_random.max()), + 'Z统计量': z_score, + 'p值': p_value, + '显著性(α=0.05)': p_value < 0.05 if not np.isnan(p_value) else False, + '随机游走分形维数': d_random, + } + + return result + + +# ============================================================ +# 多尺度自相似性分析 +# ============================================================ +def multi_scale_self_similarity(prices: np.ndarray, + scales: List[int] = None) -> Dict: + """ + 多尺度自相似性分析:在不同聚合级别下比较统计特征 + + 方法: + 对价格序列按不同尺度聚合后,比较收益率分布的统计矩 + 如果序列具有自相似性,其缩放后的统计特征应保持一致 + + Parameters + ---------- + prices : np.ndarray + 价格序列 + scales : list of int + 聚合尺度,默认 [1, 2, 5, 10, 20, 50] + + Returns + ------- + dict + 各尺度下的统计特征 + """ + if scales is None: + scales = [1, 2, 5, 10, 20, 50] + + results = {} + + for scale in scales: + # 对价格序列按scale聚合(每scale个点取一个) + aggregated = prices[::scale] + if len(aggregated) < 30: + continue + + # 计算对数收益率 + returns = np.diff(np.log(aggregated)) + if len(returns) < 10: + continue + + results[scale] = { + '样本量': len(returns), + '均值': np.mean(returns), + '标准差': np.std(returns), + '偏度': float(stats.skew(returns)), + '峰度': float(stats.kurtosis(returns)), + # 标准差的缩放关系:如果H是Hurst指数,std(scale) ∝ scale^H + '标准差(原始)': np.std(returns), + } + + # 计算缩放指数:log(std) vs log(scale) 的斜率 + valid_scales = sorted(results.keys()) + if len(valid_scales) >= 3: + log_scales = np.log(valid_scales) + log_stds = np.log([results[s]['标准差'] for s in valid_scales]) + scaling_exponent = np.polyfit(log_scales, log_stds, 1)[0] + scaling_result = { + '缩放指数(H估计)': scaling_exponent, + '各尺度统计': results, + } + else: + scaling_result = { + '缩放指数(H估计)': np.nan, + '各尺度统计': results, + } + + return scaling_result + + +# ============================================================ +# 可视化函数 +# ============================================================ +def plot_box_counting(log_inv_scales: np.ndarray, log_counts: np.ndarray, D: float, + output_dir: Path, filename: str = "fractal_box_counting.png"): + """绘制盒计数法的log-log图""" + fig, ax = plt.subplots(figsize=(10, 7)) + + # 散点 + ax.scatter(log_inv_scales, log_counts, color='steelblue', s=40, zorder=3, + label='盒计数数据点') + + # 拟合线 + coeffs = np.polyfit(log_inv_scales, log_counts, 1) + fit_line = np.polyval(coeffs, log_inv_scales) + ax.plot(log_inv_scales, fit_line, 'r-', linewidth=2, + label=f'拟合线 (D = {D:.4f})') + + # 参考线:D=1.5(纯随机游走理论值) + ref_line = 1.5 * log_inv_scales + (log_counts[0] - 1.5 * log_inv_scales[0]) + ax.plot(log_inv_scales, ref_line, 'k--', alpha=0.5, linewidth=1, + label='D=1.5 (随机游走理论值)') + + ax.set_xlabel('log(1/ε) - 尺度倒数的对数', fontsize=12) + ax.set_ylabel('log(N(ε)) - 盒子数的对数', fontsize=12) + ax.set_title(f'BTC 盒计数法分析 (分形维数 D = {D:.4f})', fontsize=13) + ax.legend(fontsize=11) + ax.grid(True, alpha=0.3) + + fig.tight_layout() + filepath = output_dir / filename + fig.savefig(filepath, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" 已保存: {filepath}") + + +def plot_monte_carlo(mc_results: Dict, output_dir: Path, + filename: str = "fractal_monte_carlo.png"): + """绘制蒙特卡洛模拟结果:随机游走分形维数直方图 vs BTC""" + fig, ax = plt.subplots(figsize=(10, 7)) + + d_random = mc_results['随机游走分形维数'] + d_btc = mc_results['BTC分形维数'] + + # 直方图 + ax.hist(d_random, bins=20, density=True, alpha=0.7, color='steelblue', + edgecolor='white', label=f'随机游走 (n={len(d_random)})') + + # BTC分形维数的竖线 + ax.axvline(x=d_btc, color='red', linewidth=2.5, linestyle='-', + label=f'BTC (D={d_btc:.4f})') + + # 随机游走均值的竖线 + ax.axvline(x=mc_results['随机游走均值'], color='blue', linewidth=1.5, linestyle='--', + label=f'随机游走均值 (D={mc_results["随机游走均值"]:.4f})') + + # 添加正态分布拟合曲线 + x_range = np.linspace(d_random.min() - 0.05, d_random.max() + 0.05, 200) + pdf = stats.norm.pdf(x_range, mc_results['随机游走均值'], mc_results['随机游走标准差']) + ax.plot(x_range, pdf, 'b-', alpha=0.5, linewidth=1) + + # 标注统计信息 + info_text = ( + f"Z统计量: {mc_results['Z统计量']:.2f}\n" + f"p值: {mc_results['p值']:.4f}\n" + f"显著性(α=0.05): {'是' if mc_results['显著性(α=0.05)'] else '否'}" + ) + ax.text(0.02, 0.95, info_text, transform=ax.transAxes, fontsize=11, + verticalalignment='top', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8)) + + ax.set_xlabel('分形维数 D', fontsize=12) + ax.set_ylabel('概率密度', fontsize=12) + ax.set_title('BTC分形维数 vs 随机游走蒙特卡洛模拟', fontsize=13) + ax.legend(fontsize=11, loc='upper right') + ax.grid(True, alpha=0.3) + + fig.tight_layout() + filepath = output_dir / filename + fig.savefig(filepath, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" 已保存: {filepath}") + + +def plot_self_similarity(scaling_result: Dict, output_dir: Path, + filename: str = "fractal_self_similarity.png"): + """绘制多尺度自相似性分析图""" + scale_stats = scaling_result['各尺度统计'] + if not scale_stats: + print(" 没有可绘制的自相似性结果") + return + + scales = sorted(scale_stats.keys()) + stds = [scale_stats[s]['标准差'] for s in scales] + skews = [scale_stats[s]['偏度'] for s in scales] + kurts = [scale_stats[s]['峰度'] for s in scales] + + fig, axes = plt.subplots(1, 3, figsize=(18, 6)) + + # 图1:log(std) vs log(scale) — 缩放关系 + ax1 = axes[0] + log_scales = np.log(scales) + log_stds = np.log(stds) + + ax1.scatter(log_scales, log_stds, color='steelblue', s=60, zorder=3) + + if len(log_scales) >= 3: + coeffs = np.polyfit(log_scales, log_stds, 1) + fit_line = np.polyval(coeffs, log_scales) + ax1.plot(log_scales, fit_line, 'r-', linewidth=2, + label=f'拟合斜率 H≈{coeffs[0]:.4f}') + + # 参考线 H=0.5 + ref_line = 0.5 * log_scales + (log_stds[0] - 0.5 * log_scales[0]) + ax1.plot(log_scales, ref_line, 'k--', alpha=0.5, label='H=0.5 参考线') + + ax1.set_xlabel('log(聚合尺度)', fontsize=11) + ax1.set_ylabel('log(标准差)', fontsize=11) + ax1.set_title('缩放关系 (标准差 vs 尺度)', fontsize=12) + ax1.legend(fontsize=10) + ax1.grid(True, alpha=0.3) + + # 图2:偏度随尺度变化 + ax2 = axes[1] + ax2.bar(range(len(scales)), skews, color='coral', alpha=0.8) + ax2.set_xticks(range(len(scales))) + ax2.set_xticklabels([str(s) for s in scales]) + ax2.axhline(y=0, color='black', linestyle='--', alpha=0.5) + ax2.set_xlabel('聚合尺度', fontsize=11) + ax2.set_ylabel('偏度', fontsize=11) + ax2.set_title('偏度随尺度变化', fontsize=12) + ax2.grid(True, alpha=0.3, axis='y') + + # 图3:峰度随尺度变化 + ax3 = axes[2] + ax3.bar(range(len(scales)), kurts, color='seagreen', alpha=0.8) + ax3.set_xticks(range(len(scales))) + ax3.set_xticklabels([str(s) for s in scales]) + ax3.axhline(y=0, color='black', linestyle='--', alpha=0.5, label='正态分布峰度=0') + ax3.set_xlabel('聚合尺度', fontsize=11) + ax3.set_ylabel('超额峰度', fontsize=11) + ax3.set_title('峰度随尺度变化', fontsize=12) + ax3.legend(fontsize=10) + ax3.grid(True, alpha=0.3, axis='y') + + fig.suptitle(f'BTC 多尺度自相似性分析 (缩放指数 H≈{scaling_result["缩放指数(H估计)"]:.4f})', + fontsize=14, y=1.02) + fig.tight_layout() + filepath = output_dir / filename + fig.savefig(filepath, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" 已保存: {filepath}") + + +# ============================================================ +# 主入口函数 +# ============================================================ +def run_fractal_analysis(df: pd.DataFrame, output_dir: str = "output/fractal") -> Dict: + """ + 分形维数与自相似性综合分析主入口 + + Parameters + ---------- + df : pd.DataFrame + K线数据(需包含 'close' 列和DatetimeIndex索引) + output_dir : str + 图表输出目录 + + Returns + ------- + dict + 包含所有分析结果的字典 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + results = {} + + print("=" * 70) + print("分形维数与自相似性分析") + print("=" * 70) + + # ---------------------------------------------------------- + # 1. 准备数据 + # ---------------------------------------------------------- + prices = df['close'].dropna().values + + print(f"\n数据概况:") + print(f" 时间范围: {df.index.min()} ~ {df.index.max()}") + print(f" 价格序列长度: {len(prices)}") + print(f" 价格范围: {prices.min():.2f} ~ {prices.max():.2f}") + + # ---------------------------------------------------------- + # 2. 盒计数法分形维数 + # ---------------------------------------------------------- + print("\n" + "-" * 50) + print("【1】盒计数法 (Box-Counting Dimension)") + print("-" * 50) + + D, log_inv_scales, log_counts = box_counting_dimension(prices) + results['盒计数分形维数'] = D + + print(f" BTC分形维数: D = {D:.4f}") + print(f" 理论参考值:") + print(f" D = 1.0: 光滑曲线(完全可预测)") + print(f" D = 1.5: 纯随机游走(布朗运动)") + print(f" D = 2.0: 完全填充平面(极端不规则)") + + if D < 1.3: + interpretation = "序列非常光滑,可能存在强趋势特征" + elif D < 1.45: + interpretation = "序列较为光滑,具有一定趋势持续性" + elif D < 1.55: + interpretation = "序列接近随机游走特征" + elif D < 1.7: + interpretation = "序列较为粗糙,具有一定均值回归倾向" + else: + interpretation = "序列非常不规则,高度波动" + + print(f" BTC解读: {interpretation}") + results['维数解读'] = interpretation + + # 分形维数与Hurst指数的关系: D = 2 - H + h_from_d = 2.0 - D + print(f"\n 由分形维数推算Hurst指数 (D = 2 - H):") + print(f" H ≈ {h_from_d:.4f}") + results['Hurst(从D推算)'] = h_from_d + + # 绘制盒计数log-log图 + plot_box_counting(log_inv_scales, log_counts, D, output_dir) + + # ---------------------------------------------------------- + # 3. 蒙特卡洛模拟对比 + # ---------------------------------------------------------- + print("\n" + "-" * 50) + print("【2】蒙特卡洛模拟对比 (100次随机游走)") + print("-" * 50) + + mc_results = monte_carlo_fractal_test(prices, n_simulations=100, seed=42) + results['蒙特卡洛检验'] = { + k: v for k, v in mc_results.items() if k != '随机游走分形维数' + } + + print(f"\n 结果汇总:") + print(f" BTC分形维数: D = {mc_results['BTC分形维数']:.4f}") + print(f" 随机游走均值: D = {mc_results['随机游走均值']:.4f} ± {mc_results['随机游走标准差']:.4f}") + print(f" 随机游走范围: [{mc_results['随机游走范围'][0]:.4f}, {mc_results['随机游走范围'][1]:.4f}]") + print(f" Z统计量: {mc_results['Z统计量']:.4f}") + print(f" p值: {mc_results['p值']:.6f}") + print(f" 显著性(α=0.05): {'是 - BTC与随机游走显著不同' if mc_results['显著性(α=0.05)'] else '否 - 无法拒绝随机游走假设'}") + + # 绘制蒙特卡洛结果图 + plot_monte_carlo(mc_results, output_dir) + + # ---------------------------------------------------------- + # 4. 多尺度自相似性分析 + # ---------------------------------------------------------- + print("\n" + "-" * 50) + print("【3】多尺度自相似性分析") + print("-" * 50) + + scaling_result = multi_scale_self_similarity(prices, scales=[1, 2, 5, 10, 20, 50]) + results['多尺度自相似性'] = { + k: v for k, v in scaling_result.items() if k != '各尺度统计' + } + results['多尺度自相似性']['缩放指数(H估计)'] = scaling_result['缩放指数(H估计)'] + + print(f"\n 缩放指数 (波动率缩放关系 H估计): {scaling_result['缩放指数(H估计)']:.4f}") + print(f" 各尺度统计特征:") + for scale, stat in sorted(scaling_result['各尺度统计'].items()): + print(f" 尺度={scale:3d}: 样本={stat['样本量']:5d}, " + f"std={stat['标准差']:.6f}, " + f"偏度={stat['偏度']:.4f}, " + f"峰度={stat['峰度']:.4f}") + + # 自相似性判定 + scale_stats = scaling_result['各尺度统计'] + if scale_stats: + valid_scales = sorted(scale_stats.keys()) + if len(valid_scales) >= 2: + kurts = [scale_stats[s]['峰度'] for s in valid_scales] + # 如果峰度随尺度增大而趋向0(正态),说明大尺度下趋向正态 + if all(k > 1.0 for k in kurts): + print("\n 自相似性判定: 所有尺度均呈现超额峰度(尖峰厚尾),") + print(" 表明BTC收益率分布在各尺度下均偏离正态分布,具有分形特征") + elif kurts[-1] < kurts[0] * 0.5: + print("\n 自相似性判定: 峰度随聚合尺度增大而显著下降,") + print(" 表明大尺度下收益率趋于正态,自相似性有限") + else: + print("\n 自相似性判定: 峰度随尺度变化不大,具有一定自相似性") + + # 绘制自相似性图 + plot_self_similarity(scaling_result, output_dir) + + # ---------------------------------------------------------- + # 5. 总结 + # ---------------------------------------------------------- + print("\n" + "=" * 70) + print("分析总结") + print("=" * 70) + print(f" 盒计数分形维数: D = {D:.4f}") + print(f" 由D推算Hurst指数: H = {h_from_d:.4f}") + print(f" 维数解读: {interpretation}") + print(f"\n 蒙特卡洛检验:") + if mc_results['显著性(α=0.05)']: + print(f" BTC价格序列的分形维数与纯随机游走存在显著差异 (p={mc_results['p值']:.6f})") + if D < mc_results['随机游走均值']: + print(f" BTC的D({D:.4f}) < 随机游走的D({mc_results['随机游走均值']:.4f}),") + print(" 表明BTC价格比纯随机游走更「光滑」,即存在趋势持续性") + else: + print(f" BTC的D({D:.4f}) > 随机游走的D({mc_results['随机游走均值']:.4f}),") + print(" 表明BTC价格比纯随机游走更「粗糙」,即存在均值回归特征") + else: + print(f" 无法在5%显著性水平下拒绝BTC为随机游走的假设 (p={mc_results['p值']:.6f})") + + print(f"\n 波动率缩放指数: H ≈ {scaling_result['缩放指数(H估计)']:.4f}") + print(f" H > 0.5: 波动率超线性增长 → 趋势持续性") + print(f" H < 0.5: 波动率亚线性增长 → 均值回归性") + print(f" H ≈ 0.5: 波动率线性增长 → 随机游走") + + print(f"\n 图表已保存至: {output_dir.resolve()}") + print("=" * 70) + + return results + + +# ============================================================ +# 独立运行入口 +# ============================================================ +if __name__ == "__main__": + from data_loader import load_daily + + print("加载BTC日线数据...") + df = load_daily() + print(f"数据加载完成: {len(df)} 条记录") + + results = run_fractal_analysis(df, output_dir="output/fractal") diff --git a/src/halving_analysis.py b/src/halving_analysis.py new file mode 100644 index 0000000..c6be485 --- /dev/null +++ b/src/halving_analysis.py @@ -0,0 +1,546 @@ +"""BTC 减半周期分析模块 - 减半前后价格行为、波动率、累计收益对比""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +import matplotlib.ticker as mticker +from pathlib import Path +from scipy import stats + +# 中文显示配置 +plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans'] +plt.rcParams['axes.unicode_minus'] = False + +# BTC 减半日期(数据范围 2017-2026 内的两次减半) +HALVING_DATES = [ + pd.Timestamp('2020-05-11'), + pd.Timestamp('2024-04-20'), +] +HALVING_LABELS = ['第三次减半 (2020-05-11)', '第四次减半 (2024-04-20)'] + +# 分析窗口:减半前后各 500 天 +WINDOW_DAYS = 500 + + +def _extract_halving_window(df: pd.DataFrame, halving_date: pd.Timestamp, + window: int = WINDOW_DAYS): + """ + 提取减半日期前后的数据窗口。 + + Parameters + ---------- + df : pd.DataFrame + 日线数据(DatetimeIndex 索引,含 close 和 log_return 列) + halving_date : pd.Timestamp + 减半日期 + window : int + 前后各取的天数 + + Returns + ------- + pd.DataFrame + 窗口数据,附加 'days_from_halving' 列(减半日=0) + """ + start = halving_date - pd.Timedelta(days=window) + end = halving_date + pd.Timedelta(days=window) + mask = (df.index >= start) & (df.index <= end) + window_df = df.loc[mask].copy() + + # 计算距减半日的天数差 + window_df['days_from_halving'] = (window_df.index - halving_date).days + return window_df + + +def _normalize_price(window_df: pd.DataFrame, halving_date: pd.Timestamp): + """ + 以减半日价格为基准(=100)归一化价格。 + + Parameters + ---------- + window_df : pd.DataFrame + 窗口数据(含 close 列) + halving_date : pd.Timestamp + 减半日期 + + Returns + ------- + pd.Series + 归一化后的价格序列(减半日=100) + """ + # 找到距减半日最近的交易日 + idx = window_df.index.get_indexer([halving_date], method='nearest')[0] + base_price = window_df['close'].iloc[idx] + return (window_df['close'] / base_price) * 100 + + +def analyze_normalized_trajectories(windows: list, output_dir: Path): + """ + 绘制归一化价格轨迹叠加图。 + + Parameters + ---------- + windows : list[dict] + 每个元素包含 'df', 'normalized', 'label', 'halving_date' + output_dir : Path + 图片保存目录 + """ + print("\n" + "-" * 60) + print("【归一化价格轨迹叠加】") + print("-" * 60) + + fig, ax = plt.subplots(figsize=(14, 7)) + colors = ['#2980b9', '#e74c3c'] + linestyles = ['-', '--'] + + for i, w in enumerate(windows): + days = w['df']['days_from_halving'] + normalized = w['normalized'] + ax.plot(days, normalized, color=colors[i], linestyle=linestyles[i], + linewidth=1.5, label=w['label'], alpha=0.85) + + ax.axvline(x=0, color='gold', linestyle='-', linewidth=2, + alpha=0.8, label='减半日') + ax.axhline(y=100, color='grey', linestyle=':', alpha=0.4) + + ax.set_title('BTC 减半周期 - 归一化价格轨迹叠加(减半日=100)', fontsize=14) + ax.set_xlabel(f'距减半日天数(前后各 {WINDOW_DAYS} 天)') + ax.set_ylabel('归一化价格') + ax.legend(fontsize=11) + ax.grid(True, alpha=0.3) + + fig_path = output_dir / 'halving_normalized_trajectories.png' + fig.savefig(fig_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"图表已保存: {fig_path}") + + +def analyze_pre_post_returns(windows: list, output_dir: Path): + """ + 对比减半前后平均收益率,进行 Welch's t 检验。 + + Parameters + ---------- + windows : list[dict] + 窗口数据列表 + output_dir : Path + 图片保存目录 + """ + print("\n" + "-" * 60) + print("【减半前后收益率对比 & Welch's t 检验】") + print("-" * 60) + + all_pre_returns = [] + all_post_returns = [] + + for w in windows: + df_w = w['df'] + pre = df_w.loc[df_w['days_from_halving'] < 0, 'log_return'].dropna() + post = df_w.loc[df_w['days_from_halving'] > 0, 'log_return'].dropna() + all_pre_returns.append(pre) + all_post_returns.append(post) + + print(f"\n{w['label']}:") + print(f" 减半前 {WINDOW_DAYS}天: 均值={pre.mean():.6f}, 标准差={pre.std():.6f}, " + f"中位数={pre.median():.6f}, N={len(pre)}") + print(f" 减半后 {WINDOW_DAYS}天: 均值={post.mean():.6f}, 标准差={post.std():.6f}, " + f"中位数={post.median():.6f}, N={len(post)}") + + # 单周期 Welch's t-test + if len(pre) >= 3 and len(post) >= 3: + t_stat, p_val = stats.ttest_ind(pre, post, equal_var=False) + print(f" Welch's t 检验: t={t_stat:.4f}, p={p_val:.6f}") + if p_val < 0.05: + print(" => 减半前后收益率在 5% 水平下存在显著差异") + else: + print(" => 减半前后收益率在 5% 水平下无显著差异") + + # 合并所有周期的前后收益率进行总体检验 + combined_pre = pd.concat(all_pre_returns) + combined_post = pd.concat(all_post_returns) + print(f"\n--- 合并所有减半周期 ---") + print(f" 合并减半前: 均值={combined_pre.mean():.6f}, N={len(combined_pre)}") + print(f" 合并减半后: 均值={combined_post.mean():.6f}, N={len(combined_post)}") + t_stat_all, p_val_all = stats.ttest_ind(combined_pre, combined_post, equal_var=False) + print(f" 合并 Welch's t 检验: t={t_stat_all:.4f}, p={p_val_all:.6f}") + + # --- 可视化: 减半前后收益率对比柱状图(含置信区间) --- + fig, axes = plt.subplots(1, len(windows), figsize=(7 * len(windows), 6)) + if len(windows) == 1: + axes = [axes] + + for i, w in enumerate(windows): + df_w = w['df'] + pre = df_w.loc[df_w['days_from_halving'] < 0, 'log_return'].dropna() + post = df_w.loc[df_w['days_from_halving'] > 0, 'log_return'].dropna() + + means = [pre.mean(), post.mean()] + # 95% 置信区间 + ci_pre = stats.t.interval(0.95, len(pre) - 1, loc=pre.mean(), scale=pre.sem()) + ci_post = stats.t.interval(0.95, len(post) - 1, loc=post.mean(), scale=post.sem()) + errors = [ + [means[0] - ci_pre[0], means[1] - ci_post[0]], + [ci_pre[1] - means[0], ci_post[1] - means[1]], + ] + + colors_bar = ['#3498db', '#e67e22'] + axes[i].bar(['减半前', '减半后'], means, yerr=errors, color=colors_bar, + alpha=0.8, capsize=5, edgecolor='black', linewidth=0.5) + axes[i].axhline(y=0, color='grey', linestyle='--', alpha=0.5) + axes[i].set_title(w['label'] + '\n日均对数收益率(95% CI)', fontsize=12) + axes[i].set_ylabel('平均对数收益率') + + plt.tight_layout() + fig_path = output_dir / 'halving_pre_post_returns.png' + fig.savefig(fig_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"\n图表已保存: {fig_path}") + + +def analyze_cumulative_returns(windows: list, output_dir: Path): + """ + 绘制减半后累计收益率对比。 + + Parameters + ---------- + windows : list[dict] + 窗口数据列表 + output_dir : Path + 图片保存目录 + """ + print("\n" + "-" * 60) + print("【减半后累计收益率对比】") + print("-" * 60) + + fig, ax = plt.subplots(figsize=(14, 7)) + colors = ['#2980b9', '#e74c3c'] + + for i, w in enumerate(windows): + df_w = w['df'] + post = df_w.loc[df_w['days_from_halving'] >= 0].copy() + if len(post) == 0: + print(f" {w['label']}: 无减半后数据") + continue + + # 累计对数收益率 + post_returns = post['log_return'].fillna(0) + cum_return = post_returns.cumsum() + # 转为百分比形式 + cum_return_pct = (np.exp(cum_return) - 1) * 100 + + days = post['days_from_halving'] + ax.plot(days, cum_return_pct, color=colors[i], linewidth=1.5, + label=w['label'], alpha=0.85) + + # 输出关键节点 + final_cum = cum_return_pct.iloc[-1] if len(cum_return_pct) > 0 else 0 + print(f" {w['label']}: 减半后 {len(post)} 天累计收益率 = {final_cum:.2f}%") + + # 输出一些关键时间节点的累计收益 + for target_day in [30, 90, 180, 365, WINDOW_DAYS]: + mask_day = days <= target_day + if mask_day.any(): + val = cum_return_pct.loc[mask_day].iloc[-1] + actual_day = days.loc[mask_day].iloc[-1] + print(f" 第 {actual_day} 天: {val:.2f}%") + + ax.axhline(y=0, color='grey', linestyle=':', alpha=0.4) + ax.set_title('BTC 减半后累计收益率对比', fontsize=14) + ax.set_xlabel('距减半日天数') + ax.set_ylabel('累计收益率 (%)') + ax.legend(fontsize=11) + ax.grid(True, alpha=0.3) + ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:,.0f}%')) + + fig_path = output_dir / 'halving_cumulative_returns.png' + fig.savefig(fig_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"\n图表已保存: {fig_path}") + + +def analyze_volatility_change(windows: list, output_dir: Path): + """ + Levene 检验:减半前后波动率变化。 + + Parameters + ---------- + windows : list[dict] + 窗口数据列表 + output_dir : Path + 图片保存目录 + """ + print("\n" + "-" * 60) + print("【减半前后波动率变化 - Levene 检验】") + print("-" * 60) + + for w in windows: + df_w = w['df'] + pre = df_w.loc[df_w['days_from_halving'] < 0, 'log_return'].dropna() + post = df_w.loc[df_w['days_from_halving'] > 0, 'log_return'].dropna() + + print(f"\n{w['label']}:") + print(f" 减半前波动率(日标准差): {pre.std():.6f} " + f"(年化: {pre.std() * np.sqrt(365):.4f})") + print(f" 减半后波动率(日标准差): {post.std():.6f} " + f"(年化: {post.std() * np.sqrt(365):.4f})") + + if len(pre) >= 3 and len(post) >= 3: + lev_stat, lev_p = stats.levene(pre, post, center='median') + print(f" Levene 检验: W={lev_stat:.4f}, p={lev_p:.6f}") + if lev_p < 0.05: + print(" => 在 5% 水平下,减半前后波动率存在显著变化") + else: + print(" => 在 5% 水平下,减半前后波动率无显著变化") + + +def analyze_inter_cycle_correlation(windows: list): + """ + 两个减半周期归一化轨迹的 Pearson 相关系数。 + + Parameters + ---------- + windows : list[dict] + 窗口数据列表(需要至少2个周期) + """ + print("\n" + "-" * 60) + print("【周期间轨迹相关性 - Pearson 相关】") + print("-" * 60) + + if len(windows) < 2: + print(" 仅有1个周期,无法计算周期间相关性。") + return + + # 按照 days_from_halving 对齐两个周期 + w1, w2 = windows[0], windows[1] + df1 = w1['df'][['days_from_halving']].copy() + df1['norm_price_1'] = w1['normalized'].values + + df2 = w2['df'][['days_from_halving']].copy() + df2['norm_price_2'] = w2['normalized'].values + + # 以 days_from_halving 为键进行内连接 + merged = pd.merge(df1, df2, on='days_from_halving', how='inner') + + if len(merged) < 10: + print(f" 重叠天数过少({len(merged)}天),无法可靠计算相关性。") + return + + r, p_val = stats.pearsonr(merged['norm_price_1'], merged['norm_price_2']) + print(f" 重叠天数: {len(merged)}") + print(f" Pearson 相关系数: r={r:.4f}, p={p_val:.6f}") + + if abs(r) > 0.7: + print(" => 两个减半周期的价格轨迹呈强相关") + elif abs(r) > 0.4: + print(" => 两个减半周期的价格轨迹呈中等相关") + else: + print(" => 两个减半周期的价格轨迹相关性较弱") + + # 分别看减半前和减半后的相关性 + pre_merged = merged[merged['days_from_halving'] < 0] + post_merged = merged[merged['days_from_halving'] > 0] + + if len(pre_merged) >= 10: + r_pre, p_pre = stats.pearsonr(pre_merged['norm_price_1'], pre_merged['norm_price_2']) + print(f" 减半前轨迹相关性: r={r_pre:.4f}, p={p_pre:.6f} (N={len(pre_merged)})") + + if len(post_merged) >= 10: + r_post, p_post = stats.pearsonr(post_merged['norm_price_1'], post_merged['norm_price_2']) + print(f" 减半后轨迹相关性: r={r_post:.4f}, p={p_post:.6f} (N={len(post_merged)})") + + +# -------------------------------------------------------------------------- +# 主入口 +# -------------------------------------------------------------------------- +def run_halving_analysis( + df: pd.DataFrame, + output_dir: str = 'output/halving', +): + """ + BTC 减半周期分析主入口。 + + Parameters + ---------- + df : pd.DataFrame + 日线数据,已通过 add_derived_features 添加衍生特征(含 close、log_return 列) + output_dir : str or Path + 输出目录 + + Notes + ----- + 重要局限性: 数据范围内仅含2次减半事件(2020、2024),样本量极少, + 统计检验的功效(power)很低,结论仅供参考,不能作为因果推断依据。 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + print("\n" + "#" * 70) + print("# BTC 减半周期分析 (Halving Cycle Analysis)") + print("#" * 70) + + # ===== 重要局限性说明 ===== + print("\n⚠️ 重要局限性说明:") + print(f" 本分析仅覆盖 {len(HALVING_DATES)} 次减半事件(样本量极少)。") + print(" 统计检验的功效(statistical power)很低,") + print(" 任何「显著性」结论都应谨慎解读,不能作为因果推断依据。") + print(" 结果主要用于描述性分析和模式探索。\n") + + # 提取每次减半的窗口数据 + windows = [] + for i, (hdate, hlabel) in enumerate(zip(HALVING_DATES, HALVING_LABELS)): + w_df = _extract_halving_window(df, hdate, WINDOW_DAYS) + if len(w_df) == 0: + print(f"[警告] {hlabel} 窗口内无数据,跳过。") + continue + + normalized = _normalize_price(w_df, hdate) + + print(f"周期 {i + 1}: {hlabel}") + print(f" 数据范围: {w_df.index.min().date()} ~ {w_df.index.max().date()}") + print(f" 数据量: {len(w_df)} 天") + print(f" 减半日价格: {w_df['close'].iloc[w_df.index.get_indexer([hdate], method='nearest')[0]]:.2f} USDT") + + windows.append({ + 'df': w_df, + 'normalized': normalized, + 'label': hlabel, + 'halving_date': hdate, + }) + + if len(windows) == 0: + print("[错误] 无有效减半窗口数据,分析中止。") + return + + # 1. 归一化价格轨迹叠加 + analyze_normalized_trajectories(windows, output_dir) + + # 2. 减半前后收益率对比 + analyze_pre_post_returns(windows, output_dir) + + # 3. 减半后累计收益率 + analyze_cumulative_returns(windows, output_dir) + + # 4. 波动率变化 (Levene 检验) + analyze_volatility_change(windows, output_dir) + + # 5. 周期间轨迹相关性 + analyze_inter_cycle_correlation(windows) + + # ===== 综合可视化: 三合一图 ===== + _plot_combined_summary(windows, output_dir) + + print("\n" + "#" * 70) + print("# 减半周期分析完成") + print(f"# 注意: 仅 {len(windows)} 个周期,结论统计功效有限") + print("#" * 70) + + +def _plot_combined_summary(windows: list, output_dir: Path): + """ + 综合图: 归一化轨迹 + 减半前后收益率柱状图 + 累计收益率对比。 + + Parameters + ---------- + windows : list[dict] + 窗口数据列表 + output_dir : Path + 图片保存目录 + """ + fig, axes = plt.subplots(2, 2, figsize=(16, 12)) + colors = ['#2980b9', '#e74c3c'] + linestyles = ['-', '--'] + + # (0,0) 归一化轨迹 + ax = axes[0, 0] + for i, w in enumerate(windows): + days = w['df']['days_from_halving'] + ax.plot(days, w['normalized'], color=colors[i], linestyle=linestyles[i], + linewidth=1.5, label=w['label'], alpha=0.85) + ax.axvline(x=0, color='gold', linewidth=2, alpha=0.8, label='减半日') + ax.axhline(y=100, color='grey', linestyle=':', alpha=0.4) + ax.set_title('归一化价格轨迹(减半日=100)', fontsize=12) + ax.set_xlabel('距减半日天数') + ax.set_ylabel('归一化价格') + ax.legend(fontsize=9) + ax.grid(True, alpha=0.3) + + # (0,1) 减半前后日均收益率 + ax = axes[0, 1] + x_pos = np.arange(len(windows)) + width = 0.35 + pre_means, post_means, pre_errs, post_errs = [], [], [], [] + for w in windows: + df_w = w['df'] + pre = df_w.loc[df_w['days_from_halving'] < 0, 'log_return'].dropna() + post = df_w.loc[df_w['days_from_halving'] > 0, 'log_return'].dropna() + pre_means.append(pre.mean()) + post_means.append(post.mean()) + pre_errs.append(pre.sem() * 1.96) # 95% CI + post_errs.append(post.sem() * 1.96) + + ax.bar(x_pos - width / 2, pre_means, width, yerr=pre_errs, label='减半前', + color='#3498db', alpha=0.8, capsize=4, edgecolor='black', linewidth=0.5) + ax.bar(x_pos + width / 2, post_means, width, yerr=post_errs, label='减半后', + color='#e67e22', alpha=0.8, capsize=4, edgecolor='black', linewidth=0.5) + ax.set_xticks(x_pos) + ax.set_xticklabels([w['label'].split('(')[0].strip() for w in windows], fontsize=9) + ax.axhline(y=0, color='grey', linestyle='--', alpha=0.5) + ax.set_title('减半前后日均对数收益率(95% CI)', fontsize=12) + ax.set_ylabel('平均对数收益率') + ax.legend(fontsize=9) + + # (1,0) 累计收益率 + ax = axes[1, 0] + for i, w in enumerate(windows): + df_w = w['df'] + post = df_w.loc[df_w['days_from_halving'] >= 0].copy() + if len(post) == 0: + continue + cum_ret = post['log_return'].fillna(0).cumsum() + cum_ret_pct = (np.exp(cum_ret) - 1) * 100 + ax.plot(post['days_from_halving'], cum_ret_pct, color=colors[i], + linewidth=1.5, label=w['label'], alpha=0.85) + ax.axhline(y=0, color='grey', linestyle=':', alpha=0.4) + ax.set_title('减半后累计收益率对比', fontsize=12) + ax.set_xlabel('距减半日天数') + ax.set_ylabel('累计收益率 (%)') + ax.legend(fontsize=9) + ax.grid(True, alpha=0.3) + ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:,.0f}%')) + + # (1,1) 波动率对比(滚动30天) + ax = axes[1, 1] + for i, w in enumerate(windows): + df_w = w['df'] + rolling_vol = df_w['log_return'].rolling(30).std() * np.sqrt(365) + ax.plot(df_w['days_from_halving'], rolling_vol, color=colors[i], + linewidth=1.2, label=w['label'], alpha=0.8) + ax.axvline(x=0, color='gold', linewidth=2, alpha=0.8, label='减半日') + ax.set_title('滚动30天年化波动率', fontsize=12) + ax.set_xlabel('距减半日天数') + ax.set_ylabel('年化波动率') + ax.legend(fontsize=9) + ax.grid(True, alpha=0.3) + + plt.suptitle('BTC 减半周期综合分析', fontsize=15, y=1.01) + plt.tight_layout() + fig_path = output_dir / 'halving_combined_summary.png' + fig.savefig(fig_path, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"\n综合图表已保存: {fig_path}") + + +# -------------------------------------------------------------------------- +# 可独立运行 +# -------------------------------------------------------------------------- +if __name__ == '__main__': + from data_loader import load_daily + from preprocessing import add_derived_features + + # 加载数据 + df_daily = load_daily() + df_daily = add_derived_features(df_daily) + + run_halving_analysis(df_daily, output_dir='output/halving') diff --git a/src/hurst_analysis.py b/src/hurst_analysis.py new file mode 100644 index 0000000..87e111d --- /dev/null +++ b/src/hurst_analysis.py @@ -0,0 +1,633 @@ +""" +Hurst指数分析模块 +================ +通过R/S分析和DFA(去趋势波动分析)计算Hurst指数, +评估BTC价格序列的长程依赖性和市场状态(趋势/均值回归/随机游走)。 + +核心功能: +- R/S (Rescaled Range) 分析 +- DFA (Detrended Fluctuation Analysis) via nolds +- R/S 与 DFA 交叉验证 +- 滚动窗口Hurst指数追踪市场状态变化 +- 多时间框架Hurst对比分析 +""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +import matplotlib.dates as mdates +try: + import nolds + HAS_NOLDS = True +except Exception: + HAS_NOLDS = False +from pathlib import Path +from typing import Tuple, Dict, List, Optional + +import sys +sys.path.insert(0, str(Path(__file__).parent.parent)) +from src.data_loader import load_klines +from src.preprocessing import log_returns + + +# ============================================================ +# Hurst指数判定标准 +# ============================================================ +TREND_THRESHOLD = 0.55 # H > 0.55 → 趋势性(持续性) +MEAN_REV_THRESHOLD = 0.45 # H < 0.45 → 均值回归(反持续性) +# 0.45 <= H <= 0.55 → 近似随机游走 + + +def interpret_hurst(h: float) -> str: + """根据Hurst指数值给出市场状态解读""" + if h > TREND_THRESHOLD: + return f"趋势性 (H={h:.4f} > {TREND_THRESHOLD}):序列具有长程正相关,价格趋势倾向于持续" + elif h < MEAN_REV_THRESHOLD: + return f"均值回归 (H={h:.4f} < {MEAN_REV_THRESHOLD}):序列具有长程负相关,价格倾向于反转" + else: + return f"随机游走 (H={h:.4f} ≈ 0.5):序列近似无记忆,价格变动近似独立" + + +# ============================================================ +# R/S (Rescaled Range) 分析 +# ============================================================ +def _rs_for_segment(segment: np.ndarray) -> float: + """计算单个分段的R/S统计量""" + n = len(segment) + if n < 2: + return np.nan + + # 计算均值偏差的累积和 + mean_val = np.mean(segment) + deviations = segment - mean_val + cumulative = np.cumsum(deviations) + + # 极差 R = max(累积偏差) - min(累积偏差) + R = np.max(cumulative) - np.min(cumulative) + + # 标准差 S + S = np.std(segment, ddof=1) + if S == 0: + return np.nan + + return R / S + + +def rs_hurst(series: np.ndarray, min_window: int = 10, max_window: Optional[int] = None, + num_scales: int = 30) -> Tuple[float, np.ndarray, np.ndarray]: + """ + R/S重标极差分析计算Hurst指数 + + Parameters + ---------- + series : np.ndarray + 时间序列数据(通常为对数收益率) + min_window : int + 最小窗口大小 + max_window : int, optional + 最大窗口大小,默认为序列长度的1/4 + num_scales : int + 尺度数量 + + Returns + ------- + H : float + Hurst指数 + log_ns : np.ndarray + log(窗口大小) + log_rs : np.ndarray + log(平均R/S值) + """ + n = len(series) + if max_window is None: + max_window = n // 4 + + # 生成对数均匀分布的窗口大小 + window_sizes = np.unique( + np.logspace(np.log10(min_window), np.log10(max_window), num=num_scales).astype(int) + ) + + log_ns = [] + log_rs = [] + + for w in window_sizes: + if w < 10 or w > n // 2: + continue + + # 将序列分成不重叠的分段 + num_segments = n // w + if num_segments < 1: + continue + + rs_values = [] + for i in range(num_segments): + segment = series[i * w: (i + 1) * w] + rs_val = _rs_for_segment(segment) + if not np.isnan(rs_val): + rs_values.append(rs_val) + + if len(rs_values) > 0: + mean_rs = np.mean(rs_values) + if mean_rs > 0: + log_ns.append(np.log(w)) + log_rs.append(np.log(mean_rs)) + + log_ns = np.array(log_ns) + log_rs = np.array(log_rs) + + # 线性回归:log(R/S) = H * log(n) + c + if len(log_ns) < 3: + return 0.5, log_ns, log_rs + + coeffs = np.polyfit(log_ns, log_rs, 1) + H = coeffs[0] + + return H, log_ns, log_rs + + +# ============================================================ +# DFA (Detrended Fluctuation Analysis) - 使用nolds库 +# ============================================================ +def dfa_hurst(series: np.ndarray) -> float: + """ + 使用nolds库进行DFA分析,返回Hurst指数 + + Parameters + ---------- + series : np.ndarray + 时间序列数据 + + Returns + ------- + float + DFA估计的Hurst指数(DFA指数α,对于分数布朗运动 α = H + 0.5 - 0.5 = H) + """ + if HAS_NOLDS: + # nolds.dfa 返回的是DFA scaling exponent α + # 对于对数收益率序列(增量过程),α ≈ H + # 对于累积序列(如价格),α ≈ H + 0.5 + alpha = nolds.dfa(series) + return alpha + else: + # 自实现的简化DFA + N = len(series) + y = np.cumsum(series - np.mean(series)) + scales = np.unique(np.logspace(np.log10(4), np.log10(N // 4), 20).astype(int)) + flucts = [] + for s in scales: + n_seg = N // s + if n_seg < 1: + continue + rms_list = [] + for i in range(n_seg): + seg = y[i*s:(i+1)*s] + x = np.arange(s) + coeffs = np.polyfit(x, seg, 1) + trend = np.polyval(coeffs, x) + rms_list.append(np.sqrt(np.mean((seg - trend)**2))) + flucts.append(np.mean(rms_list)) + if len(flucts) < 2: + return 0.5 + log_s = np.log(scales[:len(flucts)]) + log_f = np.log(flucts) + alpha = np.polyfit(log_s, log_f, 1)[0] + return alpha + + +# ============================================================ +# 交叉验证:比较R/S和DFA结果 +# ============================================================ +def cross_validate_hurst(series: np.ndarray) -> Dict[str, float]: + """ + 使用R/S和DFA两种方法计算Hurst指数并交叉验证 + + Returns + ------- + dict + 包含两种方法的Hurst值及其差异 + """ + h_rs, _, _ = rs_hurst(series) + h_dfa = dfa_hurst(series) + + result = { + 'R/S Hurst': h_rs, + 'DFA Hurst': h_dfa, + '两种方法差异': abs(h_rs - h_dfa), + '平均值': (h_rs + h_dfa) / 2, + } + return result + + +# ============================================================ +# 滚动窗口Hurst指数 +# ============================================================ +def rolling_hurst(series: np.ndarray, dates: pd.DatetimeIndex, + window: int = 500, step: int = 30, + method: str = 'rs') -> Tuple[pd.DatetimeIndex, np.ndarray]: + """ + 滚动窗口计算Hurst指数,追踪市场状态随时间的演变 + + Parameters + ---------- + series : np.ndarray + 时间序列(对数收益率) + dates : pd.DatetimeIndex + 对应的日期索引 + window : int + 滚动窗口大小(默认500天) + step : int + 滚动步长(默认30天) + method : str + 'rs' 使用R/S分析,'dfa' 使用DFA分析 + + Returns + ------- + roll_dates : pd.DatetimeIndex + 每个窗口对应的日期(窗口末尾日期) + roll_hurst : np.ndarray + 对应的Hurst指数值 + """ + n = len(series) + roll_dates = [] + roll_hurst = [] + + for start_idx in range(0, n - window + 1, step): + end_idx = start_idx + window + segment = series[start_idx:end_idx] + + if method == 'rs': + h, _, _ = rs_hurst(segment) + elif method == 'dfa': + h = dfa_hurst(segment) + else: + raise ValueError(f"未知方法: {method}") + + roll_dates.append(dates[end_idx - 1]) + roll_hurst.append(h) + + return pd.DatetimeIndex(roll_dates), np.array(roll_hurst) + + +# ============================================================ +# 多时间框架Hurst分析 +# ============================================================ +def multi_timeframe_hurst(intervals: List[str] = None) -> Dict[str, Dict[str, float]]: + """ + 在多个时间框架下计算Hurst指数 + + Parameters + ---------- + intervals : list of str + 时间框架列表,默认 ['1h', '4h', '1d', '1w'] + + Returns + ------- + dict + 每个时间框架的Hurst分析结果 + """ + if intervals is None: + intervals = ['1h', '4h', '1d', '1w'] + + results = {} + for interval in intervals: + try: + print(f"\n正在加载 {interval} 数据...") + df = load_klines(interval) + prices = df['close'].dropna() + + if len(prices) < 100: + print(f" {interval} 数据量不足({len(prices)}条),跳过") + continue + + returns = log_returns(prices).values + + # R/S分析 + h_rs, _, _ = rs_hurst(returns) + # DFA分析 + h_dfa = dfa_hurst(returns) + + results[interval] = { + 'R/S Hurst': h_rs, + 'DFA Hurst': h_dfa, + '平均Hurst': (h_rs + h_dfa) / 2, + '数据量': len(returns), + '解读': interpret_hurst((h_rs + h_dfa) / 2), + } + + print(f" {interval}: R/S={h_rs:.4f}, DFA={h_dfa:.4f}, " + f"平均={results[interval]['平均Hurst']:.4f}") + + except FileNotFoundError: + print(f" {interval} 数据文件不存在,跳过") + except Exception as e: + print(f" {interval} 分析失败: {e}") + + return results + + +# ============================================================ +# 可视化函数 +# ============================================================ +def plot_rs_loglog(log_ns: np.ndarray, log_rs: np.ndarray, H: float, + output_dir: Path, filename: str = "hurst_rs_loglog.png"): + """绘制R/S分析的log-log图""" + fig, ax = plt.subplots(figsize=(10, 7)) + + # 散点 + ax.scatter(log_ns, log_rs, color='steelblue', s=40, zorder=3, label='R/S 数据点') + + # 拟合线 + coeffs = np.polyfit(log_ns, log_rs, 1) + fit_line = np.polyval(coeffs, log_ns) + ax.plot(log_ns, fit_line, 'r-', linewidth=2, label=f'拟合线 (H = {H:.4f})') + + # 参考线:H=0.5(随机游走) + ref_line = 0.5 * log_ns + (log_rs[0] - 0.5 * log_ns[0]) + ax.plot(log_ns, ref_line, 'k--', alpha=0.5, linewidth=1, label='H=0.5 (随机游走)') + + ax.set_xlabel('log(n) - 窗口大小的对数', fontsize=12) + ax.set_ylabel('log(R/S) - 重标极差的对数', fontsize=12) + ax.set_title(f'BTC R/S 分析 (Hurst指数 = {H:.4f})\n{interpret_hurst(H)}', fontsize=13) + ax.legend(fontsize=11) + ax.grid(True, alpha=0.3) + + fig.tight_layout() + filepath = output_dir / filename + fig.savefig(filepath, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" 已保存: {filepath}") + + +def plot_rolling_hurst(roll_dates: pd.DatetimeIndex, roll_hurst: np.ndarray, + output_dir: Path, filename: str = "hurst_rolling.png"): + """绘制滚动Hurst指数时间序列,带有市场状态色带""" + fig, ax = plt.subplots(figsize=(14, 7)) + + # 绘制Hurst指数曲线 + ax.plot(roll_dates, roll_hurst, color='steelblue', linewidth=1.5, label='滚动Hurst指数') + + # 状态色带 + ax.axhspan(TREND_THRESHOLD, max(roll_hurst.max() + 0.05, 0.8), + alpha=0.1, color='green', label=f'趋势区 (H>{TREND_THRESHOLD})') + ax.axhspan(MEAN_REV_THRESHOLD, TREND_THRESHOLD, + alpha=0.1, color='yellow', label=f'随机游走区 ({MEAN_REV_THRESHOLD} Dict: + """ + Hurst指数综合分析主入口 + + Parameters + ---------- + df : pd.DataFrame + K线数据(需包含 'close' 列和DatetimeIndex索引) + output_dir : str + 图表输出目录 + + Returns + ------- + dict + 包含所有分析结果的字典 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + results = {} + + print("=" * 70) + print("Hurst指数综合分析") + print("=" * 70) + + # ---------------------------------------------------------- + # 1. 准备数据 + # ---------------------------------------------------------- + prices = df['close'].dropna() + returns = log_returns(prices) + returns_arr = returns.values + + print(f"\n数据概况:") + print(f" 时间范围: {df.index.min()} ~ {df.index.max()}") + print(f" 收益率序列长度: {len(returns_arr)}") + + # ---------------------------------------------------------- + # 2. R/S分析 + # ---------------------------------------------------------- + print("\n" + "-" * 50) + print("【1】R/S (Rescaled Range) 分析") + print("-" * 50) + + h_rs, log_ns, log_rs = rs_hurst(returns_arr) + results['R/S Hurst'] = h_rs + + print(f" R/S Hurst指数: {h_rs:.4f}") + print(f" 解读: {interpret_hurst(h_rs)}") + + # 绘制R/S log-log图 + plot_rs_loglog(log_ns, log_rs, h_rs, output_dir) + + # ---------------------------------------------------------- + # 3. DFA分析(使用nolds库) + # ---------------------------------------------------------- + print("\n" + "-" * 50) + print("【2】DFA (Detrended Fluctuation Analysis) 分析") + print("-" * 50) + + h_dfa = dfa_hurst(returns_arr) + results['DFA Hurst'] = h_dfa + + print(f" DFA Hurst指数: {h_dfa:.4f}") + print(f" 解读: {interpret_hurst(h_dfa)}") + + # ---------------------------------------------------------- + # 4. 交叉验证 + # ---------------------------------------------------------- + print("\n" + "-" * 50) + print("【3】交叉验证:R/S vs DFA") + print("-" * 50) + + cv_results = cross_validate_hurst(returns_arr) + results['交叉验证'] = cv_results + + print(f" R/S Hurst: {cv_results['R/S Hurst']:.4f}") + print(f" DFA Hurst: {cv_results['DFA Hurst']:.4f}") + print(f" 两种方法差异: {cv_results['两种方法差异']:.4f}") + print(f" 平均值: {cv_results['平均值']:.4f}") + + avg_h = cv_results['平均值'] + if cv_results['两种方法差异'] < 0.05: + print(" ✓ 两种方法结果一致性较好(差异<0.05)") + else: + print(" ⚠ 两种方法结果存在一定差异(差异≥0.05),建议结合其他方法验证") + + print(f"\n 综合解读: {interpret_hurst(avg_h)}") + results['综合Hurst'] = avg_h + results['综合解读'] = interpret_hurst(avg_h) + + # ---------------------------------------------------------- + # 5. 滚动窗口Hurst(窗口500天,步长30天) + # ---------------------------------------------------------- + print("\n" + "-" * 50) + print("【4】滚动窗口Hurst指数 (窗口=500天, 步长=30天)") + print("-" * 50) + + if len(returns_arr) >= 500: + roll_dates, roll_h = rolling_hurst( + returns_arr, returns.index, window=500, step=30, method='rs' + ) + + # 统计各状态占比 + n_trend = np.sum(roll_h > TREND_THRESHOLD) + n_mean_rev = np.sum(roll_h < MEAN_REV_THRESHOLD) + n_random = np.sum((roll_h >= MEAN_REV_THRESHOLD) & (roll_h <= TREND_THRESHOLD)) + total = len(roll_h) + + print(f" 滚动窗口数: {total}") + print(f" 趋势状态占比: {n_trend / total * 100:.1f}% ({n_trend}/{total})") + print(f" 随机游走占比: {n_random / total * 100:.1f}% ({n_random}/{total})") + print(f" 均值回归占比: {n_mean_rev / total * 100:.1f}% ({n_mean_rev}/{total})") + print(f" Hurst范围: [{roll_h.min():.4f}, {roll_h.max():.4f}]") + print(f" Hurst均值: {roll_h.mean():.4f}") + + results['滚动Hurst'] = { + '窗口数': total, + '趋势占比': n_trend / total, + '随机游走占比': n_random / total, + '均值回归占比': n_mean_rev / total, + 'Hurst范围': (roll_h.min(), roll_h.max()), + 'Hurst均值': roll_h.mean(), + } + + # 绘制滚动Hurst图 + plot_rolling_hurst(roll_dates, roll_h, output_dir) + else: + print(f" 数据量不足({len(returns_arr)}<500),跳过滚动窗口分析") + + # ---------------------------------------------------------- + # 6. 多时间框架Hurst分析 + # ---------------------------------------------------------- + print("\n" + "-" * 50) + print("【5】多时间框架Hurst指数") + print("-" * 50) + + mt_results = multi_timeframe_hurst(['1h', '4h', '1d', '1w']) + results['多时间框架'] = mt_results + + # 绘制多时间框架对比图 + plot_multi_timeframe(mt_results, output_dir) + + # ---------------------------------------------------------- + # 7. 总结 + # ---------------------------------------------------------- + print("\n" + "=" * 70) + print("分析总结") + print("=" * 70) + print(f" 日线综合Hurst指数: {avg_h:.4f}") + print(f" 市场状态判断: {interpret_hurst(avg_h)}") + + if mt_results: + print("\n 各时间框架Hurst指数:") + for interval, data in mt_results.items(): + print(f" {interval}: 平均H={data['平均Hurst']:.4f} - {data['解读']}") + + print(f"\n 判定标准:") + print(f" H > {TREND_THRESHOLD}: 趋势性(持续性,适合趋势跟随策略)") + print(f" H < {MEAN_REV_THRESHOLD}: 均值回归(反持续性,适合均值回归策略)") + print(f" {MEAN_REV_THRESHOLD} ≤ H ≤ {TREND_THRESHOLD}: 随机游走(无显著可预测性)") + + print(f"\n 图表已保存至: {output_dir.resolve()}") + print("=" * 70) + + return results + + +# ============================================================ +# 独立运行入口 +# ============================================================ +if __name__ == "__main__": + from data_loader import load_daily + + print("加载BTC日线数据...") + df = load_daily() + print(f"数据加载完成: {len(df)} 条记录") + + results = run_hurst_analysis(df, output_dir="output/hurst") diff --git a/src/indicators.py b/src/indicators.py new file mode 100644 index 0000000..cd2ef4b --- /dev/null +++ b/src/indicators.py @@ -0,0 +1,626 @@ +""" +技术指标有效性验证模块 + +手动实现常见技术指标(MA/EMA交叉、RSI、MACD、布林带), +在训练集上进行统计显著性检验,并在验证集上验证。 +包含反数据窥探措施:Benjamini-Hochberg FDR 校正 + 置换检验。 +""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from scipy import stats +from pathlib import Path +from typing import Dict, List, Tuple, Optional + +from src.data_loader import split_data +from src.preprocessing import log_returns + + +# ============================================================ +# 1. 手动实现技术指标 +# ============================================================ + +def calc_sma(series: pd.Series, window: int) -> pd.Series: + """简单移动平均线""" + return series.rolling(window=window, min_periods=window).mean() + + +def calc_ema(series: pd.Series, span: int) -> pd.Series: + """指数移动平均线""" + return series.ewm(span=span, adjust=False).mean() + + +def calc_rsi(close: pd.Series, period: int = 14) -> pd.Series: + """ + 相对强弱指标 (RSI) + RSI = 100 - 100 / (1 + RS) + RS = 平均上涨幅度 / 平均下跌幅度 + """ + delta = close.diff() + gain = delta.clip(lower=0) + loss = (-delta).clip(lower=0) + # 使用 EMA 计算平均涨跌 + avg_gain = gain.ewm(alpha=1.0 / period, min_periods=period, adjust=False).mean() + avg_loss = loss.ewm(alpha=1.0 / period, min_periods=period, adjust=False).mean() + rs = avg_gain / avg_loss.replace(0, np.nan) + rsi = 100 - 100 / (1 + rs) + return rsi + + +def calc_macd(close: pd.Series, fast: int = 12, slow: int = 26, signal: int = 9) -> Tuple[pd.Series, pd.Series, pd.Series]: + """ + MACD 指标 + 返回: (macd_line, signal_line, histogram) + """ + ema_fast = calc_ema(close, fast) + ema_slow = calc_ema(close, slow) + macd_line = ema_fast - ema_slow + signal_line = calc_ema(macd_line, signal) + histogram = macd_line - signal_line + return macd_line, signal_line, histogram + + +def calc_bollinger_bands(close: pd.Series, window: int = 20, num_std: float = 2.0) -> Tuple[pd.Series, pd.Series, pd.Series]: + """ + 布林带 + 返回: (upper, middle, lower) + """ + middle = calc_sma(close, window) + rolling_std = close.rolling(window=window, min_periods=window).std() + upper = middle + num_std * rolling_std + lower = middle - num_std * rolling_std + return upper, middle, lower + + +# ============================================================ +# 2. 信号生成 +# ============================================================ + +def generate_ma_crossover_signals(close: pd.Series, short_w: int, long_w: int, use_ema: bool = False) -> pd.Series: + """ + 均线交叉信号 + 金叉 = +1(短期上穿长期),死叉 = -1(短期下穿长期),无信号 = 0 + """ + func = calc_ema if use_ema else calc_sma + short_ma = func(close, short_w) + long_ma = func(close, long_w) + # 当前短>长 且 前一根短<=长 => 金叉(+1) + # 当前短<长 且 前一根短>=长 => 死叉(-1) + cross_up = (short_ma > long_ma) & (short_ma.shift(1) <= long_ma.shift(1)) + cross_down = (short_ma < long_ma) & (short_ma.shift(1) >= long_ma.shift(1)) + signal = pd.Series(0, index=close.index) + signal[cross_up] = 1 + signal[cross_down] = -1 + return signal + + +def generate_rsi_signals(close: pd.Series, period: int, oversold: float = 30, overbought: float = 70) -> pd.Series: + """ + RSI 超买超卖信号 + RSI 从超卖区回升 => +1 (买入信号) + RSI 从超买区回落 => -1 (卖出信号) + """ + rsi = calc_rsi(close, period) + rsi_prev = rsi.shift(1) + signal = pd.Series(0, index=close.index) + # 从超卖回升 + signal[(rsi_prev <= oversold) & (rsi > oversold)] = 1 + # 从超买回落 + signal[(rsi_prev >= overbought) & (rsi < overbought)] = -1 + return signal + + +def generate_macd_signals(close: pd.Series, fast: int = 12, slow: int = 26, sig: int = 9) -> pd.Series: + """ + MACD 交叉信号 + MACD线上穿信号线 => +1 + MACD线下穿信号线 => -1 + """ + macd_line, signal_line, _ = calc_macd(close, fast, slow, sig) + cross_up = (macd_line > signal_line) & (macd_line.shift(1) <= signal_line.shift(1)) + cross_down = (macd_line < signal_line) & (macd_line.shift(1) >= signal_line.shift(1)) + signal = pd.Series(0, index=close.index) + signal[cross_up] = 1 + signal[cross_down] = -1 + return signal + + +def generate_bollinger_signals(close: pd.Series, window: int = 20, num_std: float = 2.0) -> pd.Series: + """ + 布林带信号 + 价格触及下轨后回升 => +1 (买入) + 价格触及上轨后回落 => -1 (卖出) + """ + upper, middle, lower = calc_bollinger_bands(close, window, num_std) + # 前一根在下轨以下,当前回到下轨以上 + cross_up = (close.shift(1) <= lower.shift(1)) & (close > lower) + # 前一根在上轨以上,当前回到上轨以下 + cross_down = (close.shift(1) >= upper.shift(1)) & (close < upper) + signal = pd.Series(0, index=close.index) + signal[cross_up] = 1 + signal[cross_down] = -1 + return signal + + +def build_all_signals(close: pd.Series) -> Dict[str, pd.Series]: + """ + 构建所有技术指标信号 + 返回字典: {指标名称: 信号序列} + """ + signals = {} + + # --- MA / EMA 交叉 --- + ma_pairs = [(5, 20), (10, 50), (20, 100), (50, 200)] + for short_w, long_w in ma_pairs: + signals[f"SMA_{short_w}_{long_w}"] = generate_ma_crossover_signals(close, short_w, long_w, use_ema=False) + signals[f"EMA_{short_w}_{long_w}"] = generate_ma_crossover_signals(close, short_w, long_w, use_ema=True) + + # --- RSI --- + rsi_configs = [ + (7, 30, 70), (7, 25, 75), (7, 20, 80), + (14, 30, 70), (14, 25, 75), (14, 20, 80), + (21, 30, 70), (21, 25, 75), (21, 20, 80), + ] + for period, oversold, overbought in rsi_configs: + signals[f"RSI_{period}_{oversold}_{overbought}"] = generate_rsi_signals(close, period, oversold, overbought) + + # --- MACD --- + macd_configs = [(12, 26, 9), (8, 17, 9), (5, 35, 5)] + for fast, slow, sig in macd_configs: + signals[f"MACD_{fast}_{slow}_{sig}"] = generate_macd_signals(close, fast, slow, sig) + + # --- 布林带 --- + signals["BB_20_2"] = generate_bollinger_signals(close, 20, 2.0) + + return signals + + +# ============================================================ +# 3. 统计检验 +# ============================================================ + +def calc_forward_returns(close: pd.Series, periods: int = 1) -> pd.Series: + """计算未来N日收益率(对数收益率)""" + return np.log(close.shift(-periods) / close) + + +def test_signal_returns(signal: pd.Series, returns: pd.Series) -> Dict: + """ + 对单个指标信号进行统计检验 + + - Welch t-test:比较信号日 vs 非信号日收益均值差异 + - Mann-Whitney U:非参数检验 + - 二项检验:方向准确率是否显著高于50% + - 信息系数 (IC):Spearman秩相关 + """ + # 买入信号日(signal == 1)的收益 + buy_returns = returns[signal == 1].dropna() + # 卖出信号日(signal == -1)的收益 + sell_returns = returns[signal == -1].dropna() + # 非信号日收益 + no_signal_returns = returns[signal == 0].dropna() + + result = { + 'n_buy': len(buy_returns), + 'n_sell': len(sell_returns), + 'n_no_signal': len(no_signal_returns), + 'buy_mean': buy_returns.mean() if len(buy_returns) > 0 else np.nan, + 'sell_mean': sell_returns.mean() if len(sell_returns) > 0 else np.nan, + 'no_signal_mean': no_signal_returns.mean() if len(no_signal_returns) > 0 else np.nan, + } + + # --- Welch t-test (买入信号 vs 非信号) --- + if len(buy_returns) >= 5 and len(no_signal_returns) >= 5: + t_stat, t_pval = stats.ttest_ind(buy_returns, no_signal_returns, equal_var=False) + result['welch_t_stat'] = t_stat + result['welch_t_pval'] = t_pval + else: + result['welch_t_stat'] = np.nan + result['welch_t_pval'] = np.nan + + # --- Mann-Whitney U (买入信号 vs 非信号) --- + if len(buy_returns) >= 5 and len(no_signal_returns) >= 5: + u_stat, u_pval = stats.mannwhitneyu(buy_returns, no_signal_returns, alternative='two-sided') + result['mwu_stat'] = u_stat + result['mwu_pval'] = u_pval + else: + result['mwu_stat'] = np.nan + result['mwu_pval'] = np.nan + + # --- 二项检验:买入信号日收益>0的比例 vs 50% --- + if len(buy_returns) >= 5: + n_positive = (buy_returns > 0).sum() + binom_pval = stats.binomtest(n_positive, len(buy_returns), 0.5).pvalue + result['buy_hit_rate'] = n_positive / len(buy_returns) + result['binom_pval'] = binom_pval + else: + result['buy_hit_rate'] = np.nan + result['binom_pval'] = np.nan + + # --- 信息系数 (IC):Spearman秩相关 --- + # 用信号值(-1, 0, 1)与未来收益的秩相关 + valid_mask = signal.notna() & returns.notna() + if valid_mask.sum() >= 30: + ic, ic_pval = stats.spearmanr(signal[valid_mask], returns[valid_mask]) + result['ic'] = ic + result['ic_pval'] = ic_pval + else: + result['ic'] = np.nan + result['ic_pval'] = np.nan + + return result + + +def benjamini_hochberg(p_values: np.ndarray, alpha: float = 0.05) -> Tuple[np.ndarray, np.ndarray]: + """ + Benjamini-Hochberg FDR 校正 + + 参数: + p_values: 原始 p 值数组 + alpha: 显著性水平 + + 返回: + (rejected, adjusted_p): 是否拒绝原假设, 校正后p值 + """ + n = len(p_values) + if n == 0: + return np.array([], dtype=bool), np.array([]) + + # 处理 NaN + valid_mask = ~np.isnan(p_values) + adjusted = np.full(n, np.nan) + rejected = np.full(n, False) + + valid_pvals = p_values[valid_mask] + n_valid = len(valid_pvals) + if n_valid == 0: + return rejected, adjusted + + # 排序 + sorted_idx = np.argsort(valid_pvals) + sorted_pvals = valid_pvals[sorted_idx] + + # BH校正 + rank = np.arange(1, n_valid + 1) + adjusted_sorted = sorted_pvals * n_valid / rank + # 从后往前取累积最小值,确保单调性 + adjusted_sorted = np.minimum.accumulate(adjusted_sorted[::-1])[::-1] + adjusted_sorted = np.clip(adjusted_sorted, 0, 1) + + # 填回 + valid_indices = np.where(valid_mask)[0] + for i, idx in enumerate(sorted_idx): + adjusted[valid_indices[idx]] = adjusted_sorted[i] + rejected[valid_indices[idx]] = adjusted_sorted[i] <= alpha + + return rejected, adjusted + + +def permutation_test(signal: pd.Series, returns: pd.Series, n_permutations: int = 1000, stat_func=None) -> Tuple[float, float]: + """ + 置换检验 + + 随机打乱信号与收益的对应关系,评估原始统计量的显著性 + 返回: (observed_stat, p_value) + """ + if stat_func is None: + # 默认统计量:买入信号日均值 - 非信号日均值 + def stat_func(sig, ret): + buy_ret = ret[sig == 1] + no_sig_ret = ret[sig == 0] + if len(buy_ret) < 2 or len(no_sig_ret) < 2: + return 0.0 + return buy_ret.mean() - no_sig_ret.mean() + + valid_mask = signal.notna() & returns.notna() + sig_valid = signal[valid_mask].values + ret_valid = returns[valid_mask].values + + observed = stat_func(pd.Series(sig_valid), pd.Series(ret_valid)) + + # 置换 + count_extreme = 0 + rng = np.random.RandomState(42) + for _ in range(n_permutations): + perm_sig = rng.permutation(sig_valid) + perm_stat = stat_func(pd.Series(perm_sig), pd.Series(ret_valid)) + if abs(perm_stat) >= abs(observed): + count_extreme += 1 + + perm_pval = (count_extreme + 1) / (n_permutations + 1) + return observed, perm_pval + + +# ============================================================ +# 4. 可视化 +# ============================================================ + +def plot_ic_distribution(results_df: pd.DataFrame, output_dir: Path, prefix: str = "train"): + """绘制信息系数 (IC) 分布图""" + fig, ax = plt.subplots(figsize=(12, 6)) + ic_vals = results_df['ic'].dropna() + ax.barh(range(len(ic_vals)), ic_vals.values, color=['green' if v > 0 else 'red' for v in ic_vals.values]) + ax.set_yticks(range(len(ic_vals))) + ax.set_yticklabels(ic_vals.index, fontsize=7) + ax.set_xlabel('Information Coefficient (Spearman)') + ax.set_title(f'IC Distribution - {prefix.upper()} Set') + ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5) + plt.tight_layout() + fig.savefig(output_dir / f"ic_distribution_{prefix}.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [saved] ic_distribution_{prefix}.png") + + +def plot_pvalue_heatmap(results_df: pd.DataFrame, output_dir: Path, prefix: str = "train"): + """绘制 p 值热力图:原始 vs FDR 校正后""" + pval_cols = ['welch_t_pval', 'mwu_pval', 'binom_pval', 'ic_pval'] + adj_cols = ['welch_t_adj_pval', 'mwu_adj_pval', 'binom_adj_pval', 'ic_adj_pval'] + + # 只取存在的列 + existing_pval = [c for c in pval_cols if c in results_df.columns] + existing_adj = [c for c in adj_cols if c in results_df.columns] + + if not existing_pval: + return + + fig, axes = plt.subplots(1, 2, figsize=(16, max(8, len(results_df) * 0.35))) + + # 原始 p 值 + pval_data = results_df[existing_pval].values.astype(float) + im1 = axes[0].imshow(pval_data, aspect='auto', cmap='RdYlGn_r', vmin=0, vmax=0.1) + axes[0].set_yticks(range(len(results_df))) + axes[0].set_yticklabels(results_df.index, fontsize=6) + axes[0].set_xticks(range(len(existing_pval))) + axes[0].set_xticklabels([c.replace('_pval', '') for c in existing_pval], fontsize=8, rotation=45) + axes[0].set_title('Raw p-values') + plt.colorbar(im1, ax=axes[0], shrink=0.6) + + # FDR 校正后 p 值 + if existing_adj: + adj_data = results_df[existing_adj].values.astype(float) + im2 = axes[1].imshow(adj_data, aspect='auto', cmap='RdYlGn_r', vmin=0, vmax=0.1) + axes[1].set_yticks(range(len(results_df))) + axes[1].set_yticklabels(results_df.index, fontsize=6) + axes[1].set_xticks(range(len(existing_adj))) + axes[1].set_xticklabels([c.replace('_adj_pval', '') for c in existing_adj], fontsize=8, rotation=45) + axes[1].set_title('FDR-adjusted p-values') + plt.colorbar(im2, ax=axes[1], shrink=0.6) + else: + axes[1].text(0.5, 0.5, 'No adjusted p-values', ha='center', va='center') + axes[1].set_title('FDR-adjusted p-values (N/A)') + + plt.suptitle(f'P-value Heatmap - {prefix.upper()} Set', fontsize=14) + plt.tight_layout() + fig.savefig(output_dir / f"pvalue_heatmap_{prefix}.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [saved] pvalue_heatmap_{prefix}.png") + + +def plot_best_indicator_signal(close: pd.Series, signal: pd.Series, returns: pd.Series, + indicator_name: str, output_dir: Path, prefix: str = "train"): + """绘制最佳指标的信号 vs 收益散点图""" + fig, axes = plt.subplots(2, 1, figsize=(14, 10), gridspec_kw={'height_ratios': [2, 1]}) + + # 上图:价格 + 信号标记 + axes[0].plot(close.index, close.values, color='gray', alpha=0.7, linewidth=0.8, label='BTC Close') + buy_mask = signal == 1 + sell_mask = signal == -1 + axes[0].scatter(close.index[buy_mask], close.values[buy_mask], + marker='^', color='green', s=40, label='Buy Signal', zorder=5) + axes[0].scatter(close.index[sell_mask], close.values[sell_mask], + marker='v', color='red', s=40, label='Sell Signal', zorder=5) + axes[0].set_title(f'Best Indicator: {indicator_name} - {prefix.upper()} Set') + axes[0].set_ylabel('Price (USDT)') + axes[0].legend(fontsize=8) + + # 下图:信号日收益分布 + buy_returns = returns[buy_mask].dropna() + sell_returns = returns[sell_mask].dropna() + if len(buy_returns) > 0: + axes[1].hist(buy_returns, bins=30, alpha=0.6, color='green', label=f'Buy ({len(buy_returns)})') + if len(sell_returns) > 0: + axes[1].hist(sell_returns, bins=30, alpha=0.6, color='red', label=f'Sell ({len(sell_returns)})') + axes[1].axvline(x=0, color='black', linestyle='--', linewidth=0.8) + axes[1].set_xlabel('Forward 1-day Log Return') + axes[1].set_ylabel('Count') + axes[1].legend(fontsize=8) + + plt.tight_layout() + fig.savefig(output_dir / f"best_indicator_{prefix}.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [saved] best_indicator_{prefix}.png") + + +# ============================================================ +# 5. 主流程 +# ============================================================ + +def evaluate_signals_on_set(close: pd.Series, signals: Dict[str, pd.Series], set_name: str) -> pd.DataFrame: + """ + 在给定数据集上评估所有信号 + + 返回包含所有统计指标的 DataFrame + """ + # 未来1日收益 + fwd_ret = calc_forward_returns(close, periods=1) + + results = {} + for name, signal in signals.items(): + # 只取当前数据集范围内的信号 + sig = signal.reindex(close.index).fillna(0) + ret = fwd_ret.reindex(close.index) + results[name] = test_signal_returns(sig, ret) + + results_df = pd.DataFrame(results).T + results_df.index.name = 'indicator' + + print(f"\n{'='*60}") + print(f" {set_name} 数据集评估结果") + print(f"{'='*60}") + print(f" 总指标数: {len(results_df)}") + print(f" 数据点数: {len(close)}") + + return results_df + + +def apply_fdr_correction(results_df: pd.DataFrame, alpha: float = 0.05) -> pd.DataFrame: + """ + 对所有 p 值列进行 Benjamini-Hochberg FDR 校正 + """ + pval_cols = ['welch_t_pval', 'mwu_pval', 'binom_pval', 'ic_pval'] + + for col in pval_cols: + if col not in results_df.columns: + continue + pvals = results_df[col].values.astype(float) + rejected, adjusted = benjamini_hochberg(pvals, alpha) + adj_col = col.replace('_pval', '_adj_pval') + rej_col = col.replace('_pval', '_rejected') + results_df[adj_col] = adjusted + results_df[rej_col] = rejected + + return results_df + + +def run_indicators_analysis(df: pd.DataFrame, output_dir: str) -> Dict: + """ + 技术指标有效性验证主入口 + + 参数: + df: 完整的日线 DataFrame(含 open/high/low/close/volume 等列,DatetimeIndex) + output_dir: 图表输出目录 + + 返回: + 包含训练集和验证集结果的字典 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + print("=" * 60) + print(" 技术指标有效性验证") + print("=" * 60) + + # --- 数据切分 --- + train, val, test = split_data(df) + print(f"\n训练集: {train.index.min()} ~ {train.index.max()} ({len(train)} bars)") + print(f"验证集: {val.index.min()} ~ {val.index.max()} ({len(val)} bars)") + + # --- 构建全部信号(在全量数据上计算,避免前导NaN问题) --- + all_signals = build_all_signals(df['close']) + print(f"\n共构建 {len(all_signals)} 个技术指标信号") + + # ============ 训练集评估 ============ + train_results = evaluate_signals_on_set(train['close'], all_signals, "训练集 (TRAIN)") + + # FDR 校正 + train_results = apply_fdr_correction(train_results, alpha=0.05) + + # 找出通过 FDR 校正的指标 + reject_cols = [c for c in train_results.columns if c.endswith('_rejected')] + if reject_cols: + train_results['any_fdr_pass'] = train_results[reject_cols].any(axis=1) + fdr_passed = train_results[train_results['any_fdr_pass']].index.tolist() + else: + fdr_passed = [] + + print(f"\n--- FDR 校正结果 (训练集) ---") + if fdr_passed: + print(f" 通过 FDR 校正的指标 ({len(fdr_passed)} 个):") + for name in fdr_passed: + row = train_results.loc[name] + ic_val = row.get('ic', np.nan) + print(f" - {name}: IC={ic_val:.4f}" if not np.isnan(ic_val) else f" - {name}") + else: + print(" 没有指标通过 FDR 校正(alpha=0.05)") + + # --- 置换检验(仅对 IC 排名前5的指标) --- + fwd_ret_train = calc_forward_returns(train['close'], periods=1) + ic_series = train_results['ic'].dropna().abs().sort_values(ascending=False) + top_indicators = ic_series.head(5).index.tolist() + + print(f"\n--- 置换检验 (训练集, top-5 IC 指标, 1000次置换) ---") + perm_results = {} + for name in top_indicators: + sig = all_signals[name].reindex(train.index).fillna(0) + ret = fwd_ret_train.reindex(train.index) + obs, pval = permutation_test(sig, ret, n_permutations=1000) + perm_results[name] = {'observed_diff': obs, 'perm_pval': pval} + perm_pass = "PASS" if pval < 0.05 else "FAIL" + print(f" {name}: obs_diff={obs:.6f}, perm_p={pval:.4f} [{perm_pass}]") + + # --- 训练集可视化 --- + print("\n--- 训练集可视化 ---") + plot_ic_distribution(train_results, output_dir, prefix="train") + plot_pvalue_heatmap(train_results, output_dir, prefix="train") + + # 最佳指标(IC绝对值最大) + if len(ic_series) > 0: + best_name = ic_series.index[0] + best_signal = all_signals[best_name].reindex(train.index).fillna(0) + best_ret = fwd_ret_train.reindex(train.index) + plot_best_indicator_signal(train['close'], best_signal, best_ret, best_name, output_dir, prefix="train") + + # ============ 验证集评估 ============ + val_results = evaluate_signals_on_set(val['close'], all_signals, "验证集 (VAL)") + val_results = apply_fdr_correction(val_results, alpha=0.05) + + reject_cols_val = [c for c in val_results.columns if c.endswith('_rejected')] + if reject_cols_val: + val_results['any_fdr_pass'] = val_results[reject_cols_val].any(axis=1) + val_fdr_passed = val_results[val_results['any_fdr_pass']].index.tolist() + else: + val_fdr_passed = [] + + print(f"\n--- FDR 校正结果 (验证集) ---") + if val_fdr_passed: + print(f" 通过 FDR 校正的指标 ({len(val_fdr_passed)} 个):") + for name in val_fdr_passed: + row = val_results.loc[name] + ic_val = row.get('ic', np.nan) + print(f" - {name}: IC={ic_val:.4f}" if not np.isnan(ic_val) else f" - {name}") + else: + print(" 没有指标通过 FDR 校正(alpha=0.05)") + + # 训练集 vs 验证集 IC 对比 + if 'ic' in train_results.columns and 'ic' in val_results.columns: + print(f"\n--- 训练集 vs 验证集 IC 对比 (Top-10) ---") + merged_ic = pd.DataFrame({ + 'train_ic': train_results['ic'], + 'val_ic': val_results['ic'] + }).dropna() + merged_ic['consistent'] = (merged_ic['train_ic'] * merged_ic['val_ic']) > 0 # 同号 + merged_ic = merged_ic.reindex(merged_ic['train_ic'].abs().sort_values(ascending=False).index) + for name in merged_ic.head(10).index: + row = merged_ic.loc[name] + cons = "OK" if row['consistent'] else "FLIP" + print(f" {name}: train_IC={row['train_ic']:.4f}, val_IC={row['val_ic']:.4f} [{cons}]") + + # --- 验证集可视化 --- + print("\n--- 验证集可视化 ---") + plot_ic_distribution(val_results, output_dir, prefix="val") + plot_pvalue_heatmap(val_results, output_dir, prefix="val") + + val_ic_series = val_results['ic'].dropna().abs().sort_values(ascending=False) + if len(val_ic_series) > 0: + fwd_ret_val = calc_forward_returns(val['close'], periods=1) + best_val_name = val_ic_series.index[0] + best_val_signal = all_signals[best_val_name].reindex(val.index).fillna(0) + best_val_ret = fwd_ret_val.reindex(val.index) + plot_best_indicator_signal(val['close'], best_val_signal, best_val_ret, best_val_name, output_dir, prefix="val") + + print(f"\n{'='*60}") + print(" 技术指标有效性验证完成") + print(f"{'='*60}") + + return { + 'train_results': train_results, + 'val_results': val_results, + 'fdr_passed_train': fdr_passed, + 'fdr_passed_val': val_fdr_passed, + 'permutation_results': perm_results, + 'all_signals': all_signals, + } diff --git a/src/patterns.py b/src/patterns.py new file mode 100644 index 0000000..b706226 --- /dev/null +++ b/src/patterns.py @@ -0,0 +1,853 @@ +""" +K线形态识别与统计验证模块 + +手动实现常见蜡烛图形态(Doji、Hammer、Engulfing、Morning/Evening Star 等), +使用前向收益分析 + Wilson 置信区间 + FDR 校正进行统计验证。 +""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from scipy import stats +from pathlib import Path +from typing import Dict, List, Tuple, Optional + +from src.data_loader import split_data + + +# ============================================================ +# 1. 辅助函数 +# ============================================================ + +def _body(df: pd.DataFrame) -> pd.Series: + """实体大小(绝对值)""" + return (df['close'] - df['open']).abs() + + +def _body_signed(df: pd.DataFrame) -> pd.Series: + """带符号的实体(正=阳线,负=阴线)""" + return df['close'] - df['open'] + + +def _upper_shadow(df: pd.DataFrame) -> pd.Series: + """上影线长度""" + return df['high'] - df[['open', 'close']].max(axis=1) + + +def _lower_shadow(df: pd.DataFrame) -> pd.Series: + """下影线长度""" + return df[['open', 'close']].min(axis=1) - df['low'] + + +def _total_range(df: pd.DataFrame) -> pd.Series: + """总振幅(high - low),避免零值""" + return (df['high'] - df['low']).replace(0, np.nan) + + +def _is_bullish(df: pd.DataFrame) -> pd.Series: + """是否阳线""" + return df['close'] > df['open'] + + +def _is_bearish(df: pd.DataFrame) -> pd.Series: + """是否阴线""" + return df['close'] < df['open'] + + +# ============================================================ +# 2. 形态识别函数(手动实现) +# ============================================================ + +def detect_doji(df: pd.DataFrame) -> pd.Series: + """ + 十字星 (Doji) + 条件: 实体 < 总振幅的 10% + 方向: 中性 (0) + """ + body = _body(df) + total = _total_range(df) + return (body / total < 0.10).astype(int) + + +def detect_hammer(df: pd.DataFrame) -> pd.Series: + """ + 锤子线 (Hammer) — 底部反转看涨信号 + 条件: + - 下影线 > 实体的 2 倍 + - 上影线 < 实体的 0.5 倍(或 < 总振幅的 15%) + - 实体在上半部分 + """ + body = _body(df) + lower = _lower_shadow(df) + upper = _upper_shadow(df) + total = _total_range(df) + + cond = ( + (lower > 2 * body) & + (upper < 0.5 * body + 1e-10) & # 加小值避免零实体问题 + (body > 0) # 排除doji + ) + return cond.astype(int) + + +def detect_inverted_hammer(df: pd.DataFrame) -> pd.Series: + """ + 倒锤子线 (Inverted Hammer) — 底部反转看涨信号 + 条件: + - 上影线 > 实体的 2 倍 + - 下影线 < 实体的 0.5 倍 + """ + body = _body(df) + lower = _lower_shadow(df) + upper = _upper_shadow(df) + + cond = ( + (upper > 2 * body) & + (lower < 0.5 * body + 1e-10) & + (body > 0) + ) + return cond.astype(int) + + +def detect_bullish_engulfing(df: pd.DataFrame) -> pd.Series: + """ + 看涨吞没 (Bullish Engulfing) + 条件: + - 前一根阴线,当前阳线 + - 当前实体完全包裹前一根实体 + """ + prev_bearish = _is_bearish(df).shift(1) + curr_bullish = _is_bullish(df) + + # 当前开盘 < 前一根收盘 (前一根阴线收盘较低) + # 当前收盘 > 前一根开盘 + cond = ( + prev_bearish & + curr_bullish & + (df['open'] <= df['close'].shift(1)) & + (df['close'] >= df['open'].shift(1)) + ) + return cond.fillna(False).astype(int) + + +def detect_bearish_engulfing(df: pd.DataFrame) -> pd.Series: + """ + 看跌吞没 (Bearish Engulfing) + 条件: + - 前一根阳线,当前阴线 + - 当前实体完全包裹前一根实体 + """ + prev_bullish = _is_bullish(df).shift(1) + curr_bearish = _is_bearish(df) + + cond = ( + prev_bullish & + curr_bearish & + (df['open'] >= df['close'].shift(1)) & + (df['close'] <= df['open'].shift(1)) + ) + return cond.fillna(False).astype(int) + + +def detect_morning_star(df: pd.DataFrame) -> pd.Series: + """ + 晨星 (Morning Star) — 3根K线底部反转 + 条件: + - 第1根: 大阴线(实体 > 中位数实体) + - 第2根: 小实体(实体 < 中位数实体 * 0.5),跳空低开或接近 + - 第3根: 大阳线,收盘超过第1根实体中点 + """ + body = _body(df) + body_signed = _body_signed(df) + median_body = body.rolling(window=20, min_periods=10).median() + + # 第1根大阴线 + bar1_big_bear = (body_signed.shift(2) < 0) & (body.shift(2) > median_body.shift(2)) + # 第2根小实体 + bar2_small = body.shift(1) < median_body.shift(1) * 0.5 + # 第3根大阳线,收盘超过第1根实体中点 + bar1_mid = (df['open'].shift(2) + df['close'].shift(2)) / 2 + bar3_big_bull = (body_signed > 0) & (body > median_body) & (df['close'] > bar1_mid) + + cond = bar1_big_bear & bar2_small & bar3_big_bull + return cond.fillna(False).astype(int) + + +def detect_evening_star(df: pd.DataFrame) -> pd.Series: + """ + 暮星 (Evening Star) — 3根K线顶部反转 + 条件: + - 第1根: 大阳线 + - 第2根: 小实体 + - 第3根: 大阴线,收盘低于第1根实体中点 + """ + body = _body(df) + body_signed = _body_signed(df) + median_body = body.rolling(window=20, min_periods=10).median() + + bar1_big_bull = (body_signed.shift(2) > 0) & (body.shift(2) > median_body.shift(2)) + bar2_small = body.shift(1) < median_body.shift(1) * 0.5 + bar1_mid = (df['open'].shift(2) + df['close'].shift(2)) / 2 + bar3_big_bear = (body_signed < 0) & (body > median_body) & (df['close'] < bar1_mid) + + cond = bar1_big_bull & bar2_small & bar3_big_bear + return cond.fillna(False).astype(int) + + +def detect_three_white_soldiers(df: pd.DataFrame) -> pd.Series: + """ + 三阳开泰 (Three White Soldiers) + 条件: + - 连续3根阳线 + - 每根开盘在前一根实体范围内 + - 每根收盘创新高 + - 上影线较小 + """ + bullish = _is_bullish(df) + body = _body(df) + upper = _upper_shadow(df) + + cond = ( + bullish & bullish.shift(1) & bullish.shift(2) & + # 每根收盘逐步升高 + (df['close'] > df['close'].shift(1)) & + (df['close'].shift(1) > df['close'].shift(2)) & + # 每根开盘在前一根实体内 + (df['open'] >= df['open'].shift(1)) & + (df['open'] <= df['close'].shift(1)) & + (df['open'].shift(1) >= df['open'].shift(2)) & + (df['open'].shift(1) <= df['close'].shift(2)) & + # 上影线不超过实体的30% + (upper < body * 0.3 + 1e-10) & + (upper.shift(1) < body.shift(1) * 0.3 + 1e-10) + ) + return cond.fillna(False).astype(int) + + +def detect_three_black_crows(df: pd.DataFrame) -> pd.Series: + """ + 三阴断头 (Three Black Crows) + 条件: + - 连续3根阴线 + - 每根开盘在前一根实体范围内 + - 每根收盘创新低 + - 下影线较小 + """ + bearish = _is_bearish(df) + body = _body(df) + lower = _lower_shadow(df) + + cond = ( + bearish & bearish.shift(1) & bearish.shift(2) & + # 每根收盘逐步降低 + (df['close'] < df['close'].shift(1)) & + (df['close'].shift(1) < df['close'].shift(2)) & + # 每根开盘在前一根实体内 + (df['open'] <= df['open'].shift(1)) & + (df['open'] >= df['close'].shift(1)) & + (df['open'].shift(1) <= df['open'].shift(2)) & + (df['open'].shift(1) >= df['close'].shift(2)) & + # 下影线不超过实体的30% + (lower < body * 0.3 + 1e-10) & + (lower.shift(1) < body.shift(1) * 0.3 + 1e-10) + ) + return cond.fillna(False).astype(int) + + +def detect_pin_bar(df: pd.DataFrame) -> pd.Series: + """ + Pin Bar (影线 > 总振幅的 2/3) + 分为上Pin Bar(看跌)和下Pin Bar(看涨),此处合并检测 + 返回: + +1 = 下Pin Bar (长下影,看涨) + -1 = 上Pin Bar (长上影,看跌) + 0 = 无信号 + """ + total = _total_range(df) + upper = _upper_shadow(df) + lower = _lower_shadow(df) + threshold = 2.0 / 3.0 + + long_lower = (lower / total > threshold) # 长下影 -> 看涨 + long_upper = (upper / total > threshold) # 长上影 -> 看跌 + + signal = pd.Series(0, index=df.index) + signal[long_lower] = 1 # 看涨Pin Bar + signal[long_upper] = -1 # 看跌Pin Bar + # 如果同时满足(极端情况),取消信号 + signal[long_lower & long_upper] = 0 + return signal + + +def detect_shooting_star(df: pd.DataFrame) -> pd.Series: + """ + 流星线 (Shooting Star) — 顶部反转看跌信号 + 条件: + - 上影线 > 实体的 2 倍 + - 下影线 < 实体的 0.5 倍 + - 在上涨趋势末端(前2根收盘低于当前收盘) + """ + body = _body(df) + upper = _upper_shadow(df) + lower = _lower_shadow(df) + + cond = ( + (upper > 2 * body) & + (lower < 0.5 * body + 1e-10) & + (body > 0) & + (df['close'].shift(1) < df['high']) & + (df['close'].shift(2) < df['close'].shift(1)) + ) + return cond.fillna(False).astype(int) + + +def detect_all_patterns(df: pd.DataFrame) -> Dict[str, pd.Series]: + """ + 检测所有K线形态 + 返回字典: {形态名称: 信号序列} + + 对于方向性形态: + - 看涨形态的值 > 0 表示检测到 + - 看跌形态的值 > 0 表示检测到 + - Pin Bar 特殊: +1=看涨, -1=看跌 + """ + patterns = {} + + # --- 单根K线形态 --- + patterns['Doji'] = detect_doji(df) + patterns['Hammer'] = detect_hammer(df) + patterns['Inverted_Hammer'] = detect_inverted_hammer(df) + patterns['Shooting_Star'] = detect_shooting_star(df) + patterns['Pin_Bar_Bull'] = (detect_pin_bar(df) == 1).astype(int) + patterns['Pin_Bar_Bear'] = (detect_pin_bar(df) == -1).astype(int) + + # --- 两根K线形态 --- + patterns['Bullish_Engulfing'] = detect_bullish_engulfing(df) + patterns['Bearish_Engulfing'] = detect_bearish_engulfing(df) + + # --- 三根K线形态 --- + patterns['Morning_Star'] = detect_morning_star(df) + patterns['Evening_Star'] = detect_evening_star(df) + patterns['Three_White_Soldiers'] = detect_three_white_soldiers(df) + patterns['Three_Black_Crows'] = detect_three_black_crows(df) + + return patterns + + +# 形态的预期方向映射(+1=看涨, -1=看跌, 0=中性) +PATTERN_EXPECTED_DIRECTION = { + 'Doji': 0, + 'Hammer': 1, + 'Inverted_Hammer': 1, + 'Shooting_Star': -1, + 'Pin_Bar_Bull': 1, + 'Pin_Bar_Bear': -1, + 'Bullish_Engulfing': 1, + 'Bearish_Engulfing': -1, + 'Morning_Star': 1, + 'Evening_Star': -1, + 'Three_White_Soldiers': 1, + 'Three_Black_Crows': -1, +} + + +# ============================================================ +# 3. 前向收益分析 +# ============================================================ + +def calc_forward_returns_multi(close: pd.Series, horizons: List[int] = None) -> pd.DataFrame: + """计算多个前向周期的对数收益率""" + if horizons is None: + horizons = [1, 3, 5, 10, 20] + fwd = pd.DataFrame(index=close.index) + for h in horizons: + fwd[f'fwd_{h}d'] = np.log(close.shift(-h) / close) + return fwd + + +def analyze_pattern_returns(pattern_signal: pd.Series, fwd_returns: pd.DataFrame, + expected_dir: int = 0) -> Dict: + """ + 对单个形态进行前向收益分析 + + 参数: + pattern_signal: 形态检测信号 (1=出现, 0=未出现) + fwd_returns: 前向收益 DataFrame + expected_dir: 预期方向 (+1=看涨, -1=看跌, 0=中性) + + 返回: + 统计结果字典 + """ + mask = pattern_signal > 0 # Pin_Bar_Bear 已经处理为单独信号 + n_occurrences = mask.sum() + + result = {'n_occurrences': int(n_occurrences), 'expected_direction': expected_dir} + + if n_occurrences < 3: + # 样本太少,跳过 + for col in fwd_returns.columns: + result[f'{col}_mean'] = np.nan + result[f'{col}_median'] = np.nan + result[f'{col}_pct_positive'] = np.nan + result[f'{col}_ttest_pval'] = np.nan + result['hit_rate'] = np.nan + result['wilson_ci_lower'] = np.nan + result['wilson_ci_upper'] = np.nan + return result + + for col in fwd_returns.columns: + returns = fwd_returns.loc[mask, col].dropna() + if len(returns) == 0: + result[f'{col}_mean'] = np.nan + result[f'{col}_median'] = np.nan + result[f'{col}_pct_positive'] = np.nan + result[f'{col}_ttest_pval'] = np.nan + continue + + result[f'{col}_mean'] = returns.mean() + result[f'{col}_median'] = returns.median() + result[f'{col}_pct_positive'] = (returns > 0).mean() + + # 单样本 t-test: 均值是否显著不等于 0 + if len(returns) >= 5: + t_stat, t_pval = stats.ttest_1samp(returns, 0) + result[f'{col}_ttest_pval'] = t_pval + else: + result[f'{col}_ttest_pval'] = np.nan + + # --- 命中率 (hit rate) --- + # 使用 fwd_1d 作为判断依据 + if 'fwd_1d' in fwd_returns.columns: + ret_1d = fwd_returns.loc[mask, 'fwd_1d'].dropna() + if len(ret_1d) > 0: + if expected_dir == 1: + # 看涨:收益>0 为命中 + hits = (ret_1d > 0).sum() + elif expected_dir == -1: + # 看跌:收益<0 为命中 + hits = (ret_1d < 0).sum() + else: + # 中性:取绝对值较大方向的准确率 + hits = max((ret_1d > 0).sum(), (ret_1d < 0).sum()) + + n = len(ret_1d) + hit_rate = hits / n + result['hit_rate'] = hit_rate + result['hit_count'] = int(hits) + result['hit_n'] = int(n) + + # Wilson 置信区间 + ci_lower, ci_upper = wilson_confidence_interval(hits, n, alpha=0.05) + result['wilson_ci_lower'] = ci_lower + result['wilson_ci_upper'] = ci_upper + + # 二项检验: 命中率是否显著高于 50% + binom_pval = stats.binomtest(hits, n, 0.5, alternative='greater').pvalue + result['binom_pval'] = binom_pval + else: + result['hit_rate'] = np.nan + result['wilson_ci_lower'] = np.nan + result['wilson_ci_upper'] = np.nan + result['binom_pval'] = np.nan + else: + result['hit_rate'] = np.nan + result['wilson_ci_lower'] = np.nan + result['wilson_ci_upper'] = np.nan + + return result + + +# ============================================================ +# 4. Wilson 置信区间 + FDR 校正 +# ============================================================ + +def wilson_confidence_interval(successes: int, n: int, alpha: float = 0.05) -> Tuple[float, float]: + """ + Wilson 置信区间计算 + + 比 Wald 区间更适合小样本和极端比例的情况 + + 参数: + successes: 成功次数 + n: 总次数 + alpha: 显著性水平 + + 返回: + (lower, upper) 置信区间 + """ + if n == 0: + return (0.0, 1.0) + + p_hat = successes / n + z = stats.norm.ppf(1 - alpha / 2) + + denominator = 1 + z ** 2 / n + center = (p_hat + z ** 2 / (2 * n)) / denominator + margin = z * np.sqrt((p_hat * (1 - p_hat) + z ** 2 / (4 * n)) / n) / denominator + + lower = max(0, center - margin) + upper = min(1, center + margin) + return (lower, upper) + + +def benjamini_hochberg(p_values: np.ndarray, alpha: float = 0.05) -> Tuple[np.ndarray, np.ndarray]: + """ + Benjamini-Hochberg FDR 校正 + + 参数: + p_values: 原始 p 值数组 + alpha: 显著性水平 + + 返回: + (rejected, adjusted_p): 是否拒绝原假设, 校正后p值 + """ + n = len(p_values) + if n == 0: + return np.array([], dtype=bool), np.array([]) + + valid_mask = ~np.isnan(p_values) + adjusted = np.full(n, np.nan) + rejected = np.full(n, False) + + valid_pvals = p_values[valid_mask] + n_valid = len(valid_pvals) + if n_valid == 0: + return rejected, adjusted + + sorted_idx = np.argsort(valid_pvals) + sorted_pvals = valid_pvals[sorted_idx] + + rank = np.arange(1, n_valid + 1) + adjusted_sorted = sorted_pvals * n_valid / rank + adjusted_sorted = np.minimum.accumulate(adjusted_sorted[::-1])[::-1] + adjusted_sorted = np.clip(adjusted_sorted, 0, 1) + + valid_indices = np.where(valid_mask)[0] + for i, idx in enumerate(sorted_idx): + adjusted[valid_indices[idx]] = adjusted_sorted[i] + rejected[valid_indices[idx]] = adjusted_sorted[i] <= alpha + + return rejected, adjusted + + +# ============================================================ +# 5. 可视化 +# ============================================================ + +def plot_pattern_counts(pattern_counts: Dict[str, int], output_dir: Path, prefix: str = "train"): + """绘制形态出现次数的柱状图""" + fig, ax = plt.subplots(figsize=(12, 6)) + + names = list(pattern_counts.keys()) + counts = list(pattern_counts.values()) + colors = ['#2ecc71' if PATTERN_EXPECTED_DIRECTION.get(n, 0) >= 0 else '#e74c3c' for n in names] + + bars = ax.barh(range(len(names)), counts, color=colors, edgecolor='gray', linewidth=0.5) + ax.set_yticks(range(len(names))) + ax.set_yticklabels(names, fontsize=9) + ax.set_xlabel('Occurrence Count') + ax.set_title(f'Pattern Occurrence Counts - {prefix.upper()} Set') + + # 在柱形上标注数值 + for bar, count in zip(bars, counts): + ax.text(bar.get_width() + 0.5, bar.get_y() + bar.get_height() / 2, + str(count), va='center', fontsize=8) + + plt.tight_layout() + fig.savefig(output_dir / f"pattern_counts_{prefix}.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [saved] pattern_counts_{prefix}.png") + + +def plot_forward_return_boxplots(patterns: Dict[str, pd.Series], fwd_returns: pd.DataFrame, + output_dir: Path, prefix: str = "train"): + """绘制各形态前向收益的箱线图""" + horizons = [c for c in fwd_returns.columns if c.startswith('fwd_')] + n_horizons = len(horizons) + if n_horizons == 0: + return + + # 筛选有足够样本的形态 + valid_patterns = {name: sig for name, sig in patterns.items() if sig.sum() >= 3} + if not valid_patterns: + return + + n_patterns = len(valid_patterns) + fig, axes = plt.subplots(1, n_horizons, figsize=(4 * n_horizons, max(6, n_patterns * 0.4))) + if n_horizons == 1: + axes = [axes] + + for ax_idx, horizon in enumerate(horizons): + data_list = [] + labels = [] + for name, sig in valid_patterns.items(): + mask = sig > 0 + ret = fwd_returns.loc[mask, horizon].dropna() + if len(ret) > 0: + data_list.append(ret.values) + labels.append(f"{name} (n={len(ret)})") + + if data_list: + bp = axes[ax_idx].boxplot(data_list, vert=False, patch_artist=True, widths=0.6) + for patch, name in zip(bp['boxes'], valid_patterns.keys()): + direction = PATTERN_EXPECTED_DIRECTION.get(name, 0) + patch.set_facecolor('#a8e6cf' if direction >= 0 else '#ffb3b3') + patch.set_alpha(0.7) + axes[ax_idx].set_yticklabels(labels, fontsize=7) + axes[ax_idx].axvline(x=0, color='red', linestyle='--', linewidth=0.8, alpha=0.7) + axes[ax_idx].set_xlabel('Log Return') + horizon_label = horizon.replace('fwd_', '').replace('d', '-day') + axes[ax_idx].set_title(f'{horizon_label} Forward Return') + + plt.suptitle(f'Pattern Forward Returns - {prefix.upper()} Set', fontsize=13) + plt.tight_layout() + fig.savefig(output_dir / f"pattern_forward_returns_{prefix}.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [saved] pattern_forward_returns_{prefix}.png") + + +def plot_hit_rate_with_ci(results_df: pd.DataFrame, output_dir: Path, prefix: str = "train"): + """绘制命中率 + Wilson 置信区间""" + # 筛选有效数据 + valid = results_df.dropna(subset=['hit_rate', 'wilson_ci_lower', 'wilson_ci_upper']) + if len(valid) == 0: + return + + fig, ax = plt.subplots(figsize=(12, max(6, len(valid) * 0.5))) + + names = valid.index.tolist() + hit_rates = valid['hit_rate'].values + ci_lower = valid['wilson_ci_lower'].values + ci_upper = valid['wilson_ci_upper'].values + + y_pos = range(len(names)) + # 置信区间误差条 + xerr_lower = hit_rates - ci_lower + xerr_upper = ci_upper - hit_rates + xerr = np.array([xerr_lower, xerr_upper]) + + colors = ['#2ecc71' if hr > 0.5 else '#e74c3c' for hr in hit_rates] + ax.barh(y_pos, hit_rates, xerr=xerr, color=colors, edgecolor='gray', + linewidth=0.5, alpha=0.8, capsize=3, ecolor='black') + ax.axvline(x=0.5, color='blue', linestyle='--', linewidth=1.0, label='50% baseline') + + # 标注 FDR 校正结果 + if 'binom_adj_pval' in valid.columns: + for i, name in enumerate(names): + adj_p = valid.loc[name, 'binom_adj_pval'] + marker = '' + if not np.isnan(adj_p): + if adj_p < 0.01: + marker = ' ***' + elif adj_p < 0.05: + marker = ' **' + elif adj_p < 0.10: + marker = ' *' + ax.text(ci_upper[i] + 0.01, i, f"{hit_rates[i]:.1%}{marker}", va='center', fontsize=8) + else: + for i in range(len(names)): + ax.text(ci_upper[i] + 0.01, i, f"{hit_rates[i]:.1%}", va='center', fontsize=8) + + ax.set_yticks(y_pos) + ax.set_yticklabels(names, fontsize=9) + ax.set_xlabel('Hit Rate') + ax.set_title(f'Pattern Hit Rate with Wilson CI - {prefix.upper()} Set\n(* p<0.10, ** p<0.05, *** p<0.01 after FDR)') + ax.legend(fontsize=9) + ax.set_xlim(0, 1) + + plt.tight_layout() + fig.savefig(output_dir / f"pattern_hit_rate_{prefix}.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [saved] pattern_hit_rate_{prefix}.png") + + +# ============================================================ +# 6. 主流程 +# ============================================================ + +def evaluate_patterns_on_set(df: pd.DataFrame, patterns: Dict[str, pd.Series], + set_name: str) -> pd.DataFrame: + """ + 在给定数据集上评估所有形态 + + 参数: + df: 数据集 DataFrame (含 OHLCV) + patterns: 形态信号字典 + set_name: 数据集名称(用于打印) + + 返回: + 包含统计结果的 DataFrame + """ + close = df['close'] + fwd_returns = calc_forward_returns_multi(close, horizons=[1, 3, 5, 10, 20]) + + results = {} + for name, signal in patterns.items(): + sig = signal.reindex(df.index).fillna(0) + expected_dir = PATTERN_EXPECTED_DIRECTION.get(name, 0) + results[name] = analyze_pattern_returns(sig, fwd_returns, expected_dir) + + results_df = pd.DataFrame(results).T + results_df.index.name = 'pattern' + + print(f"\n{'='*60}") + print(f" {set_name} 数据集形态评估结果") + print(f"{'='*60}") + + # 打印形态出现次数 + print(f"\n 形态出现次数:") + for name in results_df.index: + n = int(results_df.loc[name, 'n_occurrences']) + print(f" {name}: {n} 次") + + return results_df + + +def apply_fdr_to_patterns(results_df: pd.DataFrame, alpha: float = 0.05) -> pd.DataFrame: + """ + 对形态检验的多个 p 值进行 FDR 校正 + + 校正的 p 值列: + - 各前向周期的 t-test p 值 + - 二项检验 p 值 + """ + # t-test p 值列 + ttest_cols = [c for c in results_df.columns if c.endswith('_ttest_pval')] + all_pval_cols = ttest_cols.copy() + + if 'binom_pval' in results_df.columns: + all_pval_cols.append('binom_pval') + + for col in all_pval_cols: + pvals = results_df[col].values.astype(float) + rejected, adjusted = benjamini_hochberg(pvals, alpha) + adj_col = col.replace('_pval', '_adj_pval') + rej_col = col.replace('_pval', '_rejected') + results_df[adj_col] = adjusted + results_df[rej_col] = rejected + + return results_df + + +def run_patterns_analysis(df: pd.DataFrame, output_dir: str) -> Dict: + """ + K线形态识别与统计验证主入口 + + 参数: + df: 完整的日线 DataFrame(含 open/high/low/close/volume 等列,DatetimeIndex) + output_dir: 图表输出目录 + + 返回: + 包含训练集和验证集结果的字典 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + print("=" * 60) + print(" K线形态识别与统计验证") + print("=" * 60) + + # --- 数据切分 --- + train, val, test = split_data(df) + print(f"\n训练集: {train.index.min()} ~ {train.index.max()} ({len(train)} bars)") + print(f"验证集: {val.index.min()} ~ {val.index.max()} ({len(val)} bars)") + + # --- 检测所有形态(在全量数据上计算) --- + all_patterns = detect_all_patterns(df) + print(f"\n共检测 {len(all_patterns)} 种K线形态") + + # ============ 训练集评估 ============ + train_results = evaluate_patterns_on_set(train, all_patterns, "训练集 (TRAIN)") + + # FDR 校正 + train_results = apply_fdr_to_patterns(train_results, alpha=0.05) + + # 找出显著形态 + reject_cols = [c for c in train_results.columns if c.endswith('_rejected')] + if reject_cols: + train_results['any_fdr_pass'] = train_results[reject_cols].any(axis=1) + fdr_passed_train = train_results[train_results['any_fdr_pass']].index.tolist() + else: + fdr_passed_train = [] + + print(f"\n--- FDR 校正结果 (训练集) ---") + if fdr_passed_train: + print(f" 通过 FDR 校正的形态 ({len(fdr_passed_train)} 个):") + for name in fdr_passed_train: + row = train_results.loc[name] + hr = row.get('hit_rate', np.nan) + n = int(row.get('n_occurrences', 0)) + hr_str = f", hit_rate={hr:.1%}" if not np.isnan(hr) else "" + print(f" - {name}: n={n}{hr_str}") + else: + print(" 没有形态通过 FDR 校正(alpha=0.05)") + + # --- 训练集可视化 --- + print("\n--- 训练集可视化 ---") + train_counts = {name: int(train_results.loc[name, 'n_occurrences']) for name in train_results.index} + plot_pattern_counts(train_counts, output_dir, prefix="train") + + train_patterns_in_set = {name: sig.reindex(train.index).fillna(0) for name, sig in all_patterns.items()} + train_fwd = calc_forward_returns_multi(train['close'], horizons=[1, 3, 5, 10, 20]) + plot_forward_return_boxplots(train_patterns_in_set, train_fwd, output_dir, prefix="train") + plot_hit_rate_with_ci(train_results, output_dir, prefix="train") + + # ============ 验证集评估 ============ + val_results = evaluate_patterns_on_set(val, all_patterns, "验证集 (VAL)") + val_results = apply_fdr_to_patterns(val_results, alpha=0.05) + + reject_cols_val = [c for c in val_results.columns if c.endswith('_rejected')] + if reject_cols_val: + val_results['any_fdr_pass'] = val_results[reject_cols_val].any(axis=1) + fdr_passed_val = val_results[val_results['any_fdr_pass']].index.tolist() + else: + fdr_passed_val = [] + + print(f"\n--- FDR 校正结果 (验证集) ---") + if fdr_passed_val: + print(f" 通过 FDR 校正的形态 ({len(fdr_passed_val)} 个):") + for name in fdr_passed_val: + row = val_results.loc[name] + hr = row.get('hit_rate', np.nan) + n = int(row.get('n_occurrences', 0)) + hr_str = f", hit_rate={hr:.1%}" if not np.isnan(hr) else "" + print(f" - {name}: n={n}{hr_str}") + else: + print(" 没有形态通过 FDR 校正(alpha=0.05)") + + # --- 训练集 vs 验证集对比 --- + if 'hit_rate' in train_results.columns and 'hit_rate' in val_results.columns: + print(f"\n--- 训练集 vs 验证集命中率对比 ---") + for name in train_results.index: + tr_hr = train_results.loc[name, 'hit_rate'] if name in train_results.index else np.nan + va_hr = val_results.loc[name, 'hit_rate'] if name in val_results.index else np.nan + if np.isnan(tr_hr) or np.isnan(va_hr): + continue + diff = va_hr - tr_hr + label = "STABLE" if abs(diff) < 0.05 else ("IMPROVE" if diff > 0 else "DECAY") + print(f" {name}: train={tr_hr:.1%}, val={va_hr:.1%}, diff={diff:+.1%} [{label}]") + + # --- 验证集可视化 --- + print("\n--- 验证集可视化 ---") + val_counts = {name: int(val_results.loc[name, 'n_occurrences']) for name in val_results.index} + plot_pattern_counts(val_counts, output_dir, prefix="val") + + val_patterns_in_set = {name: sig.reindex(val.index).fillna(0) for name, sig in all_patterns.items()} + val_fwd = calc_forward_returns_multi(val['close'], horizons=[1, 3, 5, 10, 20]) + plot_forward_return_boxplots(val_patterns_in_set, val_fwd, output_dir, prefix="val") + plot_hit_rate_with_ci(val_results, output_dir, prefix="val") + + print(f"\n{'='*60}") + print(" K线形态识别与统计验证完成") + print(f"{'='*60}") + + return { + 'train_results': train_results, + 'val_results': val_results, + 'fdr_passed_train': fdr_passed_train, + 'fdr_passed_val': fdr_passed_val, + 'all_patterns': all_patterns, + } diff --git a/src/power_law_analysis.py b/src/power_law_analysis.py new file mode 100644 index 0000000..e83c67f --- /dev/null +++ b/src/power_law_analysis.py @@ -0,0 +1,468 @@ +"""幂律增长拟合与走廊模型分析 + +通过幂律模型拟合BTC价格的长期增长趋势,构建价格走廊, +并与指数增长模型进行比较,评估当前价格在历史分布中的位置。 +""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from scipy import stats +from scipy.optimize import curve_fit +from pathlib import Path +from typing import Tuple, Dict + +# 中文显示支持 +plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans'] +plt.rcParams['axes.unicode_minus'] = False + + +def _compute_days_since_start(df: pd.DataFrame) -> np.ndarray: + """计算距离起始日的天数(从1开始,避免log(0))""" + days = (df.index - df.index[0]).days.astype(float) + 1.0 + return days + + +def _fit_power_law(log_days: np.ndarray, log_prices: np.ndarray) -> Dict: + """对数-对数线性回归拟合幂律模型 + + 模型: log(price) = slope * log(days) + intercept + 等价于: price = exp(intercept) * days^slope + + Returns + ------- + dict + 包含 slope, intercept, r_squared, residuals, fitted_values + """ + slope, intercept, r_value, p_value, std_err = stats.linregress(log_days, log_prices) + fitted = slope * log_days + intercept + residuals = log_prices - fitted + + return { + 'slope': slope, # 幂律指数 α + 'intercept': intercept, # log(c) + 'r_squared': r_value ** 2, + 'p_value': p_value, + 'std_err': std_err, + 'residuals': residuals, + 'fitted_values': fitted, + } + + +def _build_corridor( + log_days: np.ndarray, + fit_result: Dict, + quantiles: Tuple[float, ...] = (0.05, 0.50, 0.95), +) -> Dict[float, np.ndarray]: + """基于残差分位数构建幂律走廊 + + Parameters + ---------- + log_days : array + log(天数) 序列 + fit_result : dict + 幂律拟合结果 + quantiles : tuple + 走廊分位数 + + Returns + ------- + dict + 分位数 -> 走廊价格(原始尺度) + """ + residuals = fit_result['residuals'] + corridor = {} + for q in quantiles: + q_val = np.quantile(residuals, q) + # log_price = slope * log_days + intercept + quantile_offset + log_price_band = fit_result['slope'] * log_days + fit_result['intercept'] + q_val + corridor[q] = np.exp(log_price_band) + return corridor + + +def _power_law_func(days: np.ndarray, c: float, alpha: float) -> np.ndarray: + """幂律函数: price = c * days^alpha""" + return c * np.power(days, alpha) + + +def _exponential_func(days: np.ndarray, c: float, beta: float) -> np.ndarray: + """指数函数: price = c * exp(beta * days)""" + return c * np.exp(beta * days) + + +def _compute_aic_bic(n: int, k: int, rss: float) -> Tuple[float, float]: + """计算AIC和BIC + + Parameters + ---------- + n : int + 样本量 + k : int + 模型参数个数 + rss : float + 残差平方和 + + Returns + ------- + tuple + (AIC, BIC) + """ + # 对数似然 (假设正态分布残差) + log_likelihood = -n / 2 * (np.log(2 * np.pi * rss / n) + 1) + aic = 2 * k - 2 * log_likelihood + bic = k * np.log(n) - 2 * log_likelihood + return aic, bic + + +def _fit_and_compare_models( + days: np.ndarray, prices: np.ndarray +) -> Dict: + """拟合幂律和指数增长模型并比较AIC/BIC + + Returns + ------- + dict + 包含两个模型的参数、AIC、BIC及比较结论 + """ + n = len(prices) + k = 2 # 两个模型都有2个参数 + + # --- 幂律拟合: price = c * days^alpha --- + try: + popt_pl, _ = curve_fit( + _power_law_func, days, prices, + p0=[1.0, 1.5], maxfev=10000 + ) + prices_pred_pl = _power_law_func(days, *popt_pl) + rss_pl = np.sum((prices - prices_pred_pl) ** 2) + aic_pl, bic_pl = _compute_aic_bic(n, k, rss_pl) + except RuntimeError: + # curve_fit 失败时回退到对数空间OLS估计 + log_d = np.log(days) + log_p = np.log(prices) + slope, intercept, _, _, _ = stats.linregress(log_d, log_p) + popt_pl = [np.exp(intercept), slope] + prices_pred_pl = _power_law_func(days, *popt_pl) + rss_pl = np.sum((prices - prices_pred_pl) ** 2) + aic_pl, bic_pl = _compute_aic_bic(n, k, rss_pl) + + # --- 指数拟合: price = c * exp(beta * days) --- + # 初始值通过log空间OLS估计 + log_p = np.log(prices) + beta_init, log_c_init, _, _, _ = stats.linregress(days, log_p) + try: + popt_exp, _ = curve_fit( + _exponential_func, days, prices, + p0=[np.exp(log_c_init), beta_init], maxfev=10000 + ) + prices_pred_exp = _exponential_func(days, *popt_exp) + rss_exp = np.sum((prices - prices_pred_exp) ** 2) + aic_exp, bic_exp = _compute_aic_bic(n, k, rss_exp) + except (RuntimeError, OverflowError): + # 指数拟合容易溢出,使用log空间线性回归作替代 + popt_exp = [np.exp(log_c_init), beta_init] + prices_pred_exp = _exponential_func(days, *popt_exp) + # 裁剪防止溢出 + prices_pred_exp = np.clip(prices_pred_exp, 0, prices.max() * 100) + rss_exp = np.sum((prices - prices_pred_exp) ** 2) + aic_exp, bic_exp = _compute_aic_bic(n, k, rss_exp) + + return { + 'power_law': { + 'params': {'c': popt_pl[0], 'alpha': popt_pl[1]}, + 'aic': aic_pl, + 'bic': bic_pl, + 'rss': rss_pl, + 'predicted': prices_pred_pl, + }, + 'exponential': { + 'params': {'c': popt_exp[0], 'beta': popt_exp[1]}, + 'aic': aic_exp, + 'bic': bic_exp, + 'rss': rss_exp, + 'predicted': prices_pred_exp, + }, + 'preferred': 'power_law' if aic_pl < aic_exp else 'exponential', + } + + +def _compute_current_percentile(residuals: np.ndarray) -> float: + """计算当前价格(最后一个残差)在历史残差分布中的百分位 + + Returns + ------- + float + 百分位数 (0-100) + """ + current_residual = residuals[-1] + percentile = stats.percentileofscore(residuals, current_residual) + return percentile + + +# ============================================================================= +# 可视化函数 +# ============================================================================= + +def _plot_loglog_regression( + log_days: np.ndarray, + log_prices: np.ndarray, + fit_result: Dict, + dates: pd.DatetimeIndex, + output_dir: Path, +): + """图1: 对数-对数散点图 + 回归线""" + fig, ax = plt.subplots(figsize=(12, 7)) + + ax.scatter(log_days, log_prices, s=3, alpha=0.5, color='steelblue', label='实际价格') + ax.plot(log_days, fit_result['fitted_values'], color='red', linewidth=2, + label=f"回归线: slope={fit_result['slope']:.4f}, R²={fit_result['r_squared']:.4f}") + + ax.set_xlabel('log(天数)', fontsize=12) + ax.set_ylabel('log(价格)', fontsize=12) + ax.set_title('BTC 幂律拟合 — 对数-对数回归', fontsize=14) + ax.legend(fontsize=11) + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / 'power_law_loglog_regression.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [图] 对数-对数回归已保存: {output_dir / 'power_law_loglog_regression.png'}") + + +def _plot_corridor( + df: pd.DataFrame, + days: np.ndarray, + corridor: Dict[float, np.ndarray], + fit_result: Dict, + output_dir: Path, +): + """图2: 幂律走廊模型(价格 + 5%/50%/95% 通道)""" + fig, ax = plt.subplots(figsize=(14, 7)) + + # 实际价格 + ax.semilogy(df.index, df['close'], color='black', linewidth=0.8, label='BTC 收盘价') + + # 走廊带 + colors = {0.05: 'green', 0.50: 'orange', 0.95: 'red'} + labels = {0.05: '5% 下界', 0.50: '50% 中位线', 0.95: '95% 上界'} + for q, band in corridor.items(): + ax.semilogy(df.index, band, color=colors[q], linewidth=1.5, + linestyle='--', label=labels[q]) + + # 填充走廊区间 + ax.fill_between(df.index, corridor[0.05], corridor[0.95], + alpha=0.1, color='blue', label='90% 走廊区间') + + ax.set_xlabel('日期', fontsize=12) + ax.set_ylabel('价格 (USDT, 对数尺度)', fontsize=12) + ax.set_title('BTC 幂律走廊模型', fontsize=14) + ax.legend(fontsize=10, loc='upper left') + ax.grid(True, alpha=0.3, which='both') + + fig.savefig(output_dir / 'power_law_corridor.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [图] 幂律走廊已保存: {output_dir / 'power_law_corridor.png'}") + + +def _plot_model_comparison( + df: pd.DataFrame, + days: np.ndarray, + comparison: Dict, + output_dir: Path, +): + """图3: 幂律 vs 指数增长模型对比""" + fig, axes = plt.subplots(1, 2, figsize=(16, 7)) + + # 左图: 价格对比 + ax1 = axes[0] + ax1.semilogy(df.index, df['close'], color='black', linewidth=0.8, label='实际价格') + ax1.semilogy(df.index, comparison['power_law']['predicted'], + color='blue', linewidth=1.5, linestyle='--', label='幂律拟合') + ax1.semilogy(df.index, np.clip(comparison['exponential']['predicted'], 1e-1, None), + color='red', linewidth=1.5, linestyle='--', label='指数拟合') + ax1.set_xlabel('日期', fontsize=11) + ax1.set_ylabel('价格 (USDT, 对数尺度)', fontsize=11) + ax1.set_title('模型拟合对比', fontsize=13) + ax1.legend(fontsize=10) + ax1.grid(True, alpha=0.3, which='both') + + # 右图: AIC/BIC 柱状图 + ax2 = axes[1] + models = ['幂律模型', '指数模型'] + aic_vals = [comparison['power_law']['aic'], comparison['exponential']['aic']] + bic_vals = [comparison['power_law']['bic'], comparison['exponential']['bic']] + + x = np.arange(len(models)) + width = 0.35 + bars1 = ax2.bar(x - width / 2, aic_vals, width, label='AIC', color='steelblue') + bars2 = ax2.bar(x + width / 2, bic_vals, width, label='BIC', color='coral') + + ax2.set_xticks(x) + ax2.set_xticklabels(models, fontsize=11) + ax2.set_ylabel('信息准则值', fontsize=11) + ax2.set_title('AIC / BIC 模型比较', fontsize=13) + ax2.legend(fontsize=10) + ax2.grid(True, alpha=0.3, axis='y') + + # 添加数值标签 + for bar in bars1: + ax2.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), + f'{bar.get_height():.0f}', ha='center', va='bottom', fontsize=9) + for bar in bars2: + ax2.text(bar.get_x() + bar.get_width() / 2, bar.get_height(), + f'{bar.get_height():.0f}', ha='center', va='bottom', fontsize=9) + + fig.tight_layout() + fig.savefig(output_dir / 'power_law_model_comparison.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [图] 模型对比已保存: {output_dir / 'power_law_model_comparison.png'}") + + +def _plot_residual_distribution( + residuals: np.ndarray, + current_percentile: float, + output_dir: Path, +): + """图4: 残差分布 + 当前位置""" + fig, ax = plt.subplots(figsize=(10, 6)) + + ax.hist(residuals, bins=60, density=True, alpha=0.6, color='steelblue', + edgecolor='white', label='残差分布') + + # 当前位置 + current_res = residuals[-1] + ax.axvline(current_res, color='red', linewidth=2, linestyle='--', + label=f'当前位置: {current_percentile:.1f}%') + + # 分位数线 + for q, color, label in [(0.05, 'green', '5%'), (0.50, 'orange', '50%'), (0.95, 'red', '95%')]: + q_val = np.quantile(residuals, q) + ax.axvline(q_val, color=color, linewidth=1, linestyle=':', + alpha=0.7, label=f'{label} 分位: {q_val:.3f}') + + ax.set_xlabel('残差 (log尺度)', fontsize=12) + ax.set_ylabel('密度', fontsize=12) + ax.set_title(f'幂律残差分布 — 当前价格位于 {current_percentile:.1f}% 分位', fontsize=14) + ax.legend(fontsize=9) + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / 'power_law_residual_distribution.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [图] 残差分布已保存: {output_dir / 'power_law_residual_distribution.png'}") + + +# ============================================================================= +# 主入口 +# ============================================================================= + +def run_power_law_analysis(df: pd.DataFrame, output_dir: str = "output") -> Dict: + """幂律增长拟合与走廊模型 — 主入口函数 + + Parameters + ---------- + df : pd.DataFrame + 由 data_loader.load_daily() 返回的日线数据,含 DatetimeIndex 和 close 列 + output_dir : str + 图表输出目录 + + Returns + ------- + dict + 分析结果摘要 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + print("=" * 60) + print(" BTC 幂律增长分析") + print("=" * 60) + + prices = df['close'].dropna() + + # ---- 步骤1: 准备数据 ---- + days = _compute_days_since_start(df.loc[prices.index]) + log_days = np.log(days) + log_prices = np.log(prices.values) + + print(f"\n数据范围: {prices.index[0].date()} ~ {prices.index[-1].date()}") + print(f"样本数量: {len(prices)}") + + # ---- 步骤2: 对数-对数线性回归 ---- + print("\n--- 对数-对数线性回归 ---") + fit_result = _fit_power_law(log_days, log_prices) + print(f" 幂律指数 (slope/α): {fit_result['slope']:.6f}") + print(f" 截距 log(c): {fit_result['intercept']:.6f}") + print(f" 等价系数 c: {np.exp(fit_result['intercept']):.6f}") + print(f" R²: {fit_result['r_squared']:.6f}") + print(f" p-value: {fit_result['p_value']:.2e}") + print(f" 标准误差: {fit_result['std_err']:.6f}") + + # ---- 步骤3: 幂律走廊模型 ---- + print("\n--- 幂律走廊模型 ---") + quantiles = (0.05, 0.50, 0.95) + corridor = _build_corridor(log_days, fit_result, quantiles) + for q in quantiles: + print(f" {int(q * 100):>3d}% 分位当前走廊价格: ${corridor[q][-1]:,.0f}") + + # ---- 步骤4: 模型比较 (幂律 vs 指数) ---- + print("\n--- 模型比较: 幂律 vs 指数 ---") + comparison = _fit_and_compare_models(days, prices.values) + + pl = comparison['power_law'] + exp = comparison['exponential'] + print(f" 幂律模型: c={pl['params']['c']:.4f}, α={pl['params']['alpha']:.4f}") + print(f" AIC={pl['aic']:.0f}, BIC={pl['bic']:.0f}") + print(f" 指数模型: c={exp['params']['c']:.4f}, β={exp['params']['beta']:.6f}") + print(f" AIC={exp['aic']:.0f}, BIC={exp['bic']:.0f}") + print(f" AIC 差值 (幂律-指数): {pl['aic'] - exp['aic']:.0f}") + print(f" BIC 差值 (幂律-指数): {pl['bic'] - exp['bic']:.0f}") + print(f" >> 优选模型: {comparison['preferred']}") + + # ---- 步骤5: 当前价格位置 ---- + print("\n--- 当前价格位置 ---") + current_percentile = _compute_current_percentile(fit_result['residuals']) + current_price = prices.iloc[-1] + print(f" 当前价格: ${current_price:,.2f}") + print(f" 历史残差分位: {current_percentile:.1f}%") + if current_percentile > 90: + print(" >> 警告: 当前价格处于历史高估区域") + elif current_percentile < 10: + print(" >> 提示: 当前价格处于历史低估区域") + else: + print(" >> 当前价格处于历史正常波动范围内") + + # ---- 步骤6: 生成可视化 ---- + print("\n--- 生成可视化图表 ---") + _plot_loglog_regression(log_days, log_prices, fit_result, prices.index, output_dir) + _plot_corridor(df.loc[prices.index], days, corridor, fit_result, output_dir) + _plot_model_comparison(df.loc[prices.index], days, comparison, output_dir) + _plot_residual_distribution(fit_result['residuals'], current_percentile, output_dir) + + print("\n" + "=" * 60) + print(" 幂律分析完成") + print("=" * 60) + + # 返回结果摘要 + return { + 'r_squared': fit_result['r_squared'], + 'power_exponent': fit_result['slope'], + 'intercept': fit_result['intercept'], + 'corridor_prices': {q: corridor[q][-1] for q in quantiles}, + 'model_comparison': { + 'power_law_aic': pl['aic'], + 'power_law_bic': pl['bic'], + 'exponential_aic': exp['aic'], + 'exponential_bic': exp['bic'], + 'preferred': comparison['preferred'], + }, + 'current_price': current_price, + 'current_percentile': current_percentile, + } + + +if __name__ == '__main__': + from data_loader import load_daily + df = load_daily() + results = run_power_law_analysis(df, output_dir='../output/power_law') diff --git a/src/preprocessing.py b/src/preprocessing.py new file mode 100644 index 0000000..c9a0ea5 --- /dev/null +++ b/src/preprocessing.py @@ -0,0 +1,80 @@ +"""数据预处理模块 - 收益率、去趋势、标准化、衍生指标""" + +import pandas as pd +import numpy as np +from typing import Optional + + +def log_returns(prices: pd.Series) -> pd.Series: + """对数收益率""" + return np.log(prices / prices.shift(1)).dropna() + + +def simple_returns(prices: pd.Series) -> pd.Series: + """简单收益率""" + return prices.pct_change().dropna() + + +def detrend_log_diff(prices: pd.Series) -> pd.Series: + """对数差分去趋势""" + return np.log(prices).diff().dropna() + + +def detrend_linear(series: pd.Series) -> pd.Series: + """线性去趋势""" + x = np.arange(len(series)) + coeffs = np.polyfit(x, series.values, 1) + trend = np.polyval(coeffs, x) + return pd.Series(series.values - trend, index=series.index) + + +def hp_filter(series: pd.Series, lamb: float = 1600) -> tuple: + """Hodrick-Prescott 滤波器""" + from statsmodels.tsa.filters.hp_filter import hpfilter + cycle, trend = hpfilter(series.dropna(), lamb=lamb) + return cycle, trend + + +def rolling_volatility(returns: pd.Series, window: int = 30) -> pd.Series: + """滚动波动率(年化)""" + return returns.rolling(window=window).std() * np.sqrt(365) + + +def realized_volatility(returns: pd.Series, window: int = 30) -> pd.Series: + """已实现波动率""" + return np.sqrt((returns ** 2).rolling(window=window).sum()) + + +def taker_buy_ratio(df: pd.DataFrame) -> pd.Series: + """Taker买入比例""" + return df["taker_buy_volume"] / df["volume"].replace(0, np.nan) + + +def add_derived_features(df: pd.DataFrame) -> pd.DataFrame: + """添加常用衍生特征列""" + out = df.copy() + out["log_return"] = log_returns(df["close"]) + out["simple_return"] = simple_returns(df["close"]) + out["log_price"] = np.log(df["close"]) + out["range_pct"] = (df["high"] - df["low"]) / df["close"] + out["body_pct"] = (df["close"] - df["open"]) / df["open"] + out["taker_buy_ratio"] = taker_buy_ratio(df) + out["vol_30d"] = rolling_volatility(out["log_return"], 30) + out["vol_7d"] = rolling_volatility(out["log_return"], 7) + out["volume_ma20"] = df["volume"].rolling(20).mean() + out["volume_ratio"] = df["volume"] / out["volume_ma20"] + out["abs_return"] = out["log_return"].abs() + out["squared_return"] = out["log_return"] ** 2 + return out + + +def standardize(series: pd.Series) -> pd.Series: + """Z-score标准化""" + return (series - series.mean()) / series.std() + + +def winsorize(series: pd.Series, lower: float = 0.01, upper: float = 0.99) -> pd.Series: + """Winsorize处理极端值""" + lo = series.quantile(lower) + hi = series.quantile(upper) + return series.clip(lo, hi) diff --git a/src/returns_analysis.py b/src/returns_analysis.py new file mode 100644 index 0000000..f965756 --- /dev/null +++ b/src/returns_analysis.py @@ -0,0 +1,479 @@ +"""收益率分布分析与GARCH建模模块 + +分析内容: +- 正态性检验(KS、JB、AD) +- 厚尾特征分析(峰度、偏度、超越比率) +- 多时间尺度收益率分布对比 +- QQ图 +- GARCH(1,1) 条件波动率建模 +""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from matplotlib.gridspec import GridSpec +from scipy import stats +from pathlib import Path +from typing import Optional + +from src.data_loader import load_klines +from src.preprocessing import log_returns + + +# ============================================================ +# 1. 正态性检验 +# ============================================================ + +def normality_tests(returns: pd.Series) -> dict: + """ + 对收益率序列进行多种正态性检验 + + Parameters + ---------- + returns : pd.Series + 对数收益率序列(已去除NaN) + + Returns + ------- + dict + 包含KS、JB、AD检验统计量和p值的字典 + """ + r = returns.dropna().values + + # Kolmogorov-Smirnov 检验(与标准正态比较) + r_standardized = (r - r.mean()) / r.std() + ks_stat, ks_p = stats.kstest(r_standardized, 'norm') + + # Jarque-Bera 检验 + jb_stat, jb_p = stats.jarque_bera(r) + + # Anderson-Darling 检验 + ad_result = stats.anderson(r, dist='norm') + + results = { + 'ks_statistic': ks_stat, + 'ks_pvalue': ks_p, + 'jb_statistic': jb_stat, + 'jb_pvalue': jb_p, + 'ad_statistic': ad_result.statistic, + 'ad_critical_values': dict(zip( + [f'{sl}%' for sl in ad_result.significance_level], + ad_result.critical_values + )), + } + return results + + +# ============================================================ +# 2. 厚尾分析 +# ============================================================ + +def fat_tail_analysis(returns: pd.Series) -> dict: + """ + 厚尾特征分析:峰度、偏度、σ超越比率 + + Parameters + ---------- + returns : pd.Series + 对数收益率序列 + + Returns + ------- + dict + 峰度、偏度、3σ/4σ超越比率及其与正态分布的对比 + """ + r = returns.dropna().values + mu, sigma = r.mean(), r.std() + + # 基础统计 + excess_kurtosis = stats.kurtosis(r) # scipy默认是excess kurtosis + skewness = stats.skew(r) + + # 实际超越比率 + r_std = (r - mu) / sigma + exceed_3sigma = np.mean(np.abs(r_std) > 3) + exceed_4sigma = np.mean(np.abs(r_std) > 4) + + # 正态分布理论超越比率 + normal_3sigma = 2 * (1 - stats.norm.cdf(3)) # ≈ 0.0027 + normal_4sigma = 2 * (1 - stats.norm.cdf(4)) # ≈ 0.0001 + + results = { + 'excess_kurtosis': excess_kurtosis, + 'skewness': skewness, + 'exceed_3sigma_actual': exceed_3sigma, + 'exceed_3sigma_normal': normal_3sigma, + 'exceed_3sigma_ratio': exceed_3sigma / normal_3sigma if normal_3sigma > 0 else np.inf, + 'exceed_4sigma_actual': exceed_4sigma, + 'exceed_4sigma_normal': normal_4sigma, + 'exceed_4sigma_ratio': exceed_4sigma / normal_4sigma if normal_4sigma > 0 else np.inf, + } + return results + + +# ============================================================ +# 3. 多时间尺度分布对比 +# ============================================================ + +def multi_timeframe_distributions() -> dict: + """ + 加载1h/4h/1d/1w数据,计算各时间尺度的对数收益率分布 + + Returns + ------- + dict + {interval: pd.Series} 各时间尺度的对数收益率 + """ + intervals = ['1h', '4h', '1d', '1w'] + distributions = {} + for interval in intervals: + try: + df = load_klines(interval) + ret = log_returns(df['close']) + distributions[interval] = ret + except FileNotFoundError: + print(f"[警告] {interval} 数据文件不存在,跳过") + return distributions + + +# ============================================================ +# 4. GARCH(1,1) 建模 +# ============================================================ + +def fit_garch11(returns: pd.Series) -> dict: + """ + 拟合GARCH(1,1)模型 + + Parameters + ---------- + returns : pd.Series + 对数收益率序列(百分比化后传入arch库) + + Returns + ------- + dict + 包含模型参数、持续性、条件波动率序列的字典 + """ + from arch import arch_model + + # arch库推荐使用百分比收益率以改善数值稳定性 + r_pct = returns.dropna() * 100 + + # 拟合GARCH(1,1),均值模型用常数均值 + model = arch_model(r_pct, vol='Garch', p=1, q=1, mean='Constant', dist='Normal') + result = model.fit(disp='off') + + # 提取参数 + params = result.params + omega = params.get('omega', np.nan) + alpha = params.get('alpha[1]', np.nan) + beta = params.get('beta[1]', np.nan) + persistence = alpha + beta + + # 条件波动率(转回原始比例) + cond_vol = result.conditional_volatility / 100 + + results = { + 'model_summary': str(result.summary()), + 'omega': omega, + 'alpha': alpha, + 'beta': beta, + 'persistence': persistence, + 'log_likelihood': result.loglikelihood, + 'aic': result.aic, + 'bic': result.bic, + 'conditional_volatility': cond_vol, + 'result_obj': result, + } + return results + + +# ============================================================ +# 5. 可视化 +# ============================================================ + +def plot_histogram_vs_normal(returns: pd.Series, output_dir: Path): + """绘制收益率直方图与正态分布对比""" + r = returns.dropna().values + mu, sigma = r.mean(), r.std() + + fig, ax = plt.subplots(figsize=(12, 6)) + + # 直方图 + n_bins = 150 + ax.hist(r, bins=n_bins, density=True, alpha=0.65, color='steelblue', + edgecolor='white', linewidth=0.3, label='BTC日对数收益率') + + # 正态分布拟合曲线 + x = np.linspace(r.min(), r.max(), 500) + ax.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, + label=f'正态分布 N({mu:.5f}, {sigma:.4f}²)') + + ax.set_xlabel('日对数收益率', fontsize=12) + ax.set_ylabel('概率密度', fontsize=12) + ax.set_title('BTC日对数收益率分布 vs 正态分布', fontsize=14) + ax.legend(fontsize=11) + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / 'returns_histogram_vs_normal.png', + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[保存] {output_dir / 'returns_histogram_vs_normal.png'}") + + +def plot_qq(returns: pd.Series, output_dir: Path): + """绘制QQ图""" + fig, ax = plt.subplots(figsize=(8, 8)) + r = returns.dropna().values + + # QQ图 + (osm, osr), (slope, intercept, _) = stats.probplot(r, dist='norm') + ax.scatter(osm, osr, s=5, alpha=0.5, color='steelblue', label='样本分位数') + # 理论线 + x_line = np.array([osm.min(), osm.max()]) + ax.plot(x_line, slope * x_line + intercept, 'r-', linewidth=2, label='理论正态线') + + ax.set_xlabel('理论分位数(正态)', fontsize=12) + ax.set_ylabel('样本分位数', fontsize=12) + ax.set_title('BTC日对数收益率 QQ图', fontsize=14) + ax.legend(fontsize=11) + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / 'returns_qq_plot.png', + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[保存] {output_dir / 'returns_qq_plot.png'}") + + +def plot_multi_timeframe(distributions: dict, output_dir: Path): + """绘制多时间尺度收益率分布对比""" + n_plots = len(distributions) + if n_plots == 0: + print("[警告] 无可用的多时间尺度数据") + return + + fig, axes = plt.subplots(2, 2, figsize=(14, 10)) + axes = axes.flatten() + + interval_names = { + '1h': '1小时', '4h': '4小时', '1d': '1天', '1w': '1周' + } + + for idx, (interval, ret) in enumerate(distributions.items()): + if idx >= 4: + break + ax = axes[idx] + r = ret.dropna().values + mu, sigma = r.mean(), r.std() + + ax.hist(r, bins=100, density=True, alpha=0.65, color='steelblue', + edgecolor='white', linewidth=0.3) + + x = np.linspace(r.min(), r.max(), 500) + ax.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=1.5) + + # 统计信息 + kurt = stats.kurtosis(r) + skew = stats.skew(r) + label = interval_names.get(interval, interval) + ax.set_title(f'{label}收益率 (峰度={kurt:.2f}, 偏度={skew:.3f})', fontsize=11) + ax.set_xlabel('对数收益率', fontsize=10) + ax.set_ylabel('概率密度', fontsize=10) + ax.grid(True, alpha=0.3) + + # 隐藏多余子图 + for idx in range(len(distributions), 4): + axes[idx].set_visible(False) + + fig.suptitle('多时间尺度BTC对数收益率分布', fontsize=14, y=1.02) + fig.tight_layout() + fig.savefig(output_dir / 'multi_timeframe_distributions.png', + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[保存] {output_dir / 'multi_timeframe_distributions.png'}") + + +def plot_garch_conditional_vol(garch_results: dict, output_dir: Path): + """绘制GARCH(1,1)条件波动率时序图""" + cond_vol = garch_results['conditional_volatility'] + + fig, ax = plt.subplots(figsize=(14, 5)) + ax.plot(cond_vol.index, cond_vol.values, linewidth=0.8, color='steelblue') + ax.fill_between(cond_vol.index, 0, cond_vol.values, alpha=0.2, color='steelblue') + + ax.set_xlabel('日期', fontsize=12) + ax.set_ylabel('条件波动率', fontsize=12) + ax.set_title( + f'GARCH(1,1) 条件波动率 ' + f'(α={garch_results["alpha"]:.4f}, β={garch_results["beta"]:.4f}, ' + f'持续性={garch_results["persistence"]:.4f})', + fontsize=13 + ) + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / 'garch_conditional_volatility.png', + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[保存] {output_dir / 'garch_conditional_volatility.png'}") + + +# ============================================================ +# 6. 结果打印 +# ============================================================ + +def print_normality_results(results: dict): + """打印正态性检验结果""" + print("\n" + "=" * 60) + print("正态性检验结果") + print("=" * 60) + + print(f"\n[KS检验] Kolmogorov-Smirnov") + print(f" 统计量: {results['ks_statistic']:.6f}") + print(f" p值: {results['ks_pvalue']:.2e}") + print(f" 结论: {'拒绝正态假设' if results['ks_pvalue'] < 0.05 else '不能拒绝正态假设'}") + + print(f"\n[JB检验] Jarque-Bera") + print(f" 统计量: {results['jb_statistic']:.4f}") + print(f" p值: {results['jb_pvalue']:.2e}") + print(f" 结论: {'拒绝正态假设' if results['jb_pvalue'] < 0.05 else '不能拒绝正态假设'}") + + print(f"\n[AD检验] Anderson-Darling") + print(f" 统计量: {results['ad_statistic']:.4f}") + print(" 临界值:") + for level, cv in results['ad_critical_values'].items(): + reject = results['ad_statistic'] > cv + print(f" {level}: {cv:.4f} {'(拒绝)' if reject else '(不拒绝)'}") + + +def print_fat_tail_results(results: dict): + """打印厚尾分析结果""" + print("\n" + "=" * 60) + print("厚尾特征分析") + print("=" * 60) + print(f" 超额峰度 (excess kurtosis): {results['excess_kurtosis']:.4f}") + print(f" (正态分布=0,值越大尾部越厚)") + print(f" 偏度 (skewness): {results['skewness']:.4f}") + print(f" (正态分布=0,负值表示左偏)") + + print(f"\n 3σ超越比率:") + print(f" 实际: {results['exceed_3sigma_actual']:.6f} " + f"({results['exceed_3sigma_actual'] * 100:.3f}%)") + print(f" 正态: {results['exceed_3sigma_normal']:.6f} " + f"({results['exceed_3sigma_normal'] * 100:.3f}%)") + print(f" 倍数: {results['exceed_3sigma_ratio']:.2f}x") + + print(f"\n 4σ超越比率:") + print(f" 实际: {results['exceed_4sigma_actual']:.6f} " + f"({results['exceed_4sigma_actual'] * 100:.4f}%)") + print(f" 正态: {results['exceed_4sigma_normal']:.6f} " + f"({results['exceed_4sigma_normal'] * 100:.4f}%)") + print(f" 倍数: {results['exceed_4sigma_ratio']:.2f}x") + + +def print_garch_results(results: dict): + """打印GARCH(1,1)建模结果""" + print("\n" + "=" * 60) + print("GARCH(1,1) 建模结果") + print("=" * 60) + print(f" ω (omega): {results['omega']:.6f}") + print(f" α (alpha[1]): {results['alpha']:.6f}") + print(f" β (beta[1]): {results['beta']:.6f}") + print(f" 持续性 (α+β): {results['persistence']:.6f}") + print(f" {'高持续性(接近1)→波动率冲击衰减缓慢' if results['persistence'] > 0.9 else '中等持续性'}") + print(f" 对数似然值: {results['log_likelihood']:.4f}") + print(f" AIC: {results['aic']:.4f}") + print(f" BIC: {results['bic']:.4f}") + + +# ============================================================ +# 7. 主入口 +# ============================================================ + +def run_returns_analysis(df: pd.DataFrame, output_dir: str = "output/returns"): + """ + 收益率分布分析主函数 + + Parameters + ---------- + df : pd.DataFrame + 日线K线数据(含'close'列,DatetimeIndex索引) + output_dir : str + 图表输出目录 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + print("=" * 60) + print("BTC 收益率分布分析与 GARCH 建模") + print("=" * 60) + print(f"数据范围: {df.index.min()} ~ {df.index.max()}") + print(f"样本数量: {len(df)}") + + # 计算日对数收益率 + daily_returns = log_returns(df['close']) + print(f"日对数收益率样本数: {len(daily_returns)}") + + # --- 正态性检验 --- + print("\n>>> 执行正态性检验...") + norm_results = normality_tests(daily_returns) + print_normality_results(norm_results) + + # --- 厚尾分析 --- + print("\n>>> 执行厚尾分析...") + tail_results = fat_tail_analysis(daily_returns) + print_fat_tail_results(tail_results) + + # --- 多时间尺度分布 --- + print("\n>>> 加载多时间尺度数据...") + distributions = multi_timeframe_distributions() + # 打印各尺度统计 + print("\n多时间尺度对数收益率统计:") + print(f" {'尺度':<8} {'样本数':>8} {'均值':>12} {'标准差':>12} {'峰度':>10} {'偏度':>10}") + print(" " + "-" * 62) + for interval, ret in distributions.items(): + r = ret.dropna().values + print(f" {interval:<8} {len(r):>8d} {r.mean():>12.6f} {r.std():>12.6f} " + f"{stats.kurtosis(r):>10.4f} {stats.skew(r):>10.4f}") + + # --- GARCH(1,1) 建模 --- + print("\n>>> 拟合 GARCH(1,1) 模型...") + garch_results = fit_garch11(daily_returns) + print_garch_results(garch_results) + + # --- 生成可视化 --- + print("\n>>> 生成可视化图表...") + + # 设置中文字体(兼容多系统) + plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans'] + plt.rcParams['axes.unicode_minus'] = False + + plot_histogram_vs_normal(daily_returns, output_dir) + plot_qq(daily_returns, output_dir) + plot_multi_timeframe(distributions, output_dir) + plot_garch_conditional_vol(garch_results, output_dir) + + print("\n" + "=" * 60) + print("收益率分布分析完成!") + print(f"图表已保存至: {output_dir.resolve()}") + print("=" * 60) + + # 返回所有结果供后续使用 + return { + 'normality': norm_results, + 'fat_tail': tail_results, + 'multi_timeframe': distributions, + 'garch': garch_results, + } + + +# ============================================================ +# 独立运行入口 +# ============================================================ + +if __name__ == '__main__': + from src.data_loader import load_daily + df = load_daily() + run_returns_analysis(df) diff --git a/src/time_series.py b/src/time_series.py new file mode 100644 index 0000000..3f20e8d --- /dev/null +++ b/src/time_series.py @@ -0,0 +1,804 @@ +"""时间序列预测模块 - ARIMA、Prophet、LSTM/GRU + +对BTC日线数据进行多模型预测与对比评估。 +每个模型独立运行,单个模型失败不影响其他模型。 +""" + +import warnings +import numpy as np +import pandas as pd +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +from pathlib import Path +from typing import Optional, Tuple, Dict, List +from scipy import stats + +from src.data_loader import split_data + + +# ============================================================ +# 评估指标 +# ============================================================ + +def _direction_accuracy(y_true: np.ndarray, y_pred: np.ndarray) -> float: + """方向准确率:预测涨跌方向正确的比例""" + if len(y_true) < 2: + return np.nan + true_dir = np.sign(y_true) + pred_dir = np.sign(y_pred) + return np.mean(true_dir == pred_dir) + + +def _rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float: + """均方根误差""" + return np.sqrt(np.mean((y_true - y_pred) ** 2)) + + +def _diebold_mariano_test(e1: np.ndarray, e2: np.ndarray, h: int = 1) -> Tuple[float, float]: + """ + Diebold-Mariano检验:比较两个预测的损失差异是否显著 + + H0: 两个模型预测精度无差异 + e1, e2: 两个模型的预测误差序列 + + Returns + ------- + dm_stat : DM统计量 + p_value : 双侧p值 + """ + d = e1 ** 2 - e2 ** 2 # 平方损失差 + n = len(d) + if n < 10: + return np.nan, np.nan + + mean_d = np.mean(d) + + # Newey-West方差估计(考虑自相关) + gamma_0 = np.var(d, ddof=1) + gamma_sum = 0 + for k in range(1, h): + gamma_k = np.cov(d[k:], d[:-k])[0, 1] if len(d[k:]) > 1 else 0 + gamma_sum += 2 * gamma_k + + var_d = (gamma_0 + gamma_sum) / n + if var_d <= 0: + return np.nan, np.nan + + dm_stat = mean_d / np.sqrt(var_d) + p_value = 2 * stats.norm.sf(np.abs(dm_stat)) + return dm_stat, p_value + + +def _evaluate_model(name: str, y_true: np.ndarray, y_pred: np.ndarray, + rw_errors: np.ndarray) -> Dict: + """统一评估单个模型""" + errors = y_true - y_pred + rmse_val = _rmse(y_true, y_pred) + rw_rmse = _rmse(y_true, np.zeros_like(y_true)) # Random Walk RMSE + rmse_ratio = rmse_val / rw_rmse if rw_rmse > 0 else np.nan + dir_acc = _direction_accuracy(y_true, y_pred) + + # DM检验 vs Random Walk + dm_stat, dm_pval = _diebold_mariano_test(errors, rw_errors) + + result = { + "name": name, + "rmse": rmse_val, + "rmse_ratio_vs_rw": rmse_ratio, + "direction_accuracy": dir_acc, + "dm_stat_vs_rw": dm_stat, + "dm_pval_vs_rw": dm_pval, + "predictions": y_pred, + "errors": errors, + } + return result + + +# ============================================================ +# 基准模型 +# ============================================================ + +def _baseline_random_walk(y_true: np.ndarray) -> np.ndarray: + """随机游走基准:预测收益率=0""" + return np.zeros_like(y_true) + + +def _baseline_historical_mean(train_returns: np.ndarray, n_pred: int) -> np.ndarray: + """历史均值基准:预测收益率=训练集均值""" + return np.full(n_pred, np.mean(train_returns)) + + +# ============================================================ +# ARIMA 模型 +# ============================================================ + +def _run_arima(train_returns: pd.Series, val_returns: pd.Series) -> Dict: + """ + ARIMA模型:使用auto_arima自动选参 + walk-forward预测 + + Returns + ------- + dict : 包含预测结果和诊断信息 + """ + try: + import pmdarima as pm + from statsmodels.stats.diagnostic import acorr_ljungbox + except ImportError: + print(" [ARIMA] 跳过 - pmdarima 未安装。pip install pmdarima") + return None + + print("\n" + "=" * 60) + print("ARIMA 模型") + print("=" * 60) + + # 自动选择ARIMA参数 + print(" [1/3] auto_arima 参数搜索...") + model = pm.auto_arima( + train_returns.values, + start_p=0, max_p=5, + start_q=0, max_q=5, + d=0, # 对数收益率已经是平稳的 + seasonal=False, + stepwise=True, + suppress_warnings=True, + error_action='ignore', + trace=False, + information_criterion='aic', + ) + print(f" 最优模型: ARIMA{model.order}") + print(f" AIC: {model.aic():.2f}") + + # Ljung-Box 残差诊断 + print(" [2/3] Ljung-Box 残差白噪声检验...") + residuals = model.resid() + lb_result = acorr_ljungbox(residuals, lags=[10, 20], return_df=True) + print(f" Ljung-Box 检验 (lag=10): 统计量={lb_result.iloc[0]['lb_stat']:.2f}, " + f"p值={lb_result.iloc[0]['lb_pvalue']:.4f}") + print(f" Ljung-Box 检验 (lag=20): 统计量={lb_result.iloc[1]['lb_stat']:.2f}, " + f"p值={lb_result.iloc[1]['lb_pvalue']:.4f}") + + if lb_result.iloc[0]['lb_pvalue'] > 0.05: + print(" 残差通过白噪声检验 (p>0.05),模型拟合充分") + else: + print(" 残差未通过白噪声检验 (p<=0.05),可能存在未捕获的自相关结构") + + # Walk-forward 预测 + print(" [3/3] Walk-forward 验证集预测...") + val_values = val_returns.values + n_val = len(val_values) + predictions = np.zeros(n_val) + + # 使用滚动窗口预测 + history = list(train_returns.values) + for i in range(n_val): + # 一步预测 + fc = model.predict(n_periods=1) + predictions[i] = fc[0] + # 更新模型(添加真实观测值) + model.update(val_values[i:i+1]) + if (i + 1) % 100 == 0: + print(f" 进度: {i+1}/{n_val}") + + print(f" Walk-forward 预测完成,共{n_val}步") + + return { + "predictions": predictions, + "order": model.order, + "aic": model.aic(), + "ljung_box": lb_result, + } + + +# ============================================================ +# Prophet 模型 +# ============================================================ + +def _run_prophet(train_df: pd.DataFrame, val_df: pd.DataFrame) -> Dict: + """ + Prophet模型:基于日收盘价的时间序列预测 + + Returns + ------- + dict : 包含预测结果 + """ + try: + from prophet import Prophet + except ImportError: + print(" [Prophet] 跳过 - prophet 未安装。pip install prophet") + return None + + print("\n" + "=" * 60) + print("Prophet 模型") + print("=" * 60) + + # 准备Prophet格式数据 + prophet_train = pd.DataFrame({ + 'ds': train_df.index, + 'y': train_df['close'].values, + }) + + print(" [1/3] 构建Prophet模型并添加自定义季节性...") + + model = Prophet( + daily_seasonality=False, + weekly_seasonality=False, + yearly_seasonality=False, + changepoint_prior_scale=0.05, + ) + + # 添加自定义季节性 + model.add_seasonality(name='weekly', period=7, fourier_order=3) + model.add_seasonality(name='monthly', period=30, fourier_order=5) + model.add_seasonality(name='yearly', period=365, fourier_order=10) + model.add_seasonality(name='halving_cycle', period=1458, fourier_order=5) + + print(" [2/3] 拟合模型...") + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + model.fit(prophet_train) + + # 预测验证期 + print(" [3/3] 预测验证期...") + future_dates = pd.DataFrame({'ds': val_df.index}) + forecast = model.predict(future_dates) + + # 转换为对数收益率预测(与其他模型对齐) + pred_close = forecast['yhat'].values + # 用前一天的真实收盘价计算预测收益率 + # 第一天用训练集最后一天的价格 + prev_close = np.concatenate([[train_df['close'].iloc[-1]], val_df['close'].values[:-1]]) + pred_returns = np.log(pred_close / prev_close) + + print(f" 预测完成,验证期: {val_df.index[0]} ~ {val_df.index[-1]}") + print(f" 预测价格范围: {pred_close.min():.0f} ~ {pred_close.max():.0f}") + + return { + "predictions_return": pred_returns, + "predictions_close": pred_close, + "forecast": forecast, + "model": model, + } + + +# ============================================================ +# LSTM/GRU 模型 (PyTorch) +# ============================================================ + +def _run_lstm(train_df: pd.DataFrame, val_df: pd.DataFrame, + lookback: int = 60, hidden_size: int = 128, + num_layers: int = 2, max_epochs: int = 100, + patience: int = 10, batch_size: int = 64) -> Dict: + """ + LSTM/GRU 模型:基于PyTorch的深度学习时间序列预测 + + Returns + ------- + dict : 包含预测结果和训练历史 + """ + try: + import torch + import torch.nn as nn + from torch.utils.data import DataLoader, TensorDataset + except ImportError: + print(" [LSTM] 跳过 - PyTorch 未安装。pip install torch") + return None + + print("\n" + "=" * 60) + print("LSTM 模型 (PyTorch)") + print("=" * 60) + + device = torch.device('cuda' if torch.cuda.is_available() else + 'mps' if torch.backends.mps.is_available() else 'cpu') + print(f" 设备: {device}") + + # ---- 数据准备 ---- + # 使用收盘价的对数收益率作为目标 + feature_cols = ['log_return', 'volume_ratio', 'taker_buy_ratio'] + available_cols = [c for c in feature_cols if c in train_df.columns] + + if not available_cols: + # 降级到只用收盘价 + print(" [警告] 特征列不可用,仅使用收盘价收益率") + available_cols = ['log_return'] + + print(f" 特征: {available_cols}") + + # 合并训练和验证数据以创建连续序列 + all_data = pd.concat([train_df, val_df]) + features = all_data[available_cols].values + target = all_data['log_return'].values + + # 处理NaN + mask = ~np.isnan(features).any(axis=1) & ~np.isnan(target) + features_clean = features[mask] + target_clean = target[mask] + + # 特征标准化(基于训练集统计量) + train_len = mask[:len(train_df)].sum() + feat_mean = features_clean[:train_len].mean(axis=0) + feat_std = features_clean[:train_len].std(axis=0) + 1e-10 + features_norm = (features_clean - feat_mean) / feat_std + + target_mean = target_clean[:train_len].mean() + target_std = target_clean[:train_len].std() + 1e-10 + target_norm = (target_clean - target_mean) / target_std + + # 创建序列样本 + def create_sequences(feat, tgt, seq_len): + X, y = [], [] + for i in range(seq_len, len(feat)): + X.append(feat[i - seq_len:i]) + y.append(tgt[i]) + return np.array(X), np.array(y) + + X_all, y_all = create_sequences(features_norm, target_norm, lookback) + + # 划分训练和验证(根据原始训练集长度调整) + train_samples = max(0, train_len - lookback) + X_train = X_all[:train_samples] + y_train = y_all[:train_samples] + X_val = X_all[train_samples:] + y_val = y_all[train_samples:] + + if len(X_train) == 0 or len(X_val) == 0: + print(" [LSTM] 跳过 - 数据不足以创建训练/验证序列") + return None + + print(f" 训练样本: {len(X_train)}, 验证样本: {len(X_val)}") + print(f" 回看窗口: {lookback}, 隐藏维度: {hidden_size}, 层数: {num_layers}") + + # 转换为Tensor + X_train_t = torch.FloatTensor(X_train).to(device) + y_train_t = torch.FloatTensor(y_train).to(device) + X_val_t = torch.FloatTensor(X_val).to(device) + y_val_t = torch.FloatTensor(y_val).to(device) + + train_dataset = TensorDataset(X_train_t, y_train_t) + train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) + + # ---- 模型定义 ---- + class LSTMModel(nn.Module): + def __init__(self, input_size, hidden_size, num_layers, dropout=0.2): + super().__init__() + self.lstm = nn.LSTM( + input_size=input_size, + hidden_size=hidden_size, + num_layers=num_layers, + batch_first=True, + dropout=dropout if num_layers > 1 else 0, + ) + self.fc = nn.Sequential( + nn.Linear(hidden_size, 64), + nn.ReLU(), + nn.Dropout(dropout), + nn.Linear(64, 1), + ) + + def forward(self, x): + lstm_out, _ = self.lstm(x) + # 取最后一个时间步的输出 + last_out = lstm_out[:, -1, :] + return self.fc(last_out).squeeze(-1) + + input_size = len(available_cols) + model = LSTMModel(input_size, hidden_size, num_layers).to(device) + + criterion = nn.MSELoss() + optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4) + scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( + optimizer, mode='min', factor=0.5, patience=5, verbose=False + ) + + # ---- 训练 ---- + print(f" 开始训练 (最多{max_epochs}轮, 早停耐心={patience})...") + best_val_loss = np.inf + patience_counter = 0 + train_losses = [] + val_losses = [] + + for epoch in range(max_epochs): + # 训练 + model.train() + epoch_loss = 0 + n_batches = 0 + for batch_X, batch_y in train_loader: + optimizer.zero_grad() + pred = model(batch_X) + loss = criterion(pred, batch_y) + loss.backward() + torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) + optimizer.step() + epoch_loss += loss.item() + n_batches += 1 + + avg_train_loss = epoch_loss / max(n_batches, 1) + train_losses.append(avg_train_loss) + + # 验证 + model.eval() + with torch.no_grad(): + val_pred = model(X_val_t) + val_loss = criterion(val_pred, y_val_t).item() + val_losses.append(val_loss) + + scheduler.step(val_loss) + + if (epoch + 1) % 10 == 0: + lr = optimizer.param_groups[0]['lr'] + print(f" Epoch {epoch+1}/{max_epochs}: " + f"train_loss={avg_train_loss:.6f}, val_loss={val_loss:.6f}, lr={lr:.1e}") + + # 早停 + if val_loss < best_val_loss: + best_val_loss = val_loss + patience_counter = 0 + best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()} + else: + patience_counter += 1 + if patience_counter >= patience: + print(f" 早停触发 (epoch {epoch+1})") + break + + # 加载最佳模型 + model.load_state_dict(best_state) + model.eval() + + # ---- 预测 ---- + with torch.no_grad(): + val_pred_norm = model(X_val_t).cpu().numpy() + + # 逆标准化 + val_pred_returns = val_pred_norm * target_std + target_mean + val_true_returns = y_val * target_std + target_mean + + print(f" 训练完成,最佳验证损失: {best_val_loss:.6f}") + + return { + "predictions_return": val_pred_returns, + "true_returns": val_true_returns, + "train_losses": train_losses, + "val_losses": val_losses, + "model": model, + "device": str(device), + } + + +# ============================================================ +# 可视化 +# ============================================================ + +def _plot_predictions(val_dates, y_true, model_preds: Dict[str, np.ndarray], + output_dir: Path): + """各模型实际 vs 预测对比图""" + n_models = len(model_preds) + fig, axes = plt.subplots(n_models, 1, figsize=(16, 4 * n_models), sharex=True) + if n_models == 1: + axes = [axes] + + for i, (name, y_pred) in enumerate(model_preds.items()): + ax = axes[i] + # 对齐长度(LSTM可能因lookback导致长度不同) + n = min(len(y_true), len(y_pred)) + dates = val_dates[:n] if len(val_dates) >= n else val_dates + + ax.plot(dates, y_true[:n], 'b-', alpha=0.6, linewidth=0.8, label='实际收益率') + ax.plot(dates, y_pred[:n], 'r-', alpha=0.6, linewidth=0.8, label='预测收益率') + ax.set_title(f"{name} - 实际 vs 预测", fontsize=13) + ax.set_ylabel("对数收益率", fontsize=11) + ax.legend(fontsize=9) + ax.grid(True, alpha=0.3) + ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5) + + axes[-1].set_xlabel("日期", fontsize=11) + plt.tight_layout() + fig.savefig(output_dir / "ts_predictions_comparison.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] ts_predictions_comparison.png") + + +def _plot_direction_accuracy(metrics: Dict[str, Dict], output_dir: Path): + """方向准确率对比柱状图""" + names = list(metrics.keys()) + accs = [metrics[n]["direction_accuracy"] * 100 for n in names] + + fig, ax = plt.subplots(figsize=(10, 6)) + colors = plt.cm.Set2(np.linspace(0, 1, len(names))) + bars = ax.bar(names, accs, color=colors, edgecolor='gray', linewidth=0.5) + + # 标注数值 + for bar, acc in zip(bars, accs): + ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5, + f"{acc:.1f}%", ha='center', va='bottom', fontsize=11, fontweight='bold') + + ax.axhline(y=50, color='red', linestyle='--', alpha=0.7, label='随机基准 (50%)') + ax.set_ylabel("方向准确率 (%)", fontsize=12) + ax.set_title("各模型方向预测准确率对比", fontsize=14) + ax.legend(fontsize=10) + ax.grid(True, alpha=0.3, axis='y') + ax.set_ylim(0, max(accs) * 1.2 if accs else 100) + + fig.savefig(output_dir / "ts_direction_accuracy.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] ts_direction_accuracy.png") + + +def _plot_cumulative_error(val_dates, metrics: Dict[str, Dict], output_dir: Path): + """累计误差对比图""" + fig, ax = plt.subplots(figsize=(16, 7)) + + for name, m in metrics.items(): + errors = m.get("errors") + if errors is None: + continue + n = len(errors) + dates = val_dates[:n] + cum_sq_err = np.cumsum(errors ** 2) + ax.plot(dates, cum_sq_err, linewidth=1.2, label=f"{name}") + + ax.set_xlabel("日期", fontsize=12) + ax.set_ylabel("累计平方误差", fontsize=12) + ax.set_title("各模型累计预测误差对比", fontsize=14) + ax.legend(fontsize=10) + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / "ts_cumulative_error.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] ts_cumulative_error.png") + + +def _plot_lstm_training(train_losses: List, val_losses: List, output_dir: Path): + """LSTM训练损失曲线""" + fig, ax = plt.subplots(figsize=(10, 6)) + ax.plot(train_losses, 'b-', label='训练损失', linewidth=1.5) + ax.plot(val_losses, 'r-', label='验证损失', linewidth=1.5) + ax.set_xlabel("Epoch", fontsize=12) + ax.set_ylabel("MSE Loss", fontsize=12) + ax.set_title("LSTM 训练过程", fontsize=14) + ax.legend(fontsize=11) + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / "ts_lstm_training.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] ts_lstm_training.png") + + +def _plot_prophet_components(prophet_result: Dict, output_dir: Path): + """Prophet预测 - 实际价格 vs 预测价格""" + try: + from prophet import Prophet + except ImportError: + return + + forecast = prophet_result.get("forecast") + if forecast is None: + return + + fig, ax = plt.subplots(figsize=(16, 7)) + ax.plot(forecast['ds'], forecast['yhat'], 'r-', linewidth=1.2, label='Prophet预测') + ax.fill_between(forecast['ds'], forecast['yhat_lower'], forecast['yhat_upper'], + alpha=0.15, color='red', label='置信区间') + ax.set_xlabel("日期", fontsize=12) + ax.set_ylabel("BTC 价格 (USDT)", fontsize=12) + ax.set_title("Prophet 价格预测(验证期)", fontsize=14) + ax.legend(fontsize=10) + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / "ts_prophet_forecast.png", dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [保存] ts_prophet_forecast.png") + + +# ============================================================ +# 结果打印 +# ============================================================ + +def _print_metrics_table(all_metrics: Dict[str, Dict]): + """打印所有模型的评估指标表""" + print("\n" + "=" * 80) + print(" 模型评估汇总") + print("=" * 80) + print(f" {'模型':<20s} {'RMSE':>10s} {'RMSE/RW':>10s} {'方向准确率':>10s} " + f"{'DM统计量':>10s} {'DM p值':>10s}") + print("-" * 80) + + for name, m in all_metrics.items(): + rmse_str = f"{m['rmse']:.6f}" + ratio_str = f"{m['rmse_ratio_vs_rw']:.4f}" if not np.isnan(m['rmse_ratio_vs_rw']) else "N/A" + dir_str = f"{m['direction_accuracy']*100:.1f}%" + dm_str = f"{m['dm_stat_vs_rw']:.3f}" if not np.isnan(m['dm_stat_vs_rw']) else "N/A" + pv_str = f"{m['dm_pval_vs_rw']:.4f}" if not np.isnan(m['dm_pval_vs_rw']) else "N/A" + print(f" {name:<20s} {rmse_str:>10s} {ratio_str:>10s} {dir_str:>10s} " + f"{dm_str:>10s} {pv_str:>10s}") + + print("-" * 80) + + # 解读 + print("\n [解读]") + print(" - RMSE/RW < 1.0 表示优于随机游走基准") + print(" - 方向准确率 > 50% 表示有一定方向预测能力") + print(" - DM检验 p值 < 0.05 表示与随机游走有显著差异") + + +# ============================================================ +# 主入口 +# ============================================================ + +def run_time_series_analysis(df: pd.DataFrame, output_dir: "str | Path" = "output/time_series") -> Dict: + """ + 时间序列预测分析 - 主入口 + + Parameters + ---------- + df : pd.DataFrame + 已经通过 add_derived_features() 添加了衍生特征的日线数据 + output_dir : str or Path + 图表输出目录 + + Returns + ------- + results : dict + 包含所有模型的预测结果和评估指标 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + # 设置中文字体(macOS) + plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans'] + plt.rcParams['axes.unicode_minus'] = False + + print("=" * 60) + print(" BTC 时间序列预测分析") + print("=" * 60) + + # ---- 数据划分 ---- + train_df, val_df, test_df = split_data(df) + print(f"\n 训练集: {train_df.index[0]} ~ {train_df.index[-1]} ({len(train_df)}天)") + print(f" 验证集: {val_df.index[0]} ~ {val_df.index[-1]} ({len(val_df)}天)") + print(f" 测试集: {test_df.index[0]} ~ {test_df.index[-1]} ({len(test_df)}天)") + + # 对数收益率序列 + train_returns = train_df['log_return'].dropna() + val_returns = val_df['log_return'].dropna() + val_dates = val_returns.index + y_true = val_returns.values + + # ---- 基准模型 ---- + print("\n" + "=" * 60) + print("基准模型") + print("=" * 60) + + # Random Walk基准 + rw_pred = _baseline_random_walk(y_true) + rw_errors = y_true - rw_pred + print(f" Random Walk (预测收益=0): RMSE = {_rmse(y_true, rw_pred):.6f}") + + # 历史均值基准 + hm_pred = _baseline_historical_mean(train_returns.values, len(y_true)) + print(f" Historical Mean (收益={train_returns.mean():.6f}): RMSE = {_rmse(y_true, hm_pred):.6f}") + + # 存储所有模型结果 + all_metrics = {} + model_preds = {} + + # 评估基准模型 + all_metrics["Random Walk"] = _evaluate_model("Random Walk", y_true, rw_pred, rw_errors) + model_preds["Random Walk"] = rw_pred + + all_metrics["Historical Mean"] = _evaluate_model("Historical Mean", y_true, hm_pred, rw_errors) + model_preds["Historical Mean"] = hm_pred + + # ---- ARIMA ---- + try: + arima_result = _run_arima(train_returns, val_returns) + if arima_result is not None: + arima_pred = arima_result["predictions"] + all_metrics["ARIMA"] = _evaluate_model("ARIMA", y_true, arima_pred, rw_errors) + model_preds["ARIMA"] = arima_pred + print(f"\n ARIMA 验证集: RMSE={all_metrics['ARIMA']['rmse']:.6f}, " + f"方向准确率={all_metrics['ARIMA']['direction_accuracy']*100:.1f}%") + except Exception as e: + print(f"\n [ARIMA] 运行失败: {e}") + + # ---- Prophet ---- + try: + prophet_result = _run_prophet(train_df, val_df) + if prophet_result is not None: + prophet_pred = prophet_result["predictions_return"] + # 对齐长度 + n = min(len(y_true), len(prophet_pred)) + all_metrics["Prophet"] = _evaluate_model( + "Prophet", y_true[:n], prophet_pred[:n], rw_errors[:n] + ) + model_preds["Prophet"] = prophet_pred[:n] + print(f"\n Prophet 验证集: RMSE={all_metrics['Prophet']['rmse']:.6f}, " + f"方向准确率={all_metrics['Prophet']['direction_accuracy']*100:.1f}%") + + # Prophet专属图表 + _plot_prophet_components(prophet_result, output_dir) + except Exception as e: + print(f"\n [Prophet] 运行失败: {e}") + prophet_result = None + + # ---- LSTM ---- + try: + lstm_result = _run_lstm(train_df, val_df) + if lstm_result is not None: + lstm_pred = lstm_result["predictions_return"] + lstm_true = lstm_result["true_returns"] + n_lstm = len(lstm_pred) + + # LSTM因lookback导致样本数不同,使用其自身的true_returns评估 + lstm_rw_errors = lstm_true - np.zeros_like(lstm_true) + all_metrics["LSTM"] = _evaluate_model( + "LSTM", lstm_true, lstm_pred, lstm_rw_errors + ) + model_preds["LSTM"] = lstm_pred + print(f"\n LSTM 验证集: RMSE={all_metrics['LSTM']['rmse']:.6f}, " + f"方向准确率={all_metrics['LSTM']['direction_accuracy']*100:.1f}%") + + # LSTM训练曲线 + _plot_lstm_training(lstm_result["train_losses"], + lstm_result["val_losses"], output_dir) + except Exception as e: + print(f"\n [LSTM] 运行失败: {e}") + lstm_result = None + + # ---- 评估汇总 ---- + _print_metrics_table(all_metrics) + + # ---- 可视化 ---- + print("\n[可视化] 生成分析图表...") + + # 预测对比图(仅使用与y_true等长的预测,排除LSTM) + aligned_preds = {k: v for k, v in model_preds.items() + if k != "LSTM" and len(v) == len(y_true)} + if aligned_preds: + _plot_predictions(val_dates, y_true, aligned_preds, output_dir) + + # LSTM单独画图(长度不同) + if "LSTM" in model_preds and lstm_result is not None: + lstm_dates = val_dates[-len(lstm_result["predictions_return"]):] + _plot_predictions(lstm_dates, lstm_result["true_returns"], + {"LSTM": lstm_result["predictions_return"]}, output_dir) + + # 方向准确率对比 + _plot_direction_accuracy(all_metrics, output_dir) + + # 累计误差对比 + _plot_cumulative_error(val_dates, all_metrics, output_dir) + + # ---- 汇总 ---- + results = { + "metrics": all_metrics, + "model_predictions": model_preds, + "val_dates": val_dates, + "y_true": y_true, + } + + if 'arima_result' in dir() and arima_result is not None: + results["arima"] = arima_result + if prophet_result is not None: + results["prophet"] = prophet_result + if lstm_result is not None: + results["lstm"] = lstm_result + + print("\n" + "=" * 60) + print(" 时间序列预测分析完成!") + print("=" * 60) + + return results + + +# ============================================================ +# 命令行入口 +# ============================================================ + +if __name__ == "__main__": + from data_loader import load_daily + from preprocessing import add_derived_features + + df = load_daily() + df = add_derived_features(df) + + results = run_time_series_analysis(df, output_dir="output/time_series") diff --git a/src/visualization.py b/src/visualization.py new file mode 100644 index 0000000..9dd9f43 --- /dev/null +++ b/src/visualization.py @@ -0,0 +1,317 @@ +"""统一可视化工具模块 + +提供跨模块共用的绑图辅助函数与综合结果仪表盘。 +""" + +import numpy as np +import pandas as pd +import matplotlib +matplotlib.use('Agg') +import matplotlib.pyplot as plt +import matplotlib.gridspec as gridspec +from pathlib import Path +from typing import Dict, List, Optional, Any +import json +import warnings + +# ── 全局样式 ────────────────────────────────────────────── + +STYLE_CONFIG = { + "figure.facecolor": "white", + "axes.facecolor": "#fafafa", + "axes.grid": True, + "grid.alpha": 0.3, + "grid.linestyle": "--", + "font.size": 10, + "axes.titlesize": 13, + "axes.labelsize": 11, + "xtick.labelsize": 9, + "ytick.labelsize": 9, + "legend.fontsize": 9, + "figure.dpi": 120, + "savefig.dpi": 150, + "savefig.bbox": "tight", +} + +COLOR_PALETTE = { + "primary": "#2563eb", + "secondary": "#7c3aed", + "success": "#059669", + "danger": "#dc2626", + "warning": "#d97706", + "info": "#0891b2", + "muted": "#6b7280", + "bg_light": "#f8fafc", +} + +EVIDENCE_COLORS = { + "strong": "#059669", # 绿 + "moderate": "#d97706", # 橙 + "weak": "#dc2626", # 红 + "none": "#6b7280", # 灰 +} + + +def apply_style(): + """应用全局matplotlib样式""" + plt.rcParams.update(STYLE_CONFIG) + try: + plt.rcParams["font.sans-serif"] = ["Arial Unicode MS", "SimHei", "DejaVu Sans"] + plt.rcParams["axes.unicode_minus"] = False + except Exception: + pass + + +def ensure_dir(path): + """确保目录存在""" + Path(path).mkdir(parents=True, exist_ok=True) + return Path(path) + + +# ── 证据评分框架 ─────────────────────────────────────────── + +EVIDENCE_CRITERIA = """ +"真正有规律" 判定标准(必须同时满足): + 1. FDR校正后 p < 0.05 + 2. 排列检验 p < 0.01(如适用) + 3. 测试集上效果方向一致且显著 + 4. >80% bootstrap子样本中成立(如适用) + 5. Cohen's d > 0.2 或经济意义显著 + 6. 有合理的经济/市场直觉解释 +""" + + +def score_evidence(result: Dict) -> Dict: + """ + 对单个分析模块的结果打分 + + Parameters + ---------- + result : dict + 模块返回的结果字典,应包含 'findings' 列表 + + Returns + ------- + dict + 包含 score, level, summary + """ + findings = result.get("findings", []) + if not findings: + return {"score": 0, "level": "none", "summary": "无可评估的发现", + "n_findings": 0, "total_score": 0, "details": []} + + total_score = 0 + details = [] + + for f in findings: + s = 0 + name = f.get("name", "未命名") + p_value = f.get("p_value") + effect_size = f.get("effect_size") + significant = f.get("significant", False) + description = f.get("description", "") + + if significant: + s += 2 + if p_value is not None and p_value < 0.01: + s += 1 + if effect_size is not None and abs(effect_size) > 0.2: + s += 1 + if f.get("test_set_consistent", False): + s += 2 + if f.get("bootstrap_robust", False): + s += 1 + + total_score += s + details.append({"name": name, "score": s, "description": description}) + + avg = total_score / len(findings) if findings else 0 + + if avg >= 5: + level = "strong" + elif avg >= 3: + level = "moderate" + elif avg >= 1: + level = "weak" + else: + level = "none" + + return { + "score": round(avg, 2), + "level": level, + "n_findings": len(findings), + "total_score": total_score, + "details": details, + } + + +# ── 综合仪表盘 ───────────────────────────────────────────── + +def generate_summary_dashboard(all_results: Dict[str, Dict], output_dir: str = "output"): + """ + 生成综合分析仪表盘 + + Parameters + ---------- + all_results : dict + {module_name: module_result_dict} + output_dir : str + 输出目录 + """ + apply_style() + out = ensure_dir(output_dir) + + # ── 1. 汇总各模块证据强度 ── + summary_rows = [] + for module, result in all_results.items(): + ev = score_evidence(result) + summary_rows.append({ + "module": module, + "score": ev["score"], + "level": ev["level"], + "n_findings": ev["n_findings"], + "total_score": ev["total_score"], + }) + + summary_df = pd.DataFrame(summary_rows) + if summary_df.empty: + print("[visualization] 无模块结果可汇总") + return {} + + summary_df.sort_values("score", ascending=True, inplace=True) + + # ── 2. 证据强度横向柱状图 ── + fig, ax = plt.subplots(figsize=(10, max(6, len(summary_df) * 0.5))) + colors = [EVIDENCE_COLORS.get(row["level"], "#6b7280") for _, row in summary_df.iterrows()] + bars = ax.barh(summary_df["module"], summary_df["score"], color=colors, edgecolor="white", linewidth=0.5) + + for bar, (_, row) in zip(bars, summary_df.iterrows()): + ax.text(bar.get_width() + 0.1, bar.get_y() + bar.get_height()/2, + f'{row["score"]:.1f} ({row["level"]})', + va='center', fontsize=9) + + ax.set_xlabel("Evidence Score") + ax.set_title("BTC/USDT Analysis - Evidence Strength by Module") + ax.axvline(x=3, color="#d97706", linestyle="--", alpha=0.5, label="Moderate threshold") + ax.axvline(x=5, color="#059669", linestyle="--", alpha=0.5, label="Strong threshold") + ax.legend(loc="lower right") + plt.tight_layout() + fig.savefig(out / "evidence_dashboard.png") + plt.close(fig) + + # ── 3. 综合结论文本报告 ── + report_lines = [] + report_lines.append("=" * 70) + report_lines.append("BTC/USDT 价格规律性分析 — 综合结论报告") + report_lines.append("=" * 70) + report_lines.append("") + report_lines.append(EVIDENCE_CRITERIA) + report_lines.append("") + report_lines.append("-" * 70) + report_lines.append(f"{'模块':<30} {'得分':>6} {'强度':>10} {'发现数':>8}") + report_lines.append("-" * 70) + + for _, row in summary_df.sort_values("score", ascending=False).iterrows(): + report_lines.append( + f"{row['module']:<30} {row['score']:>6.2f} {row['level']:>10} {row['n_findings']:>8}" + ) + + report_lines.append("-" * 70) + report_lines.append("") + + # 分级汇总 + strong = summary_df[summary_df["level"] == "strong"]["module"].tolist() + moderate = summary_df[summary_df["level"] == "moderate"]["module"].tolist() + weak = summary_df[summary_df["level"] == "weak"]["module"].tolist() + none_found = summary_df[summary_df["level"] == "none"]["module"].tolist() + + report_lines.append("## 强证据规律(可重复、有经济意义):") + if strong: + for m in strong: + report_lines.append(f" * {m}") + else: + report_lines.append(" (无)") + + report_lines.append("") + report_lines.append("## 中等证据规律(统计显著但效果有限):") + if moderate: + for m in moderate: + report_lines.append(f" * {m}") + else: + report_lines.append(" (无)") + + report_lines.append("") + report_lines.append("## 弱证据/不显著:") + for m in weak + none_found: + report_lines.append(f" * {m}") + + report_lines.append("") + report_lines.append("=" * 70) + report_lines.append("注: 得分基于各模块自报告的统计检验结果。") + report_lines.append(" 具体参数和图表请参见各子目录的输出。") + report_lines.append("=" * 70) + + report_text = "\n".join(report_lines) + + with open(out / "综合结论报告.txt", "w", encoding="utf-8") as f: + f.write(report_text) + + # ── 4. JSON 格式结果存储 ── + json_results = {} + for module, result in all_results.items(): + # 去除不可序列化的对象 + clean = {} + for k, v in result.items(): + try: + json.dumps(v) + clean[k] = v + except (TypeError, ValueError): + clean[k] = str(v) + json_results[module] = clean + + with open(out / "all_results.json", "w", encoding="utf-8") as f: + json.dump(json_results, f, ensure_ascii=False, indent=2, default=str) + + print(report_text) + + return { + "summary_df": summary_df, + "report_path": str(out / "综合结论报告.txt"), + "dashboard_path": str(out / "evidence_dashboard.png"), + "json_path": str(out / "all_results.json"), + } + + +def plot_price_overview(df: pd.DataFrame, output_dir: str = "output"): + """生成价格概览图(对数尺度 + 成交量 + 关键事件标注)""" + apply_style() + out = ensure_dir(output_dir) + + fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 8), height_ratios=[3, 1], + sharex=True, gridspec_kw={"hspace": 0.05}) + + # 价格(对数尺度) + ax1.semilogy(df.index, df["close"], color=COLOR_PALETTE["primary"], linewidth=0.8) + ax1.set_ylabel("Price (USDT, log scale)") + ax1.set_title("BTC/USDT Price & Volume Overview") + + # 标注减半事件 + halvings = [ + ("2020-05-11", "3rd Halving"), + ("2024-04-20", "4th Halving"), + ] + for date_str, label in halvings: + dt = pd.Timestamp(date_str) + if df.index.min() <= dt <= df.index.max(): + ax1.axvline(x=dt, color=COLOR_PALETTE["danger"], linestyle="--", alpha=0.6) + ax1.text(dt, ax1.get_ylim()[1] * 0.9, label, rotation=90, + va="top", fontsize=8, color=COLOR_PALETTE["danger"]) + + # 成交量 + ax2.bar(df.index, df["volume"], width=1, color=COLOR_PALETTE["info"], alpha=0.5) + ax2.set_ylabel("Volume") + ax2.set_xlabel("Date") + + fig.savefig(out / "price_overview.png") + plt.close(fig) + print(f"[visualization] 价格概览图 -> {out / 'price_overview.png'}") diff --git a/src/volatility_analysis.py b/src/volatility_analysis.py new file mode 100644 index 0000000..6cfdca3 --- /dev/null +++ b/src/volatility_analysis.py @@ -0,0 +1,639 @@ +"""波动率聚集与非对称GARCH建模模块 + +分析内容: +- 多窗口已实现波动率(7d, 30d, 90d) +- 波动率自相关幂律衰减检验(长记忆性) +- GARCH/EGARCH/GJR-GARCH 模型对比 +- 杠杆效应分析:收益率与未来波动率的相关性 +""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from scipy import stats +from scipy.optimize import curve_fit +from statsmodels.tsa.stattools import acf +from pathlib import Path +from typing import Optional + +from src.data_loader import load_daily +from src.preprocessing import log_returns + + +# ============================================================ +# 1. 多窗口已实现波动率 +# ============================================================ + +def multi_window_realized_vol(returns: pd.Series, + windows: list = [7, 30, 90]) -> pd.DataFrame: + """ + 计算多窗口已实现波动率(年化) + + Parameters + ---------- + returns : pd.Series + 日对数收益率 + windows : list + 滚动窗口列表(天数) + + Returns + ------- + pd.DataFrame + 各窗口已实现波动率,列名为 'rv_7d', 'rv_30d', 'rv_90d' 等 + """ + vol_df = pd.DataFrame(index=returns.index) + for w in windows: + # 已实现波动率 = sqrt(sum(r^2)) * sqrt(365/window) 进行年化 + rv = np.sqrt((returns ** 2).rolling(window=w).sum()) * np.sqrt(365 / w) + vol_df[f'rv_{w}d'] = rv + return vol_df.dropna(how='all') + + +# ============================================================ +# 2. 波动率自相关幂律衰减检验(长记忆性) +# ============================================================ + +def volatility_acf_power_law(returns: pd.Series, + max_lags: int = 200) -> dict: + """ + 检验|收益率|的自相关函数是否服从幂律衰减:ACF(k) ~ k^(-d) + + 长记忆性判断:若 0 < d < 1,则存在长记忆 + + Parameters + ---------- + returns : pd.Series + 日对数收益率 + max_lags : int + 最大滞后阶数 + + Returns + ------- + dict + 包含幂律拟合参数d、拟合优度R²、ACF值等 + """ + abs_returns = returns.dropna().abs() + + # 计算ACF + acf_values = acf(abs_returns, nlags=max_lags, fft=True) + # 从lag=1开始(lag=0始终为1) + lags = np.arange(1, max_lags + 1) + acf_vals = acf_values[1:] + + # 只取正的ACF值来做对数拟合 + positive_mask = acf_vals > 0 + lags_pos = lags[positive_mask] + acf_pos = acf_vals[positive_mask] + + if len(lags_pos) < 10: + print("[警告] 正的ACF值过少,无法可靠拟合幂律") + return { + 'd': np.nan, 'r_squared': np.nan, + 'lags': lags, 'acf_values': acf_vals, + 'is_long_memory': False, + } + + # 对数-对数线性回归: log(ACF) = -d * log(k) + c + log_lags = np.log(lags_pos) + log_acf = np.log(acf_pos) + slope, intercept, r_value, p_value, std_err = stats.linregress(log_lags, log_acf) + + d = -slope # 幂律衰减指数 + r_squared = r_value ** 2 + + # 非线性拟合作为对照(幂律函数直接拟合) + def power_law(k, a, d_param): + return a * k ** (-d_param) + + try: + popt, pcov = curve_fit(power_law, lags_pos, acf_pos, + p0=[acf_pos[0], d], maxfev=5000) + d_nonlinear = popt[1] + except (RuntimeError, ValueError): + d_nonlinear = np.nan + + results = { + 'd': d, + 'd_nonlinear': d_nonlinear, + 'r_squared': r_squared, + 'slope': slope, + 'intercept': intercept, + 'p_value': p_value, + 'std_err': std_err, + 'lags': lags, + 'acf_values': acf_vals, + 'lags_positive': lags_pos, + 'acf_positive': acf_pos, + 'is_long_memory': 0 < d < 1, + } + return results + + +# ============================================================ +# 3. GARCH / EGARCH / GJR-GARCH 模型对比 +# ============================================================ + +def compare_garch_models(returns: pd.Series) -> dict: + """ + 拟合GARCH(1,1)、EGARCH(1,1)、GJR-GARCH(1,1)并比较AIC/BIC + + Parameters + ---------- + returns : pd.Series + 日对数收益率 + + Returns + ------- + dict + 各模型参数、AIC/BIC、杠杆效应参数 + """ + from arch import arch_model + + r_pct = returns.dropna() * 100 # 百分比收益率 + results = {} + + # --- GARCH(1,1) --- + model_garch = arch_model(r_pct, vol='Garch', p=1, q=1, + mean='Constant', dist='Normal') + res_garch = model_garch.fit(disp='off') + results['GARCH'] = { + 'params': dict(res_garch.params), + 'aic': res_garch.aic, + 'bic': res_garch.bic, + 'log_likelihood': res_garch.loglikelihood, + 'conditional_volatility': res_garch.conditional_volatility / 100, + 'result_obj': res_garch, + } + + # --- EGARCH(1,1) --- + model_egarch = arch_model(r_pct, vol='EGARCH', p=1, q=1, + mean='Constant', dist='Normal') + res_egarch = model_egarch.fit(disp='off') + # EGARCH的gamma参数反映杠杆效应(负值表示负收益增大波动率) + egarch_params = dict(res_egarch.params) + results['EGARCH'] = { + 'params': egarch_params, + 'aic': res_egarch.aic, + 'bic': res_egarch.bic, + 'log_likelihood': res_egarch.loglikelihood, + 'conditional_volatility': res_egarch.conditional_volatility / 100, + 'leverage_param': egarch_params.get('gamma[1]', np.nan), + 'result_obj': res_egarch, + } + + # --- GJR-GARCH(1,1) --- + # GJR-GARCH 在 arch 库中通过 vol='Garch', o=1 实现 + model_gjr = arch_model(r_pct, vol='Garch', p=1, o=1, q=1, + mean='Constant', dist='Normal') + res_gjr = model_gjr.fit(disp='off') + gjr_params = dict(res_gjr.params) + results['GJR-GARCH'] = { + 'params': gjr_params, + 'aic': res_gjr.aic, + 'bic': res_gjr.bic, + 'log_likelihood': res_gjr.loglikelihood, + 'conditional_volatility': res_gjr.conditional_volatility / 100, + # gamma[1] > 0 表示负冲击产生更大波动 + 'leverage_param': gjr_params.get('gamma[1]', np.nan), + 'result_obj': res_gjr, + } + + return results + + +# ============================================================ +# 4. 杠杆效应分析 +# ============================================================ + +def leverage_effect_analysis(returns: pd.Series, + forward_windows: list = [5, 10, 20]) -> dict: + """ + 分析收益率与未来波动率的相关性(杠杆效应) + + 杠杆效应:负收益倾向于增加未来波动率,正收益倾向于减少未来波动率 + 表现为 corr(r_t, vol_{t+k}) < 0 + + Parameters + ---------- + returns : pd.Series + 日对数收益率 + forward_windows : list + 前瞻波动率窗口列表 + + Returns + ------- + dict + 各窗口下的相关系数及显著性 + """ + r = returns.dropna() + results = {} + + for w in forward_windows: + # 前瞻已实现波动率 + future_vol = r.abs().rolling(window=w).mean().shift(-w) + # 对齐有效数据 + valid = pd.DataFrame({'return': r, 'future_vol': future_vol}).dropna() + + if len(valid) < 30: + results[f'{w}d'] = { + 'correlation': np.nan, + 'p_value': np.nan, + 'n_samples': len(valid), + } + continue + + corr, p_val = stats.pearsonr(valid['return'], valid['future_vol']) + # Spearman秩相关作为稳健性检查 + spearman_corr, spearman_p = stats.spearmanr(valid['return'], valid['future_vol']) + + results[f'{w}d'] = { + 'pearson_correlation': corr, + 'pearson_pvalue': p_val, + 'spearman_correlation': spearman_corr, + 'spearman_pvalue': spearman_p, + 'n_samples': len(valid), + 'return_series': valid['return'], + 'future_vol_series': valid['future_vol'], + } + + return results + + +# ============================================================ +# 5. 可视化 +# ============================================================ + +def plot_realized_volatility(vol_df: pd.DataFrame, output_dir: Path): + """绘制多窗口已实现波动率时序图""" + fig, ax = plt.subplots(figsize=(14, 6)) + + colors = ['#1f77b4', '#ff7f0e', '#2ca02c'] + labels = {'rv_7d': '7天', 'rv_30d': '30天', 'rv_90d': '90天'} + + for idx, col in enumerate(vol_df.columns): + label = labels.get(col, col) + ax.plot(vol_df.index, vol_df[col], linewidth=0.8, + color=colors[idx % len(colors)], + label=f'{label}已实现波动率(年化)', alpha=0.85) + + ax.set_xlabel('日期', fontsize=12) + ax.set_ylabel('年化波动率', fontsize=12) + ax.set_title('BTC 多窗口已实现波动率', fontsize=14) + ax.legend(fontsize=11) + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / 'realized_volatility_multiwindow.png', + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[保存] {output_dir / 'realized_volatility_multiwindow.png'}") + + +def plot_acf_power_law(acf_results: dict, output_dir: Path): + """绘制ACF幂律衰减拟合图""" + fig, axes = plt.subplots(1, 2, figsize=(14, 5)) + + lags = acf_results['lags'] + acf_vals = acf_results['acf_values'] + + # 左图:ACF原始值 + ax1 = axes[0] + ax1.bar(lags, acf_vals, width=1, alpha=0.6, color='steelblue') + ax1.set_xlabel('滞后阶数', fontsize=11) + ax1.set_ylabel('ACF', fontsize=11) + ax1.set_title('|收益率| 自相关函数', fontsize=12) + ax1.grid(True, alpha=0.3) + ax1.axhline(y=0, color='black', linewidth=0.5) + + # 右图:对数-对数图 + 幂律拟合 + ax2 = axes[1] + lags_pos = acf_results['lags_positive'] + acf_pos = acf_results['acf_positive'] + + ax2.scatter(np.log(lags_pos), np.log(acf_pos), s=10, alpha=0.5, + color='steelblue', label='实际ACF') + + # 拟合线 + d = acf_results['d'] + intercept = acf_results['intercept'] + x_fit = np.linspace(np.log(lags_pos.min()), np.log(lags_pos.max()), 100) + y_fit = -d * x_fit + intercept + ax2.plot(x_fit, y_fit, 'r-', linewidth=2, + label=f'幂律拟合: d={d:.3f}, R²={acf_results["r_squared"]:.3f}') + + ax2.set_xlabel('log(滞后阶数)', fontsize=11) + ax2.set_ylabel('log(ACF)', fontsize=11) + ax2.set_title('幂律衰减拟合(双对数坐标)', fontsize=12) + ax2.legend(fontsize=10) + ax2.grid(True, alpha=0.3) + + fig.tight_layout() + fig.savefig(output_dir / 'acf_power_law_fit.png', + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[保存] {output_dir / 'acf_power_law_fit.png'}") + + +def plot_model_comparison(model_results: dict, output_dir: Path): + """绘制GARCH模型对比图(AIC/BIC + 条件波动率对比)""" + fig, axes = plt.subplots(2, 1, figsize=(14, 10)) + + model_names = list(model_results.keys()) + aic_values = [model_results[m]['aic'] for m in model_names] + bic_values = [model_results[m]['bic'] for m in model_names] + + # 上图:AIC/BIC 对比柱状图 + ax1 = axes[0] + x = np.arange(len(model_names)) + width = 0.35 + bars1 = ax1.bar(x - width / 2, aic_values, width, label='AIC', + color='steelblue', alpha=0.8) + bars2 = ax1.bar(x + width / 2, bic_values, width, label='BIC', + color='coral', alpha=0.8) + + ax1.set_xlabel('模型', fontsize=12) + ax1.set_ylabel('信息准则值', fontsize=12) + ax1.set_title('GARCH 模型信息准则对比(越小越好)', fontsize=13) + ax1.set_xticks(x) + ax1.set_xticklabels(model_names, fontsize=11) + ax1.legend(fontsize=11) + ax1.grid(True, alpha=0.3, axis='y') + + # 在柱状图上标注数值 + for bar in bars1: + height = bar.get_height() + ax1.annotate(f'{height:.1f}', + xy=(bar.get_x() + bar.get_width() / 2, height), + xytext=(0, 3), textcoords="offset points", + ha='center', va='bottom', fontsize=9) + for bar in bars2: + height = bar.get_height() + ax1.annotate(f'{height:.1f}', + xy=(bar.get_x() + bar.get_width() / 2, height), + xytext=(0, 3), textcoords="offset points", + ha='center', va='bottom', fontsize=9) + + # 下图:各模型条件波动率时序对比 + ax2 = axes[1] + colors = {'GARCH': '#1f77b4', 'EGARCH': '#ff7f0e', 'GJR-GARCH': '#2ca02c'} + for name in model_names: + cv = model_results[name]['conditional_volatility'] + ax2.plot(cv.index, cv.values, linewidth=0.7, + color=colors.get(name, 'gray'), + label=name, alpha=0.8) + + ax2.set_xlabel('日期', fontsize=12) + ax2.set_ylabel('条件波动率', fontsize=12) + ax2.set_title('各GARCH模型条件波动率对比', fontsize=13) + ax2.legend(fontsize=11) + ax2.grid(True, alpha=0.3) + + fig.tight_layout() + fig.savefig(output_dir / 'garch_model_comparison.png', + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[保存] {output_dir / 'garch_model_comparison.png'}") + + +def plot_leverage_effect(leverage_results: dict, output_dir: Path): + """绘制杠杆效应散点图""" + # 找到有数据的窗口 + valid_windows = [w for w, r in leverage_results.items() + if 'return_series' in r] + n_plots = len(valid_windows) + if n_plots == 0: + print("[警告] 无有效杠杆效应数据可绘制") + return + + fig, axes = plt.subplots(1, n_plots, figsize=(6 * n_plots, 5)) + if n_plots == 1: + axes = [axes] + + for idx, window_key in enumerate(valid_windows): + ax = axes[idx] + data = leverage_results[window_key] + ret = data['return_series'] + fvol = data['future_vol_series'] + + # 散点图(采样避免过多点) + n_sample = min(len(ret), 2000) + sample_idx = np.random.choice(len(ret), n_sample, replace=False) + ax.scatter(ret.values[sample_idx], fvol.values[sample_idx], + s=5, alpha=0.3, color='steelblue') + + # 回归线 + z = np.polyfit(ret.values, fvol.values, 1) + p = np.poly1d(z) + x_line = np.linspace(ret.min(), ret.max(), 100) + ax.plot(x_line, p(x_line), 'r-', linewidth=2) + + corr = data['pearson_correlation'] + p_val = data['pearson_pvalue'] + ax.set_xlabel('当日对数收益率', fontsize=11) + ax.set_ylabel(f'未来{window_key}平均|收益率|', fontsize=11) + ax.set_title(f'杠杆效应 ({window_key})\n' + f'Pearson r={corr:.4f}, p={p_val:.2e}', fontsize=11) + ax.grid(True, alpha=0.3) + + fig.tight_layout() + fig.savefig(output_dir / 'leverage_effect_scatter.png', + dpi=150, bbox_inches='tight') + plt.close(fig) + print(f"[保存] {output_dir / 'leverage_effect_scatter.png'}") + + +# ============================================================ +# 6. 结果打印 +# ============================================================ + +def print_realized_vol_summary(vol_df: pd.DataFrame): + """打印已实现波动率统计摘要""" + print("\n" + "=" * 60) + print("多窗口已实现波动率统计(年化)") + print("=" * 60) + summary = vol_df.describe().T + for col in vol_df.columns: + s = vol_df[col].dropna() + print(f"\n {col}:") + print(f" 均值: {s.mean():.4f} ({s.mean() * 100:.2f}%)") + print(f" 中位数: {s.median():.4f} ({s.median() * 100:.2f}%)") + print(f" 最大值: {s.max():.4f} ({s.max() * 100:.2f}%)") + print(f" 最小值: {s.min():.4f} ({s.min() * 100:.2f}%)") + print(f" 标准差: {s.std():.4f}") + + +def print_acf_power_law_results(results: dict): + """打印ACF幂律衰减检验结果""" + print("\n" + "=" * 60) + print("波动率自相关幂律衰减检验(长记忆性)") + print("=" * 60) + print(f" 幂律衰减指数 d (线性拟合): {results['d']:.4f}") + print(f" 幂律衰减指数 d (非线性拟合): {results['d_nonlinear']:.4f}") + print(f" 拟合优度 R²: {results['r_squared']:.4f}") + print(f" 回归斜率: {results['slope']:.4f}") + print(f" 回归截距: {results['intercept']:.4f}") + print(f" p值: {results['p_value']:.2e}") + print(f" 标准误: {results['std_err']:.4f}") + print(f"\n 长记忆性判断 (0 < d < 1): " + f"{'是 - 存在长记忆性' if results['is_long_memory'] else '否'}") + if results['is_long_memory']: + print(f" → |收益率|的自相关以幂律速度缓慢衰减") + print(f" → 波动率聚集具有长记忆特征,GARCH模型的持续性可能不足以刻画") + + +def print_model_comparison(model_results: dict): + """打印GARCH模型对比结果""" + print("\n" + "=" * 60) + print("GARCH / EGARCH / GJR-GARCH 模型对比") + print("=" * 60) + + print(f"\n {'模型':<14} {'AIC':>12} {'BIC':>12} {'对数似然':>12}") + print(" " + "-" * 52) + for name, res in model_results.items(): + print(f" {name:<14} {res['aic']:>12.2f} {res['bic']:>12.2f} " + f"{res['log_likelihood']:>12.2f}") + + # 找到最优模型 + best_aic = min(model_results.items(), key=lambda x: x[1]['aic']) + best_bic = min(model_results.items(), key=lambda x: x[1]['bic']) + print(f"\n AIC最优模型: {best_aic[0]} (AIC={best_aic[1]['aic']:.2f})") + print(f" BIC最优模型: {best_bic[0]} (BIC={best_bic[1]['bic']:.2f})") + + # 杠杆效应参数 + print("\n 杠杆效应参数:") + for name in ['EGARCH', 'GJR-GARCH']: + if name in model_results and 'leverage_param' in model_results[name]: + gamma = model_results[name]['leverage_param'] + print(f" {name} gamma[1] = {gamma:.6f}") + if name == 'EGARCH': + # EGARCH中gamma<0表示负冲击增大波动 + if gamma < 0: + print(f" → gamma < 0: 负收益(下跌)产生更大波动,存在杠杆效应") + else: + print(f" → gamma >= 0: 未观察到明显杠杆效应") + elif name == 'GJR-GARCH': + # GJR-GARCH中gamma>0表示负冲击的额外影响 + if gamma > 0: + print(f" → gamma > 0: 负冲击产生额外波动增量,存在杠杆效应") + else: + print(f" → gamma <= 0: 未观察到明显杠杆效应") + + # 打印各模型详细参数 + print("\n 各模型详细参数:") + for name, res in model_results.items(): + print(f"\n [{name}]") + for param_name, param_val in res['params'].items(): + print(f" {param_name}: {param_val:.6f}") + + +def print_leverage_results(leverage_results: dict): + """打印杠杆效应分析结果""" + print("\n" + "=" * 60) + print("杠杆效应分析:收益率与未来波动率的相关性") + print("=" * 60) + print(f"\n {'窗口':<8} {'Pearson r':>12} {'p值':>12} " + f"{'Spearman r':>12} {'p值':>12} {'样本数':>8}") + print(" " + "-" * 66) + for window, data in leverage_results.items(): + if 'pearson_correlation' in data: + print(f" {window:<8} " + f"{data['pearson_correlation']:>12.4f} " + f"{data['pearson_pvalue']:>12.2e} " + f"{data['spearman_correlation']:>12.4f} " + f"{data['spearman_pvalue']:>12.2e} " + f"{data['n_samples']:>8d}") + else: + print(f" {window:<8} {'N/A':>12} {'N/A':>12} " + f"{'N/A':>12} {'N/A':>12} {data.get('n_samples', 0):>8d}") + + # 总结 + print("\n 解读:") + print(" - 相关系数 < 0: 负收益(下跌)后波动率上升 → 存在杠杆效应") + print(" - 相关系数 ≈ 0: 收益率方向与未来波动率无关") + print(" - 相关系数 > 0: 正收益(上涨)后波动率上升(反向杠杆/波动率反馈效应)") + print(" - 注意: BTC作为加密货币,杠杆效应可能与传统股票不同") + + +# ============================================================ +# 7. 主入口 +# ============================================================ + +def run_volatility_analysis(df: pd.DataFrame, output_dir: str = "output/volatility"): + """ + 波动率聚集与非对称GARCH分析主函数 + + Parameters + ---------- + df : pd.DataFrame + 日线K线数据(含'close'列,DatetimeIndex索引) + output_dir : str + 图表输出目录 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + print("=" * 60) + print("BTC 波动率聚集与非对称 GARCH 分析") + print("=" * 60) + print(f"数据范围: {df.index.min()} ~ {df.index.max()}") + print(f"样本数量: {len(df)}") + + # 计算日对数收益率 + daily_returns = log_returns(df['close']) + print(f"日对数收益率样本数: {len(daily_returns)}") + + # 设置中文字体(兼容多系统) + plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans'] + plt.rcParams['axes.unicode_minus'] = False + + # 固定随机种子以保证杠杆效应散点图采样可复现 + np.random.seed(42) + + # --- 多窗口已实现波动率 --- + print("\n>>> 计算多窗口已实现波动率 (7d, 30d, 90d)...") + vol_df = multi_window_realized_vol(daily_returns, windows=[7, 30, 90]) + print_realized_vol_summary(vol_df) + plot_realized_volatility(vol_df, output_dir) + + # --- ACF幂律衰减检验 --- + print("\n>>> 执行波动率自相关幂律衰减检验...") + acf_results = volatility_acf_power_law(daily_returns, max_lags=200) + print_acf_power_law_results(acf_results) + plot_acf_power_law(acf_results, output_dir) + + # --- GARCH模型对比 --- + print("\n>>> 拟合 GARCH / EGARCH / GJR-GARCH 模型...") + model_results = compare_garch_models(daily_returns) + print_model_comparison(model_results) + plot_model_comparison(model_results, output_dir) + + # --- 杠杆效应分析 --- + print("\n>>> 执行杠杆效应分析...") + leverage_results = leverage_effect_analysis(daily_returns, + forward_windows=[5, 10, 20]) + print_leverage_results(leverage_results) + plot_leverage_effect(leverage_results, output_dir) + + print("\n" + "=" * 60) + print("波动率分析完成!") + print(f"图表已保存至: {output_dir.resolve()}") + print("=" * 60) + + # 返回所有结果供后续使用 + return { + 'realized_vol': vol_df, + 'acf_power_law': acf_results, + 'model_comparison': model_results, + 'leverage_effect': leverage_results, + } + + +# ============================================================ +# 独立运行入口 +# ============================================================ + +if __name__ == '__main__': + df = load_daily() + run_volatility_analysis(df) diff --git a/src/volume_price_analysis.py b/src/volume_price_analysis.py new file mode 100644 index 0000000..2afe8cf --- /dev/null +++ b/src/volume_price_analysis.py @@ -0,0 +1,577 @@ +"""成交量-价格关系与OBV分析 + +分析BTC成交量与价格变动的关系,包括Spearman相关性、 +Taker买入比例领先分析、Granger因果检验和OBV背离检测。 +""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from scipy import stats +from statsmodels.tsa.stattools import grangercausalitytests +from pathlib import Path +from typing import Dict, List, Tuple + +# 中文显示支持 +plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans'] +plt.rcParams['axes.unicode_minus'] = False + + +# ============================================================================= +# 核心分析函数 +# ============================================================================= + +def _spearman_volume_returns(volume: pd.Series, returns: pd.Series) -> Dict: + """Spearman秩相关: 成交量 vs |收益率| + + 使用Spearman而非Pearson,因为量价关系通常是非线性的。 + + Returns + ------- + dict + 包含 correlation, p_value, n_samples + """ + # 对齐索引并去除NaN + abs_ret = returns.abs() + aligned = pd.concat([volume, abs_ret], axis=1, keys=['volume', 'abs_return']).dropna() + + corr, p_val = stats.spearmanr(aligned['volume'], aligned['abs_return']) + + return { + 'correlation': corr, + 'p_value': p_val, + 'n_samples': len(aligned), + } + + +def _taker_buy_ratio_lead_lag( + taker_buy_ratio: pd.Series, + returns: pd.Series, + max_lag: int = 20, +) -> pd.DataFrame: + """Taker买入比例领先-滞后分析 + + 计算 taker_buy_ratio(t) 与 returns(t+lag) 的互相关, + 检验买入比例对未来收益的预测能力。 + + Parameters + ---------- + taker_buy_ratio : pd.Series + Taker买入占比序列 + returns : pd.Series + 对数收益率序列 + max_lag : int + 最大领先天数 + + Returns + ------- + pd.DataFrame + 包含 lag, correlation, p_value, significant 列 + """ + results = [] + for lag in range(1, max_lag + 1): + # taker_buy_ratio(t) vs returns(t+lag) + ratio_shifted = taker_buy_ratio.shift(lag) + aligned = pd.concat([ratio_shifted, returns], axis=1).dropna() + aligned.columns = ['ratio', 'return'] + + if len(aligned) < 30: + continue + + corr, p_val = stats.spearmanr(aligned['ratio'], aligned['return']) + results.append({ + 'lag': lag, + 'correlation': corr, + 'p_value': p_val, + 'significant': p_val < 0.05, + }) + + return pd.DataFrame(results) + + +def _granger_causality( + volume: pd.Series, + returns: pd.Series, + max_lag: int = 10, +) -> Dict[str, pd.DataFrame]: + """双向Granger因果检验: 成交量 ↔ 收益率 + + Parameters + ---------- + volume : pd.Series + 成交量序列 + returns : pd.Series + 收益率序列 + max_lag : int + 最大滞后阶数 + + Returns + ------- + dict + 'volume_to_returns': 成交量→收益率 的p值表 + 'returns_to_volume': 收益率→成交量 的p值表 + """ + # 对齐并去除NaN + aligned = pd.concat([volume, returns], axis=1, keys=['volume', 'returns']).dropna() + + results = {} + + # 方向1: 成交量 → 收益率 (检验成交量是否Granger-cause收益率) + # grangercausalitytests 的数据格式: [被预测变量, 预测变量] + try: + data_v2r = aligned[['returns', 'volume']].values + gc_v2r = grangercausalitytests(data_v2r, maxlag=max_lag, verbose=False) + rows_v2r = [] + for lag_order in range(1, max_lag + 1): + test_results = gc_v2r[lag_order][0] + rows_v2r.append({ + 'lag': lag_order, + 'ssr_ftest_pval': test_results['ssr_ftest'][1], + 'ssr_chi2test_pval': test_results['ssr_chi2test'][1], + 'lrtest_pval': test_results['lrtest'][1], + 'params_ftest_pval': test_results['params_ftest'][1], + }) + results['volume_to_returns'] = pd.DataFrame(rows_v2r) + except Exception as e: + print(f" [警告] 成交量→收益率 Granger检验失败: {e}") + results['volume_to_returns'] = pd.DataFrame() + + # 方向2: 收益率 → 成交量 + try: + data_r2v = aligned[['volume', 'returns']].values + gc_r2v = grangercausalitytests(data_r2v, maxlag=max_lag, verbose=False) + rows_r2v = [] + for lag_order in range(1, max_lag + 1): + test_results = gc_r2v[lag_order][0] + rows_r2v.append({ + 'lag': lag_order, + 'ssr_ftest_pval': test_results['ssr_ftest'][1], + 'ssr_chi2test_pval': test_results['ssr_chi2test'][1], + 'lrtest_pval': test_results['lrtest'][1], + 'params_ftest_pval': test_results['params_ftest'][1], + }) + results['returns_to_volume'] = pd.DataFrame(rows_r2v) + except Exception as e: + print(f" [警告] 收益率→成交量 Granger检验失败: {e}") + results['returns_to_volume'] = pd.DataFrame() + + return results + + +def _compute_obv(df: pd.DataFrame) -> pd.Series: + """计算OBV (On-Balance Volume) + + 规则: + - 收盘价上涨: OBV += volume + - 收盘价下跌: OBV -= volume + - 收盘价持平: OBV 不变 + """ + close = df['close'] + volume = df['volume'] + + direction = np.sign(close.diff()) + obv = (direction * volume).fillna(0).cumsum() + obv.name = 'obv' + return obv + + +def _detect_obv_divergences( + prices: pd.Series, + obv: pd.Series, + window: int = 60, + lookback: int = 5, +) -> pd.DataFrame: + """检测OBV-价格背离 + + 背离类型: + - 顶背离 (bearish): 价格创新高但OBV未创新高 → 潜在下跌信号 + - 底背离 (bullish): 价格创新低但OBV未创新低 → 潜在上涨信号 + + Parameters + ---------- + prices : pd.Series + 收盘价序列 + obv : pd.Series + OBV序列 + window : int + 滚动窗口大小,用于判断"新高"/"新低" + lookback : int + 新高/新低确认回看天数 + + Returns + ------- + pd.DataFrame + 背离事件表,包含 date, type, price, obv 列 + """ + divergences = [] + + # 滚动最高/最低 + price_rolling_max = prices.rolling(window=window, min_periods=window).max() + price_rolling_min = prices.rolling(window=window, min_periods=window).min() + obv_rolling_max = obv.rolling(window=window, min_periods=window).max() + obv_rolling_min = obv.rolling(window=window, min_periods=window).min() + + for i in range(window + lookback, len(prices)): + idx = prices.index[i] + price_val = prices.iloc[i] + obv_val = obv.iloc[i] + + # 价格创近期新高 (最近lookback天内触及滚动最高) + recent_prices = prices.iloc[i - lookback:i + 1] + recent_obv = obv.iloc[i - lookback:i + 1] + rolling_max_price = price_rolling_max.iloc[i] + rolling_max_obv = obv_rolling_max.iloc[i] + rolling_min_price = price_rolling_min.iloc[i] + rolling_min_obv = obv_rolling_min.iloc[i] + + # 顶背离: 价格 == 滚动最高 且 OBV 未达到滚动最高的95% + if price_val >= rolling_max_price * 0.998: + if obv_val < rolling_max_obv * 0.95: + divergences.append({ + 'date': idx, + 'type': 'bearish', # 顶背离 + 'price': price_val, + 'obv': obv_val, + }) + + # 底背离: 价格 == 滚动最低 且 OBV 未达到滚动最低(更高) + if price_val <= rolling_min_price * 1.002: + if obv_val > rolling_min_obv * 1.05: + divergences.append({ + 'date': idx, + 'type': 'bullish', # 底背离 + 'price': price_val, + 'obv': obv_val, + }) + + df_div = pd.DataFrame(divergences) + + # 去除密集重复信号 (同类型信号间隔至少10天) + if not df_div.empty: + df_div = df_div.sort_values('date') + filtered = [df_div.iloc[0]] + for _, row in df_div.iloc[1:].iterrows(): + last = filtered[-1] + if row['type'] != last['type'] or (row['date'] - last['date']).days >= 10: + filtered.append(row) + df_div = pd.DataFrame(filtered).reset_index(drop=True) + + return df_div + + +# ============================================================================= +# 可视化函数 +# ============================================================================= + +def _plot_volume_return_scatter( + volume: pd.Series, + returns: pd.Series, + spearman_result: Dict, + output_dir: Path, +): + """图1: 成交量 vs |收益率| 散点图""" + fig, ax = plt.subplots(figsize=(10, 7)) + + abs_ret = returns.abs() + aligned = pd.concat([volume, abs_ret], axis=1, keys=['volume', 'abs_return']).dropna() + + ax.scatter(aligned['volume'], aligned['abs_return'], + s=5, alpha=0.3, color='steelblue') + + rho = spearman_result['correlation'] + p_val = spearman_result['p_value'] + ax.set_xlabel('成交量', fontsize=12) + ax.set_ylabel('|对数收益率|', fontsize=12) + ax.set_title(f'成交量 vs |收益率| 散点图\nSpearman ρ={rho:.4f}, p={p_val:.2e}', fontsize=13) + ax.grid(True, alpha=0.3) + + fig.savefig(output_dir / 'volume_return_scatter.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [图] 量价散点图已保存: {output_dir / 'volume_return_scatter.png'}") + + +def _plot_lead_lag_correlation( + lead_lag_df: pd.DataFrame, + output_dir: Path, +): + """图2: Taker买入比例领先-滞后相关性柱状图""" + fig, ax = plt.subplots(figsize=(12, 6)) + + if lead_lag_df.empty: + ax.text(0.5, 0.5, '数据不足,无法计算领先-滞后相关性', + transform=ax.transAxes, ha='center', va='center', fontsize=14) + fig.savefig(output_dir / 'taker_buy_lead_lag.png', dpi=150, bbox_inches='tight') + plt.close(fig) + return + + colors = ['red' if sig else 'steelblue' + for sig in lead_lag_df['significant']] + + bars = ax.bar(lead_lag_df['lag'], lead_lag_df['correlation'], + color=colors, alpha=0.8, edgecolor='white') + + # 显著性水平线 + ax.axhline(y=0, color='black', linewidth=0.5) + + ax.set_xlabel('领先天数 (lag)', fontsize=12) + ax.set_ylabel('Spearman 相关系数', fontsize=12) + ax.set_title('Taker买入比例对未来收益的领先相关性\n(红色=p<0.05 显著)', fontsize=13) + ax.set_xticks(lead_lag_df['lag']) + ax.grid(True, alpha=0.3, axis='y') + + fig.savefig(output_dir / 'taker_buy_lead_lag.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [图] Taker买入比例领先分析已保存: {output_dir / 'taker_buy_lead_lag.png'}") + + +def _plot_granger_heatmap( + granger_results: Dict[str, pd.DataFrame], + output_dir: Path, +): + """图3: Granger因果检验p值热力图""" + fig, axes = plt.subplots(1, 2, figsize=(16, 6)) + + titles = { + 'volume_to_returns': '成交量 → 收益率', + 'returns_to_volume': '收益率 → 成交量', + } + + for ax, (direction, df_gc) in zip(axes, granger_results.items()): + if df_gc.empty: + ax.text(0.5, 0.5, '检验失败', transform=ax.transAxes, + ha='center', va='center', fontsize=14) + ax.set_title(titles[direction], fontsize=13) + continue + + # 构建热力图矩阵 + test_names = ['ssr_ftest_pval', 'ssr_chi2test_pval', 'lrtest_pval', 'params_ftest_pval'] + test_labels = ['SSR F-test', 'SSR Chi2', 'LR test', 'Params F-test'] + lags = df_gc['lag'].values + + heatmap_data = df_gc[test_names].values.T # shape: (4, n_lags) + + im = ax.imshow(heatmap_data, aspect='auto', cmap='RdYlGn', + vmin=0, vmax=0.1, interpolation='nearest') + + ax.set_xticks(range(len(lags))) + ax.set_xticklabels(lags, fontsize=9) + ax.set_yticks(range(len(test_labels))) + ax.set_yticklabels(test_labels, fontsize=9) + ax.set_xlabel('滞后阶数', fontsize=11) + ax.set_title(f'Granger因果: {titles[direction]}', fontsize=13) + + # 标注p值 + for i in range(len(test_labels)): + for j in range(len(lags)): + val = heatmap_data[i, j] + color = 'white' if val < 0.03 else 'black' + ax.text(j, i, f'{val:.3f}', ha='center', va='center', + fontsize=7, color=color) + + fig.colorbar(im, ax=axes, label='p-value', shrink=0.8) + fig.tight_layout() + fig.savefig(output_dir / 'granger_causality_heatmap.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [图] Granger因果热力图已保存: {output_dir / 'granger_causality_heatmap.png'}") + + +def _plot_obv_with_divergences( + df: pd.DataFrame, + obv: pd.Series, + divergences: pd.DataFrame, + output_dir: Path, +): + """图4: OBV vs 价格 + 背离标记""" + fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(16, 10), sharex=True, + gridspec_kw={'height_ratios': [2, 1]}) + + # 上图: 价格 + ax1.plot(df.index, df['close'], color='black', linewidth=0.8, label='BTC 收盘价') + ax1.set_ylabel('价格 (USDT)', fontsize=12) + ax1.set_title('BTC 价格与OBV背离分析', fontsize=14) + ax1.set_yscale('log') + ax1.grid(True, alpha=0.3, which='both') + + # 下图: OBV + ax2.plot(obv.index, obv.values, color='steelblue', linewidth=0.8, label='OBV') + ax2.set_ylabel('OBV', fontsize=12) + ax2.set_xlabel('日期', fontsize=12) + ax2.grid(True, alpha=0.3) + + # 标记背离 + if not divergences.empty: + bearish = divergences[divergences['type'] == 'bearish'] + bullish = divergences[divergences['type'] == 'bullish'] + + if not bearish.empty: + ax1.scatter(bearish['date'], bearish['price'], + marker='v', s=60, color='red', zorder=5, + label=f'顶背离 ({len(bearish)}次)', alpha=0.7) + for _, row in bearish.iterrows(): + ax2.axvline(row['date'], color='red', alpha=0.2, linewidth=0.5) + + if not bullish.empty: + ax1.scatter(bullish['date'], bullish['price'], + marker='^', s=60, color='green', zorder=5, + label=f'底背离 ({len(bullish)}次)', alpha=0.7) + for _, row in bullish.iterrows(): + ax2.axvline(row['date'], color='green', alpha=0.2, linewidth=0.5) + + ax1.legend(fontsize=10, loc='upper left') + ax2.legend(fontsize=10, loc='upper left') + + fig.tight_layout() + fig.savefig(output_dir / 'obv_divergence.png', dpi=150, bbox_inches='tight') + plt.close(fig) + print(f" [图] OBV背离分析已保存: {output_dir / 'obv_divergence.png'}") + + +# ============================================================================= +# 主入口 +# ============================================================================= + +def run_volume_price_analysis(df: pd.DataFrame, output_dir: str = "output") -> Dict: + """成交量-价格关系与OBV分析 — 主入口函数 + + Parameters + ---------- + df : pd.DataFrame + 由 data_loader.load_daily() 返回的日线数据,含 DatetimeIndex, + close, volume, taker_buy_volume 等列 + output_dir : str + 图表输出目录 + + Returns + ------- + dict + 分析结果摘要 + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + print("=" * 60) + print(" BTC 成交量-价格关系分析") + print("=" * 60) + + # 准备数据 + prices = df['close'].dropna() + volume = df['volume'].dropna() + log_ret = np.log(prices / prices.shift(1)).dropna() + + # 计算taker买入比例 + taker_buy_ratio = (df['taker_buy_volume'] / df['volume'].replace(0, np.nan)).dropna() + + print(f"\n数据范围: {df.index[0].date()} ~ {df.index[-1].date()}") + print(f"样本数量: {len(df)}") + + # ---- 步骤1: Spearman相关性 ---- + print("\n--- Spearman 成交量-|收益率| 相关性 ---") + spearman_result = _spearman_volume_returns(volume, log_ret) + print(f" Spearman ρ: {spearman_result['correlation']:.4f}") + print(f" p-value: {spearman_result['p_value']:.2e}") + print(f" 样本量: {spearman_result['n_samples']}") + if spearman_result['p_value'] < 0.01: + print(" >> 结论: 成交量与|收益率|存在显著正相关(成交量放大伴随大幅波动)") + else: + print(" >> 结论: 成交量与|收益率|相关性不显著") + + # ---- 步骤2: Taker买入比例领先分析 ---- + print("\n--- Taker买入比例领先分析 ---") + lead_lag_df = _taker_buy_ratio_lead_lag(taker_buy_ratio, log_ret, max_lag=20) + if not lead_lag_df.empty: + sig_lags = lead_lag_df[lead_lag_df['significant']] + if not sig_lags.empty: + print(f" 显著领先期 (p<0.05):") + for _, row in sig_lags.iterrows(): + print(f" lag={int(row['lag']):>2d}天: ρ={row['correlation']:.4f}, p={row['p_value']:.4f}") + best = sig_lags.loc[sig_lags['correlation'].abs().idxmax()] + print(f" >> 最强领先信号: lag={int(best['lag'])}天, ρ={best['correlation']:.4f}") + else: + print(" 未发现显著的领先关系 (所有lag的p>0.05)") + else: + print(" 数据不足,无法进行领先-滞后分析") + + # ---- 步骤3: Granger因果检验 ---- + print("\n--- Granger 因果检验 (双向, lag 1-10) ---") + granger_results = _granger_causality(volume, log_ret, max_lag=10) + + for direction, label in [('volume_to_returns', '成交量→收益率'), + ('returns_to_volume', '收益率→成交量')]: + df_gc = granger_results[direction] + if not df_gc.empty: + # 使用SSR F-test的p值 + sig_gc = df_gc[df_gc['ssr_ftest_pval'] < 0.05] + if not sig_gc.empty: + print(f" {label}: 在以下滞后阶显著 (SSR F-test p<0.05):") + for _, row in sig_gc.iterrows(): + print(f" lag={int(row['lag'])}: p={row['ssr_ftest_pval']:.4f}") + else: + print(f" {label}: 在所有滞后阶均不显著") + else: + print(f" {label}: 检验失败") + + # ---- 步骤4: OBV计算与背离检测 ---- + print("\n--- OBV 与 价格背离分析 ---") + obv = _compute_obv(df) + divergences = _detect_obv_divergences(prices, obv, window=60, lookback=5) + + if not divergences.empty: + bearish_count = len(divergences[divergences['type'] == 'bearish']) + bullish_count = len(divergences[divergences['type'] == 'bullish']) + print(f" 检测到 {len(divergences)} 个背离信号:") + print(f" 顶背离 (看跌): {bearish_count} 次") + print(f" 底背离 (看涨): {bullish_count} 次") + + # 最近的背离 + recent = divergences.tail(5) + print(f" 最近 {len(recent)} 个背离:") + for _, row in recent.iterrows(): + div_type = '顶背离' if row['type'] == 'bearish' else '底背离' + date_str = row['date'].strftime('%Y-%m-%d') + print(f" {date_str}: {div_type}, 价格=${row['price']:,.0f}") + else: + bearish_count = 0 + bullish_count = 0 + print(" 未检测到明显的OBV-价格背离") + + # ---- 步骤5: 生成可视化 ---- + print("\n--- 生成可视化图表 ---") + _plot_volume_return_scatter(volume, log_ret, spearman_result, output_dir) + _plot_lead_lag_correlation(lead_lag_df, output_dir) + _plot_granger_heatmap(granger_results, output_dir) + _plot_obv_with_divergences(df, obv, divergences, output_dir) + + print("\n" + "=" * 60) + print(" 成交量-价格分析完成") + print("=" * 60) + + # 返回结果摘要 + return { + 'spearman': spearman_result, + 'lead_lag': { + 'significant_lags': lead_lag_df[lead_lag_df['significant']]['lag'].tolist() + if not lead_lag_df.empty else [], + }, + 'granger': { + 'volume_to_returns_sig_lags': granger_results['volume_to_returns'][ + granger_results['volume_to_returns']['ssr_ftest_pval'] < 0.05 + ]['lag'].tolist() if not granger_results['volume_to_returns'].empty else [], + 'returns_to_volume_sig_lags': granger_results['returns_to_volume'][ + granger_results['returns_to_volume']['ssr_ftest_pval'] < 0.05 + ]['lag'].tolist() if not granger_results['returns_to_volume'].empty else [], + }, + 'obv_divergences': { + 'total': len(divergences), + 'bearish': bearish_count, + 'bullish': bullish_count, + }, + } + + +if __name__ == '__main__': + from data_loader import load_daily + df = load_daily() + results = run_volume_price_analysis(df, output_dir='../output/volume_price') diff --git a/src/wavelet_analysis.py b/src/wavelet_analysis.py new file mode 100644 index 0000000..0795a6d --- /dev/null +++ b/src/wavelet_analysis.py @@ -0,0 +1,817 @@ +"""小波变换分析模块 - CWT时频分析、全局小波谱、显著性检验、周期强度追踪""" + +import matplotlib +matplotlib.use('Agg') + +import numpy as np +import pandas as pd +import pywt +import matplotlib.pyplot as plt +import matplotlib.dates as mdates +from matplotlib.colors import LogNorm +from scipy.signal import detrend +from pathlib import Path +from typing import Dict, List, Optional, Tuple + +from src.preprocessing import log_returns, standardize + + +# ============================================================================ +# 核心参数配置 +# ============================================================================ + +WAVELET = 'cmor1.5-1.0' # 复Morlet小波 (bandwidth=1.5, center_freq=1.0) +MIN_PERIOD = 7 # 最小周期(天) +MAX_PERIOD = 1500 # 最大周期(天) +NUM_SCALES = 256 # 尺度数量 +KEY_PERIODS = [30, 90, 365, 1400] # 关键追踪周期(天) +N_SURROGATES = 1000 # Monte Carlo替代数据数量 +SIGNIFICANCE_LEVEL = 0.95 # 显著性水平 +DPI = 150 # 图像分辨率 + + +# ============================================================================ +# 辅助函数:尺度与周期转换 +# ============================================================================ + +def _periods_to_scales(periods: np.ndarray, wavelet: str, dt: float = 1.0) -> np.ndarray: + """将周期(天)转换为CWT尺度参数 + + Parameters + ---------- + periods : np.ndarray + 目标周期数组(天) + wavelet : str + 小波名称 + dt : float + 采样间隔(天) + + Returns + ------- + np.ndarray + 对应的尺度数组 + """ + central_freq = pywt.central_frequency(wavelet) + scales = central_freq * periods / dt + return scales + + +def _scales_to_periods(scales: np.ndarray, wavelet: str, dt: float = 1.0) -> np.ndarray: + """将CWT尺度参数转换为周期(天)""" + central_freq = pywt.central_frequency(wavelet) + periods = scales * dt / central_freq + return periods + + +# ============================================================================ +# 核心计算:连续小波变换 +# ============================================================================ + +def compute_cwt( + signal: np.ndarray, + dt: float = 1.0, + wavelet: str = WAVELET, + min_period: float = MIN_PERIOD, + max_period: float = MAX_PERIOD, + num_scales: int = NUM_SCALES, +) -> Tuple[np.ndarray, np.ndarray, np.ndarray]: + """计算连续小波变换(CWT) + + Parameters + ---------- + signal : np.ndarray + 输入时间序列(建议已标准化) + dt : float + 采样间隔(天) + wavelet : str + 小波函数名称 + min_period : float + 最小分析周期(天) + max_period : float + 最大分析周期(天) + num_scales : int + 尺度分辨率 + + Returns + ------- + coeffs : np.ndarray + CWT系数矩阵 (n_scales, n_times) + periods : np.ndarray + 对应周期数组(天) + scales : np.ndarray + 尺度数组 + """ + # 生成对数等间隔的周期序列 + periods = np.logspace(np.log10(min_period), np.log10(max_period), num_scales) + scales = _periods_to_scales(periods, wavelet, dt) + + # 执行CWT + coeffs, _ = pywt.cwt(signal, scales, wavelet, sampling_period=dt) + + return coeffs, periods, scales + + +def compute_power_spectrum(coeffs: np.ndarray) -> np.ndarray: + """计算小波功率谱 |W(s,t)|^2 + + Parameters + ---------- + coeffs : np.ndarray + CWT系数矩阵 + + Returns + ------- + np.ndarray + 功率谱矩阵 + """ + return np.abs(coeffs) ** 2 + + +# ============================================================================ +# 影响锥(Cone of Influence) +# ============================================================================ + +def compute_coi(n: int, dt: float = 1.0, wavelet: str = WAVELET) -> np.ndarray: + """计算影响锥(COI)边界 + + 影响锥标识边界效应显著的区域。对于Morlet小波, + COI对应于e-folding时间 sqrt(2) * scale。 + + Parameters + ---------- + n : int + 时间序列长度 + dt : float + 采样间隔 + wavelet : str + 小波名称 + + Returns + ------- + coi_periods : np.ndarray + 每个时间点对应的COI周期边界(天) + """ + # e-folding time for Morlet wavelet: sqrt(2) * s + # COI period = sqrt(2) * s * dt / central_freq + central_freq = pywt.central_frequency(wavelet) + # 从两端递增到中间 + t = np.arange(n) * dt + coi_time = np.minimum(t, (n - 1) * dt - t) + # 转换为周期:COI_period = sqrt(2) * coi_time * central_freq (反推) + # 实际上 COI boundary in period space: period = sqrt(2) * dt * index / central_freq * central_freq + # 简化: coi_period = sqrt(2) * coi_time + coi_periods = np.sqrt(2) * coi_time + # 最小值截断到最小周期 + coi_periods = np.maximum(coi_periods, dt) + return coi_periods + + +# ============================================================================ +# AR(1) 红噪声显著性检验(Monte Carlo方法) +# ============================================================================ + +def _estimate_ar1(signal: np.ndarray) -> float: + """估计信号的AR(1)自相关系数(lag-1 autocorrelation) + + Parameters + ---------- + signal : np.ndarray + 输入时间序列 + + Returns + ------- + float + lag-1自相关系数 + """ + n = len(signal) + x = signal - np.mean(signal) + c0 = np.sum(x ** 2) / n + c1 = np.sum(x[:-1] * x[1:]) / n + if c0 == 0: + return 0.0 + alpha = c1 / c0 + return np.clip(alpha, -0.999, 0.999) + + +def _generate_ar1_surrogate(n: int, alpha: float, variance: float) -> np.ndarray: + """生成AR(1)红噪声替代数据 + + x(t) = alpha * x(t-1) + noise + + Parameters + ---------- + n : int + 序列长度 + alpha : float + AR(1)系数 + variance : float + 原始信号方差 + + Returns + ------- + np.ndarray + AR(1)替代序列 + """ + noise_std = np.sqrt(variance * (1 - alpha ** 2)) + noise = np.random.normal(0, noise_std, n) + surrogate = np.zeros(n) + surrogate[0] = noise[0] + for i in range(1, n): + surrogate[i] = alpha * surrogate[i - 1] + noise[i] + return surrogate + + +def significance_test_monte_carlo( + signal: np.ndarray, + periods: np.ndarray, + dt: float = 1.0, + wavelet: str = WAVELET, + n_surrogates: int = N_SURROGATES, + significance_level: float = SIGNIFICANCE_LEVEL, +) -> Tuple[np.ndarray, np.ndarray]: + """AR(1)红噪声Monte Carlo显著性检验 + + 生成大量AR(1)替代数据,计算其全局小波谱分布, + 得到指定置信水平的阈值。 + + Parameters + ---------- + signal : np.ndarray + 原始时间序列 + periods : np.ndarray + CWT分析的周期数组 + dt : float + 采样间隔 + wavelet : str + 小波名称 + n_surrogates : int + 替代数据数量 + significance_level : float + 显著性水平(如0.95对应95%置信度) + + Returns + ------- + significance_threshold : np.ndarray + 各周期的显著性阈值 + surrogate_spectra : np.ndarray + 所有替代数据的全局谱 (n_surrogates, n_periods) + """ + n = len(signal) + alpha = _estimate_ar1(signal) + variance = np.var(signal) + scales = _periods_to_scales(periods, wavelet, dt) + + print(f" AR(1) 系数 alpha = {alpha:.4f}") + print(f" 生成 {n_surrogates} 个AR(1)替代数据进行Monte Carlo检验...") + + surrogate_global_spectra = np.zeros((n_surrogates, len(periods))) + + for i in range(n_surrogates): + surrogate = _generate_ar1_surrogate(n, alpha, variance) + coeffs_surr, _ = pywt.cwt(surrogate, scales, wavelet, sampling_period=dt) + power_surr = np.abs(coeffs_surr) ** 2 + surrogate_global_spectra[i, :] = np.mean(power_surr, axis=1) + + if (i + 1) % 200 == 0: + print(f" Monte Carlo 进度: {i + 1}/{n_surrogates}") + + # 计算指定分位数作为显著性阈值 + percentile = significance_level * 100 + significance_threshold = np.percentile(surrogate_global_spectra, percentile, axis=0) + + return significance_threshold, surrogate_global_spectra + + +# ============================================================================ +# 全局小波谱 +# ============================================================================ + +def compute_global_wavelet_spectrum(power: np.ndarray) -> np.ndarray: + """计算全局小波谱(时间平均功率) + + Parameters + ---------- + power : np.ndarray + 功率谱矩阵 (n_scales, n_times) + + Returns + ------- + np.ndarray + 全局小波谱 (n_scales,) + """ + return np.mean(power, axis=1) + + +def find_significant_periods( + global_spectrum: np.ndarray, + significance_threshold: np.ndarray, + periods: np.ndarray, +) -> List[Dict]: + """找出超过显著性阈值的周期峰 + + 在全局谱中检测超过95%置信水平的局部极大值。 + + Parameters + ---------- + global_spectrum : np.ndarray + 全局小波谱 + significance_threshold : np.ndarray + 显著性阈值 + periods : np.ndarray + 周期数组 + + Returns + ------- + list of dict + 显著周期列表,每项包含 period, power, threshold, ratio + """ + # 找出超过阈值的区域 + above_mask = global_spectrum > significance_threshold + + significant = [] + if not np.any(above_mask): + return significant + + # 在超过阈值的连续区间内找峰值 + diff = np.diff(above_mask.astype(int)) + starts = np.where(diff == 1)[0] + 1 + ends = np.where(diff == -1)[0] + 1 + + # 处理边界情况 + if above_mask[0]: + starts = np.insert(starts, 0, 0) + if above_mask[-1]: + ends = np.append(ends, len(above_mask)) + + for s, e in zip(starts, ends): + segment = global_spectrum[s:e] + peak_idx = s + np.argmax(segment) + significant.append({ + 'period': float(periods[peak_idx]), + 'power': float(global_spectrum[peak_idx]), + 'threshold': float(significance_threshold[peak_idx]), + 'ratio': float(global_spectrum[peak_idx] / significance_threshold[peak_idx]), + }) + + # 按功率降序排列 + significant.sort(key=lambda x: x['power'], reverse=True) + return significant + + +# ============================================================================ +# 关键周期功率时间演化 +# ============================================================================ + +def extract_power_at_periods( + power: np.ndarray, + periods: np.ndarray, + key_periods: List[float] = None, +) -> Dict[float, np.ndarray]: + """提取关键周期处的功率随时间变化 + + Parameters + ---------- + power : np.ndarray + 功率谱矩阵 (n_scales, n_times) + periods : np.ndarray + 周期数组 + key_periods : list of float + 要追踪的关键周期(天) + + Returns + ------- + dict + {period: power_time_series} 映射 + """ + if key_periods is None: + key_periods = KEY_PERIODS + + result = {} + for target_period in key_periods: + # 找到最接近目标周期的尺度索引 + idx = np.argmin(np.abs(periods - target_period)) + actual_period = periods[idx] + result[target_period] = { + 'power': power[idx, :], + 'actual_period': float(actual_period), + } + + return result + + +# ============================================================================ +# 可视化模块 +# ============================================================================ + +def plot_cwt_scalogram( + power: np.ndarray, + periods: np.ndarray, + dates: pd.DatetimeIndex, + coi_periods: np.ndarray, + output_path: Path, + title: str = 'BTC/USDT CWT 时频功率谱(Scalogram)', +) -> None: + """绘制CWT scalogram(时间-周期-功率热力图)含影响锥 + + Parameters + ---------- + power : np.ndarray + 功率谱矩阵 + periods : np.ndarray + 周期数组(天) + dates : pd.DatetimeIndex + 时间索引 + coi_periods : np.ndarray + 影响锥边界 + output_path : Path + 输出文件路径 + title : str + 图标题 + """ + fig, ax = plt.subplots(figsize=(16, 8)) + + # 使用对数归一化的伪彩色图 + t = mdates.date2num(dates.to_pydatetime()) + T, P = np.meshgrid(t, periods) + + # 功率取对数以获得更好的视觉效果 + power_plot = power.copy() + power_plot[power_plot <= 0] = np.min(power_plot[power_plot > 0]) * 0.1 + + im = ax.pcolormesh( + T, P, power_plot, + cmap='jet', + norm=LogNorm(vmin=np.percentile(power_plot, 5), vmax=np.percentile(power_plot, 99)), + shading='auto', + ) + + # 绘制影响锥(COI) + coi_t = mdates.date2num(dates.to_pydatetime()) + ax.fill_between( + coi_t, coi_periods, periods[-1] * 1.1, + alpha=0.3, facecolor='white', hatch='x', + label='影响锥 (COI)', + ) + + # Y轴对数刻度 + ax.set_yscale('log') + ax.set_ylim(periods[0], periods[-1]) + ax.invert_yaxis() + + # 标记关键周期 + for kp in KEY_PERIODS: + if periods[0] <= kp <= periods[-1]: + ax.axhline(y=kp, color='white', linestyle='--', alpha=0.6, linewidth=0.8) + ax.text(t[-1] + (t[-1] - t[0]) * 0.01, kp, f'{kp}d', + color='white', fontsize=8, va='center') + + # 格式化 + ax.xaxis_date() + ax.xaxis.set_major_locator(mdates.YearLocator()) + ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y')) + ax.set_xlabel('日期', fontsize=12) + ax.set_ylabel('周期(天)', fontsize=12) + ax.set_title(title, fontsize=14) + + cbar = fig.colorbar(im, ax=ax, pad=0.08, shrink=0.8) + cbar.set_label('功率(对数尺度)', fontsize=10) + + ax.legend(loc='lower right', fontsize=9) + plt.tight_layout() + fig.savefig(output_path, dpi=DPI, bbox_inches='tight') + plt.close(fig) + print(f" Scalogram 已保存: {output_path}") + + +def plot_global_spectrum( + global_spectrum: np.ndarray, + significance_threshold: np.ndarray, + periods: np.ndarray, + significant_periods: List[Dict], + output_path: Path, + title: str = 'BTC/USDT 全局小波谱 + 95%显著性', +) -> None: + """绘制全局小波谱及95%红噪声显著性阈值 + + Parameters + ---------- + global_spectrum : np.ndarray + 全局小波谱 + significance_threshold : np.ndarray + 95%显著性阈值 + periods : np.ndarray + 周期数组 + significant_periods : list of dict + 显著周期信息 + output_path : Path + 输出路径 + title : str + 图标题 + """ + fig, ax = plt.subplots(figsize=(10, 7)) + + ax.plot(periods, global_spectrum, 'b-', linewidth=1.5, label='全局小波谱') + ax.plot(periods, significance_threshold, 'r--', linewidth=1.2, label='95% 红噪声显著性') + + # 填充显著区域 + above = global_spectrum > significance_threshold + ax.fill_between( + periods, global_spectrum, significance_threshold, + where=above, alpha=0.25, color='blue', label='显著区域', + ) + + # 标注显著周期峰值 + for sp in significant_periods: + ax.annotate( + f"{sp['period']:.0f}d\n({sp['ratio']:.1f}x)", + xy=(sp['period'], sp['power']), + xytext=(sp['period'] * 1.3, sp['power'] * 1.2), + fontsize=9, + arrowprops=dict(arrowstyle='->', color='darkblue', lw=1.0), + color='darkblue', + fontweight='bold', + ) + + # 标记关键周期 + for kp in KEY_PERIODS: + if periods[0] <= kp <= periods[-1]: + ax.axvline(x=kp, color='gray', linestyle=':', alpha=0.5, linewidth=0.8) + ax.text(kp, ax.get_ylim()[1] * 0.95, f'{kp}d', + ha='center', va='top', fontsize=8, color='gray') + + ax.set_xscale('log') + ax.set_yscale('log') + ax.set_xlabel('周期(天)', fontsize=12) + ax.set_ylabel('功率', fontsize=12) + ax.set_title(title, fontsize=14) + ax.legend(loc='upper left', fontsize=10) + ax.grid(True, alpha=0.3, which='both') + + plt.tight_layout() + fig.savefig(output_path, dpi=DPI, bbox_inches='tight') + plt.close(fig) + print(f" 全局小波谱 已保存: {output_path}") + + +def plot_key_period_power( + key_power: Dict[float, Dict], + dates: pd.DatetimeIndex, + coi_periods: np.ndarray, + output_path: Path, + title: str = 'BTC/USDT 关键周期功率时间演化', +) -> None: + """绘制关键周期处的功率随时间变化 + + Parameters + ---------- + key_power : dict + extract_power_at_periods 的返回结果 + dates : pd.DatetimeIndex + 时间索引 + coi_periods : np.ndarray + 影响锥边界 + output_path : Path + 输出路径 + title : str + 图标题 + """ + n_periods = len(key_power) + fig, axes = plt.subplots(n_periods, 1, figsize=(16, 3.5 * n_periods), sharex=True) + if n_periods == 1: + axes = [axes] + + colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b'] + + for i, (target_period, info) in enumerate(key_power.items()): + ax = axes[i] + power_ts = info['power'] + actual_period = info['actual_period'] + + # 标记COI内外区域 + in_coi = coi_periods < actual_period # COI内=不可靠 + reliable_power = power_ts.copy() + reliable_power[in_coi] = np.nan + unreliable_power = power_ts.copy() + unreliable_power[~in_coi] = np.nan + + color = colors[i % len(colors)] + ax.plot(dates, reliable_power, color=color, linewidth=1.0, + label=f'{target_period}d (实际 {actual_period:.1f}d)') + ax.plot(dates, unreliable_power, color=color, linewidth=0.8, + alpha=0.3, linestyle='--', label='COI 内(不可靠)') + + # 对功率做平滑以显示趋势 + window = max(int(target_period / 5), 7) + smoothed = pd.Series(power_ts).rolling(window=window, center=True, min_periods=1).mean() + ax.plot(dates, smoothed, color='black', linewidth=1.5, alpha=0.6, label=f'平滑 ({window}d)') + + ax.set_ylabel('功率', fontsize=10) + ax.set_title(f'周期 ~ {target_period} 天', fontsize=11) + ax.legend(loc='upper right', fontsize=8, ncol=3) + ax.grid(True, alpha=0.3) + + axes[-1].xaxis.set_major_locator(mdates.YearLocator()) + axes[-1].xaxis.set_major_formatter(mdates.DateFormatter('%Y')) + axes[-1].set_xlabel('日期', fontsize=12) + + fig.suptitle(title, fontsize=14, y=1.01) + plt.tight_layout() + fig.savefig(output_path, dpi=DPI, bbox_inches='tight') + plt.close(fig) + print(f" 关键周期功率图 已保存: {output_path}") + + +# ============================================================================ +# 主入口函数 +# ============================================================================ + +def run_wavelet_analysis( + df: pd.DataFrame, + output_dir: str, + wavelet: str = WAVELET, + min_period: float = MIN_PERIOD, + max_period: float = MAX_PERIOD, + num_scales: int = NUM_SCALES, + key_periods: List[float] = None, + n_surrogates: int = N_SURROGATES, +) -> Dict: + """执行完整的小波变换分析流程 + + Parameters + ---------- + df : pd.DataFrame + 日线 DataFrame,需包含 'close' 列和 DatetimeIndex + output_dir : str + 输出目录路径 + wavelet : str + 小波函数名 + min_period : float + 最小分析周期(天) + max_period : float + 最大分析周期(天) + num_scales : int + 尺度分辨率 + key_periods : list of float + 要追踪的关键周期 + n_surrogates : int + Monte Carlo替代数据数量 + + Returns + ------- + dict + 包含所有分析结果的字典: + - coeffs: CWT系数矩阵 + - power: 功率谱矩阵 + - periods: 周期数组 + - global_spectrum: 全局小波谱 + - significance_threshold: 95%显著性阈值 + - significant_periods: 显著周期列表 + - key_period_power: 关键周期功率演化 + - ar1_alpha: AR(1)系数 + - dates: 时间索引 + """ + if key_periods is None: + key_periods = KEY_PERIODS + + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + # ---- 1. 数据准备 ---- + print("=" * 70) + print("小波变换分析 (Continuous Wavelet Transform)") + print("=" * 70) + + prices = df['close'].dropna() + dates = prices.index + n = len(prices) + + print(f"\n[数据概况]") + print(f" 时间范围: {dates[0].strftime('%Y-%m-%d')} ~ {dates[-1].strftime('%Y-%m-%d')}") + print(f" 样本数: {n}") + print(f" 小波函数: {wavelet}") + print(f" 分析周期范围: {min_period}d ~ {max_period}d") + + # 对数收益率 + 标准化,作为CWT输入信号 + log_ret = log_returns(prices) + signal = standardize(log_ret).values + signal_dates = log_ret.index + + # 处理可能的NaN/Inf + valid_mask = np.isfinite(signal) + if not np.all(valid_mask): + print(f" 警告: 移除 {np.sum(~valid_mask)} 个非有限值") + signal = signal[valid_mask] + signal_dates = signal_dates[valid_mask] + + n_signal = len(signal) + print(f" CWT输入信号长度: {n_signal}") + + # ---- 2. 连续小波变换 ---- + print(f"\n[CWT 计算]") + print(f" 尺度数量: {num_scales}") + + coeffs, periods, scales = compute_cwt( + signal, dt=1.0, wavelet=wavelet, + min_period=min_period, max_period=max_period, num_scales=num_scales, + ) + power = compute_power_spectrum(coeffs) + + print(f" 系数矩阵形状: {coeffs.shape}") + print(f" 周期范围: {periods[0]:.1f}d ~ {periods[-1]:.1f}d") + + # ---- 3. 影响锥 ---- + coi_periods = compute_coi(n_signal, dt=1.0, wavelet=wavelet) + + # ---- 4. 全局小波谱 ---- + print(f"\n[全局小波谱]") + global_spectrum = compute_global_wavelet_spectrum(power) + + # ---- 5. AR(1) 红噪声 Monte Carlo 显著性检验 ---- + print(f"\n[Monte Carlo 显著性检验]") + significance_threshold, surrogate_spectra = significance_test_monte_carlo( + signal, periods, dt=1.0, wavelet=wavelet, + n_surrogates=n_surrogates, significance_level=SIGNIFICANCE_LEVEL, + ) + + # ---- 6. 找出显著周期 ---- + significant_periods = find_significant_periods( + global_spectrum, significance_threshold, periods, + ) + + print(f"\n[显著周期(超过95%置信水平)]") + if significant_periods: + for sp in significant_periods: + days = sp['period'] + years = days / 365.25 + print(f" * {days:7.0f} 天 ({years:5.2f} 年) | " + f"功率={sp['power']:.4f} | 阈值={sp['threshold']:.4f} | " + f"比值={sp['ratio']:.2f}x") + else: + print(" 未发现超过95%显著性水平的周期") + + # ---- 7. 关键周期功率时间演化 ---- + print(f"\n[关键周期功率追踪]") + key_power = extract_power_at_periods(power, periods, key_periods) + for kp, info in key_power.items(): + print(f" {kp}d -> 实际匹配周期: {info['actual_period']:.1f}d, " + f"平均功率: {np.mean(info['power']):.4f}") + + # ---- 8. 可视化 ---- + print(f"\n[生成图表]") + + # 8.1 CWT Scalogram + plot_cwt_scalogram( + power, periods, signal_dates, coi_periods, + output_dir / 'wavelet_scalogram.png', + ) + + # 8.2 全局小波谱 + 显著性 + plot_global_spectrum( + global_spectrum, significance_threshold, periods, significant_periods, + output_dir / 'wavelet_global_spectrum.png', + ) + + # 8.3 关键周期功率演化 + plot_key_period_power( + key_power, signal_dates, coi_periods, + output_dir / 'wavelet_key_periods.png', + ) + + # ---- 9. 汇总结果 ---- + ar1_alpha = _estimate_ar1(signal) + + results = { + 'coeffs': coeffs, + 'power': power, + 'periods': periods, + 'scales': scales, + 'global_spectrum': global_spectrum, + 'significance_threshold': significance_threshold, + 'significant_periods': significant_periods, + 'key_period_power': key_power, + 'coi_periods': coi_periods, + 'ar1_alpha': ar1_alpha, + 'dates': signal_dates, + 'wavelet': wavelet, + 'signal_length': n_signal, + } + + print(f"\n{'=' * 70}") + print(f"小波分析完成。共生成 3 张图表,保存至: {output_dir}") + print(f"{'=' * 70}") + + return results + + +# ============================================================================ +# 独立运行入口 +# ============================================================================ + +if __name__ == '__main__': + from src.data_loader import load_daily + + print("加载 BTC/USDT 日线数据...") + df = load_daily() + print(f"数据加载完成: {len(df)} 行\n") + + results = run_wavelet_analysis(df, output_dir='outputs/wavelet')