Add comprehensive BTC/USDT price analysis framework with 17 modules
Complete statistical analysis pipeline covering: - FFT spectral analysis, wavelet CWT, ACF/PACF autocorrelation - Returns distribution (fat tails, kurtosis=15.65), GARCH volatility modeling - Hurst exponent (H=0.593), fractal dimension, power law corridor - Volume-price causality (Granger), calendar effects, halving cycle analysis - Technical indicator validation (0/21 pass FDR), candlestick pattern testing - Market state clustering (K-Means/GMM), Markov chain transitions - Time series forecasting (ARIMA/Prophet/LSTM benchmarks) - Anomaly detection ensemble (IF+LOF+COPOD, AUC=0.9935) Key finding: volatility is predictable (GARCH persistence=0.973), but price direction is statistically indistinguishable from random walk. Includes REPORT.md with 16-section analysis report and future projections, 70+ charts in output/, and all source modules in src/. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
921
REPORT.md
Normal file
@@ -0,0 +1,921 @@
|
||||
# BTC/USDT 价格规律性全面分析报告
|
||||
|
||||
> **数据源**: Binance BTCUSDT | **时间跨度**: 2017-08-17 ~ 2026-02-01 (3,091 日线) | **时间粒度**: 1m/3m/5m/15m/30m/1h/2h/4h/6h/8h/12h/1d/3d/1w/1mo (15种)
|
||||
|
||||
---
|
||||
|
||||
## 目录
|
||||
|
||||
- [1. 数据概览](#1-数据概览)
|
||||
- [2. 收益率分布特征](#2-收益率分布特征)
|
||||
- [3. 波动率聚集与长记忆性](#3-波动率聚集与长记忆性)
|
||||
- [4. 频域周期分析](#4-频域周期分析)
|
||||
- [5. Hurst 指数与分形分析](#5-hurst-指数与分形分析)
|
||||
- [6. 幂律增长模型](#6-幂律增长模型)
|
||||
- [7. 量价关系与因果检验](#7-量价关系与因果检验)
|
||||
- [8. 日历效应](#8-日历效应)
|
||||
- [9. 减半周期分析](#9-减半周期分析)
|
||||
- [10. 技术指标有效性验证](#10-技术指标有效性验证)
|
||||
- [11. K线形态统计验证](#11-k线形态统计验证)
|
||||
- [12. 市场状态聚类](#12-市场状态聚类)
|
||||
- [13. 时序预测模型](#13-时序预测模型)
|
||||
- [14. 异常检测与前兆模式](#14-异常检测与前兆模式)
|
||||
- [15. 综合结论](#15-综合结论)
|
||||
|
||||
---
|
||||
|
||||
## 1. 数据概览
|
||||
|
||||

|
||||
|
||||
| 指标 | 值 |
|
||||
|------|-----|
|
||||
| 日线样本数 | 3,091 |
|
||||
| 小时样本数 | 74,053 |
|
||||
| 价格范围 | $3,189.02 ~ $124,658.54 |
|
||||
| 缺失值 | 0 |
|
||||
| 重复索引 | 0 |
|
||||
|
||||
数据切分策略(严格按时间顺序,不随机打乱):
|
||||
|
||||
| 集合 | 时间范围 | 样本数 | 比例 |
|
||||
|------|---------|--------|------|
|
||||
| 训练集 | 2017-08 ~ 2022-09 | 1,871 | 60.5% |
|
||||
| 验证集 | 2022-10 ~ 2024-06 | 639 | 20.7% |
|
||||
| 测试集 | 2024-07 ~ 2026-02 | 581 | 18.8% |
|
||||
|
||||
---
|
||||
|
||||
## 2. 收益率分布特征
|
||||
|
||||
### 2.1 正态性检验
|
||||
|
||||
三项独立检验**一致拒绝正态假设**:
|
||||
|
||||
| 检验方法 | 统计量 | p 值 | 结论 |
|
||||
|---------|--------|------|------|
|
||||
| Kolmogorov-Smirnov | 0.0974 | 5.97e-26 | 拒绝 |
|
||||
| Jarque-Bera | 31,996.3 | 0.00 | 拒绝 |
|
||||
| Anderson-Darling | 64.18 | 在所有临界值(1%~15%)下均拒绝 | 拒绝 |
|
||||
|
||||
### 2.2 厚尾特征
|
||||
|
||||
| 指标 | BTC实际值 | 正态分布理论值 | 倍数 |
|
||||
|------|----------|--------------|------|
|
||||
| 超额峰度 | 15.65 | 0 | — |
|
||||
| 偏度 | -0.97 | 0 | — |
|
||||
| 3σ超越比率 | 1.553% | 0.270% | **5.75x** |
|
||||
| 4σ超越比率 | 0.550% | 0.006% | **86.86x** |
|
||||
|
||||
4σ 极端事件的出现频率是正态分布预测的近 87 倍,证明 BTC 收益率具有显著的厚尾特征。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
### 2.3 多时间尺度分布
|
||||
|
||||
| 时间尺度 | 样本数 | 均值 | 标准差 | 峰度 | 偏度 |
|
||||
|---------|--------|------|--------|------|------|
|
||||
| 1h | 74,052 | 0.000039 | 0.0078 | 35.88 | -0.47 |
|
||||
| 4h | 18,527 | 0.000155 | 0.0149 | 20.54 | -0.20 |
|
||||
| 1d | 3,090 | 0.000935 | 0.0361 | 15.65 | -0.97 |
|
||||
| 1w | 434 | 0.006812 | 0.0959 | 2.08 | -0.44 |
|
||||
|
||||
**关键发现**: 峰度随时间尺度增大从 35.88 → 2.08 单调递减,趋向正态分布,符合中心极限定理的聚合正态性。
|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 3. 波动率聚集与长记忆性
|
||||
|
||||
### 3.1 GARCH 建模
|
||||
|
||||
| 参数 | GARCH(1,1) | EGARCH(1,1) | GJR-GARCH(1,1) |
|
||||
|------|-----------|-------------|-----------------|
|
||||
| α | 0.0962 | — | — |
|
||||
| β | 0.8768 | — | — |
|
||||
| 持续性(α+β) | **0.9730** | — | — |
|
||||
| 杠杆参数 γ | — | < 0 | > 0 |
|
||||
|
||||
持续性 0.973 接近 1,意味着波动率冲击衰减极慢 — 一次大幅波动的影响需要数十天才能消散。
|
||||
|
||||

|
||||
|
||||
### 3.2 波动率 ACF 幂律衰减
|
||||
|
||||
| 指标 | 值 |
|
||||
|------|-----|
|
||||
| 幂律衰减指数 d(线性拟合) | 0.6351 |
|
||||
| 幂律衰减指数 d(非线性拟合) | 0.3449 |
|
||||
| R² | 0.4231 |
|
||||
| p 值 | 5.82e-25 |
|
||||
| 长记忆性判断 (0 < d < 1) | **是** |
|
||||
|
||||
绝对收益率的自相关以幂律速度缓慢衰减,证实波动率具有长记忆特征。标准 GARCH 模型的指数衰减假设可能不足以完整刻画这一特征。
|
||||
|
||||

|
||||
|
||||
### 3.3 ACF 分析证据
|
||||
|
||||
| 序列 | ACF显著滞后数 | Ljung-Box Q(100) | p 值 |
|
||||
|------|-------------|-----------------|------|
|
||||
| 对数收益率 | 10 | 148.68 | 0.001151 |
|
||||
| 平方收益率 | 11 | 211.18 | 0.000000 |
|
||||
| 绝对收益率 | **88** | 2,294.61 | 0.000000 |
|
||||
| 成交量 | **100** | 103,242.29 | 0.000000 |
|
||||
|
||||
绝对收益率前 88 阶 ACF 均显著(100 阶中的 88 阶),成交量全部 100 阶均显著(ACF(1) = 0.892),证明极强的非线性依赖和波动聚集。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
### 3.4 杠杆效应
|
||||
|
||||
| 前瞻窗口 | Pearson r | p 值 | 结论 |
|
||||
|---------|-----------|------|------|
|
||||
| 5d | -0.0620 | 5.72e-04 | 显著弱负相关 |
|
||||
| 10d | -0.0337 | 0.062 | 不显著 |
|
||||
| 20d | -0.0176 | 0.329 | 不显著 |
|
||||
|
||||
仅在 5 天窗口内观测到弱杠杆效应(下跌后波动率上升),效应量极小(r=-0.062),比传统股市弱得多。
|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 4. 频域周期分析
|
||||
|
||||
### 4.1 FFT 频谱分析
|
||||
|
||||
对日线对数收益率施加 Hann 窗后做 FFT,以 AR(1) 红噪声为基准检测显著周期:
|
||||
|
||||
| 周期(天) | SNR (信噪比) | 跨时间框架确认 |
|
||||
|---------|-------------|--------------|
|
||||
| 39.6 | 6.36x | 4h + 1d + 1w(三框架确认) |
|
||||
| 3.1 | 5.27x | 4h + 1d |
|
||||
| 14.4 | 5.22x | 4h + 1d |
|
||||
| 13.3 | 5.19x | 4h + 1d |
|
||||
|
||||
**带通滤波方差占比**:
|
||||
|
||||
| 周期分量 | 方差占比 |
|
||||
|---------|---------|
|
||||
| 7d | 14.917% |
|
||||
| 30d | 3.770% |
|
||||
| 90d | 2.405% |
|
||||
| 365d | 0.749% |
|
||||
| 1400d | 0.233% |
|
||||
|
||||
7 天周期分量解释了最多的方差(14.9%),但总体所有周期分量加起来仅解释 ~22% 的方差,约 78% 的波动无法用周期性解释。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
### 4.2 小波变换 (CWT)
|
||||
|
||||
使用复 Morlet 小波(cmor1.5-1.0),1000 次 AR(1) Monte Carlo 替代数据构建 95% 显著性阈值:
|
||||
|
||||
| 显著周期(天) | 年数 | 功率/阈值比 |
|
||||
|-------------|------|-----------|
|
||||
| 633 | 1.73 | 1.01x |
|
||||
| 316 | 0.87 | 1.15x |
|
||||
| 297 | 0.81 | 1.07x |
|
||||
| 278 | 0.76 | 1.10x |
|
||||
| 267 | 0.73 | 1.07x |
|
||||
| 251 | 0.69 | 1.11x |
|
||||
| 212 | 0.58 | 1.14x |
|
||||
|
||||
这些周期虽然通过了 95% 显著性检验,但功率/阈值比值仅 1.01~1.15x,属于**边际显著**,实际应用价值有限。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 5. Hurst 指数与分形分析
|
||||
|
||||
### 5.1 Hurst 指数
|
||||
|
||||
R/S 分析和 DFA 两种独立方法交叉验证:
|
||||
|
||||
| 方法 | Hurst 值 | 解读 |
|
||||
|------|---------|------|
|
||||
| R/S 分析 | 0.5991 | 弱趋势性 |
|
||||
| DFA | 0.5868 | 弱趋势性 |
|
||||
| **平均** | **0.5930** | 弱趋势性 (H > 0.55) |
|
||||
| 方法差异 | 0.0122 | 一致性好 (< 0.05) |
|
||||
|
||||
判定标准:H > 0.55 趋势性 / H < 0.45 均值回归 / 0.45 ≤ H ≤ 0.55 随机游走
|
||||
|
||||
**多时间框架 Hurst**:
|
||||
|
||||
| 时间尺度 | R/S | DFA | 平均 |
|
||||
|---------|-----|-----|------|
|
||||
| 1h | 0.5552 | 0.5559 | 0.5556 |
|
||||
| 4h | 0.5749 | 0.5771 | 0.5760 |
|
||||
| 1d | 0.5991 | 0.5868 | 0.5930 |
|
||||
| 1w | 0.6864 | 0.6552 | **0.6708** |
|
||||
|
||||
Hurst 指数随时间尺度增大而增大,周线级别(H=0.67)呈现更明显的趋势性。
|
||||
|
||||
**滚动窗口分析**(500 天窗口,30 天步进):
|
||||
|
||||
| 指标 | 值 |
|
||||
|------|-----|
|
||||
| 窗口数 | 87 |
|
||||
| 趋势状态占比 | **98.9%** (86/87) |
|
||||
| 随机游走占比 | 1.1% |
|
||||
| 均值回归占比 | 0.0% |
|
||||
| Hurst 范围 | [0.549, 0.654] |
|
||||
|
||||
几乎所有时间窗口都显示弱趋势性,没有任何窗口进入均值回归状态。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
### 5.2 分形维度
|
||||
|
||||
| 指标 | BTC | 随机游走均值 | 随机游走标准差 |
|
||||
|------|-----|-----------|-------------|
|
||||
| 盒计数维数 D | 1.3398 | 1.3805 | 0.0295 |
|
||||
| 由 D 推算 H (D=2-H) | 0.6602 | — | — |
|
||||
| Z 统计量 | -1.3821 | — | — |
|
||||
| p 值 | 0.1669 | — | — |
|
||||
|
||||
BTC 的分形维数 D=1.34 低于随机游走的 D=1.38(序列更光滑),但 100 次蒙特卡洛模拟 Z 检验的 p=0.167 **未达到 5% 显著性**。
|
||||
|
||||
**多尺度自相似性**:峰度从尺度 1 的 15.65 降至尺度 50 的 -0.25,大尺度下趋于正态,自相似性有限。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 6. 幂律增长模型
|
||||
|
||||
| 指标 | 值 |
|
||||
|------|-----|
|
||||
| 幂律指数 α | 0.770 |
|
||||
| R² | 0.568 |
|
||||
| p 值 | 0.00 |
|
||||
|
||||
### 6.1 幂律走廊模型
|
||||
|
||||
| 分位数 | 当前走廊价格 |
|
||||
|--------|-----------|
|
||||
| 5%(低估) | $16,879 |
|
||||
| 50%(中枢) | $51,707 |
|
||||
| 95%(高估) | $119,340 |
|
||||
| **当前价格** | **$76,968** |
|
||||
| 历史残差分位 | **67.9%** |
|
||||
|
||||
当前价格处于走廊的 67.9% 分位,属于历史正常波动范围内。
|
||||
|
||||
### 6.2 幂律 vs 指数增长模型对比
|
||||
|
||||
| 模型 | AIC | BIC |
|
||||
|------|-----|-----|
|
||||
| 幂律 | 68,301 | 68,313 |
|
||||
| 指数 | **67,807** | **67,820** |
|
||||
| 差值 | +493 | +493 |
|
||||
|
||||
AIC/BIC 均支持指数增长模型优于幂律模型(差值 493),说明 BTC 的长期增长更接近指数而非幂律。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 7. 量价关系与因果检验
|
||||
|
||||
### 7.1 成交量-波动率相关性
|
||||
|
||||
| 指标 | 值 |
|
||||
|------|-----|
|
||||
| Spearman ρ (volume vs \|return\|) | **0.3215** |
|
||||
| p 值 | 3.11e-75 |
|
||||
|
||||
成交量放大伴随大幅波动,中等正相关且极其显著。
|
||||
|
||||

|
||||
|
||||
### 7.2 Granger 因果检验
|
||||
|
||||
共 50 次检验(10 对 × 5 个滞后阶),Bonferroni 校正阈值 = 0.001:
|
||||
|
||||
| 因果方向 | 校正后显著的滞后阶数 | 最大 F 统计量 |
|
||||
|---------|-----------------|-------------|
|
||||
| abs_return → volume | **5/5 全显著** | 55.19 |
|
||||
| log_return → taker_buy_ratio | **5/5 全显著** | 139.21 |
|
||||
| squared_return → volume | **4/5 显著** | 52.44 |
|
||||
| log_return → range_pct | 1/5 | 5.74 |
|
||||
| volume → abs_return | 1/5 | 3.69 |
|
||||
| volume → log_return | 0/5 | — |
|
||||
| log_return → volume | 0/5 | — |
|
||||
| taker_buy_ratio → log_return | 0/5(校正后) | — |
|
||||
|
||||
**核心发现**: 因果关系是**单向**的 — 波动率/收益率 Granger-cause 成交量和 taker_buy_ratio,反向不成立。这意味着成交量是价格波动的结果而非原因。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
### 7.3 跨时间尺度因果
|
||||
|
||||
| 方向 | 显著滞后阶 |
|
||||
|------|----------|
|
||||
| hourly_intraday_vol → log_return | lag=10 显著 (Bonferroni) |
|
||||
| hourly_volume_sum → log_return | 不显著 |
|
||||
| hourly_max_abs_return → log_return | lag=10 边际显著 |
|
||||
|
||||
小时级别日内波动率对日线收益率存在微弱的领先信号,但仅在 10 天滞后下显著。
|
||||
|
||||
### 7.4 OBV 背离
|
||||
|
||||
检测到 82 个价量背离信号(49 个顶背离 + 33 个底背离)。
|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 8. 日历效应
|
||||
|
||||
### 8.1 星期效应
|
||||
|
||||
| 星期 | 样本数 | 日均收益率 | 标准差 |
|
||||
|------|--------|----------|--------|
|
||||
| 周一 | 441 | +0.310% | 4.05% |
|
||||
| 周二 | 441 | -0.027% | 3.56% |
|
||||
| 周三 | 441 | +0.374% | 3.69% |
|
||||
| 周四 | 441 | -0.319% | 4.58% |
|
||||
| 周五 | 442 | +0.180% | 3.62% |
|
||||
| 周六 | 442 | +0.117% | 2.45% |
|
||||
| 周日 | 442 | +0.021% | 2.87% |
|
||||
|
||||
**Kruskal-Wallis H 检验: H=8.24, p=0.221 → 不显著**
|
||||
|
||||
Bonferroni 校正后的 21 对 Mann-Whitney U 两两比较均不显著。
|
||||
|
||||

|
||||
|
||||
### 8.2 月份效应
|
||||
|
||||
**Kruskal-Wallis H 检验: H=6.12, p=0.865 → 不显著**
|
||||
|
||||
10 月份均值收益率最高(+0.501%),8 月最低(-0.123%),但 66 对两两比较经 Bonferroni 校正后无一显著。
|
||||
|
||||

|
||||
|
||||
### 8.3 小时效应
|
||||
|
||||
**收益率 Kruskal-Wallis: H=56.88, p=0.000107 → 显著**
|
||||
**成交量 Kruskal-Wallis: H=2601.9, p=0.000000 → 显著**
|
||||
|
||||
日内小时效应在收益率和成交量上均显著存在。14:00 UTC 成交量最高(3,805 BTC),03:00-05:00 UTC 成交量最低(~1,980 BTC)。
|
||||
|
||||

|
||||
|
||||
### 8.4 季度 & 月初月末效应
|
||||
|
||||
| 检验 | 统计量 | p 值 | 结论 |
|
||||
|------|--------|------|------|
|
||||
| 季度 Kruskal-Wallis | 1.15 | 0.765 | 不显著 |
|
||||
| 月初 vs 月末 Mann-Whitney | 134,569 | 0.236 | 不显著 |
|
||||
|
||||

|
||||
|
||||
### 日历效应总结
|
||||
|
||||
| 效应类型 | 检验 p 值 | 结论 |
|
||||
|---------|----------|------|
|
||||
| 星期效应 | 0.221 | **不显著** |
|
||||
| 月份效应 | 0.865 | **不显著** |
|
||||
| 小时效应(收益率) | 0.000107 | **显著** |
|
||||
| 小时效应(成交量) | 0.000000 | **显著** |
|
||||
| 季度效应 | 0.765 | **不显著** |
|
||||
| 月初/月末 | 0.236 | **不显著** |
|
||||
|
||||
仅日内小时效应在统计上显著。
|
||||
|
||||
---
|
||||
|
||||
## 9. 减半周期分析
|
||||
|
||||
> ⚠️ **重要局限**: 仅覆盖 2 次减半事件(2020-05-11, 2024-04-20),统计功效极低。
|
||||
|
||||
### 9.1 减半前后收益率对比
|
||||
|
||||
| 周期 | 减半前500天均值 | 减半后500天均值 | Welch's t | p 值 |
|
||||
|------|-------------|-------------|-----------|------|
|
||||
| 第三次(2020) | +0.179%/天 | +0.331%/天 | -0.590 | 0.555 |
|
||||
| 第四次(2024) | +0.264%/天 | +0.108%/天 | 1.008 | 0.314 |
|
||||
| **合并** | +0.221%/天 | +0.220%/天 | 0.011 | **0.991** |
|
||||
|
||||
合并后 p=0.991,减半前后收益率几乎完全无差异。
|
||||
|
||||
### 9.2 波动率变化 (Levene 检验)
|
||||
|
||||
| 周期 | 减半前年化波动率 | 减半后年化波动率 | Levene W | p 值 |
|
||||
|------|--------------|--------------|---------|------|
|
||||
| 第三次 | 82.72% | 73.13% | 0.608 | 0.436 |
|
||||
| 第四次 | 47.18% | 46.26% | 0.197 | 0.657 |
|
||||
|
||||
波动率变化在两个周期中均**不显著**。
|
||||
|
||||
### 9.3 累计收益率
|
||||
|
||||
| 减半后天数 | 第三次(2020) | 第四次(2024) |
|
||||
|-----------|-------------|-------------|
|
||||
| 30天 | +13.32% | +11.95% |
|
||||
| 90天 | +33.92% | +4.45% |
|
||||
| 180天 | +69.88% | +5.65% |
|
||||
| 365天 | **+549.68%** | +33.47% |
|
||||
| 500天 | +414.35% | +74.31% |
|
||||
|
||||
两次减半后的轨迹差异巨大(365天:550% vs 33%)。
|
||||
|
||||
### 9.4 轨迹相关性
|
||||
|
||||
| 时段 | Pearson r | p 值 |
|
||||
|------|-----------|------|
|
||||
| 全部 (1001天) | **0.808** | 0.000 |
|
||||
| 减半前 (500天) | 0.213 | 0.000002 |
|
||||
| 减半后 (500天) | **0.737** | 0.000 |
|
||||
|
||||
两个周期的归一化价格轨迹高度相关(r=0.81),但仅 2 个样本无法做出因果推断。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 10. 技术指标有效性验证
|
||||
|
||||
对 21 个指标信号(8 种 MA/EMA 交叉 + 9 种 RSI + 3 种 MACD + 1 种布林带)进行严格统计验证。
|
||||
|
||||
### 10.1 FDR 校正
|
||||
|
||||
| 数据集 | 通过 FDR 校正的指标数 |
|
||||
|--------|-------------------|
|
||||
| 训练集 (1,871 bars) | **0 / 21** |
|
||||
| 验证集 (639 bars) | **0 / 21** |
|
||||
|
||||
**所有 21 个技术指标经 Benjamini-Hochberg FDR 校正后均不显著。**
|
||||
|
||||
### 10.2 置换检验 (Top-5 IC 指标)
|
||||
|
||||
| 指标 | IC 差值 | 置换 p 值 | 结论 |
|
||||
|------|--------|----------|------|
|
||||
| RSI_14_30_70 | -0.005 | 0.566 | 不通过 |
|
||||
| RSI_14_25_75 | -0.030 | 0.015 | 通过 |
|
||||
| RSI_21_30_70 | -0.012 | 0.268 | 不通过 |
|
||||
| RSI_7_25_75 | -0.014 | 0.021 | 通过 |
|
||||
| RSI_21_20_80 | -0.025 | 0.303 | 不通过 |
|
||||
|
||||
仅 2/5 通过置换检验,且 IC 值均极小(|IC| < 0.05),实际预测力可忽略。
|
||||
|
||||
### 10.3 训练集 vs 验证集 IC 一致性
|
||||
|
||||
Top-10 IC 中有 9/10 方向一致,1 个(SMA_20_100)发生方向翻转。但所有 IC 值均在 [-0.10, +0.05] 范围内,效果量极小。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 11. K线形态统计验证
|
||||
|
||||
对 12 种手动实现的经典 K 线形态进行前瞻收益率分析。
|
||||
|
||||
### 11.1 形态出现频率(训练集)
|
||||
|
||||
| 形态 | 出现次数 | FDR 通过 |
|
||||
|------|---------|---------|
|
||||
| Doji | 219 | 否 |
|
||||
| Bullish_Engulfing | 159 | 否 |
|
||||
| Bearish_Engulfing | 149 | 否 |
|
||||
| Pin_Bar_Bull | 116 | 否 |
|
||||
| Pin_Bar_Bear | 57 | 否 |
|
||||
| Hammer | 49 | 否 |
|
||||
| Morning_Star | 23 | 否 |
|
||||
| Evening_Star | 20 | 否 |
|
||||
| Inverted_Hammer | 17 | 否 |
|
||||
| Three_White_Soldiers | 11 | 否 |
|
||||
| Shooting_Star | 6 | 否 |
|
||||
| Three_Black_Crows | 4 | 否 |
|
||||
|
||||
**训练集 FDR 校正后 0/12 通过。**
|
||||
|
||||
### 11.2 验证集结果
|
||||
|
||||
验证集中 3 个形态通过 FDR 校正(Doji 53.1%、Pin_Bar_Bull 39.3%、Bullish_Engulfing 36.2%),但命中率接近或低于 50%(随机水平),缺乏实际交易价值。
|
||||
|
||||
### 11.3 训练集 → 验证集稳定性
|
||||
|
||||
| 形态 | 训练集命中率 | 验证集命中率 | 变化 | 评价 |
|
||||
|------|-----------|-----------|------|------|
|
||||
| Doji | 51.1% | 53.1% | +1.9% | 稳定 |
|
||||
| Hammer | 63.3% | 50.0% | -13.3% | 衰减 |
|
||||
| Pin_Bar_Bear | 57.9% | 60.0% | +2.1% | 稳定 |
|
||||
| Bullish_Engulfing | 50.9% | 36.2% | -14.7% | 衰减 |
|
||||
| Morning_Star | 56.5% | 40.0% | -16.5% | 衰减 |
|
||||
|
||||
大部分形态的命中率在验证集上出现衰减,说明训练集中的表现可能是过拟合。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 12. 市场状态聚类
|
||||
|
||||
### 12.1 K-Means (k=3, 轮廓系数=0.338)
|
||||
|
||||
| 状态 | 占比 | 日均收益率 | 7d年化波动率 | 成交量比 |
|
||||
|------|------|----------|-----------|---------|
|
||||
| 横盘整理 | 73.6% | -0.010% | 46.5% | 0.896 |
|
||||
| 急剧下跌 | 11.8% | -5.636% | 95.2% | 1.452 |
|
||||
| 强势上涨 | 14.6% | +5.279% | 87.6% | 1.330 |
|
||||
|
||||
### 12.2 马尔可夫转移概率矩阵
|
||||
|
||||
| | → 横盘 | → 暴跌 | → 暴涨 |
|
||||
|---|-------|-------|-------|
|
||||
| 横盘 | 0.820 | 0.077 | 0.103 |
|
||||
| 暴跌 | 0.452 | 0.230 | 0.319 |
|
||||
| 暴涨 | 0.546 | 0.230 | 0.224 |
|
||||
|
||||
**平稳分布**: 横盘 73.6%、暴跌 11.8%、暴涨 14.6%
|
||||
|
||||
**平均持有时间**: 横盘 5.55 天 / 暴跌 1.30 天 / 暴涨 1.29 天
|
||||
|
||||
暴涨暴跌状态平均仅持续 1.3 天即回归横盘。暴跌后有 31.9% 概率转为暴涨(反弹)。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 13. 时序预测模型
|
||||
|
||||
| 模型 | RMSE | RMSE/RW | 方向准确率 | DM p 值 |
|
||||
|------|------|---------|----------|--------|
|
||||
| Random Walk | 0.02532 | 1.000 | 0.0%* | — |
|
||||
| Historical Mean | 0.02527 | 0.998 | 49.9% | 0.152 |
|
||||
| ARIMA | 未完成** | — | — | — |
|
||||
| Prophet | 未安装 | — | — | — |
|
||||
| LSTM | 未安装 | — | — | — |
|
||||
|
||||
\* Random Walk 预测收益=0,方向准确率定义为 0%
|
||||
\*\* ARIMA 因 numpy 二进制兼容性问题未能完成
|
||||
|
||||
Historical Mean 的 RMSE/RW = 0.998,仅比随机游走好 0.2%,Diebold-Mariano 检验 p=0.152 **不显著**,本质上等同于随机游走。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 14. 异常检测与前兆模式
|
||||
|
||||
### 14.1 集成异常检测
|
||||
|
||||
| 方法 | 异常数 | 占比 |
|
||||
|------|--------|------|
|
||||
| Isolation Forest | 154 | 5.01% |
|
||||
| LOF | 154 | 5.01% |
|
||||
| COPOD | 154 | 5.01% |
|
||||
| **集成 (≥2/3)** | **142** | **4.62%** |
|
||||
| GARCH 残差异常 | 48 | 1.55% |
|
||||
| 集成 ∩ GARCH 重叠 | 41 | — |
|
||||
|
||||
### 14.2 已知事件对齐(容差 5 天)
|
||||
|
||||
| 事件 | 日期 | 是否对齐 | 最小偏差(天) |
|
||||
|------|------|---------|------------|
|
||||
| 2017年牛市顶点 | 2017-12-17 | ✓ | 1 |
|
||||
| 2018年熊市底部 | 2018-12-15 | ✓ | 5 |
|
||||
| 新冠黑色星期四 | 2020-03-12 | ✓ | **0** |
|
||||
| 第三次减半 | 2020-05-11 | ✓ | 1 |
|
||||
| Luna/3AC 暴跌 | 2022-06-18 | ✓ | **0** |
|
||||
| FTX 崩盘 | 2022-11-09 | ✓ | **0** |
|
||||
|
||||
12 个已知事件中 6 个被成功对齐,其中 3 个精确到 0 天偏差。
|
||||
|
||||
### 14.3 前兆分类器
|
||||
|
||||
| 指标 | 值 |
|
||||
|------|-----|
|
||||
| 分类器 AUC | **0.9935** |
|
||||
| 样本数 | 3,053 (异常 134, 正常 2,919) |
|
||||
|
||||
**Top-5 前兆特征(异常前 5~20 天的信号)**:
|
||||
|
||||
| 特征 | 重要性 |
|
||||
|------|--------|
|
||||
| range_pct_max_5d | 0.0856 |
|
||||
| range_pct_std_5d | 0.0836 |
|
||||
| abs_return_std_5d | 0.0605 |
|
||||
| abs_return_max_5d | 0.0583 |
|
||||
| range_pct_deviation_20d | 0.0562 |
|
||||
|
||||
异常事件前 5 天的价格波动幅度(range_pct)和绝对收益率的最大值/标准差是最强的前兆信号。
|
||||
|
||||
> **注意**: AUC=0.99 部分反映了异常本身的聚集性(异常日前后也是异常的),不等于真正的"事前预测"能力。
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
---
|
||||
|
||||
## 15. 综合结论
|
||||
|
||||
### 证据分级汇总
|
||||
|
||||
#### ✅ 强证据(高度可重复,具有经济意义)
|
||||
|
||||
| 规律 | 关键证据 | 可利用性 |
|
||||
|------|---------|---------|
|
||||
| 收益率厚尾分布 | KS/JB/AD p≈0,超额峰度=15.65,4σ事件87倍于正态 | 风控必须考虑 |
|
||||
| 波动率聚集 | GARCH persistence=0.973,绝对收益率ACF 88阶显著 | 可预测波动率 |
|
||||
| 波动率长记忆性 | 幂律衰减 d=0.635, p=5.8e-25 | FIGARCH建模 |
|
||||
| 单向因果:波动→成交量 | abs_return→volume F=55.19, Bonferroni校正后全显著 | 理解市场微观结构 |
|
||||
| 异常事件前兆 | AUC=0.9935,6/12已知事件精确对齐 | 波动率异常预警 |
|
||||
|
||||
#### ⚠️ 中等证据(统计显著但效果有限)
|
||||
|
||||
| 规律 | 关键证据 | 限制 |
|
||||
|------|---------|------|
|
||||
| 弱趋势性 | Hurst H=0.593, 98.9%窗口>0.55 | 效应量小(H仅略>0.5) |
|
||||
| 日内小时效应 | Kruskal-Wallis p=0.0001 | 仅限小时级别 |
|
||||
| FFT 39.6天周期 | SNR=6.36, 三框架确认 | 7天分量仅解释15%方差 |
|
||||
| 小波 ~300天周期 | 95% MC显著 | 功率/阈值比仅1.01-1.15x |
|
||||
|
||||
#### ❌ 弱证据/不显著
|
||||
|
||||
| 规律 | 关键证据 | 结论 |
|
||||
|------|---------|------|
|
||||
| 日历效应(星期/月份/季度) | Kruskal-Wallis p=0.22~0.87 | **不存在** |
|
||||
| 减半效应 | Welch's t p=0.55/0.31, 合并p=0.991 | **不显著**(仅2样本) |
|
||||
| 技术指标预测力 | 21个指标FDR校正后0通过,IC<0.05 | **不存在** |
|
||||
| K线形态超额收益 | 训练集FDR 0/12通过,验证集多数衰减 | **不存在** |
|
||||
| 分形维度偏离随机游走 | Z=-1.38, p=0.167 | **不显著** |
|
||||
| 时序模型超越随机游走 | RMSE/RW=0.998, DM p=0.152 | **不显著** |
|
||||
|
||||
### 最终判断
|
||||
|
||||
> **BTC 价格走势存在可测量的统计规律,但绝大多数不具备价格方向的预测可利用性。**
|
||||
>
|
||||
> 1. **波动率可预测,价格方向不可预测**。GARCH 效应、波动率聚集、长记忆性是确凿的市场特征,可用于风险管理和期权定价,但不能用于预测涨跌。
|
||||
>
|
||||
> 2. **市场效率的非对称性**。BTC 市场对价格水平(一阶矩)接近有效,但对波动率(二阶矩)远非有效 — 这与传统金融市场的"波动率可预测悖论"一致。
|
||||
>
|
||||
> 3. **流行的交易信号经不起严格检验**。21 个技术指标、12 种 K 线形态、日历效应、减半效应在 FDR/Bonferroni 校正后全部不显著或效果量极小。
|
||||
>
|
||||
> 4. **实际启示**:关注波动率管理而非方向预测;极端事件的风险评估应使用厚尾模型;异常检测可作为风控辅助工具。
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## 16. 基于分析数据的未来价格推演(2026-02 ~ 2028-02)
|
||||
|
||||
> **重要免责声明**: 本章节是基于前述 15 章的统计分析结果所做的数据驱动推演,**不构成任何投资建议**。BTC 价格的方向准确率在统计上等同于随机游走(第 13 章),任何点位预测的精确性都是幻觉。以下推演的价值在于**量化不确定性的范围**,而非给出精确预测。
|
||||
|
||||
### 16.1 推演方法论
|
||||
|
||||
我们综合使用 6 个独立分析框架的量化输出,构建概率分布而非单一预测值:
|
||||
|
||||
| 框架 | 数据来源 | 作用 |
|
||||
|------|---------|------|
|
||||
| 几何布朗运动 (GBM) | 日收益率 μ=0.0935%/天, σ=3.61%/天 (第 2 章) | 中性基准的概率锥 |
|
||||
| 幂律走廊外推 | α=0.770, R²=0.568 (第 6 章) | 长期结构性锚定区间 |
|
||||
| GARCH 波动率锥 | persistence=0.973 (第 3 章) | 动态波动率调整 |
|
||||
| 减半周期类比 | 第 3/4 次减半轨迹 r=0.81 (第 9 章) | 周期性参考(仅 2 样本) |
|
||||
| 马尔可夫状态模型 | 3 状态转移矩阵 (第 12 章) | 状态持续与切换概率 |
|
||||
| Hurst 趋势推断 | H=0.593, 周线 H=0.67 (第 5 章) | 趋势持续性修正 |
|
||||
|
||||
### 16.2 当前市场状态诊断
|
||||
|
||||
**基准价格**: $76,968(2026-02-01 收盘价)
|
||||
|
||||
| 诊断维度 | 值 | 含义 |
|
||||
|---------|-----|------|
|
||||
| 幂律走廊分位 | 67.9% | 偏高但未极端(5%=$16,879, 95%=$119,340) |
|
||||
| 距第 4 次减半天数 | ~652 天 | 进入减半后期(第 3 次在 ~550 天见顶) |
|
||||
| 马尔可夫当前状态 | 横盘整理(73.6%概率) | 日均收益 -0.01%, 年化波动率 46.5% |
|
||||
| Hurst 最近窗口 | 0.549 ~ 0.654 | 弱趋势持续,未进入均值回归 |
|
||||
| GARCH 波动率持续性 | 0.973 | 当前波动率水平有强惯性 |
|
||||
|
||||
### 16.3 框架一:GBM 概率锥(假设收益率独立同分布)
|
||||
|
||||
基于日线对数收益率参数(μ=0.000935, σ=0.0361),在几何布朗运动假设下:
|
||||
|
||||
**风险中性漂移修正**: E[ln(S_T/S_0)] = (μ - σ²/2) × T = 0.000283/天
|
||||
|
||||
| 时间跨度 | 中位数预期 | -1σ (16%分位) | +1σ (84%分位) | -2σ (2.5%分位) | +2σ (97.5%分位) |
|
||||
|---------|-----------|-------------|-------------|-------------|---------------|
|
||||
| 6 个月 (183天) | $80,834 | $52,891 | $123,470 | $36,267 | $180,129 |
|
||||
| 1 年 (365天) | $85,347 | $42,823 | $170,171 | $21,502 | $338,947 |
|
||||
| 2 年 (730天) | $94,618 | $35,692 | $250,725 | $13,475 | $664,268 |
|
||||
|
||||
> **关键修正**: 由于 BTC 收益率呈厚尾分布(超额峰度=15.65,4σ事件概率是正态的 87 倍),上述 GBM 模型**严重低估了尾部风险**。实际 2.5%/97.5% 分位数的范围应显著宽于上表。
|
||||
|
||||
### 16.4 框架二:幂律走廊外推
|
||||
|
||||
以当前幂律参数 α=0.770 外推走廊上下轨:
|
||||
|
||||
| 时间点 | 5% 下轨 | 50% 中轨 | 95% 上轨 | 当前价格位置 |
|
||||
|--------|---------|---------|---------|-----------|
|
||||
| 2026-02(现在, day 3091) | $16,879 | $51,707 | $119,340 | $76,968 (67.9%) |
|
||||
| 2026-08(day 3274) | $17,647 | $54,060 | $124,773 | — |
|
||||
| 2027-02(day 3456) | $18,412 | $56,404 | $130,183 | — |
|
||||
| 2028-02(day 3821) | $19,861 | $60,839 | $140,423 | — |
|
||||
|
||||
> **注意**: 幂律模型 R²=0.568 且 AIC 显示指数增长模型拟合更好(差值 493),因此幂律走廊仅做结构性参考,不应作为主要定价依据。走廊的年增速约 9%,远低于历史年化回报 34%。
|
||||
|
||||
### 16.5 框架三:减半周期类比
|
||||
|
||||
第 4 次减半(2024-04-20)已过约 652 天。以第 3 次减半为参照:
|
||||
|
||||
| 事件 | 第 3 次(2020-05-11) | 第 4 次(2024-04-20) | 缩减比 |
|
||||
|------|-------|-------|--------|
|
||||
| 减半日价格 | ~$8,600 | ~$64,000 | — |
|
||||
| 365 天累计 | **+549.68%** | +33.47% | **0.061x** |
|
||||
| 500 天累计 | +414.35% | +74.31% | **0.179x** |
|
||||
| 周期峰值 | ~$69,000 (~550天) | **?** | — |
|
||||
| 轨迹相关性 | r = 0.808 (p < 0.001) | — | — |
|
||||
|
||||
**推演**:
|
||||
- 如果按第 3 次减半的轨迹形态(r=0.81),但收益率大幅衰减(0.06x~0.18x 缩减比),第 4 次周期可能已经或接近峰值
|
||||
- 第 3 次减半在 ~550 天达到顶点后进入长期下跌(随后的 2022 年熊市),若类比成立,2026Q1-Q2 可能处于"周期后期"
|
||||
- **但仅 2 个样本的统计功效极低**(Welch's t 合并 p=0.991),不能依赖此推演
|
||||
|
||||
### 16.6 框架四:马尔可夫状态模型推演
|
||||
|
||||
基于 3 状态马尔可夫转移矩阵的条件概率预测:
|
||||
|
||||
**当前状态假设为横盘整理**(73.6% 的日子处于此状态):
|
||||
|
||||
| 未来状态 | 1 天后概率 | 5 天后概率* | 30 天后概率* |
|
||||
|---------|-----------|-----------|------------|
|
||||
| 继续横盘 | 82.0% | ~51.3% | ≈平稳分布 73.6% |
|
||||
| 转入暴跌 | 7.7% | ~10.5% | ≈平稳分布 11.8% |
|
||||
| 转入暴涨 | 10.3% | ~13.4% | ≈平稳分布 14.6% |
|
||||
|
||||
\* 多步概率通过转移矩阵幂次计算,约 15-20 步后收敛到平稳分布。
|
||||
|
||||
**关键含义**:
|
||||
- 暴涨暴跌平均仅持续 1.3 天即回归横盘
|
||||
- 暴跌后有 31.9% 概率立即反弹为暴涨("V 型反转"概率)
|
||||
- 长期来看,市场约 73.6% 的时间在横盘,约 14.6% 的时间在强势上涨,约 11.8% 的时间在急剧下跌
|
||||
- **暴涨与暴跌的概率不对称**:暴涨概率(14.6%)略高于暴跌(11.8%),与长期正漂移一致
|
||||
|
||||
### 16.7 框架五:厚尾修正的概率分布
|
||||
|
||||
标准 GBM 假设正态分布,但 BTC 的超额峰度=15.65。我们用历史尾部概率修正极端场景:
|
||||
|
||||
| 场景 | 正态模型概率 | BTC 实际概率(历史) | 1 年内触发一次的概率 |
|
||||
|------|-----------|-----------------|------------------|
|
||||
| 单日 ≥ 3σ (+10.8%) | 0.135% | **0.776%** (5.75x) | ~94% |
|
||||
| 单日 ≤ -3σ (-10.8%) | 0.135% | **0.776%** (5.75x) | ~94% |
|
||||
| 单日 ≥ 4σ (+14.4%) | 0.003% | **0.275%** (86.9x) | ~63% |
|
||||
| 单日 ≤ -4σ (-14.4%) | 0.003% | **0.275%** (86.9x) | ~63% |
|
||||
| 单日 ≥ 5σ (+18.1%) | ~0.00003% | **估计 0.06%** | ~20% |
|
||||
| 单日 ≤ -5σ (-18.1%) | ~0.00003% | **估计 0.06%** | ~20% |
|
||||
|
||||
在未来 1 年内,**几乎确定会出现至少一次单日 ±10% 的波动**,且有约 63% 的概率出现 ±14% 以上的极端日。
|
||||
|
||||
### 16.8 综合情景推演
|
||||
|
||||
综合上述 6 个框架,构建 5 个离散情景:
|
||||
|
||||
#### 情景 A:持续牛市(概率 ~15%)
|
||||
|
||||
| 指标 | 值 | 数据依据 |
|
||||
|------|-----|---------|
|
||||
| 1 年目标 | $130,000 ~ $200,000 | GBM +1σ 区间 + Hurst 趋势持续 |
|
||||
| 2 年目标 | $180,000 ~ $350,000 | GBM +1σ~+2σ,幂律上轨 $140K |
|
||||
| 触发条件 | 连续突破幂律 95% 上轨 ($119,340) | 历史上 2021 年曾发生 |
|
||||
| 概率依据 | 马尔可夫暴涨状态 14.6% × Hurst 趋势延续 98.9% | 但单次暴涨仅持续 1.3 天 |
|
||||
|
||||
**数据支撑**: Hurst H=0.593 表明价格有弱趋势延续性,一旦进入上行通道可能持续。周线 H=0.67 暗示更长周期趋势性更强。但暴涨状态平均仅 1.3 天,需要连续多次暴涨才能实现。
|
||||
|
||||
**数据矛盾**: ARIMA/历史均值模型均无法显著超越随机游走(RMSE/RW=0.998),方向预测准确率仅 49.9%。
|
||||
|
||||
#### 情景 B:温和上涨(概率 ~25%)
|
||||
|
||||
| 指标 | 值 | 数据依据 |
|
||||
|------|-----|---------|
|
||||
| 1 年目标 | $85,000 ~ $130,000 | GBM 中位数 $85K ~ +1σ $170K 之间 |
|
||||
| 2 年目标 | $95,000 ~ $180,000 | 幂律中轨上方,历史漂移率 |
|
||||
| 触发条件 | 维持在幂律 50%~95% 区间内 | 当前 67.9% 已在此区间 |
|
||||
| 概率依据 | 历史日均收益 +0.094% 的长期漂移 | 8.5 年数据支撑 |
|
||||
|
||||
**数据支撑**: 日均正漂移 0.094% 在 8.5 年 3,091 天中持续存在。指数增长模型优于幂律(AIC 差 493),暗示增长速率可能不会减缓。
|
||||
|
||||
#### 情景 C:横盘震荡(概率 ~30%)
|
||||
|
||||
| 指标 | 值 | 数据依据 |
|
||||
|------|-----|---------|
|
||||
| 1 年区间 | $50,000 ~ $100,000 | 幂律走廊 50%-95% |
|
||||
| 2 年区间 | $45,000 ~ $110,000 | GBM ±0.5σ |
|
||||
| 触发条件 | 横盘状态延续(马尔可夫 82% 自我转移) | 最可能的单一状态 |
|
||||
| 概率依据 | 马尔可夫平稳分布 73.6% 横盘 | 市场多数时间在整理 |
|
||||
|
||||
**数据支撑**: 横盘整理是最频繁的市场状态(73.6% 的日子),且自我转移概率高达 82%。当前年化波动率约 46.5%,与横盘状态特征一致。FFT 检测到的 ~39.6 天周期(SNR=6.36)暗示中短期存在围绕均值的振荡结构。
|
||||
|
||||
#### 情景 D:温和下跌(概率 ~20%)
|
||||
|
||||
| 指标 | 值 | 数据依据 |
|
||||
|------|-----|---------|
|
||||
| 1 年目标 | $40,000 ~ $65,000 | GBM -1σ ($43K) 附近 |
|
||||
| 2 年目标 | $35,000 ~ $55,000 | 回归幂律中轨 ($57K~$61K) |
|
||||
| 触发条件 | 减半周期后期回撤 | 第 3 次在 ~550天后转熊 |
|
||||
| 概率依据 | 幂律位置 67.9% → 回归 50% 中轨 | 均值回归力量 |
|
||||
|
||||
**数据支撑**: 当前位于幂律走廊 67.9% 分位(偏高),统计上有回归中轨的倾向。第 3 次减半在峰值(~550 天)后经历了约 -75% 的回撤($69K → $16K),第 4 次减半已过 652 天。
|
||||
|
||||
#### 情景 E:黑天鹅暴跌(概率 ~10%)
|
||||
|
||||
| 指标 | 值 | 数据依据 |
|
||||
|------|-----|---------|
|
||||
| 1 年最低 | $15,000 ~ $35,000 | GBM -2σ ($21.5K),接近幂律 5% 下轨 |
|
||||
| 触发条件 | 系统性事件(如 2020 新冠、2022 FTX) | 异常检测 6/12 事件对齐 |
|
||||
| 概率依据 | 4σ事件年概率 63% × 持续下行 | 厚尾 87x 增强 |
|
||||
|
||||
**数据支撑**: 历史上确实发生过 -75%(2022)、-84%(2018)的回撤。异常检测模型(AUC=0.9935)显示极端事件具有前兆特征(前 5 天波动幅度和绝对收益率标准差异常升高),但不等于可精确预测时间点。
|
||||
|
||||
### 16.9 概率加权预期
|
||||
|
||||
| 情景 | 概率 | 1 年中点 | 2 年中点 |
|
||||
|------|------|---------|---------|
|
||||
| A 持续牛市 | 15% | $165,000 | $265,000 |
|
||||
| B 温和上涨 | 25% | $107,500 | $137,500 |
|
||||
| C 横盘震荡 | 30% | $75,000 | $77,500 |
|
||||
| D 温和下跌 | 20% | $52,500 | $45,000 |
|
||||
| E 黑天鹅 | 10% | $25,000 | $25,000 |
|
||||
| **概率加权** | **100%** | **$87,750** | **$107,875** |
|
||||
|
||||
概率加权后的 1 年预期价格约 $87,750(+14%),2 年预期约 $107,875(+40%),与历史日均正漂移的累积效应(1 年 +34%)在同一量级。
|
||||
|
||||
### 16.10 推演的核心局限性
|
||||
|
||||
1. **方向不可预测**: 本报告第 13 章已证明,所有时序模型均无法显著超越随机游走(DM 检验 p=0.152),方向预测准确率仅 49.9%
|
||||
2. **周期样本不足**: 减半效应仅基于 2 个样本(合并 p=0.991),统计功效极低
|
||||
3. **结构性变化**: 2017-2026 年期间 BTC 的市场结构(机构化、ETF、监管)发生了根本性变化,历史参数可能不适用于未来
|
||||
4. **外生冲击不可建模**: 监管政策、宏观经济、地缘政治等外生因素对 BTC 价格有重大影响,但无法从历史价格数据中推断
|
||||
5. **波动率可预测,方向不可预测**: 本分析的核心发现是 GARCH persistence=0.973 和波动率长记忆性(d=0.635),意味着我们能较准确预测"波动有多大",但无法预测"方向是什么"
|
||||
6. **厚尾风险**: 正态假设下的置信区间**严重低估**极端场景概率,BTC 的 4σ 事件是正态的 87 倍
|
||||
|
||||
> **最诚实的结论**: 如果你必须对 BTC 未来 1-2 年做出判断,唯一有统计证据支持的陈述是:
|
||||
> 1. **波动率会很大**(年化 ~60%,即 1 年内 ±60% 波动属于"正常"范围)
|
||||
> 2. **极端日几乎确定会出现**(年内 ±10% 单日波动概率 >90%)
|
||||
> 3. **长期存在微弱的正漂移**(日均 +0.094%,但单日标准差 3.61% 是漂移的 39 倍)
|
||||
> 4. **任何精确的价格预测都没有统计学基础**
|
||||
|
||||
---
|
||||
|
||||
*报告生成日期: 2026-02-03 | 分析代码: [src/](src/) | 图表输出: [output/](output/)*
|
||||
219
main.py
Normal file
@@ -0,0 +1,219 @@
|
||||
#!/usr/bin/env python3
|
||||
"""BTC/USDT 价格规律性全面分析 — 主入口
|
||||
|
||||
串联执行所有分析模块,输出结果到 output/ 目录。
|
||||
每个模块独立运行,单个模块失败不影响其他模块。
|
||||
|
||||
用法:
|
||||
python3 main.py # 运行全部模块
|
||||
python3 main.py --modules fft wavelet # 只运行指定模块
|
||||
python3 main.py --list # 列出所有可用模块
|
||||
"""
|
||||
|
||||
import sys
|
||||
import time
|
||||
import argparse
|
||||
import traceback
|
||||
from pathlib import Path
|
||||
from collections import OrderedDict
|
||||
|
||||
# 确保 src 在路径中
|
||||
ROOT = Path(__file__).parent
|
||||
sys.path.insert(0, str(ROOT))
|
||||
|
||||
from src.data_loader import load_klines, load_daily, load_hourly, validate_data
|
||||
from src.preprocessing import add_derived_features
|
||||
|
||||
|
||||
# ── 模块注册表 ─────────────────────────────────────────────
|
||||
|
||||
def _import_module(name):
|
||||
"""延迟导入分析模块,避免启动时全部加载"""
|
||||
import importlib
|
||||
return importlib.import_module(f"src.{name}")
|
||||
|
||||
|
||||
# (模块key, 显示名称, 源模块名, 入口函数名, 是否需要hourly数据)
|
||||
MODULE_REGISTRY = OrderedDict([
|
||||
("fft", ("FFT频谱分析", "fft_analysis", "run_fft_analysis", False)),
|
||||
("wavelet", ("小波变换分析", "wavelet_analysis", "run_wavelet_analysis", False)),
|
||||
("acf", ("ACF/PACF分析", "acf_analysis", "run_acf_analysis", False)),
|
||||
("returns", ("收益率分布分析", "returns_analysis", "run_returns_analysis", False)),
|
||||
("volatility", ("波动率聚集分析", "volatility_analysis", "run_volatility_analysis", False)),
|
||||
("hurst", ("Hurst指数分析", "hurst_analysis", "run_hurst_analysis", False)),
|
||||
("fractal", ("分形维度分析", "fractal_analysis", "run_fractal_analysis", False)),
|
||||
("power_law", ("幂律增长分析", "power_law_analysis", "run_power_law_analysis", False)),
|
||||
("volume_price", ("量价关系分析", "volume_price_analysis", "run_volume_price_analysis", False)),
|
||||
("calendar", ("日历效应分析", "calendar_analysis", "run_calendar_analysis", True)),
|
||||
("halving", ("减半周期分析", "halving_analysis", "run_halving_analysis", False)),
|
||||
("indicators", ("技术指标验证", "indicators", "run_indicators_analysis", False)),
|
||||
("patterns", ("K线形态分析", "patterns", "run_patterns_analysis", False)),
|
||||
("clustering", ("市场状态聚类", "clustering", "run_clustering_analysis", False)),
|
||||
("time_series", ("时序预测", "time_series", "run_time_series_analysis", False)),
|
||||
("causality", ("因果检验", "causality", "run_causality_analysis", False)),
|
||||
("anomaly", ("异常检测", "anomaly", "run_anomaly_analysis", False)),
|
||||
])
|
||||
|
||||
|
||||
OUTPUT_DIR = ROOT / "output"
|
||||
|
||||
|
||||
def run_single_module(key, df, df_hourly, output_base):
|
||||
"""
|
||||
运行单个分析模块
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict or None
|
||||
模块返回的结果字典,失败返回 None
|
||||
"""
|
||||
display_name, mod_name, func_name, needs_hourly = MODULE_REGISTRY[key]
|
||||
module_output = str(output_base / key)
|
||||
Path(module_output).mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f" [{key}] {display_name}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
try:
|
||||
mod = _import_module(mod_name)
|
||||
func = getattr(mod, func_name)
|
||||
|
||||
if needs_hourly:
|
||||
result = func(df, df_hourly, module_output)
|
||||
else:
|
||||
result = func(df, module_output)
|
||||
|
||||
if result is None:
|
||||
result = {"status": "completed", "findings": []}
|
||||
|
||||
result["status"] = "success"
|
||||
print(f" [{key}] 完成 ✓")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
print(f" [{key}] 失败 ✗: {e}")
|
||||
traceback.print_exc()
|
||||
return {"status": "error", "error": str(e), "findings": []}
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="BTC/USDT 价格规律性全面分析")
|
||||
parser.add_argument("--modules", nargs="*", default=None,
|
||||
help="指定要运行的模块 (默认运行全部)")
|
||||
parser.add_argument("--list", action="store_true",
|
||||
help="列出所有可用模块")
|
||||
parser.add_argument("--start", type=str, default=None,
|
||||
help="数据起始日期, 如 2020-01-01")
|
||||
parser.add_argument("--end", type=str, default=None,
|
||||
help="数据结束日期, 如 2025-12-31")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.list:
|
||||
print("\n可用分析模块:")
|
||||
print("-" * 50)
|
||||
for key, (name, _, _, _) in MODULE_REGISTRY.items():
|
||||
print(f" {key:<15} {name}")
|
||||
print()
|
||||
return
|
||||
|
||||
# ── 1. 加载数据 ──────────────────────────────────────
|
||||
print("=" * 60)
|
||||
print(" BTC/USDT 价格规律性全面分析")
|
||||
print("=" * 60)
|
||||
|
||||
print("\n[1/3] 加载日线数据...")
|
||||
df_daily = load_daily(start=args.start, end=args.end)
|
||||
report = validate_data(df_daily, "1d")
|
||||
print(f" 行数: {report['rows']}")
|
||||
print(f" 日期范围: {report['date_range']}")
|
||||
print(f" 价格范围: {report['price_range']}")
|
||||
|
||||
print("\n[2/3] 添加衍生特征...")
|
||||
df = add_derived_features(df_daily)
|
||||
print(f" 特征列: {list(df.columns)}")
|
||||
|
||||
print("\n[3/3] 加载小时数据 (日历效应需要)...")
|
||||
try:
|
||||
df_hourly_raw = load_hourly(start=args.start, end=args.end)
|
||||
df_hourly = add_derived_features(df_hourly_raw)
|
||||
print(f" 小时数据行数: {len(df_hourly)}")
|
||||
except Exception as e:
|
||||
print(f" 小时数据加载失败 (日历效应小时分析将跳过): {e}")
|
||||
df_hourly = None
|
||||
|
||||
# ── 2. 确定要运行的模块 ──────────────────────────────
|
||||
if args.modules:
|
||||
modules_to_run = []
|
||||
for m in args.modules:
|
||||
if m in MODULE_REGISTRY:
|
||||
modules_to_run.append(m)
|
||||
else:
|
||||
print(f" 警告: 未知模块 '{m}', 跳过")
|
||||
else:
|
||||
modules_to_run = list(MODULE_REGISTRY.keys())
|
||||
|
||||
print(f"\n将运行 {len(modules_to_run)} 个分析模块:")
|
||||
for m in modules_to_run:
|
||||
print(f" - {m}: {MODULE_REGISTRY[m][0]}")
|
||||
|
||||
# ── 3. 逐一运行模块 ─────────────────────────────────
|
||||
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
all_results = {}
|
||||
timings = {}
|
||||
|
||||
for key in modules_to_run:
|
||||
t0 = time.time()
|
||||
result = run_single_module(key, df, df_hourly, OUTPUT_DIR)
|
||||
elapsed = time.time() - t0
|
||||
timings[key] = elapsed
|
||||
if result is not None:
|
||||
all_results[key] = result
|
||||
print(f" 耗时: {elapsed:.1f}s")
|
||||
|
||||
# ── 4. 生成综合报告 ──────────────────────────────────
|
||||
print(f"\n{'='*60}")
|
||||
print(" 生成综合分析报告")
|
||||
print(f"{'='*60}")
|
||||
|
||||
from src.visualization import generate_summary_dashboard, plot_price_overview
|
||||
|
||||
# 价格概览图
|
||||
plot_price_overview(df_daily, str(OUTPUT_DIR))
|
||||
|
||||
# 综合仪表盘
|
||||
dashboard_result = generate_summary_dashboard(all_results, str(OUTPUT_DIR))
|
||||
|
||||
# ── 5. 打印执行摘要 ──────────────────────────────────
|
||||
print(f"\n{'='*60}")
|
||||
print(" 执行摘要")
|
||||
print(f"{'='*60}")
|
||||
|
||||
success = sum(1 for r in all_results.values() if r.get("status") == "success")
|
||||
failed = sum(1 for r in all_results.values() if r.get("status") == "error")
|
||||
total_time = sum(timings.values())
|
||||
|
||||
print(f"\n 模块总数: {len(modules_to_run)}")
|
||||
print(f" 成功: {success}")
|
||||
print(f" 失败: {failed}")
|
||||
print(f" 总耗时: {total_time:.1f}s")
|
||||
|
||||
print(f"\n 各模块耗时:")
|
||||
for key, t in sorted(timings.items(), key=lambda x: -x[1]):
|
||||
status = all_results.get(key, {}).get("status", "unknown")
|
||||
mark = "✓" if status == "success" else "✗"
|
||||
print(f" {mark} {key:<15} {t:>8.1f}s")
|
||||
|
||||
print(f"\n 输出目录: {OUTPUT_DIR.resolve()}")
|
||||
if dashboard_result:
|
||||
print(f" 综合报告: {dashboard_result.get('report_path', 'N/A')}")
|
||||
print(f" 仪表盘图: {dashboard_result.get('dashboard_path', 'N/A')}")
|
||||
print(f" JSON结果: {dashboard_result.get('json_path', 'N/A')}")
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(" 分析完成!")
|
||||
print(f"{'='*60}\n")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
BIN
output/acf/acf_grid.png
Normal file
|
After Width: | Height: | Size: 94 KiB |
BIN
output/acf/pacf_grid.png
Normal file
|
After Width: | Height: | Size: 96 KiB |
BIN
output/acf/significant_lags_heatmap.png
Normal file
|
After Width: | Height: | Size: 27 KiB |
44
output/all_results.json
Normal file
BIN
output/anomaly/anomaly_feature_distributions.png
Normal file
|
After Width: | Height: | Size: 100 KiB |
BIN
output/anomaly/anomaly_price_chart.png
Normal file
|
After Width: | Height: | Size: 203 KiB |
BIN
output/anomaly/precursor_feature_importance.png
Normal file
|
After Width: | Height: | Size: 84 KiB |
BIN
output/anomaly/precursor_roc_curve.png
Normal file
|
After Width: | Height: | Size: 57 KiB |
BIN
output/calendar/calendar_hour_effect.png
Normal file
|
After Width: | Height: | Size: 84 KiB |
BIN
output/calendar/calendar_month_effect.png
Normal file
|
After Width: | Height: | Size: 218 KiB |
BIN
output/calendar/calendar_quarter_boundary_effect.png
Normal file
|
After Width: | Height: | Size: 59 KiB |
BIN
output/calendar/calendar_weekday_effect.png
Normal file
|
After Width: | Height: | Size: 68 KiB |
BIN
output/causality/granger_causal_network.png
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
output/causality/granger_pvalue_heatmap.png
Normal file
|
After Width: | Height: | Size: 119 KiB |
BIN
output/clustering/cluster_heatmap_gmm.png
Normal file
|
After Width: | Height: | Size: 160 KiB |
BIN
output/clustering/cluster_heatmap_k-means.png
Normal file
|
After Width: | Height: | Size: 101 KiB |
BIN
output/clustering/cluster_k_selection.png
Normal file
|
After Width: | Height: | Size: 107 KiB |
BIN
output/clustering/cluster_pca_gmm.png
Normal file
|
After Width: | Height: | Size: 150 KiB |
BIN
output/clustering/cluster_pca_k-means.png
Normal file
|
After Width: | Height: | Size: 123 KiB |
BIN
output/clustering/cluster_silhouette_k-means.png
Normal file
|
After Width: | Height: | Size: 64 KiB |
BIN
output/clustering/cluster_state_timeseries.png
Normal file
|
After Width: | Height: | Size: 170 KiB |
BIN
output/clustering/cluster_transition_matrix.png
Normal file
|
After Width: | Height: | Size: 65 KiB |
BIN
output/evidence_dashboard.png
Normal file
|
After Width: | Height: | Size: 48 KiB |
BIN
output/fft/fft_bandpass_components.png
Normal file
|
After Width: | Height: | Size: 652 KiB |
BIN
output/fft/fft_multi_timeframe.png
Normal file
|
After Width: | Height: | Size: 517 KiB |
BIN
output/fft/fft_power_spectrum.png
Normal file
|
After Width: | Height: | Size: 294 KiB |
BIN
output/fractal/fractal_box_counting.png
Normal file
|
After Width: | Height: | Size: 95 KiB |
BIN
output/fractal/fractal_monte_carlo.png
Normal file
|
After Width: | Height: | Size: 87 KiB |
BIN
output/fractal/fractal_self_similarity.png
Normal file
|
After Width: | Height: | Size: 111 KiB |
BIN
output/halving/halving_combined_summary.png
Normal file
|
After Width: | Height: | Size: 350 KiB |
BIN
output/halving/halving_cumulative_returns.png
Normal file
|
After Width: | Height: | Size: 131 KiB |
BIN
output/halving/halving_normalized_trajectories.png
Normal file
|
After Width: | Height: | Size: 130 KiB |
BIN
output/halving/halving_pre_post_returns.png
Normal file
|
After Width: | Height: | Size: 54 KiB |
BIN
output/hurst/hurst_multi_timeframe.png
Normal file
|
After Width: | Height: | Size: 58 KiB |
BIN
output/hurst/hurst_rolling.png
Normal file
|
After Width: | Height: | Size: 108 KiB |
BIN
output/hurst/hurst_rs_loglog.png
Normal file
|
After Width: | Height: | Size: 103 KiB |
BIN
output/indicators/best_indicator_train.png
Normal file
|
After Width: | Height: | Size: 129 KiB |
BIN
output/indicators/best_indicator_val.png
Normal file
|
After Width: | Height: | Size: 114 KiB |
BIN
output/indicators/ic_distribution_train.png
Normal file
|
After Width: | Height: | Size: 57 KiB |
BIN
output/indicators/ic_distribution_val.png
Normal file
|
After Width: | Height: | Size: 57 KiB |
BIN
output/indicators/pvalue_heatmap_train.png
Normal file
|
After Width: | Height: | Size: 72 KiB |
BIN
output/indicators/pvalue_heatmap_val.png
Normal file
|
After Width: | Height: | Size: 71 KiB |
BIN
output/patterns/pattern_counts_train.png
Normal file
|
After Width: | Height: | Size: 53 KiB |
BIN
output/patterns/pattern_counts_val.png
Normal file
|
After Width: | Height: | Size: 53 KiB |
BIN
output/patterns/pattern_forward_returns_train.png
Normal file
|
After Width: | Height: | Size: 119 KiB |
BIN
output/patterns/pattern_forward_returns_val.png
Normal file
|
After Width: | Height: | Size: 101 KiB |
BIN
output/patterns/pattern_hit_rate_train.png
Normal file
|
After Width: | Height: | Size: 75 KiB |
BIN
output/patterns/pattern_hit_rate_val.png
Normal file
|
After Width: | Height: | Size: 71 KiB |
BIN
output/power_law/power_law_corridor.png
Normal file
|
After Width: | Height: | Size: 141 KiB |
BIN
output/power_law/power_law_loglog_regression.png
Normal file
|
After Width: | Height: | Size: 97 KiB |
BIN
output/power_law/power_law_model_comparison.png
Normal file
|
After Width: | Height: | Size: 122 KiB |
BIN
output/power_law/power_law_residual_distribution.png
Normal file
|
After Width: | Height: | Size: 68 KiB |
BIN
output/price_overview.png
Normal file
|
After Width: | Height: | Size: 121 KiB |
BIN
output/returns/garch_conditional_volatility.png
Normal file
|
After Width: | Height: | Size: 118 KiB |
BIN
output/returns/multi_timeframe_distributions.png
Normal file
|
After Width: | Height: | Size: 132 KiB |
BIN
output/returns/returns_histogram_vs_normal.png
Normal file
|
After Width: | Height: | Size: 61 KiB |
BIN
output/returns/returns_qq_plot.png
Normal file
|
After Width: | Height: | Size: 58 KiB |
BIN
output/time_series/ts_cumulative_error.png
Normal file
|
After Width: | Height: | Size: 64 KiB |
BIN
output/time_series/ts_direction_accuracy.png
Normal file
|
After Width: | Height: | Size: 36 KiB |
BIN
output/time_series/ts_predictions_comparison.png
Normal file
|
After Width: | Height: | Size: 280 KiB |
BIN
output/volatility/acf_power_law_fit.png
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
output/volatility/garch_model_comparison.png
Normal file
|
After Width: | Height: | Size: 234 KiB |
BIN
output/volatility/leverage_effect_scatter.png
Normal file
|
After Width: | Height: | Size: 229 KiB |
BIN
output/volatility/realized_volatility_multiwindow.png
Normal file
|
After Width: | Height: | Size: 222 KiB |
BIN
output/volume_price/granger_causality_heatmap.png
Normal file
|
After Width: | Height: | Size: 65 KiB |
BIN
output/volume_price/obv_divergence.png
Normal file
|
After Width: | Height: | Size: 216 KiB |
BIN
output/volume_price/taker_buy_lead_lag.png
Normal file
|
After Width: | Height: | Size: 47 KiB |
BIN
output/volume_price/volume_return_scatter.png
Normal file
|
After Width: | Height: | Size: 85 KiB |
BIN
output/wavelet/wavelet_global_spectrum.png
Normal file
|
After Width: | Height: | Size: 105 KiB |
BIN
output/wavelet/wavelet_key_periods.png
Normal file
|
After Width: | Height: | Size: 785 KiB |
BIN
output/wavelet/wavelet_scalogram.png
Normal file
|
After Width: | Height: | Size: 1.1 MiB |
35
output/综合结论报告.txt
Normal file
@@ -0,0 +1,35 @@
|
||||
======================================================================
|
||||
BTC/USDT 价格规律性分析 — 综合结论报告
|
||||
======================================================================
|
||||
|
||||
|
||||
"真正有规律" 判定标准(必须同时满足):
|
||||
1. FDR校正后 p < 0.05
|
||||
2. 排列检验 p < 0.01(如适用)
|
||||
3. 测试集上效果方向一致且显著
|
||||
4. >80% bootstrap子样本中成立(如适用)
|
||||
5. Cohen's d > 0.2 或经济意义显著
|
||||
6. 有合理的经济/市场直觉解释
|
||||
|
||||
|
||||
----------------------------------------------------------------------
|
||||
模块 得分 强度 发现数
|
||||
----------------------------------------------------------------------
|
||||
indicators 0.00 none 0
|
||||
patterns 0.00 none 0
|
||||
----------------------------------------------------------------------
|
||||
|
||||
## 强证据规律(可重复、有经济意义):
|
||||
(无)
|
||||
|
||||
## 中等证据规律(统计显著但效果有限):
|
||||
(无)
|
||||
|
||||
## 弱证据/不显著:
|
||||
* indicators
|
||||
* patterns
|
||||
|
||||
======================================================================
|
||||
注: 得分基于各模块自报告的统计检验结果。
|
||||
具体参数和图表请参见各子目录的输出。
|
||||
======================================================================
|
||||
17
requirements.txt
Normal file
@@ -0,0 +1,17 @@
|
||||
pandas>=2.0
|
||||
numpy>=1.24
|
||||
scipy>=1.11
|
||||
matplotlib>=3.7
|
||||
seaborn>=0.12
|
||||
statsmodels>=0.14
|
||||
PyWavelets>=1.4
|
||||
arch>=6.0
|
||||
scikit-learn>=1.3
|
||||
# pandas-ta 已移除,技术指标在 indicators.py 中手动实现
|
||||
hdbscan>=0.8
|
||||
nolds>=0.5.2
|
||||
prophet>=1.1
|
||||
torch>=2.0
|
||||
pyod>=1.1
|
||||
plotly>=5.15
|
||||
pmdarima>=2.0
|
||||
1
src/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
# BTC/USDT Price Analysis Package
|
||||
758
src/acf_analysis.py
Normal file
@@ -0,0 +1,758 @@
|
||||
"""ACF/PACF 自相关分析模块
|
||||
|
||||
对BTC日线数据的多序列(对数收益率、平方收益率、绝对收益率、成交量)进行
|
||||
自相关函数(ACF)、偏自相关函数(PACF)分析,自动检测显著滞后阶与周期性模式,
|
||||
并执行 Ljung-Box 检验以验证序列依赖结构。
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
import matplotlib.pyplot as plt
|
||||
from statsmodels.tsa.stattools import acf, pacf
|
||||
from statsmodels.stats.diagnostic import acorr_ljungbox
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Tuple, Optional, Any, Union
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 常量配置
|
||||
# ============================================================
|
||||
|
||||
# ACF/PACF 最大滞后阶数
|
||||
ACF_MAX_LAGS = 100
|
||||
PACF_MAX_LAGS = 40
|
||||
|
||||
# Ljung-Box 检验的滞后组
|
||||
LJUNGBOX_LAG_GROUPS = [10, 20, 50, 100]
|
||||
|
||||
# 显著性水平对应的 z 值(双侧 5%)
|
||||
Z_CRITICAL = 1.96
|
||||
|
||||
# 分析目标序列名称 -> 列名映射
|
||||
SERIES_CONFIG = {
|
||||
"log_return": {
|
||||
"column": "log_return",
|
||||
"label": "对数收益率 (Log Return)",
|
||||
"purpose": "检测线性序列相关性",
|
||||
},
|
||||
"squared_return": {
|
||||
"column": "squared_return",
|
||||
"label": "平方收益率 (Squared Return)",
|
||||
"purpose": "检测波动聚集效应 / ARCH效应",
|
||||
},
|
||||
"abs_return": {
|
||||
"column": "abs_return",
|
||||
"label": "绝对收益率 (Absolute Return)",
|
||||
"purpose": "非线性依赖关系的稳健性检验",
|
||||
},
|
||||
"volume": {
|
||||
"column": "volume",
|
||||
"label": "成交量 (Volume)",
|
||||
"purpose": "检测成交量自相关性",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 核心计算函数
|
||||
# ============================================================
|
||||
|
||||
def compute_acf(series: pd.Series, nlags: int = ACF_MAX_LAGS) -> Tuple[np.ndarray, np.ndarray]:
|
||||
"""
|
||||
计算自相关函数及置信区间
|
||||
|
||||
Parameters
|
||||
----------
|
||||
series : pd.Series
|
||||
输入时间序列(已去除NaN)
|
||||
nlags : int
|
||||
最大滞后阶数
|
||||
|
||||
Returns
|
||||
-------
|
||||
acf_values : np.ndarray
|
||||
ACF 值数组,shape=(nlags+1,)
|
||||
confint : np.ndarray
|
||||
置信区间数组,shape=(nlags+1, 2)
|
||||
"""
|
||||
clean = series.dropna().values
|
||||
# alpha=0.05 对应 95% 置信区间
|
||||
acf_values, confint = acf(clean, nlags=nlags, alpha=0.05, fft=True)
|
||||
return acf_values, confint
|
||||
|
||||
|
||||
def compute_pacf(series: pd.Series, nlags: int = PACF_MAX_LAGS) -> Tuple[np.ndarray, np.ndarray]:
|
||||
"""
|
||||
计算偏自相关函数及置信区间
|
||||
|
||||
Parameters
|
||||
----------
|
||||
series : pd.Series
|
||||
输入时间序列(已去除NaN)
|
||||
nlags : int
|
||||
最大滞后阶数
|
||||
|
||||
Returns
|
||||
-------
|
||||
pacf_values : np.ndarray
|
||||
PACF 值数组
|
||||
confint : np.ndarray
|
||||
置信区间数组
|
||||
"""
|
||||
clean = series.dropna().values
|
||||
# 确保 nlags 不超过样本量的一半
|
||||
max_allowed = len(clean) // 2 - 1
|
||||
nlags = min(nlags, max_allowed)
|
||||
pacf_values, confint = pacf(clean, nlags=nlags, alpha=0.05, method='ywm')
|
||||
return pacf_values, confint
|
||||
|
||||
|
||||
def find_significant_lags(
|
||||
acf_values: np.ndarray,
|
||||
n_obs: int,
|
||||
start_lag: int = 1,
|
||||
) -> List[int]:
|
||||
"""
|
||||
识别超过 ±1.96/√N 置信带的显著滞后阶
|
||||
|
||||
Parameters
|
||||
----------
|
||||
acf_values : np.ndarray
|
||||
ACF 值数组(包含 lag 0)
|
||||
n_obs : int
|
||||
样本总数(用于计算 Bartlett 置信带宽度)
|
||||
start_lag : int
|
||||
从哪个滞后阶开始检测(默认跳过 lag 0)
|
||||
|
||||
Returns
|
||||
-------
|
||||
significant : list of int
|
||||
显著的滞后阶列表
|
||||
"""
|
||||
threshold = Z_CRITICAL / np.sqrt(n_obs)
|
||||
significant = []
|
||||
for lag in range(start_lag, len(acf_values)):
|
||||
if abs(acf_values[lag]) > threshold:
|
||||
significant.append(lag)
|
||||
return significant
|
||||
|
||||
|
||||
def detect_periodic_pattern(
|
||||
significant_lags: List[int],
|
||||
min_period: int = 2,
|
||||
max_period: int = 50,
|
||||
min_occurrences: int = 3,
|
||||
tolerance: int = 1,
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
检测显著滞后阶中的周期性模式
|
||||
|
||||
算法:对每个候选周期 p,检查 p, 2p, 3p, ... 是否在显著滞后阶集合中
|
||||
(允许 ±tolerance 偏差),若命中次数 >= min_occurrences 则认为存在周期。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
significant_lags : list of int
|
||||
显著滞后阶列表
|
||||
min_period : int
|
||||
最小候选周期
|
||||
max_period : int
|
||||
最大候选周期
|
||||
min_occurrences : int
|
||||
最少需要出现的周期倍数次数
|
||||
tolerance : int
|
||||
允许的滞后偏差(天数)
|
||||
|
||||
Returns
|
||||
-------
|
||||
patterns : list of dict
|
||||
检测到的周期性模式列表,每个元素包含:
|
||||
- period: 周期长度
|
||||
- hits: 命中的滞后阶列表
|
||||
- count: 命中次数
|
||||
- fft_note: FFT 交叉验证说明
|
||||
"""
|
||||
if not significant_lags:
|
||||
return []
|
||||
|
||||
sig_set = set(significant_lags)
|
||||
max_lag = max(significant_lags)
|
||||
patterns = []
|
||||
|
||||
for period in range(min_period, min(max_period + 1, max_lag + 1)):
|
||||
hits = []
|
||||
# 检查周期的整数倍是否出现在显著滞后阶中
|
||||
multiple = 1
|
||||
while period * multiple <= max_lag + tolerance:
|
||||
target = period * multiple
|
||||
# 在 ±tolerance 范围内查找匹配
|
||||
for offset in range(-tolerance, tolerance + 1):
|
||||
if (target + offset) in sig_set:
|
||||
hits.append(target + offset)
|
||||
break
|
||||
multiple += 1
|
||||
|
||||
if len(hits) >= min_occurrences:
|
||||
# FFT 交叉验证说明:周期 p 天对应频率 1/p
|
||||
fft_freq = 1.0 / period
|
||||
patterns.append({
|
||||
"period": period,
|
||||
"hits": hits,
|
||||
"count": len(hits),
|
||||
"fft_note": (
|
||||
f"若FFT频谱在 f={fft_freq:.4f} (1/{period}天) "
|
||||
f"处存在峰值,则交叉验证通过"
|
||||
),
|
||||
})
|
||||
|
||||
# 按命中次数降序排列,去除被更短周期包含的冗余模式
|
||||
patterns.sort(key=lambda x: (-x["count"], x["period"]))
|
||||
filtered = _filter_harmonic_patterns(patterns)
|
||||
|
||||
return filtered
|
||||
|
||||
|
||||
def _filter_harmonic_patterns(
|
||||
patterns: List[Dict[str, Any]],
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
过滤谐波冗余的周期模式
|
||||
|
||||
如果周期 A 是周期 B 的整数倍且命中数不明显更多,则保留较短周期。
|
||||
"""
|
||||
if len(patterns) <= 1:
|
||||
return patterns
|
||||
|
||||
kept = []
|
||||
periods_kept = set()
|
||||
|
||||
for pat in patterns:
|
||||
p = pat["period"]
|
||||
# 检查是否为已保留周期的整数倍
|
||||
is_harmonic = False
|
||||
for kp in periods_kept:
|
||||
if p % kp == 0 and p != kp:
|
||||
is_harmonic = True
|
||||
break
|
||||
if not is_harmonic:
|
||||
kept.append(pat)
|
||||
periods_kept.add(p)
|
||||
|
||||
return kept
|
||||
|
||||
|
||||
def run_ljungbox_test(
|
||||
series: pd.Series,
|
||||
lag_groups: List[int] = None,
|
||||
) -> pd.DataFrame:
|
||||
"""
|
||||
对序列执行 Ljung-Box 白噪声检验
|
||||
|
||||
Parameters
|
||||
----------
|
||||
series : pd.Series
|
||||
输入时间序列
|
||||
lag_groups : list of int
|
||||
检验的滞后阶组
|
||||
|
||||
Returns
|
||||
-------
|
||||
results : pd.DataFrame
|
||||
包含 lag, lb_stat, lb_pvalue 的结果表
|
||||
"""
|
||||
if lag_groups is None:
|
||||
lag_groups = LJUNGBOX_LAG_GROUPS
|
||||
|
||||
clean = series.dropna()
|
||||
max_lag = max(lag_groups)
|
||||
|
||||
# 确保最大滞后不超过样本量
|
||||
if max_lag >= len(clean):
|
||||
lag_groups = [lg for lg in lag_groups if lg < len(clean)]
|
||||
if not lag_groups:
|
||||
return pd.DataFrame(columns=["lag", "lb_stat", "lb_pvalue"])
|
||||
max_lag = max(lag_groups)
|
||||
|
||||
lb_result = acorr_ljungbox(clean, lags=max_lag, return_df=True)
|
||||
|
||||
rows = []
|
||||
for lg in lag_groups:
|
||||
if lg <= len(lb_result):
|
||||
rows.append({
|
||||
"lag": lg,
|
||||
"lb_stat": lb_result.loc[lg, "lb_stat"],
|
||||
"lb_pvalue": lb_result.loc[lg, "lb_pvalue"],
|
||||
})
|
||||
|
||||
return pd.DataFrame(rows)
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 可视化函数
|
||||
# ============================================================
|
||||
|
||||
def _plot_acf_grid(
|
||||
acf_data: Dict[str, Tuple[np.ndarray, np.ndarray, int, List[int]]],
|
||||
output_path: Path,
|
||||
) -> None:
|
||||
"""
|
||||
绘制 2x2 ACF 图
|
||||
|
||||
Parameters
|
||||
----------
|
||||
acf_data : dict
|
||||
键为序列名称,值为 (acf_values, confint, n_obs, significant_lags) 元组
|
||||
output_path : Path
|
||||
输出文件路径
|
||||
"""
|
||||
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
|
||||
fig.suptitle("BTC 自相关函数 (ACF) 分析", fontsize=16, fontweight='bold', y=0.98)
|
||||
|
||||
series_keys = list(SERIES_CONFIG.keys())
|
||||
|
||||
for idx, key in enumerate(series_keys):
|
||||
ax = axes[idx // 2, idx % 2]
|
||||
|
||||
if key not in acf_data:
|
||||
ax.set_visible(False)
|
||||
continue
|
||||
|
||||
acf_vals, confint, n_obs, sig_lags = acf_data[key]
|
||||
config = SERIES_CONFIG[key]
|
||||
lags = np.arange(len(acf_vals))
|
||||
threshold = Z_CRITICAL / np.sqrt(n_obs)
|
||||
|
||||
# 绘制 ACF 柱状图
|
||||
colors = []
|
||||
for lag in lags:
|
||||
if lag == 0:
|
||||
colors.append('#2196F3') # lag 0 用蓝色
|
||||
elif lag in sig_lags:
|
||||
colors.append('#F44336') # 显著滞后用红色
|
||||
else:
|
||||
colors.append('#90CAF9') # 非显著用浅蓝
|
||||
|
||||
ax.bar(lags, acf_vals, color=colors, width=0.8, alpha=0.85)
|
||||
|
||||
# 绘制置信带
|
||||
ax.axhline(y=threshold, color='#E91E63', linestyle='--',
|
||||
linewidth=1.2, alpha=0.7, label=f'±{Z_CRITICAL}/√N = ±{threshold:.4f}')
|
||||
ax.axhline(y=-threshold, color='#E91E63', linestyle='--',
|
||||
linewidth=1.2, alpha=0.7)
|
||||
ax.axhline(y=0, color='black', linewidth=0.5)
|
||||
|
||||
# 标注显著滞后阶(仅标注前10个避免拥挤)
|
||||
sig_lags_sorted = sorted(sig_lags)[:10]
|
||||
for lag in sig_lags_sorted:
|
||||
if lag < len(acf_vals):
|
||||
ax.annotate(
|
||||
f'{lag}',
|
||||
xy=(lag, acf_vals[lag]),
|
||||
xytext=(0, 8 if acf_vals[lag] > 0 else -12),
|
||||
textcoords='offset points',
|
||||
fontsize=7,
|
||||
color='#D32F2F',
|
||||
ha='center',
|
||||
fontweight='bold',
|
||||
)
|
||||
|
||||
ax.set_title(f'{config["label"]}\n({config["purpose"]})', fontsize=11)
|
||||
ax.set_xlabel('滞后阶 (Lag)', fontsize=10)
|
||||
ax.set_ylabel('ACF', fontsize=10)
|
||||
ax.legend(fontsize=8, loc='upper right')
|
||||
ax.set_xlim(-1, len(acf_vals))
|
||||
ax.grid(axis='y', alpha=0.3)
|
||||
ax.tick_params(labelsize=9)
|
||||
|
||||
plt.tight_layout(rect=[0, 0, 1, 0.95])
|
||||
fig.savefig(output_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[ACF图] 已保存: {output_path}")
|
||||
|
||||
|
||||
def _plot_pacf_grid(
|
||||
pacf_data: Dict[str, Tuple[np.ndarray, np.ndarray, int, List[int]]],
|
||||
output_path: Path,
|
||||
) -> None:
|
||||
"""
|
||||
绘制 2x2 PACF 图
|
||||
|
||||
Parameters
|
||||
----------
|
||||
pacf_data : dict
|
||||
键为序列名称,值为 (pacf_values, confint, n_obs, significant_lags) 元组
|
||||
output_path : Path
|
||||
输出文件路径
|
||||
"""
|
||||
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
|
||||
fig.suptitle("BTC 偏自相关函数 (PACF) 分析", fontsize=16, fontweight='bold', y=0.98)
|
||||
|
||||
series_keys = list(SERIES_CONFIG.keys())
|
||||
|
||||
for idx, key in enumerate(series_keys):
|
||||
ax = axes[idx // 2, idx % 2]
|
||||
|
||||
if key not in pacf_data:
|
||||
ax.set_visible(False)
|
||||
continue
|
||||
|
||||
pacf_vals, confint, n_obs, sig_lags = pacf_data[key]
|
||||
config = SERIES_CONFIG[key]
|
||||
lags = np.arange(len(pacf_vals))
|
||||
threshold = Z_CRITICAL / np.sqrt(n_obs)
|
||||
|
||||
# 绘制 PACF 柱状图
|
||||
colors = []
|
||||
for lag in lags:
|
||||
if lag == 0:
|
||||
colors.append('#4CAF50')
|
||||
elif lag in sig_lags:
|
||||
colors.append('#FF5722')
|
||||
else:
|
||||
colors.append('#A5D6A7')
|
||||
|
||||
ax.bar(lags, pacf_vals, color=colors, width=0.6, alpha=0.85)
|
||||
|
||||
# 置信带
|
||||
ax.axhline(y=threshold, color='#E91E63', linestyle='--',
|
||||
linewidth=1.2, alpha=0.7, label=f'±{Z_CRITICAL}/√N = ±{threshold:.4f}')
|
||||
ax.axhline(y=-threshold, color='#E91E63', linestyle='--',
|
||||
linewidth=1.2, alpha=0.7)
|
||||
ax.axhline(y=0, color='black', linewidth=0.5)
|
||||
|
||||
# 标注显著滞后阶
|
||||
sig_lags_sorted = sorted(sig_lags)[:10]
|
||||
for lag in sig_lags_sorted:
|
||||
if lag < len(pacf_vals):
|
||||
ax.annotate(
|
||||
f'{lag}',
|
||||
xy=(lag, pacf_vals[lag]),
|
||||
xytext=(0, 8 if pacf_vals[lag] > 0 else -12),
|
||||
textcoords='offset points',
|
||||
fontsize=7,
|
||||
color='#BF360C',
|
||||
ha='center',
|
||||
fontweight='bold',
|
||||
)
|
||||
|
||||
ax.set_title(f'{config["label"]}\n(PACF - 偏自相关)', fontsize=11)
|
||||
ax.set_xlabel('滞后阶 (Lag)', fontsize=10)
|
||||
ax.set_ylabel('PACF', fontsize=10)
|
||||
ax.legend(fontsize=8, loc='upper right')
|
||||
ax.set_xlim(-1, len(pacf_vals))
|
||||
ax.grid(axis='y', alpha=0.3)
|
||||
ax.tick_params(labelsize=9)
|
||||
|
||||
plt.tight_layout(rect=[0, 0, 1, 0.95])
|
||||
fig.savefig(output_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[PACF图] 已保存: {output_path}")
|
||||
|
||||
|
||||
def _plot_significant_lags_summary(
|
||||
all_sig_lags: Dict[str, List[int]],
|
||||
n_obs: int,
|
||||
output_path: Path,
|
||||
) -> None:
|
||||
"""
|
||||
绘制所有序列的显著滞后阶汇总热力图
|
||||
|
||||
Parameters
|
||||
----------
|
||||
all_sig_lags : dict
|
||||
键为序列名称,值为显著滞后阶列表
|
||||
n_obs : int
|
||||
样本总数
|
||||
output_path : Path
|
||||
输出文件路径
|
||||
"""
|
||||
max_lag = ACF_MAX_LAGS
|
||||
series_names = list(SERIES_CONFIG.keys())
|
||||
labels = [SERIES_CONFIG[k]["label"].split(" (")[0] for k in series_names]
|
||||
|
||||
# 构建二值矩阵:行=序列,列=滞后阶
|
||||
matrix = np.zeros((len(series_names), max_lag + 1))
|
||||
for i, key in enumerate(series_names):
|
||||
for lag in all_sig_lags.get(key, []):
|
||||
if lag <= max_lag:
|
||||
matrix[i, lag] = 1
|
||||
|
||||
fig, ax = plt.subplots(figsize=(20, 4))
|
||||
im = ax.imshow(matrix, aspect='auto', cmap='YlOrRd', interpolation='none')
|
||||
ax.set_yticks(range(len(labels)))
|
||||
ax.set_yticklabels(labels, fontsize=10)
|
||||
ax.set_xlabel('滞后阶 (Lag)', fontsize=11)
|
||||
ax.set_title('显著自相关滞后阶汇总 (ACF > 置信带)', fontsize=13, fontweight='bold')
|
||||
|
||||
# 每隔 5 个标注 x 轴
|
||||
ax.set_xticks(range(0, max_lag + 1, 5))
|
||||
ax.tick_params(labelsize=8)
|
||||
|
||||
plt.colorbar(im, ax=ax, label='显著 (1) / 不显著 (0)', shrink=0.8)
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[显著滞后汇总图] 已保存: {output_path}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 主入口函数
|
||||
# ============================================================
|
||||
|
||||
def run_acf_analysis(
|
||||
df: pd.DataFrame,
|
||||
output_dir: Union[str, Path] = "output/acf",
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
ACF/PACF 自相关分析主入口
|
||||
|
||||
对对数收益率、平方收益率、绝对收益率、成交量四个序列执行完整的
|
||||
自相关分析流程,包括:ACF计算、PACF计算、显著滞后检测、周期性
|
||||
模式识别、Ljung-Box检验以及可视化。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线DataFrame,需包含 log_return, squared_return, abs_return, volume 列
|
||||
(通常由 preprocessing.add_derived_features 生成)
|
||||
output_dir : str or Path
|
||||
图表输出目录
|
||||
|
||||
Returns
|
||||
-------
|
||||
results : dict
|
||||
分析结果字典,结构如下:
|
||||
{
|
||||
"acf": {series_name: {"values": ndarray, "significant_lags": list, ...}},
|
||||
"pacf": {series_name: {"values": ndarray, "significant_lags": list, ...}},
|
||||
"ljungbox": {series_name: DataFrame},
|
||||
"periodic_patterns": {series_name: list of dict},
|
||||
"summary": {...}
|
||||
}
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# 验证必要列存在
|
||||
required_cols = [cfg["column"] for cfg in SERIES_CONFIG.values()]
|
||||
missing = [c for c in required_cols if c not in df.columns]
|
||||
if missing:
|
||||
raise ValueError(f"DataFrame 缺少必要列: {missing}。请先调用 add_derived_features()。")
|
||||
|
||||
print("=" * 70)
|
||||
print("ACF / PACF 自相关分析")
|
||||
print("=" * 70)
|
||||
print(f"样本量: {len(df)}")
|
||||
print(f"时间范围: {df.index.min()} ~ {df.index.max()}")
|
||||
print(f"ACF最大滞后: {ACF_MAX_LAGS} | PACF最大滞后: {PACF_MAX_LAGS}")
|
||||
print(f"置信水平: 95% (z={Z_CRITICAL})")
|
||||
print()
|
||||
|
||||
# 存储结果
|
||||
results = {
|
||||
"acf": {},
|
||||
"pacf": {},
|
||||
"ljungbox": {},
|
||||
"periodic_patterns": {},
|
||||
"summary": {},
|
||||
}
|
||||
|
||||
# 用于绘图的中间数据
|
||||
acf_plot_data = {} # {key: (acf_vals, confint, n_obs, sig_lags_set)}
|
||||
pacf_plot_data = {}
|
||||
all_sig_lags = {} # {key: list of significant lag indices}
|
||||
|
||||
# --------------------------------------------------------
|
||||
# 逐序列分析
|
||||
# --------------------------------------------------------
|
||||
for key, config in SERIES_CONFIG.items():
|
||||
col = config["column"]
|
||||
label = config["label"]
|
||||
purpose = config["purpose"]
|
||||
series = df[col].dropna()
|
||||
n_obs = len(series)
|
||||
|
||||
print(f"{'─' * 60}")
|
||||
print(f"序列: {label}")
|
||||
print(f" 目的: {purpose}")
|
||||
print(f" 有效样本: {n_obs}")
|
||||
|
||||
# ---------- ACF ----------
|
||||
acf_vals, acf_confint = compute_acf(series, nlags=ACF_MAX_LAGS)
|
||||
sig_lags_acf = find_significant_lags(acf_vals, n_obs)
|
||||
sig_lags_set = set(sig_lags_acf)
|
||||
|
||||
results["acf"][key] = {
|
||||
"values": acf_vals,
|
||||
"confint": acf_confint,
|
||||
"significant_lags": sig_lags_acf,
|
||||
"n_obs": n_obs,
|
||||
"threshold": Z_CRITICAL / np.sqrt(n_obs),
|
||||
}
|
||||
acf_plot_data[key] = (acf_vals, acf_confint, n_obs, sig_lags_set)
|
||||
all_sig_lags[key] = sig_lags_acf
|
||||
|
||||
print(f" [ACF] 显著滞后阶数: {len(sig_lags_acf)}")
|
||||
if sig_lags_acf:
|
||||
# 打印前 20 个显著滞后
|
||||
display_lags = sig_lags_acf[:20]
|
||||
lag_str = ", ".join(str(l) for l in display_lags)
|
||||
if len(sig_lags_acf) > 20:
|
||||
lag_str += f" ... (共{len(sig_lags_acf)}个)"
|
||||
print(f" 滞后阶: {lag_str}")
|
||||
# 打印最大 ACF 值的滞后阶(排除 lag 0)
|
||||
max_idx = max(range(1, len(acf_vals)), key=lambda i: abs(acf_vals[i]))
|
||||
print(f" 最大|ACF|: lag={max_idx}, ACF={acf_vals[max_idx]:.6f}")
|
||||
|
||||
# ---------- PACF ----------
|
||||
pacf_vals, pacf_confint = compute_pacf(series, nlags=PACF_MAX_LAGS)
|
||||
sig_lags_pacf = find_significant_lags(pacf_vals, n_obs)
|
||||
sig_lags_pacf_set = set(sig_lags_pacf)
|
||||
|
||||
results["pacf"][key] = {
|
||||
"values": pacf_vals,
|
||||
"confint": pacf_confint,
|
||||
"significant_lags": sig_lags_pacf,
|
||||
"n_obs": n_obs,
|
||||
}
|
||||
pacf_plot_data[key] = (pacf_vals, pacf_confint, n_obs, sig_lags_pacf_set)
|
||||
|
||||
print(f" [PACF] 显著滞后阶数: {len(sig_lags_pacf)}")
|
||||
if sig_lags_pacf:
|
||||
display_lags_p = sig_lags_pacf[:15]
|
||||
lag_str_p = ", ".join(str(l) for l in display_lags_p)
|
||||
if len(sig_lags_pacf) > 15:
|
||||
lag_str_p += f" ... (共{len(sig_lags_pacf)}个)"
|
||||
print(f" 滞后阶: {lag_str_p}")
|
||||
|
||||
# ---------- 周期性模式检测 ----------
|
||||
periodic = detect_periodic_pattern(sig_lags_acf)
|
||||
results["periodic_patterns"][key] = periodic
|
||||
|
||||
if periodic:
|
||||
print(f" [周期性] 检测到 {len(periodic)} 个周期模式:")
|
||||
for pat in periodic:
|
||||
hit_str = ", ".join(str(h) for h in pat["hits"][:8])
|
||||
print(f" - 周期 {pat['period']}天 (命中{pat['count']}次): "
|
||||
f"lags=[{hit_str}]")
|
||||
print(f" FFT验证: {pat['fft_note']}")
|
||||
else:
|
||||
print(f" [周期性] 未检测到明显周期模式")
|
||||
|
||||
# ---------- Ljung-Box 检验 ----------
|
||||
lb_df = run_ljungbox_test(series, LJUNGBOX_LAG_GROUPS)
|
||||
results["ljungbox"][key] = lb_df
|
||||
|
||||
print(f" [Ljung-Box检验]")
|
||||
if not lb_df.empty:
|
||||
for _, row in lb_df.iterrows():
|
||||
lag_val = int(row["lag"])
|
||||
stat = row["lb_stat"]
|
||||
pval = row["lb_pvalue"]
|
||||
# 判断显著性
|
||||
sig_mark = "***" if pval < 0.001 else "**" if pval < 0.01 else "*" if pval < 0.05 else ""
|
||||
reject_str = "拒绝H0(存在自相关)" if pval < 0.05 else "不拒绝H0(无显著自相关)"
|
||||
print(f" lag={lag_val:3d}: Q={stat:12.2f}, p={pval:.6f} {sig_mark} → {reject_str}")
|
||||
print()
|
||||
|
||||
# --------------------------------------------------------
|
||||
# 汇总
|
||||
# --------------------------------------------------------
|
||||
print("=" * 70)
|
||||
print("分析汇总")
|
||||
print("=" * 70)
|
||||
|
||||
summary = {}
|
||||
for key, config in SERIES_CONFIG.items():
|
||||
label_short = config["label"].split(" (")[0]
|
||||
acf_sig = results["acf"][key]["significant_lags"]
|
||||
pacf_sig = results["pacf"][key]["significant_lags"]
|
||||
lb = results["ljungbox"][key]
|
||||
periodic = results["periodic_patterns"][key]
|
||||
|
||||
# Ljung-Box 在最大 lag 下是否显著
|
||||
lb_significant = False
|
||||
if not lb.empty:
|
||||
max_lag_row = lb.iloc[-1]
|
||||
lb_significant = max_lag_row["lb_pvalue"] < 0.05
|
||||
|
||||
summary[key] = {
|
||||
"label": label_short,
|
||||
"acf_significant_count": len(acf_sig),
|
||||
"pacf_significant_count": len(pacf_sig),
|
||||
"ljungbox_rejects_white_noise": lb_significant,
|
||||
"periodic_patterns_count": len(periodic),
|
||||
"periodic_periods": [p["period"] for p in periodic],
|
||||
}
|
||||
|
||||
lb_verdict = "存在自相关" if lb_significant else "无显著自相关"
|
||||
period_str = (
|
||||
", ".join(f"{p}天" for p in summary[key]["periodic_periods"])
|
||||
if periodic else "无"
|
||||
)
|
||||
|
||||
print(f" {label_short}:")
|
||||
print(f" ACF显著滞后: {len(acf_sig)}个 | PACF显著滞后: {len(pacf_sig)}个")
|
||||
print(f" Ljung-Box: {lb_verdict} | 周期性模式: {period_str}")
|
||||
|
||||
results["summary"] = summary
|
||||
|
||||
# --------------------------------------------------------
|
||||
# 可视化
|
||||
# --------------------------------------------------------
|
||||
print()
|
||||
print("生成可视化图表...")
|
||||
|
||||
# 1) ACF 2x2 网格图
|
||||
_plot_acf_grid(acf_plot_data, output_dir / "acf_grid.png")
|
||||
|
||||
# 2) PACF 2x2 网格图
|
||||
_plot_pacf_grid(pacf_plot_data, output_dir / "pacf_grid.png")
|
||||
|
||||
# 3) 显著滞后汇总热力图
|
||||
_plot_significant_lags_summary(
|
||||
all_sig_lags,
|
||||
n_obs=len(df.dropna(subset=["log_return"])),
|
||||
output_path=output_dir / "significant_lags_heatmap.png",
|
||||
)
|
||||
|
||||
print()
|
||||
print("=" * 70)
|
||||
print("ACF/PACF 分析完成")
|
||||
print(f"图表输出目录: {output_dir.resolve()}")
|
||||
print("=" * 70)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 独立运行入口
|
||||
# ============================================================
|
||||
|
||||
if __name__ == "__main__":
|
||||
from data_loader import load_daily
|
||||
from preprocessing import add_derived_features
|
||||
|
||||
# 加载并预处理数据
|
||||
print("加载日线数据...")
|
||||
df = load_daily()
|
||||
print(f"原始数据: {len(df)} 行")
|
||||
|
||||
print("添加衍生特征...")
|
||||
df = add_derived_features(df)
|
||||
print(f"预处理后: {len(df)} 行, 列={list(df.columns)}")
|
||||
print()
|
||||
|
||||
# 执行 ACF/PACF 分析
|
||||
results = run_acf_analysis(df, output_dir="output/acf")
|
||||
|
||||
# 打印结果概要
|
||||
print()
|
||||
print("返回结果键:")
|
||||
for k, v in results.items():
|
||||
if isinstance(v, dict):
|
||||
print(f" results['{k}']: {list(v.keys())}")
|
||||
else:
|
||||
print(f" results['{k}']: {type(v).__name__}")
|
||||
774
src/anomaly.py
Normal file
@@ -0,0 +1,774 @@
|
||||
"""异常检测与前兆模式提取模块
|
||||
|
||||
分析内容:
|
||||
- 集成异常检测(Isolation Forest + LOF + COPOD,≥2/3 一致判定)
|
||||
- GARCH 条件波动率异常检测(标准化残差 > 3)
|
||||
- 异常前兆模式提取(Random Forest 分类器)
|
||||
- 事件对齐分析(比特币减半等重大事件)
|
||||
- 可视化:异常标记价格图、特征分布对比、ROC 曲线、特征重要性
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import warnings
|
||||
from pathlib import Path
|
||||
from typing import Optional, Dict, List, Tuple
|
||||
|
||||
from sklearn.ensemble import IsolationForest, RandomForestClassifier
|
||||
from sklearn.neighbors import LocalOutlierFactor
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.model_selection import cross_val_predict, StratifiedKFold
|
||||
from sklearn.metrics import roc_auc_score, roc_curve
|
||||
|
||||
try:
|
||||
from pyod.models.copod import COPOD
|
||||
HAS_COPOD = True
|
||||
except ImportError:
|
||||
HAS_COPOD = False
|
||||
print("[警告] pyod 未安装,COPOD 检测将跳过,使用 2/2 一致判定")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 1. 检测特征定义
|
||||
# ============================================================
|
||||
|
||||
# 用于异常检测的特征列
|
||||
DETECTION_FEATURES = [
|
||||
'log_return',
|
||||
'abs_return',
|
||||
'volume_ratio',
|
||||
'range_pct',
|
||||
'taker_buy_ratio',
|
||||
'vol_7d',
|
||||
]
|
||||
|
||||
# 比特币减半及其他重大事件日期
|
||||
KNOWN_EVENTS = {
|
||||
'2012-11-28': '第一次减半',
|
||||
'2016-07-09': '第二次减半',
|
||||
'2020-05-11': '第三次减半',
|
||||
'2024-04-20': '第四次减半',
|
||||
'2017-12-17': '2017年牛市顶点',
|
||||
'2018-12-15': '2018年熊市底部',
|
||||
'2020-03-12': '新冠黑色星期四',
|
||||
'2021-04-14': '2021年牛市中期高点',
|
||||
'2021-11-10': '2021年牛市顶点',
|
||||
'2022-06-18': 'Luna/3AC 暴跌',
|
||||
'2022-11-09': 'FTX 崩盘',
|
||||
'2024-01-11': 'BTC ETF 获批',
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 2. 集成异常检测
|
||||
# ============================================================
|
||||
|
||||
def prepare_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, np.ndarray]:
|
||||
"""
|
||||
准备异常检测特征矩阵
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
含衍生特征的日线数据
|
||||
|
||||
Returns
|
||||
-------
|
||||
features_df : pd.DataFrame
|
||||
特征子集(已去除 NaN)
|
||||
X_scaled : np.ndarray
|
||||
标准化后的特征矩阵
|
||||
"""
|
||||
# 选取可用特征
|
||||
available = [f for f in DETECTION_FEATURES if f in df.columns]
|
||||
if len(available) < 3:
|
||||
raise ValueError(f"可用特征不足: {available},至少需要 3 个")
|
||||
|
||||
features_df = df[available].dropna()
|
||||
|
||||
# 标准化
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(features_df.values)
|
||||
|
||||
return features_df, X_scaled
|
||||
|
||||
|
||||
def detect_isolation_forest(X: np.ndarray, contamination: float = 0.05) -> np.ndarray:
|
||||
"""Isolation Forest 异常检测"""
|
||||
model = IsolationForest(
|
||||
n_estimators=200,
|
||||
contamination=contamination,
|
||||
random_state=42,
|
||||
n_jobs=-1,
|
||||
)
|
||||
# -1 = 异常, 1 = 正常
|
||||
labels = model.fit_predict(X)
|
||||
return (labels == -1).astype(int)
|
||||
|
||||
|
||||
def detect_lof(X: np.ndarray, contamination: float = 0.05) -> np.ndarray:
|
||||
"""Local Outlier Factor 异常检测"""
|
||||
model = LocalOutlierFactor(
|
||||
n_neighbors=20,
|
||||
contamination=contamination,
|
||||
novelty=False,
|
||||
n_jobs=-1,
|
||||
)
|
||||
labels = model.fit_predict(X)
|
||||
return (labels == -1).astype(int)
|
||||
|
||||
|
||||
def detect_copod(X: np.ndarray, contamination: float = 0.05) -> np.ndarray:
|
||||
"""COPOD 异常检测(基于 Copula)"""
|
||||
if not HAS_COPOD:
|
||||
return None
|
||||
|
||||
model = COPOD(contamination=contamination)
|
||||
labels = model.fit_predict(X)
|
||||
return labels.astype(int)
|
||||
|
||||
|
||||
def ensemble_anomaly_detection(
|
||||
df: pd.DataFrame,
|
||||
contamination: float = 0.05,
|
||||
min_agreement: int = 2,
|
||||
) -> pd.DataFrame:
|
||||
"""
|
||||
集成异常检测:要求 ≥ min_agreement / n_methods 一致判定
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
含衍生特征的日线数据
|
||||
contamination : float
|
||||
预期异常比例
|
||||
min_agreement : int
|
||||
最少多少个方法一致才标记为异常
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
添加了各方法检测结果及集成结果的数据
|
||||
"""
|
||||
features_df, X_scaled = prepare_features(df)
|
||||
|
||||
print(f" 特征矩阵: {X_scaled.shape[0]} 样本 x {X_scaled.shape[1]} 特征")
|
||||
|
||||
# 执行各方法检测
|
||||
print(" [1/3] Isolation Forest...")
|
||||
if_labels = detect_isolation_forest(X_scaled, contamination)
|
||||
|
||||
print(" [2/3] Local Outlier Factor...")
|
||||
lof_labels = detect_lof(X_scaled, contamination)
|
||||
|
||||
n_methods = 2
|
||||
vote_matrix = np.column_stack([if_labels, lof_labels])
|
||||
method_names = ['iforest', 'lof']
|
||||
|
||||
print(" [3/3] COPOD...")
|
||||
copod_labels = detect_copod(X_scaled, contamination)
|
||||
if copod_labels is not None:
|
||||
vote_matrix = np.column_stack([vote_matrix, copod_labels])
|
||||
method_names.append('copod')
|
||||
n_methods = 3
|
||||
else:
|
||||
print(" COPOD 不可用,使用 2 方法集成")
|
||||
|
||||
# 投票
|
||||
vote_sum = vote_matrix.sum(axis=1)
|
||||
ensemble_label = (vote_sum >= min_agreement).astype(int)
|
||||
|
||||
# 构建结果 DataFrame
|
||||
result = features_df.copy()
|
||||
for i, name in enumerate(method_names):
|
||||
result[f'anomaly_{name}'] = vote_matrix[:, i]
|
||||
result['anomaly_votes'] = vote_sum
|
||||
result['anomaly_ensemble'] = ensemble_label
|
||||
|
||||
# 打印各方法统计
|
||||
print(f"\n 异常检测统计:")
|
||||
for name in method_names:
|
||||
n_anom = result[f'anomaly_{name}'].sum()
|
||||
print(f" {name:>12}: {n_anom} 个异常 ({n_anom / len(result) * 100:.2f}%)")
|
||||
n_ensemble = ensemble_label.sum()
|
||||
print(f" {'集成(≥' + str(min_agreement) + ')':>12}: {n_ensemble} 个异常 ({n_ensemble / len(result) * 100:.2f}%)")
|
||||
|
||||
# 方法间重叠度
|
||||
print(f"\n 方法间重叠:")
|
||||
for i in range(len(method_names)):
|
||||
for j in range(i + 1, len(method_names)):
|
||||
overlap = ((vote_matrix[:, i] == 1) & (vote_matrix[:, j] == 1)).sum()
|
||||
n_i = vote_matrix[:, i].sum()
|
||||
n_j = vote_matrix[:, j].sum()
|
||||
if min(n_i, n_j) > 0:
|
||||
jaccard = overlap / ((vote_matrix[:, i] == 1) | (vote_matrix[:, j] == 1)).sum()
|
||||
else:
|
||||
jaccard = 0.0
|
||||
print(f" {method_names[i]} ∩ {method_names[j]}: "
|
||||
f"{overlap} 个 (Jaccard={jaccard:.3f})")
|
||||
|
||||
return result
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 3. GARCH 条件波动率异常
|
||||
# ============================================================
|
||||
|
||||
def garch_anomaly_detection(
|
||||
df: pd.DataFrame,
|
||||
threshold: float = 3.0,
|
||||
) -> pd.Series:
|
||||
"""
|
||||
基于 GARCH(1,1) 的条件波动率异常检测
|
||||
|
||||
标准化残差 |ε_t / σ_t| > threshold 的日期标记为异常
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
含 log_return 列的数据
|
||||
threshold : float
|
||||
标准化残差阈值
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.Series
|
||||
异常标记(1 = 异常,0 = 正常),索引与输入对齐
|
||||
"""
|
||||
from arch import arch_model
|
||||
|
||||
returns = df['log_return'].dropna()
|
||||
r_pct = returns * 100 # arch 库使用百分比收益率
|
||||
|
||||
# 拟合 GARCH(1,1)
|
||||
model = arch_model(r_pct, vol='Garch', p=1, q=1, mean='Constant', dist='Normal')
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("ignore")
|
||||
result = model.fit(disp='off')
|
||||
|
||||
# 计算标准化残差
|
||||
std_resid = result.resid / result.conditional_volatility
|
||||
anomaly = (std_resid.abs() > threshold).astype(int)
|
||||
|
||||
n_anom = anomaly.sum()
|
||||
print(f" GARCH 异常: {n_anom} 个 (|标准化残差| > {threshold})")
|
||||
print(f" GARCH 模型: α={result.params.get('alpha[1]', np.nan):.4f}, "
|
||||
f"β={result.params.get('beta[1]', np.nan):.4f}, "
|
||||
f"持续性={result.params.get('alpha[1]', 0) + result.params.get('beta[1]', 0):.4f}")
|
||||
|
||||
return anomaly
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 4. 前兆模式提取
|
||||
# ============================================================
|
||||
|
||||
def extract_precursor_features(
|
||||
df: pd.DataFrame,
|
||||
anomaly_labels: pd.Series,
|
||||
lookback_windows: List[int] = None,
|
||||
) -> Tuple[pd.DataFrame, pd.Series]:
|
||||
"""
|
||||
提取异常日前若干天的特征作为前兆信号
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
含衍生特征的数据
|
||||
anomaly_labels : pd.Series
|
||||
异常标记(1 = 异常)
|
||||
lookback_windows : list of int
|
||||
向前回溯的天数窗口
|
||||
|
||||
Returns
|
||||
-------
|
||||
X : pd.DataFrame
|
||||
前兆特征矩阵
|
||||
y : pd.Series
|
||||
标签(1 = 后续发生异常, 0 = 正常)
|
||||
"""
|
||||
if lookback_windows is None:
|
||||
lookback_windows = [5, 10, 20]
|
||||
|
||||
# 确保对齐
|
||||
common_idx = df.index.intersection(anomaly_labels.index)
|
||||
df_aligned = df.loc[common_idx]
|
||||
labels_aligned = anomaly_labels.loc[common_idx]
|
||||
|
||||
base_features = [f for f in DETECTION_FEATURES if f in df.columns]
|
||||
precursor_features = {}
|
||||
|
||||
for window in lookback_windows:
|
||||
for feat in base_features:
|
||||
if feat not in df_aligned.columns:
|
||||
continue
|
||||
series = df_aligned[feat]
|
||||
|
||||
# 滚动统计作为前兆特征
|
||||
precursor_features[f'{feat}_mean_{window}d'] = series.rolling(window).mean()
|
||||
precursor_features[f'{feat}_std_{window}d'] = series.rolling(window).std()
|
||||
precursor_features[f'{feat}_max_{window}d'] = series.rolling(window).max()
|
||||
precursor_features[f'{feat}_min_{window}d'] = series.rolling(window).min()
|
||||
|
||||
# 趋势特征(最近值 vs 窗口均值的偏离)
|
||||
rolling_mean = series.rolling(window).mean()
|
||||
precursor_features[f'{feat}_deviation_{window}d'] = series - rolling_mean
|
||||
|
||||
X = pd.DataFrame(precursor_features, index=df_aligned.index)
|
||||
|
||||
# 标签: 未来是否出现异常(shift(-1) 使得特征是"之前"的)
|
||||
# 我们用当前特征预测当天是否异常
|
||||
y = labels_aligned
|
||||
|
||||
# 去除 NaN
|
||||
valid_mask = X.notna().all(axis=1) & y.notna()
|
||||
X = X[valid_mask]
|
||||
y = y[valid_mask]
|
||||
|
||||
return X, y
|
||||
|
||||
|
||||
def train_precursor_classifier(
|
||||
X: pd.DataFrame,
|
||||
y: pd.Series,
|
||||
) -> Dict:
|
||||
"""
|
||||
训练前兆模式分类器(Random Forest)
|
||||
|
||||
使用分层 K 折交叉验证评估
|
||||
|
||||
Parameters
|
||||
----------
|
||||
X : pd.DataFrame
|
||||
前兆特征矩阵
|
||||
y : pd.Series
|
||||
标签
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
AUC、特征重要性等结果
|
||||
"""
|
||||
if len(X) < 50 or y.sum() < 10:
|
||||
print(f" [警告] 样本不足 (n={len(X)}, 正例={y.sum()}),跳过分类器训练")
|
||||
return {}
|
||||
|
||||
# 标准化
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# 分层 K 折
|
||||
n_splits = min(5, int(y.sum()))
|
||||
if n_splits < 2:
|
||||
print(" [警告] 正例数过少,无法进行交叉验证")
|
||||
return {}
|
||||
|
||||
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
|
||||
|
||||
clf = RandomForestClassifier(
|
||||
n_estimators=200,
|
||||
max_depth=10,
|
||||
min_samples_split=5,
|
||||
class_weight='balanced',
|
||||
random_state=42,
|
||||
n_jobs=-1,
|
||||
)
|
||||
|
||||
# 交叉验证预测概率
|
||||
try:
|
||||
y_prob = cross_val_predict(clf, X_scaled, y, cv=cv, method='predict_proba')[:, 1]
|
||||
auc = roc_auc_score(y, y_prob)
|
||||
except Exception as e:
|
||||
print(f" [错误] 交叉验证失败: {e}")
|
||||
return {}
|
||||
|
||||
# 在全量数据上训练获取特征重要性
|
||||
clf.fit(X_scaled, y)
|
||||
importances = pd.Series(clf.feature_importances_, index=X.columns)
|
||||
importances = importances.sort_values(ascending=False)
|
||||
|
||||
# ROC 曲线数据
|
||||
fpr, tpr, thresholds = roc_curve(y, y_prob)
|
||||
|
||||
results = {
|
||||
'auc': auc,
|
||||
'feature_importances': importances,
|
||||
'y_true': y,
|
||||
'y_prob': y_prob,
|
||||
'fpr': fpr,
|
||||
'tpr': tpr,
|
||||
}
|
||||
|
||||
print(f"\n 前兆分类器结果:")
|
||||
print(f" AUC: {auc:.4f}")
|
||||
print(f" 样本: {len(y)} (异常: {y.sum()}, 正常: {(y == 0).sum()})")
|
||||
print(f" Top-10 重要特征:")
|
||||
for feat, imp in importances.head(10).items():
|
||||
print(f" {feat:<40} {imp:.4f}")
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 5. 事件对齐分析
|
||||
# ============================================================
|
||||
|
||||
def align_with_events(
|
||||
anomaly_dates: pd.DatetimeIndex,
|
||||
tolerance_days: int = 5,
|
||||
) -> pd.DataFrame:
|
||||
"""
|
||||
将异常日期与已知事件对齐
|
||||
|
||||
Parameters
|
||||
----------
|
||||
anomaly_dates : pd.DatetimeIndex
|
||||
异常日期列表
|
||||
tolerance_days : int
|
||||
容差天数(异常日期与事件日期相差 ≤ tolerance_days 天即视为匹配)
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
匹配结果
|
||||
"""
|
||||
matches = []
|
||||
|
||||
for event_date_str, event_name in KNOWN_EVENTS.items():
|
||||
event_date = pd.Timestamp(event_date_str)
|
||||
|
||||
for anom_date in anomaly_dates:
|
||||
diff_days = abs((anom_date - event_date).days)
|
||||
if diff_days <= tolerance_days:
|
||||
matches.append({
|
||||
'anomaly_date': anom_date,
|
||||
'event_date': event_date,
|
||||
'event_name': event_name,
|
||||
'diff_days': diff_days,
|
||||
})
|
||||
|
||||
if matches:
|
||||
result = pd.DataFrame(matches)
|
||||
print(f"\n 事件对齐 (容差 {tolerance_days} 天):")
|
||||
for _, row in result.iterrows():
|
||||
print(f" 异常 {row['anomaly_date'].strftime('%Y-%m-%d')} ↔ "
|
||||
f"{row['event_name']} ({row['event_date'].strftime('%Y-%m-%d')}, "
|
||||
f"差 {row['diff_days']} 天)")
|
||||
return result
|
||||
else:
|
||||
print(f" [信息] 无异常日期与已知事件匹配 (容差 {tolerance_days} 天)")
|
||||
return pd.DataFrame()
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 6. 可视化
|
||||
# ============================================================
|
||||
|
||||
def plot_price_with_anomalies(
|
||||
df: pd.DataFrame,
|
||||
anomaly_result: pd.DataFrame,
|
||||
garch_anomaly: Optional[pd.Series],
|
||||
output_dir: Path,
|
||||
):
|
||||
"""绘制价格图,标注异常点"""
|
||||
fig, axes = plt.subplots(2, 1, figsize=(16, 10), gridspec_kw={'height_ratios': [3, 1]})
|
||||
|
||||
# 上图:价格 + 异常标记
|
||||
ax1 = axes[0]
|
||||
ax1.plot(df.index, df['close'], linewidth=0.6, color='steelblue', alpha=0.8, label='BTC 收盘价')
|
||||
|
||||
# 集成异常
|
||||
ensemble_anom = anomaly_result[anomaly_result['anomaly_ensemble'] == 1]
|
||||
if not ensemble_anom.empty:
|
||||
# 获取异常日期对应的收盘价
|
||||
anom_prices = df.loc[df.index.isin(ensemble_anom.index), 'close']
|
||||
ax1.scatter(anom_prices.index, anom_prices.values,
|
||||
color='red', s=30, zorder=5, label=f'集成异常 (n={len(anom_prices)})',
|
||||
alpha=0.7, edgecolors='darkred', linewidths=0.5)
|
||||
|
||||
# GARCH 异常
|
||||
if garch_anomaly is not None:
|
||||
garch_anom_dates = garch_anomaly[garch_anomaly == 1].index
|
||||
garch_prices = df.loc[df.index.isin(garch_anom_dates), 'close']
|
||||
if not garch_prices.empty:
|
||||
ax1.scatter(garch_prices.index, garch_prices.values,
|
||||
color='orange', s=20, zorder=4, marker='^',
|
||||
label=f'GARCH 异常 (n={len(garch_prices)})',
|
||||
alpha=0.7, edgecolors='darkorange', linewidths=0.5)
|
||||
|
||||
ax1.set_ylabel('价格 (USDT)', fontsize=12)
|
||||
ax1.set_title('BTC 价格与异常检测结果', fontsize=14)
|
||||
ax1.legend(fontsize=10, loc='upper left')
|
||||
ax1.grid(True, alpha=0.3)
|
||||
ax1.set_yscale('log')
|
||||
|
||||
# 下图:成交量 + 异常标记
|
||||
ax2 = axes[1]
|
||||
if 'volume' in df.columns:
|
||||
ax2.bar(df.index, df['volume'], width=1, color='steelblue', alpha=0.4, label='成交量')
|
||||
if not ensemble_anom.empty:
|
||||
anom_vol = df.loc[df.index.isin(ensemble_anom.index), 'volume']
|
||||
ax2.bar(anom_vol.index, anom_vol.values, width=1, color='red', alpha=0.7, label='异常日成交量')
|
||||
ax2.set_ylabel('成交量', fontsize=12)
|
||||
ax2.set_xlabel('日期', fontsize=12)
|
||||
ax2.legend(fontsize=10)
|
||||
ax2.grid(True, alpha=0.3)
|
||||
|
||||
fig.tight_layout()
|
||||
fig.savefig(output_dir / 'anomaly_price_chart.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] {output_dir / 'anomaly_price_chart.png'}")
|
||||
|
||||
|
||||
def plot_anomaly_feature_distributions(
|
||||
anomaly_result: pd.DataFrame,
|
||||
output_dir: Path,
|
||||
):
|
||||
"""绘制异常日 vs 正常日的特征分布对比"""
|
||||
features_to_plot = [f for f in DETECTION_FEATURES if f in anomaly_result.columns]
|
||||
n_feats = len(features_to_plot)
|
||||
if n_feats == 0:
|
||||
print(" [警告] 无可绘制特征")
|
||||
return
|
||||
|
||||
n_cols = 3
|
||||
n_rows = (n_feats + n_cols - 1) // n_cols
|
||||
|
||||
fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 4 * n_rows))
|
||||
axes = np.array(axes).flatten()
|
||||
|
||||
normal = anomaly_result[anomaly_result['anomaly_ensemble'] == 0]
|
||||
anomaly = anomaly_result[anomaly_result['anomaly_ensemble'] == 1]
|
||||
|
||||
for idx, feat in enumerate(features_to_plot):
|
||||
ax = axes[idx]
|
||||
|
||||
# 正常分布
|
||||
vals_normal = normal[feat].dropna()
|
||||
vals_anomaly = anomaly[feat].dropna()
|
||||
|
||||
ax.hist(vals_normal, bins=50, density=True, alpha=0.6,
|
||||
color='steelblue', label=f'正常 (n={len(vals_normal)})', edgecolor='white', linewidth=0.3)
|
||||
if len(vals_anomaly) > 0:
|
||||
ax.hist(vals_anomaly, bins=30, density=True, alpha=0.6,
|
||||
color='red', label=f'异常 (n={len(vals_anomaly)})', edgecolor='white', linewidth=0.3)
|
||||
|
||||
ax.set_title(feat, fontsize=11)
|
||||
ax.legend(fontsize=8)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
# 隐藏多余子图
|
||||
for idx in range(n_feats, len(axes)):
|
||||
axes[idx].set_visible(False)
|
||||
|
||||
fig.suptitle('异常日 vs 正常日 特征分布对比', fontsize=14, y=1.02)
|
||||
fig.tight_layout()
|
||||
fig.savefig(output_dir / 'anomaly_feature_distributions.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] {output_dir / 'anomaly_feature_distributions.png'}")
|
||||
|
||||
|
||||
def plot_precursor_roc(precursor_results: Dict, output_dir: Path):
|
||||
"""绘制前兆分类器 ROC 曲线"""
|
||||
if not precursor_results or 'fpr' not in precursor_results:
|
||||
print(" [警告] 无前兆分类器结果,跳过 ROC 曲线")
|
||||
return
|
||||
|
||||
fig, ax = plt.subplots(figsize=(8, 8))
|
||||
|
||||
fpr = precursor_results['fpr']
|
||||
tpr = precursor_results['tpr']
|
||||
auc = precursor_results['auc']
|
||||
|
||||
ax.plot(fpr, tpr, color='steelblue', linewidth=2,
|
||||
label=f'Random Forest (AUC = {auc:.4f})')
|
||||
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='随机基线')
|
||||
|
||||
ax.set_xlabel('假阳性率 (FPR)', fontsize=12)
|
||||
ax.set_ylabel('真阳性率 (TPR)', fontsize=12)
|
||||
ax.set_title('异常前兆分类器 ROC 曲线', fontsize=14)
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
ax.set_xlim([-0.02, 1.02])
|
||||
ax.set_ylim([-0.02, 1.02])
|
||||
|
||||
fig.savefig(output_dir / 'precursor_roc_curve.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] {output_dir / 'precursor_roc_curve.png'}")
|
||||
|
||||
|
||||
def plot_feature_importance(precursor_results: Dict, output_dir: Path, top_n: int = 20):
|
||||
"""绘制前兆特征重要性条形图"""
|
||||
if not precursor_results or 'feature_importances' not in precursor_results:
|
||||
print(" [警告] 无特征重要性数据,跳过")
|
||||
return
|
||||
|
||||
importances = precursor_results['feature_importances'].head(top_n)
|
||||
|
||||
fig, ax = plt.subplots(figsize=(10, max(6, top_n * 0.35)))
|
||||
|
||||
colors = plt.cm.RdYlBu_r(np.linspace(0.2, 0.8, len(importances)))
|
||||
ax.barh(range(len(importances)), importances.values[::-1],
|
||||
color=colors[::-1], edgecolor='white', linewidth=0.5)
|
||||
ax.set_yticks(range(len(importances)))
|
||||
ax.set_yticklabels(importances.index[::-1], fontsize=9)
|
||||
ax.set_xlabel('特征重要性', fontsize=12)
|
||||
ax.set_title(f'异常前兆 Top-{top_n} 特征重要性 (Random Forest)', fontsize=13)
|
||||
ax.grid(True, alpha=0.3, axis='x')
|
||||
|
||||
fig.savefig(output_dir / 'precursor_feature_importance.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] {output_dir / 'precursor_feature_importance.png'}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 7. 结果打印
|
||||
# ============================================================
|
||||
|
||||
def print_anomaly_summary(
|
||||
anomaly_result: pd.DataFrame,
|
||||
garch_anomaly: Optional[pd.Series],
|
||||
precursor_results: Dict,
|
||||
):
|
||||
"""打印异常检测汇总"""
|
||||
print("\n" + "=" * 70)
|
||||
print("异常检测结果汇总")
|
||||
print("=" * 70)
|
||||
|
||||
# 集成异常统计
|
||||
n_total = len(anomaly_result)
|
||||
n_ensemble = anomaly_result['anomaly_ensemble'].sum()
|
||||
print(f"\n 总样本数: {n_total}")
|
||||
print(f" 集成异常数: {n_ensemble} ({n_ensemble / n_total * 100:.2f}%)")
|
||||
|
||||
# 各方法统计
|
||||
method_cols = [c for c in anomaly_result.columns if c.startswith('anomaly_') and c != 'anomaly_ensemble' and c != 'anomaly_votes']
|
||||
for col in method_cols:
|
||||
method_name = col.replace('anomaly_', '')
|
||||
n_anom = anomaly_result[col].sum()
|
||||
print(f" {method_name:>12}: {n_anom} ({n_anom / n_total * 100:.2f}%)")
|
||||
|
||||
# GARCH 异常
|
||||
if garch_anomaly is not None:
|
||||
n_garch = garch_anomaly.sum()
|
||||
print(f" {'GARCH':>12}: {n_garch} ({n_garch / len(garch_anomaly) * 100:.2f}%)")
|
||||
|
||||
# 集成异常与 GARCH 异常的重叠
|
||||
common_idx = anomaly_result.index.intersection(garch_anomaly.index)
|
||||
if len(common_idx) > 0:
|
||||
ensemble_set = set(anomaly_result.loc[common_idx][anomaly_result.loc[common_idx, 'anomaly_ensemble'] == 1].index)
|
||||
garch_set = set(garch_anomaly[garch_anomaly == 1].index)
|
||||
overlap = len(ensemble_set & garch_set)
|
||||
print(f"\n 集成 ∩ GARCH 重叠: {overlap} 个")
|
||||
|
||||
# 前兆分类器
|
||||
if precursor_results and 'auc' in precursor_results:
|
||||
print(f"\n 前兆分类器 AUC: {precursor_results['auc']:.4f}")
|
||||
print(f" Top-5 前兆特征:")
|
||||
for feat, imp in precursor_results['feature_importances'].head(5).items():
|
||||
print(f" {feat:<40} {imp:.4f}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 8. 主入口
|
||||
# ============================================================
|
||||
|
||||
def run_anomaly_analysis(
|
||||
df: pd.DataFrame,
|
||||
output_dir: str = "output/anomaly",
|
||||
) -> Dict:
|
||||
"""
|
||||
异常检测与前兆模式分析主函数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线数据(已通过 add_derived_features 添加衍生特征)
|
||||
output_dir : str
|
||||
图表输出目录
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含所有分析结果的字典
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 70)
|
||||
print("BTC 异常检测与前兆模式分析")
|
||||
print("=" * 70)
|
||||
print(f"数据范围: {df.index.min()} ~ {df.index.max()}")
|
||||
print(f"样本数量: {len(df)}")
|
||||
|
||||
# 设置中文字体
|
||||
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
|
||||
plt.rcParams['axes.unicode_minus'] = False
|
||||
|
||||
# --- 集成异常检测 ---
|
||||
print("\n>>> [1/5] 执行集成异常检测...")
|
||||
anomaly_result = ensemble_anomaly_detection(df, contamination=0.05, min_agreement=2)
|
||||
|
||||
# --- GARCH 条件波动率异常 ---
|
||||
print("\n>>> [2/5] 执行 GARCH 条件波动率异常检测...")
|
||||
garch_anomaly = None
|
||||
try:
|
||||
garch_anomaly = garch_anomaly_detection(df, threshold=3.0)
|
||||
except Exception as e:
|
||||
print(f" [错误] GARCH 异常检测失败: {e}")
|
||||
|
||||
# --- 事件对齐 ---
|
||||
print("\n>>> [3/5] 执行事件对齐分析...")
|
||||
ensemble_anom_dates = anomaly_result[anomaly_result['anomaly_ensemble'] == 1].index
|
||||
event_alignment = align_with_events(ensemble_anom_dates, tolerance_days=5)
|
||||
|
||||
# --- 前兆模式提取 ---
|
||||
print("\n>>> [4/5] 提取前兆模式并训练分类器...")
|
||||
precursor_results = {}
|
||||
try:
|
||||
X_precursor, y_precursor = extract_precursor_features(
|
||||
df, anomaly_result['anomaly_ensemble'], lookback_windows=[5, 10, 20]
|
||||
)
|
||||
print(f" 前兆特征矩阵: {X_precursor.shape[0]} 样本 x {X_precursor.shape[1]} 特征")
|
||||
precursor_results = train_precursor_classifier(X_precursor, y_precursor)
|
||||
except Exception as e:
|
||||
print(f" [错误] 前兆模式提取失败: {e}")
|
||||
|
||||
# --- 可视化 ---
|
||||
print("\n>>> [5/5] 生成可视化图表...")
|
||||
plot_price_with_anomalies(df, anomaly_result, garch_anomaly, output_dir)
|
||||
plot_anomaly_feature_distributions(anomaly_result, output_dir)
|
||||
plot_precursor_roc(precursor_results, output_dir)
|
||||
plot_feature_importance(precursor_results, output_dir)
|
||||
|
||||
# --- 汇总打印 ---
|
||||
print_anomaly_summary(anomaly_result, garch_anomaly, precursor_results)
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("异常检测与前兆模式分析完成!")
|
||||
print(f"图表已保存至: {output_dir.resolve()}")
|
||||
print("=" * 70)
|
||||
|
||||
return {
|
||||
'anomaly_result': anomaly_result,
|
||||
'garch_anomaly': garch_anomaly,
|
||||
'event_alignment': event_alignment,
|
||||
'precursor_results': precursor_results,
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 独立运行入口
|
||||
# ============================================================
|
||||
|
||||
if __name__ == '__main__':
|
||||
from src.data_loader import load_daily
|
||||
from src.preprocessing import add_derived_features
|
||||
|
||||
df = load_daily()
|
||||
df = add_derived_features(df)
|
||||
run_anomaly_analysis(df)
|
||||
565
src/calendar_analysis.py
Normal file
@@ -0,0 +1,565 @@
|
||||
"""日历效应分析模块 - 星期、月份、小时、季度、月初月末效应"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import matplotlib.ticker as mticker
|
||||
import seaborn as sns
|
||||
from pathlib import Path
|
||||
from itertools import combinations
|
||||
from scipy import stats
|
||||
|
||||
# 中文显示配置
|
||||
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
|
||||
plt.rcParams['axes.unicode_minus'] = False
|
||||
|
||||
# 星期名称映射(中英文)
|
||||
WEEKDAY_NAMES_CN = {0: '周一', 1: '周二', 2: '周三', 3: '周四',
|
||||
4: '周五', 5: '周六', 6: '周日'}
|
||||
WEEKDAY_NAMES_EN = {0: 'Mon', 1: 'Tue', 2: 'Wed', 3: 'Thu',
|
||||
4: 'Fri', 5: 'Sat', 6: 'Sun'}
|
||||
|
||||
# 月份名称映射
|
||||
MONTH_NAMES_CN = {1: '1月', 2: '2月', 3: '3月', 4: '4月',
|
||||
5: '5月', 6: '6月', 7: '7月', 8: '8月',
|
||||
9: '9月', 10: '10月', 11: '11月', 12: '12月'}
|
||||
MONTH_NAMES_EN = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr',
|
||||
5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug',
|
||||
9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}
|
||||
|
||||
|
||||
def _bonferroni_pairwise_mannwhitney(groups: dict, alpha: float = 0.05):
|
||||
"""
|
||||
对多组数据进行 Mann-Whitney U 两两检验,并做 Bonferroni 校正。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
groups : dict
|
||||
{组标签: 收益率序列}
|
||||
alpha : float
|
||||
显著性水平(校正前)
|
||||
|
||||
Returns
|
||||
-------
|
||||
list[dict]
|
||||
每对检验的结果列表
|
||||
"""
|
||||
keys = sorted(groups.keys())
|
||||
pairs = list(combinations(keys, 2))
|
||||
n_tests = len(pairs)
|
||||
corrected_alpha = alpha / n_tests if n_tests > 0 else alpha
|
||||
|
||||
results = []
|
||||
for k1, k2 in pairs:
|
||||
g1, g2 = groups[k1].dropna(), groups[k2].dropna()
|
||||
if len(g1) < 3 or len(g2) < 3:
|
||||
continue
|
||||
stat, pval = stats.mannwhitneyu(g1, g2, alternative='two-sided')
|
||||
results.append({
|
||||
'group1': k1,
|
||||
'group2': k2,
|
||||
'U_stat': stat,
|
||||
'p_value': pval,
|
||||
'p_corrected': min(pval * n_tests, 1.0), # Bonferroni 校正
|
||||
'significant': pval * n_tests < alpha,
|
||||
'corrected_alpha': corrected_alpha,
|
||||
})
|
||||
return results
|
||||
|
||||
|
||||
def _kruskal_wallis_test(groups: dict):
|
||||
"""
|
||||
Kruskal-Wallis H 检验(非参数单因素检验)。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
groups : dict
|
||||
{组标签: 收益率序列}
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含 H 统计量、p 值等
|
||||
"""
|
||||
valid_groups = [g.dropna().values for g in groups.values() if len(g.dropna()) >= 3]
|
||||
if len(valid_groups) < 2:
|
||||
return {'H_stat': np.nan, 'p_value': np.nan, 'n_groups': len(valid_groups)}
|
||||
|
||||
h_stat, p_val = stats.kruskal(*valid_groups)
|
||||
return {'H_stat': h_stat, 'p_value': p_val, 'n_groups': len(valid_groups)}
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# 1. 星期效应分析
|
||||
# --------------------------------------------------------------------------
|
||||
def analyze_day_of_week(df: pd.DataFrame, output_dir: Path):
|
||||
"""
|
||||
分析日收益率的星期效应。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线数据(需含 log_return 列,DatetimeIndex 索引)
|
||||
output_dir : Path
|
||||
图片保存目录
|
||||
"""
|
||||
print("\n" + "=" * 70)
|
||||
print("【星期效应分析】Day-of-Week Effect")
|
||||
print("=" * 70)
|
||||
|
||||
df = df.dropna(subset=['log_return']).copy()
|
||||
df['weekday'] = df.index.dayofweek # 0=周一, 6=周日
|
||||
|
||||
# --- 描述性统计 ---
|
||||
groups = {wd: df.loc[df['weekday'] == wd, 'log_return'] for wd in range(7)}
|
||||
|
||||
print("\n--- 各星期对数收益率统计 ---")
|
||||
stats_rows = []
|
||||
for wd in range(7):
|
||||
g = groups[wd]
|
||||
row = {
|
||||
'星期': WEEKDAY_NAMES_CN[wd],
|
||||
'样本量': len(g),
|
||||
'均值': g.mean(),
|
||||
'中位数': g.median(),
|
||||
'标准差': g.std(),
|
||||
'偏度': g.skew(),
|
||||
'峰度': g.kurtosis(),
|
||||
}
|
||||
stats_rows.append(row)
|
||||
stats_df = pd.DataFrame(stats_rows)
|
||||
print(stats_df.to_string(index=False, float_format='{:.6f}'.format))
|
||||
|
||||
# --- Kruskal-Wallis 检验 ---
|
||||
kw_result = _kruskal_wallis_test(groups)
|
||||
print(f"\nKruskal-Wallis H 检验: H={kw_result['H_stat']:.4f}, "
|
||||
f"p={kw_result['p_value']:.6f}")
|
||||
if kw_result['p_value'] < 0.05:
|
||||
print(" => 在 5% 显著性水平下,各星期收益率存在显著差异")
|
||||
else:
|
||||
print(" => 在 5% 显著性水平下,各星期收益率无显著差异")
|
||||
|
||||
# --- Mann-Whitney U 两两检验 (Bonferroni 校正) ---
|
||||
pairwise = _bonferroni_pairwise_mannwhitney(groups)
|
||||
sig_pairs = [p for p in pairwise if p['significant']]
|
||||
print(f"\nMann-Whitney U 两两检验 (Bonferroni 校正, {len(pairwise)} 对比较):")
|
||||
if sig_pairs:
|
||||
for p in sig_pairs:
|
||||
print(f" {WEEKDAY_NAMES_CN[p['group1']]} vs {WEEKDAY_NAMES_CN[p['group2']]}: "
|
||||
f"U={p['U_stat']:.1f}, p_raw={p['p_value']:.6f}, "
|
||||
f"p_corrected={p['p_corrected']:.6f} *")
|
||||
else:
|
||||
print(" 无显著差异的配对(校正后)")
|
||||
|
||||
# --- 可视化: 箱线图 ---
|
||||
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
|
||||
|
||||
# 箱线图
|
||||
box_data = [groups[wd].values for wd in range(7)]
|
||||
bp = axes[0].boxplot(box_data, labels=[WEEKDAY_NAMES_CN[i] for i in range(7)],
|
||||
patch_artist=True, showfliers=False, showmeans=True,
|
||||
meanprops=dict(marker='D', markerfacecolor='red', markersize=5))
|
||||
colors = plt.cm.Set3(np.linspace(0, 1, 7))
|
||||
for patch, color in zip(bp['boxes'], colors):
|
||||
patch.set_facecolor(color)
|
||||
axes[0].axhline(y=0, color='grey', linestyle='--', alpha=0.5)
|
||||
axes[0].set_title('BTC 日收益率 - 星期效应(箱线图)', fontsize=13)
|
||||
axes[0].set_ylabel('对数收益率')
|
||||
axes[0].set_xlabel('星期')
|
||||
|
||||
# 均值柱状图
|
||||
means = [groups[wd].mean() for wd in range(7)]
|
||||
sems = [groups[wd].sem() for wd in range(7)]
|
||||
bar_colors = ['#2ecc71' if m > 0 else '#e74c3c' for m in means]
|
||||
axes[1].bar(range(7), means, yerr=sems, color=bar_colors,
|
||||
alpha=0.8, capsize=3, edgecolor='black', linewidth=0.5)
|
||||
axes[1].set_xticks(range(7))
|
||||
axes[1].set_xticklabels([WEEKDAY_NAMES_CN[i] for i in range(7)])
|
||||
axes[1].axhline(y=0, color='grey', linestyle='--', alpha=0.5)
|
||||
axes[1].set_title('BTC 日均收益率 - 星期效应(均值±SE)', fontsize=13)
|
||||
axes[1].set_ylabel('平均对数收益率')
|
||||
axes[1].set_xlabel('星期')
|
||||
|
||||
plt.tight_layout()
|
||||
fig_path = output_dir / 'calendar_weekday_effect.png'
|
||||
fig.savefig(fig_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"\n图表已保存: {fig_path}")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# 2. 月份效应分析
|
||||
# --------------------------------------------------------------------------
|
||||
def analyze_month_of_year(df: pd.DataFrame, output_dir: Path):
|
||||
"""
|
||||
分析日收益率的月份效应,并绘制年×月热力图。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线数据(需含 log_return 列)
|
||||
output_dir : Path
|
||||
图片保存目录
|
||||
"""
|
||||
print("\n" + "=" * 70)
|
||||
print("【月份效应分析】Month-of-Year Effect")
|
||||
print("=" * 70)
|
||||
|
||||
df = df.dropna(subset=['log_return']).copy()
|
||||
df['month'] = df.index.month
|
||||
df['year'] = df.index.year
|
||||
|
||||
# --- 描述性统计 ---
|
||||
groups = {m: df.loc[df['month'] == m, 'log_return'] for m in range(1, 13)}
|
||||
|
||||
print("\n--- 各月份对数收益率统计 ---")
|
||||
stats_rows = []
|
||||
for m in range(1, 13):
|
||||
g = groups[m]
|
||||
row = {
|
||||
'月份': MONTH_NAMES_CN[m],
|
||||
'样本量': len(g),
|
||||
'均值': g.mean(),
|
||||
'中位数': g.median(),
|
||||
'标准差': g.std(),
|
||||
}
|
||||
stats_rows.append(row)
|
||||
stats_df = pd.DataFrame(stats_rows)
|
||||
print(stats_df.to_string(index=False, float_format='{:.6f}'.format))
|
||||
|
||||
# --- Kruskal-Wallis 检验 ---
|
||||
kw_result = _kruskal_wallis_test(groups)
|
||||
print(f"\nKruskal-Wallis H 检验: H={kw_result['H_stat']:.4f}, "
|
||||
f"p={kw_result['p_value']:.6f}")
|
||||
if kw_result['p_value'] < 0.05:
|
||||
print(" => 在 5% 显著性水平下,各月份收益率存在显著差异")
|
||||
else:
|
||||
print(" => 在 5% 显著性水平下,各月份收益率无显著差异")
|
||||
|
||||
# --- Mann-Whitney U 两两检验 (Bonferroni 校正) ---
|
||||
pairwise = _bonferroni_pairwise_mannwhitney(groups)
|
||||
sig_pairs = [p for p in pairwise if p['significant']]
|
||||
print(f"\nMann-Whitney U 两两检验 (Bonferroni 校正, {len(pairwise)} 对比较):")
|
||||
if sig_pairs:
|
||||
for p in sig_pairs:
|
||||
print(f" {MONTH_NAMES_CN[p['group1']]} vs {MONTH_NAMES_CN[p['group2']]}: "
|
||||
f"U={p['U_stat']:.1f}, p_raw={p['p_value']:.6f}, "
|
||||
f"p_corrected={p['p_corrected']:.6f} *")
|
||||
else:
|
||||
print(" 无显著差异的配对(校正后)")
|
||||
|
||||
# --- 可视化 ---
|
||||
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
|
||||
|
||||
# 均值柱状图
|
||||
means = [groups[m].mean() for m in range(1, 13)]
|
||||
sems = [groups[m].sem() for m in range(1, 13)]
|
||||
bar_colors = ['#2ecc71' if m > 0 else '#e74c3c' for m in means]
|
||||
axes[0].bar(range(1, 13), means, yerr=sems, color=bar_colors,
|
||||
alpha=0.8, capsize=3, edgecolor='black', linewidth=0.5)
|
||||
axes[0].set_xticks(range(1, 13))
|
||||
axes[0].set_xticklabels([MONTH_NAMES_EN[i] for i in range(1, 13)])
|
||||
axes[0].axhline(y=0, color='grey', linestyle='--', alpha=0.5)
|
||||
axes[0].set_title('BTC 月均收益率(均值±SE)', fontsize=13)
|
||||
axes[0].set_ylabel('平均对数收益率')
|
||||
axes[0].set_xlabel('月份')
|
||||
|
||||
# 年×月 热力图:每月累计收益率
|
||||
monthly_returns = df.groupby(['year', 'month'])['log_return'].sum().unstack(fill_value=np.nan)
|
||||
monthly_returns.columns = [MONTH_NAMES_EN[c] for c in monthly_returns.columns]
|
||||
sns.heatmap(monthly_returns, annot=True, fmt='.3f', cmap='RdYlGn', center=0,
|
||||
linewidths=0.5, ax=axes[1], cbar_kws={'label': '累计对数收益率'})
|
||||
axes[1].set_title('BTC 年×月 累计对数收益率热力图', fontsize=13)
|
||||
axes[1].set_ylabel('年份')
|
||||
axes[1].set_xlabel('月份')
|
||||
|
||||
plt.tight_layout()
|
||||
fig_path = output_dir / 'calendar_month_effect.png'
|
||||
fig.savefig(fig_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"\n图表已保存: {fig_path}")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# 3. 小时效应分析(1h 数据)
|
||||
# --------------------------------------------------------------------------
|
||||
def analyze_hour_of_day(df_hourly: pd.DataFrame, output_dir: Path):
|
||||
"""
|
||||
分析小时级别收益率与成交量的日内效应。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df_hourly : pd.DataFrame
|
||||
小时线数据(需含 close、volume 列,DatetimeIndex 索引)
|
||||
output_dir : Path
|
||||
图片保存目录
|
||||
"""
|
||||
print("\n" + "=" * 70)
|
||||
print("【小时效应分析】Hour-of-Day Effect")
|
||||
print("=" * 70)
|
||||
|
||||
df = df_hourly.copy()
|
||||
# 计算小时收益率
|
||||
df['log_return'] = np.log(df['close'] / df['close'].shift(1))
|
||||
df = df.dropna(subset=['log_return'])
|
||||
df['hour'] = df.index.hour
|
||||
|
||||
# --- 描述性统计 ---
|
||||
groups_ret = {h: df.loc[df['hour'] == h, 'log_return'] for h in range(24)}
|
||||
groups_vol = {h: df.loc[df['hour'] == h, 'volume'] for h in range(24)}
|
||||
|
||||
print("\n--- 各小时对数收益率与成交量统计 ---")
|
||||
stats_rows = []
|
||||
for h in range(24):
|
||||
gr = groups_ret[h]
|
||||
gv = groups_vol[h]
|
||||
row = {
|
||||
'小时(UTC)': f'{h:02d}:00',
|
||||
'样本量': len(gr),
|
||||
'收益率均值': gr.mean(),
|
||||
'收益率中位数': gr.median(),
|
||||
'收益率标准差': gr.std(),
|
||||
'成交量均值': gv.mean(),
|
||||
}
|
||||
stats_rows.append(row)
|
||||
stats_df = pd.DataFrame(stats_rows)
|
||||
print(stats_df.to_string(index=False, float_format='{:.6f}'.format))
|
||||
|
||||
# --- Kruskal-Wallis 检验 (收益率) ---
|
||||
kw_ret = _kruskal_wallis_test(groups_ret)
|
||||
print(f"\n收益率 Kruskal-Wallis H 检验: H={kw_ret['H_stat']:.4f}, "
|
||||
f"p={kw_ret['p_value']:.6f}")
|
||||
if kw_ret['p_value'] < 0.05:
|
||||
print(" => 在 5% 显著性水平下,各小时收益率存在显著差异")
|
||||
else:
|
||||
print(" => 在 5% 显著性水平下,各小时收益率无显著差异")
|
||||
|
||||
# --- Kruskal-Wallis 检验 (成交量) ---
|
||||
kw_vol = _kruskal_wallis_test(groups_vol)
|
||||
print(f"\n成交量 Kruskal-Wallis H 检验: H={kw_vol['H_stat']:.4f}, "
|
||||
f"p={kw_vol['p_value']:.6f}")
|
||||
if kw_vol['p_value'] < 0.05:
|
||||
print(" => 在 5% 显著性水平下,各小时成交量存在显著差异")
|
||||
else:
|
||||
print(" => 在 5% 显著性水平下,各小时成交量无显著差异")
|
||||
|
||||
# --- 可视化 ---
|
||||
fig, axes = plt.subplots(2, 1, figsize=(14, 10))
|
||||
|
||||
hours = list(range(24))
|
||||
hour_labels = [f'{h:02d}' for h in hours]
|
||||
|
||||
# 收益率
|
||||
ret_means = [groups_ret[h].mean() for h in hours]
|
||||
ret_sems = [groups_ret[h].sem() for h in hours]
|
||||
bar_colors_ret = ['#2ecc71' if m > 0 else '#e74c3c' for m in ret_means]
|
||||
axes[0].bar(hours, ret_means, yerr=ret_sems, color=bar_colors_ret,
|
||||
alpha=0.8, capsize=2, edgecolor='black', linewidth=0.3)
|
||||
axes[0].set_xticks(hours)
|
||||
axes[0].set_xticklabels(hour_labels)
|
||||
axes[0].axhline(y=0, color='grey', linestyle='--', alpha=0.5)
|
||||
axes[0].set_title('BTC 小时均收益率 (UTC, 均值±SE)', fontsize=13)
|
||||
axes[0].set_ylabel('平均对数收益率')
|
||||
axes[0].set_xlabel('小时 (UTC)')
|
||||
|
||||
# 成交量
|
||||
vol_means = [groups_vol[h].mean() for h in hours]
|
||||
axes[1].bar(hours, vol_means, color='steelblue', alpha=0.8,
|
||||
edgecolor='black', linewidth=0.3)
|
||||
axes[1].set_xticks(hours)
|
||||
axes[1].set_xticklabels(hour_labels)
|
||||
axes[1].set_title('BTC 小时均成交量 (UTC)', fontsize=13)
|
||||
axes[1].set_ylabel('平均成交量 (BTC)')
|
||||
axes[1].set_xlabel('小时 (UTC)')
|
||||
axes[1].yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:,.0f}'))
|
||||
|
||||
plt.tight_layout()
|
||||
fig_path = output_dir / 'calendar_hour_effect.png'
|
||||
fig.savefig(fig_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"\n图表已保存: {fig_path}")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# 4. 季度效应 & 月初月末效应
|
||||
# --------------------------------------------------------------------------
|
||||
def analyze_quarter_and_month_boundary(df: pd.DataFrame, output_dir: Path):
|
||||
"""
|
||||
分析季度效应,以及每月前5日/后5日的收益率差异。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线数据(需含 log_return 列)
|
||||
output_dir : Path
|
||||
图片保存目录
|
||||
"""
|
||||
print("\n" + "=" * 70)
|
||||
print("【季度效应 & 月初/月末效应分析】")
|
||||
print("=" * 70)
|
||||
|
||||
df = df.dropna(subset=['log_return']).copy()
|
||||
df['quarter'] = df.index.quarter
|
||||
df['month'] = df.index.month
|
||||
df['day'] = df.index.day
|
||||
|
||||
# ========== 季度效应 ==========
|
||||
groups_q = {q: df.loc[df['quarter'] == q, 'log_return'] for q in range(1, 5)}
|
||||
|
||||
print("\n--- 各季度对数收益率统计 ---")
|
||||
quarter_names = {1: 'Q1', 2: 'Q2', 3: 'Q3', 4: 'Q4'}
|
||||
for q in range(1, 5):
|
||||
g = groups_q[q]
|
||||
print(f" {quarter_names[q]}: 均值={g.mean():.6f}, 中位数={g.median():.6f}, "
|
||||
f"标准差={g.std():.6f}, 样本量={len(g)}")
|
||||
|
||||
kw_q = _kruskal_wallis_test(groups_q)
|
||||
print(f"\n季度 Kruskal-Wallis H 检验: H={kw_q['H_stat']:.4f}, p={kw_q['p_value']:.6f}")
|
||||
if kw_q['p_value'] < 0.05:
|
||||
print(" => 在 5% 显著性水平下,各季度收益率存在显著差异")
|
||||
else:
|
||||
print(" => 在 5% 显著性水平下,各季度收益率无显著差异")
|
||||
|
||||
# 季度两两比较
|
||||
pairwise_q = _bonferroni_pairwise_mannwhitney(groups_q)
|
||||
sig_q = [p for p in pairwise_q if p['significant']]
|
||||
if sig_q:
|
||||
print(f"\n季度两两检验 (Bonferroni 校正, {len(pairwise_q)} 对):")
|
||||
for p in sig_q:
|
||||
print(f" {quarter_names[p['group1']]} vs {quarter_names[p['group2']]}: "
|
||||
f"U={p['U_stat']:.1f}, p_corrected={p['p_corrected']:.6f} *")
|
||||
|
||||
# ========== 月初/月末效应 ==========
|
||||
# 判断每月最后5天:通过计算每个日期距当月末的天数
|
||||
from pandas.tseries.offsets import MonthEnd
|
||||
df['month_end'] = df.index + MonthEnd(0) # 当月最后一天
|
||||
df['days_to_end'] = (df['month_end'] - df.index).dt.days
|
||||
|
||||
# 月初前5天 vs 月末后5天
|
||||
mask_start = df['day'] <= 5
|
||||
mask_end = df['days_to_end'] < 5 # 距离月末不到5天(即最后5天)
|
||||
|
||||
ret_start = df.loc[mask_start, 'log_return']
|
||||
ret_end = df.loc[mask_end, 'log_return']
|
||||
ret_mid = df.loc[~mask_start & ~mask_end, 'log_return']
|
||||
|
||||
print("\n--- 月初 / 月中 / 月末 收益率统计 ---")
|
||||
for label, data in [('月初(前5日)', ret_start), ('月中', ret_mid), ('月末(后5日)', ret_end)]:
|
||||
print(f" {label}: 均值={data.mean():.6f}, 中位数={data.median():.6f}, "
|
||||
f"标准差={data.std():.6f}, 样本量={len(data)}")
|
||||
|
||||
# Mann-Whitney U 检验:月初 vs 月末
|
||||
if len(ret_start) >= 3 and len(ret_end) >= 3:
|
||||
u_stat, p_val = stats.mannwhitneyu(ret_start, ret_end, alternative='two-sided')
|
||||
print(f"\n月初 vs 月末 Mann-Whitney U 检验: U={u_stat:.1f}, p={p_val:.6f}")
|
||||
if p_val < 0.05:
|
||||
print(" => 在 5% 显著性水平下,月初与月末收益率存在显著差异")
|
||||
else:
|
||||
print(" => 在 5% 显著性水平下,月初与月末收益率无显著差异")
|
||||
|
||||
# --- 可视化 ---
|
||||
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
|
||||
|
||||
# 季度柱状图
|
||||
q_means = [groups_q[q].mean() for q in range(1, 5)]
|
||||
q_sems = [groups_q[q].sem() for q in range(1, 5)]
|
||||
q_colors = ['#2ecc71' if m > 0 else '#e74c3c' for m in q_means]
|
||||
axes[0].bar(range(1, 5), q_means, yerr=q_sems, color=q_colors,
|
||||
alpha=0.8, capsize=4, edgecolor='black', linewidth=0.5)
|
||||
axes[0].set_xticks(range(1, 5))
|
||||
axes[0].set_xticklabels(['Q1', 'Q2', 'Q3', 'Q4'])
|
||||
axes[0].axhline(y=0, color='grey', linestyle='--', alpha=0.5)
|
||||
axes[0].set_title('BTC 季度均收益率(均值±SE)', fontsize=13)
|
||||
axes[0].set_ylabel('平均对数收益率')
|
||||
axes[0].set_xlabel('季度')
|
||||
|
||||
# 月初/月中/月末 柱状图
|
||||
boundary_means = [ret_start.mean(), ret_mid.mean(), ret_end.mean()]
|
||||
boundary_sems = [ret_start.sem(), ret_mid.sem(), ret_end.sem()]
|
||||
boundary_colors = ['#3498db', '#95a5a6', '#e67e22']
|
||||
axes[1].bar(range(3), boundary_means, yerr=boundary_sems, color=boundary_colors,
|
||||
alpha=0.8, capsize=4, edgecolor='black', linewidth=0.5)
|
||||
axes[1].set_xticks(range(3))
|
||||
axes[1].set_xticklabels(['月初(前5日)', '月中', '月末(后5日)'])
|
||||
axes[1].axhline(y=0, color='grey', linestyle='--', alpha=0.5)
|
||||
axes[1].set_title('BTC 月初/月中/月末 均收益率(均值±SE)', fontsize=13)
|
||||
axes[1].set_ylabel('平均对数收益率')
|
||||
|
||||
plt.tight_layout()
|
||||
fig_path = output_dir / 'calendar_quarter_boundary_effect.png'
|
||||
fig.savefig(fig_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"\n图表已保存: {fig_path}")
|
||||
|
||||
# 清理临时列
|
||||
df.drop(columns=['month_end', 'days_to_end'], inplace=True, errors='ignore')
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# 主入口
|
||||
# --------------------------------------------------------------------------
|
||||
def run_calendar_analysis(
|
||||
df: pd.DataFrame,
|
||||
df_hourly: pd.DataFrame = None,
|
||||
output_dir: str = 'output/calendar',
|
||||
):
|
||||
"""
|
||||
日历效应分析主入口。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线数据,已通过 add_derived_features 添加衍生特征(含 log_return 列)
|
||||
df_hourly : pd.DataFrame, optional
|
||||
小时线原始数据(含 close、volume 列)。若为 None 则跳过小时效应分析。
|
||||
output_dir : str or Path
|
||||
输出目录
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("\n" + "#" * 70)
|
||||
print("# BTC 日历效应分析 (Calendar Effects Analysis)")
|
||||
print("#" * 70)
|
||||
|
||||
# 1. 星期效应
|
||||
analyze_day_of_week(df, output_dir)
|
||||
|
||||
# 2. 月份效应
|
||||
analyze_month_of_year(df, output_dir)
|
||||
|
||||
# 3. 小时效应(若有小时数据)
|
||||
if df_hourly is not None and len(df_hourly) > 0:
|
||||
analyze_hour_of_day(df_hourly, output_dir)
|
||||
else:
|
||||
print("\n[跳过] 小时效应分析:未提供小时数据 (df_hourly is None)")
|
||||
|
||||
# 4. 季度 & 月初月末效应
|
||||
analyze_quarter_and_month_boundary(df, output_dir)
|
||||
|
||||
print("\n" + "#" * 70)
|
||||
print("# 日历效应分析完成")
|
||||
print("#" * 70)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# 可独立运行
|
||||
# --------------------------------------------------------------------------
|
||||
if __name__ == '__main__':
|
||||
from data_loader import load_daily, load_hourly
|
||||
from preprocessing import add_derived_features
|
||||
|
||||
# 加载数据
|
||||
df_daily = load_daily()
|
||||
df_daily = add_derived_features(df_daily)
|
||||
|
||||
try:
|
||||
df_hourly = load_hourly()
|
||||
except Exception as e:
|
||||
print(f"[警告] 加载小时数据失败: {e}")
|
||||
df_hourly = None
|
||||
|
||||
run_calendar_analysis(df_daily, df_hourly, output_dir='output/calendar')
|
||||
615
src/causality.py
Normal file
@@ -0,0 +1,615 @@
|
||||
"""Granger 因果检验模块
|
||||
|
||||
分析内容:
|
||||
- 双向 Granger 因果检验(5 对变量,各 5 个滞后阶数)
|
||||
- 跨时间尺度因果检验(小时级聚合特征 → 日级收益率)
|
||||
- Bonferroni 多重检验校正
|
||||
- 可视化:p 值热力图、显著因果关系网络图
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import warnings
|
||||
from pathlib import Path
|
||||
from typing import Optional, List, Tuple, Dict
|
||||
|
||||
from statsmodels.tsa.stattools import grangercausalitytests
|
||||
|
||||
from src.data_loader import load_hourly
|
||||
from src.preprocessing import log_returns, add_derived_features
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 1. 因果检验对定义
|
||||
# ============================================================
|
||||
|
||||
# 5 对双向因果关系,每对 (cause, effect)
|
||||
CAUSALITY_PAIRS = [
|
||||
('volume', 'log_return'),
|
||||
('log_return', 'volume'),
|
||||
('abs_return', 'volume'),
|
||||
('volume', 'abs_return'),
|
||||
('taker_buy_ratio', 'log_return'),
|
||||
('log_return', 'taker_buy_ratio'),
|
||||
('squared_return', 'volume'),
|
||||
('volume', 'squared_return'),
|
||||
('range_pct', 'log_return'),
|
||||
('log_return', 'range_pct'),
|
||||
]
|
||||
|
||||
# 测试的滞后阶数
|
||||
TEST_LAGS = [1, 2, 3, 5, 10]
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 2. 单对 Granger 因果检验
|
||||
# ============================================================
|
||||
|
||||
def granger_test_pair(
|
||||
df: pd.DataFrame,
|
||||
cause: str,
|
||||
effect: str,
|
||||
max_lag: int = 10,
|
||||
test_lags: Optional[List[int]] = None,
|
||||
) -> List[Dict]:
|
||||
"""
|
||||
对指定的 (cause → effect) 方向执行 Granger 因果检验
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
包含 cause 和 effect 列的数据
|
||||
cause : str
|
||||
原因变量列名
|
||||
effect : str
|
||||
结果变量列名
|
||||
max_lag : int
|
||||
最大滞后阶数
|
||||
test_lags : list of int, optional
|
||||
需要测试的滞后阶数列表
|
||||
|
||||
Returns
|
||||
-------
|
||||
list of dict
|
||||
每个滞后阶数的检验结果
|
||||
"""
|
||||
if test_lags is None:
|
||||
test_lags = TEST_LAGS
|
||||
|
||||
# grangercausalitytests 要求: 第一列是 effect,第二列是 cause
|
||||
data = df[[effect, cause]].dropna()
|
||||
|
||||
if len(data) < max_lag + 20:
|
||||
print(f" [警告] {cause} → {effect}: 样本量不足 ({len(data)}),跳过")
|
||||
return []
|
||||
|
||||
results = []
|
||||
try:
|
||||
# 执行检验,maxlag 取最大值,一次获取所有滞后
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("ignore")
|
||||
gc_results = grangercausalitytests(data, maxlag=max_lag, verbose=False)
|
||||
|
||||
# 提取指定滞后阶数的结果
|
||||
for lag in test_lags:
|
||||
if lag > max_lag:
|
||||
continue
|
||||
test_result = gc_results[lag]
|
||||
# 取 ssr_ftest 的 F 统计量和 p 值
|
||||
f_stat = test_result[0]['ssr_ftest'][0]
|
||||
p_value = test_result[0]['ssr_ftest'][1]
|
||||
|
||||
results.append({
|
||||
'cause': cause,
|
||||
'effect': effect,
|
||||
'lag': lag,
|
||||
'f_stat': f_stat,
|
||||
'p_value': p_value,
|
||||
})
|
||||
except Exception as e:
|
||||
print(f" [错误] {cause} → {effect}: {e}")
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 3. 批量因果检验
|
||||
# ============================================================
|
||||
|
||||
def run_all_granger_tests(
|
||||
df: pd.DataFrame,
|
||||
pairs: Optional[List[Tuple[str, str]]] = None,
|
||||
test_lags: Optional[List[int]] = None,
|
||||
) -> pd.DataFrame:
|
||||
"""
|
||||
对所有变量对执行双向 Granger 因果检验
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
包含衍生特征的日线数据
|
||||
pairs : list of tuple, optional
|
||||
变量对列表 [(cause, effect), ...]
|
||||
test_lags : list of tuple, optional
|
||||
滞后阶数列表
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
所有检验结果汇总表
|
||||
"""
|
||||
if pairs is None:
|
||||
pairs = CAUSALITY_PAIRS
|
||||
if test_lags is None:
|
||||
test_lags = TEST_LAGS
|
||||
|
||||
max_lag = max(test_lags)
|
||||
all_results = []
|
||||
|
||||
for cause, effect in pairs:
|
||||
if cause not in df.columns or effect not in df.columns:
|
||||
print(f" [警告] 列 {cause} 或 {effect} 不存在,跳过")
|
||||
continue
|
||||
pair_results = granger_test_pair(df, cause, effect, max_lag=max_lag, test_lags=test_lags)
|
||||
all_results.extend(pair_results)
|
||||
|
||||
results_df = pd.DataFrame(all_results)
|
||||
return results_df
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 4. Bonferroni 校正
|
||||
# ============================================================
|
||||
|
||||
def apply_bonferroni(results_df: pd.DataFrame, alpha: float = 0.05) -> pd.DataFrame:
|
||||
"""
|
||||
对 Granger 检验结果应用 Bonferroni 多重检验校正
|
||||
|
||||
Parameters
|
||||
----------
|
||||
results_df : pd.DataFrame
|
||||
包含 p_value 列的检验结果
|
||||
alpha : float
|
||||
原始显著性水平
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
添加了校正后显著性判断的结果
|
||||
"""
|
||||
n_tests = len(results_df)
|
||||
if n_tests == 0:
|
||||
return results_df
|
||||
|
||||
out = results_df.copy()
|
||||
# Bonferroni 校正阈值
|
||||
corrected_alpha = alpha / n_tests
|
||||
out['bonferroni_alpha'] = corrected_alpha
|
||||
out['significant_raw'] = out['p_value'] < alpha
|
||||
out['significant_corrected'] = out['p_value'] < corrected_alpha
|
||||
|
||||
return out
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 5. 跨时间尺度因果检验
|
||||
# ============================================================
|
||||
|
||||
def cross_timeframe_causality(
|
||||
daily_df: pd.DataFrame,
|
||||
test_lags: Optional[List[int]] = None,
|
||||
) -> pd.DataFrame:
|
||||
"""
|
||||
检验小时级聚合特征是否 Granger 因果于日级收益率
|
||||
|
||||
具体步骤:
|
||||
1. 加载小时级数据
|
||||
2. 计算小时级波动率和成交量的日内聚合指标
|
||||
3. 与日线收益率合并
|
||||
4. 执行 Granger 因果检验
|
||||
|
||||
Parameters
|
||||
----------
|
||||
daily_df : pd.DataFrame
|
||||
日线数据(含 log_return)
|
||||
test_lags : list of int, optional
|
||||
滞后阶数列表
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
跨时间尺度因果检验结果
|
||||
"""
|
||||
if test_lags is None:
|
||||
test_lags = TEST_LAGS
|
||||
|
||||
# 加载小时数据
|
||||
try:
|
||||
hourly_raw = load_hourly()
|
||||
except (FileNotFoundError, Exception) as e:
|
||||
print(f" [警告] 无法加载小时级数据,跳过跨时间尺度因果检验: {e}")
|
||||
return pd.DataFrame()
|
||||
|
||||
# 计算小时级衍生特征
|
||||
hourly = add_derived_features(hourly_raw)
|
||||
|
||||
# 日内聚合:按日期聚合小时数据
|
||||
hourly['date'] = hourly.index.date
|
||||
agg_dict = {}
|
||||
|
||||
# 小时级日内波动率(对数收益率标准差)
|
||||
if 'log_return' in hourly.columns:
|
||||
hourly_vol = hourly.groupby('date')['log_return'].std()
|
||||
hourly_vol.name = 'hourly_intraday_vol'
|
||||
agg_dict['hourly_intraday_vol'] = hourly_vol
|
||||
|
||||
# 小时级日内成交量总和
|
||||
if 'volume' in hourly.columns:
|
||||
hourly_volume = hourly.groupby('date')['volume'].sum()
|
||||
hourly_volume.name = 'hourly_volume_sum'
|
||||
agg_dict['hourly_volume_sum'] = hourly_volume
|
||||
|
||||
# 小时级日内最大绝对收益率
|
||||
if 'abs_return' in hourly.columns:
|
||||
hourly_max_abs = hourly.groupby('date')['abs_return'].max()
|
||||
hourly_max_abs.name = 'hourly_max_abs_return'
|
||||
agg_dict['hourly_max_abs_return'] = hourly_max_abs
|
||||
|
||||
if not agg_dict:
|
||||
print(" [警告] 小时级聚合特征为空,跳过")
|
||||
return pd.DataFrame()
|
||||
|
||||
# 合并聚合结果
|
||||
hourly_agg = pd.DataFrame(agg_dict)
|
||||
hourly_agg.index = pd.to_datetime(hourly_agg.index)
|
||||
|
||||
# 与日线数据合并
|
||||
daily_for_merge = daily_df[['log_return']].copy()
|
||||
merged = daily_for_merge.join(hourly_agg, how='inner')
|
||||
|
||||
print(f" [跨时间尺度] 合并后样本数: {len(merged)}")
|
||||
|
||||
# 对每个小时级聚合特征检验 → 日级收益率
|
||||
cross_pairs = []
|
||||
for col in agg_dict.keys():
|
||||
cross_pairs.append((col, 'log_return'))
|
||||
|
||||
max_lag = max(test_lags)
|
||||
all_results = []
|
||||
for cause, effect in cross_pairs:
|
||||
pair_results = granger_test_pair(merged, cause, effect, max_lag=max_lag, test_lags=test_lags)
|
||||
all_results.extend(pair_results)
|
||||
|
||||
results_df = pd.DataFrame(all_results)
|
||||
return results_df
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 6. 可视化:p 值热力图
|
||||
# ============================================================
|
||||
|
||||
def plot_pvalue_heatmap(results_df: pd.DataFrame, output_dir: Path):
|
||||
"""
|
||||
绘制 p 值热力图(变量对 x 滞后阶数)
|
||||
|
||||
Parameters
|
||||
----------
|
||||
results_df : pd.DataFrame
|
||||
因果检验结果
|
||||
output_dir : Path
|
||||
输出目录
|
||||
"""
|
||||
if results_df.empty:
|
||||
print(" [警告] 无检验结果,跳过热力图绘制")
|
||||
return
|
||||
|
||||
# 构建标签
|
||||
results_df = results_df.copy()
|
||||
results_df['pair'] = results_df['cause'] + ' → ' + results_df['effect']
|
||||
|
||||
# 构建 pivot table: 行=pair, 列=lag
|
||||
pivot = results_df.pivot_table(index='pair', columns='lag', values='p_value')
|
||||
|
||||
fig, ax = plt.subplots(figsize=(12, max(6, len(pivot) * 0.5)))
|
||||
|
||||
# 绘制热力图
|
||||
im = ax.imshow(-np.log10(pivot.values + 1e-300), cmap='RdYlGn_r', aspect='auto')
|
||||
|
||||
# 设置坐标轴
|
||||
ax.set_xticks(range(len(pivot.columns)))
|
||||
ax.set_xticklabels([f'Lag {c}' for c in pivot.columns], fontsize=10)
|
||||
ax.set_yticks(range(len(pivot.index)))
|
||||
ax.set_yticklabels(pivot.index, fontsize=9)
|
||||
|
||||
# 在每个格子中标注 p 值
|
||||
for i in range(len(pivot.index)):
|
||||
for j in range(len(pivot.columns)):
|
||||
val = pivot.values[i, j]
|
||||
if np.isnan(val):
|
||||
text = 'N/A'
|
||||
else:
|
||||
text = f'{val:.4f}'
|
||||
color = 'white' if -np.log10(val + 1e-300) > 2 else 'black'
|
||||
ax.text(j, i, text, ha='center', va='center', fontsize=8, color=color)
|
||||
|
||||
# Bonferroni 校正线
|
||||
n_tests = len(results_df)
|
||||
if n_tests > 0:
|
||||
bonf_alpha = 0.05 / n_tests
|
||||
ax.set_title(
|
||||
f'Granger 因果检验 p 值热力图 (-log10)\n'
|
||||
f'Bonferroni 校正阈值: {bonf_alpha:.6f} (共 {n_tests} 次检验)',
|
||||
fontsize=13
|
||||
)
|
||||
|
||||
cbar = fig.colorbar(im, ax=ax, shrink=0.8)
|
||||
cbar.set_label('-log10(p-value)', fontsize=11)
|
||||
|
||||
fig.savefig(output_dir / 'granger_pvalue_heatmap.png',
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] {output_dir / 'granger_pvalue_heatmap.png'}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 7. 可视化:因果关系网络图
|
||||
# ============================================================
|
||||
|
||||
def plot_causal_network(results_df: pd.DataFrame, output_dir: Path, alpha: float = 0.05):
|
||||
"""
|
||||
绘制显著因果关系网络图(matplotlib 箭头实现)
|
||||
|
||||
仅显示 Bonferroni 校正后仍显著的因果对(取最优滞后的结果)
|
||||
|
||||
Parameters
|
||||
----------
|
||||
results_df : pd.DataFrame
|
||||
含 significant_corrected 列的检验结果
|
||||
output_dir : Path
|
||||
输出目录
|
||||
alpha : float
|
||||
显著性水平
|
||||
"""
|
||||
if results_df.empty or 'significant_corrected' not in results_df.columns:
|
||||
print(" [警告] 无校正后结果,跳过网络图绘制")
|
||||
return
|
||||
|
||||
# 筛选显著因果对(取每对中 p 值最小的滞后)
|
||||
sig = results_df[results_df['significant_corrected']].copy()
|
||||
if sig.empty:
|
||||
print(" [信息] Bonferroni 校正后无显著因果关系,绘制空网络图")
|
||||
|
||||
# 对每对取最小 p 值
|
||||
if not sig.empty:
|
||||
sig_best = sig.loc[sig.groupby(['cause', 'effect'])['p_value'].idxmin()]
|
||||
else:
|
||||
sig_best = pd.DataFrame(columns=results_df.columns)
|
||||
|
||||
# 收集所有变量节点
|
||||
all_vars = set()
|
||||
for _, row in results_df.iterrows():
|
||||
all_vars.add(row['cause'])
|
||||
all_vars.add(row['effect'])
|
||||
all_vars = sorted(all_vars)
|
||||
n_vars = len(all_vars)
|
||||
|
||||
if n_vars == 0:
|
||||
return
|
||||
|
||||
# 布局:圆形排列
|
||||
angles = np.linspace(0, 2 * np.pi, n_vars, endpoint=False)
|
||||
positions = {v: (np.cos(a), np.sin(a)) for v, a in zip(all_vars, angles)}
|
||||
|
||||
fig, ax = plt.subplots(figsize=(10, 10))
|
||||
|
||||
# 绘制节点
|
||||
for var, (x, y) in positions.items():
|
||||
circle = plt.Circle((x, y), 0.12, color='steelblue', alpha=0.8)
|
||||
ax.add_patch(circle)
|
||||
ax.text(x, y, var, ha='center', va='center', fontsize=8,
|
||||
fontweight='bold', color='white')
|
||||
|
||||
# 绘制显著因果箭头
|
||||
for _, row in sig_best.iterrows():
|
||||
cause_pos = positions[row['cause']]
|
||||
effect_pos = positions[row['effect']]
|
||||
|
||||
# 计算起点和终点(缩短到节点边缘)
|
||||
dx = effect_pos[0] - cause_pos[0]
|
||||
dy = effect_pos[1] - cause_pos[1]
|
||||
dist = np.sqrt(dx ** 2 + dy ** 2)
|
||||
if dist < 0.01:
|
||||
continue
|
||||
|
||||
# 缩短箭头到节点圆的边缘
|
||||
shrink = 0.14
|
||||
start_x = cause_pos[0] + shrink * dx / dist
|
||||
start_y = cause_pos[1] + shrink * dy / dist
|
||||
end_x = effect_pos[0] - shrink * dx / dist
|
||||
end_y = effect_pos[1] - shrink * dy / dist
|
||||
|
||||
# 箭头粗细与 -log10(p) 相关
|
||||
width = min(3.0, -np.log10(row['p_value'] + 1e-300) * 0.5)
|
||||
|
||||
ax.annotate(
|
||||
'',
|
||||
xy=(end_x, end_y),
|
||||
xytext=(start_x, start_y),
|
||||
arrowprops=dict(
|
||||
arrowstyle='->', color='red', lw=width,
|
||||
connectionstyle='arc3,rad=0.1',
|
||||
mutation_scale=15,
|
||||
),
|
||||
)
|
||||
# 标注滞后阶数和 p 值
|
||||
mid_x = (start_x + end_x) / 2
|
||||
mid_y = (start_y + end_y) / 2
|
||||
ax.text(mid_x, mid_y, f'lag={int(row["lag"])}\np={row["p_value"]:.2e}',
|
||||
fontsize=7, ha='center', va='center',
|
||||
bbox=dict(boxstyle='round,pad=0.2', facecolor='yellow', alpha=0.7))
|
||||
|
||||
n_sig = len(sig_best)
|
||||
n_total = len(results_df)
|
||||
ax.set_title(
|
||||
f'Granger 因果关系网络 (Bonferroni 校正后)\n'
|
||||
f'显著链接: {n_sig}/{n_total}',
|
||||
fontsize=14
|
||||
)
|
||||
ax.set_xlim(-1.6, 1.6)
|
||||
ax.set_ylim(-1.6, 1.6)
|
||||
ax.set_aspect('equal')
|
||||
ax.axis('off')
|
||||
|
||||
fig.savefig(output_dir / 'granger_causal_network.png',
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] {output_dir / 'granger_causal_network.png'}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 8. 结果打印
|
||||
# ============================================================
|
||||
|
||||
def print_causality_results(results_df: pd.DataFrame):
|
||||
"""打印所有因果检验结果"""
|
||||
if results_df.empty:
|
||||
print(" [信息] 无检验结果")
|
||||
return
|
||||
|
||||
print("\n" + "=" * 90)
|
||||
print("Granger 因果检验结果明细")
|
||||
print("=" * 90)
|
||||
print(f" {'因果方向':<40} {'滞后':>4} {'F统计量':>12} {'p值':>12} {'原始显著':>8} {'校正显著':>8}")
|
||||
print(" " + "-" * 88)
|
||||
|
||||
for _, row in results_df.iterrows():
|
||||
pair_label = f"{row['cause']} → {row['effect']}"
|
||||
sig_raw = '***' if row.get('significant_raw', False) else ''
|
||||
sig_corr = '***' if row.get('significant_corrected', False) else ''
|
||||
print(f" {pair_label:<40} {int(row['lag']):>4} "
|
||||
f"{row['f_stat']:>12.4f} {row['p_value']:>12.6f} "
|
||||
f"{sig_raw:>8} {sig_corr:>8}")
|
||||
|
||||
# 汇总统计
|
||||
n_total = len(results_df)
|
||||
n_sig_raw = results_df.get('significant_raw', pd.Series(dtype=bool)).sum()
|
||||
n_sig_corr = results_df.get('significant_corrected', pd.Series(dtype=bool)).sum()
|
||||
|
||||
print(f"\n 汇总: 共 {n_total} 次检验")
|
||||
print(f" 原始显著 (p < 0.05): {n_sig_raw} ({n_sig_raw / n_total * 100:.1f}%)")
|
||||
print(f" Bonferroni 校正后显著: {n_sig_corr} ({n_sig_corr / n_total * 100:.1f}%)")
|
||||
|
||||
if n_total > 0:
|
||||
bonf_alpha = 0.05 / n_total
|
||||
print(f" Bonferroni 校正阈值: {bonf_alpha:.6f}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 9. 主入口
|
||||
# ============================================================
|
||||
|
||||
def run_causality_analysis(
|
||||
df: pd.DataFrame,
|
||||
output_dir: str = "output/causality",
|
||||
) -> Dict:
|
||||
"""
|
||||
Granger 因果检验主函数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线数据(已通过 add_derived_features 添加衍生特征)
|
||||
output_dir : str
|
||||
图表输出目录
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含所有检验结果的字典
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 70)
|
||||
print("BTC Granger 因果检验分析")
|
||||
print("=" * 70)
|
||||
print(f"数据范围: {df.index.min()} ~ {df.index.max()}")
|
||||
print(f"样本数量: {len(df)}")
|
||||
print(f"测试滞后阶数: {TEST_LAGS}")
|
||||
print(f"因果变量对数: {len(CAUSALITY_PAIRS)}")
|
||||
print(f"总检验次数(含所有滞后): {len(CAUSALITY_PAIRS) * len(TEST_LAGS)}")
|
||||
|
||||
# 设置中文字体
|
||||
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
|
||||
plt.rcParams['axes.unicode_minus'] = False
|
||||
|
||||
# --- 日线级 Granger 因果检验 ---
|
||||
print("\n>>> [1/4] 执行日线级 Granger 因果检验...")
|
||||
daily_results = run_all_granger_tests(df, pairs=CAUSALITY_PAIRS, test_lags=TEST_LAGS)
|
||||
|
||||
if not daily_results.empty:
|
||||
daily_results = apply_bonferroni(daily_results, alpha=0.05)
|
||||
print_causality_results(daily_results)
|
||||
else:
|
||||
print(" [警告] 日线级因果检验未产生结果")
|
||||
|
||||
# --- 跨时间尺度因果检验 ---
|
||||
print("\n>>> [2/4] 执行跨时间尺度因果检验(小时 → 日线)...")
|
||||
cross_results = cross_timeframe_causality(df, test_lags=TEST_LAGS)
|
||||
|
||||
if not cross_results.empty:
|
||||
cross_results = apply_bonferroni(cross_results, alpha=0.05)
|
||||
print("\n跨时间尺度因果检验结果:")
|
||||
print_causality_results(cross_results)
|
||||
else:
|
||||
print(" [信息] 跨时间尺度因果检验无结果(可能小时数据不可用)")
|
||||
|
||||
# --- 合并所有结果用于可视化 ---
|
||||
all_results = pd.concat([daily_results, cross_results], ignore_index=True)
|
||||
if not all_results.empty and 'significant_corrected' not in all_results.columns:
|
||||
all_results = apply_bonferroni(all_results, alpha=0.05)
|
||||
|
||||
# --- p 值热力图(仅日线级结果,避免混淆) ---
|
||||
print("\n>>> [3/4] 绘制 p 值热力图...")
|
||||
plot_pvalue_heatmap(daily_results, output_dir)
|
||||
|
||||
# --- 因果关系网络图 ---
|
||||
print("\n>>> [4/4] 绘制因果关系网络图...")
|
||||
# 使用所有结果(含跨时间尺度)
|
||||
if not all_results.empty:
|
||||
# 重新做一次 Bonferroni 校正(因为合并后总检验数增加)
|
||||
all_corrected = apply_bonferroni(all_results.drop(
|
||||
columns=['bonferroni_alpha', 'significant_raw', 'significant_corrected'],
|
||||
errors='ignore'
|
||||
), alpha=0.05)
|
||||
plot_causal_network(all_corrected, output_dir)
|
||||
else:
|
||||
print(" [警告] 无可用结果,跳过网络图")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("Granger 因果检验分析完成!")
|
||||
print(f"图表已保存至: {output_dir.resolve()}")
|
||||
print("=" * 70)
|
||||
|
||||
return {
|
||||
'daily_results': daily_results,
|
||||
'cross_timeframe_results': cross_results,
|
||||
'all_results': all_results,
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 独立运行入口
|
||||
# ============================================================
|
||||
|
||||
if __name__ == '__main__':
|
||||
from src.data_loader import load_daily
|
||||
from src.preprocessing import add_derived_features
|
||||
|
||||
df = load_daily()
|
||||
df = add_derived_features(df)
|
||||
run_causality_analysis(df)
|
||||
742
src/clustering.py
Normal file
@@ -0,0 +1,742 @@
|
||||
"""市场状态聚类与马尔可夫链分析模块
|
||||
|
||||
基于K-Means、GMM、HDBSCAN对BTC日线特征进行聚类,
|
||||
构建状态转移矩阵并计算平稳分布。
|
||||
"""
|
||||
|
||||
import warnings
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
import matplotlib.pyplot as plt
|
||||
import matplotlib.gridspec as gridspec
|
||||
from pathlib import Path
|
||||
from typing import Optional, Tuple, Dict, List
|
||||
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.cluster import KMeans
|
||||
from sklearn.mixture import GaussianMixture
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.metrics import silhouette_score, silhouette_samples
|
||||
|
||||
try:
|
||||
import hdbscan
|
||||
HAS_HDBSCAN = True
|
||||
except ImportError:
|
||||
HAS_HDBSCAN = False
|
||||
warnings.warn("hdbscan 未安装,将跳过 HDBSCAN 聚类。pip install hdbscan")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 特征工程
|
||||
# ============================================================
|
||||
|
||||
FEATURE_COLS = [
|
||||
"log_return", "abs_return", "vol_7d", "vol_30d",
|
||||
"volume_ratio", "taker_buy_ratio", "range_pct", "body_pct",
|
||||
"log_return_lag1", "log_return_lag2",
|
||||
]
|
||||
|
||||
|
||||
def _prepare_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, np.ndarray, StandardScaler]:
|
||||
"""
|
||||
准备聚类特征:添加滞后收益率、标准化、去除NaN行
|
||||
|
||||
Returns
|
||||
-------
|
||||
df_clean : 清洗后的DataFrame(保留索引用于后续映射)
|
||||
X_scaled : 标准化后的特征矩阵
|
||||
scaler : 标准化器(可用于逆变换)
|
||||
"""
|
||||
out = df.copy()
|
||||
|
||||
# 添加滞后收益率特征
|
||||
out["log_return_lag1"] = out["log_return"].shift(1)
|
||||
out["log_return_lag2"] = out["log_return"].shift(2)
|
||||
|
||||
# 只保留所需特征列,删除含NaN的行
|
||||
df_feat = out[FEATURE_COLS].copy()
|
||||
mask = df_feat.notna().all(axis=1)
|
||||
df_clean = out.loc[mask].copy()
|
||||
X_raw = df_feat.loc[mask].values
|
||||
|
||||
# Z-score标准化
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X_raw)
|
||||
|
||||
print(f"[特征准备] 有效样本数: {X_scaled.shape[0]}, 特征维度: {X_scaled.shape[1]}")
|
||||
return df_clean, X_scaled, scaler
|
||||
|
||||
|
||||
# ============================================================
|
||||
# K-Means 聚类
|
||||
# ============================================================
|
||||
|
||||
def _run_kmeans(X: np.ndarray, k_range: List[int] = None) -> Tuple[int, np.ndarray, Dict]:
|
||||
"""
|
||||
K-Means聚类,通过轮廓系数选择最优k
|
||||
|
||||
Returns
|
||||
-------
|
||||
best_k : 最优聚类数
|
||||
labels : 最优k对应的聚类标签
|
||||
info : 包含每个k的轮廓系数、惯性等
|
||||
"""
|
||||
if k_range is None:
|
||||
k_range = [3, 4, 5, 6, 7]
|
||||
|
||||
results = {}
|
||||
best_score = -1
|
||||
best_k = k_range[0]
|
||||
best_labels = None
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("K-Means 聚类分析")
|
||||
print("=" * 60)
|
||||
|
||||
for k in k_range:
|
||||
km = KMeans(n_clusters=k, n_init=20, max_iter=500, random_state=42)
|
||||
labels = km.fit_predict(X)
|
||||
sil = silhouette_score(X, labels)
|
||||
inertia = km.inertia_
|
||||
results[k] = {"silhouette": sil, "inertia": inertia, "labels": labels, "model": km}
|
||||
print(f" k={k}: 轮廓系数={sil:.4f}, 惯性={inertia:.1f}")
|
||||
|
||||
if sil > best_score:
|
||||
best_score = sil
|
||||
best_k = k
|
||||
best_labels = labels
|
||||
|
||||
print(f"\n >>> 最优 k = {best_k} (轮廓系数 = {best_score:.4f})")
|
||||
return best_k, best_labels, results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# GMM (高斯混合模型)
|
||||
# ============================================================
|
||||
|
||||
def _run_gmm(X: np.ndarray, k_range: List[int] = None) -> Tuple[int, np.ndarray, Dict]:
|
||||
"""
|
||||
GMM聚类,通过BIC选择最优组件数
|
||||
|
||||
Returns
|
||||
-------
|
||||
best_k : BIC最低的组件数
|
||||
labels : 对应的聚类标签
|
||||
info : 每个k的BIC、AIC、标签等
|
||||
"""
|
||||
if k_range is None:
|
||||
k_range = [3, 4, 5, 6, 7]
|
||||
|
||||
results = {}
|
||||
best_bic = np.inf
|
||||
best_k = k_range[0]
|
||||
best_labels = None
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("GMM (高斯混合模型) 聚类分析")
|
||||
print("=" * 60)
|
||||
|
||||
for k in k_range:
|
||||
gmm = GaussianMixture(n_components=k, covariance_type='full',
|
||||
n_init=5, max_iter=500, random_state=42)
|
||||
gmm.fit(X)
|
||||
labels = gmm.predict(X)
|
||||
bic = gmm.bic(X)
|
||||
aic = gmm.aic(X)
|
||||
sil = silhouette_score(X, labels)
|
||||
results[k] = {"bic": bic, "aic": aic, "silhouette": sil,
|
||||
"labels": labels, "model": gmm}
|
||||
print(f" k={k}: BIC={bic:.1f}, AIC={aic:.1f}, 轮廓系数={sil:.4f}")
|
||||
|
||||
if bic < best_bic:
|
||||
best_bic = bic
|
||||
best_k = k
|
||||
best_labels = labels
|
||||
|
||||
print(f"\n >>> 最优 k = {best_k} (BIC = {best_bic:.1f})")
|
||||
return best_k, best_labels, results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# HDBSCAN (密度聚类)
|
||||
# ============================================================
|
||||
|
||||
def _run_hdbscan(X: np.ndarray) -> Tuple[np.ndarray, Dict]:
|
||||
"""
|
||||
HDBSCAN密度聚类
|
||||
|
||||
Returns
|
||||
-------
|
||||
labels : 聚类标签 (-1表示噪声)
|
||||
info : 聚类统计信息
|
||||
"""
|
||||
if not HAS_HDBSCAN:
|
||||
print("\n[HDBSCAN] 跳过 - hdbscan 未安装")
|
||||
return None, {}
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("HDBSCAN 密度聚类分析")
|
||||
print("=" * 60)
|
||||
|
||||
clusterer = hdbscan.HDBSCAN(
|
||||
min_cluster_size=30,
|
||||
min_samples=10,
|
||||
metric='euclidean',
|
||||
cluster_selection_method='eom',
|
||||
)
|
||||
labels = clusterer.fit_predict(X)
|
||||
|
||||
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
|
||||
n_noise = (labels == -1).sum()
|
||||
noise_pct = n_noise / len(labels) * 100
|
||||
|
||||
info = {
|
||||
"n_clusters": n_clusters,
|
||||
"n_noise": n_noise,
|
||||
"noise_pct": noise_pct,
|
||||
"labels": labels,
|
||||
"model": clusterer,
|
||||
}
|
||||
|
||||
print(f" 聚类数: {n_clusters}")
|
||||
print(f" 噪声点: {n_noise} ({noise_pct:.1f}%)")
|
||||
|
||||
# 排除噪声点后计算轮廓系数
|
||||
if n_clusters >= 2:
|
||||
mask = labels >= 0
|
||||
if mask.sum() > n_clusters:
|
||||
sil = silhouette_score(X[mask], labels[mask])
|
||||
info["silhouette"] = sil
|
||||
print(f" 轮廓系数(去噪): {sil:.4f}")
|
||||
|
||||
return labels, info
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 聚类解释与标签映射
|
||||
# ============================================================
|
||||
|
||||
# 状态标签定义
|
||||
STATE_LABELS = {
|
||||
"sideways": "横盘整理",
|
||||
"mild_up": "温和上涨",
|
||||
"mild_down": "温和下跌",
|
||||
"surge": "强势上涨",
|
||||
"crash": "急剧下跌",
|
||||
"high_vol": "高波动",
|
||||
"low_vol": "低波动",
|
||||
}
|
||||
|
||||
|
||||
def _interpret_clusters(df_clean: pd.DataFrame, labels: np.ndarray,
|
||||
method_name: str = "K-Means") -> pd.DataFrame:
|
||||
"""
|
||||
解释聚类结果:计算每个簇的特征均值,并自动标注状态名称
|
||||
|
||||
Returns
|
||||
-------
|
||||
cluster_desc : 每个聚类的特征均值表 + state_label列
|
||||
"""
|
||||
df_work = df_clean.copy()
|
||||
col_name = f"cluster_{method_name}"
|
||||
df_work[col_name] = labels
|
||||
|
||||
# 计算每个聚类的特征均值
|
||||
cluster_means = df_work.groupby(col_name)[FEATURE_COLS].mean()
|
||||
|
||||
print(f"\n{'=' * 60}")
|
||||
print(f"{method_name} 聚类特征均值")
|
||||
print("=" * 60)
|
||||
|
||||
# 自动标注状态
|
||||
state_labels = {}
|
||||
for cid in cluster_means.index:
|
||||
row = cluster_means.loc[cid]
|
||||
lr = row["log_return"]
|
||||
vol = row["vol_7d"]
|
||||
abs_r = row["abs_return"]
|
||||
|
||||
# 基于收益率和波动率的规则判断
|
||||
if lr > 0.02 and abs_r > 0.02:
|
||||
label = "surge"
|
||||
elif lr < -0.02 and abs_r > 0.02:
|
||||
label = "crash"
|
||||
elif lr > 0.005:
|
||||
label = "mild_up"
|
||||
elif lr < -0.005:
|
||||
label = "mild_down"
|
||||
elif abs_r > 0.015 or vol > cluster_means["vol_7d"].median() * 1.5:
|
||||
label = "high_vol"
|
||||
else:
|
||||
label = "sideways"
|
||||
|
||||
state_labels[cid] = label
|
||||
|
||||
cluster_means["state_label"] = pd.Series(state_labels)
|
||||
cluster_means["state_cn"] = cluster_means["state_label"].map(STATE_LABELS)
|
||||
|
||||
# 统计每个聚类的样本数和占比
|
||||
counts = df_work[col_name].value_counts().sort_index()
|
||||
cluster_means["count"] = counts
|
||||
cluster_means["pct"] = (counts / counts.sum() * 100).round(1)
|
||||
|
||||
for cid in cluster_means.index:
|
||||
row = cluster_means.loc[cid]
|
||||
print(f"\n 聚类 {cid} [{row['state_cn']}] (n={int(row['count'])}, {row['pct']:.1f}%)")
|
||||
print(f" log_return: {row['log_return']:.5f}, abs_return: {row['abs_return']:.5f}")
|
||||
print(f" vol_7d: {row['vol_7d']:.4f}, vol_30d: {row['vol_30d']:.4f}")
|
||||
print(f" volume_ratio: {row['volume_ratio']:.3f}, taker_buy_ratio: {row['taker_buy_ratio']:.4f}")
|
||||
print(f" range_pct: {row['range_pct']:.5f}, body_pct: {row['body_pct']:.5f}")
|
||||
|
||||
return cluster_means
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 马尔可夫转移矩阵
|
||||
# ============================================================
|
||||
|
||||
def _compute_transition_matrix(labels: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
|
||||
"""
|
||||
计算状态转移概率矩阵、平稳分布和平均持有时间
|
||||
|
||||
Parameters
|
||||
----------
|
||||
labels : 时间序列的聚类标签
|
||||
|
||||
Returns
|
||||
-------
|
||||
trans_matrix : 转移概率矩阵 (n_states x n_states)
|
||||
stationary : 平稳分布向量
|
||||
holding_time : 各状态平均持有时间
|
||||
"""
|
||||
states = np.sort(np.unique(labels))
|
||||
n_states = len(states)
|
||||
|
||||
# 状态映射到连续索引
|
||||
state_to_idx = {s: i for i, s in enumerate(states)}
|
||||
|
||||
# 计数矩阵
|
||||
count_matrix = np.zeros((n_states, n_states), dtype=np.float64)
|
||||
for t in range(len(labels) - 1):
|
||||
i = state_to_idx[labels[t]]
|
||||
j = state_to_idx[labels[t + 1]]
|
||||
count_matrix[i, j] += 1
|
||||
|
||||
# 转移概率矩阵(行归一化)
|
||||
row_sums = count_matrix.sum(axis=1, keepdims=True)
|
||||
row_sums[row_sums == 0] = 1 # 避免除零
|
||||
trans_matrix = count_matrix / row_sums
|
||||
|
||||
# 平稳分布:求转移矩阵的左特征向量(特征值=1对应的)
|
||||
# π * P = π => P^T * π^T = π^T
|
||||
eigenvalues, eigenvectors = np.linalg.eig(trans_matrix.T)
|
||||
|
||||
# 找最接近1的特征值对应的特征向量
|
||||
idx = np.argmin(np.abs(eigenvalues - 1.0))
|
||||
stationary = np.real(eigenvectors[:, idx])
|
||||
stationary = stationary / stationary.sum() # 归一化为概率
|
||||
|
||||
# 确保非负(数值误差可能导致微小负值)
|
||||
stationary = np.abs(stationary)
|
||||
stationary = stationary / stationary.sum()
|
||||
|
||||
# 平均持有时间 = 1 / (1 - p_ii)
|
||||
diag = np.diag(trans_matrix)
|
||||
holding_time = np.where(diag < 1.0, 1.0 / (1.0 - diag), np.inf)
|
||||
|
||||
return trans_matrix, stationary, holding_time
|
||||
|
||||
|
||||
def _print_markov_results(trans_matrix: np.ndarray, stationary: np.ndarray,
|
||||
holding_time: np.ndarray, cluster_desc: pd.DataFrame):
|
||||
"""打印马尔可夫链分析结果"""
|
||||
states = cluster_desc.index.tolist()
|
||||
state_names = cluster_desc["state_cn"].tolist()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("马尔可夫链状态转移分析")
|
||||
print("=" * 60)
|
||||
|
||||
# 转移概率矩阵
|
||||
print("\n转移概率矩阵:")
|
||||
header = " " + " ".join([f" {state_names[j][:4]:>4s}" for j in range(len(states))])
|
||||
print(header)
|
||||
for i, s in enumerate(states):
|
||||
row_str = f" {state_names[i][:4]:>4s}"
|
||||
for j in range(len(states)):
|
||||
row_str += f" {trans_matrix[i, j]:6.3f}"
|
||||
print(row_str)
|
||||
|
||||
# 平稳分布
|
||||
print("\n平稳分布 (长期均衡概率):")
|
||||
for i, s in enumerate(states):
|
||||
print(f" {state_names[i]}: {stationary[i]:.4f} ({stationary[i]*100:.1f}%)")
|
||||
|
||||
# 平均持有时间
|
||||
print("\n平均持有时间 (天):")
|
||||
for i, s in enumerate(states):
|
||||
if np.isinf(holding_time[i]):
|
||||
print(f" {state_names[i]}: ∞ (吸收态)")
|
||||
else:
|
||||
print(f" {state_names[i]}: {holding_time[i]:.2f} 天")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 可视化
|
||||
# ============================================================
|
||||
|
||||
def _plot_pca_scatter(X: np.ndarray, labels: np.ndarray,
|
||||
cluster_desc: pd.DataFrame, method_name: str,
|
||||
output_dir: Path):
|
||||
"""2D PCA散点图,按聚类着色"""
|
||||
pca = PCA(n_components=2)
|
||||
X_2d = pca.fit_transform(X)
|
||||
|
||||
fig, ax = plt.subplots(figsize=(12, 8))
|
||||
states = np.sort(np.unique(labels))
|
||||
colors = plt.cm.Set2(np.linspace(0, 1, len(states)))
|
||||
|
||||
for i, s in enumerate(states):
|
||||
mask = labels == s
|
||||
label_name = cluster_desc.loc[s, "state_cn"] if s in cluster_desc.index else f"Cluster {s}"
|
||||
ax.scatter(X_2d[mask, 0], X_2d[mask, 1], c=[colors[i]], label=label_name,
|
||||
alpha=0.5, s=15, edgecolors='none')
|
||||
|
||||
ax.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)", fontsize=12)
|
||||
ax.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)", fontsize=12)
|
||||
ax.set_title(f"{method_name} 聚类结果 - PCA 2D投影", fontsize=14)
|
||||
ax.legend(fontsize=10, loc='best')
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / f"cluster_pca_{method_name.lower().replace(' ', '_')}.png",
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] cluster_pca_{method_name.lower().replace(' ', '_')}.png")
|
||||
|
||||
|
||||
def _plot_silhouette(X: np.ndarray, labels: np.ndarray, method_name: str, output_dir: Path):
|
||||
"""轮廓系数分析图"""
|
||||
n_clusters = len(set(labels) - {-1})
|
||||
if n_clusters < 2:
|
||||
return
|
||||
|
||||
# 排除噪声点
|
||||
mask = labels >= 0
|
||||
if mask.sum() < n_clusters + 1:
|
||||
return
|
||||
|
||||
sil_vals = silhouette_samples(X[mask], labels[mask])
|
||||
avg_sil = silhouette_score(X[mask], labels[mask])
|
||||
|
||||
fig, ax = plt.subplots(figsize=(10, 7))
|
||||
y_lower = 10
|
||||
valid_labels = np.sort(np.unique(labels[mask]))
|
||||
colors = plt.cm.Set2(np.linspace(0, 1, len(valid_labels)))
|
||||
|
||||
for i, c in enumerate(valid_labels):
|
||||
c_sil = sil_vals[labels[mask] == c]
|
||||
c_sil.sort()
|
||||
size = c_sil.shape[0]
|
||||
y_upper = y_lower + size
|
||||
|
||||
ax.fill_betweenx(np.arange(y_lower, y_upper), 0, c_sil,
|
||||
facecolor=colors[i], edgecolor=colors[i], alpha=0.7)
|
||||
ax.text(-0.05, y_lower + 0.5 * size, str(c), fontsize=10)
|
||||
y_lower = y_upper + 10
|
||||
|
||||
ax.axvline(x=avg_sil, color="red", linestyle="--", label=f"平均={avg_sil:.3f}")
|
||||
ax.set_xlabel("轮廓系数", fontsize=12)
|
||||
ax.set_ylabel("聚类标签", fontsize=12)
|
||||
ax.set_title(f"{method_name} 轮廓系数分析 (平均={avg_sil:.3f})", fontsize=14)
|
||||
ax.legend(fontsize=10)
|
||||
|
||||
fig.savefig(output_dir / f"cluster_silhouette_{method_name.lower().replace(' ', '_')}.png",
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] cluster_silhouette_{method_name.lower().replace(' ', '_')}.png")
|
||||
|
||||
|
||||
def _plot_cluster_heatmap(cluster_desc: pd.DataFrame, method_name: str, output_dir: Path):
|
||||
"""聚类特征热力图"""
|
||||
# 只选择数值型特征列
|
||||
feat_cols = [c for c in FEATURE_COLS if c in cluster_desc.columns]
|
||||
data = cluster_desc[feat_cols].copy()
|
||||
|
||||
# 对每列进行Z-score标准化(便于比较不同量纲的特征)
|
||||
data_norm = (data - data.mean()) / (data.std() + 1e-10)
|
||||
|
||||
fig, ax = plt.subplots(figsize=(14, max(6, len(data) * 1.2)))
|
||||
|
||||
# 行标签用中文状态名
|
||||
row_labels = [f"{idx}-{cluster_desc.loc[idx, 'state_cn']}" for idx in data.index]
|
||||
|
||||
im = ax.imshow(data_norm.values, cmap='RdYlGn', aspect='auto')
|
||||
ax.set_xticks(range(len(feat_cols)))
|
||||
ax.set_xticklabels(feat_cols, rotation=45, ha='right', fontsize=10)
|
||||
ax.set_yticks(range(len(row_labels)))
|
||||
ax.set_yticklabels(row_labels, fontsize=11)
|
||||
|
||||
# 在格子中显示原始数值
|
||||
for i in range(data.shape[0]):
|
||||
for j in range(data.shape[1]):
|
||||
val = data.iloc[i, j]
|
||||
ax.text(j, i, f"{val:.4f}", ha='center', va='center', fontsize=8,
|
||||
color='black' if abs(data_norm.iloc[i, j]) < 1.5 else 'white')
|
||||
|
||||
plt.colorbar(im, ax=ax, shrink=0.8, label="标准化值")
|
||||
ax.set_title(f"{method_name} 各聚类特征热力图", fontsize=14)
|
||||
|
||||
fig.savefig(output_dir / f"cluster_heatmap_{method_name.lower().replace(' ', '_')}.png",
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] cluster_heatmap_{method_name.lower().replace(' ', '_')}.png")
|
||||
|
||||
|
||||
def _plot_transition_heatmap(trans_matrix: np.ndarray, cluster_desc: pd.DataFrame,
|
||||
output_dir: Path):
|
||||
"""状态转移概率矩阵热力图"""
|
||||
state_names = [cluster_desc.loc[idx, "state_cn"] for idx in cluster_desc.index]
|
||||
|
||||
fig, ax = plt.subplots(figsize=(10, 8))
|
||||
im = ax.imshow(trans_matrix, cmap='YlOrRd', vmin=0, vmax=1, aspect='auto')
|
||||
|
||||
n = len(state_names)
|
||||
ax.set_xticks(range(n))
|
||||
ax.set_xticklabels(state_names, rotation=45, ha='right', fontsize=11)
|
||||
ax.set_yticks(range(n))
|
||||
ax.set_yticklabels(state_names, fontsize=11)
|
||||
|
||||
# 标注概率值
|
||||
for i in range(n):
|
||||
for j in range(n):
|
||||
color = 'white' if trans_matrix[i, j] > 0.5 else 'black'
|
||||
ax.text(j, i, f"{trans_matrix[i, j]:.3f}", ha='center', va='center',
|
||||
fontsize=11, color=color, fontweight='bold')
|
||||
|
||||
plt.colorbar(im, ax=ax, shrink=0.8, label="转移概率")
|
||||
ax.set_xlabel("下一状态", fontsize=12)
|
||||
ax.set_ylabel("当前状态", fontsize=12)
|
||||
ax.set_title("马尔可夫状态转移概率矩阵", fontsize=14)
|
||||
|
||||
fig.savefig(output_dir / "cluster_transition_matrix.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] cluster_transition_matrix.png")
|
||||
|
||||
|
||||
def _plot_state_timeseries(df_clean: pd.DataFrame, labels: np.ndarray,
|
||||
cluster_desc: pd.DataFrame, output_dir: Path):
|
||||
"""状态随时间变化的时间序列图"""
|
||||
fig, axes = plt.subplots(2, 1, figsize=(18, 10), height_ratios=[2, 1], sharex=True)
|
||||
|
||||
dates = df_clean.index
|
||||
close = df_clean["close"].values
|
||||
|
||||
states = np.sort(np.unique(labels))
|
||||
colors = plt.cm.Set2(np.linspace(0, 1, len(states)))
|
||||
color_map = {s: colors[i] for i, s in enumerate(states)}
|
||||
|
||||
# 上图:价格走势,按状态着色
|
||||
ax1 = axes[0]
|
||||
for i in range(len(dates) - 1):
|
||||
ax1.plot([dates[i], dates[i + 1]], [close[i], close[i + 1]],
|
||||
color=color_map[labels[i]], linewidth=0.8)
|
||||
|
||||
# 添加图例
|
||||
from matplotlib.patches import Patch
|
||||
legend_patches = []
|
||||
for s in states:
|
||||
name = cluster_desc.loc[s, "state_cn"] if s in cluster_desc.index else f"Cluster {s}"
|
||||
legend_patches.append(Patch(color=color_map[s], label=name))
|
||||
ax1.legend(handles=legend_patches, fontsize=9, loc='upper left')
|
||||
ax1.set_ylabel("BTC 价格 (USDT)", fontsize=12)
|
||||
ax1.set_title("BTC 价格与市场状态时间序列", fontsize=14)
|
||||
ax1.set_yscale('log')
|
||||
ax1.grid(True, alpha=0.3)
|
||||
|
||||
# 下图:状态标签时间线
|
||||
ax2 = axes[1]
|
||||
state_colors = [color_map[l] for l in labels]
|
||||
ax2.bar(dates, np.ones(len(dates)), color=state_colors, width=1.5, edgecolor='none')
|
||||
ax2.set_yticks([])
|
||||
ax2.set_ylabel("市场状态", fontsize=12)
|
||||
ax2.set_xlabel("日期", fontsize=12)
|
||||
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_dir / "cluster_state_timeseries.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] cluster_state_timeseries.png")
|
||||
|
||||
|
||||
def _plot_kmeans_selection(kmeans_results: Dict, gmm_results: Dict, output_dir: Path):
|
||||
"""K选择对比图:轮廓系数 + BIC"""
|
||||
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
|
||||
|
||||
# 1. K-Means 轮廓系数
|
||||
ks_km = sorted(kmeans_results.keys())
|
||||
sils_km = [kmeans_results[k]["silhouette"] for k in ks_km]
|
||||
axes[0].plot(ks_km, sils_km, 'bo-', linewidth=2, markersize=8)
|
||||
best_k_km = ks_km[np.argmax(sils_km)]
|
||||
axes[0].axvline(x=best_k_km, color='red', linestyle='--', alpha=0.7)
|
||||
axes[0].set_xlabel("k", fontsize=12)
|
||||
axes[0].set_ylabel("轮廓系数", fontsize=12)
|
||||
axes[0].set_title("K-Means 轮廓系数", fontsize=13)
|
||||
axes[0].grid(True, alpha=0.3)
|
||||
|
||||
# 2. K-Means 惯性 (Elbow)
|
||||
inertias = [kmeans_results[k]["inertia"] for k in ks_km]
|
||||
axes[1].plot(ks_km, inertias, 'gs-', linewidth=2, markersize=8)
|
||||
axes[1].set_xlabel("k", fontsize=12)
|
||||
axes[1].set_ylabel("惯性 (Inertia)", fontsize=12)
|
||||
axes[1].set_title("K-Means 肘部法则", fontsize=13)
|
||||
axes[1].grid(True, alpha=0.3)
|
||||
|
||||
# 3. GMM BIC
|
||||
ks_gmm = sorted(gmm_results.keys())
|
||||
bics = [gmm_results[k]["bic"] for k in ks_gmm]
|
||||
axes[2].plot(ks_gmm, bics, 'r^-', linewidth=2, markersize=8)
|
||||
best_k_gmm = ks_gmm[np.argmin(bics)]
|
||||
axes[2].axvline(x=best_k_gmm, color='blue', linestyle='--', alpha=0.7)
|
||||
axes[2].set_xlabel("k", fontsize=12)
|
||||
axes[2].set_ylabel("BIC", fontsize=12)
|
||||
axes[2].set_title("GMM BIC 选择", fontsize=13)
|
||||
axes[2].grid(True, alpha=0.3)
|
||||
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_dir / "cluster_k_selection.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] cluster_k_selection.png")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 主入口
|
||||
# ============================================================
|
||||
|
||||
def run_clustering_analysis(df: pd.DataFrame, output_dir: "str | Path" = "output/clustering") -> Dict:
|
||||
"""
|
||||
市场状态聚类与马尔可夫链分析 - 主入口
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
已经通过 add_derived_features() 添加了衍生特征的日线数据
|
||||
output_dir : str or Path
|
||||
图表输出目录
|
||||
|
||||
Returns
|
||||
-------
|
||||
results : dict
|
||||
包含聚类结果、转移矩阵、平稳分布等
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# 设置中文字体(macOS)
|
||||
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
|
||||
plt.rcParams['axes.unicode_minus'] = False
|
||||
|
||||
print("=" * 60)
|
||||
print(" BTC 市场状态聚类与马尔可夫链分析")
|
||||
print("=" * 60)
|
||||
|
||||
# ---- 1. 特征准备 ----
|
||||
df_clean, X_scaled, scaler = _prepare_features(df)
|
||||
|
||||
# ---- 2. K-Means 聚类 ----
|
||||
best_k_km, km_labels, kmeans_results = _run_kmeans(X_scaled)
|
||||
|
||||
# ---- 3. GMM 聚类 ----
|
||||
best_k_gmm, gmm_labels, gmm_results = _run_gmm(X_scaled)
|
||||
|
||||
# ---- 4. HDBSCAN 聚类 ----
|
||||
hdbscan_labels, hdbscan_info = _run_hdbscan(X_scaled)
|
||||
|
||||
# ---- 5. K选择对比图 ----
|
||||
print("\n[可视化] 生成K选择对比图...")
|
||||
_plot_kmeans_selection(kmeans_results, gmm_results, output_dir)
|
||||
|
||||
# ---- 6. K-Means 聚类解释 ----
|
||||
km_desc = _interpret_clusters(df_clean, km_labels, "K-Means")
|
||||
|
||||
# ---- 7. GMM 聚类解释 ----
|
||||
gmm_desc = _interpret_clusters(df_clean, gmm_labels, "GMM")
|
||||
|
||||
# ---- 8. 马尔可夫链分析(基于K-Means结果)----
|
||||
trans_matrix, stationary, holding_time = _compute_transition_matrix(km_labels)
|
||||
_print_markov_results(trans_matrix, stationary, holding_time, km_desc)
|
||||
|
||||
# ---- 9. 可视化 ----
|
||||
print("\n[可视化] 生成分析图表...")
|
||||
|
||||
# PCA散点图
|
||||
_plot_pca_scatter(X_scaled, km_labels, km_desc, "K-Means", output_dir)
|
||||
_plot_pca_scatter(X_scaled, gmm_labels, gmm_desc, "GMM", output_dir)
|
||||
if hdbscan_labels is not None and hdbscan_info.get("n_clusters", 0) >= 2:
|
||||
# 为HDBSCAN创建简易描述
|
||||
hdb_states = np.sort(np.unique(hdbscan_labels[hdbscan_labels >= 0]))
|
||||
hdb_desc = _interpret_clusters(df_clean, hdbscan_labels, "HDBSCAN")
|
||||
_plot_pca_scatter(X_scaled, hdbscan_labels, hdb_desc, "HDBSCAN", output_dir)
|
||||
|
||||
# 轮廓系数图
|
||||
_plot_silhouette(X_scaled, km_labels, "K-Means", output_dir)
|
||||
|
||||
# 聚类特征热力图
|
||||
_plot_cluster_heatmap(km_desc, "K-Means", output_dir)
|
||||
_plot_cluster_heatmap(gmm_desc, "GMM", output_dir)
|
||||
|
||||
# 转移矩阵热力图
|
||||
_plot_transition_heatmap(trans_matrix, km_desc, output_dir)
|
||||
|
||||
# 状态时间序列图
|
||||
_plot_state_timeseries(df_clean, km_labels, km_desc, output_dir)
|
||||
|
||||
# ---- 10. 汇总结果 ----
|
||||
results = {
|
||||
"kmeans": {
|
||||
"best_k": best_k_km,
|
||||
"labels": km_labels,
|
||||
"cluster_desc": km_desc,
|
||||
"all_results": kmeans_results,
|
||||
},
|
||||
"gmm": {
|
||||
"best_k": best_k_gmm,
|
||||
"labels": gmm_labels,
|
||||
"cluster_desc": gmm_desc,
|
||||
"all_results": gmm_results,
|
||||
},
|
||||
"hdbscan": {
|
||||
"labels": hdbscan_labels,
|
||||
"info": hdbscan_info,
|
||||
},
|
||||
"markov": {
|
||||
"transition_matrix": trans_matrix,
|
||||
"stationary_distribution": stationary,
|
||||
"holding_time": holding_time,
|
||||
},
|
||||
"features": {
|
||||
"df_clean": df_clean,
|
||||
"X_scaled": X_scaled,
|
||||
"scaler": scaler,
|
||||
},
|
||||
}
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print(" 聚类与马尔可夫链分析完成!")
|
||||
print("=" * 60)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 命令行入口
|
||||
# ============================================================
|
||||
|
||||
if __name__ == "__main__":
|
||||
from data_loader import load_daily
|
||||
from preprocessing import add_derived_features
|
||||
|
||||
df = load_daily()
|
||||
df = add_derived_features(df)
|
||||
|
||||
results = run_clustering_analysis(df, output_dir="output/clustering")
|
||||
142
src/data_loader.py
Normal file
@@ -0,0 +1,142 @@
|
||||
"""统一数据加载模块 - 处理毫秒/微秒时间戳差异"""
|
||||
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
DATA_DIR = Path(__file__).parent.parent / "data"
|
||||
|
||||
AVAILABLE_INTERVALS = [
|
||||
"1m", "3m", "5m", "15m", "30m",
|
||||
"1h", "2h", "4h", "6h", "8h", "12h",
|
||||
"1d", "3d", "1w", "1mo"
|
||||
]
|
||||
|
||||
COLUMNS = [
|
||||
"open_time", "open", "high", "low", "close", "volume",
|
||||
"close_time", "quote_volume", "trades",
|
||||
"taker_buy_volume", "taker_buy_quote_volume", "ignore"
|
||||
]
|
||||
|
||||
NUMERIC_COLS = [
|
||||
"open", "high", "low", "close", "volume",
|
||||
"quote_volume", "trades", "taker_buy_volume", "taker_buy_quote_volume"
|
||||
]
|
||||
|
||||
|
||||
def _adaptive_timestamp(ts_series: pd.Series) -> pd.DatetimeIndex:
|
||||
"""自适应处理毫秒(13位)和微秒(16位)时间戳"""
|
||||
ts = ts_series.astype(np.int64)
|
||||
# 16位时间戳(微秒) -> 转为毫秒
|
||||
mask = ts > 1e15
|
||||
ts = ts.copy()
|
||||
ts[mask] = ts[mask] // 1000
|
||||
return pd.to_datetime(ts, unit="ms")
|
||||
|
||||
|
||||
def load_klines(
|
||||
interval: str = "1d",
|
||||
start: Optional[str] = None,
|
||||
end: Optional[str] = None,
|
||||
data_dir: Optional[Path] = None,
|
||||
) -> pd.DataFrame:
|
||||
"""
|
||||
加载指定时间粒度的K线数据
|
||||
|
||||
Parameters
|
||||
----------
|
||||
interval : str
|
||||
K线粒度,如 '1d', '1h', '4h', '1w', '1mo'
|
||||
start : str, optional
|
||||
起始日期,如 '2020-01-01'
|
||||
end : str, optional
|
||||
结束日期,如 '2025-12-31'
|
||||
data_dir : Path, optional
|
||||
数据目录,默认使用 data/
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
以 DatetimeIndex 为索引的K线数据
|
||||
"""
|
||||
if data_dir is None:
|
||||
data_dir = DATA_DIR
|
||||
|
||||
filepath = data_dir / f"btcusdt_{interval}.csv"
|
||||
if not filepath.exists():
|
||||
raise FileNotFoundError(f"数据文件不存在: {filepath}")
|
||||
|
||||
df = pd.read_csv(filepath)
|
||||
|
||||
# 类型转换
|
||||
for col in NUMERIC_COLS:
|
||||
if col in df.columns:
|
||||
df[col] = pd.to_numeric(df[col], errors="coerce")
|
||||
|
||||
# 自适应时间戳处理
|
||||
df.index = _adaptive_timestamp(df["open_time"])
|
||||
df.index.name = "datetime"
|
||||
|
||||
# close_time 也做处理
|
||||
if "close_time" in df.columns:
|
||||
df["close_time"] = _adaptive_timestamp(df["close_time"])
|
||||
|
||||
# 删除原始时间戳列和ignore列
|
||||
df.drop(columns=["open_time", "ignore"], inplace=True, errors="ignore")
|
||||
|
||||
# 排序去重
|
||||
df.sort_index(inplace=True)
|
||||
df = df[~df.index.duplicated(keep="first")]
|
||||
|
||||
# 时间范围过滤
|
||||
if start:
|
||||
df = df[df.index >= pd.Timestamp(start)]
|
||||
if end:
|
||||
df = df[df.index <= pd.Timestamp(end)]
|
||||
|
||||
return df
|
||||
|
||||
|
||||
def load_daily(start: Optional[str] = None, end: Optional[str] = None) -> pd.DataFrame:
|
||||
"""快捷加载日线数据"""
|
||||
return load_klines("1d", start=start, end=end)
|
||||
|
||||
|
||||
def load_hourly(start: Optional[str] = None, end: Optional[str] = None) -> pd.DataFrame:
|
||||
"""快捷加载小时数据"""
|
||||
return load_klines("1h", start=start, end=end)
|
||||
|
||||
|
||||
def validate_data(df: pd.DataFrame, interval: str = "1d") -> dict:
|
||||
"""数据完整性校验"""
|
||||
report = {
|
||||
"rows": len(df),
|
||||
"date_range": f"{df.index.min()} ~ {df.index.max()}",
|
||||
"null_counts": df.isnull().sum().to_dict(),
|
||||
"duplicate_index": df.index.duplicated().sum(),
|
||||
}
|
||||
|
||||
# 检查价格合理性
|
||||
report["price_range"] = f"{df['close'].min():.2f} ~ {df['close'].max():.2f}"
|
||||
report["negative_volume"] = (df["volume"] < 0).sum()
|
||||
|
||||
# 检查缺失天数(仅日线)
|
||||
if interval == "1d":
|
||||
expected_days = (df.index.max() - df.index.min()).days + 1
|
||||
report["expected_days"] = expected_days
|
||||
report["missing_days"] = expected_days - len(df)
|
||||
|
||||
return report
|
||||
|
||||
|
||||
# 数据切分常量
|
||||
TRAIN_END = "2022-09-30"
|
||||
VAL_END = "2024-06-30"
|
||||
|
||||
def split_data(df: pd.DataFrame):
|
||||
"""按时间顺序切分 训练/验证/测试 集"""
|
||||
train = df[df.index <= TRAIN_END]
|
||||
val = df[(df.index > TRAIN_END) & (df.index <= VAL_END)]
|
||||
test = df[df.index > VAL_END]
|
||||
return train, val, test
|
||||
901
src/fft_analysis.py
Normal file
@@ -0,0 +1,901 @@
|
||||
"""FFT 频谱分析模块 - BTC价格周期性检测与频域特征提取"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from scipy.fft import fft, fftfreq, ifft
|
||||
from scipy.signal import find_peaks, butter, sosfiltfilt
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Tuple
|
||||
|
||||
from src.data_loader import load_klines
|
||||
from src.preprocessing import log_returns, detrend_linear
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 常量定义
|
||||
# ============================================================
|
||||
|
||||
# 多时间框架比较所用的K线粒度及其对应采样周期(天)
|
||||
MULTI_TF_INTERVALS = {
|
||||
"4h": 4 / 24, # 0.1667天
|
||||
"1d": 1.0, # 1天
|
||||
"1w": 7.0, # 7天
|
||||
}
|
||||
|
||||
# 带通滤波目标周期(天)
|
||||
BANDPASS_PERIODS_DAYS = [7, 30, 90, 365, 1400]
|
||||
|
||||
# 峰值检测阈值:功率必须超过背景噪声的倍数
|
||||
PEAK_THRESHOLD_RATIO = 5.0
|
||||
|
||||
# 图表保存参数
|
||||
SAVE_KW = dict(dpi=150, bbox_inches="tight")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 核心FFT计算函数
|
||||
# ============================================================
|
||||
|
||||
def compute_fft_spectrum(
|
||||
signal: np.ndarray,
|
||||
sampling_period_days: float,
|
||||
apply_window: bool = True,
|
||||
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
|
||||
"""
|
||||
计算信号的FFT功率谱
|
||||
|
||||
Parameters
|
||||
----------
|
||||
signal : np.ndarray
|
||||
输入时域信号(需已去趋势/取对数收益率)
|
||||
sampling_period_days : float
|
||||
采样周期,单位为天(日线=1.0, 4h线=4/24)
|
||||
apply_window : bool
|
||||
是否应用Hann窗函数以抑制频谱泄漏
|
||||
|
||||
Returns
|
||||
-------
|
||||
freqs : np.ndarray
|
||||
频率数组(仅正频率部分),单位 cycles/day
|
||||
periods : np.ndarray
|
||||
周期数组(天),即 1/freqs
|
||||
power : np.ndarray
|
||||
功率谱(振幅平方的归一化值)
|
||||
"""
|
||||
n = len(signal)
|
||||
if n == 0:
|
||||
return np.array([]), np.array([]), np.array([])
|
||||
|
||||
# 应用Hann窗减少频谱泄漏
|
||||
if apply_window:
|
||||
window = np.hanning(n)
|
||||
windowed = signal * window
|
||||
# 窗函数能量补偿:保持总功率不变
|
||||
window_energy = np.sum(window ** 2) / n
|
||||
else:
|
||||
windowed = signal.copy()
|
||||
window_energy = 1.0
|
||||
|
||||
# FFT计算
|
||||
yf = fft(windowed)
|
||||
freqs = fftfreq(n, d=sampling_period_days)
|
||||
|
||||
# 仅取正频率部分(排除直流分量 freq=0)
|
||||
pos_mask = freqs > 0
|
||||
freqs_pos = freqs[pos_mask]
|
||||
yf_pos = yf[pos_mask]
|
||||
|
||||
# 功率谱密度:|FFT|^2 / (N * 窗函数能量)
|
||||
power = (np.abs(yf_pos) ** 2) / (n * window_energy)
|
||||
|
||||
# 对应周期
|
||||
periods = 1.0 / freqs_pos
|
||||
|
||||
return freqs_pos, periods, power
|
||||
|
||||
|
||||
# ============================================================
|
||||
# AR(1) 红噪声基线模型
|
||||
# ============================================================
|
||||
|
||||
def ar1_red_noise_spectrum(
|
||||
signal: np.ndarray,
|
||||
freqs: np.ndarray,
|
||||
sampling_period_days: float,
|
||||
confidence_percentile: float = 95.0,
|
||||
) -> Tuple[np.ndarray, np.ndarray]:
|
||||
"""
|
||||
基于AR(1)模型估算红噪声理论功率谱
|
||||
|
||||
AR(1)模型的功率谱密度公式:
|
||||
S(f) = S0 * (1 - rho^2) / (1 - 2*rho*cos(2*pi*f*dt) + rho^2)
|
||||
|
||||
Parameters
|
||||
----------
|
||||
signal : np.ndarray
|
||||
原始信号
|
||||
freqs : np.ndarray
|
||||
频率数组
|
||||
sampling_period_days : float
|
||||
采样周期
|
||||
confidence_percentile : float
|
||||
置信水平百分位数(默认95%)
|
||||
|
||||
Returns
|
||||
-------
|
||||
noise_mean : np.ndarray
|
||||
红噪声理论均值功率谱
|
||||
noise_threshold : np.ndarray
|
||||
指定置信水平的功率阈值
|
||||
"""
|
||||
n = len(signal)
|
||||
if n < 3:
|
||||
return np.zeros_like(freqs), np.zeros_like(freqs)
|
||||
|
||||
# 估计AR(1)系数 rho(滞后1自相关)
|
||||
signal_centered = signal - np.mean(signal)
|
||||
autocov_0 = np.sum(signal_centered ** 2) / n
|
||||
autocov_1 = np.sum(signal_centered[:-1] * signal_centered[1:]) / n
|
||||
rho = autocov_1 / autocov_0 if autocov_0 > 0 else 0.0
|
||||
rho = np.clip(rho, -0.999, 0.999) # 防止数值不稳定
|
||||
|
||||
# AR(1)理论功率谱
|
||||
variance = autocov_0
|
||||
s0 = variance * (1 - rho ** 2)
|
||||
cos_term = np.cos(2 * np.pi * freqs * sampling_period_days)
|
||||
denominator = 1 - 2 * rho * cos_term + rho ** 2
|
||||
noise_mean = s0 / denominator
|
||||
|
||||
# 归一化使均值与信号功率谱均值匹配(经验缩放)
|
||||
# 在chi-squared分布下,FFT功率近似服从指数分布(自由度2)
|
||||
# 95%置信上界 = 均值 * chi2_ppf(0.95, 2) / 2 ≈ 均值 * 2.996
|
||||
from scipy.stats import chi2
|
||||
scale_factor = chi2.ppf(confidence_percentile / 100.0, df=2) / 2.0
|
||||
noise_threshold = noise_mean * scale_factor
|
||||
|
||||
return noise_mean, noise_threshold
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 峰值检测
|
||||
# ============================================================
|
||||
|
||||
def detect_spectral_peaks(
|
||||
freqs: np.ndarray,
|
||||
periods: np.ndarray,
|
||||
power: np.ndarray,
|
||||
noise_mean: np.ndarray,
|
||||
noise_threshold: np.ndarray,
|
||||
threshold_ratio: float = PEAK_THRESHOLD_RATIO,
|
||||
min_period_days: float = 2.0,
|
||||
) -> pd.DataFrame:
|
||||
"""
|
||||
在功率谱中检测显著峰值
|
||||
|
||||
峰值判定标准:
|
||||
1. scipy.signal.find_peaks 局部峰值
|
||||
2. 功率 > threshold_ratio * 背景噪声均值
|
||||
3. 周期 > min_period_days(过滤高频噪声)
|
||||
|
||||
Parameters
|
||||
----------
|
||||
freqs, periods, power : np.ndarray
|
||||
频率、周期、功率数组
|
||||
noise_mean, noise_threshold : np.ndarray
|
||||
红噪声均值和置信阈值
|
||||
threshold_ratio : float
|
||||
峰值必须超过噪声均值的倍数
|
||||
min_period_days : float
|
||||
最小周期阈值(天)
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
检测到的峰值信息表,包含 period_days, frequency, power, noise_level, snr 列
|
||||
"""
|
||||
if len(power) == 0:
|
||||
return pd.DataFrame(columns=["period_days", "frequency", "power", "noise_level", "snr"])
|
||||
|
||||
# 使用scipy检测局部峰值
|
||||
peak_indices, properties = find_peaks(power, height=0)
|
||||
|
||||
results = []
|
||||
for idx in peak_indices:
|
||||
period_d = periods[idx]
|
||||
pwr = power[idx]
|
||||
noise_lvl = noise_mean[idx] if idx < len(noise_mean) else 1.0
|
||||
snr = pwr / noise_lvl if noise_lvl > 0 else 0.0
|
||||
|
||||
# 筛选:周期足够长且功率显著超过噪声
|
||||
if period_d >= min_period_days and snr >= threshold_ratio:
|
||||
results.append({
|
||||
"period_days": period_d,
|
||||
"frequency": freqs[idx],
|
||||
"power": pwr,
|
||||
"noise_level": noise_lvl,
|
||||
"snr": snr,
|
||||
})
|
||||
|
||||
df_peaks = pd.DataFrame(results)
|
||||
if not df_peaks.empty:
|
||||
df_peaks = df_peaks.sort_values("snr", ascending=False).reset_index(drop=True)
|
||||
|
||||
return df_peaks
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 带通滤波器
|
||||
# ============================================================
|
||||
|
||||
def bandpass_filter(
|
||||
signal: np.ndarray,
|
||||
sampling_period_days: float,
|
||||
center_period_days: float,
|
||||
bandwidth_ratio: float = 0.3,
|
||||
order: int = 4,
|
||||
) -> np.ndarray:
|
||||
"""
|
||||
带通滤波提取特定周期分量
|
||||
|
||||
对于长周期(归一化低频 < 0.01)自动使用FFT域滤波以避免
|
||||
Butterworth滤波器的数值不稳定问题。其余情况使用SOS格式的
|
||||
Butterworth带通滤波(sosfiltfilt),保证数值稳定性。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
signal : np.ndarray
|
||||
输入信号
|
||||
sampling_period_days : float
|
||||
采样周期(天)
|
||||
center_period_days : float
|
||||
目标中心周期(天)
|
||||
bandwidth_ratio : float
|
||||
带宽比例:实际带宽 = center_period * (1 +/- bandwidth_ratio)
|
||||
order : int
|
||||
Butterworth滤波器阶数
|
||||
|
||||
Returns
|
||||
-------
|
||||
np.ndarray
|
||||
滤波后的信号分量
|
||||
"""
|
||||
fs = 1.0 / sampling_period_days # 采样频率 (cycles/day)
|
||||
nyquist = fs / 2.0
|
||||
|
||||
# 带通频率范围
|
||||
low_period = center_period_days * (1 + bandwidth_ratio)
|
||||
high_period = center_period_days * (1 - bandwidth_ratio)
|
||||
|
||||
if high_period <= 0:
|
||||
high_period = sampling_period_days * 2.1 # 保证物理意义
|
||||
|
||||
low_freq = 1.0 / low_period
|
||||
high_freq = 1.0 / high_period
|
||||
|
||||
# 归一化到Nyquist频率
|
||||
low_norm = low_freq / nyquist
|
||||
high_norm = high_freq / nyquist
|
||||
|
||||
# 确保归一化频率在有效范围 (0, 1) 内
|
||||
low_norm = np.clip(low_norm, 1e-6, 0.9999)
|
||||
high_norm = np.clip(high_norm, low_norm + 1e-6, 0.9999)
|
||||
|
||||
if low_norm >= high_norm:
|
||||
return np.zeros_like(signal)
|
||||
|
||||
# 对于长周期(归一化低频极小),Butterworth滤波器数值不稳定
|
||||
# 直接使用FFT域带通滤波作为可靠替代
|
||||
if low_norm < 0.01:
|
||||
return _fft_bandpass_fallback(signal, sampling_period_days,
|
||||
center_period_days, bandwidth_ratio)
|
||||
|
||||
# 信号长度检查:sosfiltfilt 需要足够的样本点
|
||||
min_samples = 3 * (2 * order + 1)
|
||||
if len(signal) < min_samples:
|
||||
return np.zeros_like(signal)
|
||||
|
||||
try:
|
||||
# 使用SOS格式(二阶节)保证数值稳定性
|
||||
sos = butter(order, [low_norm, high_norm], btype="band", output="sos")
|
||||
filtered = sosfiltfilt(sos, signal)
|
||||
return filtered
|
||||
except (ValueError, np.linalg.LinAlgError):
|
||||
# 若滤波失败,回退到FFT方式
|
||||
return _fft_bandpass_fallback(signal, sampling_period_days,
|
||||
center_period_days, bandwidth_ratio)
|
||||
|
||||
|
||||
def _fft_bandpass_fallback(
|
||||
signal: np.ndarray,
|
||||
sampling_period_days: float,
|
||||
center_period_days: float,
|
||||
bandwidth_ratio: float,
|
||||
) -> np.ndarray:
|
||||
"""FFT域带通滤波备选方案"""
|
||||
n = len(signal)
|
||||
freqs = fftfreq(n, d=sampling_period_days)
|
||||
yf = fft(signal)
|
||||
|
||||
center_freq = 1.0 / center_period_days
|
||||
low_freq = center_freq / (1 + bandwidth_ratio)
|
||||
high_freq = center_freq / (1 - bandwidth_ratio) if bandwidth_ratio < 1 else center_freq * 10
|
||||
|
||||
# 频域掩码:保留目标频段
|
||||
mask = (np.abs(freqs) >= low_freq) & (np.abs(freqs) <= high_freq)
|
||||
yf_filtered = np.zeros_like(yf)
|
||||
yf_filtered[mask] = yf[mask]
|
||||
|
||||
return np.real(ifft(yf_filtered))
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 可视化函数
|
||||
# ============================================================
|
||||
|
||||
def plot_power_spectrum(
|
||||
periods: np.ndarray,
|
||||
power: np.ndarray,
|
||||
noise_mean: np.ndarray,
|
||||
noise_threshold: np.ndarray,
|
||||
peaks_df: pd.DataFrame,
|
||||
title: str = "BTC Log Returns - FFT Power Spectrum",
|
||||
save_path: Optional[Path] = None,
|
||||
) -> plt.Figure:
|
||||
"""
|
||||
功率谱图:包含峰值标注和红噪声置信带
|
||||
|
||||
Parameters
|
||||
----------
|
||||
periods, power : np.ndarray
|
||||
周期和功率数组
|
||||
noise_mean, noise_threshold : np.ndarray
|
||||
红噪声均值和置信阈值
|
||||
peaks_df : pd.DataFrame
|
||||
检测到的峰值表
|
||||
title : str
|
||||
图表标题
|
||||
save_path : Path, optional
|
||||
保存路径
|
||||
|
||||
Returns
|
||||
-------
|
||||
fig : plt.Figure
|
||||
"""
|
||||
fig, ax = plt.subplots(figsize=(14, 7))
|
||||
|
||||
# 功率谱(对数坐标)
|
||||
ax.loglog(periods, power, color="#2196F3", linewidth=0.6, alpha=0.8, label="Power Spectrum")
|
||||
|
||||
# 红噪声基线
|
||||
ax.loglog(periods, noise_mean, color="#FF9800", linewidth=1.5,
|
||||
linestyle="--", label="AR(1) Red Noise Mean")
|
||||
|
||||
# 95%置信带
|
||||
ax.fill_between(periods, 0, noise_threshold,
|
||||
alpha=0.15, color="#FF9800", label="95% Confidence Band")
|
||||
ax.loglog(periods, noise_threshold, color="#FF5722", linewidth=1.0,
|
||||
linestyle=":", alpha=0.7, label="95% Confidence Threshold")
|
||||
|
||||
# 5x噪声阈值线
|
||||
noise_5x = noise_mean * PEAK_THRESHOLD_RATIO
|
||||
ax.loglog(periods, noise_5x, color="#F44336", linewidth=1.0,
|
||||
linestyle="-.", alpha=0.5, label=f"{PEAK_THRESHOLD_RATIO:.0f}x Noise Threshold")
|
||||
|
||||
# 峰值标注
|
||||
if not peaks_df.empty:
|
||||
for _, row in peaks_df.iterrows():
|
||||
period_d = row["period_days"]
|
||||
pwr = row["power"]
|
||||
snr = row["snr"]
|
||||
|
||||
ax.plot(period_d, pwr, "rv", markersize=10, zorder=5)
|
||||
|
||||
# 周期标签格式化
|
||||
if period_d >= 365:
|
||||
label_str = f"{period_d / 365:.1f}y (SNR={snr:.1f})"
|
||||
elif period_d >= 30:
|
||||
label_str = f"{period_d:.0f}d (SNR={snr:.1f})"
|
||||
else:
|
||||
label_str = f"{period_d:.1f}d (SNR={snr:.1f})"
|
||||
|
||||
ax.annotate(
|
||||
label_str,
|
||||
xy=(period_d, pwr),
|
||||
xytext=(0, 15),
|
||||
textcoords="offset points",
|
||||
fontsize=8,
|
||||
fontweight="bold",
|
||||
color="#D32F2F",
|
||||
ha="center",
|
||||
arrowprops=dict(arrowstyle="-", color="#D32F2F", lw=0.5),
|
||||
)
|
||||
|
||||
ax.set_xlabel("Period (days)", fontsize=12)
|
||||
ax.set_ylabel("Power", fontsize=12)
|
||||
ax.set_title(title, fontsize=14, fontweight="bold")
|
||||
ax.legend(loc="upper right", fontsize=9)
|
||||
ax.grid(True, which="both", alpha=0.3)
|
||||
|
||||
# X轴标记关键周期
|
||||
key_periods = [7, 14, 30, 60, 90, 180, 365, 730, 1460]
|
||||
ax.set_xticks(key_periods)
|
||||
ax.set_xticklabels([str(p) for p in key_periods], fontsize=8)
|
||||
ax.set_xlim(left=max(2, periods.min()), right=periods.max())
|
||||
|
||||
plt.tight_layout()
|
||||
|
||||
if save_path:
|
||||
fig.savefig(save_path, **SAVE_KW)
|
||||
print(f" [保存] 功率谱图 -> {save_path}")
|
||||
|
||||
return fig
|
||||
|
||||
|
||||
def plot_multi_timeframe(
|
||||
tf_results: Dict[str, dict],
|
||||
save_path: Optional[Path] = None,
|
||||
) -> plt.Figure:
|
||||
"""
|
||||
多时间框架FFT频谱对比图
|
||||
|
||||
Parameters
|
||||
----------
|
||||
tf_results : dict
|
||||
键为时间框架标签,值为包含 periods/power/noise_mean 的dict
|
||||
save_path : Path, optional
|
||||
保存路径
|
||||
|
||||
Returns
|
||||
-------
|
||||
fig : plt.Figure
|
||||
"""
|
||||
n_tf = len(tf_results)
|
||||
fig, axes = plt.subplots(n_tf, 1, figsize=(14, 5 * n_tf), sharex=False)
|
||||
if n_tf == 1:
|
||||
axes = [axes]
|
||||
|
||||
colors = ["#2196F3", "#4CAF50", "#9C27B0"]
|
||||
|
||||
for ax, (label, data), color in zip(axes, tf_results.items(), colors):
|
||||
periods = data["periods"]
|
||||
power = data["power"]
|
||||
noise_mean = data["noise_mean"]
|
||||
|
||||
ax.loglog(periods, power, color=color, linewidth=0.6, alpha=0.8,
|
||||
label=f"{label} Spectrum")
|
||||
ax.loglog(periods, noise_mean, color="#FF9800", linewidth=1.2,
|
||||
linestyle="--", alpha=0.7, label="AR(1) Noise")
|
||||
|
||||
# 标注峰值
|
||||
peaks_df = data.get("peaks", pd.DataFrame())
|
||||
if not peaks_df.empty:
|
||||
for _, row in peaks_df.head(5).iterrows():
|
||||
period_d = row["period_days"]
|
||||
pwr = row["power"]
|
||||
ax.plot(period_d, pwr, "rv", markersize=8, zorder=5)
|
||||
if period_d >= 365:
|
||||
lbl = f"{period_d / 365:.1f}y"
|
||||
elif period_d >= 30:
|
||||
lbl = f"{period_d:.0f}d"
|
||||
else:
|
||||
lbl = f"{period_d:.1f}d"
|
||||
ax.annotate(lbl, xy=(period_d, pwr), xytext=(0, 10),
|
||||
textcoords="offset points", fontsize=8,
|
||||
color="#D32F2F", ha="center", fontweight="bold")
|
||||
|
||||
ax.set_ylabel("Power", fontsize=11)
|
||||
ax.set_title(f"BTC FFT Spectrum - {label}", fontsize=12, fontweight="bold")
|
||||
ax.legend(loc="upper right", fontsize=9)
|
||||
ax.grid(True, which="both", alpha=0.3)
|
||||
|
||||
axes[-1].set_xlabel("Period (days)", fontsize=12)
|
||||
plt.tight_layout()
|
||||
|
||||
if save_path:
|
||||
fig.savefig(save_path, **SAVE_KW)
|
||||
print(f" [保存] 多时间框架对比图 -> {save_path}")
|
||||
|
||||
return fig
|
||||
|
||||
|
||||
def plot_bandpass_components(
|
||||
dates: pd.DatetimeIndex,
|
||||
original_signal: np.ndarray,
|
||||
components: Dict[str, np.ndarray],
|
||||
save_path: Optional[Path] = None,
|
||||
) -> plt.Figure:
|
||||
"""
|
||||
带通滤波分量子图
|
||||
|
||||
Parameters
|
||||
----------
|
||||
dates : pd.DatetimeIndex
|
||||
日期索引
|
||||
original_signal : np.ndarray
|
||||
原始信号(对数收益率)
|
||||
components : dict
|
||||
键为周期标签(如 "7d"),值为滤波后的信号数组
|
||||
save_path : Path, optional
|
||||
保存路径
|
||||
|
||||
Returns
|
||||
-------
|
||||
fig : plt.Figure
|
||||
"""
|
||||
n_comp = len(components) + 1 # +1 for original
|
||||
fig, axes = plt.subplots(n_comp, 1, figsize=(14, 3 * n_comp), sharex=True)
|
||||
|
||||
# 原始信号
|
||||
axes[0].plot(dates, original_signal, color="#455A64", linewidth=0.5, alpha=0.8)
|
||||
axes[0].set_title("Original Log Returns", fontsize=11, fontweight="bold")
|
||||
axes[0].set_ylabel("Log Return", fontsize=9)
|
||||
axes[0].grid(True, alpha=0.3)
|
||||
|
||||
# 各周期分量
|
||||
colors_bp = ["#E91E63", "#2196F3", "#4CAF50", "#FF9800", "#9C27B0"]
|
||||
for i, ((label, comp), color) in enumerate(zip(components.items(), colors_bp)):
|
||||
ax = axes[i + 1]
|
||||
ax.plot(dates, comp, color=color, linewidth=0.8, alpha=0.9)
|
||||
ax.set_title(f"Bandpass Component: {label} cycle", fontsize=11, fontweight="bold")
|
||||
ax.set_ylabel("Amplitude", fontsize=9)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
# 显示该分量的方差占比
|
||||
if np.var(original_signal) > 0:
|
||||
var_ratio = np.var(comp) / np.var(original_signal) * 100
|
||||
ax.text(0.02, 0.92, f"Variance ratio: {var_ratio:.2f}%",
|
||||
transform=ax.transAxes, fontsize=9,
|
||||
bbox=dict(boxstyle="round,pad=0.3", facecolor=color, alpha=0.15))
|
||||
|
||||
axes[-1].set_xlabel("Date", fontsize=11)
|
||||
plt.tight_layout()
|
||||
|
||||
if save_path:
|
||||
fig.savefig(save_path, **SAVE_KW)
|
||||
print(f" [保存] 带通滤波分量图 -> {save_path}")
|
||||
|
||||
return fig
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 单时间框架FFT分析流水线
|
||||
# ============================================================
|
||||
|
||||
def _analyze_single_timeframe(
|
||||
df: pd.DataFrame,
|
||||
sampling_period_days: float,
|
||||
label: str = "1d",
|
||||
) -> dict:
|
||||
"""
|
||||
对单个时间框架执行完整FFT分析
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含 freqs, periods, power, noise_mean, noise_threshold, peaks, log_ret 等
|
||||
"""
|
||||
prices = df["close"].dropna()
|
||||
if len(prices) < 10:
|
||||
print(f" [警告] {label} 数据量不足 ({len(prices)} 条),跳过分析")
|
||||
return {}
|
||||
|
||||
# 计算对数收益率
|
||||
log_ret = np.log(prices / prices.shift(1)).dropna().values
|
||||
|
||||
# FFT频谱计算(Hann窗)
|
||||
freqs, periods, power = compute_fft_spectrum(
|
||||
log_ret, sampling_period_days, apply_window=True
|
||||
)
|
||||
|
||||
if len(freqs) == 0:
|
||||
return {}
|
||||
|
||||
# AR(1)红噪声基线
|
||||
noise_mean, noise_threshold = ar1_red_noise_spectrum(
|
||||
log_ret, freqs, sampling_period_days, confidence_percentile=95.0
|
||||
)
|
||||
|
||||
# 峰值检测
|
||||
# 对于低频数据(如周线),放宽最小周期约束
|
||||
min_period = max(2.0, sampling_period_days * 3)
|
||||
peaks_df = detect_spectral_peaks(
|
||||
freqs, periods, power, noise_mean, noise_threshold,
|
||||
threshold_ratio=PEAK_THRESHOLD_RATIO,
|
||||
min_period_days=min_period,
|
||||
)
|
||||
|
||||
return {
|
||||
"freqs": freqs,
|
||||
"periods": periods,
|
||||
"power": power,
|
||||
"noise_mean": noise_mean,
|
||||
"noise_threshold": noise_threshold,
|
||||
"peaks": peaks_df,
|
||||
"log_ret": log_ret,
|
||||
"label": label,
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 主入口函数
|
||||
# ============================================================
|
||||
|
||||
def run_fft_analysis(
|
||||
df: pd.DataFrame,
|
||||
output_dir: str,
|
||||
) -> Dict:
|
||||
"""
|
||||
BTC价格FFT频谱分析主入口
|
||||
|
||||
执行以下分析并保存可视化结果:
|
||||
1. 日线对数收益率FFT频谱分析(Hann窗 + AR1红噪声基线)
|
||||
2. 功率谱峰值检测(5x噪声阈值)
|
||||
3. 多时间框架(4h/1d/1w)频谱对比
|
||||
4. 带通滤波提取关键周期分量(7d/30d/90d/365d/1400d)
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线K线数据,DatetimeIndex,需包含 close 列
|
||||
output_dir : str
|
||||
图表输出目录路径
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
分析结果汇总:
|
||||
- daily_peaks: 日线显著周期峰值表
|
||||
- multi_tf_peaks: 各时间框架峰值字典
|
||||
- bandpass_variance_ratios: 各带通分量方差占比
|
||||
- ar1_rho: AR(1)自相关系数
|
||||
"""
|
||||
output_path = Path(output_dir)
|
||||
output_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 70)
|
||||
print("BTC FFT 频谱分析")
|
||||
print("=" * 70)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 第一部分:日线对数收益率FFT分析
|
||||
# ----------------------------------------------------------
|
||||
print("\n[1/4] 日线对数收益率FFT分析 (Hann窗)")
|
||||
daily_result = _analyze_single_timeframe(df, sampling_period_days=1.0, label="1d")
|
||||
|
||||
if not daily_result:
|
||||
print(" [错误] 日线分析失败,数据不足")
|
||||
return {}
|
||||
|
||||
log_ret = daily_result["log_ret"]
|
||||
periods = daily_result["periods"]
|
||||
power = daily_result["power"]
|
||||
noise_mean = daily_result["noise_mean"]
|
||||
noise_threshold = daily_result["noise_threshold"]
|
||||
peaks_df = daily_result["peaks"]
|
||||
|
||||
# 打印AR(1)参数
|
||||
signal_centered = log_ret - np.mean(log_ret)
|
||||
autocov_0 = np.sum(signal_centered ** 2) / len(log_ret)
|
||||
autocov_1 = np.sum(signal_centered[:-1] * signal_centered[1:]) / len(log_ret)
|
||||
ar1_rho = autocov_1 / autocov_0 if autocov_0 > 0 else 0.0
|
||||
print(f" AR(1) 自相关系数 rho = {ar1_rho:.4f}")
|
||||
print(f" 数据长度: {len(log_ret)} 个交易日")
|
||||
print(f" 频率分辨率: {1.0 / len(log_ret):.6f} cycles/day (最大可分辨周期: {len(log_ret):.0f} 天)")
|
||||
|
||||
# 打印显著峰值
|
||||
if not peaks_df.empty:
|
||||
print(f"\n 检测到 {len(peaks_df)} 个显著周期峰值 (SNR > {PEAK_THRESHOLD_RATIO:.0f}x):")
|
||||
print(" " + "-" * 60)
|
||||
print(f" {'周期(天)':>10} | {'周期':>12} | {'SNR':>8} | {'功率':>12}")
|
||||
print(" " + "-" * 60)
|
||||
for _, row in peaks_df.iterrows():
|
||||
pd_days = row["period_days"]
|
||||
snr = row["snr"]
|
||||
pwr = row["power"]
|
||||
if pd_days >= 365:
|
||||
human_period = f"{pd_days / 365:.1f} 年"
|
||||
elif pd_days >= 30:
|
||||
human_period = f"{pd_days / 30:.1f} 月"
|
||||
else:
|
||||
human_period = f"{pd_days:.1f} 天"
|
||||
print(f" {pd_days:>10.1f} | {human_period:>12} | {snr:>8.2f} | {pwr:>12.6e}")
|
||||
print(" " + "-" * 60)
|
||||
else:
|
||||
print(" 未检测到显著超过红噪声基线的周期峰值")
|
||||
|
||||
# 功率谱图
|
||||
fig_spectrum = plot_power_spectrum(
|
||||
periods, power, noise_mean, noise_threshold, peaks_df,
|
||||
title="BTC Daily Log Returns - FFT Power Spectrum (Hann Window)",
|
||||
save_path=output_path / "fft_power_spectrum.png",
|
||||
)
|
||||
plt.close(fig_spectrum)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 第二部分:多时间框架FFT对比
|
||||
# ----------------------------------------------------------
|
||||
print("\n[2/4] 多时间框架FFT对比 (4h / 1d / 1w)")
|
||||
tf_results = {}
|
||||
|
||||
for interval, sp_days in MULTI_TF_INTERVALS.items():
|
||||
try:
|
||||
if interval == "1d":
|
||||
tf_df = df
|
||||
else:
|
||||
tf_df = load_klines(interval)
|
||||
result = _analyze_single_timeframe(tf_df, sp_days, label=interval)
|
||||
if result:
|
||||
tf_results[interval] = result
|
||||
n_peaks = len(result["peaks"]) if not result["peaks"].empty else 0
|
||||
print(f" {interval}: {len(result['log_ret'])} 样本, {n_peaks} 个显著峰值")
|
||||
except FileNotFoundError:
|
||||
print(f" [警告] {interval} 数据文件未找到,跳过")
|
||||
except Exception as e:
|
||||
print(f" [警告] {interval} 分析失败: {e}")
|
||||
|
||||
# 多时间框架对比图
|
||||
if len(tf_results) > 1:
|
||||
fig_mtf = plot_multi_timeframe(
|
||||
tf_results,
|
||||
save_path=output_path / "fft_multi_timeframe.png",
|
||||
)
|
||||
plt.close(fig_mtf)
|
||||
else:
|
||||
print(" [警告] 可用时间框架不足,跳过对比图")
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 第三部分:带通滤波提取周期分量
|
||||
# ----------------------------------------------------------
|
||||
print(f"\n[3/4] 带通滤波提取周期分量: {BANDPASS_PERIODS_DAYS}")
|
||||
prices = df["close"].dropna()
|
||||
dates = prices.index[1:] # 与log_ret对齐(差分损失1个点)
|
||||
# 确保dates和log_ret长度一致
|
||||
if len(dates) > len(log_ret):
|
||||
dates = dates[:len(log_ret)]
|
||||
elif len(dates) < len(log_ret):
|
||||
log_ret = log_ret[:len(dates)]
|
||||
|
||||
components = {}
|
||||
variance_ratios = {}
|
||||
original_var = np.var(log_ret)
|
||||
|
||||
for period_days in BANDPASS_PERIODS_DAYS:
|
||||
# 检查Nyquist条件:目标周期必须大于2倍采样周期
|
||||
if period_days < 2.0 * 1.0:
|
||||
print(f" [跳过] {period_days}d 周期低于Nyquist极限")
|
||||
continue
|
||||
# 检查信号长度是否覆盖至少2个完整周期
|
||||
if len(log_ret) < period_days * 2:
|
||||
print(f" [跳过] {period_days}d 周期:数据长度不足 ({len(log_ret)} < {period_days * 2:.0f})")
|
||||
continue
|
||||
|
||||
filtered = bandpass_filter(
|
||||
log_ret,
|
||||
sampling_period_days=1.0,
|
||||
center_period_days=float(period_days),
|
||||
bandwidth_ratio=0.3,
|
||||
order=4,
|
||||
)
|
||||
|
||||
label = f"{period_days}d"
|
||||
components[label] = filtered
|
||||
var_ratio = np.var(filtered) / original_var * 100 if original_var > 0 else 0
|
||||
variance_ratios[label] = var_ratio
|
||||
print(f" {label:>6} 分量方差占比: {var_ratio:.3f}%")
|
||||
|
||||
# 带通分量图
|
||||
if components:
|
||||
fig_bp = plot_bandpass_components(
|
||||
dates, log_ret, components,
|
||||
save_path=output_path / "fft_bandpass_components.png",
|
||||
)
|
||||
plt.close(fig_bp)
|
||||
else:
|
||||
print(" [警告] 无有效带通分量可绘制")
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 第四部分:汇总输出
|
||||
# ----------------------------------------------------------
|
||||
print("\n[4/4] 分析汇总")
|
||||
|
||||
# 收集多时间框架峰值
|
||||
multi_tf_peaks = {}
|
||||
for tf_label, tf_data in tf_results.items():
|
||||
if not tf_data["peaks"].empty:
|
||||
multi_tf_peaks[tf_label] = tf_data["peaks"]
|
||||
|
||||
# 跨时间框架一致性检验
|
||||
print("\n 跨时间框架周期一致性检查:")
|
||||
if len(multi_tf_peaks) >= 2:
|
||||
# 收集所有检测到的周期
|
||||
all_detected_periods = []
|
||||
for tf_label, p_df in multi_tf_peaks.items():
|
||||
for _, row in p_df.iterrows():
|
||||
all_detected_periods.append({
|
||||
"timeframe": tf_label,
|
||||
"period_days": row["period_days"],
|
||||
"snr": row["snr"],
|
||||
})
|
||||
|
||||
if all_detected_periods:
|
||||
all_periods_df = pd.DataFrame(all_detected_periods)
|
||||
# 按周期分组(允许20%误差范围),寻找多时间框架确认的周期
|
||||
confirmed = []
|
||||
used = set()
|
||||
for i, row_i in all_periods_df.iterrows():
|
||||
if i in used:
|
||||
continue
|
||||
p_i = row_i["period_days"]
|
||||
group = [row_i]
|
||||
used.add(i)
|
||||
for j, row_j in all_periods_df.iterrows():
|
||||
if j in used:
|
||||
continue
|
||||
if row_j["timeframe"] != row_i["timeframe"]:
|
||||
if abs(row_j["period_days"] - p_i) / p_i < 0.2:
|
||||
group.append(row_j)
|
||||
used.add(j)
|
||||
if len(group) > 1:
|
||||
tfs = [g["timeframe"] for g in group]
|
||||
avg_period = np.mean([g["period_days"] for g in group])
|
||||
avg_snr = np.mean([g["snr"] for g in group])
|
||||
confirmed.append({
|
||||
"period_days": avg_period,
|
||||
"confirmed_by": tfs,
|
||||
"avg_snr": avg_snr,
|
||||
})
|
||||
|
||||
if confirmed:
|
||||
for c in confirmed:
|
||||
tfs_str = " & ".join(c["confirmed_by"])
|
||||
print(f" {c['period_days']:.1f}d 周期被 {tfs_str} 共同确认 (平均SNR={c['avg_snr']:.2f})")
|
||||
else:
|
||||
print(" 未发现跨时间框架一致确认的周期")
|
||||
else:
|
||||
print(" 各时间框架均未检测到显著峰值")
|
||||
else:
|
||||
print(" 可用时间框架不足,无法进行一致性检查")
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("FFT分析完成")
|
||||
print(f"图表已保存至: {output_path.resolve()}")
|
||||
print("=" * 70)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 返回结果字典
|
||||
# ----------------------------------------------------------
|
||||
results = {
|
||||
"daily_peaks": peaks_df,
|
||||
"multi_tf_peaks": multi_tf_peaks,
|
||||
"bandpass_variance_ratios": variance_ratios,
|
||||
"bandpass_components": components,
|
||||
"ar1_rho": ar1_rho,
|
||||
"daily_spectrum": {
|
||||
"freqs": daily_result["freqs"],
|
||||
"periods": daily_result["periods"],
|
||||
"power": daily_result["power"],
|
||||
"noise_mean": daily_result["noise_mean"],
|
||||
"noise_threshold": daily_result["noise_threshold"],
|
||||
},
|
||||
"multi_tf_results": tf_results,
|
||||
}
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 独立运行入口
|
||||
# ============================================================
|
||||
|
||||
if __name__ == "__main__":
|
||||
from src.data_loader import load_daily
|
||||
|
||||
print("加载BTC日线数据...")
|
||||
df = load_daily()
|
||||
print(f"数据范围: {df.index.min()} ~ {df.index.max()}, 共 {len(df)} 条")
|
||||
|
||||
results = run_fft_analysis(df, output_dir="output/fft")
|
||||
645
src/fractal_analysis.py
Normal file
@@ -0,0 +1,645 @@
|
||||
"""
|
||||
分形维数与自相似性分析模块
|
||||
========================
|
||||
通过盒计数法(Box-Counting)计算BTC价格序列的分形维数,
|
||||
并通过蒙特卡洛模拟与随机游走对比,检验BTC价格是否具有显著不同的分形特征。
|
||||
|
||||
核心功能:
|
||||
- 盒计数法(Box-Counting Dimension)计算分形维数
|
||||
- 蒙特卡洛模拟对比(Z检验)
|
||||
- 多尺度自相似性分析
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from pathlib import Path
|
||||
from typing import Tuple, Dict, List, Optional
|
||||
from scipy import stats
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
from src.data_loader import load_klines
|
||||
from src.preprocessing import log_returns
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 盒计数法(Box-Counting Dimension)
|
||||
# ============================================================
|
||||
def box_counting_dimension(prices: np.ndarray,
|
||||
num_scales: int = 30,
|
||||
min_boxes: int = 5,
|
||||
max_boxes: int = None) -> Tuple[float, np.ndarray, np.ndarray]:
|
||||
"""
|
||||
盒计数法计算价格序列的分形维数
|
||||
|
||||
方法:
|
||||
1. 将价格序列归一化到 [0,1] x [0,1] 空间
|
||||
2. 在不同尺度(box size)下计数覆盖曲线所需的盒子数
|
||||
3. 通过 log(count) vs log(1/scale) 的线性回归得到分形维数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
prices : np.ndarray
|
||||
价格序列
|
||||
num_scales : int
|
||||
尺度数量
|
||||
min_boxes : int
|
||||
最小划分数量
|
||||
max_boxes : int, optional
|
||||
最大划分数量,默认为序列长度的1/4
|
||||
|
||||
Returns
|
||||
-------
|
||||
D : float
|
||||
盒计数分形维数
|
||||
log_inv_scales : np.ndarray
|
||||
log(1/scale) 数组
|
||||
log_counts : np.ndarray
|
||||
log(count) 数组
|
||||
"""
|
||||
n = len(prices)
|
||||
if max_boxes is None:
|
||||
max_boxes = n // 4
|
||||
|
||||
# 步骤1:归一化到 [0,1] x [0,1]
|
||||
# x轴:时间归一化
|
||||
x = np.linspace(0, 1, n)
|
||||
# y轴:价格归一化
|
||||
y = (prices - prices.min()) / (prices.max() - prices.min())
|
||||
|
||||
# 步骤2:在不同尺度下计数
|
||||
# 生成对数均匀分布的划分数量
|
||||
box_counts_list = np.unique(
|
||||
np.logspace(np.log10(min_boxes), np.log10(max_boxes), num=num_scales).astype(int)
|
||||
)
|
||||
|
||||
log_inv_scales = []
|
||||
log_counts = []
|
||||
|
||||
for num_boxes_per_side in box_counts_list:
|
||||
if num_boxes_per_side < 2:
|
||||
continue
|
||||
|
||||
# 盒子大小(在归一化空间中)
|
||||
box_size = 1.0 / num_boxes_per_side
|
||||
|
||||
# 计算每个数据点所在的盒子编号
|
||||
# x方向:时间划分
|
||||
x_box = np.floor(x / box_size).astype(int)
|
||||
x_box = np.clip(x_box, 0, num_boxes_per_side - 1)
|
||||
|
||||
# y方向:价格划分
|
||||
y_box = np.floor(y / box_size).astype(int)
|
||||
y_box = np.clip(y_box, 0, num_boxes_per_side - 1)
|
||||
|
||||
# 还需要考虑相邻点之间的连线经过的盒子
|
||||
occupied = set()
|
||||
for i in range(n):
|
||||
occupied.add((x_box[i], y_box[i]))
|
||||
|
||||
# 对于相邻点,如果它们不在同一个盒子中,需要插值连接
|
||||
for i in range(n - 1):
|
||||
if x_box[i] == x_box[i + 1] and y_box[i] == y_box[i + 1]:
|
||||
continue
|
||||
|
||||
# 线性插值找出经过的所有盒子
|
||||
steps = max(abs(x_box[i + 1] - x_box[i]), abs(y_box[i + 1] - y_box[i])) + 1
|
||||
if steps <= 1:
|
||||
continue
|
||||
|
||||
for t in np.linspace(0, 1, steps + 1):
|
||||
xi = x[i] + t * (x[i + 1] - x[i])
|
||||
yi = y[i] + t * (y[i + 1] - y[i])
|
||||
bx = int(np.clip(np.floor(xi / box_size), 0, num_boxes_per_side - 1))
|
||||
by = int(np.clip(np.floor(yi / box_size), 0, num_boxes_per_side - 1))
|
||||
occupied.add((bx, by))
|
||||
|
||||
count = len(occupied)
|
||||
if count > 0:
|
||||
log_inv_scales.append(np.log(1.0 / box_size))
|
||||
log_counts.append(np.log(count))
|
||||
|
||||
log_inv_scales = np.array(log_inv_scales)
|
||||
log_counts = np.array(log_counts)
|
||||
|
||||
# 步骤3:线性回归
|
||||
if len(log_inv_scales) < 3:
|
||||
return 1.5, log_inv_scales, log_counts
|
||||
|
||||
coeffs = np.polyfit(log_inv_scales, log_counts, 1)
|
||||
D = coeffs[0] # 斜率即分形维数
|
||||
|
||||
return D, log_inv_scales, log_counts
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 蒙特卡洛模拟对比
|
||||
# ============================================================
|
||||
def generate_random_walk(n: int, seed: Optional[int] = None) -> np.ndarray:
|
||||
"""
|
||||
生成一条与BTC价格序列等长的随机游走
|
||||
|
||||
Parameters
|
||||
----------
|
||||
n : int
|
||||
序列长度
|
||||
seed : int, optional
|
||||
随机种子
|
||||
|
||||
Returns
|
||||
-------
|
||||
np.ndarray
|
||||
随机游走价格序列
|
||||
"""
|
||||
if seed is not None:
|
||||
rng = np.random.RandomState(seed)
|
||||
else:
|
||||
rng = np.random.RandomState()
|
||||
|
||||
# 生成标准正态分布的增量
|
||||
increments = rng.randn(n - 1)
|
||||
# 累积求和得到随机游走
|
||||
walk = np.cumsum(increments)
|
||||
# 加上一个正的起始值避免负数
|
||||
walk = walk - walk.min() + 1.0
|
||||
return walk
|
||||
|
||||
|
||||
def monte_carlo_fractal_test(prices: np.ndarray, n_simulations: int = 100,
|
||||
seed: int = 42) -> Dict:
|
||||
"""
|
||||
蒙特卡洛模拟检验BTC分形维数是否显著偏离随机游走
|
||||
|
||||
方法:
|
||||
1. 生成n_simulations条随机游走
|
||||
2. 计算每条的分形维数
|
||||
3. 与BTC分形维数做Z检验
|
||||
|
||||
Parameters
|
||||
----------
|
||||
prices : np.ndarray
|
||||
BTC价格序列
|
||||
n_simulations : int
|
||||
模拟次数(默认100)
|
||||
seed : int
|
||||
随机种子(可重复性)
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含BTC分形维数、随机游走分形维数分布、Z检验结果
|
||||
"""
|
||||
n = len(prices)
|
||||
|
||||
# 计算BTC分形维数
|
||||
print(f" 计算BTC分形维数...")
|
||||
d_btc, _, _ = box_counting_dimension(prices)
|
||||
print(f" BTC分形维数: {d_btc:.4f}")
|
||||
|
||||
# 蒙特卡洛模拟
|
||||
print(f" 运行{n_simulations}次随机游走模拟...")
|
||||
d_random = []
|
||||
for i in range(n_simulations):
|
||||
if (i + 1) % 20 == 0:
|
||||
print(f" 进度: {i + 1}/{n_simulations}")
|
||||
rw = generate_random_walk(n, seed=seed + i)
|
||||
d_rw, _, _ = box_counting_dimension(rw)
|
||||
d_random.append(d_rw)
|
||||
|
||||
d_random = np.array(d_random)
|
||||
|
||||
# Z检验:BTC分形维数 vs 随机游走分形维数分布
|
||||
mean_rw = np.mean(d_random)
|
||||
std_rw = np.std(d_random, ddof=1)
|
||||
|
||||
if std_rw > 0:
|
||||
z_score = (d_btc - mean_rw) / std_rw
|
||||
# 双侧p值
|
||||
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
|
||||
else:
|
||||
z_score = np.nan
|
||||
p_value = np.nan
|
||||
|
||||
result = {
|
||||
'BTC分形维数': d_btc,
|
||||
'随机游走均值': mean_rw,
|
||||
'随机游走标准差': std_rw,
|
||||
'随机游走范围': (d_random.min(), d_random.max()),
|
||||
'Z统计量': z_score,
|
||||
'p值': p_value,
|
||||
'显著性(α=0.05)': p_value < 0.05 if not np.isnan(p_value) else False,
|
||||
'随机游走分形维数': d_random,
|
||||
}
|
||||
|
||||
return result
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 多尺度自相似性分析
|
||||
# ============================================================
|
||||
def multi_scale_self_similarity(prices: np.ndarray,
|
||||
scales: List[int] = None) -> Dict:
|
||||
"""
|
||||
多尺度自相似性分析:在不同聚合级别下比较统计特征
|
||||
|
||||
方法:
|
||||
对价格序列按不同尺度聚合后,比较收益率分布的统计矩
|
||||
如果序列具有自相似性,其缩放后的统计特征应保持一致
|
||||
|
||||
Parameters
|
||||
----------
|
||||
prices : np.ndarray
|
||||
价格序列
|
||||
scales : list of int
|
||||
聚合尺度,默认 [1, 2, 5, 10, 20, 50]
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
各尺度下的统计特征
|
||||
"""
|
||||
if scales is None:
|
||||
scales = [1, 2, 5, 10, 20, 50]
|
||||
|
||||
results = {}
|
||||
|
||||
for scale in scales:
|
||||
# 对价格序列按scale聚合(每scale个点取一个)
|
||||
aggregated = prices[::scale]
|
||||
if len(aggregated) < 30:
|
||||
continue
|
||||
|
||||
# 计算对数收益率
|
||||
returns = np.diff(np.log(aggregated))
|
||||
if len(returns) < 10:
|
||||
continue
|
||||
|
||||
results[scale] = {
|
||||
'样本量': len(returns),
|
||||
'均值': np.mean(returns),
|
||||
'标准差': np.std(returns),
|
||||
'偏度': float(stats.skew(returns)),
|
||||
'峰度': float(stats.kurtosis(returns)),
|
||||
# 标准差的缩放关系:如果H是Hurst指数,std(scale) ∝ scale^H
|
||||
'标准差(原始)': np.std(returns),
|
||||
}
|
||||
|
||||
# 计算缩放指数:log(std) vs log(scale) 的斜率
|
||||
valid_scales = sorted(results.keys())
|
||||
if len(valid_scales) >= 3:
|
||||
log_scales = np.log(valid_scales)
|
||||
log_stds = np.log([results[s]['标准差'] for s in valid_scales])
|
||||
scaling_exponent = np.polyfit(log_scales, log_stds, 1)[0]
|
||||
scaling_result = {
|
||||
'缩放指数(H估计)': scaling_exponent,
|
||||
'各尺度统计': results,
|
||||
}
|
||||
else:
|
||||
scaling_result = {
|
||||
'缩放指数(H估计)': np.nan,
|
||||
'各尺度统计': results,
|
||||
}
|
||||
|
||||
return scaling_result
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 可视化函数
|
||||
# ============================================================
|
||||
def plot_box_counting(log_inv_scales: np.ndarray, log_counts: np.ndarray, D: float,
|
||||
output_dir: Path, filename: str = "fractal_box_counting.png"):
|
||||
"""绘制盒计数法的log-log图"""
|
||||
fig, ax = plt.subplots(figsize=(10, 7))
|
||||
|
||||
# 散点
|
||||
ax.scatter(log_inv_scales, log_counts, color='steelblue', s=40, zorder=3,
|
||||
label='盒计数数据点')
|
||||
|
||||
# 拟合线
|
||||
coeffs = np.polyfit(log_inv_scales, log_counts, 1)
|
||||
fit_line = np.polyval(coeffs, log_inv_scales)
|
||||
ax.plot(log_inv_scales, fit_line, 'r-', linewidth=2,
|
||||
label=f'拟合线 (D = {D:.4f})')
|
||||
|
||||
# 参考线:D=1.5(纯随机游走理论值)
|
||||
ref_line = 1.5 * log_inv_scales + (log_counts[0] - 1.5 * log_inv_scales[0])
|
||||
ax.plot(log_inv_scales, ref_line, 'k--', alpha=0.5, linewidth=1,
|
||||
label='D=1.5 (随机游走理论值)')
|
||||
|
||||
ax.set_xlabel('log(1/ε) - 尺度倒数的对数', fontsize=12)
|
||||
ax.set_ylabel('log(N(ε)) - 盒子数的对数', fontsize=12)
|
||||
ax.set_title(f'BTC 盒计数法分析 (分形维数 D = {D:.4f})', fontsize=13)
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.tight_layout()
|
||||
filepath = output_dir / filename
|
||||
fig.savefig(filepath, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" 已保存: {filepath}")
|
||||
|
||||
|
||||
def plot_monte_carlo(mc_results: Dict, output_dir: Path,
|
||||
filename: str = "fractal_monte_carlo.png"):
|
||||
"""绘制蒙特卡洛模拟结果:随机游走分形维数直方图 vs BTC"""
|
||||
fig, ax = plt.subplots(figsize=(10, 7))
|
||||
|
||||
d_random = mc_results['随机游走分形维数']
|
||||
d_btc = mc_results['BTC分形维数']
|
||||
|
||||
# 直方图
|
||||
ax.hist(d_random, bins=20, density=True, alpha=0.7, color='steelblue',
|
||||
edgecolor='white', label=f'随机游走 (n={len(d_random)})')
|
||||
|
||||
# BTC分形维数的竖线
|
||||
ax.axvline(x=d_btc, color='red', linewidth=2.5, linestyle='-',
|
||||
label=f'BTC (D={d_btc:.4f})')
|
||||
|
||||
# 随机游走均值的竖线
|
||||
ax.axvline(x=mc_results['随机游走均值'], color='blue', linewidth=1.5, linestyle='--',
|
||||
label=f'随机游走均值 (D={mc_results["随机游走均值"]:.4f})')
|
||||
|
||||
# 添加正态分布拟合曲线
|
||||
x_range = np.linspace(d_random.min() - 0.05, d_random.max() + 0.05, 200)
|
||||
pdf = stats.norm.pdf(x_range, mc_results['随机游走均值'], mc_results['随机游走标准差'])
|
||||
ax.plot(x_range, pdf, 'b-', alpha=0.5, linewidth=1)
|
||||
|
||||
# 标注统计信息
|
||||
info_text = (
|
||||
f"Z统计量: {mc_results['Z统计量']:.2f}\n"
|
||||
f"p值: {mc_results['p值']:.4f}\n"
|
||||
f"显著性(α=0.05): {'是' if mc_results['显著性(α=0.05)'] else '否'}"
|
||||
)
|
||||
ax.text(0.02, 0.95, info_text, transform=ax.transAxes, fontsize=11,
|
||||
verticalalignment='top', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
|
||||
|
||||
ax.set_xlabel('分形维数 D', fontsize=12)
|
||||
ax.set_ylabel('概率密度', fontsize=12)
|
||||
ax.set_title('BTC分形维数 vs 随机游走蒙特卡洛模拟', fontsize=13)
|
||||
ax.legend(fontsize=11, loc='upper right')
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.tight_layout()
|
||||
filepath = output_dir / filename
|
||||
fig.savefig(filepath, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" 已保存: {filepath}")
|
||||
|
||||
|
||||
def plot_self_similarity(scaling_result: Dict, output_dir: Path,
|
||||
filename: str = "fractal_self_similarity.png"):
|
||||
"""绘制多尺度自相似性分析图"""
|
||||
scale_stats = scaling_result['各尺度统计']
|
||||
if not scale_stats:
|
||||
print(" 没有可绘制的自相似性结果")
|
||||
return
|
||||
|
||||
scales = sorted(scale_stats.keys())
|
||||
stds = [scale_stats[s]['标准差'] for s in scales]
|
||||
skews = [scale_stats[s]['偏度'] for s in scales]
|
||||
kurts = [scale_stats[s]['峰度'] for s in scales]
|
||||
|
||||
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
|
||||
|
||||
# 图1:log(std) vs log(scale) — 缩放关系
|
||||
ax1 = axes[0]
|
||||
log_scales = np.log(scales)
|
||||
log_stds = np.log(stds)
|
||||
|
||||
ax1.scatter(log_scales, log_stds, color='steelblue', s=60, zorder=3)
|
||||
|
||||
if len(log_scales) >= 3:
|
||||
coeffs = np.polyfit(log_scales, log_stds, 1)
|
||||
fit_line = np.polyval(coeffs, log_scales)
|
||||
ax1.plot(log_scales, fit_line, 'r-', linewidth=2,
|
||||
label=f'拟合斜率 H≈{coeffs[0]:.4f}')
|
||||
|
||||
# 参考线 H=0.5
|
||||
ref_line = 0.5 * log_scales + (log_stds[0] - 0.5 * log_scales[0])
|
||||
ax1.plot(log_scales, ref_line, 'k--', alpha=0.5, label='H=0.5 参考线')
|
||||
|
||||
ax1.set_xlabel('log(聚合尺度)', fontsize=11)
|
||||
ax1.set_ylabel('log(标准差)', fontsize=11)
|
||||
ax1.set_title('缩放关系 (标准差 vs 尺度)', fontsize=12)
|
||||
ax1.legend(fontsize=10)
|
||||
ax1.grid(True, alpha=0.3)
|
||||
|
||||
# 图2:偏度随尺度变化
|
||||
ax2 = axes[1]
|
||||
ax2.bar(range(len(scales)), skews, color='coral', alpha=0.8)
|
||||
ax2.set_xticks(range(len(scales)))
|
||||
ax2.set_xticklabels([str(s) for s in scales])
|
||||
ax2.axhline(y=0, color='black', linestyle='--', alpha=0.5)
|
||||
ax2.set_xlabel('聚合尺度', fontsize=11)
|
||||
ax2.set_ylabel('偏度', fontsize=11)
|
||||
ax2.set_title('偏度随尺度变化', fontsize=12)
|
||||
ax2.grid(True, alpha=0.3, axis='y')
|
||||
|
||||
# 图3:峰度随尺度变化
|
||||
ax3 = axes[2]
|
||||
ax3.bar(range(len(scales)), kurts, color='seagreen', alpha=0.8)
|
||||
ax3.set_xticks(range(len(scales)))
|
||||
ax3.set_xticklabels([str(s) for s in scales])
|
||||
ax3.axhline(y=0, color='black', linestyle='--', alpha=0.5, label='正态分布峰度=0')
|
||||
ax3.set_xlabel('聚合尺度', fontsize=11)
|
||||
ax3.set_ylabel('超额峰度', fontsize=11)
|
||||
ax3.set_title('峰度随尺度变化', fontsize=12)
|
||||
ax3.legend(fontsize=10)
|
||||
ax3.grid(True, alpha=0.3, axis='y')
|
||||
|
||||
fig.suptitle(f'BTC 多尺度自相似性分析 (缩放指数 H≈{scaling_result["缩放指数(H估计)"]:.4f})',
|
||||
fontsize=14, y=1.02)
|
||||
fig.tight_layout()
|
||||
filepath = output_dir / filename
|
||||
fig.savefig(filepath, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" 已保存: {filepath}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 主入口函数
|
||||
# ============================================================
|
||||
def run_fractal_analysis(df: pd.DataFrame, output_dir: str = "output/fractal") -> Dict:
|
||||
"""
|
||||
分形维数与自相似性综合分析主入口
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
K线数据(需包含 'close' 列和DatetimeIndex索引)
|
||||
output_dir : str
|
||||
图表输出目录
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含所有分析结果的字典
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
results = {}
|
||||
|
||||
print("=" * 70)
|
||||
print("分形维数与自相似性分析")
|
||||
print("=" * 70)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 1. 准备数据
|
||||
# ----------------------------------------------------------
|
||||
prices = df['close'].dropna().values
|
||||
|
||||
print(f"\n数据概况:")
|
||||
print(f" 时间范围: {df.index.min()} ~ {df.index.max()}")
|
||||
print(f" 价格序列长度: {len(prices)}")
|
||||
print(f" 价格范围: {prices.min():.2f} ~ {prices.max():.2f}")
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 2. 盒计数法分形维数
|
||||
# ----------------------------------------------------------
|
||||
print("\n" + "-" * 50)
|
||||
print("【1】盒计数法 (Box-Counting Dimension)")
|
||||
print("-" * 50)
|
||||
|
||||
D, log_inv_scales, log_counts = box_counting_dimension(prices)
|
||||
results['盒计数分形维数'] = D
|
||||
|
||||
print(f" BTC分形维数: D = {D:.4f}")
|
||||
print(f" 理论参考值:")
|
||||
print(f" D = 1.0: 光滑曲线(完全可预测)")
|
||||
print(f" D = 1.5: 纯随机游走(布朗运动)")
|
||||
print(f" D = 2.0: 完全填充平面(极端不规则)")
|
||||
|
||||
if D < 1.3:
|
||||
interpretation = "序列非常光滑,可能存在强趋势特征"
|
||||
elif D < 1.45:
|
||||
interpretation = "序列较为光滑,具有一定趋势持续性"
|
||||
elif D < 1.55:
|
||||
interpretation = "序列接近随机游走特征"
|
||||
elif D < 1.7:
|
||||
interpretation = "序列较为粗糙,具有一定均值回归倾向"
|
||||
else:
|
||||
interpretation = "序列非常不规则,高度波动"
|
||||
|
||||
print(f" BTC解读: {interpretation}")
|
||||
results['维数解读'] = interpretation
|
||||
|
||||
# 分形维数与Hurst指数的关系: D = 2 - H
|
||||
h_from_d = 2.0 - D
|
||||
print(f"\n 由分形维数推算Hurst指数 (D = 2 - H):")
|
||||
print(f" H ≈ {h_from_d:.4f}")
|
||||
results['Hurst(从D推算)'] = h_from_d
|
||||
|
||||
# 绘制盒计数log-log图
|
||||
plot_box_counting(log_inv_scales, log_counts, D, output_dir)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 3. 蒙特卡洛模拟对比
|
||||
# ----------------------------------------------------------
|
||||
print("\n" + "-" * 50)
|
||||
print("【2】蒙特卡洛模拟对比 (100次随机游走)")
|
||||
print("-" * 50)
|
||||
|
||||
mc_results = monte_carlo_fractal_test(prices, n_simulations=100, seed=42)
|
||||
results['蒙特卡洛检验'] = {
|
||||
k: v for k, v in mc_results.items() if k != '随机游走分形维数'
|
||||
}
|
||||
|
||||
print(f"\n 结果汇总:")
|
||||
print(f" BTC分形维数: D = {mc_results['BTC分形维数']:.4f}")
|
||||
print(f" 随机游走均值: D = {mc_results['随机游走均值']:.4f} ± {mc_results['随机游走标准差']:.4f}")
|
||||
print(f" 随机游走范围: [{mc_results['随机游走范围'][0]:.4f}, {mc_results['随机游走范围'][1]:.4f}]")
|
||||
print(f" Z统计量: {mc_results['Z统计量']:.4f}")
|
||||
print(f" p值: {mc_results['p值']:.6f}")
|
||||
print(f" 显著性(α=0.05): {'是 - BTC与随机游走显著不同' if mc_results['显著性(α=0.05)'] else '否 - 无法拒绝随机游走假设'}")
|
||||
|
||||
# 绘制蒙特卡洛结果图
|
||||
plot_monte_carlo(mc_results, output_dir)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 4. 多尺度自相似性分析
|
||||
# ----------------------------------------------------------
|
||||
print("\n" + "-" * 50)
|
||||
print("【3】多尺度自相似性分析")
|
||||
print("-" * 50)
|
||||
|
||||
scaling_result = multi_scale_self_similarity(prices, scales=[1, 2, 5, 10, 20, 50])
|
||||
results['多尺度自相似性'] = {
|
||||
k: v for k, v in scaling_result.items() if k != '各尺度统计'
|
||||
}
|
||||
results['多尺度自相似性']['缩放指数(H估计)'] = scaling_result['缩放指数(H估计)']
|
||||
|
||||
print(f"\n 缩放指数 (波动率缩放关系 H估计): {scaling_result['缩放指数(H估计)']:.4f}")
|
||||
print(f" 各尺度统计特征:")
|
||||
for scale, stat in sorted(scaling_result['各尺度统计'].items()):
|
||||
print(f" 尺度={scale:3d}: 样本={stat['样本量']:5d}, "
|
||||
f"std={stat['标准差']:.6f}, "
|
||||
f"偏度={stat['偏度']:.4f}, "
|
||||
f"峰度={stat['峰度']:.4f}")
|
||||
|
||||
# 自相似性判定
|
||||
scale_stats = scaling_result['各尺度统计']
|
||||
if scale_stats:
|
||||
valid_scales = sorted(scale_stats.keys())
|
||||
if len(valid_scales) >= 2:
|
||||
kurts = [scale_stats[s]['峰度'] for s in valid_scales]
|
||||
# 如果峰度随尺度增大而趋向0(正态),说明大尺度下趋向正态
|
||||
if all(k > 1.0 for k in kurts):
|
||||
print("\n 自相似性判定: 所有尺度均呈现超额峰度(尖峰厚尾),")
|
||||
print(" 表明BTC收益率分布在各尺度下均偏离正态分布,具有分形特征")
|
||||
elif kurts[-1] < kurts[0] * 0.5:
|
||||
print("\n 自相似性判定: 峰度随聚合尺度增大而显著下降,")
|
||||
print(" 表明大尺度下收益率趋于正态,自相似性有限")
|
||||
else:
|
||||
print("\n 自相似性判定: 峰度随尺度变化不大,具有一定自相似性")
|
||||
|
||||
# 绘制自相似性图
|
||||
plot_self_similarity(scaling_result, output_dir)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 5. 总结
|
||||
# ----------------------------------------------------------
|
||||
print("\n" + "=" * 70)
|
||||
print("分析总结")
|
||||
print("=" * 70)
|
||||
print(f" 盒计数分形维数: D = {D:.4f}")
|
||||
print(f" 由D推算Hurst指数: H = {h_from_d:.4f}")
|
||||
print(f" 维数解读: {interpretation}")
|
||||
print(f"\n 蒙特卡洛检验:")
|
||||
if mc_results['显著性(α=0.05)']:
|
||||
print(f" BTC价格序列的分形维数与纯随机游走存在显著差异 (p={mc_results['p值']:.6f})")
|
||||
if D < mc_results['随机游走均值']:
|
||||
print(f" BTC的D({D:.4f}) < 随机游走的D({mc_results['随机游走均值']:.4f}),")
|
||||
print(" 表明BTC价格比纯随机游走更「光滑」,即存在趋势持续性")
|
||||
else:
|
||||
print(f" BTC的D({D:.4f}) > 随机游走的D({mc_results['随机游走均值']:.4f}),")
|
||||
print(" 表明BTC价格比纯随机游走更「粗糙」,即存在均值回归特征")
|
||||
else:
|
||||
print(f" 无法在5%显著性水平下拒绝BTC为随机游走的假设 (p={mc_results['p值']:.6f})")
|
||||
|
||||
print(f"\n 波动率缩放指数: H ≈ {scaling_result['缩放指数(H估计)']:.4f}")
|
||||
print(f" H > 0.5: 波动率超线性增长 → 趋势持续性")
|
||||
print(f" H < 0.5: 波动率亚线性增长 → 均值回归性")
|
||||
print(f" H ≈ 0.5: 波动率线性增长 → 随机游走")
|
||||
|
||||
print(f"\n 图表已保存至: {output_dir.resolve()}")
|
||||
print("=" * 70)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 独立运行入口
|
||||
# ============================================================
|
||||
if __name__ == "__main__":
|
||||
from data_loader import load_daily
|
||||
|
||||
print("加载BTC日线数据...")
|
||||
df = load_daily()
|
||||
print(f"数据加载完成: {len(df)} 条记录")
|
||||
|
||||
results = run_fractal_analysis(df, output_dir="output/fractal")
|
||||
546
src/halving_analysis.py
Normal file
@@ -0,0 +1,546 @@
|
||||
"""BTC 减半周期分析模块 - 减半前后价格行为、波动率、累计收益对比"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import matplotlib.ticker as mticker
|
||||
from pathlib import Path
|
||||
from scipy import stats
|
||||
|
||||
# 中文显示配置
|
||||
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
|
||||
plt.rcParams['axes.unicode_minus'] = False
|
||||
|
||||
# BTC 减半日期(数据范围 2017-2026 内的两次减半)
|
||||
HALVING_DATES = [
|
||||
pd.Timestamp('2020-05-11'),
|
||||
pd.Timestamp('2024-04-20'),
|
||||
]
|
||||
HALVING_LABELS = ['第三次减半 (2020-05-11)', '第四次减半 (2024-04-20)']
|
||||
|
||||
# 分析窗口:减半前后各 500 天
|
||||
WINDOW_DAYS = 500
|
||||
|
||||
|
||||
def _extract_halving_window(df: pd.DataFrame, halving_date: pd.Timestamp,
|
||||
window: int = WINDOW_DAYS):
|
||||
"""
|
||||
提取减半日期前后的数据窗口。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线数据(DatetimeIndex 索引,含 close 和 log_return 列)
|
||||
halving_date : pd.Timestamp
|
||||
减半日期
|
||||
window : int
|
||||
前后各取的天数
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
窗口数据,附加 'days_from_halving' 列(减半日=0)
|
||||
"""
|
||||
start = halving_date - pd.Timedelta(days=window)
|
||||
end = halving_date + pd.Timedelta(days=window)
|
||||
mask = (df.index >= start) & (df.index <= end)
|
||||
window_df = df.loc[mask].copy()
|
||||
|
||||
# 计算距减半日的天数差
|
||||
window_df['days_from_halving'] = (window_df.index - halving_date).days
|
||||
return window_df
|
||||
|
||||
|
||||
def _normalize_price(window_df: pd.DataFrame, halving_date: pd.Timestamp):
|
||||
"""
|
||||
以减半日价格为基准(=100)归一化价格。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
window_df : pd.DataFrame
|
||||
窗口数据(含 close 列)
|
||||
halving_date : pd.Timestamp
|
||||
减半日期
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.Series
|
||||
归一化后的价格序列(减半日=100)
|
||||
"""
|
||||
# 找到距减半日最近的交易日
|
||||
idx = window_df.index.get_indexer([halving_date], method='nearest')[0]
|
||||
base_price = window_df['close'].iloc[idx]
|
||||
return (window_df['close'] / base_price) * 100
|
||||
|
||||
|
||||
def analyze_normalized_trajectories(windows: list, output_dir: Path):
|
||||
"""
|
||||
绘制归一化价格轨迹叠加图。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
windows : list[dict]
|
||||
每个元素包含 'df', 'normalized', 'label', 'halving_date'
|
||||
output_dir : Path
|
||||
图片保存目录
|
||||
"""
|
||||
print("\n" + "-" * 60)
|
||||
print("【归一化价格轨迹叠加】")
|
||||
print("-" * 60)
|
||||
|
||||
fig, ax = plt.subplots(figsize=(14, 7))
|
||||
colors = ['#2980b9', '#e74c3c']
|
||||
linestyles = ['-', '--']
|
||||
|
||||
for i, w in enumerate(windows):
|
||||
days = w['df']['days_from_halving']
|
||||
normalized = w['normalized']
|
||||
ax.plot(days, normalized, color=colors[i], linestyle=linestyles[i],
|
||||
linewidth=1.5, label=w['label'], alpha=0.85)
|
||||
|
||||
ax.axvline(x=0, color='gold', linestyle='-', linewidth=2,
|
||||
alpha=0.8, label='减半日')
|
||||
ax.axhline(y=100, color='grey', linestyle=':', alpha=0.4)
|
||||
|
||||
ax.set_title('BTC 减半周期 - 归一化价格轨迹叠加(减半日=100)', fontsize=14)
|
||||
ax.set_xlabel(f'距减半日天数(前后各 {WINDOW_DAYS} 天)')
|
||||
ax.set_ylabel('归一化价格')
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig_path = output_dir / 'halving_normalized_trajectories.png'
|
||||
fig.savefig(fig_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"图表已保存: {fig_path}")
|
||||
|
||||
|
||||
def analyze_pre_post_returns(windows: list, output_dir: Path):
|
||||
"""
|
||||
对比减半前后平均收益率,进行 Welch's t 检验。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
windows : list[dict]
|
||||
窗口数据列表
|
||||
output_dir : Path
|
||||
图片保存目录
|
||||
"""
|
||||
print("\n" + "-" * 60)
|
||||
print("【减半前后收益率对比 & Welch's t 检验】")
|
||||
print("-" * 60)
|
||||
|
||||
all_pre_returns = []
|
||||
all_post_returns = []
|
||||
|
||||
for w in windows:
|
||||
df_w = w['df']
|
||||
pre = df_w.loc[df_w['days_from_halving'] < 0, 'log_return'].dropna()
|
||||
post = df_w.loc[df_w['days_from_halving'] > 0, 'log_return'].dropna()
|
||||
all_pre_returns.append(pre)
|
||||
all_post_returns.append(post)
|
||||
|
||||
print(f"\n{w['label']}:")
|
||||
print(f" 减半前 {WINDOW_DAYS}天: 均值={pre.mean():.6f}, 标准差={pre.std():.6f}, "
|
||||
f"中位数={pre.median():.6f}, N={len(pre)}")
|
||||
print(f" 减半后 {WINDOW_DAYS}天: 均值={post.mean():.6f}, 标准差={post.std():.6f}, "
|
||||
f"中位数={post.median():.6f}, N={len(post)}")
|
||||
|
||||
# 单周期 Welch's t-test
|
||||
if len(pre) >= 3 and len(post) >= 3:
|
||||
t_stat, p_val = stats.ttest_ind(pre, post, equal_var=False)
|
||||
print(f" Welch's t 检验: t={t_stat:.4f}, p={p_val:.6f}")
|
||||
if p_val < 0.05:
|
||||
print(" => 减半前后收益率在 5% 水平下存在显著差异")
|
||||
else:
|
||||
print(" => 减半前后收益率在 5% 水平下无显著差异")
|
||||
|
||||
# 合并所有周期的前后收益率进行总体检验
|
||||
combined_pre = pd.concat(all_pre_returns)
|
||||
combined_post = pd.concat(all_post_returns)
|
||||
print(f"\n--- 合并所有减半周期 ---")
|
||||
print(f" 合并减半前: 均值={combined_pre.mean():.6f}, N={len(combined_pre)}")
|
||||
print(f" 合并减半后: 均值={combined_post.mean():.6f}, N={len(combined_post)}")
|
||||
t_stat_all, p_val_all = stats.ttest_ind(combined_pre, combined_post, equal_var=False)
|
||||
print(f" 合并 Welch's t 检验: t={t_stat_all:.4f}, p={p_val_all:.6f}")
|
||||
|
||||
# --- 可视化: 减半前后收益率对比柱状图(含置信区间) ---
|
||||
fig, axes = plt.subplots(1, len(windows), figsize=(7 * len(windows), 6))
|
||||
if len(windows) == 1:
|
||||
axes = [axes]
|
||||
|
||||
for i, w in enumerate(windows):
|
||||
df_w = w['df']
|
||||
pre = df_w.loc[df_w['days_from_halving'] < 0, 'log_return'].dropna()
|
||||
post = df_w.loc[df_w['days_from_halving'] > 0, 'log_return'].dropna()
|
||||
|
||||
means = [pre.mean(), post.mean()]
|
||||
# 95% 置信区间
|
||||
ci_pre = stats.t.interval(0.95, len(pre) - 1, loc=pre.mean(), scale=pre.sem())
|
||||
ci_post = stats.t.interval(0.95, len(post) - 1, loc=post.mean(), scale=post.sem())
|
||||
errors = [
|
||||
[means[0] - ci_pre[0], means[1] - ci_post[0]],
|
||||
[ci_pre[1] - means[0], ci_post[1] - means[1]],
|
||||
]
|
||||
|
||||
colors_bar = ['#3498db', '#e67e22']
|
||||
axes[i].bar(['减半前', '减半后'], means, yerr=errors, color=colors_bar,
|
||||
alpha=0.8, capsize=5, edgecolor='black', linewidth=0.5)
|
||||
axes[i].axhline(y=0, color='grey', linestyle='--', alpha=0.5)
|
||||
axes[i].set_title(w['label'] + '\n日均对数收益率(95% CI)', fontsize=12)
|
||||
axes[i].set_ylabel('平均对数收益率')
|
||||
|
||||
plt.tight_layout()
|
||||
fig_path = output_dir / 'halving_pre_post_returns.png'
|
||||
fig.savefig(fig_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"\n图表已保存: {fig_path}")
|
||||
|
||||
|
||||
def analyze_cumulative_returns(windows: list, output_dir: Path):
|
||||
"""
|
||||
绘制减半后累计收益率对比。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
windows : list[dict]
|
||||
窗口数据列表
|
||||
output_dir : Path
|
||||
图片保存目录
|
||||
"""
|
||||
print("\n" + "-" * 60)
|
||||
print("【减半后累计收益率对比】")
|
||||
print("-" * 60)
|
||||
|
||||
fig, ax = plt.subplots(figsize=(14, 7))
|
||||
colors = ['#2980b9', '#e74c3c']
|
||||
|
||||
for i, w in enumerate(windows):
|
||||
df_w = w['df']
|
||||
post = df_w.loc[df_w['days_from_halving'] >= 0].copy()
|
||||
if len(post) == 0:
|
||||
print(f" {w['label']}: 无减半后数据")
|
||||
continue
|
||||
|
||||
# 累计对数收益率
|
||||
post_returns = post['log_return'].fillna(0)
|
||||
cum_return = post_returns.cumsum()
|
||||
# 转为百分比形式
|
||||
cum_return_pct = (np.exp(cum_return) - 1) * 100
|
||||
|
||||
days = post['days_from_halving']
|
||||
ax.plot(days, cum_return_pct, color=colors[i], linewidth=1.5,
|
||||
label=w['label'], alpha=0.85)
|
||||
|
||||
# 输出关键节点
|
||||
final_cum = cum_return_pct.iloc[-1] if len(cum_return_pct) > 0 else 0
|
||||
print(f" {w['label']}: 减半后 {len(post)} 天累计收益率 = {final_cum:.2f}%")
|
||||
|
||||
# 输出一些关键时间节点的累计收益
|
||||
for target_day in [30, 90, 180, 365, WINDOW_DAYS]:
|
||||
mask_day = days <= target_day
|
||||
if mask_day.any():
|
||||
val = cum_return_pct.loc[mask_day].iloc[-1]
|
||||
actual_day = days.loc[mask_day].iloc[-1]
|
||||
print(f" 第 {actual_day} 天: {val:.2f}%")
|
||||
|
||||
ax.axhline(y=0, color='grey', linestyle=':', alpha=0.4)
|
||||
ax.set_title('BTC 减半后累计收益率对比', fontsize=14)
|
||||
ax.set_xlabel('距减半日天数')
|
||||
ax.set_ylabel('累计收益率 (%)')
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:,.0f}%'))
|
||||
|
||||
fig_path = output_dir / 'halving_cumulative_returns.png'
|
||||
fig.savefig(fig_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"\n图表已保存: {fig_path}")
|
||||
|
||||
|
||||
def analyze_volatility_change(windows: list, output_dir: Path):
|
||||
"""
|
||||
Levene 检验:减半前后波动率变化。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
windows : list[dict]
|
||||
窗口数据列表
|
||||
output_dir : Path
|
||||
图片保存目录
|
||||
"""
|
||||
print("\n" + "-" * 60)
|
||||
print("【减半前后波动率变化 - Levene 检验】")
|
||||
print("-" * 60)
|
||||
|
||||
for w in windows:
|
||||
df_w = w['df']
|
||||
pre = df_w.loc[df_w['days_from_halving'] < 0, 'log_return'].dropna()
|
||||
post = df_w.loc[df_w['days_from_halving'] > 0, 'log_return'].dropna()
|
||||
|
||||
print(f"\n{w['label']}:")
|
||||
print(f" 减半前波动率(日标准差): {pre.std():.6f} "
|
||||
f"(年化: {pre.std() * np.sqrt(365):.4f})")
|
||||
print(f" 减半后波动率(日标准差): {post.std():.6f} "
|
||||
f"(年化: {post.std() * np.sqrt(365):.4f})")
|
||||
|
||||
if len(pre) >= 3 and len(post) >= 3:
|
||||
lev_stat, lev_p = stats.levene(pre, post, center='median')
|
||||
print(f" Levene 检验: W={lev_stat:.4f}, p={lev_p:.6f}")
|
||||
if lev_p < 0.05:
|
||||
print(" => 在 5% 水平下,减半前后波动率存在显著变化")
|
||||
else:
|
||||
print(" => 在 5% 水平下,减半前后波动率无显著变化")
|
||||
|
||||
|
||||
def analyze_inter_cycle_correlation(windows: list):
|
||||
"""
|
||||
两个减半周期归一化轨迹的 Pearson 相关系数。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
windows : list[dict]
|
||||
窗口数据列表(需要至少2个周期)
|
||||
"""
|
||||
print("\n" + "-" * 60)
|
||||
print("【周期间轨迹相关性 - Pearson 相关】")
|
||||
print("-" * 60)
|
||||
|
||||
if len(windows) < 2:
|
||||
print(" 仅有1个周期,无法计算周期间相关性。")
|
||||
return
|
||||
|
||||
# 按照 days_from_halving 对齐两个周期
|
||||
w1, w2 = windows[0], windows[1]
|
||||
df1 = w1['df'][['days_from_halving']].copy()
|
||||
df1['norm_price_1'] = w1['normalized'].values
|
||||
|
||||
df2 = w2['df'][['days_from_halving']].copy()
|
||||
df2['norm_price_2'] = w2['normalized'].values
|
||||
|
||||
# 以 days_from_halving 为键进行内连接
|
||||
merged = pd.merge(df1, df2, on='days_from_halving', how='inner')
|
||||
|
||||
if len(merged) < 10:
|
||||
print(f" 重叠天数过少({len(merged)}天),无法可靠计算相关性。")
|
||||
return
|
||||
|
||||
r, p_val = stats.pearsonr(merged['norm_price_1'], merged['norm_price_2'])
|
||||
print(f" 重叠天数: {len(merged)}")
|
||||
print(f" Pearson 相关系数: r={r:.4f}, p={p_val:.6f}")
|
||||
|
||||
if abs(r) > 0.7:
|
||||
print(" => 两个减半周期的价格轨迹呈强相关")
|
||||
elif abs(r) > 0.4:
|
||||
print(" => 两个减半周期的价格轨迹呈中等相关")
|
||||
else:
|
||||
print(" => 两个减半周期的价格轨迹相关性较弱")
|
||||
|
||||
# 分别看减半前和减半后的相关性
|
||||
pre_merged = merged[merged['days_from_halving'] < 0]
|
||||
post_merged = merged[merged['days_from_halving'] > 0]
|
||||
|
||||
if len(pre_merged) >= 10:
|
||||
r_pre, p_pre = stats.pearsonr(pre_merged['norm_price_1'], pre_merged['norm_price_2'])
|
||||
print(f" 减半前轨迹相关性: r={r_pre:.4f}, p={p_pre:.6f} (N={len(pre_merged)})")
|
||||
|
||||
if len(post_merged) >= 10:
|
||||
r_post, p_post = stats.pearsonr(post_merged['norm_price_1'], post_merged['norm_price_2'])
|
||||
print(f" 减半后轨迹相关性: r={r_post:.4f}, p={p_post:.6f} (N={len(post_merged)})")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# 主入口
|
||||
# --------------------------------------------------------------------------
|
||||
def run_halving_analysis(
|
||||
df: pd.DataFrame,
|
||||
output_dir: str = 'output/halving',
|
||||
):
|
||||
"""
|
||||
BTC 减半周期分析主入口。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线数据,已通过 add_derived_features 添加衍生特征(含 close、log_return 列)
|
||||
output_dir : str or Path
|
||||
输出目录
|
||||
|
||||
Notes
|
||||
-----
|
||||
重要局限性: 数据范围内仅含2次减半事件(2020、2024),样本量极少,
|
||||
统计检验的功效(power)很低,结论仅供参考,不能作为因果推断依据。
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("\n" + "#" * 70)
|
||||
print("# BTC 减半周期分析 (Halving Cycle Analysis)")
|
||||
print("#" * 70)
|
||||
|
||||
# ===== 重要局限性说明 =====
|
||||
print("\n⚠️ 重要局限性说明:")
|
||||
print(f" 本分析仅覆盖 {len(HALVING_DATES)} 次减半事件(样本量极少)。")
|
||||
print(" 统计检验的功效(statistical power)很低,")
|
||||
print(" 任何「显著性」结论都应谨慎解读,不能作为因果推断依据。")
|
||||
print(" 结果主要用于描述性分析和模式探索。\n")
|
||||
|
||||
# 提取每次减半的窗口数据
|
||||
windows = []
|
||||
for i, (hdate, hlabel) in enumerate(zip(HALVING_DATES, HALVING_LABELS)):
|
||||
w_df = _extract_halving_window(df, hdate, WINDOW_DAYS)
|
||||
if len(w_df) == 0:
|
||||
print(f"[警告] {hlabel} 窗口内无数据,跳过。")
|
||||
continue
|
||||
|
||||
normalized = _normalize_price(w_df, hdate)
|
||||
|
||||
print(f"周期 {i + 1}: {hlabel}")
|
||||
print(f" 数据范围: {w_df.index.min().date()} ~ {w_df.index.max().date()}")
|
||||
print(f" 数据量: {len(w_df)} 天")
|
||||
print(f" 减半日价格: {w_df['close'].iloc[w_df.index.get_indexer([hdate], method='nearest')[0]]:.2f} USDT")
|
||||
|
||||
windows.append({
|
||||
'df': w_df,
|
||||
'normalized': normalized,
|
||||
'label': hlabel,
|
||||
'halving_date': hdate,
|
||||
})
|
||||
|
||||
if len(windows) == 0:
|
||||
print("[错误] 无有效减半窗口数据,分析中止。")
|
||||
return
|
||||
|
||||
# 1. 归一化价格轨迹叠加
|
||||
analyze_normalized_trajectories(windows, output_dir)
|
||||
|
||||
# 2. 减半前后收益率对比
|
||||
analyze_pre_post_returns(windows, output_dir)
|
||||
|
||||
# 3. 减半后累计收益率
|
||||
analyze_cumulative_returns(windows, output_dir)
|
||||
|
||||
# 4. 波动率变化 (Levene 检验)
|
||||
analyze_volatility_change(windows, output_dir)
|
||||
|
||||
# 5. 周期间轨迹相关性
|
||||
analyze_inter_cycle_correlation(windows)
|
||||
|
||||
# ===== 综合可视化: 三合一图 =====
|
||||
_plot_combined_summary(windows, output_dir)
|
||||
|
||||
print("\n" + "#" * 70)
|
||||
print("# 减半周期分析完成")
|
||||
print(f"# 注意: 仅 {len(windows)} 个周期,结论统计功效有限")
|
||||
print("#" * 70)
|
||||
|
||||
|
||||
def _plot_combined_summary(windows: list, output_dir: Path):
|
||||
"""
|
||||
综合图: 归一化轨迹 + 减半前后收益率柱状图 + 累计收益率对比。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
windows : list[dict]
|
||||
窗口数据列表
|
||||
output_dir : Path
|
||||
图片保存目录
|
||||
"""
|
||||
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
|
||||
colors = ['#2980b9', '#e74c3c']
|
||||
linestyles = ['-', '--']
|
||||
|
||||
# (0,0) 归一化轨迹
|
||||
ax = axes[0, 0]
|
||||
for i, w in enumerate(windows):
|
||||
days = w['df']['days_from_halving']
|
||||
ax.plot(days, w['normalized'], color=colors[i], linestyle=linestyles[i],
|
||||
linewidth=1.5, label=w['label'], alpha=0.85)
|
||||
ax.axvline(x=0, color='gold', linewidth=2, alpha=0.8, label='减半日')
|
||||
ax.axhline(y=100, color='grey', linestyle=':', alpha=0.4)
|
||||
ax.set_title('归一化价格轨迹(减半日=100)', fontsize=12)
|
||||
ax.set_xlabel('距减半日天数')
|
||||
ax.set_ylabel('归一化价格')
|
||||
ax.legend(fontsize=9)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
# (0,1) 减半前后日均收益率
|
||||
ax = axes[0, 1]
|
||||
x_pos = np.arange(len(windows))
|
||||
width = 0.35
|
||||
pre_means, post_means, pre_errs, post_errs = [], [], [], []
|
||||
for w in windows:
|
||||
df_w = w['df']
|
||||
pre = df_w.loc[df_w['days_from_halving'] < 0, 'log_return'].dropna()
|
||||
post = df_w.loc[df_w['days_from_halving'] > 0, 'log_return'].dropna()
|
||||
pre_means.append(pre.mean())
|
||||
post_means.append(post.mean())
|
||||
pre_errs.append(pre.sem() * 1.96) # 95% CI
|
||||
post_errs.append(post.sem() * 1.96)
|
||||
|
||||
ax.bar(x_pos - width / 2, pre_means, width, yerr=pre_errs, label='减半前',
|
||||
color='#3498db', alpha=0.8, capsize=4, edgecolor='black', linewidth=0.5)
|
||||
ax.bar(x_pos + width / 2, post_means, width, yerr=post_errs, label='减半后',
|
||||
color='#e67e22', alpha=0.8, capsize=4, edgecolor='black', linewidth=0.5)
|
||||
ax.set_xticks(x_pos)
|
||||
ax.set_xticklabels([w['label'].split('(')[0].strip() for w in windows], fontsize=9)
|
||||
ax.axhline(y=0, color='grey', linestyle='--', alpha=0.5)
|
||||
ax.set_title('减半前后日均对数收益率(95% CI)', fontsize=12)
|
||||
ax.set_ylabel('平均对数收益率')
|
||||
ax.legend(fontsize=9)
|
||||
|
||||
# (1,0) 累计收益率
|
||||
ax = axes[1, 0]
|
||||
for i, w in enumerate(windows):
|
||||
df_w = w['df']
|
||||
post = df_w.loc[df_w['days_from_halving'] >= 0].copy()
|
||||
if len(post) == 0:
|
||||
continue
|
||||
cum_ret = post['log_return'].fillna(0).cumsum()
|
||||
cum_ret_pct = (np.exp(cum_ret) - 1) * 100
|
||||
ax.plot(post['days_from_halving'], cum_ret_pct, color=colors[i],
|
||||
linewidth=1.5, label=w['label'], alpha=0.85)
|
||||
ax.axhline(y=0, color='grey', linestyle=':', alpha=0.4)
|
||||
ax.set_title('减半后累计收益率对比', fontsize=12)
|
||||
ax.set_xlabel('距减半日天数')
|
||||
ax.set_ylabel('累计收益率 (%)')
|
||||
ax.legend(fontsize=9)
|
||||
ax.grid(True, alpha=0.3)
|
||||
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'{x:,.0f}%'))
|
||||
|
||||
# (1,1) 波动率对比(滚动30天)
|
||||
ax = axes[1, 1]
|
||||
for i, w in enumerate(windows):
|
||||
df_w = w['df']
|
||||
rolling_vol = df_w['log_return'].rolling(30).std() * np.sqrt(365)
|
||||
ax.plot(df_w['days_from_halving'], rolling_vol, color=colors[i],
|
||||
linewidth=1.2, label=w['label'], alpha=0.8)
|
||||
ax.axvline(x=0, color='gold', linewidth=2, alpha=0.8, label='减半日')
|
||||
ax.set_title('滚动30天年化波动率', fontsize=12)
|
||||
ax.set_xlabel('距减半日天数')
|
||||
ax.set_ylabel('年化波动率')
|
||||
ax.legend(fontsize=9)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
plt.suptitle('BTC 减半周期综合分析', fontsize=15, y=1.01)
|
||||
plt.tight_layout()
|
||||
fig_path = output_dir / 'halving_combined_summary.png'
|
||||
fig.savefig(fig_path, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"\n综合图表已保存: {fig_path}")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------
|
||||
# 可独立运行
|
||||
# --------------------------------------------------------------------------
|
||||
if __name__ == '__main__':
|
||||
from data_loader import load_daily
|
||||
from preprocessing import add_derived_features
|
||||
|
||||
# 加载数据
|
||||
df_daily = load_daily()
|
||||
df_daily = add_derived_features(df_daily)
|
||||
|
||||
run_halving_analysis(df_daily, output_dir='output/halving')
|
||||
633
src/hurst_analysis.py
Normal file
@@ -0,0 +1,633 @@
|
||||
"""
|
||||
Hurst指数分析模块
|
||||
================
|
||||
通过R/S分析和DFA(去趋势波动分析)计算Hurst指数,
|
||||
评估BTC价格序列的长程依赖性和市场状态(趋势/均值回归/随机游走)。
|
||||
|
||||
核心功能:
|
||||
- R/S (Rescaled Range) 分析
|
||||
- DFA (Detrended Fluctuation Analysis) via nolds
|
||||
- R/S 与 DFA 交叉验证
|
||||
- 滚动窗口Hurst指数追踪市场状态变化
|
||||
- 多时间框架Hurst对比分析
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
import matplotlib.dates as mdates
|
||||
try:
|
||||
import nolds
|
||||
HAS_NOLDS = True
|
||||
except Exception:
|
||||
HAS_NOLDS = False
|
||||
from pathlib import Path
|
||||
from typing import Tuple, Dict, List, Optional
|
||||
|
||||
import sys
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
from src.data_loader import load_klines
|
||||
from src.preprocessing import log_returns
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Hurst指数判定标准
|
||||
# ============================================================
|
||||
TREND_THRESHOLD = 0.55 # H > 0.55 → 趋势性(持续性)
|
||||
MEAN_REV_THRESHOLD = 0.45 # H < 0.45 → 均值回归(反持续性)
|
||||
# 0.45 <= H <= 0.55 → 近似随机游走
|
||||
|
||||
|
||||
def interpret_hurst(h: float) -> str:
|
||||
"""根据Hurst指数值给出市场状态解读"""
|
||||
if h > TREND_THRESHOLD:
|
||||
return f"趋势性 (H={h:.4f} > {TREND_THRESHOLD}):序列具有长程正相关,价格趋势倾向于持续"
|
||||
elif h < MEAN_REV_THRESHOLD:
|
||||
return f"均值回归 (H={h:.4f} < {MEAN_REV_THRESHOLD}):序列具有长程负相关,价格倾向于反转"
|
||||
else:
|
||||
return f"随机游走 (H={h:.4f} ≈ 0.5):序列近似无记忆,价格变动近似独立"
|
||||
|
||||
|
||||
# ============================================================
|
||||
# R/S (Rescaled Range) 分析
|
||||
# ============================================================
|
||||
def _rs_for_segment(segment: np.ndarray) -> float:
|
||||
"""计算单个分段的R/S统计量"""
|
||||
n = len(segment)
|
||||
if n < 2:
|
||||
return np.nan
|
||||
|
||||
# 计算均值偏差的累积和
|
||||
mean_val = np.mean(segment)
|
||||
deviations = segment - mean_val
|
||||
cumulative = np.cumsum(deviations)
|
||||
|
||||
# 极差 R = max(累积偏差) - min(累积偏差)
|
||||
R = np.max(cumulative) - np.min(cumulative)
|
||||
|
||||
# 标准差 S
|
||||
S = np.std(segment, ddof=1)
|
||||
if S == 0:
|
||||
return np.nan
|
||||
|
||||
return R / S
|
||||
|
||||
|
||||
def rs_hurst(series: np.ndarray, min_window: int = 10, max_window: Optional[int] = None,
|
||||
num_scales: int = 30) -> Tuple[float, np.ndarray, np.ndarray]:
|
||||
"""
|
||||
R/S重标极差分析计算Hurst指数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
series : np.ndarray
|
||||
时间序列数据(通常为对数收益率)
|
||||
min_window : int
|
||||
最小窗口大小
|
||||
max_window : int, optional
|
||||
最大窗口大小,默认为序列长度的1/4
|
||||
num_scales : int
|
||||
尺度数量
|
||||
|
||||
Returns
|
||||
-------
|
||||
H : float
|
||||
Hurst指数
|
||||
log_ns : np.ndarray
|
||||
log(窗口大小)
|
||||
log_rs : np.ndarray
|
||||
log(平均R/S值)
|
||||
"""
|
||||
n = len(series)
|
||||
if max_window is None:
|
||||
max_window = n // 4
|
||||
|
||||
# 生成对数均匀分布的窗口大小
|
||||
window_sizes = np.unique(
|
||||
np.logspace(np.log10(min_window), np.log10(max_window), num=num_scales).astype(int)
|
||||
)
|
||||
|
||||
log_ns = []
|
||||
log_rs = []
|
||||
|
||||
for w in window_sizes:
|
||||
if w < 10 or w > n // 2:
|
||||
continue
|
||||
|
||||
# 将序列分成不重叠的分段
|
||||
num_segments = n // w
|
||||
if num_segments < 1:
|
||||
continue
|
||||
|
||||
rs_values = []
|
||||
for i in range(num_segments):
|
||||
segment = series[i * w: (i + 1) * w]
|
||||
rs_val = _rs_for_segment(segment)
|
||||
if not np.isnan(rs_val):
|
||||
rs_values.append(rs_val)
|
||||
|
||||
if len(rs_values) > 0:
|
||||
mean_rs = np.mean(rs_values)
|
||||
if mean_rs > 0:
|
||||
log_ns.append(np.log(w))
|
||||
log_rs.append(np.log(mean_rs))
|
||||
|
||||
log_ns = np.array(log_ns)
|
||||
log_rs = np.array(log_rs)
|
||||
|
||||
# 线性回归:log(R/S) = H * log(n) + c
|
||||
if len(log_ns) < 3:
|
||||
return 0.5, log_ns, log_rs
|
||||
|
||||
coeffs = np.polyfit(log_ns, log_rs, 1)
|
||||
H = coeffs[0]
|
||||
|
||||
return H, log_ns, log_rs
|
||||
|
||||
|
||||
# ============================================================
|
||||
# DFA (Detrended Fluctuation Analysis) - 使用nolds库
|
||||
# ============================================================
|
||||
def dfa_hurst(series: np.ndarray) -> float:
|
||||
"""
|
||||
使用nolds库进行DFA分析,返回Hurst指数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
series : np.ndarray
|
||||
时间序列数据
|
||||
|
||||
Returns
|
||||
-------
|
||||
float
|
||||
DFA估计的Hurst指数(DFA指数α,对于分数布朗运动 α = H + 0.5 - 0.5 = H)
|
||||
"""
|
||||
if HAS_NOLDS:
|
||||
# nolds.dfa 返回的是DFA scaling exponent α
|
||||
# 对于对数收益率序列(增量过程),α ≈ H
|
||||
# 对于累积序列(如价格),α ≈ H + 0.5
|
||||
alpha = nolds.dfa(series)
|
||||
return alpha
|
||||
else:
|
||||
# 自实现的简化DFA
|
||||
N = len(series)
|
||||
y = np.cumsum(series - np.mean(series))
|
||||
scales = np.unique(np.logspace(np.log10(4), np.log10(N // 4), 20).astype(int))
|
||||
flucts = []
|
||||
for s in scales:
|
||||
n_seg = N // s
|
||||
if n_seg < 1:
|
||||
continue
|
||||
rms_list = []
|
||||
for i in range(n_seg):
|
||||
seg = y[i*s:(i+1)*s]
|
||||
x = np.arange(s)
|
||||
coeffs = np.polyfit(x, seg, 1)
|
||||
trend = np.polyval(coeffs, x)
|
||||
rms_list.append(np.sqrt(np.mean((seg - trend)**2)))
|
||||
flucts.append(np.mean(rms_list))
|
||||
if len(flucts) < 2:
|
||||
return 0.5
|
||||
log_s = np.log(scales[:len(flucts)])
|
||||
log_f = np.log(flucts)
|
||||
alpha = np.polyfit(log_s, log_f, 1)[0]
|
||||
return alpha
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 交叉验证:比较R/S和DFA结果
|
||||
# ============================================================
|
||||
def cross_validate_hurst(series: np.ndarray) -> Dict[str, float]:
|
||||
"""
|
||||
使用R/S和DFA两种方法计算Hurst指数并交叉验证
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含两种方法的Hurst值及其差异
|
||||
"""
|
||||
h_rs, _, _ = rs_hurst(series)
|
||||
h_dfa = dfa_hurst(series)
|
||||
|
||||
result = {
|
||||
'R/S Hurst': h_rs,
|
||||
'DFA Hurst': h_dfa,
|
||||
'两种方法差异': abs(h_rs - h_dfa),
|
||||
'平均值': (h_rs + h_dfa) / 2,
|
||||
}
|
||||
return result
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 滚动窗口Hurst指数
|
||||
# ============================================================
|
||||
def rolling_hurst(series: np.ndarray, dates: pd.DatetimeIndex,
|
||||
window: int = 500, step: int = 30,
|
||||
method: str = 'rs') -> Tuple[pd.DatetimeIndex, np.ndarray]:
|
||||
"""
|
||||
滚动窗口计算Hurst指数,追踪市场状态随时间的演变
|
||||
|
||||
Parameters
|
||||
----------
|
||||
series : np.ndarray
|
||||
时间序列(对数收益率)
|
||||
dates : pd.DatetimeIndex
|
||||
对应的日期索引
|
||||
window : int
|
||||
滚动窗口大小(默认500天)
|
||||
step : int
|
||||
滚动步长(默认30天)
|
||||
method : str
|
||||
'rs' 使用R/S分析,'dfa' 使用DFA分析
|
||||
|
||||
Returns
|
||||
-------
|
||||
roll_dates : pd.DatetimeIndex
|
||||
每个窗口对应的日期(窗口末尾日期)
|
||||
roll_hurst : np.ndarray
|
||||
对应的Hurst指数值
|
||||
"""
|
||||
n = len(series)
|
||||
roll_dates = []
|
||||
roll_hurst = []
|
||||
|
||||
for start_idx in range(0, n - window + 1, step):
|
||||
end_idx = start_idx + window
|
||||
segment = series[start_idx:end_idx]
|
||||
|
||||
if method == 'rs':
|
||||
h, _, _ = rs_hurst(segment)
|
||||
elif method == 'dfa':
|
||||
h = dfa_hurst(segment)
|
||||
else:
|
||||
raise ValueError(f"未知方法: {method}")
|
||||
|
||||
roll_dates.append(dates[end_idx - 1])
|
||||
roll_hurst.append(h)
|
||||
|
||||
return pd.DatetimeIndex(roll_dates), np.array(roll_hurst)
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 多时间框架Hurst分析
|
||||
# ============================================================
|
||||
def multi_timeframe_hurst(intervals: List[str] = None) -> Dict[str, Dict[str, float]]:
|
||||
"""
|
||||
在多个时间框架下计算Hurst指数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
intervals : list of str
|
||||
时间框架列表,默认 ['1h', '4h', '1d', '1w']
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
每个时间框架的Hurst分析结果
|
||||
"""
|
||||
if intervals is None:
|
||||
intervals = ['1h', '4h', '1d', '1w']
|
||||
|
||||
results = {}
|
||||
for interval in intervals:
|
||||
try:
|
||||
print(f"\n正在加载 {interval} 数据...")
|
||||
df = load_klines(interval)
|
||||
prices = df['close'].dropna()
|
||||
|
||||
if len(prices) < 100:
|
||||
print(f" {interval} 数据量不足({len(prices)}条),跳过")
|
||||
continue
|
||||
|
||||
returns = log_returns(prices).values
|
||||
|
||||
# R/S分析
|
||||
h_rs, _, _ = rs_hurst(returns)
|
||||
# DFA分析
|
||||
h_dfa = dfa_hurst(returns)
|
||||
|
||||
results[interval] = {
|
||||
'R/S Hurst': h_rs,
|
||||
'DFA Hurst': h_dfa,
|
||||
'平均Hurst': (h_rs + h_dfa) / 2,
|
||||
'数据量': len(returns),
|
||||
'解读': interpret_hurst((h_rs + h_dfa) / 2),
|
||||
}
|
||||
|
||||
print(f" {interval}: R/S={h_rs:.4f}, DFA={h_dfa:.4f}, "
|
||||
f"平均={results[interval]['平均Hurst']:.4f}")
|
||||
|
||||
except FileNotFoundError:
|
||||
print(f" {interval} 数据文件不存在,跳过")
|
||||
except Exception as e:
|
||||
print(f" {interval} 分析失败: {e}")
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 可视化函数
|
||||
# ============================================================
|
||||
def plot_rs_loglog(log_ns: np.ndarray, log_rs: np.ndarray, H: float,
|
||||
output_dir: Path, filename: str = "hurst_rs_loglog.png"):
|
||||
"""绘制R/S分析的log-log图"""
|
||||
fig, ax = plt.subplots(figsize=(10, 7))
|
||||
|
||||
# 散点
|
||||
ax.scatter(log_ns, log_rs, color='steelblue', s=40, zorder=3, label='R/S 数据点')
|
||||
|
||||
# 拟合线
|
||||
coeffs = np.polyfit(log_ns, log_rs, 1)
|
||||
fit_line = np.polyval(coeffs, log_ns)
|
||||
ax.plot(log_ns, fit_line, 'r-', linewidth=2, label=f'拟合线 (H = {H:.4f})')
|
||||
|
||||
# 参考线:H=0.5(随机游走)
|
||||
ref_line = 0.5 * log_ns + (log_rs[0] - 0.5 * log_ns[0])
|
||||
ax.plot(log_ns, ref_line, 'k--', alpha=0.5, linewidth=1, label='H=0.5 (随机游走)')
|
||||
|
||||
ax.set_xlabel('log(n) - 窗口大小的对数', fontsize=12)
|
||||
ax.set_ylabel('log(R/S) - 重标极差的对数', fontsize=12)
|
||||
ax.set_title(f'BTC R/S 分析 (Hurst指数 = {H:.4f})\n{interpret_hurst(H)}', fontsize=13)
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.tight_layout()
|
||||
filepath = output_dir / filename
|
||||
fig.savefig(filepath, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" 已保存: {filepath}")
|
||||
|
||||
|
||||
def plot_rolling_hurst(roll_dates: pd.DatetimeIndex, roll_hurst: np.ndarray,
|
||||
output_dir: Path, filename: str = "hurst_rolling.png"):
|
||||
"""绘制滚动Hurst指数时间序列,带有市场状态色带"""
|
||||
fig, ax = plt.subplots(figsize=(14, 7))
|
||||
|
||||
# 绘制Hurst指数曲线
|
||||
ax.plot(roll_dates, roll_hurst, color='steelblue', linewidth=1.5, label='滚动Hurst指数')
|
||||
|
||||
# 状态色带
|
||||
ax.axhspan(TREND_THRESHOLD, max(roll_hurst.max() + 0.05, 0.8),
|
||||
alpha=0.1, color='green', label=f'趋势区 (H>{TREND_THRESHOLD})')
|
||||
ax.axhspan(MEAN_REV_THRESHOLD, TREND_THRESHOLD,
|
||||
alpha=0.1, color='yellow', label=f'随机游走区 ({MEAN_REV_THRESHOLD}<H<{TREND_THRESHOLD})')
|
||||
ax.axhspan(min(roll_hurst.min() - 0.05, 0.2), MEAN_REV_THRESHOLD,
|
||||
alpha=0.1, color='red', label=f'均值回归区 (H<{MEAN_REV_THRESHOLD})')
|
||||
|
||||
# 参考线
|
||||
ax.axhline(y=0.5, color='black', linestyle='--', alpha=0.5, linewidth=1)
|
||||
ax.axhline(y=TREND_THRESHOLD, color='green', linestyle=':', alpha=0.5)
|
||||
ax.axhline(y=MEAN_REV_THRESHOLD, color='red', linestyle=':', alpha=0.5)
|
||||
|
||||
ax.set_xlabel('日期', fontsize=12)
|
||||
ax.set_ylabel('Hurst指数', fontsize=12)
|
||||
ax.set_title('BTC 滚动Hurst指数 (窗口=500天, 步长=30天)\n市场状态随时间演变', fontsize=13)
|
||||
ax.legend(loc='upper left', fontsize=10)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
# 格式化日期轴
|
||||
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
|
||||
ax.xaxis.set_major_locator(mdates.YearLocator())
|
||||
fig.autofmt_xdate()
|
||||
|
||||
fig.tight_layout()
|
||||
filepath = output_dir / filename
|
||||
fig.savefig(filepath, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" 已保存: {filepath}")
|
||||
|
||||
|
||||
def plot_multi_timeframe(results: Dict[str, Dict[str, float]],
|
||||
output_dir: Path, filename: str = "hurst_multi_timeframe.png"):
|
||||
"""绘制多时间框架Hurst指数对比图"""
|
||||
if not results:
|
||||
print(" 没有可绘制的多时间框架结果")
|
||||
return
|
||||
|
||||
intervals = list(results.keys())
|
||||
h_rs = [results[k]['R/S Hurst'] for k in intervals]
|
||||
h_dfa = [results[k]['DFA Hurst'] for k in intervals]
|
||||
h_avg = [results[k]['平均Hurst'] for k in intervals]
|
||||
|
||||
x = np.arange(len(intervals))
|
||||
width = 0.25
|
||||
|
||||
fig, ax = plt.subplots(figsize=(12, 7))
|
||||
|
||||
bars1 = ax.bar(x - width, h_rs, width, label='R/S Hurst', color='steelblue', alpha=0.8)
|
||||
bars2 = ax.bar(x, h_dfa, width, label='DFA Hurst', color='coral', alpha=0.8)
|
||||
bars3 = ax.bar(x + width, h_avg, width, label='平均', color='seagreen', alpha=0.8)
|
||||
|
||||
# 参考线
|
||||
ax.axhline(y=0.5, color='black', linestyle='--', alpha=0.5, linewidth=1, label='H=0.5')
|
||||
ax.axhline(y=TREND_THRESHOLD, color='green', linestyle=':', alpha=0.4)
|
||||
ax.axhline(y=MEAN_REV_THRESHOLD, color='red', linestyle=':', alpha=0.4)
|
||||
|
||||
# 在柱状图上标注数值
|
||||
for bars in [bars1, bars2, bars3]:
|
||||
for bar in bars:
|
||||
height = bar.get_height()
|
||||
ax.annotate(f'{height:.3f}',
|
||||
xy=(bar.get_x() + bar.get_width() / 2, height),
|
||||
xytext=(0, 3), textcoords="offset points",
|
||||
ha='center', va='bottom', fontsize=9)
|
||||
|
||||
ax.set_xlabel('时间框架', fontsize=12)
|
||||
ax.set_ylabel('Hurst指数', fontsize=12)
|
||||
ax.set_title('BTC 多时间框架 Hurst指数对比', fontsize=13)
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(intervals)
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3, axis='y')
|
||||
|
||||
fig.tight_layout()
|
||||
filepath = output_dir / filename
|
||||
fig.savefig(filepath, dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" 已保存: {filepath}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 主入口函数
|
||||
# ============================================================
|
||||
def run_hurst_analysis(df: pd.DataFrame, output_dir: str = "output/hurst") -> Dict:
|
||||
"""
|
||||
Hurst指数综合分析主入口
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
K线数据(需包含 'close' 列和DatetimeIndex索引)
|
||||
output_dir : str
|
||||
图表输出目录
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含所有分析结果的字典
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
results = {}
|
||||
|
||||
print("=" * 70)
|
||||
print("Hurst指数综合分析")
|
||||
print("=" * 70)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 1. 准备数据
|
||||
# ----------------------------------------------------------
|
||||
prices = df['close'].dropna()
|
||||
returns = log_returns(prices)
|
||||
returns_arr = returns.values
|
||||
|
||||
print(f"\n数据概况:")
|
||||
print(f" 时间范围: {df.index.min()} ~ {df.index.max()}")
|
||||
print(f" 收益率序列长度: {len(returns_arr)}")
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 2. R/S分析
|
||||
# ----------------------------------------------------------
|
||||
print("\n" + "-" * 50)
|
||||
print("【1】R/S (Rescaled Range) 分析")
|
||||
print("-" * 50)
|
||||
|
||||
h_rs, log_ns, log_rs = rs_hurst(returns_arr)
|
||||
results['R/S Hurst'] = h_rs
|
||||
|
||||
print(f" R/S Hurst指数: {h_rs:.4f}")
|
||||
print(f" 解读: {interpret_hurst(h_rs)}")
|
||||
|
||||
# 绘制R/S log-log图
|
||||
plot_rs_loglog(log_ns, log_rs, h_rs, output_dir)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 3. DFA分析(使用nolds库)
|
||||
# ----------------------------------------------------------
|
||||
print("\n" + "-" * 50)
|
||||
print("【2】DFA (Detrended Fluctuation Analysis) 分析")
|
||||
print("-" * 50)
|
||||
|
||||
h_dfa = dfa_hurst(returns_arr)
|
||||
results['DFA Hurst'] = h_dfa
|
||||
|
||||
print(f" DFA Hurst指数: {h_dfa:.4f}")
|
||||
print(f" 解读: {interpret_hurst(h_dfa)}")
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 4. 交叉验证
|
||||
# ----------------------------------------------------------
|
||||
print("\n" + "-" * 50)
|
||||
print("【3】交叉验证:R/S vs DFA")
|
||||
print("-" * 50)
|
||||
|
||||
cv_results = cross_validate_hurst(returns_arr)
|
||||
results['交叉验证'] = cv_results
|
||||
|
||||
print(f" R/S Hurst: {cv_results['R/S Hurst']:.4f}")
|
||||
print(f" DFA Hurst: {cv_results['DFA Hurst']:.4f}")
|
||||
print(f" 两种方法差异: {cv_results['两种方法差异']:.4f}")
|
||||
print(f" 平均值: {cv_results['平均值']:.4f}")
|
||||
|
||||
avg_h = cv_results['平均值']
|
||||
if cv_results['两种方法差异'] < 0.05:
|
||||
print(" ✓ 两种方法结果一致性较好(差异<0.05)")
|
||||
else:
|
||||
print(" ⚠ 两种方法结果存在一定差异(差异≥0.05),建议结合其他方法验证")
|
||||
|
||||
print(f"\n 综合解读: {interpret_hurst(avg_h)}")
|
||||
results['综合Hurst'] = avg_h
|
||||
results['综合解读'] = interpret_hurst(avg_h)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 5. 滚动窗口Hurst(窗口500天,步长30天)
|
||||
# ----------------------------------------------------------
|
||||
print("\n" + "-" * 50)
|
||||
print("【4】滚动窗口Hurst指数 (窗口=500天, 步长=30天)")
|
||||
print("-" * 50)
|
||||
|
||||
if len(returns_arr) >= 500:
|
||||
roll_dates, roll_h = rolling_hurst(
|
||||
returns_arr, returns.index, window=500, step=30, method='rs'
|
||||
)
|
||||
|
||||
# 统计各状态占比
|
||||
n_trend = np.sum(roll_h > TREND_THRESHOLD)
|
||||
n_mean_rev = np.sum(roll_h < MEAN_REV_THRESHOLD)
|
||||
n_random = np.sum((roll_h >= MEAN_REV_THRESHOLD) & (roll_h <= TREND_THRESHOLD))
|
||||
total = len(roll_h)
|
||||
|
||||
print(f" 滚动窗口数: {total}")
|
||||
print(f" 趋势状态占比: {n_trend / total * 100:.1f}% ({n_trend}/{total})")
|
||||
print(f" 随机游走占比: {n_random / total * 100:.1f}% ({n_random}/{total})")
|
||||
print(f" 均值回归占比: {n_mean_rev / total * 100:.1f}% ({n_mean_rev}/{total})")
|
||||
print(f" Hurst范围: [{roll_h.min():.4f}, {roll_h.max():.4f}]")
|
||||
print(f" Hurst均值: {roll_h.mean():.4f}")
|
||||
|
||||
results['滚动Hurst'] = {
|
||||
'窗口数': total,
|
||||
'趋势占比': n_trend / total,
|
||||
'随机游走占比': n_random / total,
|
||||
'均值回归占比': n_mean_rev / total,
|
||||
'Hurst范围': (roll_h.min(), roll_h.max()),
|
||||
'Hurst均值': roll_h.mean(),
|
||||
}
|
||||
|
||||
# 绘制滚动Hurst图
|
||||
plot_rolling_hurst(roll_dates, roll_h, output_dir)
|
||||
else:
|
||||
print(f" 数据量不足({len(returns_arr)}<500),跳过滚动窗口分析")
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 6. 多时间框架Hurst分析
|
||||
# ----------------------------------------------------------
|
||||
print("\n" + "-" * 50)
|
||||
print("【5】多时间框架Hurst指数")
|
||||
print("-" * 50)
|
||||
|
||||
mt_results = multi_timeframe_hurst(['1h', '4h', '1d', '1w'])
|
||||
results['多时间框架'] = mt_results
|
||||
|
||||
# 绘制多时间框架对比图
|
||||
plot_multi_timeframe(mt_results, output_dir)
|
||||
|
||||
# ----------------------------------------------------------
|
||||
# 7. 总结
|
||||
# ----------------------------------------------------------
|
||||
print("\n" + "=" * 70)
|
||||
print("分析总结")
|
||||
print("=" * 70)
|
||||
print(f" 日线综合Hurst指数: {avg_h:.4f}")
|
||||
print(f" 市场状态判断: {interpret_hurst(avg_h)}")
|
||||
|
||||
if mt_results:
|
||||
print("\n 各时间框架Hurst指数:")
|
||||
for interval, data in mt_results.items():
|
||||
print(f" {interval}: 平均H={data['平均Hurst']:.4f} - {data['解读']}")
|
||||
|
||||
print(f"\n 判定标准:")
|
||||
print(f" H > {TREND_THRESHOLD}: 趋势性(持续性,适合趋势跟随策略)")
|
||||
print(f" H < {MEAN_REV_THRESHOLD}: 均值回归(反持续性,适合均值回归策略)")
|
||||
print(f" {MEAN_REV_THRESHOLD} ≤ H ≤ {TREND_THRESHOLD}: 随机游走(无显著可预测性)")
|
||||
|
||||
print(f"\n 图表已保存至: {output_dir.resolve()}")
|
||||
print("=" * 70)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 独立运行入口
|
||||
# ============================================================
|
||||
if __name__ == "__main__":
|
||||
from data_loader import load_daily
|
||||
|
||||
print("加载BTC日线数据...")
|
||||
df = load_daily()
|
||||
print(f"数据加载完成: {len(df)} 条记录")
|
||||
|
||||
results = run_hurst_analysis(df, output_dir="output/hurst")
|
||||
626
src/indicators.py
Normal file
@@ -0,0 +1,626 @@
|
||||
"""
|
||||
技术指标有效性验证模块
|
||||
|
||||
手动实现常见技术指标(MA/EMA交叉、RSI、MACD、布林带),
|
||||
在训练集上进行统计显著性检验,并在验证集上验证。
|
||||
包含反数据窥探措施:Benjamini-Hochberg FDR 校正 + 置换检验。
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from scipy import stats
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Tuple, Optional
|
||||
|
||||
from src.data_loader import split_data
|
||||
from src.preprocessing import log_returns
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 1. 手动实现技术指标
|
||||
# ============================================================
|
||||
|
||||
def calc_sma(series: pd.Series, window: int) -> pd.Series:
|
||||
"""简单移动平均线"""
|
||||
return series.rolling(window=window, min_periods=window).mean()
|
||||
|
||||
|
||||
def calc_ema(series: pd.Series, span: int) -> pd.Series:
|
||||
"""指数移动平均线"""
|
||||
return series.ewm(span=span, adjust=False).mean()
|
||||
|
||||
|
||||
def calc_rsi(close: pd.Series, period: int = 14) -> pd.Series:
|
||||
"""
|
||||
相对强弱指标 (RSI)
|
||||
RSI = 100 - 100 / (1 + RS)
|
||||
RS = 平均上涨幅度 / 平均下跌幅度
|
||||
"""
|
||||
delta = close.diff()
|
||||
gain = delta.clip(lower=0)
|
||||
loss = (-delta).clip(lower=0)
|
||||
# 使用 EMA 计算平均涨跌
|
||||
avg_gain = gain.ewm(alpha=1.0 / period, min_periods=period, adjust=False).mean()
|
||||
avg_loss = loss.ewm(alpha=1.0 / period, min_periods=period, adjust=False).mean()
|
||||
rs = avg_gain / avg_loss.replace(0, np.nan)
|
||||
rsi = 100 - 100 / (1 + rs)
|
||||
return rsi
|
||||
|
||||
|
||||
def calc_macd(close: pd.Series, fast: int = 12, slow: int = 26, signal: int = 9) -> Tuple[pd.Series, pd.Series, pd.Series]:
|
||||
"""
|
||||
MACD 指标
|
||||
返回: (macd_line, signal_line, histogram)
|
||||
"""
|
||||
ema_fast = calc_ema(close, fast)
|
||||
ema_slow = calc_ema(close, slow)
|
||||
macd_line = ema_fast - ema_slow
|
||||
signal_line = calc_ema(macd_line, signal)
|
||||
histogram = macd_line - signal_line
|
||||
return macd_line, signal_line, histogram
|
||||
|
||||
|
||||
def calc_bollinger_bands(close: pd.Series, window: int = 20, num_std: float = 2.0) -> Tuple[pd.Series, pd.Series, pd.Series]:
|
||||
"""
|
||||
布林带
|
||||
返回: (upper, middle, lower)
|
||||
"""
|
||||
middle = calc_sma(close, window)
|
||||
rolling_std = close.rolling(window=window, min_periods=window).std()
|
||||
upper = middle + num_std * rolling_std
|
||||
lower = middle - num_std * rolling_std
|
||||
return upper, middle, lower
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 2. 信号生成
|
||||
# ============================================================
|
||||
|
||||
def generate_ma_crossover_signals(close: pd.Series, short_w: int, long_w: int, use_ema: bool = False) -> pd.Series:
|
||||
"""
|
||||
均线交叉信号
|
||||
金叉 = +1(短期上穿长期),死叉 = -1(短期下穿长期),无信号 = 0
|
||||
"""
|
||||
func = calc_ema if use_ema else calc_sma
|
||||
short_ma = func(close, short_w)
|
||||
long_ma = func(close, long_w)
|
||||
# 当前短>长 且 前一根短<=长 => 金叉(+1)
|
||||
# 当前短<长 且 前一根短>=长 => 死叉(-1)
|
||||
cross_up = (short_ma > long_ma) & (short_ma.shift(1) <= long_ma.shift(1))
|
||||
cross_down = (short_ma < long_ma) & (short_ma.shift(1) >= long_ma.shift(1))
|
||||
signal = pd.Series(0, index=close.index)
|
||||
signal[cross_up] = 1
|
||||
signal[cross_down] = -1
|
||||
return signal
|
||||
|
||||
|
||||
def generate_rsi_signals(close: pd.Series, period: int, oversold: float = 30, overbought: float = 70) -> pd.Series:
|
||||
"""
|
||||
RSI 超买超卖信号
|
||||
RSI 从超卖区回升 => +1 (买入信号)
|
||||
RSI 从超买区回落 => -1 (卖出信号)
|
||||
"""
|
||||
rsi = calc_rsi(close, period)
|
||||
rsi_prev = rsi.shift(1)
|
||||
signal = pd.Series(0, index=close.index)
|
||||
# 从超卖回升
|
||||
signal[(rsi_prev <= oversold) & (rsi > oversold)] = 1
|
||||
# 从超买回落
|
||||
signal[(rsi_prev >= overbought) & (rsi < overbought)] = -1
|
||||
return signal
|
||||
|
||||
|
||||
def generate_macd_signals(close: pd.Series, fast: int = 12, slow: int = 26, sig: int = 9) -> pd.Series:
|
||||
"""
|
||||
MACD 交叉信号
|
||||
MACD线上穿信号线 => +1
|
||||
MACD线下穿信号线 => -1
|
||||
"""
|
||||
macd_line, signal_line, _ = calc_macd(close, fast, slow, sig)
|
||||
cross_up = (macd_line > signal_line) & (macd_line.shift(1) <= signal_line.shift(1))
|
||||
cross_down = (macd_line < signal_line) & (macd_line.shift(1) >= signal_line.shift(1))
|
||||
signal = pd.Series(0, index=close.index)
|
||||
signal[cross_up] = 1
|
||||
signal[cross_down] = -1
|
||||
return signal
|
||||
|
||||
|
||||
def generate_bollinger_signals(close: pd.Series, window: int = 20, num_std: float = 2.0) -> pd.Series:
|
||||
"""
|
||||
布林带信号
|
||||
价格触及下轨后回升 => +1 (买入)
|
||||
价格触及上轨后回落 => -1 (卖出)
|
||||
"""
|
||||
upper, middle, lower = calc_bollinger_bands(close, window, num_std)
|
||||
# 前一根在下轨以下,当前回到下轨以上
|
||||
cross_up = (close.shift(1) <= lower.shift(1)) & (close > lower)
|
||||
# 前一根在上轨以上,当前回到上轨以下
|
||||
cross_down = (close.shift(1) >= upper.shift(1)) & (close < upper)
|
||||
signal = pd.Series(0, index=close.index)
|
||||
signal[cross_up] = 1
|
||||
signal[cross_down] = -1
|
||||
return signal
|
||||
|
||||
|
||||
def build_all_signals(close: pd.Series) -> Dict[str, pd.Series]:
|
||||
"""
|
||||
构建所有技术指标信号
|
||||
返回字典: {指标名称: 信号序列}
|
||||
"""
|
||||
signals = {}
|
||||
|
||||
# --- MA / EMA 交叉 ---
|
||||
ma_pairs = [(5, 20), (10, 50), (20, 100), (50, 200)]
|
||||
for short_w, long_w in ma_pairs:
|
||||
signals[f"SMA_{short_w}_{long_w}"] = generate_ma_crossover_signals(close, short_w, long_w, use_ema=False)
|
||||
signals[f"EMA_{short_w}_{long_w}"] = generate_ma_crossover_signals(close, short_w, long_w, use_ema=True)
|
||||
|
||||
# --- RSI ---
|
||||
rsi_configs = [
|
||||
(7, 30, 70), (7, 25, 75), (7, 20, 80),
|
||||
(14, 30, 70), (14, 25, 75), (14, 20, 80),
|
||||
(21, 30, 70), (21, 25, 75), (21, 20, 80),
|
||||
]
|
||||
for period, oversold, overbought in rsi_configs:
|
||||
signals[f"RSI_{period}_{oversold}_{overbought}"] = generate_rsi_signals(close, period, oversold, overbought)
|
||||
|
||||
# --- MACD ---
|
||||
macd_configs = [(12, 26, 9), (8, 17, 9), (5, 35, 5)]
|
||||
for fast, slow, sig in macd_configs:
|
||||
signals[f"MACD_{fast}_{slow}_{sig}"] = generate_macd_signals(close, fast, slow, sig)
|
||||
|
||||
# --- 布林带 ---
|
||||
signals["BB_20_2"] = generate_bollinger_signals(close, 20, 2.0)
|
||||
|
||||
return signals
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 3. 统计检验
|
||||
# ============================================================
|
||||
|
||||
def calc_forward_returns(close: pd.Series, periods: int = 1) -> pd.Series:
|
||||
"""计算未来N日收益率(对数收益率)"""
|
||||
return np.log(close.shift(-periods) / close)
|
||||
|
||||
|
||||
def test_signal_returns(signal: pd.Series, returns: pd.Series) -> Dict:
|
||||
"""
|
||||
对单个指标信号进行统计检验
|
||||
|
||||
- Welch t-test:比较信号日 vs 非信号日收益均值差异
|
||||
- Mann-Whitney U:非参数检验
|
||||
- 二项检验:方向准确率是否显著高于50%
|
||||
- 信息系数 (IC):Spearman秩相关
|
||||
"""
|
||||
# 买入信号日(signal == 1)的收益
|
||||
buy_returns = returns[signal == 1].dropna()
|
||||
# 卖出信号日(signal == -1)的收益
|
||||
sell_returns = returns[signal == -1].dropna()
|
||||
# 非信号日收益
|
||||
no_signal_returns = returns[signal == 0].dropna()
|
||||
|
||||
result = {
|
||||
'n_buy': len(buy_returns),
|
||||
'n_sell': len(sell_returns),
|
||||
'n_no_signal': len(no_signal_returns),
|
||||
'buy_mean': buy_returns.mean() if len(buy_returns) > 0 else np.nan,
|
||||
'sell_mean': sell_returns.mean() if len(sell_returns) > 0 else np.nan,
|
||||
'no_signal_mean': no_signal_returns.mean() if len(no_signal_returns) > 0 else np.nan,
|
||||
}
|
||||
|
||||
# --- Welch t-test (买入信号 vs 非信号) ---
|
||||
if len(buy_returns) >= 5 and len(no_signal_returns) >= 5:
|
||||
t_stat, t_pval = stats.ttest_ind(buy_returns, no_signal_returns, equal_var=False)
|
||||
result['welch_t_stat'] = t_stat
|
||||
result['welch_t_pval'] = t_pval
|
||||
else:
|
||||
result['welch_t_stat'] = np.nan
|
||||
result['welch_t_pval'] = np.nan
|
||||
|
||||
# --- Mann-Whitney U (买入信号 vs 非信号) ---
|
||||
if len(buy_returns) >= 5 and len(no_signal_returns) >= 5:
|
||||
u_stat, u_pval = stats.mannwhitneyu(buy_returns, no_signal_returns, alternative='two-sided')
|
||||
result['mwu_stat'] = u_stat
|
||||
result['mwu_pval'] = u_pval
|
||||
else:
|
||||
result['mwu_stat'] = np.nan
|
||||
result['mwu_pval'] = np.nan
|
||||
|
||||
# --- 二项检验:买入信号日收益>0的比例 vs 50% ---
|
||||
if len(buy_returns) >= 5:
|
||||
n_positive = (buy_returns > 0).sum()
|
||||
binom_pval = stats.binomtest(n_positive, len(buy_returns), 0.5).pvalue
|
||||
result['buy_hit_rate'] = n_positive / len(buy_returns)
|
||||
result['binom_pval'] = binom_pval
|
||||
else:
|
||||
result['buy_hit_rate'] = np.nan
|
||||
result['binom_pval'] = np.nan
|
||||
|
||||
# --- 信息系数 (IC):Spearman秩相关 ---
|
||||
# 用信号值(-1, 0, 1)与未来收益的秩相关
|
||||
valid_mask = signal.notna() & returns.notna()
|
||||
if valid_mask.sum() >= 30:
|
||||
ic, ic_pval = stats.spearmanr(signal[valid_mask], returns[valid_mask])
|
||||
result['ic'] = ic
|
||||
result['ic_pval'] = ic_pval
|
||||
else:
|
||||
result['ic'] = np.nan
|
||||
result['ic_pval'] = np.nan
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def benjamini_hochberg(p_values: np.ndarray, alpha: float = 0.05) -> Tuple[np.ndarray, np.ndarray]:
|
||||
"""
|
||||
Benjamini-Hochberg FDR 校正
|
||||
|
||||
参数:
|
||||
p_values: 原始 p 值数组
|
||||
alpha: 显著性水平
|
||||
|
||||
返回:
|
||||
(rejected, adjusted_p): 是否拒绝原假设, 校正后p值
|
||||
"""
|
||||
n = len(p_values)
|
||||
if n == 0:
|
||||
return np.array([], dtype=bool), np.array([])
|
||||
|
||||
# 处理 NaN
|
||||
valid_mask = ~np.isnan(p_values)
|
||||
adjusted = np.full(n, np.nan)
|
||||
rejected = np.full(n, False)
|
||||
|
||||
valid_pvals = p_values[valid_mask]
|
||||
n_valid = len(valid_pvals)
|
||||
if n_valid == 0:
|
||||
return rejected, adjusted
|
||||
|
||||
# 排序
|
||||
sorted_idx = np.argsort(valid_pvals)
|
||||
sorted_pvals = valid_pvals[sorted_idx]
|
||||
|
||||
# BH校正
|
||||
rank = np.arange(1, n_valid + 1)
|
||||
adjusted_sorted = sorted_pvals * n_valid / rank
|
||||
# 从后往前取累积最小值,确保单调性
|
||||
adjusted_sorted = np.minimum.accumulate(adjusted_sorted[::-1])[::-1]
|
||||
adjusted_sorted = np.clip(adjusted_sorted, 0, 1)
|
||||
|
||||
# 填回
|
||||
valid_indices = np.where(valid_mask)[0]
|
||||
for i, idx in enumerate(sorted_idx):
|
||||
adjusted[valid_indices[idx]] = adjusted_sorted[i]
|
||||
rejected[valid_indices[idx]] = adjusted_sorted[i] <= alpha
|
||||
|
||||
return rejected, adjusted
|
||||
|
||||
|
||||
def permutation_test(signal: pd.Series, returns: pd.Series, n_permutations: int = 1000, stat_func=None) -> Tuple[float, float]:
|
||||
"""
|
||||
置换检验
|
||||
|
||||
随机打乱信号与收益的对应关系,评估原始统计量的显著性
|
||||
返回: (observed_stat, p_value)
|
||||
"""
|
||||
if stat_func is None:
|
||||
# 默认统计量:买入信号日均值 - 非信号日均值
|
||||
def stat_func(sig, ret):
|
||||
buy_ret = ret[sig == 1]
|
||||
no_sig_ret = ret[sig == 0]
|
||||
if len(buy_ret) < 2 or len(no_sig_ret) < 2:
|
||||
return 0.0
|
||||
return buy_ret.mean() - no_sig_ret.mean()
|
||||
|
||||
valid_mask = signal.notna() & returns.notna()
|
||||
sig_valid = signal[valid_mask].values
|
||||
ret_valid = returns[valid_mask].values
|
||||
|
||||
observed = stat_func(pd.Series(sig_valid), pd.Series(ret_valid))
|
||||
|
||||
# 置换
|
||||
count_extreme = 0
|
||||
rng = np.random.RandomState(42)
|
||||
for _ in range(n_permutations):
|
||||
perm_sig = rng.permutation(sig_valid)
|
||||
perm_stat = stat_func(pd.Series(perm_sig), pd.Series(ret_valid))
|
||||
if abs(perm_stat) >= abs(observed):
|
||||
count_extreme += 1
|
||||
|
||||
perm_pval = (count_extreme + 1) / (n_permutations + 1)
|
||||
return observed, perm_pval
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 4. 可视化
|
||||
# ============================================================
|
||||
|
||||
def plot_ic_distribution(results_df: pd.DataFrame, output_dir: Path, prefix: str = "train"):
|
||||
"""绘制信息系数 (IC) 分布图"""
|
||||
fig, ax = plt.subplots(figsize=(12, 6))
|
||||
ic_vals = results_df['ic'].dropna()
|
||||
ax.barh(range(len(ic_vals)), ic_vals.values, color=['green' if v > 0 else 'red' for v in ic_vals.values])
|
||||
ax.set_yticks(range(len(ic_vals)))
|
||||
ax.set_yticklabels(ic_vals.index, fontsize=7)
|
||||
ax.set_xlabel('Information Coefficient (Spearman)')
|
||||
ax.set_title(f'IC Distribution - {prefix.upper()} Set')
|
||||
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_dir / f"ic_distribution_{prefix}.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [saved] ic_distribution_{prefix}.png")
|
||||
|
||||
|
||||
def plot_pvalue_heatmap(results_df: pd.DataFrame, output_dir: Path, prefix: str = "train"):
|
||||
"""绘制 p 值热力图:原始 vs FDR 校正后"""
|
||||
pval_cols = ['welch_t_pval', 'mwu_pval', 'binom_pval', 'ic_pval']
|
||||
adj_cols = ['welch_t_adj_pval', 'mwu_adj_pval', 'binom_adj_pval', 'ic_adj_pval']
|
||||
|
||||
# 只取存在的列
|
||||
existing_pval = [c for c in pval_cols if c in results_df.columns]
|
||||
existing_adj = [c for c in adj_cols if c in results_df.columns]
|
||||
|
||||
if not existing_pval:
|
||||
return
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(16, max(8, len(results_df) * 0.35)))
|
||||
|
||||
# 原始 p 值
|
||||
pval_data = results_df[existing_pval].values.astype(float)
|
||||
im1 = axes[0].imshow(pval_data, aspect='auto', cmap='RdYlGn_r', vmin=0, vmax=0.1)
|
||||
axes[0].set_yticks(range(len(results_df)))
|
||||
axes[0].set_yticklabels(results_df.index, fontsize=6)
|
||||
axes[0].set_xticks(range(len(existing_pval)))
|
||||
axes[0].set_xticklabels([c.replace('_pval', '') for c in existing_pval], fontsize=8, rotation=45)
|
||||
axes[0].set_title('Raw p-values')
|
||||
plt.colorbar(im1, ax=axes[0], shrink=0.6)
|
||||
|
||||
# FDR 校正后 p 值
|
||||
if existing_adj:
|
||||
adj_data = results_df[existing_adj].values.astype(float)
|
||||
im2 = axes[1].imshow(adj_data, aspect='auto', cmap='RdYlGn_r', vmin=0, vmax=0.1)
|
||||
axes[1].set_yticks(range(len(results_df)))
|
||||
axes[1].set_yticklabels(results_df.index, fontsize=6)
|
||||
axes[1].set_xticks(range(len(existing_adj)))
|
||||
axes[1].set_xticklabels([c.replace('_adj_pval', '') for c in existing_adj], fontsize=8, rotation=45)
|
||||
axes[1].set_title('FDR-adjusted p-values')
|
||||
plt.colorbar(im2, ax=axes[1], shrink=0.6)
|
||||
else:
|
||||
axes[1].text(0.5, 0.5, 'No adjusted p-values', ha='center', va='center')
|
||||
axes[1].set_title('FDR-adjusted p-values (N/A)')
|
||||
|
||||
plt.suptitle(f'P-value Heatmap - {prefix.upper()} Set', fontsize=14)
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_dir / f"pvalue_heatmap_{prefix}.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [saved] pvalue_heatmap_{prefix}.png")
|
||||
|
||||
|
||||
def plot_best_indicator_signal(close: pd.Series, signal: pd.Series, returns: pd.Series,
|
||||
indicator_name: str, output_dir: Path, prefix: str = "train"):
|
||||
"""绘制最佳指标的信号 vs 收益散点图"""
|
||||
fig, axes = plt.subplots(2, 1, figsize=(14, 10), gridspec_kw={'height_ratios': [2, 1]})
|
||||
|
||||
# 上图:价格 + 信号标记
|
||||
axes[0].plot(close.index, close.values, color='gray', alpha=0.7, linewidth=0.8, label='BTC Close')
|
||||
buy_mask = signal == 1
|
||||
sell_mask = signal == -1
|
||||
axes[0].scatter(close.index[buy_mask], close.values[buy_mask],
|
||||
marker='^', color='green', s=40, label='Buy Signal', zorder=5)
|
||||
axes[0].scatter(close.index[sell_mask], close.values[sell_mask],
|
||||
marker='v', color='red', s=40, label='Sell Signal', zorder=5)
|
||||
axes[0].set_title(f'Best Indicator: {indicator_name} - {prefix.upper()} Set')
|
||||
axes[0].set_ylabel('Price (USDT)')
|
||||
axes[0].legend(fontsize=8)
|
||||
|
||||
# 下图:信号日收益分布
|
||||
buy_returns = returns[buy_mask].dropna()
|
||||
sell_returns = returns[sell_mask].dropna()
|
||||
if len(buy_returns) > 0:
|
||||
axes[1].hist(buy_returns, bins=30, alpha=0.6, color='green', label=f'Buy ({len(buy_returns)})')
|
||||
if len(sell_returns) > 0:
|
||||
axes[1].hist(sell_returns, bins=30, alpha=0.6, color='red', label=f'Sell ({len(sell_returns)})')
|
||||
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=0.8)
|
||||
axes[1].set_xlabel('Forward 1-day Log Return')
|
||||
axes[1].set_ylabel('Count')
|
||||
axes[1].legend(fontsize=8)
|
||||
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_dir / f"best_indicator_{prefix}.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [saved] best_indicator_{prefix}.png")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 5. 主流程
|
||||
# ============================================================
|
||||
|
||||
def evaluate_signals_on_set(close: pd.Series, signals: Dict[str, pd.Series], set_name: str) -> pd.DataFrame:
|
||||
"""
|
||||
在给定数据集上评估所有信号
|
||||
|
||||
返回包含所有统计指标的 DataFrame
|
||||
"""
|
||||
# 未来1日收益
|
||||
fwd_ret = calc_forward_returns(close, periods=1)
|
||||
|
||||
results = {}
|
||||
for name, signal in signals.items():
|
||||
# 只取当前数据集范围内的信号
|
||||
sig = signal.reindex(close.index).fillna(0)
|
||||
ret = fwd_ret.reindex(close.index)
|
||||
results[name] = test_signal_returns(sig, ret)
|
||||
|
||||
results_df = pd.DataFrame(results).T
|
||||
results_df.index.name = 'indicator'
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f" {set_name} 数据集评估结果")
|
||||
print(f"{'='*60}")
|
||||
print(f" 总指标数: {len(results_df)}")
|
||||
print(f" 数据点数: {len(close)}")
|
||||
|
||||
return results_df
|
||||
|
||||
|
||||
def apply_fdr_correction(results_df: pd.DataFrame, alpha: float = 0.05) -> pd.DataFrame:
|
||||
"""
|
||||
对所有 p 值列进行 Benjamini-Hochberg FDR 校正
|
||||
"""
|
||||
pval_cols = ['welch_t_pval', 'mwu_pval', 'binom_pval', 'ic_pval']
|
||||
|
||||
for col in pval_cols:
|
||||
if col not in results_df.columns:
|
||||
continue
|
||||
pvals = results_df[col].values.astype(float)
|
||||
rejected, adjusted = benjamini_hochberg(pvals, alpha)
|
||||
adj_col = col.replace('_pval', '_adj_pval')
|
||||
rej_col = col.replace('_pval', '_rejected')
|
||||
results_df[adj_col] = adjusted
|
||||
results_df[rej_col] = rejected
|
||||
|
||||
return results_df
|
||||
|
||||
|
||||
def run_indicators_analysis(df: pd.DataFrame, output_dir: str) -> Dict:
|
||||
"""
|
||||
技术指标有效性验证主入口
|
||||
|
||||
参数:
|
||||
df: 完整的日线 DataFrame(含 open/high/low/close/volume 等列,DatetimeIndex)
|
||||
output_dir: 图表输出目录
|
||||
|
||||
返回:
|
||||
包含训练集和验证集结果的字典
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 60)
|
||||
print(" 技术指标有效性验证")
|
||||
print("=" * 60)
|
||||
|
||||
# --- 数据切分 ---
|
||||
train, val, test = split_data(df)
|
||||
print(f"\n训练集: {train.index.min()} ~ {train.index.max()} ({len(train)} bars)")
|
||||
print(f"验证集: {val.index.min()} ~ {val.index.max()} ({len(val)} bars)")
|
||||
|
||||
# --- 构建全部信号(在全量数据上计算,避免前导NaN问题) ---
|
||||
all_signals = build_all_signals(df['close'])
|
||||
print(f"\n共构建 {len(all_signals)} 个技术指标信号")
|
||||
|
||||
# ============ 训练集评估 ============
|
||||
train_results = evaluate_signals_on_set(train['close'], all_signals, "训练集 (TRAIN)")
|
||||
|
||||
# FDR 校正
|
||||
train_results = apply_fdr_correction(train_results, alpha=0.05)
|
||||
|
||||
# 找出通过 FDR 校正的指标
|
||||
reject_cols = [c for c in train_results.columns if c.endswith('_rejected')]
|
||||
if reject_cols:
|
||||
train_results['any_fdr_pass'] = train_results[reject_cols].any(axis=1)
|
||||
fdr_passed = train_results[train_results['any_fdr_pass']].index.tolist()
|
||||
else:
|
||||
fdr_passed = []
|
||||
|
||||
print(f"\n--- FDR 校正结果 (训练集) ---")
|
||||
if fdr_passed:
|
||||
print(f" 通过 FDR 校正的指标 ({len(fdr_passed)} 个):")
|
||||
for name in fdr_passed:
|
||||
row = train_results.loc[name]
|
||||
ic_val = row.get('ic', np.nan)
|
||||
print(f" - {name}: IC={ic_val:.4f}" if not np.isnan(ic_val) else f" - {name}")
|
||||
else:
|
||||
print(" 没有指标通过 FDR 校正(alpha=0.05)")
|
||||
|
||||
# --- 置换检验(仅对 IC 排名前5的指标) ---
|
||||
fwd_ret_train = calc_forward_returns(train['close'], periods=1)
|
||||
ic_series = train_results['ic'].dropna().abs().sort_values(ascending=False)
|
||||
top_indicators = ic_series.head(5).index.tolist()
|
||||
|
||||
print(f"\n--- 置换检验 (训练集, top-5 IC 指标, 1000次置换) ---")
|
||||
perm_results = {}
|
||||
for name in top_indicators:
|
||||
sig = all_signals[name].reindex(train.index).fillna(0)
|
||||
ret = fwd_ret_train.reindex(train.index)
|
||||
obs, pval = permutation_test(sig, ret, n_permutations=1000)
|
||||
perm_results[name] = {'observed_diff': obs, 'perm_pval': pval}
|
||||
perm_pass = "PASS" if pval < 0.05 else "FAIL"
|
||||
print(f" {name}: obs_diff={obs:.6f}, perm_p={pval:.4f} [{perm_pass}]")
|
||||
|
||||
# --- 训练集可视化 ---
|
||||
print("\n--- 训练集可视化 ---")
|
||||
plot_ic_distribution(train_results, output_dir, prefix="train")
|
||||
plot_pvalue_heatmap(train_results, output_dir, prefix="train")
|
||||
|
||||
# 最佳指标(IC绝对值最大)
|
||||
if len(ic_series) > 0:
|
||||
best_name = ic_series.index[0]
|
||||
best_signal = all_signals[best_name].reindex(train.index).fillna(0)
|
||||
best_ret = fwd_ret_train.reindex(train.index)
|
||||
plot_best_indicator_signal(train['close'], best_signal, best_ret, best_name, output_dir, prefix="train")
|
||||
|
||||
# ============ 验证集评估 ============
|
||||
val_results = evaluate_signals_on_set(val['close'], all_signals, "验证集 (VAL)")
|
||||
val_results = apply_fdr_correction(val_results, alpha=0.05)
|
||||
|
||||
reject_cols_val = [c for c in val_results.columns if c.endswith('_rejected')]
|
||||
if reject_cols_val:
|
||||
val_results['any_fdr_pass'] = val_results[reject_cols_val].any(axis=1)
|
||||
val_fdr_passed = val_results[val_results['any_fdr_pass']].index.tolist()
|
||||
else:
|
||||
val_fdr_passed = []
|
||||
|
||||
print(f"\n--- FDR 校正结果 (验证集) ---")
|
||||
if val_fdr_passed:
|
||||
print(f" 通过 FDR 校正的指标 ({len(val_fdr_passed)} 个):")
|
||||
for name in val_fdr_passed:
|
||||
row = val_results.loc[name]
|
||||
ic_val = row.get('ic', np.nan)
|
||||
print(f" - {name}: IC={ic_val:.4f}" if not np.isnan(ic_val) else f" - {name}")
|
||||
else:
|
||||
print(" 没有指标通过 FDR 校正(alpha=0.05)")
|
||||
|
||||
# 训练集 vs 验证集 IC 对比
|
||||
if 'ic' in train_results.columns and 'ic' in val_results.columns:
|
||||
print(f"\n--- 训练集 vs 验证集 IC 对比 (Top-10) ---")
|
||||
merged_ic = pd.DataFrame({
|
||||
'train_ic': train_results['ic'],
|
||||
'val_ic': val_results['ic']
|
||||
}).dropna()
|
||||
merged_ic['consistent'] = (merged_ic['train_ic'] * merged_ic['val_ic']) > 0 # 同号
|
||||
merged_ic = merged_ic.reindex(merged_ic['train_ic'].abs().sort_values(ascending=False).index)
|
||||
for name in merged_ic.head(10).index:
|
||||
row = merged_ic.loc[name]
|
||||
cons = "OK" if row['consistent'] else "FLIP"
|
||||
print(f" {name}: train_IC={row['train_ic']:.4f}, val_IC={row['val_ic']:.4f} [{cons}]")
|
||||
|
||||
# --- 验证集可视化 ---
|
||||
print("\n--- 验证集可视化 ---")
|
||||
plot_ic_distribution(val_results, output_dir, prefix="val")
|
||||
plot_pvalue_heatmap(val_results, output_dir, prefix="val")
|
||||
|
||||
val_ic_series = val_results['ic'].dropna().abs().sort_values(ascending=False)
|
||||
if len(val_ic_series) > 0:
|
||||
fwd_ret_val = calc_forward_returns(val['close'], periods=1)
|
||||
best_val_name = val_ic_series.index[0]
|
||||
best_val_signal = all_signals[best_val_name].reindex(val.index).fillna(0)
|
||||
best_val_ret = fwd_ret_val.reindex(val.index)
|
||||
plot_best_indicator_signal(val['close'], best_val_signal, best_val_ret, best_val_name, output_dir, prefix="val")
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(" 技术指标有效性验证完成")
|
||||
print(f"{'='*60}")
|
||||
|
||||
return {
|
||||
'train_results': train_results,
|
||||
'val_results': val_results,
|
||||
'fdr_passed_train': fdr_passed,
|
||||
'fdr_passed_val': val_fdr_passed,
|
||||
'permutation_results': perm_results,
|
||||
'all_signals': all_signals,
|
||||
}
|
||||
853
src/patterns.py
Normal file
@@ -0,0 +1,853 @@
|
||||
"""
|
||||
K线形态识别与统计验证模块
|
||||
|
||||
手动实现常见蜡烛图形态(Doji、Hammer、Engulfing、Morning/Evening Star 等),
|
||||
使用前向收益分析 + Wilson 置信区间 + FDR 校正进行统计验证。
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from scipy import stats
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Tuple, Optional
|
||||
|
||||
from src.data_loader import split_data
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 1. 辅助函数
|
||||
# ============================================================
|
||||
|
||||
def _body(df: pd.DataFrame) -> pd.Series:
|
||||
"""实体大小(绝对值)"""
|
||||
return (df['close'] - df['open']).abs()
|
||||
|
||||
|
||||
def _body_signed(df: pd.DataFrame) -> pd.Series:
|
||||
"""带符号的实体(正=阳线,负=阴线)"""
|
||||
return df['close'] - df['open']
|
||||
|
||||
|
||||
def _upper_shadow(df: pd.DataFrame) -> pd.Series:
|
||||
"""上影线长度"""
|
||||
return df['high'] - df[['open', 'close']].max(axis=1)
|
||||
|
||||
|
||||
def _lower_shadow(df: pd.DataFrame) -> pd.Series:
|
||||
"""下影线长度"""
|
||||
return df[['open', 'close']].min(axis=1) - df['low']
|
||||
|
||||
|
||||
def _total_range(df: pd.DataFrame) -> pd.Series:
|
||||
"""总振幅(high - low),避免零值"""
|
||||
return (df['high'] - df['low']).replace(0, np.nan)
|
||||
|
||||
|
||||
def _is_bullish(df: pd.DataFrame) -> pd.Series:
|
||||
"""是否阳线"""
|
||||
return df['close'] > df['open']
|
||||
|
||||
|
||||
def _is_bearish(df: pd.DataFrame) -> pd.Series:
|
||||
"""是否阴线"""
|
||||
return df['close'] < df['open']
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 2. 形态识别函数(手动实现)
|
||||
# ============================================================
|
||||
|
||||
def detect_doji(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
十字星 (Doji)
|
||||
条件: 实体 < 总振幅的 10%
|
||||
方向: 中性 (0)
|
||||
"""
|
||||
body = _body(df)
|
||||
total = _total_range(df)
|
||||
return (body / total < 0.10).astype(int)
|
||||
|
||||
|
||||
def detect_hammer(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
锤子线 (Hammer) — 底部反转看涨信号
|
||||
条件:
|
||||
- 下影线 > 实体的 2 倍
|
||||
- 上影线 < 实体的 0.5 倍(或 < 总振幅的 15%)
|
||||
- 实体在上半部分
|
||||
"""
|
||||
body = _body(df)
|
||||
lower = _lower_shadow(df)
|
||||
upper = _upper_shadow(df)
|
||||
total = _total_range(df)
|
||||
|
||||
cond = (
|
||||
(lower > 2 * body) &
|
||||
(upper < 0.5 * body + 1e-10) & # 加小值避免零实体问题
|
||||
(body > 0) # 排除doji
|
||||
)
|
||||
return cond.astype(int)
|
||||
|
||||
|
||||
def detect_inverted_hammer(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
倒锤子线 (Inverted Hammer) — 底部反转看涨信号
|
||||
条件:
|
||||
- 上影线 > 实体的 2 倍
|
||||
- 下影线 < 实体的 0.5 倍
|
||||
"""
|
||||
body = _body(df)
|
||||
lower = _lower_shadow(df)
|
||||
upper = _upper_shadow(df)
|
||||
|
||||
cond = (
|
||||
(upper > 2 * body) &
|
||||
(lower < 0.5 * body + 1e-10) &
|
||||
(body > 0)
|
||||
)
|
||||
return cond.astype(int)
|
||||
|
||||
|
||||
def detect_bullish_engulfing(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
看涨吞没 (Bullish Engulfing)
|
||||
条件:
|
||||
- 前一根阴线,当前阳线
|
||||
- 当前实体完全包裹前一根实体
|
||||
"""
|
||||
prev_bearish = _is_bearish(df).shift(1)
|
||||
curr_bullish = _is_bullish(df)
|
||||
|
||||
# 当前开盘 < 前一根收盘 (前一根阴线收盘较低)
|
||||
# 当前收盘 > 前一根开盘
|
||||
cond = (
|
||||
prev_bearish &
|
||||
curr_bullish &
|
||||
(df['open'] <= df['close'].shift(1)) &
|
||||
(df['close'] >= df['open'].shift(1))
|
||||
)
|
||||
return cond.fillna(False).astype(int)
|
||||
|
||||
|
||||
def detect_bearish_engulfing(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
看跌吞没 (Bearish Engulfing)
|
||||
条件:
|
||||
- 前一根阳线,当前阴线
|
||||
- 当前实体完全包裹前一根实体
|
||||
"""
|
||||
prev_bullish = _is_bullish(df).shift(1)
|
||||
curr_bearish = _is_bearish(df)
|
||||
|
||||
cond = (
|
||||
prev_bullish &
|
||||
curr_bearish &
|
||||
(df['open'] >= df['close'].shift(1)) &
|
||||
(df['close'] <= df['open'].shift(1))
|
||||
)
|
||||
return cond.fillna(False).astype(int)
|
||||
|
||||
|
||||
def detect_morning_star(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
晨星 (Morning Star) — 3根K线底部反转
|
||||
条件:
|
||||
- 第1根: 大阴线(实体 > 中位数实体)
|
||||
- 第2根: 小实体(实体 < 中位数实体 * 0.5),跳空低开或接近
|
||||
- 第3根: 大阳线,收盘超过第1根实体中点
|
||||
"""
|
||||
body = _body(df)
|
||||
body_signed = _body_signed(df)
|
||||
median_body = body.rolling(window=20, min_periods=10).median()
|
||||
|
||||
# 第1根大阴线
|
||||
bar1_big_bear = (body_signed.shift(2) < 0) & (body.shift(2) > median_body.shift(2))
|
||||
# 第2根小实体
|
||||
bar2_small = body.shift(1) < median_body.shift(1) * 0.5
|
||||
# 第3根大阳线,收盘超过第1根实体中点
|
||||
bar1_mid = (df['open'].shift(2) + df['close'].shift(2)) / 2
|
||||
bar3_big_bull = (body_signed > 0) & (body > median_body) & (df['close'] > bar1_mid)
|
||||
|
||||
cond = bar1_big_bear & bar2_small & bar3_big_bull
|
||||
return cond.fillna(False).astype(int)
|
||||
|
||||
|
||||
def detect_evening_star(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
暮星 (Evening Star) — 3根K线顶部反转
|
||||
条件:
|
||||
- 第1根: 大阳线
|
||||
- 第2根: 小实体
|
||||
- 第3根: 大阴线,收盘低于第1根实体中点
|
||||
"""
|
||||
body = _body(df)
|
||||
body_signed = _body_signed(df)
|
||||
median_body = body.rolling(window=20, min_periods=10).median()
|
||||
|
||||
bar1_big_bull = (body_signed.shift(2) > 0) & (body.shift(2) > median_body.shift(2))
|
||||
bar2_small = body.shift(1) < median_body.shift(1) * 0.5
|
||||
bar1_mid = (df['open'].shift(2) + df['close'].shift(2)) / 2
|
||||
bar3_big_bear = (body_signed < 0) & (body > median_body) & (df['close'] < bar1_mid)
|
||||
|
||||
cond = bar1_big_bull & bar2_small & bar3_big_bear
|
||||
return cond.fillna(False).astype(int)
|
||||
|
||||
|
||||
def detect_three_white_soldiers(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
三阳开泰 (Three White Soldiers)
|
||||
条件:
|
||||
- 连续3根阳线
|
||||
- 每根开盘在前一根实体范围内
|
||||
- 每根收盘创新高
|
||||
- 上影线较小
|
||||
"""
|
||||
bullish = _is_bullish(df)
|
||||
body = _body(df)
|
||||
upper = _upper_shadow(df)
|
||||
|
||||
cond = (
|
||||
bullish & bullish.shift(1) & bullish.shift(2) &
|
||||
# 每根收盘逐步升高
|
||||
(df['close'] > df['close'].shift(1)) &
|
||||
(df['close'].shift(1) > df['close'].shift(2)) &
|
||||
# 每根开盘在前一根实体内
|
||||
(df['open'] >= df['open'].shift(1)) &
|
||||
(df['open'] <= df['close'].shift(1)) &
|
||||
(df['open'].shift(1) >= df['open'].shift(2)) &
|
||||
(df['open'].shift(1) <= df['close'].shift(2)) &
|
||||
# 上影线不超过实体的30%
|
||||
(upper < body * 0.3 + 1e-10) &
|
||||
(upper.shift(1) < body.shift(1) * 0.3 + 1e-10)
|
||||
)
|
||||
return cond.fillna(False).astype(int)
|
||||
|
||||
|
||||
def detect_three_black_crows(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
三阴断头 (Three Black Crows)
|
||||
条件:
|
||||
- 连续3根阴线
|
||||
- 每根开盘在前一根实体范围内
|
||||
- 每根收盘创新低
|
||||
- 下影线较小
|
||||
"""
|
||||
bearish = _is_bearish(df)
|
||||
body = _body(df)
|
||||
lower = _lower_shadow(df)
|
||||
|
||||
cond = (
|
||||
bearish & bearish.shift(1) & bearish.shift(2) &
|
||||
# 每根收盘逐步降低
|
||||
(df['close'] < df['close'].shift(1)) &
|
||||
(df['close'].shift(1) < df['close'].shift(2)) &
|
||||
# 每根开盘在前一根实体内
|
||||
(df['open'] <= df['open'].shift(1)) &
|
||||
(df['open'] >= df['close'].shift(1)) &
|
||||
(df['open'].shift(1) <= df['open'].shift(2)) &
|
||||
(df['open'].shift(1) >= df['close'].shift(2)) &
|
||||
# 下影线不超过实体的30%
|
||||
(lower < body * 0.3 + 1e-10) &
|
||||
(lower.shift(1) < body.shift(1) * 0.3 + 1e-10)
|
||||
)
|
||||
return cond.fillna(False).astype(int)
|
||||
|
||||
|
||||
def detect_pin_bar(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
Pin Bar (影线 > 总振幅的 2/3)
|
||||
分为上Pin Bar(看跌)和下Pin Bar(看涨),此处合并检测
|
||||
返回:
|
||||
+1 = 下Pin Bar (长下影,看涨)
|
||||
-1 = 上Pin Bar (长上影,看跌)
|
||||
0 = 无信号
|
||||
"""
|
||||
total = _total_range(df)
|
||||
upper = _upper_shadow(df)
|
||||
lower = _lower_shadow(df)
|
||||
threshold = 2.0 / 3.0
|
||||
|
||||
long_lower = (lower / total > threshold) # 长下影 -> 看涨
|
||||
long_upper = (upper / total > threshold) # 长上影 -> 看跌
|
||||
|
||||
signal = pd.Series(0, index=df.index)
|
||||
signal[long_lower] = 1 # 看涨Pin Bar
|
||||
signal[long_upper] = -1 # 看跌Pin Bar
|
||||
# 如果同时满足(极端情况),取消信号
|
||||
signal[long_lower & long_upper] = 0
|
||||
return signal
|
||||
|
||||
|
||||
def detect_shooting_star(df: pd.DataFrame) -> pd.Series:
|
||||
"""
|
||||
流星线 (Shooting Star) — 顶部反转看跌信号
|
||||
条件:
|
||||
- 上影线 > 实体的 2 倍
|
||||
- 下影线 < 实体的 0.5 倍
|
||||
- 在上涨趋势末端(前2根收盘低于当前收盘)
|
||||
"""
|
||||
body = _body(df)
|
||||
upper = _upper_shadow(df)
|
||||
lower = _lower_shadow(df)
|
||||
|
||||
cond = (
|
||||
(upper > 2 * body) &
|
||||
(lower < 0.5 * body + 1e-10) &
|
||||
(body > 0) &
|
||||
(df['close'].shift(1) < df['high']) &
|
||||
(df['close'].shift(2) < df['close'].shift(1))
|
||||
)
|
||||
return cond.fillna(False).astype(int)
|
||||
|
||||
|
||||
def detect_all_patterns(df: pd.DataFrame) -> Dict[str, pd.Series]:
|
||||
"""
|
||||
检测所有K线形态
|
||||
返回字典: {形态名称: 信号序列}
|
||||
|
||||
对于方向性形态:
|
||||
- 看涨形态的值 > 0 表示检测到
|
||||
- 看跌形态的值 > 0 表示检测到
|
||||
- Pin Bar 特殊: +1=看涨, -1=看跌
|
||||
"""
|
||||
patterns = {}
|
||||
|
||||
# --- 单根K线形态 ---
|
||||
patterns['Doji'] = detect_doji(df)
|
||||
patterns['Hammer'] = detect_hammer(df)
|
||||
patterns['Inverted_Hammer'] = detect_inverted_hammer(df)
|
||||
patterns['Shooting_Star'] = detect_shooting_star(df)
|
||||
patterns['Pin_Bar_Bull'] = (detect_pin_bar(df) == 1).astype(int)
|
||||
patterns['Pin_Bar_Bear'] = (detect_pin_bar(df) == -1).astype(int)
|
||||
|
||||
# --- 两根K线形态 ---
|
||||
patterns['Bullish_Engulfing'] = detect_bullish_engulfing(df)
|
||||
patterns['Bearish_Engulfing'] = detect_bearish_engulfing(df)
|
||||
|
||||
# --- 三根K线形态 ---
|
||||
patterns['Morning_Star'] = detect_morning_star(df)
|
||||
patterns['Evening_Star'] = detect_evening_star(df)
|
||||
patterns['Three_White_Soldiers'] = detect_three_white_soldiers(df)
|
||||
patterns['Three_Black_Crows'] = detect_three_black_crows(df)
|
||||
|
||||
return patterns
|
||||
|
||||
|
||||
# 形态的预期方向映射(+1=看涨, -1=看跌, 0=中性)
|
||||
PATTERN_EXPECTED_DIRECTION = {
|
||||
'Doji': 0,
|
||||
'Hammer': 1,
|
||||
'Inverted_Hammer': 1,
|
||||
'Shooting_Star': -1,
|
||||
'Pin_Bar_Bull': 1,
|
||||
'Pin_Bar_Bear': -1,
|
||||
'Bullish_Engulfing': 1,
|
||||
'Bearish_Engulfing': -1,
|
||||
'Morning_Star': 1,
|
||||
'Evening_Star': -1,
|
||||
'Three_White_Soldiers': 1,
|
||||
'Three_Black_Crows': -1,
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 3. 前向收益分析
|
||||
# ============================================================
|
||||
|
||||
def calc_forward_returns_multi(close: pd.Series, horizons: List[int] = None) -> pd.DataFrame:
|
||||
"""计算多个前向周期的对数收益率"""
|
||||
if horizons is None:
|
||||
horizons = [1, 3, 5, 10, 20]
|
||||
fwd = pd.DataFrame(index=close.index)
|
||||
for h in horizons:
|
||||
fwd[f'fwd_{h}d'] = np.log(close.shift(-h) / close)
|
||||
return fwd
|
||||
|
||||
|
||||
def analyze_pattern_returns(pattern_signal: pd.Series, fwd_returns: pd.DataFrame,
|
||||
expected_dir: int = 0) -> Dict:
|
||||
"""
|
||||
对单个形态进行前向收益分析
|
||||
|
||||
参数:
|
||||
pattern_signal: 形态检测信号 (1=出现, 0=未出现)
|
||||
fwd_returns: 前向收益 DataFrame
|
||||
expected_dir: 预期方向 (+1=看涨, -1=看跌, 0=中性)
|
||||
|
||||
返回:
|
||||
统计结果字典
|
||||
"""
|
||||
mask = pattern_signal > 0 # Pin_Bar_Bear 已经处理为单独信号
|
||||
n_occurrences = mask.sum()
|
||||
|
||||
result = {'n_occurrences': int(n_occurrences), 'expected_direction': expected_dir}
|
||||
|
||||
if n_occurrences < 3:
|
||||
# 样本太少,跳过
|
||||
for col in fwd_returns.columns:
|
||||
result[f'{col}_mean'] = np.nan
|
||||
result[f'{col}_median'] = np.nan
|
||||
result[f'{col}_pct_positive'] = np.nan
|
||||
result[f'{col}_ttest_pval'] = np.nan
|
||||
result['hit_rate'] = np.nan
|
||||
result['wilson_ci_lower'] = np.nan
|
||||
result['wilson_ci_upper'] = np.nan
|
||||
return result
|
||||
|
||||
for col in fwd_returns.columns:
|
||||
returns = fwd_returns.loc[mask, col].dropna()
|
||||
if len(returns) == 0:
|
||||
result[f'{col}_mean'] = np.nan
|
||||
result[f'{col}_median'] = np.nan
|
||||
result[f'{col}_pct_positive'] = np.nan
|
||||
result[f'{col}_ttest_pval'] = np.nan
|
||||
continue
|
||||
|
||||
result[f'{col}_mean'] = returns.mean()
|
||||
result[f'{col}_median'] = returns.median()
|
||||
result[f'{col}_pct_positive'] = (returns > 0).mean()
|
||||
|
||||
# 单样本 t-test: 均值是否显著不等于 0
|
||||
if len(returns) >= 5:
|
||||
t_stat, t_pval = stats.ttest_1samp(returns, 0)
|
||||
result[f'{col}_ttest_pval'] = t_pval
|
||||
else:
|
||||
result[f'{col}_ttest_pval'] = np.nan
|
||||
|
||||
# --- 命中率 (hit rate) ---
|
||||
# 使用 fwd_1d 作为判断依据
|
||||
if 'fwd_1d' in fwd_returns.columns:
|
||||
ret_1d = fwd_returns.loc[mask, 'fwd_1d'].dropna()
|
||||
if len(ret_1d) > 0:
|
||||
if expected_dir == 1:
|
||||
# 看涨:收益>0 为命中
|
||||
hits = (ret_1d > 0).sum()
|
||||
elif expected_dir == -1:
|
||||
# 看跌:收益<0 为命中
|
||||
hits = (ret_1d < 0).sum()
|
||||
else:
|
||||
# 中性:取绝对值较大方向的准确率
|
||||
hits = max((ret_1d > 0).sum(), (ret_1d < 0).sum())
|
||||
|
||||
n = len(ret_1d)
|
||||
hit_rate = hits / n
|
||||
result['hit_rate'] = hit_rate
|
||||
result['hit_count'] = int(hits)
|
||||
result['hit_n'] = int(n)
|
||||
|
||||
# Wilson 置信区间
|
||||
ci_lower, ci_upper = wilson_confidence_interval(hits, n, alpha=0.05)
|
||||
result['wilson_ci_lower'] = ci_lower
|
||||
result['wilson_ci_upper'] = ci_upper
|
||||
|
||||
# 二项检验: 命中率是否显著高于 50%
|
||||
binom_pval = stats.binomtest(hits, n, 0.5, alternative='greater').pvalue
|
||||
result['binom_pval'] = binom_pval
|
||||
else:
|
||||
result['hit_rate'] = np.nan
|
||||
result['wilson_ci_lower'] = np.nan
|
||||
result['wilson_ci_upper'] = np.nan
|
||||
result['binom_pval'] = np.nan
|
||||
else:
|
||||
result['hit_rate'] = np.nan
|
||||
result['wilson_ci_lower'] = np.nan
|
||||
result['wilson_ci_upper'] = np.nan
|
||||
|
||||
return result
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 4. Wilson 置信区间 + FDR 校正
|
||||
# ============================================================
|
||||
|
||||
def wilson_confidence_interval(successes: int, n: int, alpha: float = 0.05) -> Tuple[float, float]:
|
||||
"""
|
||||
Wilson 置信区间计算
|
||||
|
||||
比 Wald 区间更适合小样本和极端比例的情况
|
||||
|
||||
参数:
|
||||
successes: 成功次数
|
||||
n: 总次数
|
||||
alpha: 显著性水平
|
||||
|
||||
返回:
|
||||
(lower, upper) 置信区间
|
||||
"""
|
||||
if n == 0:
|
||||
return (0.0, 1.0)
|
||||
|
||||
p_hat = successes / n
|
||||
z = stats.norm.ppf(1 - alpha / 2)
|
||||
|
||||
denominator = 1 + z ** 2 / n
|
||||
center = (p_hat + z ** 2 / (2 * n)) / denominator
|
||||
margin = z * np.sqrt((p_hat * (1 - p_hat) + z ** 2 / (4 * n)) / n) / denominator
|
||||
|
||||
lower = max(0, center - margin)
|
||||
upper = min(1, center + margin)
|
||||
return (lower, upper)
|
||||
|
||||
|
||||
def benjamini_hochberg(p_values: np.ndarray, alpha: float = 0.05) -> Tuple[np.ndarray, np.ndarray]:
|
||||
"""
|
||||
Benjamini-Hochberg FDR 校正
|
||||
|
||||
参数:
|
||||
p_values: 原始 p 值数组
|
||||
alpha: 显著性水平
|
||||
|
||||
返回:
|
||||
(rejected, adjusted_p): 是否拒绝原假设, 校正后p值
|
||||
"""
|
||||
n = len(p_values)
|
||||
if n == 0:
|
||||
return np.array([], dtype=bool), np.array([])
|
||||
|
||||
valid_mask = ~np.isnan(p_values)
|
||||
adjusted = np.full(n, np.nan)
|
||||
rejected = np.full(n, False)
|
||||
|
||||
valid_pvals = p_values[valid_mask]
|
||||
n_valid = len(valid_pvals)
|
||||
if n_valid == 0:
|
||||
return rejected, adjusted
|
||||
|
||||
sorted_idx = np.argsort(valid_pvals)
|
||||
sorted_pvals = valid_pvals[sorted_idx]
|
||||
|
||||
rank = np.arange(1, n_valid + 1)
|
||||
adjusted_sorted = sorted_pvals * n_valid / rank
|
||||
adjusted_sorted = np.minimum.accumulate(adjusted_sorted[::-1])[::-1]
|
||||
adjusted_sorted = np.clip(adjusted_sorted, 0, 1)
|
||||
|
||||
valid_indices = np.where(valid_mask)[0]
|
||||
for i, idx in enumerate(sorted_idx):
|
||||
adjusted[valid_indices[idx]] = adjusted_sorted[i]
|
||||
rejected[valid_indices[idx]] = adjusted_sorted[i] <= alpha
|
||||
|
||||
return rejected, adjusted
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 5. 可视化
|
||||
# ============================================================
|
||||
|
||||
def plot_pattern_counts(pattern_counts: Dict[str, int], output_dir: Path, prefix: str = "train"):
|
||||
"""绘制形态出现次数的柱状图"""
|
||||
fig, ax = plt.subplots(figsize=(12, 6))
|
||||
|
||||
names = list(pattern_counts.keys())
|
||||
counts = list(pattern_counts.values())
|
||||
colors = ['#2ecc71' if PATTERN_EXPECTED_DIRECTION.get(n, 0) >= 0 else '#e74c3c' for n in names]
|
||||
|
||||
bars = ax.barh(range(len(names)), counts, color=colors, edgecolor='gray', linewidth=0.5)
|
||||
ax.set_yticks(range(len(names)))
|
||||
ax.set_yticklabels(names, fontsize=9)
|
||||
ax.set_xlabel('Occurrence Count')
|
||||
ax.set_title(f'Pattern Occurrence Counts - {prefix.upper()} Set')
|
||||
|
||||
# 在柱形上标注数值
|
||||
for bar, count in zip(bars, counts):
|
||||
ax.text(bar.get_width() + 0.5, bar.get_y() + bar.get_height() / 2,
|
||||
str(count), va='center', fontsize=8)
|
||||
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_dir / f"pattern_counts_{prefix}.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [saved] pattern_counts_{prefix}.png")
|
||||
|
||||
|
||||
def plot_forward_return_boxplots(patterns: Dict[str, pd.Series], fwd_returns: pd.DataFrame,
|
||||
output_dir: Path, prefix: str = "train"):
|
||||
"""绘制各形态前向收益的箱线图"""
|
||||
horizons = [c for c in fwd_returns.columns if c.startswith('fwd_')]
|
||||
n_horizons = len(horizons)
|
||||
if n_horizons == 0:
|
||||
return
|
||||
|
||||
# 筛选有足够样本的形态
|
||||
valid_patterns = {name: sig for name, sig in patterns.items() if sig.sum() >= 3}
|
||||
if not valid_patterns:
|
||||
return
|
||||
|
||||
n_patterns = len(valid_patterns)
|
||||
fig, axes = plt.subplots(1, n_horizons, figsize=(4 * n_horizons, max(6, n_patterns * 0.4)))
|
||||
if n_horizons == 1:
|
||||
axes = [axes]
|
||||
|
||||
for ax_idx, horizon in enumerate(horizons):
|
||||
data_list = []
|
||||
labels = []
|
||||
for name, sig in valid_patterns.items():
|
||||
mask = sig > 0
|
||||
ret = fwd_returns.loc[mask, horizon].dropna()
|
||||
if len(ret) > 0:
|
||||
data_list.append(ret.values)
|
||||
labels.append(f"{name} (n={len(ret)})")
|
||||
|
||||
if data_list:
|
||||
bp = axes[ax_idx].boxplot(data_list, vert=False, patch_artist=True, widths=0.6)
|
||||
for patch, name in zip(bp['boxes'], valid_patterns.keys()):
|
||||
direction = PATTERN_EXPECTED_DIRECTION.get(name, 0)
|
||||
patch.set_facecolor('#a8e6cf' if direction >= 0 else '#ffb3b3')
|
||||
patch.set_alpha(0.7)
|
||||
axes[ax_idx].set_yticklabels(labels, fontsize=7)
|
||||
axes[ax_idx].axvline(x=0, color='red', linestyle='--', linewidth=0.8, alpha=0.7)
|
||||
axes[ax_idx].set_xlabel('Log Return')
|
||||
horizon_label = horizon.replace('fwd_', '').replace('d', '-day')
|
||||
axes[ax_idx].set_title(f'{horizon_label} Forward Return')
|
||||
|
||||
plt.suptitle(f'Pattern Forward Returns - {prefix.upper()} Set', fontsize=13)
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_dir / f"pattern_forward_returns_{prefix}.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [saved] pattern_forward_returns_{prefix}.png")
|
||||
|
||||
|
||||
def plot_hit_rate_with_ci(results_df: pd.DataFrame, output_dir: Path, prefix: str = "train"):
|
||||
"""绘制命中率 + Wilson 置信区间"""
|
||||
# 筛选有效数据
|
||||
valid = results_df.dropna(subset=['hit_rate', 'wilson_ci_lower', 'wilson_ci_upper'])
|
||||
if len(valid) == 0:
|
||||
return
|
||||
|
||||
fig, ax = plt.subplots(figsize=(12, max(6, len(valid) * 0.5)))
|
||||
|
||||
names = valid.index.tolist()
|
||||
hit_rates = valid['hit_rate'].values
|
||||
ci_lower = valid['wilson_ci_lower'].values
|
||||
ci_upper = valid['wilson_ci_upper'].values
|
||||
|
||||
y_pos = range(len(names))
|
||||
# 置信区间误差条
|
||||
xerr_lower = hit_rates - ci_lower
|
||||
xerr_upper = ci_upper - hit_rates
|
||||
xerr = np.array([xerr_lower, xerr_upper])
|
||||
|
||||
colors = ['#2ecc71' if hr > 0.5 else '#e74c3c' for hr in hit_rates]
|
||||
ax.barh(y_pos, hit_rates, xerr=xerr, color=colors, edgecolor='gray',
|
||||
linewidth=0.5, alpha=0.8, capsize=3, ecolor='black')
|
||||
ax.axvline(x=0.5, color='blue', linestyle='--', linewidth=1.0, label='50% baseline')
|
||||
|
||||
# 标注 FDR 校正结果
|
||||
if 'binom_adj_pval' in valid.columns:
|
||||
for i, name in enumerate(names):
|
||||
adj_p = valid.loc[name, 'binom_adj_pval']
|
||||
marker = ''
|
||||
if not np.isnan(adj_p):
|
||||
if adj_p < 0.01:
|
||||
marker = ' ***'
|
||||
elif adj_p < 0.05:
|
||||
marker = ' **'
|
||||
elif adj_p < 0.10:
|
||||
marker = ' *'
|
||||
ax.text(ci_upper[i] + 0.01, i, f"{hit_rates[i]:.1%}{marker}", va='center', fontsize=8)
|
||||
else:
|
||||
for i in range(len(names)):
|
||||
ax.text(ci_upper[i] + 0.01, i, f"{hit_rates[i]:.1%}", va='center', fontsize=8)
|
||||
|
||||
ax.set_yticks(y_pos)
|
||||
ax.set_yticklabels(names, fontsize=9)
|
||||
ax.set_xlabel('Hit Rate')
|
||||
ax.set_title(f'Pattern Hit Rate with Wilson CI - {prefix.upper()} Set\n(* p<0.10, ** p<0.05, *** p<0.01 after FDR)')
|
||||
ax.legend(fontsize=9)
|
||||
ax.set_xlim(0, 1)
|
||||
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_dir / f"pattern_hit_rate_{prefix}.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [saved] pattern_hit_rate_{prefix}.png")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 6. 主流程
|
||||
# ============================================================
|
||||
|
||||
def evaluate_patterns_on_set(df: pd.DataFrame, patterns: Dict[str, pd.Series],
|
||||
set_name: str) -> pd.DataFrame:
|
||||
"""
|
||||
在给定数据集上评估所有形态
|
||||
|
||||
参数:
|
||||
df: 数据集 DataFrame (含 OHLCV)
|
||||
patterns: 形态信号字典
|
||||
set_name: 数据集名称(用于打印)
|
||||
|
||||
返回:
|
||||
包含统计结果的 DataFrame
|
||||
"""
|
||||
close = df['close']
|
||||
fwd_returns = calc_forward_returns_multi(close, horizons=[1, 3, 5, 10, 20])
|
||||
|
||||
results = {}
|
||||
for name, signal in patterns.items():
|
||||
sig = signal.reindex(df.index).fillna(0)
|
||||
expected_dir = PATTERN_EXPECTED_DIRECTION.get(name, 0)
|
||||
results[name] = analyze_pattern_returns(sig, fwd_returns, expected_dir)
|
||||
|
||||
results_df = pd.DataFrame(results).T
|
||||
results_df.index.name = 'pattern'
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f" {set_name} 数据集形态评估结果")
|
||||
print(f"{'='*60}")
|
||||
|
||||
# 打印形态出现次数
|
||||
print(f"\n 形态出现次数:")
|
||||
for name in results_df.index:
|
||||
n = int(results_df.loc[name, 'n_occurrences'])
|
||||
print(f" {name}: {n} 次")
|
||||
|
||||
return results_df
|
||||
|
||||
|
||||
def apply_fdr_to_patterns(results_df: pd.DataFrame, alpha: float = 0.05) -> pd.DataFrame:
|
||||
"""
|
||||
对形态检验的多个 p 值进行 FDR 校正
|
||||
|
||||
校正的 p 值列:
|
||||
- 各前向周期的 t-test p 值
|
||||
- 二项检验 p 值
|
||||
"""
|
||||
# t-test p 值列
|
||||
ttest_cols = [c for c in results_df.columns if c.endswith('_ttest_pval')]
|
||||
all_pval_cols = ttest_cols.copy()
|
||||
|
||||
if 'binom_pval' in results_df.columns:
|
||||
all_pval_cols.append('binom_pval')
|
||||
|
||||
for col in all_pval_cols:
|
||||
pvals = results_df[col].values.astype(float)
|
||||
rejected, adjusted = benjamini_hochberg(pvals, alpha)
|
||||
adj_col = col.replace('_pval', '_adj_pval')
|
||||
rej_col = col.replace('_pval', '_rejected')
|
||||
results_df[adj_col] = adjusted
|
||||
results_df[rej_col] = rejected
|
||||
|
||||
return results_df
|
||||
|
||||
|
||||
def run_patterns_analysis(df: pd.DataFrame, output_dir: str) -> Dict:
|
||||
"""
|
||||
K线形态识别与统计验证主入口
|
||||
|
||||
参数:
|
||||
df: 完整的日线 DataFrame(含 open/high/low/close/volume 等列,DatetimeIndex)
|
||||
output_dir: 图表输出目录
|
||||
|
||||
返回:
|
||||
包含训练集和验证集结果的字典
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 60)
|
||||
print(" K线形态识别与统计验证")
|
||||
print("=" * 60)
|
||||
|
||||
# --- 数据切分 ---
|
||||
train, val, test = split_data(df)
|
||||
print(f"\n训练集: {train.index.min()} ~ {train.index.max()} ({len(train)} bars)")
|
||||
print(f"验证集: {val.index.min()} ~ {val.index.max()} ({len(val)} bars)")
|
||||
|
||||
# --- 检测所有形态(在全量数据上计算) ---
|
||||
all_patterns = detect_all_patterns(df)
|
||||
print(f"\n共检测 {len(all_patterns)} 种K线形态")
|
||||
|
||||
# ============ 训练集评估 ============
|
||||
train_results = evaluate_patterns_on_set(train, all_patterns, "训练集 (TRAIN)")
|
||||
|
||||
# FDR 校正
|
||||
train_results = apply_fdr_to_patterns(train_results, alpha=0.05)
|
||||
|
||||
# 找出显著形态
|
||||
reject_cols = [c for c in train_results.columns if c.endswith('_rejected')]
|
||||
if reject_cols:
|
||||
train_results['any_fdr_pass'] = train_results[reject_cols].any(axis=1)
|
||||
fdr_passed_train = train_results[train_results['any_fdr_pass']].index.tolist()
|
||||
else:
|
||||
fdr_passed_train = []
|
||||
|
||||
print(f"\n--- FDR 校正结果 (训练集) ---")
|
||||
if fdr_passed_train:
|
||||
print(f" 通过 FDR 校正的形态 ({len(fdr_passed_train)} 个):")
|
||||
for name in fdr_passed_train:
|
||||
row = train_results.loc[name]
|
||||
hr = row.get('hit_rate', np.nan)
|
||||
n = int(row.get('n_occurrences', 0))
|
||||
hr_str = f", hit_rate={hr:.1%}" if not np.isnan(hr) else ""
|
||||
print(f" - {name}: n={n}{hr_str}")
|
||||
else:
|
||||
print(" 没有形态通过 FDR 校正(alpha=0.05)")
|
||||
|
||||
# --- 训练集可视化 ---
|
||||
print("\n--- 训练集可视化 ---")
|
||||
train_counts = {name: int(train_results.loc[name, 'n_occurrences']) for name in train_results.index}
|
||||
plot_pattern_counts(train_counts, output_dir, prefix="train")
|
||||
|
||||
train_patterns_in_set = {name: sig.reindex(train.index).fillna(0) for name, sig in all_patterns.items()}
|
||||
train_fwd = calc_forward_returns_multi(train['close'], horizons=[1, 3, 5, 10, 20])
|
||||
plot_forward_return_boxplots(train_patterns_in_set, train_fwd, output_dir, prefix="train")
|
||||
plot_hit_rate_with_ci(train_results, output_dir, prefix="train")
|
||||
|
||||
# ============ 验证集评估 ============
|
||||
val_results = evaluate_patterns_on_set(val, all_patterns, "验证集 (VAL)")
|
||||
val_results = apply_fdr_to_patterns(val_results, alpha=0.05)
|
||||
|
||||
reject_cols_val = [c for c in val_results.columns if c.endswith('_rejected')]
|
||||
if reject_cols_val:
|
||||
val_results['any_fdr_pass'] = val_results[reject_cols_val].any(axis=1)
|
||||
fdr_passed_val = val_results[val_results['any_fdr_pass']].index.tolist()
|
||||
else:
|
||||
fdr_passed_val = []
|
||||
|
||||
print(f"\n--- FDR 校正结果 (验证集) ---")
|
||||
if fdr_passed_val:
|
||||
print(f" 通过 FDR 校正的形态 ({len(fdr_passed_val)} 个):")
|
||||
for name in fdr_passed_val:
|
||||
row = val_results.loc[name]
|
||||
hr = row.get('hit_rate', np.nan)
|
||||
n = int(row.get('n_occurrences', 0))
|
||||
hr_str = f", hit_rate={hr:.1%}" if not np.isnan(hr) else ""
|
||||
print(f" - {name}: n={n}{hr_str}")
|
||||
else:
|
||||
print(" 没有形态通过 FDR 校正(alpha=0.05)")
|
||||
|
||||
# --- 训练集 vs 验证集对比 ---
|
||||
if 'hit_rate' in train_results.columns and 'hit_rate' in val_results.columns:
|
||||
print(f"\n--- 训练集 vs 验证集命中率对比 ---")
|
||||
for name in train_results.index:
|
||||
tr_hr = train_results.loc[name, 'hit_rate'] if name in train_results.index else np.nan
|
||||
va_hr = val_results.loc[name, 'hit_rate'] if name in val_results.index else np.nan
|
||||
if np.isnan(tr_hr) or np.isnan(va_hr):
|
||||
continue
|
||||
diff = va_hr - tr_hr
|
||||
label = "STABLE" if abs(diff) < 0.05 else ("IMPROVE" if diff > 0 else "DECAY")
|
||||
print(f" {name}: train={tr_hr:.1%}, val={va_hr:.1%}, diff={diff:+.1%} [{label}]")
|
||||
|
||||
# --- 验证集可视化 ---
|
||||
print("\n--- 验证集可视化 ---")
|
||||
val_counts = {name: int(val_results.loc[name, 'n_occurrences']) for name in val_results.index}
|
||||
plot_pattern_counts(val_counts, output_dir, prefix="val")
|
||||
|
||||
val_patterns_in_set = {name: sig.reindex(val.index).fillna(0) for name, sig in all_patterns.items()}
|
||||
val_fwd = calc_forward_returns_multi(val['close'], horizons=[1, 3, 5, 10, 20])
|
||||
plot_forward_return_boxplots(val_patterns_in_set, val_fwd, output_dir, prefix="val")
|
||||
plot_hit_rate_with_ci(val_results, output_dir, prefix="val")
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(" K线形态识别与统计验证完成")
|
||||
print(f"{'='*60}")
|
||||
|
||||
return {
|
||||
'train_results': train_results,
|
||||
'val_results': val_results,
|
||||
'fdr_passed_train': fdr_passed_train,
|
||||
'fdr_passed_val': fdr_passed_val,
|
||||
'all_patterns': all_patterns,
|
||||
}
|
||||
468
src/power_law_analysis.py
Normal file
@@ -0,0 +1,468 @@
|
||||
"""幂律增长拟合与走廊模型分析
|
||||
|
||||
通过幂律模型拟合BTC价格的长期增长趋势,构建价格走廊,
|
||||
并与指数增长模型进行比较,评估当前价格在历史分布中的位置。
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from scipy import stats
|
||||
from scipy.optimize import curve_fit
|
||||
from pathlib import Path
|
||||
from typing import Tuple, Dict
|
||||
|
||||
# 中文显示支持
|
||||
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
|
||||
plt.rcParams['axes.unicode_minus'] = False
|
||||
|
||||
|
||||
def _compute_days_since_start(df: pd.DataFrame) -> np.ndarray:
|
||||
"""计算距离起始日的天数(从1开始,避免log(0))"""
|
||||
days = (df.index - df.index[0]).days.astype(float) + 1.0
|
||||
return days
|
||||
|
||||
|
||||
def _fit_power_law(log_days: np.ndarray, log_prices: np.ndarray) -> Dict:
|
||||
"""对数-对数线性回归拟合幂律模型
|
||||
|
||||
模型: log(price) = slope * log(days) + intercept
|
||||
等价于: price = exp(intercept) * days^slope
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含 slope, intercept, r_squared, residuals, fitted_values
|
||||
"""
|
||||
slope, intercept, r_value, p_value, std_err = stats.linregress(log_days, log_prices)
|
||||
fitted = slope * log_days + intercept
|
||||
residuals = log_prices - fitted
|
||||
|
||||
return {
|
||||
'slope': slope, # 幂律指数 α
|
||||
'intercept': intercept, # log(c)
|
||||
'r_squared': r_value ** 2,
|
||||
'p_value': p_value,
|
||||
'std_err': std_err,
|
||||
'residuals': residuals,
|
||||
'fitted_values': fitted,
|
||||
}
|
||||
|
||||
|
||||
def _build_corridor(
|
||||
log_days: np.ndarray,
|
||||
fit_result: Dict,
|
||||
quantiles: Tuple[float, ...] = (0.05, 0.50, 0.95),
|
||||
) -> Dict[float, np.ndarray]:
|
||||
"""基于残差分位数构建幂律走廊
|
||||
|
||||
Parameters
|
||||
----------
|
||||
log_days : array
|
||||
log(天数) 序列
|
||||
fit_result : dict
|
||||
幂律拟合结果
|
||||
quantiles : tuple
|
||||
走廊分位数
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
分位数 -> 走廊价格(原始尺度)
|
||||
"""
|
||||
residuals = fit_result['residuals']
|
||||
corridor = {}
|
||||
for q in quantiles:
|
||||
q_val = np.quantile(residuals, q)
|
||||
# log_price = slope * log_days + intercept + quantile_offset
|
||||
log_price_band = fit_result['slope'] * log_days + fit_result['intercept'] + q_val
|
||||
corridor[q] = np.exp(log_price_band)
|
||||
return corridor
|
||||
|
||||
|
||||
def _power_law_func(days: np.ndarray, c: float, alpha: float) -> np.ndarray:
|
||||
"""幂律函数: price = c * days^alpha"""
|
||||
return c * np.power(days, alpha)
|
||||
|
||||
|
||||
def _exponential_func(days: np.ndarray, c: float, beta: float) -> np.ndarray:
|
||||
"""指数函数: price = c * exp(beta * days)"""
|
||||
return c * np.exp(beta * days)
|
||||
|
||||
|
||||
def _compute_aic_bic(n: int, k: int, rss: float) -> Tuple[float, float]:
|
||||
"""计算AIC和BIC
|
||||
|
||||
Parameters
|
||||
----------
|
||||
n : int
|
||||
样本量
|
||||
k : int
|
||||
模型参数个数
|
||||
rss : float
|
||||
残差平方和
|
||||
|
||||
Returns
|
||||
-------
|
||||
tuple
|
||||
(AIC, BIC)
|
||||
"""
|
||||
# 对数似然 (假设正态分布残差)
|
||||
log_likelihood = -n / 2 * (np.log(2 * np.pi * rss / n) + 1)
|
||||
aic = 2 * k - 2 * log_likelihood
|
||||
bic = k * np.log(n) - 2 * log_likelihood
|
||||
return aic, bic
|
||||
|
||||
|
||||
def _fit_and_compare_models(
|
||||
days: np.ndarray, prices: np.ndarray
|
||||
) -> Dict:
|
||||
"""拟合幂律和指数增长模型并比较AIC/BIC
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含两个模型的参数、AIC、BIC及比较结论
|
||||
"""
|
||||
n = len(prices)
|
||||
k = 2 # 两个模型都有2个参数
|
||||
|
||||
# --- 幂律拟合: price = c * days^alpha ---
|
||||
try:
|
||||
popt_pl, _ = curve_fit(
|
||||
_power_law_func, days, prices,
|
||||
p0=[1.0, 1.5], maxfev=10000
|
||||
)
|
||||
prices_pred_pl = _power_law_func(days, *popt_pl)
|
||||
rss_pl = np.sum((prices - prices_pred_pl) ** 2)
|
||||
aic_pl, bic_pl = _compute_aic_bic(n, k, rss_pl)
|
||||
except RuntimeError:
|
||||
# curve_fit 失败时回退到对数空间OLS估计
|
||||
log_d = np.log(days)
|
||||
log_p = np.log(prices)
|
||||
slope, intercept, _, _, _ = stats.linregress(log_d, log_p)
|
||||
popt_pl = [np.exp(intercept), slope]
|
||||
prices_pred_pl = _power_law_func(days, *popt_pl)
|
||||
rss_pl = np.sum((prices - prices_pred_pl) ** 2)
|
||||
aic_pl, bic_pl = _compute_aic_bic(n, k, rss_pl)
|
||||
|
||||
# --- 指数拟合: price = c * exp(beta * days) ---
|
||||
# 初始值通过log空间OLS估计
|
||||
log_p = np.log(prices)
|
||||
beta_init, log_c_init, _, _, _ = stats.linregress(days, log_p)
|
||||
try:
|
||||
popt_exp, _ = curve_fit(
|
||||
_exponential_func, days, prices,
|
||||
p0=[np.exp(log_c_init), beta_init], maxfev=10000
|
||||
)
|
||||
prices_pred_exp = _exponential_func(days, *popt_exp)
|
||||
rss_exp = np.sum((prices - prices_pred_exp) ** 2)
|
||||
aic_exp, bic_exp = _compute_aic_bic(n, k, rss_exp)
|
||||
except (RuntimeError, OverflowError):
|
||||
# 指数拟合容易溢出,使用log空间线性回归作替代
|
||||
popt_exp = [np.exp(log_c_init), beta_init]
|
||||
prices_pred_exp = _exponential_func(days, *popt_exp)
|
||||
# 裁剪防止溢出
|
||||
prices_pred_exp = np.clip(prices_pred_exp, 0, prices.max() * 100)
|
||||
rss_exp = np.sum((prices - prices_pred_exp) ** 2)
|
||||
aic_exp, bic_exp = _compute_aic_bic(n, k, rss_exp)
|
||||
|
||||
return {
|
||||
'power_law': {
|
||||
'params': {'c': popt_pl[0], 'alpha': popt_pl[1]},
|
||||
'aic': aic_pl,
|
||||
'bic': bic_pl,
|
||||
'rss': rss_pl,
|
||||
'predicted': prices_pred_pl,
|
||||
},
|
||||
'exponential': {
|
||||
'params': {'c': popt_exp[0], 'beta': popt_exp[1]},
|
||||
'aic': aic_exp,
|
||||
'bic': bic_exp,
|
||||
'rss': rss_exp,
|
||||
'predicted': prices_pred_exp,
|
||||
},
|
||||
'preferred': 'power_law' if aic_pl < aic_exp else 'exponential',
|
||||
}
|
||||
|
||||
|
||||
def _compute_current_percentile(residuals: np.ndarray) -> float:
|
||||
"""计算当前价格(最后一个残差)在历史残差分布中的百分位
|
||||
|
||||
Returns
|
||||
-------
|
||||
float
|
||||
百分位数 (0-100)
|
||||
"""
|
||||
current_residual = residuals[-1]
|
||||
percentile = stats.percentileofscore(residuals, current_residual)
|
||||
return percentile
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# 可视化函数
|
||||
# =============================================================================
|
||||
|
||||
def _plot_loglog_regression(
|
||||
log_days: np.ndarray,
|
||||
log_prices: np.ndarray,
|
||||
fit_result: Dict,
|
||||
dates: pd.DatetimeIndex,
|
||||
output_dir: Path,
|
||||
):
|
||||
"""图1: 对数-对数散点图 + 回归线"""
|
||||
fig, ax = plt.subplots(figsize=(12, 7))
|
||||
|
||||
ax.scatter(log_days, log_prices, s=3, alpha=0.5, color='steelblue', label='实际价格')
|
||||
ax.plot(log_days, fit_result['fitted_values'], color='red', linewidth=2,
|
||||
label=f"回归线: slope={fit_result['slope']:.4f}, R²={fit_result['r_squared']:.4f}")
|
||||
|
||||
ax.set_xlabel('log(天数)', fontsize=12)
|
||||
ax.set_ylabel('log(价格)', fontsize=12)
|
||||
ax.set_title('BTC 幂律拟合 — 对数-对数回归', fontsize=14)
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / 'power_law_loglog_regression.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [图] 对数-对数回归已保存: {output_dir / 'power_law_loglog_regression.png'}")
|
||||
|
||||
|
||||
def _plot_corridor(
|
||||
df: pd.DataFrame,
|
||||
days: np.ndarray,
|
||||
corridor: Dict[float, np.ndarray],
|
||||
fit_result: Dict,
|
||||
output_dir: Path,
|
||||
):
|
||||
"""图2: 幂律走廊模型(价格 + 5%/50%/95% 通道)"""
|
||||
fig, ax = plt.subplots(figsize=(14, 7))
|
||||
|
||||
# 实际价格
|
||||
ax.semilogy(df.index, df['close'], color='black', linewidth=0.8, label='BTC 收盘价')
|
||||
|
||||
# 走廊带
|
||||
colors = {0.05: 'green', 0.50: 'orange', 0.95: 'red'}
|
||||
labels = {0.05: '5% 下界', 0.50: '50% 中位线', 0.95: '95% 上界'}
|
||||
for q, band in corridor.items():
|
||||
ax.semilogy(df.index, band, color=colors[q], linewidth=1.5,
|
||||
linestyle='--', label=labels[q])
|
||||
|
||||
# 填充走廊区间
|
||||
ax.fill_between(df.index, corridor[0.05], corridor[0.95],
|
||||
alpha=0.1, color='blue', label='90% 走廊区间')
|
||||
|
||||
ax.set_xlabel('日期', fontsize=12)
|
||||
ax.set_ylabel('价格 (USDT, 对数尺度)', fontsize=12)
|
||||
ax.set_title('BTC 幂律走廊模型', fontsize=14)
|
||||
ax.legend(fontsize=10, loc='upper left')
|
||||
ax.grid(True, alpha=0.3, which='both')
|
||||
|
||||
fig.savefig(output_dir / 'power_law_corridor.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [图] 幂律走廊已保存: {output_dir / 'power_law_corridor.png'}")
|
||||
|
||||
|
||||
def _plot_model_comparison(
|
||||
df: pd.DataFrame,
|
||||
days: np.ndarray,
|
||||
comparison: Dict,
|
||||
output_dir: Path,
|
||||
):
|
||||
"""图3: 幂律 vs 指数增长模型对比"""
|
||||
fig, axes = plt.subplots(1, 2, figsize=(16, 7))
|
||||
|
||||
# 左图: 价格对比
|
||||
ax1 = axes[0]
|
||||
ax1.semilogy(df.index, df['close'], color='black', linewidth=0.8, label='实际价格')
|
||||
ax1.semilogy(df.index, comparison['power_law']['predicted'],
|
||||
color='blue', linewidth=1.5, linestyle='--', label='幂律拟合')
|
||||
ax1.semilogy(df.index, np.clip(comparison['exponential']['predicted'], 1e-1, None),
|
||||
color='red', linewidth=1.5, linestyle='--', label='指数拟合')
|
||||
ax1.set_xlabel('日期', fontsize=11)
|
||||
ax1.set_ylabel('价格 (USDT, 对数尺度)', fontsize=11)
|
||||
ax1.set_title('模型拟合对比', fontsize=13)
|
||||
ax1.legend(fontsize=10)
|
||||
ax1.grid(True, alpha=0.3, which='both')
|
||||
|
||||
# 右图: AIC/BIC 柱状图
|
||||
ax2 = axes[1]
|
||||
models = ['幂律模型', '指数模型']
|
||||
aic_vals = [comparison['power_law']['aic'], comparison['exponential']['aic']]
|
||||
bic_vals = [comparison['power_law']['bic'], comparison['exponential']['bic']]
|
||||
|
||||
x = np.arange(len(models))
|
||||
width = 0.35
|
||||
bars1 = ax2.bar(x - width / 2, aic_vals, width, label='AIC', color='steelblue')
|
||||
bars2 = ax2.bar(x + width / 2, bic_vals, width, label='BIC', color='coral')
|
||||
|
||||
ax2.set_xticks(x)
|
||||
ax2.set_xticklabels(models, fontsize=11)
|
||||
ax2.set_ylabel('信息准则值', fontsize=11)
|
||||
ax2.set_title('AIC / BIC 模型比较', fontsize=13)
|
||||
ax2.legend(fontsize=10)
|
||||
ax2.grid(True, alpha=0.3, axis='y')
|
||||
|
||||
# 添加数值标签
|
||||
for bar in bars1:
|
||||
ax2.text(bar.get_x() + bar.get_width() / 2, bar.get_height(),
|
||||
f'{bar.get_height():.0f}', ha='center', va='bottom', fontsize=9)
|
||||
for bar in bars2:
|
||||
ax2.text(bar.get_x() + bar.get_width() / 2, bar.get_height(),
|
||||
f'{bar.get_height():.0f}', ha='center', va='bottom', fontsize=9)
|
||||
|
||||
fig.tight_layout()
|
||||
fig.savefig(output_dir / 'power_law_model_comparison.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [图] 模型对比已保存: {output_dir / 'power_law_model_comparison.png'}")
|
||||
|
||||
|
||||
def _plot_residual_distribution(
|
||||
residuals: np.ndarray,
|
||||
current_percentile: float,
|
||||
output_dir: Path,
|
||||
):
|
||||
"""图4: 残差分布 + 当前位置"""
|
||||
fig, ax = plt.subplots(figsize=(10, 6))
|
||||
|
||||
ax.hist(residuals, bins=60, density=True, alpha=0.6, color='steelblue',
|
||||
edgecolor='white', label='残差分布')
|
||||
|
||||
# 当前位置
|
||||
current_res = residuals[-1]
|
||||
ax.axvline(current_res, color='red', linewidth=2, linestyle='--',
|
||||
label=f'当前位置: {current_percentile:.1f}%')
|
||||
|
||||
# 分位数线
|
||||
for q, color, label in [(0.05, 'green', '5%'), (0.50, 'orange', '50%'), (0.95, 'red', '95%')]:
|
||||
q_val = np.quantile(residuals, q)
|
||||
ax.axvline(q_val, color=color, linewidth=1, linestyle=':',
|
||||
alpha=0.7, label=f'{label} 分位: {q_val:.3f}')
|
||||
|
||||
ax.set_xlabel('残差 (log尺度)', fontsize=12)
|
||||
ax.set_ylabel('密度', fontsize=12)
|
||||
ax.set_title(f'幂律残差分布 — 当前价格位于 {current_percentile:.1f}% 分位', fontsize=14)
|
||||
ax.legend(fontsize=9)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / 'power_law_residual_distribution.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [图] 残差分布已保存: {output_dir / 'power_law_residual_distribution.png'}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# 主入口
|
||||
# =============================================================================
|
||||
|
||||
def run_power_law_analysis(df: pd.DataFrame, output_dir: str = "output") -> Dict:
|
||||
"""幂律增长拟合与走廊模型 — 主入口函数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
由 data_loader.load_daily() 返回的日线数据,含 DatetimeIndex 和 close 列
|
||||
output_dir : str
|
||||
图表输出目录
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
分析结果摘要
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 60)
|
||||
print(" BTC 幂律增长分析")
|
||||
print("=" * 60)
|
||||
|
||||
prices = df['close'].dropna()
|
||||
|
||||
# ---- 步骤1: 准备数据 ----
|
||||
days = _compute_days_since_start(df.loc[prices.index])
|
||||
log_days = np.log(days)
|
||||
log_prices = np.log(prices.values)
|
||||
|
||||
print(f"\n数据范围: {prices.index[0].date()} ~ {prices.index[-1].date()}")
|
||||
print(f"样本数量: {len(prices)}")
|
||||
|
||||
# ---- 步骤2: 对数-对数线性回归 ----
|
||||
print("\n--- 对数-对数线性回归 ---")
|
||||
fit_result = _fit_power_law(log_days, log_prices)
|
||||
print(f" 幂律指数 (slope/α): {fit_result['slope']:.6f}")
|
||||
print(f" 截距 log(c): {fit_result['intercept']:.6f}")
|
||||
print(f" 等价系数 c: {np.exp(fit_result['intercept']):.6f}")
|
||||
print(f" R²: {fit_result['r_squared']:.6f}")
|
||||
print(f" p-value: {fit_result['p_value']:.2e}")
|
||||
print(f" 标准误差: {fit_result['std_err']:.6f}")
|
||||
|
||||
# ---- 步骤3: 幂律走廊模型 ----
|
||||
print("\n--- 幂律走廊模型 ---")
|
||||
quantiles = (0.05, 0.50, 0.95)
|
||||
corridor = _build_corridor(log_days, fit_result, quantiles)
|
||||
for q in quantiles:
|
||||
print(f" {int(q * 100):>3d}% 分位当前走廊价格: ${corridor[q][-1]:,.0f}")
|
||||
|
||||
# ---- 步骤4: 模型比较 (幂律 vs 指数) ----
|
||||
print("\n--- 模型比较: 幂律 vs 指数 ---")
|
||||
comparison = _fit_and_compare_models(days, prices.values)
|
||||
|
||||
pl = comparison['power_law']
|
||||
exp = comparison['exponential']
|
||||
print(f" 幂律模型: c={pl['params']['c']:.4f}, α={pl['params']['alpha']:.4f}")
|
||||
print(f" AIC={pl['aic']:.0f}, BIC={pl['bic']:.0f}")
|
||||
print(f" 指数模型: c={exp['params']['c']:.4f}, β={exp['params']['beta']:.6f}")
|
||||
print(f" AIC={exp['aic']:.0f}, BIC={exp['bic']:.0f}")
|
||||
print(f" AIC 差值 (幂律-指数): {pl['aic'] - exp['aic']:.0f}")
|
||||
print(f" BIC 差值 (幂律-指数): {pl['bic'] - exp['bic']:.0f}")
|
||||
print(f" >> 优选模型: {comparison['preferred']}")
|
||||
|
||||
# ---- 步骤5: 当前价格位置 ----
|
||||
print("\n--- 当前价格位置 ---")
|
||||
current_percentile = _compute_current_percentile(fit_result['residuals'])
|
||||
current_price = prices.iloc[-1]
|
||||
print(f" 当前价格: ${current_price:,.2f}")
|
||||
print(f" 历史残差分位: {current_percentile:.1f}%")
|
||||
if current_percentile > 90:
|
||||
print(" >> 警告: 当前价格处于历史高估区域")
|
||||
elif current_percentile < 10:
|
||||
print(" >> 提示: 当前价格处于历史低估区域")
|
||||
else:
|
||||
print(" >> 当前价格处于历史正常波动范围内")
|
||||
|
||||
# ---- 步骤6: 生成可视化 ----
|
||||
print("\n--- 生成可视化图表 ---")
|
||||
_plot_loglog_regression(log_days, log_prices, fit_result, prices.index, output_dir)
|
||||
_plot_corridor(df.loc[prices.index], days, corridor, fit_result, output_dir)
|
||||
_plot_model_comparison(df.loc[prices.index], days, comparison, output_dir)
|
||||
_plot_residual_distribution(fit_result['residuals'], current_percentile, output_dir)
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print(" 幂律分析完成")
|
||||
print("=" * 60)
|
||||
|
||||
# 返回结果摘要
|
||||
return {
|
||||
'r_squared': fit_result['r_squared'],
|
||||
'power_exponent': fit_result['slope'],
|
||||
'intercept': fit_result['intercept'],
|
||||
'corridor_prices': {q: corridor[q][-1] for q in quantiles},
|
||||
'model_comparison': {
|
||||
'power_law_aic': pl['aic'],
|
||||
'power_law_bic': pl['bic'],
|
||||
'exponential_aic': exp['aic'],
|
||||
'exponential_bic': exp['bic'],
|
||||
'preferred': comparison['preferred'],
|
||||
},
|
||||
'current_price': current_price,
|
||||
'current_percentile': current_percentile,
|
||||
}
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
from data_loader import load_daily
|
||||
df = load_daily()
|
||||
results = run_power_law_analysis(df, output_dir='../output/power_law')
|
||||
80
src/preprocessing.py
Normal file
@@ -0,0 +1,80 @@
|
||||
"""数据预处理模块 - 收益率、去趋势、标准化、衍生指标"""
|
||||
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from typing import Optional
|
||||
|
||||
|
||||
def log_returns(prices: pd.Series) -> pd.Series:
|
||||
"""对数收益率"""
|
||||
return np.log(prices / prices.shift(1)).dropna()
|
||||
|
||||
|
||||
def simple_returns(prices: pd.Series) -> pd.Series:
|
||||
"""简单收益率"""
|
||||
return prices.pct_change().dropna()
|
||||
|
||||
|
||||
def detrend_log_diff(prices: pd.Series) -> pd.Series:
|
||||
"""对数差分去趋势"""
|
||||
return np.log(prices).diff().dropna()
|
||||
|
||||
|
||||
def detrend_linear(series: pd.Series) -> pd.Series:
|
||||
"""线性去趋势"""
|
||||
x = np.arange(len(series))
|
||||
coeffs = np.polyfit(x, series.values, 1)
|
||||
trend = np.polyval(coeffs, x)
|
||||
return pd.Series(series.values - trend, index=series.index)
|
||||
|
||||
|
||||
def hp_filter(series: pd.Series, lamb: float = 1600) -> tuple:
|
||||
"""Hodrick-Prescott 滤波器"""
|
||||
from statsmodels.tsa.filters.hp_filter import hpfilter
|
||||
cycle, trend = hpfilter(series.dropna(), lamb=lamb)
|
||||
return cycle, trend
|
||||
|
||||
|
||||
def rolling_volatility(returns: pd.Series, window: int = 30) -> pd.Series:
|
||||
"""滚动波动率(年化)"""
|
||||
return returns.rolling(window=window).std() * np.sqrt(365)
|
||||
|
||||
|
||||
def realized_volatility(returns: pd.Series, window: int = 30) -> pd.Series:
|
||||
"""已实现波动率"""
|
||||
return np.sqrt((returns ** 2).rolling(window=window).sum())
|
||||
|
||||
|
||||
def taker_buy_ratio(df: pd.DataFrame) -> pd.Series:
|
||||
"""Taker买入比例"""
|
||||
return df["taker_buy_volume"] / df["volume"].replace(0, np.nan)
|
||||
|
||||
|
||||
def add_derived_features(df: pd.DataFrame) -> pd.DataFrame:
|
||||
"""添加常用衍生特征列"""
|
||||
out = df.copy()
|
||||
out["log_return"] = log_returns(df["close"])
|
||||
out["simple_return"] = simple_returns(df["close"])
|
||||
out["log_price"] = np.log(df["close"])
|
||||
out["range_pct"] = (df["high"] - df["low"]) / df["close"]
|
||||
out["body_pct"] = (df["close"] - df["open"]) / df["open"]
|
||||
out["taker_buy_ratio"] = taker_buy_ratio(df)
|
||||
out["vol_30d"] = rolling_volatility(out["log_return"], 30)
|
||||
out["vol_7d"] = rolling_volatility(out["log_return"], 7)
|
||||
out["volume_ma20"] = df["volume"].rolling(20).mean()
|
||||
out["volume_ratio"] = df["volume"] / out["volume_ma20"]
|
||||
out["abs_return"] = out["log_return"].abs()
|
||||
out["squared_return"] = out["log_return"] ** 2
|
||||
return out
|
||||
|
||||
|
||||
def standardize(series: pd.Series) -> pd.Series:
|
||||
"""Z-score标准化"""
|
||||
return (series - series.mean()) / series.std()
|
||||
|
||||
|
||||
def winsorize(series: pd.Series, lower: float = 0.01, upper: float = 0.99) -> pd.Series:
|
||||
"""Winsorize处理极端值"""
|
||||
lo = series.quantile(lower)
|
||||
hi = series.quantile(upper)
|
||||
return series.clip(lo, hi)
|
||||
479
src/returns_analysis.py
Normal file
@@ -0,0 +1,479 @@
|
||||
"""收益率分布分析与GARCH建模模块
|
||||
|
||||
分析内容:
|
||||
- 正态性检验(KS、JB、AD)
|
||||
- 厚尾特征分析(峰度、偏度、超越比率)
|
||||
- 多时间尺度收益率分布对比
|
||||
- QQ图
|
||||
- GARCH(1,1) 条件波动率建模
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from matplotlib.gridspec import GridSpec
|
||||
from scipy import stats
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from src.data_loader import load_klines
|
||||
from src.preprocessing import log_returns
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 1. 正态性检验
|
||||
# ============================================================
|
||||
|
||||
def normality_tests(returns: pd.Series) -> dict:
|
||||
"""
|
||||
对收益率序列进行多种正态性检验
|
||||
|
||||
Parameters
|
||||
----------
|
||||
returns : pd.Series
|
||||
对数收益率序列(已去除NaN)
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含KS、JB、AD检验统计量和p值的字典
|
||||
"""
|
||||
r = returns.dropna().values
|
||||
|
||||
# Kolmogorov-Smirnov 检验(与标准正态比较)
|
||||
r_standardized = (r - r.mean()) / r.std()
|
||||
ks_stat, ks_p = stats.kstest(r_standardized, 'norm')
|
||||
|
||||
# Jarque-Bera 检验
|
||||
jb_stat, jb_p = stats.jarque_bera(r)
|
||||
|
||||
# Anderson-Darling 检验
|
||||
ad_result = stats.anderson(r, dist='norm')
|
||||
|
||||
results = {
|
||||
'ks_statistic': ks_stat,
|
||||
'ks_pvalue': ks_p,
|
||||
'jb_statistic': jb_stat,
|
||||
'jb_pvalue': jb_p,
|
||||
'ad_statistic': ad_result.statistic,
|
||||
'ad_critical_values': dict(zip(
|
||||
[f'{sl}%' for sl in ad_result.significance_level],
|
||||
ad_result.critical_values
|
||||
)),
|
||||
}
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 2. 厚尾分析
|
||||
# ============================================================
|
||||
|
||||
def fat_tail_analysis(returns: pd.Series) -> dict:
|
||||
"""
|
||||
厚尾特征分析:峰度、偏度、σ超越比率
|
||||
|
||||
Parameters
|
||||
----------
|
||||
returns : pd.Series
|
||||
对数收益率序列
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
峰度、偏度、3σ/4σ超越比率及其与正态分布的对比
|
||||
"""
|
||||
r = returns.dropna().values
|
||||
mu, sigma = r.mean(), r.std()
|
||||
|
||||
# 基础统计
|
||||
excess_kurtosis = stats.kurtosis(r) # scipy默认是excess kurtosis
|
||||
skewness = stats.skew(r)
|
||||
|
||||
# 实际超越比率
|
||||
r_std = (r - mu) / sigma
|
||||
exceed_3sigma = np.mean(np.abs(r_std) > 3)
|
||||
exceed_4sigma = np.mean(np.abs(r_std) > 4)
|
||||
|
||||
# 正态分布理论超越比率
|
||||
normal_3sigma = 2 * (1 - stats.norm.cdf(3)) # ≈ 0.0027
|
||||
normal_4sigma = 2 * (1 - stats.norm.cdf(4)) # ≈ 0.0001
|
||||
|
||||
results = {
|
||||
'excess_kurtosis': excess_kurtosis,
|
||||
'skewness': skewness,
|
||||
'exceed_3sigma_actual': exceed_3sigma,
|
||||
'exceed_3sigma_normal': normal_3sigma,
|
||||
'exceed_3sigma_ratio': exceed_3sigma / normal_3sigma if normal_3sigma > 0 else np.inf,
|
||||
'exceed_4sigma_actual': exceed_4sigma,
|
||||
'exceed_4sigma_normal': normal_4sigma,
|
||||
'exceed_4sigma_ratio': exceed_4sigma / normal_4sigma if normal_4sigma > 0 else np.inf,
|
||||
}
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 3. 多时间尺度分布对比
|
||||
# ============================================================
|
||||
|
||||
def multi_timeframe_distributions() -> dict:
|
||||
"""
|
||||
加载1h/4h/1d/1w数据,计算各时间尺度的对数收益率分布
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
{interval: pd.Series} 各时间尺度的对数收益率
|
||||
"""
|
||||
intervals = ['1h', '4h', '1d', '1w']
|
||||
distributions = {}
|
||||
for interval in intervals:
|
||||
try:
|
||||
df = load_klines(interval)
|
||||
ret = log_returns(df['close'])
|
||||
distributions[interval] = ret
|
||||
except FileNotFoundError:
|
||||
print(f"[警告] {interval} 数据文件不存在,跳过")
|
||||
return distributions
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 4. GARCH(1,1) 建模
|
||||
# ============================================================
|
||||
|
||||
def fit_garch11(returns: pd.Series) -> dict:
|
||||
"""
|
||||
拟合GARCH(1,1)模型
|
||||
|
||||
Parameters
|
||||
----------
|
||||
returns : pd.Series
|
||||
对数收益率序列(百分比化后传入arch库)
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含模型参数、持续性、条件波动率序列的字典
|
||||
"""
|
||||
from arch import arch_model
|
||||
|
||||
# arch库推荐使用百分比收益率以改善数值稳定性
|
||||
r_pct = returns.dropna() * 100
|
||||
|
||||
# 拟合GARCH(1,1),均值模型用常数均值
|
||||
model = arch_model(r_pct, vol='Garch', p=1, q=1, mean='Constant', dist='Normal')
|
||||
result = model.fit(disp='off')
|
||||
|
||||
# 提取参数
|
||||
params = result.params
|
||||
omega = params.get('omega', np.nan)
|
||||
alpha = params.get('alpha[1]', np.nan)
|
||||
beta = params.get('beta[1]', np.nan)
|
||||
persistence = alpha + beta
|
||||
|
||||
# 条件波动率(转回原始比例)
|
||||
cond_vol = result.conditional_volatility / 100
|
||||
|
||||
results = {
|
||||
'model_summary': str(result.summary()),
|
||||
'omega': omega,
|
||||
'alpha': alpha,
|
||||
'beta': beta,
|
||||
'persistence': persistence,
|
||||
'log_likelihood': result.loglikelihood,
|
||||
'aic': result.aic,
|
||||
'bic': result.bic,
|
||||
'conditional_volatility': cond_vol,
|
||||
'result_obj': result,
|
||||
}
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 5. 可视化
|
||||
# ============================================================
|
||||
|
||||
def plot_histogram_vs_normal(returns: pd.Series, output_dir: Path):
|
||||
"""绘制收益率直方图与正态分布对比"""
|
||||
r = returns.dropna().values
|
||||
mu, sigma = r.mean(), r.std()
|
||||
|
||||
fig, ax = plt.subplots(figsize=(12, 6))
|
||||
|
||||
# 直方图
|
||||
n_bins = 150
|
||||
ax.hist(r, bins=n_bins, density=True, alpha=0.65, color='steelblue',
|
||||
edgecolor='white', linewidth=0.3, label='BTC日对数收益率')
|
||||
|
||||
# 正态分布拟合曲线
|
||||
x = np.linspace(r.min(), r.max(), 500)
|
||||
ax.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2,
|
||||
label=f'正态分布 N({mu:.5f}, {sigma:.4f}²)')
|
||||
|
||||
ax.set_xlabel('日对数收益率', fontsize=12)
|
||||
ax.set_ylabel('概率密度', fontsize=12)
|
||||
ax.set_title('BTC日对数收益率分布 vs 正态分布', fontsize=14)
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / 'returns_histogram_vs_normal.png',
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[保存] {output_dir / 'returns_histogram_vs_normal.png'}")
|
||||
|
||||
|
||||
def plot_qq(returns: pd.Series, output_dir: Path):
|
||||
"""绘制QQ图"""
|
||||
fig, ax = plt.subplots(figsize=(8, 8))
|
||||
r = returns.dropna().values
|
||||
|
||||
# QQ图
|
||||
(osm, osr), (slope, intercept, _) = stats.probplot(r, dist='norm')
|
||||
ax.scatter(osm, osr, s=5, alpha=0.5, color='steelblue', label='样本分位数')
|
||||
# 理论线
|
||||
x_line = np.array([osm.min(), osm.max()])
|
||||
ax.plot(x_line, slope * x_line + intercept, 'r-', linewidth=2, label='理论正态线')
|
||||
|
||||
ax.set_xlabel('理论分位数(正态)', fontsize=12)
|
||||
ax.set_ylabel('样本分位数', fontsize=12)
|
||||
ax.set_title('BTC日对数收益率 QQ图', fontsize=14)
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / 'returns_qq_plot.png',
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[保存] {output_dir / 'returns_qq_plot.png'}")
|
||||
|
||||
|
||||
def plot_multi_timeframe(distributions: dict, output_dir: Path):
|
||||
"""绘制多时间尺度收益率分布对比"""
|
||||
n_plots = len(distributions)
|
||||
if n_plots == 0:
|
||||
print("[警告] 无可用的多时间尺度数据")
|
||||
return
|
||||
|
||||
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
|
||||
axes = axes.flatten()
|
||||
|
||||
interval_names = {
|
||||
'1h': '1小时', '4h': '4小时', '1d': '1天', '1w': '1周'
|
||||
}
|
||||
|
||||
for idx, (interval, ret) in enumerate(distributions.items()):
|
||||
if idx >= 4:
|
||||
break
|
||||
ax = axes[idx]
|
||||
r = ret.dropna().values
|
||||
mu, sigma = r.mean(), r.std()
|
||||
|
||||
ax.hist(r, bins=100, density=True, alpha=0.65, color='steelblue',
|
||||
edgecolor='white', linewidth=0.3)
|
||||
|
||||
x = np.linspace(r.min(), r.max(), 500)
|
||||
ax.plot(x, stats.norm.pdf(x, mu, sigma), 'r-', linewidth=1.5)
|
||||
|
||||
# 统计信息
|
||||
kurt = stats.kurtosis(r)
|
||||
skew = stats.skew(r)
|
||||
label = interval_names.get(interval, interval)
|
||||
ax.set_title(f'{label}收益率 (峰度={kurt:.2f}, 偏度={skew:.3f})', fontsize=11)
|
||||
ax.set_xlabel('对数收益率', fontsize=10)
|
||||
ax.set_ylabel('概率密度', fontsize=10)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
# 隐藏多余子图
|
||||
for idx in range(len(distributions), 4):
|
||||
axes[idx].set_visible(False)
|
||||
|
||||
fig.suptitle('多时间尺度BTC对数收益率分布', fontsize=14, y=1.02)
|
||||
fig.tight_layout()
|
||||
fig.savefig(output_dir / 'multi_timeframe_distributions.png',
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[保存] {output_dir / 'multi_timeframe_distributions.png'}")
|
||||
|
||||
|
||||
def plot_garch_conditional_vol(garch_results: dict, output_dir: Path):
|
||||
"""绘制GARCH(1,1)条件波动率时序图"""
|
||||
cond_vol = garch_results['conditional_volatility']
|
||||
|
||||
fig, ax = plt.subplots(figsize=(14, 5))
|
||||
ax.plot(cond_vol.index, cond_vol.values, linewidth=0.8, color='steelblue')
|
||||
ax.fill_between(cond_vol.index, 0, cond_vol.values, alpha=0.2, color='steelblue')
|
||||
|
||||
ax.set_xlabel('日期', fontsize=12)
|
||||
ax.set_ylabel('条件波动率', fontsize=12)
|
||||
ax.set_title(
|
||||
f'GARCH(1,1) 条件波动率 '
|
||||
f'(α={garch_results["alpha"]:.4f}, β={garch_results["beta"]:.4f}, '
|
||||
f'持续性={garch_results["persistence"]:.4f})',
|
||||
fontsize=13
|
||||
)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / 'garch_conditional_volatility.png',
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[保存] {output_dir / 'garch_conditional_volatility.png'}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 6. 结果打印
|
||||
# ============================================================
|
||||
|
||||
def print_normality_results(results: dict):
|
||||
"""打印正态性检验结果"""
|
||||
print("\n" + "=" * 60)
|
||||
print("正态性检验结果")
|
||||
print("=" * 60)
|
||||
|
||||
print(f"\n[KS检验] Kolmogorov-Smirnov")
|
||||
print(f" 统计量: {results['ks_statistic']:.6f}")
|
||||
print(f" p值: {results['ks_pvalue']:.2e}")
|
||||
print(f" 结论: {'拒绝正态假设' if results['ks_pvalue'] < 0.05 else '不能拒绝正态假设'}")
|
||||
|
||||
print(f"\n[JB检验] Jarque-Bera")
|
||||
print(f" 统计量: {results['jb_statistic']:.4f}")
|
||||
print(f" p值: {results['jb_pvalue']:.2e}")
|
||||
print(f" 结论: {'拒绝正态假设' if results['jb_pvalue'] < 0.05 else '不能拒绝正态假设'}")
|
||||
|
||||
print(f"\n[AD检验] Anderson-Darling")
|
||||
print(f" 统计量: {results['ad_statistic']:.4f}")
|
||||
print(" 临界值:")
|
||||
for level, cv in results['ad_critical_values'].items():
|
||||
reject = results['ad_statistic'] > cv
|
||||
print(f" {level}: {cv:.4f} {'(拒绝)' if reject else '(不拒绝)'}")
|
||||
|
||||
|
||||
def print_fat_tail_results(results: dict):
|
||||
"""打印厚尾分析结果"""
|
||||
print("\n" + "=" * 60)
|
||||
print("厚尾特征分析")
|
||||
print("=" * 60)
|
||||
print(f" 超额峰度 (excess kurtosis): {results['excess_kurtosis']:.4f}")
|
||||
print(f" (正态分布=0,值越大尾部越厚)")
|
||||
print(f" 偏度 (skewness): {results['skewness']:.4f}")
|
||||
print(f" (正态分布=0,负值表示左偏)")
|
||||
|
||||
print(f"\n 3σ超越比率:")
|
||||
print(f" 实际: {results['exceed_3sigma_actual']:.6f} "
|
||||
f"({results['exceed_3sigma_actual'] * 100:.3f}%)")
|
||||
print(f" 正态: {results['exceed_3sigma_normal']:.6f} "
|
||||
f"({results['exceed_3sigma_normal'] * 100:.3f}%)")
|
||||
print(f" 倍数: {results['exceed_3sigma_ratio']:.2f}x")
|
||||
|
||||
print(f"\n 4σ超越比率:")
|
||||
print(f" 实际: {results['exceed_4sigma_actual']:.6f} "
|
||||
f"({results['exceed_4sigma_actual'] * 100:.4f}%)")
|
||||
print(f" 正态: {results['exceed_4sigma_normal']:.6f} "
|
||||
f"({results['exceed_4sigma_normal'] * 100:.4f}%)")
|
||||
print(f" 倍数: {results['exceed_4sigma_ratio']:.2f}x")
|
||||
|
||||
|
||||
def print_garch_results(results: dict):
|
||||
"""打印GARCH(1,1)建模结果"""
|
||||
print("\n" + "=" * 60)
|
||||
print("GARCH(1,1) 建模结果")
|
||||
print("=" * 60)
|
||||
print(f" ω (omega): {results['omega']:.6f}")
|
||||
print(f" α (alpha[1]): {results['alpha']:.6f}")
|
||||
print(f" β (beta[1]): {results['beta']:.6f}")
|
||||
print(f" 持续性 (α+β): {results['persistence']:.6f}")
|
||||
print(f" {'高持续性(接近1)→波动率冲击衰减缓慢' if results['persistence'] > 0.9 else '中等持续性'}")
|
||||
print(f" 对数似然值: {results['log_likelihood']:.4f}")
|
||||
print(f" AIC: {results['aic']:.4f}")
|
||||
print(f" BIC: {results['bic']:.4f}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 7. 主入口
|
||||
# ============================================================
|
||||
|
||||
def run_returns_analysis(df: pd.DataFrame, output_dir: str = "output/returns"):
|
||||
"""
|
||||
收益率分布分析主函数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线K线数据(含'close'列,DatetimeIndex索引)
|
||||
output_dir : str
|
||||
图表输出目录
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 60)
|
||||
print("BTC 收益率分布分析与 GARCH 建模")
|
||||
print("=" * 60)
|
||||
print(f"数据范围: {df.index.min()} ~ {df.index.max()}")
|
||||
print(f"样本数量: {len(df)}")
|
||||
|
||||
# 计算日对数收益率
|
||||
daily_returns = log_returns(df['close'])
|
||||
print(f"日对数收益率样本数: {len(daily_returns)}")
|
||||
|
||||
# --- 正态性检验 ---
|
||||
print("\n>>> 执行正态性检验...")
|
||||
norm_results = normality_tests(daily_returns)
|
||||
print_normality_results(norm_results)
|
||||
|
||||
# --- 厚尾分析 ---
|
||||
print("\n>>> 执行厚尾分析...")
|
||||
tail_results = fat_tail_analysis(daily_returns)
|
||||
print_fat_tail_results(tail_results)
|
||||
|
||||
# --- 多时间尺度分布 ---
|
||||
print("\n>>> 加载多时间尺度数据...")
|
||||
distributions = multi_timeframe_distributions()
|
||||
# 打印各尺度统计
|
||||
print("\n多时间尺度对数收益率统计:")
|
||||
print(f" {'尺度':<8} {'样本数':>8} {'均值':>12} {'标准差':>12} {'峰度':>10} {'偏度':>10}")
|
||||
print(" " + "-" * 62)
|
||||
for interval, ret in distributions.items():
|
||||
r = ret.dropna().values
|
||||
print(f" {interval:<8} {len(r):>8d} {r.mean():>12.6f} {r.std():>12.6f} "
|
||||
f"{stats.kurtosis(r):>10.4f} {stats.skew(r):>10.4f}")
|
||||
|
||||
# --- GARCH(1,1) 建模 ---
|
||||
print("\n>>> 拟合 GARCH(1,1) 模型...")
|
||||
garch_results = fit_garch11(daily_returns)
|
||||
print_garch_results(garch_results)
|
||||
|
||||
# --- 生成可视化 ---
|
||||
print("\n>>> 生成可视化图表...")
|
||||
|
||||
# 设置中文字体(兼容多系统)
|
||||
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
|
||||
plt.rcParams['axes.unicode_minus'] = False
|
||||
|
||||
plot_histogram_vs_normal(daily_returns, output_dir)
|
||||
plot_qq(daily_returns, output_dir)
|
||||
plot_multi_timeframe(distributions, output_dir)
|
||||
plot_garch_conditional_vol(garch_results, output_dir)
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("收益率分布分析完成!")
|
||||
print(f"图表已保存至: {output_dir.resolve()}")
|
||||
print("=" * 60)
|
||||
|
||||
# 返回所有结果供后续使用
|
||||
return {
|
||||
'normality': norm_results,
|
||||
'fat_tail': tail_results,
|
||||
'multi_timeframe': distributions,
|
||||
'garch': garch_results,
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 独立运行入口
|
||||
# ============================================================
|
||||
|
||||
if __name__ == '__main__':
|
||||
from src.data_loader import load_daily
|
||||
df = load_daily()
|
||||
run_returns_analysis(df)
|
||||
804
src/time_series.py
Normal file
@@ -0,0 +1,804 @@
|
||||
"""时间序列预测模块 - ARIMA、Prophet、LSTM/GRU
|
||||
|
||||
对BTC日线数据进行多模型预测与对比评估。
|
||||
每个模型独立运行,单个模型失败不影响其他模型。
|
||||
"""
|
||||
|
||||
import warnings
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
import matplotlib.pyplot as plt
|
||||
from pathlib import Path
|
||||
from typing import Optional, Tuple, Dict, List
|
||||
from scipy import stats
|
||||
|
||||
from src.data_loader import split_data
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 评估指标
|
||||
# ============================================================
|
||||
|
||||
def _direction_accuracy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
|
||||
"""方向准确率:预测涨跌方向正确的比例"""
|
||||
if len(y_true) < 2:
|
||||
return np.nan
|
||||
true_dir = np.sign(y_true)
|
||||
pred_dir = np.sign(y_pred)
|
||||
return np.mean(true_dir == pred_dir)
|
||||
|
||||
|
||||
def _rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
|
||||
"""均方根误差"""
|
||||
return np.sqrt(np.mean((y_true - y_pred) ** 2))
|
||||
|
||||
|
||||
def _diebold_mariano_test(e1: np.ndarray, e2: np.ndarray, h: int = 1) -> Tuple[float, float]:
|
||||
"""
|
||||
Diebold-Mariano检验:比较两个预测的损失差异是否显著
|
||||
|
||||
H0: 两个模型预测精度无差异
|
||||
e1, e2: 两个模型的预测误差序列
|
||||
|
||||
Returns
|
||||
-------
|
||||
dm_stat : DM统计量
|
||||
p_value : 双侧p值
|
||||
"""
|
||||
d = e1 ** 2 - e2 ** 2 # 平方损失差
|
||||
n = len(d)
|
||||
if n < 10:
|
||||
return np.nan, np.nan
|
||||
|
||||
mean_d = np.mean(d)
|
||||
|
||||
# Newey-West方差估计(考虑自相关)
|
||||
gamma_0 = np.var(d, ddof=1)
|
||||
gamma_sum = 0
|
||||
for k in range(1, h):
|
||||
gamma_k = np.cov(d[k:], d[:-k])[0, 1] if len(d[k:]) > 1 else 0
|
||||
gamma_sum += 2 * gamma_k
|
||||
|
||||
var_d = (gamma_0 + gamma_sum) / n
|
||||
if var_d <= 0:
|
||||
return np.nan, np.nan
|
||||
|
||||
dm_stat = mean_d / np.sqrt(var_d)
|
||||
p_value = 2 * stats.norm.sf(np.abs(dm_stat))
|
||||
return dm_stat, p_value
|
||||
|
||||
|
||||
def _evaluate_model(name: str, y_true: np.ndarray, y_pred: np.ndarray,
|
||||
rw_errors: np.ndarray) -> Dict:
|
||||
"""统一评估单个模型"""
|
||||
errors = y_true - y_pred
|
||||
rmse_val = _rmse(y_true, y_pred)
|
||||
rw_rmse = _rmse(y_true, np.zeros_like(y_true)) # Random Walk RMSE
|
||||
rmse_ratio = rmse_val / rw_rmse if rw_rmse > 0 else np.nan
|
||||
dir_acc = _direction_accuracy(y_true, y_pred)
|
||||
|
||||
# DM检验 vs Random Walk
|
||||
dm_stat, dm_pval = _diebold_mariano_test(errors, rw_errors)
|
||||
|
||||
result = {
|
||||
"name": name,
|
||||
"rmse": rmse_val,
|
||||
"rmse_ratio_vs_rw": rmse_ratio,
|
||||
"direction_accuracy": dir_acc,
|
||||
"dm_stat_vs_rw": dm_stat,
|
||||
"dm_pval_vs_rw": dm_pval,
|
||||
"predictions": y_pred,
|
||||
"errors": errors,
|
||||
}
|
||||
return result
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 基准模型
|
||||
# ============================================================
|
||||
|
||||
def _baseline_random_walk(y_true: np.ndarray) -> np.ndarray:
|
||||
"""随机游走基准:预测收益率=0"""
|
||||
return np.zeros_like(y_true)
|
||||
|
||||
|
||||
def _baseline_historical_mean(train_returns: np.ndarray, n_pred: int) -> np.ndarray:
|
||||
"""历史均值基准:预测收益率=训练集均值"""
|
||||
return np.full(n_pred, np.mean(train_returns))
|
||||
|
||||
|
||||
# ============================================================
|
||||
# ARIMA 模型
|
||||
# ============================================================
|
||||
|
||||
def _run_arima(train_returns: pd.Series, val_returns: pd.Series) -> Dict:
|
||||
"""
|
||||
ARIMA模型:使用auto_arima自动选参 + walk-forward预测
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict : 包含预测结果和诊断信息
|
||||
"""
|
||||
try:
|
||||
import pmdarima as pm
|
||||
from statsmodels.stats.diagnostic import acorr_ljungbox
|
||||
except ImportError:
|
||||
print(" [ARIMA] 跳过 - pmdarima 未安装。pip install pmdarima")
|
||||
return None
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("ARIMA 模型")
|
||||
print("=" * 60)
|
||||
|
||||
# 自动选择ARIMA参数
|
||||
print(" [1/3] auto_arima 参数搜索...")
|
||||
model = pm.auto_arima(
|
||||
train_returns.values,
|
||||
start_p=0, max_p=5,
|
||||
start_q=0, max_q=5,
|
||||
d=0, # 对数收益率已经是平稳的
|
||||
seasonal=False,
|
||||
stepwise=True,
|
||||
suppress_warnings=True,
|
||||
error_action='ignore',
|
||||
trace=False,
|
||||
information_criterion='aic',
|
||||
)
|
||||
print(f" 最优模型: ARIMA{model.order}")
|
||||
print(f" AIC: {model.aic():.2f}")
|
||||
|
||||
# Ljung-Box 残差诊断
|
||||
print(" [2/3] Ljung-Box 残差白噪声检验...")
|
||||
residuals = model.resid()
|
||||
lb_result = acorr_ljungbox(residuals, lags=[10, 20], return_df=True)
|
||||
print(f" Ljung-Box 检验 (lag=10): 统计量={lb_result.iloc[0]['lb_stat']:.2f}, "
|
||||
f"p值={lb_result.iloc[0]['lb_pvalue']:.4f}")
|
||||
print(f" Ljung-Box 检验 (lag=20): 统计量={lb_result.iloc[1]['lb_stat']:.2f}, "
|
||||
f"p值={lb_result.iloc[1]['lb_pvalue']:.4f}")
|
||||
|
||||
if lb_result.iloc[0]['lb_pvalue'] > 0.05:
|
||||
print(" 残差通过白噪声检验 (p>0.05),模型拟合充分")
|
||||
else:
|
||||
print(" 残差未通过白噪声检验 (p<=0.05),可能存在未捕获的自相关结构")
|
||||
|
||||
# Walk-forward 预测
|
||||
print(" [3/3] Walk-forward 验证集预测...")
|
||||
val_values = val_returns.values
|
||||
n_val = len(val_values)
|
||||
predictions = np.zeros(n_val)
|
||||
|
||||
# 使用滚动窗口预测
|
||||
history = list(train_returns.values)
|
||||
for i in range(n_val):
|
||||
# 一步预测
|
||||
fc = model.predict(n_periods=1)
|
||||
predictions[i] = fc[0]
|
||||
# 更新模型(添加真实观测值)
|
||||
model.update(val_values[i:i+1])
|
||||
if (i + 1) % 100 == 0:
|
||||
print(f" 进度: {i+1}/{n_val}")
|
||||
|
||||
print(f" Walk-forward 预测完成,共{n_val}步")
|
||||
|
||||
return {
|
||||
"predictions": predictions,
|
||||
"order": model.order,
|
||||
"aic": model.aic(),
|
||||
"ljung_box": lb_result,
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Prophet 模型
|
||||
# ============================================================
|
||||
|
||||
def _run_prophet(train_df: pd.DataFrame, val_df: pd.DataFrame) -> Dict:
|
||||
"""
|
||||
Prophet模型:基于日收盘价的时间序列预测
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict : 包含预测结果
|
||||
"""
|
||||
try:
|
||||
from prophet import Prophet
|
||||
except ImportError:
|
||||
print(" [Prophet] 跳过 - prophet 未安装。pip install prophet")
|
||||
return None
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("Prophet 模型")
|
||||
print("=" * 60)
|
||||
|
||||
# 准备Prophet格式数据
|
||||
prophet_train = pd.DataFrame({
|
||||
'ds': train_df.index,
|
||||
'y': train_df['close'].values,
|
||||
})
|
||||
|
||||
print(" [1/3] 构建Prophet模型并添加自定义季节性...")
|
||||
|
||||
model = Prophet(
|
||||
daily_seasonality=False,
|
||||
weekly_seasonality=False,
|
||||
yearly_seasonality=False,
|
||||
changepoint_prior_scale=0.05,
|
||||
)
|
||||
|
||||
# 添加自定义季节性
|
||||
model.add_seasonality(name='weekly', period=7, fourier_order=3)
|
||||
model.add_seasonality(name='monthly', period=30, fourier_order=5)
|
||||
model.add_seasonality(name='yearly', period=365, fourier_order=10)
|
||||
model.add_seasonality(name='halving_cycle', period=1458, fourier_order=5)
|
||||
|
||||
print(" [2/3] 拟合模型...")
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("ignore")
|
||||
model.fit(prophet_train)
|
||||
|
||||
# 预测验证期
|
||||
print(" [3/3] 预测验证期...")
|
||||
future_dates = pd.DataFrame({'ds': val_df.index})
|
||||
forecast = model.predict(future_dates)
|
||||
|
||||
# 转换为对数收益率预测(与其他模型对齐)
|
||||
pred_close = forecast['yhat'].values
|
||||
# 用前一天的真实收盘价计算预测收益率
|
||||
# 第一天用训练集最后一天的价格
|
||||
prev_close = np.concatenate([[train_df['close'].iloc[-1]], val_df['close'].values[:-1]])
|
||||
pred_returns = np.log(pred_close / prev_close)
|
||||
|
||||
print(f" 预测完成,验证期: {val_df.index[0]} ~ {val_df.index[-1]}")
|
||||
print(f" 预测价格范围: {pred_close.min():.0f} ~ {pred_close.max():.0f}")
|
||||
|
||||
return {
|
||||
"predictions_return": pred_returns,
|
||||
"predictions_close": pred_close,
|
||||
"forecast": forecast,
|
||||
"model": model,
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# LSTM/GRU 模型 (PyTorch)
|
||||
# ============================================================
|
||||
|
||||
def _run_lstm(train_df: pd.DataFrame, val_df: pd.DataFrame,
|
||||
lookback: int = 60, hidden_size: int = 128,
|
||||
num_layers: int = 2, max_epochs: int = 100,
|
||||
patience: int = 10, batch_size: int = 64) -> Dict:
|
||||
"""
|
||||
LSTM/GRU 模型:基于PyTorch的深度学习时间序列预测
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict : 包含预测结果和训练历史
|
||||
"""
|
||||
try:
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import DataLoader, TensorDataset
|
||||
except ImportError:
|
||||
print(" [LSTM] 跳过 - PyTorch 未安装。pip install torch")
|
||||
return None
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("LSTM 模型 (PyTorch)")
|
||||
print("=" * 60)
|
||||
|
||||
device = torch.device('cuda' if torch.cuda.is_available() else
|
||||
'mps' if torch.backends.mps.is_available() else 'cpu')
|
||||
print(f" 设备: {device}")
|
||||
|
||||
# ---- 数据准备 ----
|
||||
# 使用收盘价的对数收益率作为目标
|
||||
feature_cols = ['log_return', 'volume_ratio', 'taker_buy_ratio']
|
||||
available_cols = [c for c in feature_cols if c in train_df.columns]
|
||||
|
||||
if not available_cols:
|
||||
# 降级到只用收盘价
|
||||
print(" [警告] 特征列不可用,仅使用收盘价收益率")
|
||||
available_cols = ['log_return']
|
||||
|
||||
print(f" 特征: {available_cols}")
|
||||
|
||||
# 合并训练和验证数据以创建连续序列
|
||||
all_data = pd.concat([train_df, val_df])
|
||||
features = all_data[available_cols].values
|
||||
target = all_data['log_return'].values
|
||||
|
||||
# 处理NaN
|
||||
mask = ~np.isnan(features).any(axis=1) & ~np.isnan(target)
|
||||
features_clean = features[mask]
|
||||
target_clean = target[mask]
|
||||
|
||||
# 特征标准化(基于训练集统计量)
|
||||
train_len = mask[:len(train_df)].sum()
|
||||
feat_mean = features_clean[:train_len].mean(axis=0)
|
||||
feat_std = features_clean[:train_len].std(axis=0) + 1e-10
|
||||
features_norm = (features_clean - feat_mean) / feat_std
|
||||
|
||||
target_mean = target_clean[:train_len].mean()
|
||||
target_std = target_clean[:train_len].std() + 1e-10
|
||||
target_norm = (target_clean - target_mean) / target_std
|
||||
|
||||
# 创建序列样本
|
||||
def create_sequences(feat, tgt, seq_len):
|
||||
X, y = [], []
|
||||
for i in range(seq_len, len(feat)):
|
||||
X.append(feat[i - seq_len:i])
|
||||
y.append(tgt[i])
|
||||
return np.array(X), np.array(y)
|
||||
|
||||
X_all, y_all = create_sequences(features_norm, target_norm, lookback)
|
||||
|
||||
# 划分训练和验证(根据原始训练集长度调整)
|
||||
train_samples = max(0, train_len - lookback)
|
||||
X_train = X_all[:train_samples]
|
||||
y_train = y_all[:train_samples]
|
||||
X_val = X_all[train_samples:]
|
||||
y_val = y_all[train_samples:]
|
||||
|
||||
if len(X_train) == 0 or len(X_val) == 0:
|
||||
print(" [LSTM] 跳过 - 数据不足以创建训练/验证序列")
|
||||
return None
|
||||
|
||||
print(f" 训练样本: {len(X_train)}, 验证样本: {len(X_val)}")
|
||||
print(f" 回看窗口: {lookback}, 隐藏维度: {hidden_size}, 层数: {num_layers}")
|
||||
|
||||
# 转换为Tensor
|
||||
X_train_t = torch.FloatTensor(X_train).to(device)
|
||||
y_train_t = torch.FloatTensor(y_train).to(device)
|
||||
X_val_t = torch.FloatTensor(X_val).to(device)
|
||||
y_val_t = torch.FloatTensor(y_val).to(device)
|
||||
|
||||
train_dataset = TensorDataset(X_train_t, y_train_t)
|
||||
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
|
||||
|
||||
# ---- 模型定义 ----
|
||||
class LSTMModel(nn.Module):
|
||||
def __init__(self, input_size, hidden_size, num_layers, dropout=0.2):
|
||||
super().__init__()
|
||||
self.lstm = nn.LSTM(
|
||||
input_size=input_size,
|
||||
hidden_size=hidden_size,
|
||||
num_layers=num_layers,
|
||||
batch_first=True,
|
||||
dropout=dropout if num_layers > 1 else 0,
|
||||
)
|
||||
self.fc = nn.Sequential(
|
||||
nn.Linear(hidden_size, 64),
|
||||
nn.ReLU(),
|
||||
nn.Dropout(dropout),
|
||||
nn.Linear(64, 1),
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
lstm_out, _ = self.lstm(x)
|
||||
# 取最后一个时间步的输出
|
||||
last_out = lstm_out[:, -1, :]
|
||||
return self.fc(last_out).squeeze(-1)
|
||||
|
||||
input_size = len(available_cols)
|
||||
model = LSTMModel(input_size, hidden_size, num_layers).to(device)
|
||||
|
||||
criterion = nn.MSELoss()
|
||||
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
|
||||
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
|
||||
optimizer, mode='min', factor=0.5, patience=5, verbose=False
|
||||
)
|
||||
|
||||
# ---- 训练 ----
|
||||
print(f" 开始训练 (最多{max_epochs}轮, 早停耐心={patience})...")
|
||||
best_val_loss = np.inf
|
||||
patience_counter = 0
|
||||
train_losses = []
|
||||
val_losses = []
|
||||
|
||||
for epoch in range(max_epochs):
|
||||
# 训练
|
||||
model.train()
|
||||
epoch_loss = 0
|
||||
n_batches = 0
|
||||
for batch_X, batch_y in train_loader:
|
||||
optimizer.zero_grad()
|
||||
pred = model(batch_X)
|
||||
loss = criterion(pred, batch_y)
|
||||
loss.backward()
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
|
||||
optimizer.step()
|
||||
epoch_loss += loss.item()
|
||||
n_batches += 1
|
||||
|
||||
avg_train_loss = epoch_loss / max(n_batches, 1)
|
||||
train_losses.append(avg_train_loss)
|
||||
|
||||
# 验证
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
val_pred = model(X_val_t)
|
||||
val_loss = criterion(val_pred, y_val_t).item()
|
||||
val_losses.append(val_loss)
|
||||
|
||||
scheduler.step(val_loss)
|
||||
|
||||
if (epoch + 1) % 10 == 0:
|
||||
lr = optimizer.param_groups[0]['lr']
|
||||
print(f" Epoch {epoch+1}/{max_epochs}: "
|
||||
f"train_loss={avg_train_loss:.6f}, val_loss={val_loss:.6f}, lr={lr:.1e}")
|
||||
|
||||
# 早停
|
||||
if val_loss < best_val_loss:
|
||||
best_val_loss = val_loss
|
||||
patience_counter = 0
|
||||
best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
|
||||
else:
|
||||
patience_counter += 1
|
||||
if patience_counter >= patience:
|
||||
print(f" 早停触发 (epoch {epoch+1})")
|
||||
break
|
||||
|
||||
# 加载最佳模型
|
||||
model.load_state_dict(best_state)
|
||||
model.eval()
|
||||
|
||||
# ---- 预测 ----
|
||||
with torch.no_grad():
|
||||
val_pred_norm = model(X_val_t).cpu().numpy()
|
||||
|
||||
# 逆标准化
|
||||
val_pred_returns = val_pred_norm * target_std + target_mean
|
||||
val_true_returns = y_val * target_std + target_mean
|
||||
|
||||
print(f" 训练完成,最佳验证损失: {best_val_loss:.6f}")
|
||||
|
||||
return {
|
||||
"predictions_return": val_pred_returns,
|
||||
"true_returns": val_true_returns,
|
||||
"train_losses": train_losses,
|
||||
"val_losses": val_losses,
|
||||
"model": model,
|
||||
"device": str(device),
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 可视化
|
||||
# ============================================================
|
||||
|
||||
def _plot_predictions(val_dates, y_true, model_preds: Dict[str, np.ndarray],
|
||||
output_dir: Path):
|
||||
"""各模型实际 vs 预测对比图"""
|
||||
n_models = len(model_preds)
|
||||
fig, axes = plt.subplots(n_models, 1, figsize=(16, 4 * n_models), sharex=True)
|
||||
if n_models == 1:
|
||||
axes = [axes]
|
||||
|
||||
for i, (name, y_pred) in enumerate(model_preds.items()):
|
||||
ax = axes[i]
|
||||
# 对齐长度(LSTM可能因lookback导致长度不同)
|
||||
n = min(len(y_true), len(y_pred))
|
||||
dates = val_dates[:n] if len(val_dates) >= n else val_dates
|
||||
|
||||
ax.plot(dates, y_true[:n], 'b-', alpha=0.6, linewidth=0.8, label='实际收益率')
|
||||
ax.plot(dates, y_pred[:n], 'r-', alpha=0.6, linewidth=0.8, label='预测收益率')
|
||||
ax.set_title(f"{name} - 实际 vs 预测", fontsize=13)
|
||||
ax.set_ylabel("对数收益率", fontsize=11)
|
||||
ax.legend(fontsize=9)
|
||||
ax.grid(True, alpha=0.3)
|
||||
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
|
||||
|
||||
axes[-1].set_xlabel("日期", fontsize=11)
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_dir / "ts_predictions_comparison.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] ts_predictions_comparison.png")
|
||||
|
||||
|
||||
def _plot_direction_accuracy(metrics: Dict[str, Dict], output_dir: Path):
|
||||
"""方向准确率对比柱状图"""
|
||||
names = list(metrics.keys())
|
||||
accs = [metrics[n]["direction_accuracy"] * 100 for n in names]
|
||||
|
||||
fig, ax = plt.subplots(figsize=(10, 6))
|
||||
colors = plt.cm.Set2(np.linspace(0, 1, len(names)))
|
||||
bars = ax.bar(names, accs, color=colors, edgecolor='gray', linewidth=0.5)
|
||||
|
||||
# 标注数值
|
||||
for bar, acc in zip(bars, accs):
|
||||
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,
|
||||
f"{acc:.1f}%", ha='center', va='bottom', fontsize=11, fontweight='bold')
|
||||
|
||||
ax.axhline(y=50, color='red', linestyle='--', alpha=0.7, label='随机基准 (50%)')
|
||||
ax.set_ylabel("方向准确率 (%)", fontsize=12)
|
||||
ax.set_title("各模型方向预测准确率对比", fontsize=14)
|
||||
ax.legend(fontsize=10)
|
||||
ax.grid(True, alpha=0.3, axis='y')
|
||||
ax.set_ylim(0, max(accs) * 1.2 if accs else 100)
|
||||
|
||||
fig.savefig(output_dir / "ts_direction_accuracy.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] ts_direction_accuracy.png")
|
||||
|
||||
|
||||
def _plot_cumulative_error(val_dates, metrics: Dict[str, Dict], output_dir: Path):
|
||||
"""累计误差对比图"""
|
||||
fig, ax = plt.subplots(figsize=(16, 7))
|
||||
|
||||
for name, m in metrics.items():
|
||||
errors = m.get("errors")
|
||||
if errors is None:
|
||||
continue
|
||||
n = len(errors)
|
||||
dates = val_dates[:n]
|
||||
cum_sq_err = np.cumsum(errors ** 2)
|
||||
ax.plot(dates, cum_sq_err, linewidth=1.2, label=f"{name}")
|
||||
|
||||
ax.set_xlabel("日期", fontsize=12)
|
||||
ax.set_ylabel("累计平方误差", fontsize=12)
|
||||
ax.set_title("各模型累计预测误差对比", fontsize=14)
|
||||
ax.legend(fontsize=10)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / "ts_cumulative_error.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] ts_cumulative_error.png")
|
||||
|
||||
|
||||
def _plot_lstm_training(train_losses: List, val_losses: List, output_dir: Path):
|
||||
"""LSTM训练损失曲线"""
|
||||
fig, ax = plt.subplots(figsize=(10, 6))
|
||||
ax.plot(train_losses, 'b-', label='训练损失', linewidth=1.5)
|
||||
ax.plot(val_losses, 'r-', label='验证损失', linewidth=1.5)
|
||||
ax.set_xlabel("Epoch", fontsize=12)
|
||||
ax.set_ylabel("MSE Loss", fontsize=12)
|
||||
ax.set_title("LSTM 训练过程", fontsize=14)
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / "ts_lstm_training.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] ts_lstm_training.png")
|
||||
|
||||
|
||||
def _plot_prophet_components(prophet_result: Dict, output_dir: Path):
|
||||
"""Prophet预测 - 实际价格 vs 预测价格"""
|
||||
try:
|
||||
from prophet import Prophet
|
||||
except ImportError:
|
||||
return
|
||||
|
||||
forecast = prophet_result.get("forecast")
|
||||
if forecast is None:
|
||||
return
|
||||
|
||||
fig, ax = plt.subplots(figsize=(16, 7))
|
||||
ax.plot(forecast['ds'], forecast['yhat'], 'r-', linewidth=1.2, label='Prophet预测')
|
||||
ax.fill_between(forecast['ds'], forecast['yhat_lower'], forecast['yhat_upper'],
|
||||
alpha=0.15, color='red', label='置信区间')
|
||||
ax.set_xlabel("日期", fontsize=12)
|
||||
ax.set_ylabel("BTC 价格 (USDT)", fontsize=12)
|
||||
ax.set_title("Prophet 价格预测(验证期)", fontsize=14)
|
||||
ax.legend(fontsize=10)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / "ts_prophet_forecast.png", dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [保存] ts_prophet_forecast.png")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 结果打印
|
||||
# ============================================================
|
||||
|
||||
def _print_metrics_table(all_metrics: Dict[str, Dict]):
|
||||
"""打印所有模型的评估指标表"""
|
||||
print("\n" + "=" * 80)
|
||||
print(" 模型评估汇总")
|
||||
print("=" * 80)
|
||||
print(f" {'模型':<20s} {'RMSE':>10s} {'RMSE/RW':>10s} {'方向准确率':>10s} "
|
||||
f"{'DM统计量':>10s} {'DM p值':>10s}")
|
||||
print("-" * 80)
|
||||
|
||||
for name, m in all_metrics.items():
|
||||
rmse_str = f"{m['rmse']:.6f}"
|
||||
ratio_str = f"{m['rmse_ratio_vs_rw']:.4f}" if not np.isnan(m['rmse_ratio_vs_rw']) else "N/A"
|
||||
dir_str = f"{m['direction_accuracy']*100:.1f}%"
|
||||
dm_str = f"{m['dm_stat_vs_rw']:.3f}" if not np.isnan(m['dm_stat_vs_rw']) else "N/A"
|
||||
pv_str = f"{m['dm_pval_vs_rw']:.4f}" if not np.isnan(m['dm_pval_vs_rw']) else "N/A"
|
||||
print(f" {name:<20s} {rmse_str:>10s} {ratio_str:>10s} {dir_str:>10s} "
|
||||
f"{dm_str:>10s} {pv_str:>10s}")
|
||||
|
||||
print("-" * 80)
|
||||
|
||||
# 解读
|
||||
print("\n [解读]")
|
||||
print(" - RMSE/RW < 1.0 表示优于随机游走基准")
|
||||
print(" - 方向准确率 > 50% 表示有一定方向预测能力")
|
||||
print(" - DM检验 p值 < 0.05 表示与随机游走有显著差异")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 主入口
|
||||
# ============================================================
|
||||
|
||||
def run_time_series_analysis(df: pd.DataFrame, output_dir: "str | Path" = "output/time_series") -> Dict:
|
||||
"""
|
||||
时间序列预测分析 - 主入口
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
已经通过 add_derived_features() 添加了衍生特征的日线数据
|
||||
output_dir : str or Path
|
||||
图表输出目录
|
||||
|
||||
Returns
|
||||
-------
|
||||
results : dict
|
||||
包含所有模型的预测结果和评估指标
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# 设置中文字体(macOS)
|
||||
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
|
||||
plt.rcParams['axes.unicode_minus'] = False
|
||||
|
||||
print("=" * 60)
|
||||
print(" BTC 时间序列预测分析")
|
||||
print("=" * 60)
|
||||
|
||||
# ---- 数据划分 ----
|
||||
train_df, val_df, test_df = split_data(df)
|
||||
print(f"\n 训练集: {train_df.index[0]} ~ {train_df.index[-1]} ({len(train_df)}天)")
|
||||
print(f" 验证集: {val_df.index[0]} ~ {val_df.index[-1]} ({len(val_df)}天)")
|
||||
print(f" 测试集: {test_df.index[0]} ~ {test_df.index[-1]} ({len(test_df)}天)")
|
||||
|
||||
# 对数收益率序列
|
||||
train_returns = train_df['log_return'].dropna()
|
||||
val_returns = val_df['log_return'].dropna()
|
||||
val_dates = val_returns.index
|
||||
y_true = val_returns.values
|
||||
|
||||
# ---- 基准模型 ----
|
||||
print("\n" + "=" * 60)
|
||||
print("基准模型")
|
||||
print("=" * 60)
|
||||
|
||||
# Random Walk基准
|
||||
rw_pred = _baseline_random_walk(y_true)
|
||||
rw_errors = y_true - rw_pred
|
||||
print(f" Random Walk (预测收益=0): RMSE = {_rmse(y_true, rw_pred):.6f}")
|
||||
|
||||
# 历史均值基准
|
||||
hm_pred = _baseline_historical_mean(train_returns.values, len(y_true))
|
||||
print(f" Historical Mean (收益={train_returns.mean():.6f}): RMSE = {_rmse(y_true, hm_pred):.6f}")
|
||||
|
||||
# 存储所有模型结果
|
||||
all_metrics = {}
|
||||
model_preds = {}
|
||||
|
||||
# 评估基准模型
|
||||
all_metrics["Random Walk"] = _evaluate_model("Random Walk", y_true, rw_pred, rw_errors)
|
||||
model_preds["Random Walk"] = rw_pred
|
||||
|
||||
all_metrics["Historical Mean"] = _evaluate_model("Historical Mean", y_true, hm_pred, rw_errors)
|
||||
model_preds["Historical Mean"] = hm_pred
|
||||
|
||||
# ---- ARIMA ----
|
||||
try:
|
||||
arima_result = _run_arima(train_returns, val_returns)
|
||||
if arima_result is not None:
|
||||
arima_pred = arima_result["predictions"]
|
||||
all_metrics["ARIMA"] = _evaluate_model("ARIMA", y_true, arima_pred, rw_errors)
|
||||
model_preds["ARIMA"] = arima_pred
|
||||
print(f"\n ARIMA 验证集: RMSE={all_metrics['ARIMA']['rmse']:.6f}, "
|
||||
f"方向准确率={all_metrics['ARIMA']['direction_accuracy']*100:.1f}%")
|
||||
except Exception as e:
|
||||
print(f"\n [ARIMA] 运行失败: {e}")
|
||||
|
||||
# ---- Prophet ----
|
||||
try:
|
||||
prophet_result = _run_prophet(train_df, val_df)
|
||||
if prophet_result is not None:
|
||||
prophet_pred = prophet_result["predictions_return"]
|
||||
# 对齐长度
|
||||
n = min(len(y_true), len(prophet_pred))
|
||||
all_metrics["Prophet"] = _evaluate_model(
|
||||
"Prophet", y_true[:n], prophet_pred[:n], rw_errors[:n]
|
||||
)
|
||||
model_preds["Prophet"] = prophet_pred[:n]
|
||||
print(f"\n Prophet 验证集: RMSE={all_metrics['Prophet']['rmse']:.6f}, "
|
||||
f"方向准确率={all_metrics['Prophet']['direction_accuracy']*100:.1f}%")
|
||||
|
||||
# Prophet专属图表
|
||||
_plot_prophet_components(prophet_result, output_dir)
|
||||
except Exception as e:
|
||||
print(f"\n [Prophet] 运行失败: {e}")
|
||||
prophet_result = None
|
||||
|
||||
# ---- LSTM ----
|
||||
try:
|
||||
lstm_result = _run_lstm(train_df, val_df)
|
||||
if lstm_result is not None:
|
||||
lstm_pred = lstm_result["predictions_return"]
|
||||
lstm_true = lstm_result["true_returns"]
|
||||
n_lstm = len(lstm_pred)
|
||||
|
||||
# LSTM因lookback导致样本数不同,使用其自身的true_returns评估
|
||||
lstm_rw_errors = lstm_true - np.zeros_like(lstm_true)
|
||||
all_metrics["LSTM"] = _evaluate_model(
|
||||
"LSTM", lstm_true, lstm_pred, lstm_rw_errors
|
||||
)
|
||||
model_preds["LSTM"] = lstm_pred
|
||||
print(f"\n LSTM 验证集: RMSE={all_metrics['LSTM']['rmse']:.6f}, "
|
||||
f"方向准确率={all_metrics['LSTM']['direction_accuracy']*100:.1f}%")
|
||||
|
||||
# LSTM训练曲线
|
||||
_plot_lstm_training(lstm_result["train_losses"],
|
||||
lstm_result["val_losses"], output_dir)
|
||||
except Exception as e:
|
||||
print(f"\n [LSTM] 运行失败: {e}")
|
||||
lstm_result = None
|
||||
|
||||
# ---- 评估汇总 ----
|
||||
_print_metrics_table(all_metrics)
|
||||
|
||||
# ---- 可视化 ----
|
||||
print("\n[可视化] 生成分析图表...")
|
||||
|
||||
# 预测对比图(仅使用与y_true等长的预测,排除LSTM)
|
||||
aligned_preds = {k: v for k, v in model_preds.items()
|
||||
if k != "LSTM" and len(v) == len(y_true)}
|
||||
if aligned_preds:
|
||||
_plot_predictions(val_dates, y_true, aligned_preds, output_dir)
|
||||
|
||||
# LSTM单独画图(长度不同)
|
||||
if "LSTM" in model_preds and lstm_result is not None:
|
||||
lstm_dates = val_dates[-len(lstm_result["predictions_return"]):]
|
||||
_plot_predictions(lstm_dates, lstm_result["true_returns"],
|
||||
{"LSTM": lstm_result["predictions_return"]}, output_dir)
|
||||
|
||||
# 方向准确率对比
|
||||
_plot_direction_accuracy(all_metrics, output_dir)
|
||||
|
||||
# 累计误差对比
|
||||
_plot_cumulative_error(val_dates, all_metrics, output_dir)
|
||||
|
||||
# ---- 汇总 ----
|
||||
results = {
|
||||
"metrics": all_metrics,
|
||||
"model_predictions": model_preds,
|
||||
"val_dates": val_dates,
|
||||
"y_true": y_true,
|
||||
}
|
||||
|
||||
if 'arima_result' in dir() and arima_result is not None:
|
||||
results["arima"] = arima_result
|
||||
if prophet_result is not None:
|
||||
results["prophet"] = prophet_result
|
||||
if lstm_result is not None:
|
||||
results["lstm"] = lstm_result
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print(" 时间序列预测分析完成!")
|
||||
print("=" * 60)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 命令行入口
|
||||
# ============================================================
|
||||
|
||||
if __name__ == "__main__":
|
||||
from data_loader import load_daily
|
||||
from preprocessing import add_derived_features
|
||||
|
||||
df = load_daily()
|
||||
df = add_derived_features(df)
|
||||
|
||||
results = run_time_series_analysis(df, output_dir="output/time_series")
|
||||
317
src/visualization.py
Normal file
@@ -0,0 +1,317 @@
|
||||
"""统一可视化工具模块
|
||||
|
||||
提供跨模块共用的绑图辅助函数与综合结果仪表盘。
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
import matplotlib.pyplot as plt
|
||||
import matplotlib.gridspec as gridspec
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Any
|
||||
import json
|
||||
import warnings
|
||||
|
||||
# ── 全局样式 ──────────────────────────────────────────────
|
||||
|
||||
STYLE_CONFIG = {
|
||||
"figure.facecolor": "white",
|
||||
"axes.facecolor": "#fafafa",
|
||||
"axes.grid": True,
|
||||
"grid.alpha": 0.3,
|
||||
"grid.linestyle": "--",
|
||||
"font.size": 10,
|
||||
"axes.titlesize": 13,
|
||||
"axes.labelsize": 11,
|
||||
"xtick.labelsize": 9,
|
||||
"ytick.labelsize": 9,
|
||||
"legend.fontsize": 9,
|
||||
"figure.dpi": 120,
|
||||
"savefig.dpi": 150,
|
||||
"savefig.bbox": "tight",
|
||||
}
|
||||
|
||||
COLOR_PALETTE = {
|
||||
"primary": "#2563eb",
|
||||
"secondary": "#7c3aed",
|
||||
"success": "#059669",
|
||||
"danger": "#dc2626",
|
||||
"warning": "#d97706",
|
||||
"info": "#0891b2",
|
||||
"muted": "#6b7280",
|
||||
"bg_light": "#f8fafc",
|
||||
}
|
||||
|
||||
EVIDENCE_COLORS = {
|
||||
"strong": "#059669", # 绿
|
||||
"moderate": "#d97706", # 橙
|
||||
"weak": "#dc2626", # 红
|
||||
"none": "#6b7280", # 灰
|
||||
}
|
||||
|
||||
|
||||
def apply_style():
|
||||
"""应用全局matplotlib样式"""
|
||||
plt.rcParams.update(STYLE_CONFIG)
|
||||
try:
|
||||
plt.rcParams["font.sans-serif"] = ["Arial Unicode MS", "SimHei", "DejaVu Sans"]
|
||||
plt.rcParams["axes.unicode_minus"] = False
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
|
||||
def ensure_dir(path):
|
||||
"""确保目录存在"""
|
||||
Path(path).mkdir(parents=True, exist_ok=True)
|
||||
return Path(path)
|
||||
|
||||
|
||||
# ── 证据评分框架 ───────────────────────────────────────────
|
||||
|
||||
EVIDENCE_CRITERIA = """
|
||||
"真正有规律" 判定标准(必须同时满足):
|
||||
1. FDR校正后 p < 0.05
|
||||
2. 排列检验 p < 0.01(如适用)
|
||||
3. 测试集上效果方向一致且显著
|
||||
4. >80% bootstrap子样本中成立(如适用)
|
||||
5. Cohen's d > 0.2 或经济意义显著
|
||||
6. 有合理的经济/市场直觉解释
|
||||
"""
|
||||
|
||||
|
||||
def score_evidence(result: Dict) -> Dict:
|
||||
"""
|
||||
对单个分析模块的结果打分
|
||||
|
||||
Parameters
|
||||
----------
|
||||
result : dict
|
||||
模块返回的结果字典,应包含 'findings' 列表
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含 score, level, summary
|
||||
"""
|
||||
findings = result.get("findings", [])
|
||||
if not findings:
|
||||
return {"score": 0, "level": "none", "summary": "无可评估的发现",
|
||||
"n_findings": 0, "total_score": 0, "details": []}
|
||||
|
||||
total_score = 0
|
||||
details = []
|
||||
|
||||
for f in findings:
|
||||
s = 0
|
||||
name = f.get("name", "未命名")
|
||||
p_value = f.get("p_value")
|
||||
effect_size = f.get("effect_size")
|
||||
significant = f.get("significant", False)
|
||||
description = f.get("description", "")
|
||||
|
||||
if significant:
|
||||
s += 2
|
||||
if p_value is not None and p_value < 0.01:
|
||||
s += 1
|
||||
if effect_size is not None and abs(effect_size) > 0.2:
|
||||
s += 1
|
||||
if f.get("test_set_consistent", False):
|
||||
s += 2
|
||||
if f.get("bootstrap_robust", False):
|
||||
s += 1
|
||||
|
||||
total_score += s
|
||||
details.append({"name": name, "score": s, "description": description})
|
||||
|
||||
avg = total_score / len(findings) if findings else 0
|
||||
|
||||
if avg >= 5:
|
||||
level = "strong"
|
||||
elif avg >= 3:
|
||||
level = "moderate"
|
||||
elif avg >= 1:
|
||||
level = "weak"
|
||||
else:
|
||||
level = "none"
|
||||
|
||||
return {
|
||||
"score": round(avg, 2),
|
||||
"level": level,
|
||||
"n_findings": len(findings),
|
||||
"total_score": total_score,
|
||||
"details": details,
|
||||
}
|
||||
|
||||
|
||||
# ── 综合仪表盘 ─────────────────────────────────────────────
|
||||
|
||||
def generate_summary_dashboard(all_results: Dict[str, Dict], output_dir: str = "output"):
|
||||
"""
|
||||
生成综合分析仪表盘
|
||||
|
||||
Parameters
|
||||
----------
|
||||
all_results : dict
|
||||
{module_name: module_result_dict}
|
||||
output_dir : str
|
||||
输出目录
|
||||
"""
|
||||
apply_style()
|
||||
out = ensure_dir(output_dir)
|
||||
|
||||
# ── 1. 汇总各模块证据强度 ──
|
||||
summary_rows = []
|
||||
for module, result in all_results.items():
|
||||
ev = score_evidence(result)
|
||||
summary_rows.append({
|
||||
"module": module,
|
||||
"score": ev["score"],
|
||||
"level": ev["level"],
|
||||
"n_findings": ev["n_findings"],
|
||||
"total_score": ev["total_score"],
|
||||
})
|
||||
|
||||
summary_df = pd.DataFrame(summary_rows)
|
||||
if summary_df.empty:
|
||||
print("[visualization] 无模块结果可汇总")
|
||||
return {}
|
||||
|
||||
summary_df.sort_values("score", ascending=True, inplace=True)
|
||||
|
||||
# ── 2. 证据强度横向柱状图 ──
|
||||
fig, ax = plt.subplots(figsize=(10, max(6, len(summary_df) * 0.5)))
|
||||
colors = [EVIDENCE_COLORS.get(row["level"], "#6b7280") for _, row in summary_df.iterrows()]
|
||||
bars = ax.barh(summary_df["module"], summary_df["score"], color=colors, edgecolor="white", linewidth=0.5)
|
||||
|
||||
for bar, (_, row) in zip(bars, summary_df.iterrows()):
|
||||
ax.text(bar.get_width() + 0.1, bar.get_y() + bar.get_height()/2,
|
||||
f'{row["score"]:.1f} ({row["level"]})',
|
||||
va='center', fontsize=9)
|
||||
|
||||
ax.set_xlabel("Evidence Score")
|
||||
ax.set_title("BTC/USDT Analysis - Evidence Strength by Module")
|
||||
ax.axvline(x=3, color="#d97706", linestyle="--", alpha=0.5, label="Moderate threshold")
|
||||
ax.axvline(x=5, color="#059669", linestyle="--", alpha=0.5, label="Strong threshold")
|
||||
ax.legend(loc="lower right")
|
||||
plt.tight_layout()
|
||||
fig.savefig(out / "evidence_dashboard.png")
|
||||
plt.close(fig)
|
||||
|
||||
# ── 3. 综合结论文本报告 ──
|
||||
report_lines = []
|
||||
report_lines.append("=" * 70)
|
||||
report_lines.append("BTC/USDT 价格规律性分析 — 综合结论报告")
|
||||
report_lines.append("=" * 70)
|
||||
report_lines.append("")
|
||||
report_lines.append(EVIDENCE_CRITERIA)
|
||||
report_lines.append("")
|
||||
report_lines.append("-" * 70)
|
||||
report_lines.append(f"{'模块':<30} {'得分':>6} {'强度':>10} {'发现数':>8}")
|
||||
report_lines.append("-" * 70)
|
||||
|
||||
for _, row in summary_df.sort_values("score", ascending=False).iterrows():
|
||||
report_lines.append(
|
||||
f"{row['module']:<30} {row['score']:>6.2f} {row['level']:>10} {row['n_findings']:>8}"
|
||||
)
|
||||
|
||||
report_lines.append("-" * 70)
|
||||
report_lines.append("")
|
||||
|
||||
# 分级汇总
|
||||
strong = summary_df[summary_df["level"] == "strong"]["module"].tolist()
|
||||
moderate = summary_df[summary_df["level"] == "moderate"]["module"].tolist()
|
||||
weak = summary_df[summary_df["level"] == "weak"]["module"].tolist()
|
||||
none_found = summary_df[summary_df["level"] == "none"]["module"].tolist()
|
||||
|
||||
report_lines.append("## 强证据规律(可重复、有经济意义):")
|
||||
if strong:
|
||||
for m in strong:
|
||||
report_lines.append(f" * {m}")
|
||||
else:
|
||||
report_lines.append(" (无)")
|
||||
|
||||
report_lines.append("")
|
||||
report_lines.append("## 中等证据规律(统计显著但效果有限):")
|
||||
if moderate:
|
||||
for m in moderate:
|
||||
report_lines.append(f" * {m}")
|
||||
else:
|
||||
report_lines.append(" (无)")
|
||||
|
||||
report_lines.append("")
|
||||
report_lines.append("## 弱证据/不显著:")
|
||||
for m in weak + none_found:
|
||||
report_lines.append(f" * {m}")
|
||||
|
||||
report_lines.append("")
|
||||
report_lines.append("=" * 70)
|
||||
report_lines.append("注: 得分基于各模块自报告的统计检验结果。")
|
||||
report_lines.append(" 具体参数和图表请参见各子目录的输出。")
|
||||
report_lines.append("=" * 70)
|
||||
|
||||
report_text = "\n".join(report_lines)
|
||||
|
||||
with open(out / "综合结论报告.txt", "w", encoding="utf-8") as f:
|
||||
f.write(report_text)
|
||||
|
||||
# ── 4. JSON 格式结果存储 ──
|
||||
json_results = {}
|
||||
for module, result in all_results.items():
|
||||
# 去除不可序列化的对象
|
||||
clean = {}
|
||||
for k, v in result.items():
|
||||
try:
|
||||
json.dumps(v)
|
||||
clean[k] = v
|
||||
except (TypeError, ValueError):
|
||||
clean[k] = str(v)
|
||||
json_results[module] = clean
|
||||
|
||||
with open(out / "all_results.json", "w", encoding="utf-8") as f:
|
||||
json.dump(json_results, f, ensure_ascii=False, indent=2, default=str)
|
||||
|
||||
print(report_text)
|
||||
|
||||
return {
|
||||
"summary_df": summary_df,
|
||||
"report_path": str(out / "综合结论报告.txt"),
|
||||
"dashboard_path": str(out / "evidence_dashboard.png"),
|
||||
"json_path": str(out / "all_results.json"),
|
||||
}
|
||||
|
||||
|
||||
def plot_price_overview(df: pd.DataFrame, output_dir: str = "output"):
|
||||
"""生成价格概览图(对数尺度 + 成交量 + 关键事件标注)"""
|
||||
apply_style()
|
||||
out = ensure_dir(output_dir)
|
||||
|
||||
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 8), height_ratios=[3, 1],
|
||||
sharex=True, gridspec_kw={"hspace": 0.05})
|
||||
|
||||
# 价格(对数尺度)
|
||||
ax1.semilogy(df.index, df["close"], color=COLOR_PALETTE["primary"], linewidth=0.8)
|
||||
ax1.set_ylabel("Price (USDT, log scale)")
|
||||
ax1.set_title("BTC/USDT Price & Volume Overview")
|
||||
|
||||
# 标注减半事件
|
||||
halvings = [
|
||||
("2020-05-11", "3rd Halving"),
|
||||
("2024-04-20", "4th Halving"),
|
||||
]
|
||||
for date_str, label in halvings:
|
||||
dt = pd.Timestamp(date_str)
|
||||
if df.index.min() <= dt <= df.index.max():
|
||||
ax1.axvline(x=dt, color=COLOR_PALETTE["danger"], linestyle="--", alpha=0.6)
|
||||
ax1.text(dt, ax1.get_ylim()[1] * 0.9, label, rotation=90,
|
||||
va="top", fontsize=8, color=COLOR_PALETTE["danger"])
|
||||
|
||||
# 成交量
|
||||
ax2.bar(df.index, df["volume"], width=1, color=COLOR_PALETTE["info"], alpha=0.5)
|
||||
ax2.set_ylabel("Volume")
|
||||
ax2.set_xlabel("Date")
|
||||
|
||||
fig.savefig(out / "price_overview.png")
|
||||
plt.close(fig)
|
||||
print(f"[visualization] 价格概览图 -> {out / 'price_overview.png'}")
|
||||
639
src/volatility_analysis.py
Normal file
@@ -0,0 +1,639 @@
|
||||
"""波动率聚集与非对称GARCH建模模块
|
||||
|
||||
分析内容:
|
||||
- 多窗口已实现波动率(7d, 30d, 90d)
|
||||
- 波动率自相关幂律衰减检验(长记忆性)
|
||||
- GARCH/EGARCH/GJR-GARCH 模型对比
|
||||
- 杠杆效应分析:收益率与未来波动率的相关性
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from scipy import stats
|
||||
from scipy.optimize import curve_fit
|
||||
from statsmodels.tsa.stattools import acf
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from src.data_loader import load_daily
|
||||
from src.preprocessing import log_returns
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 1. 多窗口已实现波动率
|
||||
# ============================================================
|
||||
|
||||
def multi_window_realized_vol(returns: pd.Series,
|
||||
windows: list = [7, 30, 90]) -> pd.DataFrame:
|
||||
"""
|
||||
计算多窗口已实现波动率(年化)
|
||||
|
||||
Parameters
|
||||
----------
|
||||
returns : pd.Series
|
||||
日对数收益率
|
||||
windows : list
|
||||
滚动窗口列表(天数)
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
各窗口已实现波动率,列名为 'rv_7d', 'rv_30d', 'rv_90d' 等
|
||||
"""
|
||||
vol_df = pd.DataFrame(index=returns.index)
|
||||
for w in windows:
|
||||
# 已实现波动率 = sqrt(sum(r^2)) * sqrt(365/window) 进行年化
|
||||
rv = np.sqrt((returns ** 2).rolling(window=w).sum()) * np.sqrt(365 / w)
|
||||
vol_df[f'rv_{w}d'] = rv
|
||||
return vol_df.dropna(how='all')
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 2. 波动率自相关幂律衰减检验(长记忆性)
|
||||
# ============================================================
|
||||
|
||||
def volatility_acf_power_law(returns: pd.Series,
|
||||
max_lags: int = 200) -> dict:
|
||||
"""
|
||||
检验|收益率|的自相关函数是否服从幂律衰减:ACF(k) ~ k^(-d)
|
||||
|
||||
长记忆性判断:若 0 < d < 1,则存在长记忆
|
||||
|
||||
Parameters
|
||||
----------
|
||||
returns : pd.Series
|
||||
日对数收益率
|
||||
max_lags : int
|
||||
最大滞后阶数
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含幂律拟合参数d、拟合优度R²、ACF值等
|
||||
"""
|
||||
abs_returns = returns.dropna().abs()
|
||||
|
||||
# 计算ACF
|
||||
acf_values = acf(abs_returns, nlags=max_lags, fft=True)
|
||||
# 从lag=1开始(lag=0始终为1)
|
||||
lags = np.arange(1, max_lags + 1)
|
||||
acf_vals = acf_values[1:]
|
||||
|
||||
# 只取正的ACF值来做对数拟合
|
||||
positive_mask = acf_vals > 0
|
||||
lags_pos = lags[positive_mask]
|
||||
acf_pos = acf_vals[positive_mask]
|
||||
|
||||
if len(lags_pos) < 10:
|
||||
print("[警告] 正的ACF值过少,无法可靠拟合幂律")
|
||||
return {
|
||||
'd': np.nan, 'r_squared': np.nan,
|
||||
'lags': lags, 'acf_values': acf_vals,
|
||||
'is_long_memory': False,
|
||||
}
|
||||
|
||||
# 对数-对数线性回归: log(ACF) = -d * log(k) + c
|
||||
log_lags = np.log(lags_pos)
|
||||
log_acf = np.log(acf_pos)
|
||||
slope, intercept, r_value, p_value, std_err = stats.linregress(log_lags, log_acf)
|
||||
|
||||
d = -slope # 幂律衰减指数
|
||||
r_squared = r_value ** 2
|
||||
|
||||
# 非线性拟合作为对照(幂律函数直接拟合)
|
||||
def power_law(k, a, d_param):
|
||||
return a * k ** (-d_param)
|
||||
|
||||
try:
|
||||
popt, pcov = curve_fit(power_law, lags_pos, acf_pos,
|
||||
p0=[acf_pos[0], d], maxfev=5000)
|
||||
d_nonlinear = popt[1]
|
||||
except (RuntimeError, ValueError):
|
||||
d_nonlinear = np.nan
|
||||
|
||||
results = {
|
||||
'd': d,
|
||||
'd_nonlinear': d_nonlinear,
|
||||
'r_squared': r_squared,
|
||||
'slope': slope,
|
||||
'intercept': intercept,
|
||||
'p_value': p_value,
|
||||
'std_err': std_err,
|
||||
'lags': lags,
|
||||
'acf_values': acf_vals,
|
||||
'lags_positive': lags_pos,
|
||||
'acf_positive': acf_pos,
|
||||
'is_long_memory': 0 < d < 1,
|
||||
}
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 3. GARCH / EGARCH / GJR-GARCH 模型对比
|
||||
# ============================================================
|
||||
|
||||
def compare_garch_models(returns: pd.Series) -> dict:
|
||||
"""
|
||||
拟合GARCH(1,1)、EGARCH(1,1)、GJR-GARCH(1,1)并比较AIC/BIC
|
||||
|
||||
Parameters
|
||||
----------
|
||||
returns : pd.Series
|
||||
日对数收益率
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
各模型参数、AIC/BIC、杠杆效应参数
|
||||
"""
|
||||
from arch import arch_model
|
||||
|
||||
r_pct = returns.dropna() * 100 # 百分比收益率
|
||||
results = {}
|
||||
|
||||
# --- GARCH(1,1) ---
|
||||
model_garch = arch_model(r_pct, vol='Garch', p=1, q=1,
|
||||
mean='Constant', dist='Normal')
|
||||
res_garch = model_garch.fit(disp='off')
|
||||
results['GARCH'] = {
|
||||
'params': dict(res_garch.params),
|
||||
'aic': res_garch.aic,
|
||||
'bic': res_garch.bic,
|
||||
'log_likelihood': res_garch.loglikelihood,
|
||||
'conditional_volatility': res_garch.conditional_volatility / 100,
|
||||
'result_obj': res_garch,
|
||||
}
|
||||
|
||||
# --- EGARCH(1,1) ---
|
||||
model_egarch = arch_model(r_pct, vol='EGARCH', p=1, q=1,
|
||||
mean='Constant', dist='Normal')
|
||||
res_egarch = model_egarch.fit(disp='off')
|
||||
# EGARCH的gamma参数反映杠杆效应(负值表示负收益增大波动率)
|
||||
egarch_params = dict(res_egarch.params)
|
||||
results['EGARCH'] = {
|
||||
'params': egarch_params,
|
||||
'aic': res_egarch.aic,
|
||||
'bic': res_egarch.bic,
|
||||
'log_likelihood': res_egarch.loglikelihood,
|
||||
'conditional_volatility': res_egarch.conditional_volatility / 100,
|
||||
'leverage_param': egarch_params.get('gamma[1]', np.nan),
|
||||
'result_obj': res_egarch,
|
||||
}
|
||||
|
||||
# --- GJR-GARCH(1,1) ---
|
||||
# GJR-GARCH 在 arch 库中通过 vol='Garch', o=1 实现
|
||||
model_gjr = arch_model(r_pct, vol='Garch', p=1, o=1, q=1,
|
||||
mean='Constant', dist='Normal')
|
||||
res_gjr = model_gjr.fit(disp='off')
|
||||
gjr_params = dict(res_gjr.params)
|
||||
results['GJR-GARCH'] = {
|
||||
'params': gjr_params,
|
||||
'aic': res_gjr.aic,
|
||||
'bic': res_gjr.bic,
|
||||
'log_likelihood': res_gjr.loglikelihood,
|
||||
'conditional_volatility': res_gjr.conditional_volatility / 100,
|
||||
# gamma[1] > 0 表示负冲击产生更大波动
|
||||
'leverage_param': gjr_params.get('gamma[1]', np.nan),
|
||||
'result_obj': res_gjr,
|
||||
}
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 4. 杠杆效应分析
|
||||
# ============================================================
|
||||
|
||||
def leverage_effect_analysis(returns: pd.Series,
|
||||
forward_windows: list = [5, 10, 20]) -> dict:
|
||||
"""
|
||||
分析收益率与未来波动率的相关性(杠杆效应)
|
||||
|
||||
杠杆效应:负收益倾向于增加未来波动率,正收益倾向于减少未来波动率
|
||||
表现为 corr(r_t, vol_{t+k}) < 0
|
||||
|
||||
Parameters
|
||||
----------
|
||||
returns : pd.Series
|
||||
日对数收益率
|
||||
forward_windows : list
|
||||
前瞻波动率窗口列表
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
各窗口下的相关系数及显著性
|
||||
"""
|
||||
r = returns.dropna()
|
||||
results = {}
|
||||
|
||||
for w in forward_windows:
|
||||
# 前瞻已实现波动率
|
||||
future_vol = r.abs().rolling(window=w).mean().shift(-w)
|
||||
# 对齐有效数据
|
||||
valid = pd.DataFrame({'return': r, 'future_vol': future_vol}).dropna()
|
||||
|
||||
if len(valid) < 30:
|
||||
results[f'{w}d'] = {
|
||||
'correlation': np.nan,
|
||||
'p_value': np.nan,
|
||||
'n_samples': len(valid),
|
||||
}
|
||||
continue
|
||||
|
||||
corr, p_val = stats.pearsonr(valid['return'], valid['future_vol'])
|
||||
# Spearman秩相关作为稳健性检查
|
||||
spearman_corr, spearman_p = stats.spearmanr(valid['return'], valid['future_vol'])
|
||||
|
||||
results[f'{w}d'] = {
|
||||
'pearson_correlation': corr,
|
||||
'pearson_pvalue': p_val,
|
||||
'spearman_correlation': spearman_corr,
|
||||
'spearman_pvalue': spearman_p,
|
||||
'n_samples': len(valid),
|
||||
'return_series': valid['return'],
|
||||
'future_vol_series': valid['future_vol'],
|
||||
}
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 5. 可视化
|
||||
# ============================================================
|
||||
|
||||
def plot_realized_volatility(vol_df: pd.DataFrame, output_dir: Path):
|
||||
"""绘制多窗口已实现波动率时序图"""
|
||||
fig, ax = plt.subplots(figsize=(14, 6))
|
||||
|
||||
colors = ['#1f77b4', '#ff7f0e', '#2ca02c']
|
||||
labels = {'rv_7d': '7天', 'rv_30d': '30天', 'rv_90d': '90天'}
|
||||
|
||||
for idx, col in enumerate(vol_df.columns):
|
||||
label = labels.get(col, col)
|
||||
ax.plot(vol_df.index, vol_df[col], linewidth=0.8,
|
||||
color=colors[idx % len(colors)],
|
||||
label=f'{label}已实现波动率(年化)', alpha=0.85)
|
||||
|
||||
ax.set_xlabel('日期', fontsize=12)
|
||||
ax.set_ylabel('年化波动率', fontsize=12)
|
||||
ax.set_title('BTC 多窗口已实现波动率', fontsize=14)
|
||||
ax.legend(fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / 'realized_volatility_multiwindow.png',
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[保存] {output_dir / 'realized_volatility_multiwindow.png'}")
|
||||
|
||||
|
||||
def plot_acf_power_law(acf_results: dict, output_dir: Path):
|
||||
"""绘制ACF幂律衰减拟合图"""
|
||||
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
|
||||
|
||||
lags = acf_results['lags']
|
||||
acf_vals = acf_results['acf_values']
|
||||
|
||||
# 左图:ACF原始值
|
||||
ax1 = axes[0]
|
||||
ax1.bar(lags, acf_vals, width=1, alpha=0.6, color='steelblue')
|
||||
ax1.set_xlabel('滞后阶数', fontsize=11)
|
||||
ax1.set_ylabel('ACF', fontsize=11)
|
||||
ax1.set_title('|收益率| 自相关函数', fontsize=12)
|
||||
ax1.grid(True, alpha=0.3)
|
||||
ax1.axhline(y=0, color='black', linewidth=0.5)
|
||||
|
||||
# 右图:对数-对数图 + 幂律拟合
|
||||
ax2 = axes[1]
|
||||
lags_pos = acf_results['lags_positive']
|
||||
acf_pos = acf_results['acf_positive']
|
||||
|
||||
ax2.scatter(np.log(lags_pos), np.log(acf_pos), s=10, alpha=0.5,
|
||||
color='steelblue', label='实际ACF')
|
||||
|
||||
# 拟合线
|
||||
d = acf_results['d']
|
||||
intercept = acf_results['intercept']
|
||||
x_fit = np.linspace(np.log(lags_pos.min()), np.log(lags_pos.max()), 100)
|
||||
y_fit = -d * x_fit + intercept
|
||||
ax2.plot(x_fit, y_fit, 'r-', linewidth=2,
|
||||
label=f'幂律拟合: d={d:.3f}, R²={acf_results["r_squared"]:.3f}')
|
||||
|
||||
ax2.set_xlabel('log(滞后阶数)', fontsize=11)
|
||||
ax2.set_ylabel('log(ACF)', fontsize=11)
|
||||
ax2.set_title('幂律衰减拟合(双对数坐标)', fontsize=12)
|
||||
ax2.legend(fontsize=10)
|
||||
ax2.grid(True, alpha=0.3)
|
||||
|
||||
fig.tight_layout()
|
||||
fig.savefig(output_dir / 'acf_power_law_fit.png',
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[保存] {output_dir / 'acf_power_law_fit.png'}")
|
||||
|
||||
|
||||
def plot_model_comparison(model_results: dict, output_dir: Path):
|
||||
"""绘制GARCH模型对比图(AIC/BIC + 条件波动率对比)"""
|
||||
fig, axes = plt.subplots(2, 1, figsize=(14, 10))
|
||||
|
||||
model_names = list(model_results.keys())
|
||||
aic_values = [model_results[m]['aic'] for m in model_names]
|
||||
bic_values = [model_results[m]['bic'] for m in model_names]
|
||||
|
||||
# 上图:AIC/BIC 对比柱状图
|
||||
ax1 = axes[0]
|
||||
x = np.arange(len(model_names))
|
||||
width = 0.35
|
||||
bars1 = ax1.bar(x - width / 2, aic_values, width, label='AIC',
|
||||
color='steelblue', alpha=0.8)
|
||||
bars2 = ax1.bar(x + width / 2, bic_values, width, label='BIC',
|
||||
color='coral', alpha=0.8)
|
||||
|
||||
ax1.set_xlabel('模型', fontsize=12)
|
||||
ax1.set_ylabel('信息准则值', fontsize=12)
|
||||
ax1.set_title('GARCH 模型信息准则对比(越小越好)', fontsize=13)
|
||||
ax1.set_xticks(x)
|
||||
ax1.set_xticklabels(model_names, fontsize=11)
|
||||
ax1.legend(fontsize=11)
|
||||
ax1.grid(True, alpha=0.3, axis='y')
|
||||
|
||||
# 在柱状图上标注数值
|
||||
for bar in bars1:
|
||||
height = bar.get_height()
|
||||
ax1.annotate(f'{height:.1f}',
|
||||
xy=(bar.get_x() + bar.get_width() / 2, height),
|
||||
xytext=(0, 3), textcoords="offset points",
|
||||
ha='center', va='bottom', fontsize=9)
|
||||
for bar in bars2:
|
||||
height = bar.get_height()
|
||||
ax1.annotate(f'{height:.1f}',
|
||||
xy=(bar.get_x() + bar.get_width() / 2, height),
|
||||
xytext=(0, 3), textcoords="offset points",
|
||||
ha='center', va='bottom', fontsize=9)
|
||||
|
||||
# 下图:各模型条件波动率时序对比
|
||||
ax2 = axes[1]
|
||||
colors = {'GARCH': '#1f77b4', 'EGARCH': '#ff7f0e', 'GJR-GARCH': '#2ca02c'}
|
||||
for name in model_names:
|
||||
cv = model_results[name]['conditional_volatility']
|
||||
ax2.plot(cv.index, cv.values, linewidth=0.7,
|
||||
color=colors.get(name, 'gray'),
|
||||
label=name, alpha=0.8)
|
||||
|
||||
ax2.set_xlabel('日期', fontsize=12)
|
||||
ax2.set_ylabel('条件波动率', fontsize=12)
|
||||
ax2.set_title('各GARCH模型条件波动率对比', fontsize=13)
|
||||
ax2.legend(fontsize=11)
|
||||
ax2.grid(True, alpha=0.3)
|
||||
|
||||
fig.tight_layout()
|
||||
fig.savefig(output_dir / 'garch_model_comparison.png',
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[保存] {output_dir / 'garch_model_comparison.png'}")
|
||||
|
||||
|
||||
def plot_leverage_effect(leverage_results: dict, output_dir: Path):
|
||||
"""绘制杠杆效应散点图"""
|
||||
# 找到有数据的窗口
|
||||
valid_windows = [w for w, r in leverage_results.items()
|
||||
if 'return_series' in r]
|
||||
n_plots = len(valid_windows)
|
||||
if n_plots == 0:
|
||||
print("[警告] 无有效杠杆效应数据可绘制")
|
||||
return
|
||||
|
||||
fig, axes = plt.subplots(1, n_plots, figsize=(6 * n_plots, 5))
|
||||
if n_plots == 1:
|
||||
axes = [axes]
|
||||
|
||||
for idx, window_key in enumerate(valid_windows):
|
||||
ax = axes[idx]
|
||||
data = leverage_results[window_key]
|
||||
ret = data['return_series']
|
||||
fvol = data['future_vol_series']
|
||||
|
||||
# 散点图(采样避免过多点)
|
||||
n_sample = min(len(ret), 2000)
|
||||
sample_idx = np.random.choice(len(ret), n_sample, replace=False)
|
||||
ax.scatter(ret.values[sample_idx], fvol.values[sample_idx],
|
||||
s=5, alpha=0.3, color='steelblue')
|
||||
|
||||
# 回归线
|
||||
z = np.polyfit(ret.values, fvol.values, 1)
|
||||
p = np.poly1d(z)
|
||||
x_line = np.linspace(ret.min(), ret.max(), 100)
|
||||
ax.plot(x_line, p(x_line), 'r-', linewidth=2)
|
||||
|
||||
corr = data['pearson_correlation']
|
||||
p_val = data['pearson_pvalue']
|
||||
ax.set_xlabel('当日对数收益率', fontsize=11)
|
||||
ax.set_ylabel(f'未来{window_key}平均|收益率|', fontsize=11)
|
||||
ax.set_title(f'杠杆效应 ({window_key})\n'
|
||||
f'Pearson r={corr:.4f}, p={p_val:.2e}', fontsize=11)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.tight_layout()
|
||||
fig.savefig(output_dir / 'leverage_effect_scatter.png',
|
||||
dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f"[保存] {output_dir / 'leverage_effect_scatter.png'}")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 6. 结果打印
|
||||
# ============================================================
|
||||
|
||||
def print_realized_vol_summary(vol_df: pd.DataFrame):
|
||||
"""打印已实现波动率统计摘要"""
|
||||
print("\n" + "=" * 60)
|
||||
print("多窗口已实现波动率统计(年化)")
|
||||
print("=" * 60)
|
||||
summary = vol_df.describe().T
|
||||
for col in vol_df.columns:
|
||||
s = vol_df[col].dropna()
|
||||
print(f"\n {col}:")
|
||||
print(f" 均值: {s.mean():.4f} ({s.mean() * 100:.2f}%)")
|
||||
print(f" 中位数: {s.median():.4f} ({s.median() * 100:.2f}%)")
|
||||
print(f" 最大值: {s.max():.4f} ({s.max() * 100:.2f}%)")
|
||||
print(f" 最小值: {s.min():.4f} ({s.min() * 100:.2f}%)")
|
||||
print(f" 标准差: {s.std():.4f}")
|
||||
|
||||
|
||||
def print_acf_power_law_results(results: dict):
|
||||
"""打印ACF幂律衰减检验结果"""
|
||||
print("\n" + "=" * 60)
|
||||
print("波动率自相关幂律衰减检验(长记忆性)")
|
||||
print("=" * 60)
|
||||
print(f" 幂律衰减指数 d (线性拟合): {results['d']:.4f}")
|
||||
print(f" 幂律衰减指数 d (非线性拟合): {results['d_nonlinear']:.4f}")
|
||||
print(f" 拟合优度 R²: {results['r_squared']:.4f}")
|
||||
print(f" 回归斜率: {results['slope']:.4f}")
|
||||
print(f" 回归截距: {results['intercept']:.4f}")
|
||||
print(f" p值: {results['p_value']:.2e}")
|
||||
print(f" 标准误: {results['std_err']:.4f}")
|
||||
print(f"\n 长记忆性判断 (0 < d < 1): "
|
||||
f"{'是 - 存在长记忆性' if results['is_long_memory'] else '否'}")
|
||||
if results['is_long_memory']:
|
||||
print(f" → |收益率|的自相关以幂律速度缓慢衰减")
|
||||
print(f" → 波动率聚集具有长记忆特征,GARCH模型的持续性可能不足以刻画")
|
||||
|
||||
|
||||
def print_model_comparison(model_results: dict):
|
||||
"""打印GARCH模型对比结果"""
|
||||
print("\n" + "=" * 60)
|
||||
print("GARCH / EGARCH / GJR-GARCH 模型对比")
|
||||
print("=" * 60)
|
||||
|
||||
print(f"\n {'模型':<14} {'AIC':>12} {'BIC':>12} {'对数似然':>12}")
|
||||
print(" " + "-" * 52)
|
||||
for name, res in model_results.items():
|
||||
print(f" {name:<14} {res['aic']:>12.2f} {res['bic']:>12.2f} "
|
||||
f"{res['log_likelihood']:>12.2f}")
|
||||
|
||||
# 找到最优模型
|
||||
best_aic = min(model_results.items(), key=lambda x: x[1]['aic'])
|
||||
best_bic = min(model_results.items(), key=lambda x: x[1]['bic'])
|
||||
print(f"\n AIC最优模型: {best_aic[0]} (AIC={best_aic[1]['aic']:.2f})")
|
||||
print(f" BIC最优模型: {best_bic[0]} (BIC={best_bic[1]['bic']:.2f})")
|
||||
|
||||
# 杠杆效应参数
|
||||
print("\n 杠杆效应参数:")
|
||||
for name in ['EGARCH', 'GJR-GARCH']:
|
||||
if name in model_results and 'leverage_param' in model_results[name]:
|
||||
gamma = model_results[name]['leverage_param']
|
||||
print(f" {name} gamma[1] = {gamma:.6f}")
|
||||
if name == 'EGARCH':
|
||||
# EGARCH中gamma<0表示负冲击增大波动
|
||||
if gamma < 0:
|
||||
print(f" → gamma < 0: 负收益(下跌)产生更大波动,存在杠杆效应")
|
||||
else:
|
||||
print(f" → gamma >= 0: 未观察到明显杠杆效应")
|
||||
elif name == 'GJR-GARCH':
|
||||
# GJR-GARCH中gamma>0表示负冲击的额外影响
|
||||
if gamma > 0:
|
||||
print(f" → gamma > 0: 负冲击产生额外波动增量,存在杠杆效应")
|
||||
else:
|
||||
print(f" → gamma <= 0: 未观察到明显杠杆效应")
|
||||
|
||||
# 打印各模型详细参数
|
||||
print("\n 各模型详细参数:")
|
||||
for name, res in model_results.items():
|
||||
print(f"\n [{name}]")
|
||||
for param_name, param_val in res['params'].items():
|
||||
print(f" {param_name}: {param_val:.6f}")
|
||||
|
||||
|
||||
def print_leverage_results(leverage_results: dict):
|
||||
"""打印杠杆效应分析结果"""
|
||||
print("\n" + "=" * 60)
|
||||
print("杠杆效应分析:收益率与未来波动率的相关性")
|
||||
print("=" * 60)
|
||||
print(f"\n {'窗口':<8} {'Pearson r':>12} {'p值':>12} "
|
||||
f"{'Spearman r':>12} {'p值':>12} {'样本数':>8}")
|
||||
print(" " + "-" * 66)
|
||||
for window, data in leverage_results.items():
|
||||
if 'pearson_correlation' in data:
|
||||
print(f" {window:<8} "
|
||||
f"{data['pearson_correlation']:>12.4f} "
|
||||
f"{data['pearson_pvalue']:>12.2e} "
|
||||
f"{data['spearman_correlation']:>12.4f} "
|
||||
f"{data['spearman_pvalue']:>12.2e} "
|
||||
f"{data['n_samples']:>8d}")
|
||||
else:
|
||||
print(f" {window:<8} {'N/A':>12} {'N/A':>12} "
|
||||
f"{'N/A':>12} {'N/A':>12} {data.get('n_samples', 0):>8d}")
|
||||
|
||||
# 总结
|
||||
print("\n 解读:")
|
||||
print(" - 相关系数 < 0: 负收益(下跌)后波动率上升 → 存在杠杆效应")
|
||||
print(" - 相关系数 ≈ 0: 收益率方向与未来波动率无关")
|
||||
print(" - 相关系数 > 0: 正收益(上涨)后波动率上升(反向杠杆/波动率反馈效应)")
|
||||
print(" - 注意: BTC作为加密货币,杠杆效应可能与传统股票不同")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 7. 主入口
|
||||
# ============================================================
|
||||
|
||||
def run_volatility_analysis(df: pd.DataFrame, output_dir: str = "output/volatility"):
|
||||
"""
|
||||
波动率聚集与非对称GARCH分析主函数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线K线数据(含'close'列,DatetimeIndex索引)
|
||||
output_dir : str
|
||||
图表输出目录
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 60)
|
||||
print("BTC 波动率聚集与非对称 GARCH 分析")
|
||||
print("=" * 60)
|
||||
print(f"数据范围: {df.index.min()} ~ {df.index.max()}")
|
||||
print(f"样本数量: {len(df)}")
|
||||
|
||||
# 计算日对数收益率
|
||||
daily_returns = log_returns(df['close'])
|
||||
print(f"日对数收益率样本数: {len(daily_returns)}")
|
||||
|
||||
# 设置中文字体(兼容多系统)
|
||||
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
|
||||
plt.rcParams['axes.unicode_minus'] = False
|
||||
|
||||
# 固定随机种子以保证杠杆效应散点图采样可复现
|
||||
np.random.seed(42)
|
||||
|
||||
# --- 多窗口已实现波动率 ---
|
||||
print("\n>>> 计算多窗口已实现波动率 (7d, 30d, 90d)...")
|
||||
vol_df = multi_window_realized_vol(daily_returns, windows=[7, 30, 90])
|
||||
print_realized_vol_summary(vol_df)
|
||||
plot_realized_volatility(vol_df, output_dir)
|
||||
|
||||
# --- ACF幂律衰减检验 ---
|
||||
print("\n>>> 执行波动率自相关幂律衰减检验...")
|
||||
acf_results = volatility_acf_power_law(daily_returns, max_lags=200)
|
||||
print_acf_power_law_results(acf_results)
|
||||
plot_acf_power_law(acf_results, output_dir)
|
||||
|
||||
# --- GARCH模型对比 ---
|
||||
print("\n>>> 拟合 GARCH / EGARCH / GJR-GARCH 模型...")
|
||||
model_results = compare_garch_models(daily_returns)
|
||||
print_model_comparison(model_results)
|
||||
plot_model_comparison(model_results, output_dir)
|
||||
|
||||
# --- 杠杆效应分析 ---
|
||||
print("\n>>> 执行杠杆效应分析...")
|
||||
leverage_results = leverage_effect_analysis(daily_returns,
|
||||
forward_windows=[5, 10, 20])
|
||||
print_leverage_results(leverage_results)
|
||||
plot_leverage_effect(leverage_results, output_dir)
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("波动率分析完成!")
|
||||
print(f"图表已保存至: {output_dir.resolve()}")
|
||||
print("=" * 60)
|
||||
|
||||
# 返回所有结果供后续使用
|
||||
return {
|
||||
'realized_vol': vol_df,
|
||||
'acf_power_law': acf_results,
|
||||
'model_comparison': model_results,
|
||||
'leverage_effect': leverage_results,
|
||||
}
|
||||
|
||||
|
||||
# ============================================================
|
||||
# 独立运行入口
|
||||
# ============================================================
|
||||
|
||||
if __name__ == '__main__':
|
||||
df = load_daily()
|
||||
run_volatility_analysis(df)
|
||||
577
src/volume_price_analysis.py
Normal file
@@ -0,0 +1,577 @@
|
||||
"""成交量-价格关系与OBV分析
|
||||
|
||||
分析BTC成交量与价格变动的关系,包括Spearman相关性、
|
||||
Taker买入比例领先分析、Granger因果检验和OBV背离检测。
|
||||
"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import matplotlib.pyplot as plt
|
||||
from scipy import stats
|
||||
from statsmodels.tsa.stattools import grangercausalitytests
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Tuple
|
||||
|
||||
# 中文显示支持
|
||||
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
|
||||
plt.rcParams['axes.unicode_minus'] = False
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# 核心分析函数
|
||||
# =============================================================================
|
||||
|
||||
def _spearman_volume_returns(volume: pd.Series, returns: pd.Series) -> Dict:
|
||||
"""Spearman秩相关: 成交量 vs |收益率|
|
||||
|
||||
使用Spearman而非Pearson,因为量价关系通常是非线性的。
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含 correlation, p_value, n_samples
|
||||
"""
|
||||
# 对齐索引并去除NaN
|
||||
abs_ret = returns.abs()
|
||||
aligned = pd.concat([volume, abs_ret], axis=1, keys=['volume', 'abs_return']).dropna()
|
||||
|
||||
corr, p_val = stats.spearmanr(aligned['volume'], aligned['abs_return'])
|
||||
|
||||
return {
|
||||
'correlation': corr,
|
||||
'p_value': p_val,
|
||||
'n_samples': len(aligned),
|
||||
}
|
||||
|
||||
|
||||
def _taker_buy_ratio_lead_lag(
|
||||
taker_buy_ratio: pd.Series,
|
||||
returns: pd.Series,
|
||||
max_lag: int = 20,
|
||||
) -> pd.DataFrame:
|
||||
"""Taker买入比例领先-滞后分析
|
||||
|
||||
计算 taker_buy_ratio(t) 与 returns(t+lag) 的互相关,
|
||||
检验买入比例对未来收益的预测能力。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
taker_buy_ratio : pd.Series
|
||||
Taker买入占比序列
|
||||
returns : pd.Series
|
||||
对数收益率序列
|
||||
max_lag : int
|
||||
最大领先天数
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
包含 lag, correlation, p_value, significant 列
|
||||
"""
|
||||
results = []
|
||||
for lag in range(1, max_lag + 1):
|
||||
# taker_buy_ratio(t) vs returns(t+lag)
|
||||
ratio_shifted = taker_buy_ratio.shift(lag)
|
||||
aligned = pd.concat([ratio_shifted, returns], axis=1).dropna()
|
||||
aligned.columns = ['ratio', 'return']
|
||||
|
||||
if len(aligned) < 30:
|
||||
continue
|
||||
|
||||
corr, p_val = stats.spearmanr(aligned['ratio'], aligned['return'])
|
||||
results.append({
|
||||
'lag': lag,
|
||||
'correlation': corr,
|
||||
'p_value': p_val,
|
||||
'significant': p_val < 0.05,
|
||||
})
|
||||
|
||||
return pd.DataFrame(results)
|
||||
|
||||
|
||||
def _granger_causality(
|
||||
volume: pd.Series,
|
||||
returns: pd.Series,
|
||||
max_lag: int = 10,
|
||||
) -> Dict[str, pd.DataFrame]:
|
||||
"""双向Granger因果检验: 成交量 ↔ 收益率
|
||||
|
||||
Parameters
|
||||
----------
|
||||
volume : pd.Series
|
||||
成交量序列
|
||||
returns : pd.Series
|
||||
收益率序列
|
||||
max_lag : int
|
||||
最大滞后阶数
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
'volume_to_returns': 成交量→收益率 的p值表
|
||||
'returns_to_volume': 收益率→成交量 的p值表
|
||||
"""
|
||||
# 对齐并去除NaN
|
||||
aligned = pd.concat([volume, returns], axis=1, keys=['volume', 'returns']).dropna()
|
||||
|
||||
results = {}
|
||||
|
||||
# 方向1: 成交量 → 收益率 (检验成交量是否Granger-cause收益率)
|
||||
# grangercausalitytests 的数据格式: [被预测变量, 预测变量]
|
||||
try:
|
||||
data_v2r = aligned[['returns', 'volume']].values
|
||||
gc_v2r = grangercausalitytests(data_v2r, maxlag=max_lag, verbose=False)
|
||||
rows_v2r = []
|
||||
for lag_order in range(1, max_lag + 1):
|
||||
test_results = gc_v2r[lag_order][0]
|
||||
rows_v2r.append({
|
||||
'lag': lag_order,
|
||||
'ssr_ftest_pval': test_results['ssr_ftest'][1],
|
||||
'ssr_chi2test_pval': test_results['ssr_chi2test'][1],
|
||||
'lrtest_pval': test_results['lrtest'][1],
|
||||
'params_ftest_pval': test_results['params_ftest'][1],
|
||||
})
|
||||
results['volume_to_returns'] = pd.DataFrame(rows_v2r)
|
||||
except Exception as e:
|
||||
print(f" [警告] 成交量→收益率 Granger检验失败: {e}")
|
||||
results['volume_to_returns'] = pd.DataFrame()
|
||||
|
||||
# 方向2: 收益率 → 成交量
|
||||
try:
|
||||
data_r2v = aligned[['volume', 'returns']].values
|
||||
gc_r2v = grangercausalitytests(data_r2v, maxlag=max_lag, verbose=False)
|
||||
rows_r2v = []
|
||||
for lag_order in range(1, max_lag + 1):
|
||||
test_results = gc_r2v[lag_order][0]
|
||||
rows_r2v.append({
|
||||
'lag': lag_order,
|
||||
'ssr_ftest_pval': test_results['ssr_ftest'][1],
|
||||
'ssr_chi2test_pval': test_results['ssr_chi2test'][1],
|
||||
'lrtest_pval': test_results['lrtest'][1],
|
||||
'params_ftest_pval': test_results['params_ftest'][1],
|
||||
})
|
||||
results['returns_to_volume'] = pd.DataFrame(rows_r2v)
|
||||
except Exception as e:
|
||||
print(f" [警告] 收益率→成交量 Granger检验失败: {e}")
|
||||
results['returns_to_volume'] = pd.DataFrame()
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def _compute_obv(df: pd.DataFrame) -> pd.Series:
|
||||
"""计算OBV (On-Balance Volume)
|
||||
|
||||
规则:
|
||||
- 收盘价上涨: OBV += volume
|
||||
- 收盘价下跌: OBV -= volume
|
||||
- 收盘价持平: OBV 不变
|
||||
"""
|
||||
close = df['close']
|
||||
volume = df['volume']
|
||||
|
||||
direction = np.sign(close.diff())
|
||||
obv = (direction * volume).fillna(0).cumsum()
|
||||
obv.name = 'obv'
|
||||
return obv
|
||||
|
||||
|
||||
def _detect_obv_divergences(
|
||||
prices: pd.Series,
|
||||
obv: pd.Series,
|
||||
window: int = 60,
|
||||
lookback: int = 5,
|
||||
) -> pd.DataFrame:
|
||||
"""检测OBV-价格背离
|
||||
|
||||
背离类型:
|
||||
- 顶背离 (bearish): 价格创新高但OBV未创新高 → 潜在下跌信号
|
||||
- 底背离 (bullish): 价格创新低但OBV未创新低 → 潜在上涨信号
|
||||
|
||||
Parameters
|
||||
----------
|
||||
prices : pd.Series
|
||||
收盘价序列
|
||||
obv : pd.Series
|
||||
OBV序列
|
||||
window : int
|
||||
滚动窗口大小,用于判断"新高"/"新低"
|
||||
lookback : int
|
||||
新高/新低确认回看天数
|
||||
|
||||
Returns
|
||||
-------
|
||||
pd.DataFrame
|
||||
背离事件表,包含 date, type, price, obv 列
|
||||
"""
|
||||
divergences = []
|
||||
|
||||
# 滚动最高/最低
|
||||
price_rolling_max = prices.rolling(window=window, min_periods=window).max()
|
||||
price_rolling_min = prices.rolling(window=window, min_periods=window).min()
|
||||
obv_rolling_max = obv.rolling(window=window, min_periods=window).max()
|
||||
obv_rolling_min = obv.rolling(window=window, min_periods=window).min()
|
||||
|
||||
for i in range(window + lookback, len(prices)):
|
||||
idx = prices.index[i]
|
||||
price_val = prices.iloc[i]
|
||||
obv_val = obv.iloc[i]
|
||||
|
||||
# 价格创近期新高 (最近lookback天内触及滚动最高)
|
||||
recent_prices = prices.iloc[i - lookback:i + 1]
|
||||
recent_obv = obv.iloc[i - lookback:i + 1]
|
||||
rolling_max_price = price_rolling_max.iloc[i]
|
||||
rolling_max_obv = obv_rolling_max.iloc[i]
|
||||
rolling_min_price = price_rolling_min.iloc[i]
|
||||
rolling_min_obv = obv_rolling_min.iloc[i]
|
||||
|
||||
# 顶背离: 价格 == 滚动最高 且 OBV 未达到滚动最高的95%
|
||||
if price_val >= rolling_max_price * 0.998:
|
||||
if obv_val < rolling_max_obv * 0.95:
|
||||
divergences.append({
|
||||
'date': idx,
|
||||
'type': 'bearish', # 顶背离
|
||||
'price': price_val,
|
||||
'obv': obv_val,
|
||||
})
|
||||
|
||||
# 底背离: 价格 == 滚动最低 且 OBV 未达到滚动最低(更高)
|
||||
if price_val <= rolling_min_price * 1.002:
|
||||
if obv_val > rolling_min_obv * 1.05:
|
||||
divergences.append({
|
||||
'date': idx,
|
||||
'type': 'bullish', # 底背离
|
||||
'price': price_val,
|
||||
'obv': obv_val,
|
||||
})
|
||||
|
||||
df_div = pd.DataFrame(divergences)
|
||||
|
||||
# 去除密集重复信号 (同类型信号间隔至少10天)
|
||||
if not df_div.empty:
|
||||
df_div = df_div.sort_values('date')
|
||||
filtered = [df_div.iloc[0]]
|
||||
for _, row in df_div.iloc[1:].iterrows():
|
||||
last = filtered[-1]
|
||||
if row['type'] != last['type'] or (row['date'] - last['date']).days >= 10:
|
||||
filtered.append(row)
|
||||
df_div = pd.DataFrame(filtered).reset_index(drop=True)
|
||||
|
||||
return df_div
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# 可视化函数
|
||||
# =============================================================================
|
||||
|
||||
def _plot_volume_return_scatter(
|
||||
volume: pd.Series,
|
||||
returns: pd.Series,
|
||||
spearman_result: Dict,
|
||||
output_dir: Path,
|
||||
):
|
||||
"""图1: 成交量 vs |收益率| 散点图"""
|
||||
fig, ax = plt.subplots(figsize=(10, 7))
|
||||
|
||||
abs_ret = returns.abs()
|
||||
aligned = pd.concat([volume, abs_ret], axis=1, keys=['volume', 'abs_return']).dropna()
|
||||
|
||||
ax.scatter(aligned['volume'], aligned['abs_return'],
|
||||
s=5, alpha=0.3, color='steelblue')
|
||||
|
||||
rho = spearman_result['correlation']
|
||||
p_val = spearman_result['p_value']
|
||||
ax.set_xlabel('成交量', fontsize=12)
|
||||
ax.set_ylabel('|对数收益率|', fontsize=12)
|
||||
ax.set_title(f'成交量 vs |收益率| 散点图\nSpearman ρ={rho:.4f}, p={p_val:.2e}', fontsize=13)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
fig.savefig(output_dir / 'volume_return_scatter.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [图] 量价散点图已保存: {output_dir / 'volume_return_scatter.png'}")
|
||||
|
||||
|
||||
def _plot_lead_lag_correlation(
|
||||
lead_lag_df: pd.DataFrame,
|
||||
output_dir: Path,
|
||||
):
|
||||
"""图2: Taker买入比例领先-滞后相关性柱状图"""
|
||||
fig, ax = plt.subplots(figsize=(12, 6))
|
||||
|
||||
if lead_lag_df.empty:
|
||||
ax.text(0.5, 0.5, '数据不足,无法计算领先-滞后相关性',
|
||||
transform=ax.transAxes, ha='center', va='center', fontsize=14)
|
||||
fig.savefig(output_dir / 'taker_buy_lead_lag.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
return
|
||||
|
||||
colors = ['red' if sig else 'steelblue'
|
||||
for sig in lead_lag_df['significant']]
|
||||
|
||||
bars = ax.bar(lead_lag_df['lag'], lead_lag_df['correlation'],
|
||||
color=colors, alpha=0.8, edgecolor='white')
|
||||
|
||||
# 显著性水平线
|
||||
ax.axhline(y=0, color='black', linewidth=0.5)
|
||||
|
||||
ax.set_xlabel('领先天数 (lag)', fontsize=12)
|
||||
ax.set_ylabel('Spearman 相关系数', fontsize=12)
|
||||
ax.set_title('Taker买入比例对未来收益的领先相关性\n(红色=p<0.05 显著)', fontsize=13)
|
||||
ax.set_xticks(lead_lag_df['lag'])
|
||||
ax.grid(True, alpha=0.3, axis='y')
|
||||
|
||||
fig.savefig(output_dir / 'taker_buy_lead_lag.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [图] Taker买入比例领先分析已保存: {output_dir / 'taker_buy_lead_lag.png'}")
|
||||
|
||||
|
||||
def _plot_granger_heatmap(
|
||||
granger_results: Dict[str, pd.DataFrame],
|
||||
output_dir: Path,
|
||||
):
|
||||
"""图3: Granger因果检验p值热力图"""
|
||||
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
|
||||
|
||||
titles = {
|
||||
'volume_to_returns': '成交量 → 收益率',
|
||||
'returns_to_volume': '收益率 → 成交量',
|
||||
}
|
||||
|
||||
for ax, (direction, df_gc) in zip(axes, granger_results.items()):
|
||||
if df_gc.empty:
|
||||
ax.text(0.5, 0.5, '检验失败', transform=ax.transAxes,
|
||||
ha='center', va='center', fontsize=14)
|
||||
ax.set_title(titles[direction], fontsize=13)
|
||||
continue
|
||||
|
||||
# 构建热力图矩阵
|
||||
test_names = ['ssr_ftest_pval', 'ssr_chi2test_pval', 'lrtest_pval', 'params_ftest_pval']
|
||||
test_labels = ['SSR F-test', 'SSR Chi2', 'LR test', 'Params F-test']
|
||||
lags = df_gc['lag'].values
|
||||
|
||||
heatmap_data = df_gc[test_names].values.T # shape: (4, n_lags)
|
||||
|
||||
im = ax.imshow(heatmap_data, aspect='auto', cmap='RdYlGn',
|
||||
vmin=0, vmax=0.1, interpolation='nearest')
|
||||
|
||||
ax.set_xticks(range(len(lags)))
|
||||
ax.set_xticklabels(lags, fontsize=9)
|
||||
ax.set_yticks(range(len(test_labels)))
|
||||
ax.set_yticklabels(test_labels, fontsize=9)
|
||||
ax.set_xlabel('滞后阶数', fontsize=11)
|
||||
ax.set_title(f'Granger因果: {titles[direction]}', fontsize=13)
|
||||
|
||||
# 标注p值
|
||||
for i in range(len(test_labels)):
|
||||
for j in range(len(lags)):
|
||||
val = heatmap_data[i, j]
|
||||
color = 'white' if val < 0.03 else 'black'
|
||||
ax.text(j, i, f'{val:.3f}', ha='center', va='center',
|
||||
fontsize=7, color=color)
|
||||
|
||||
fig.colorbar(im, ax=axes, label='p-value', shrink=0.8)
|
||||
fig.tight_layout()
|
||||
fig.savefig(output_dir / 'granger_causality_heatmap.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [图] Granger因果热力图已保存: {output_dir / 'granger_causality_heatmap.png'}")
|
||||
|
||||
|
||||
def _plot_obv_with_divergences(
|
||||
df: pd.DataFrame,
|
||||
obv: pd.Series,
|
||||
divergences: pd.DataFrame,
|
||||
output_dir: Path,
|
||||
):
|
||||
"""图4: OBV vs 价格 + 背离标记"""
|
||||
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(16, 10), sharex=True,
|
||||
gridspec_kw={'height_ratios': [2, 1]})
|
||||
|
||||
# 上图: 价格
|
||||
ax1.plot(df.index, df['close'], color='black', linewidth=0.8, label='BTC 收盘价')
|
||||
ax1.set_ylabel('价格 (USDT)', fontsize=12)
|
||||
ax1.set_title('BTC 价格与OBV背离分析', fontsize=14)
|
||||
ax1.set_yscale('log')
|
||||
ax1.grid(True, alpha=0.3, which='both')
|
||||
|
||||
# 下图: OBV
|
||||
ax2.plot(obv.index, obv.values, color='steelblue', linewidth=0.8, label='OBV')
|
||||
ax2.set_ylabel('OBV', fontsize=12)
|
||||
ax2.set_xlabel('日期', fontsize=12)
|
||||
ax2.grid(True, alpha=0.3)
|
||||
|
||||
# 标记背离
|
||||
if not divergences.empty:
|
||||
bearish = divergences[divergences['type'] == 'bearish']
|
||||
bullish = divergences[divergences['type'] == 'bullish']
|
||||
|
||||
if not bearish.empty:
|
||||
ax1.scatter(bearish['date'], bearish['price'],
|
||||
marker='v', s=60, color='red', zorder=5,
|
||||
label=f'顶背离 ({len(bearish)}次)', alpha=0.7)
|
||||
for _, row in bearish.iterrows():
|
||||
ax2.axvline(row['date'], color='red', alpha=0.2, linewidth=0.5)
|
||||
|
||||
if not bullish.empty:
|
||||
ax1.scatter(bullish['date'], bullish['price'],
|
||||
marker='^', s=60, color='green', zorder=5,
|
||||
label=f'底背离 ({len(bullish)}次)', alpha=0.7)
|
||||
for _, row in bullish.iterrows():
|
||||
ax2.axvline(row['date'], color='green', alpha=0.2, linewidth=0.5)
|
||||
|
||||
ax1.legend(fontsize=10, loc='upper left')
|
||||
ax2.legend(fontsize=10, loc='upper left')
|
||||
|
||||
fig.tight_layout()
|
||||
fig.savefig(output_dir / 'obv_divergence.png', dpi=150, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" [图] OBV背离分析已保存: {output_dir / 'obv_divergence.png'}")
|
||||
|
||||
|
||||
# =============================================================================
|
||||
# 主入口
|
||||
# =============================================================================
|
||||
|
||||
def run_volume_price_analysis(df: pd.DataFrame, output_dir: str = "output") -> Dict:
|
||||
"""成交量-价格关系与OBV分析 — 主入口函数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
由 data_loader.load_daily() 返回的日线数据,含 DatetimeIndex,
|
||||
close, volume, taker_buy_volume 等列
|
||||
output_dir : str
|
||||
图表输出目录
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
分析结果摘要
|
||||
"""
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print("=" * 60)
|
||||
print(" BTC 成交量-价格关系分析")
|
||||
print("=" * 60)
|
||||
|
||||
# 准备数据
|
||||
prices = df['close'].dropna()
|
||||
volume = df['volume'].dropna()
|
||||
log_ret = np.log(prices / prices.shift(1)).dropna()
|
||||
|
||||
# 计算taker买入比例
|
||||
taker_buy_ratio = (df['taker_buy_volume'] / df['volume'].replace(0, np.nan)).dropna()
|
||||
|
||||
print(f"\n数据范围: {df.index[0].date()} ~ {df.index[-1].date()}")
|
||||
print(f"样本数量: {len(df)}")
|
||||
|
||||
# ---- 步骤1: Spearman相关性 ----
|
||||
print("\n--- Spearman 成交量-|收益率| 相关性 ---")
|
||||
spearman_result = _spearman_volume_returns(volume, log_ret)
|
||||
print(f" Spearman ρ: {spearman_result['correlation']:.4f}")
|
||||
print(f" p-value: {spearman_result['p_value']:.2e}")
|
||||
print(f" 样本量: {spearman_result['n_samples']}")
|
||||
if spearman_result['p_value'] < 0.01:
|
||||
print(" >> 结论: 成交量与|收益率|存在显著正相关(成交量放大伴随大幅波动)")
|
||||
else:
|
||||
print(" >> 结论: 成交量与|收益率|相关性不显著")
|
||||
|
||||
# ---- 步骤2: Taker买入比例领先分析 ----
|
||||
print("\n--- Taker买入比例领先分析 ---")
|
||||
lead_lag_df = _taker_buy_ratio_lead_lag(taker_buy_ratio, log_ret, max_lag=20)
|
||||
if not lead_lag_df.empty:
|
||||
sig_lags = lead_lag_df[lead_lag_df['significant']]
|
||||
if not sig_lags.empty:
|
||||
print(f" 显著领先期 (p<0.05):")
|
||||
for _, row in sig_lags.iterrows():
|
||||
print(f" lag={int(row['lag']):>2d}天: ρ={row['correlation']:.4f}, p={row['p_value']:.4f}")
|
||||
best = sig_lags.loc[sig_lags['correlation'].abs().idxmax()]
|
||||
print(f" >> 最强领先信号: lag={int(best['lag'])}天, ρ={best['correlation']:.4f}")
|
||||
else:
|
||||
print(" 未发现显著的领先关系 (所有lag的p>0.05)")
|
||||
else:
|
||||
print(" 数据不足,无法进行领先-滞后分析")
|
||||
|
||||
# ---- 步骤3: Granger因果检验 ----
|
||||
print("\n--- Granger 因果检验 (双向, lag 1-10) ---")
|
||||
granger_results = _granger_causality(volume, log_ret, max_lag=10)
|
||||
|
||||
for direction, label in [('volume_to_returns', '成交量→收益率'),
|
||||
('returns_to_volume', '收益率→成交量')]:
|
||||
df_gc = granger_results[direction]
|
||||
if not df_gc.empty:
|
||||
# 使用SSR F-test的p值
|
||||
sig_gc = df_gc[df_gc['ssr_ftest_pval'] < 0.05]
|
||||
if not sig_gc.empty:
|
||||
print(f" {label}: 在以下滞后阶显著 (SSR F-test p<0.05):")
|
||||
for _, row in sig_gc.iterrows():
|
||||
print(f" lag={int(row['lag'])}: p={row['ssr_ftest_pval']:.4f}")
|
||||
else:
|
||||
print(f" {label}: 在所有滞后阶均不显著")
|
||||
else:
|
||||
print(f" {label}: 检验失败")
|
||||
|
||||
# ---- 步骤4: OBV计算与背离检测 ----
|
||||
print("\n--- OBV 与 价格背离分析 ---")
|
||||
obv = _compute_obv(df)
|
||||
divergences = _detect_obv_divergences(prices, obv, window=60, lookback=5)
|
||||
|
||||
if not divergences.empty:
|
||||
bearish_count = len(divergences[divergences['type'] == 'bearish'])
|
||||
bullish_count = len(divergences[divergences['type'] == 'bullish'])
|
||||
print(f" 检测到 {len(divergences)} 个背离信号:")
|
||||
print(f" 顶背离 (看跌): {bearish_count} 次")
|
||||
print(f" 底背离 (看涨): {bullish_count} 次")
|
||||
|
||||
# 最近的背离
|
||||
recent = divergences.tail(5)
|
||||
print(f" 最近 {len(recent)} 个背离:")
|
||||
for _, row in recent.iterrows():
|
||||
div_type = '顶背离' if row['type'] == 'bearish' else '底背离'
|
||||
date_str = row['date'].strftime('%Y-%m-%d')
|
||||
print(f" {date_str}: {div_type}, 价格=${row['price']:,.0f}")
|
||||
else:
|
||||
bearish_count = 0
|
||||
bullish_count = 0
|
||||
print(" 未检测到明显的OBV-价格背离")
|
||||
|
||||
# ---- 步骤5: 生成可视化 ----
|
||||
print("\n--- 生成可视化图表 ---")
|
||||
_plot_volume_return_scatter(volume, log_ret, spearman_result, output_dir)
|
||||
_plot_lead_lag_correlation(lead_lag_df, output_dir)
|
||||
_plot_granger_heatmap(granger_results, output_dir)
|
||||
_plot_obv_with_divergences(df, obv, divergences, output_dir)
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print(" 成交量-价格分析完成")
|
||||
print("=" * 60)
|
||||
|
||||
# 返回结果摘要
|
||||
return {
|
||||
'spearman': spearman_result,
|
||||
'lead_lag': {
|
||||
'significant_lags': lead_lag_df[lead_lag_df['significant']]['lag'].tolist()
|
||||
if not lead_lag_df.empty else [],
|
||||
},
|
||||
'granger': {
|
||||
'volume_to_returns_sig_lags': granger_results['volume_to_returns'][
|
||||
granger_results['volume_to_returns']['ssr_ftest_pval'] < 0.05
|
||||
]['lag'].tolist() if not granger_results['volume_to_returns'].empty else [],
|
||||
'returns_to_volume_sig_lags': granger_results['returns_to_volume'][
|
||||
granger_results['returns_to_volume']['ssr_ftest_pval'] < 0.05
|
||||
]['lag'].tolist() if not granger_results['returns_to_volume'].empty else [],
|
||||
},
|
||||
'obv_divergences': {
|
||||
'total': len(divergences),
|
||||
'bearish': bearish_count,
|
||||
'bullish': bullish_count,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
from data_loader import load_daily
|
||||
df = load_daily()
|
||||
results = run_volume_price_analysis(df, output_dir='../output/volume_price')
|
||||
817
src/wavelet_analysis.py
Normal file
@@ -0,0 +1,817 @@
|
||||
"""小波变换分析模块 - CWT时频分析、全局小波谱、显著性检验、周期强度追踪"""
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use('Agg')
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pywt
|
||||
import matplotlib.pyplot as plt
|
||||
import matplotlib.dates as mdates
|
||||
from matplotlib.colors import LogNorm
|
||||
from scipy.signal import detrend
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Tuple
|
||||
|
||||
from src.preprocessing import log_returns, standardize
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# 核心参数配置
|
||||
# ============================================================================
|
||||
|
||||
WAVELET = 'cmor1.5-1.0' # 复Morlet小波 (bandwidth=1.5, center_freq=1.0)
|
||||
MIN_PERIOD = 7 # 最小周期(天)
|
||||
MAX_PERIOD = 1500 # 最大周期(天)
|
||||
NUM_SCALES = 256 # 尺度数量
|
||||
KEY_PERIODS = [30, 90, 365, 1400] # 关键追踪周期(天)
|
||||
N_SURROGATES = 1000 # Monte Carlo替代数据数量
|
||||
SIGNIFICANCE_LEVEL = 0.95 # 显著性水平
|
||||
DPI = 150 # 图像分辨率
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# 辅助函数:尺度与周期转换
|
||||
# ============================================================================
|
||||
|
||||
def _periods_to_scales(periods: np.ndarray, wavelet: str, dt: float = 1.0) -> np.ndarray:
|
||||
"""将周期(天)转换为CWT尺度参数
|
||||
|
||||
Parameters
|
||||
----------
|
||||
periods : np.ndarray
|
||||
目标周期数组(天)
|
||||
wavelet : str
|
||||
小波名称
|
||||
dt : float
|
||||
采样间隔(天)
|
||||
|
||||
Returns
|
||||
-------
|
||||
np.ndarray
|
||||
对应的尺度数组
|
||||
"""
|
||||
central_freq = pywt.central_frequency(wavelet)
|
||||
scales = central_freq * periods / dt
|
||||
return scales
|
||||
|
||||
|
||||
def _scales_to_periods(scales: np.ndarray, wavelet: str, dt: float = 1.0) -> np.ndarray:
|
||||
"""将CWT尺度参数转换为周期(天)"""
|
||||
central_freq = pywt.central_frequency(wavelet)
|
||||
periods = scales * dt / central_freq
|
||||
return periods
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# 核心计算:连续小波变换
|
||||
# ============================================================================
|
||||
|
||||
def compute_cwt(
|
||||
signal: np.ndarray,
|
||||
dt: float = 1.0,
|
||||
wavelet: str = WAVELET,
|
||||
min_period: float = MIN_PERIOD,
|
||||
max_period: float = MAX_PERIOD,
|
||||
num_scales: int = NUM_SCALES,
|
||||
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
|
||||
"""计算连续小波变换(CWT)
|
||||
|
||||
Parameters
|
||||
----------
|
||||
signal : np.ndarray
|
||||
输入时间序列(建议已标准化)
|
||||
dt : float
|
||||
采样间隔(天)
|
||||
wavelet : str
|
||||
小波函数名称
|
||||
min_period : float
|
||||
最小分析周期(天)
|
||||
max_period : float
|
||||
最大分析周期(天)
|
||||
num_scales : int
|
||||
尺度分辨率
|
||||
|
||||
Returns
|
||||
-------
|
||||
coeffs : np.ndarray
|
||||
CWT系数矩阵 (n_scales, n_times)
|
||||
periods : np.ndarray
|
||||
对应周期数组(天)
|
||||
scales : np.ndarray
|
||||
尺度数组
|
||||
"""
|
||||
# 生成对数等间隔的周期序列
|
||||
periods = np.logspace(np.log10(min_period), np.log10(max_period), num_scales)
|
||||
scales = _periods_to_scales(periods, wavelet, dt)
|
||||
|
||||
# 执行CWT
|
||||
coeffs, _ = pywt.cwt(signal, scales, wavelet, sampling_period=dt)
|
||||
|
||||
return coeffs, periods, scales
|
||||
|
||||
|
||||
def compute_power_spectrum(coeffs: np.ndarray) -> np.ndarray:
|
||||
"""计算小波功率谱 |W(s,t)|^2
|
||||
|
||||
Parameters
|
||||
----------
|
||||
coeffs : np.ndarray
|
||||
CWT系数矩阵
|
||||
|
||||
Returns
|
||||
-------
|
||||
np.ndarray
|
||||
功率谱矩阵
|
||||
"""
|
||||
return np.abs(coeffs) ** 2
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# 影响锥(Cone of Influence)
|
||||
# ============================================================================
|
||||
|
||||
def compute_coi(n: int, dt: float = 1.0, wavelet: str = WAVELET) -> np.ndarray:
|
||||
"""计算影响锥(COI)边界
|
||||
|
||||
影响锥标识边界效应显著的区域。对于Morlet小波,
|
||||
COI对应于e-folding时间 sqrt(2) * scale。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
n : int
|
||||
时间序列长度
|
||||
dt : float
|
||||
采样间隔
|
||||
wavelet : str
|
||||
小波名称
|
||||
|
||||
Returns
|
||||
-------
|
||||
coi_periods : np.ndarray
|
||||
每个时间点对应的COI周期边界(天)
|
||||
"""
|
||||
# e-folding time for Morlet wavelet: sqrt(2) * s
|
||||
# COI period = sqrt(2) * s * dt / central_freq
|
||||
central_freq = pywt.central_frequency(wavelet)
|
||||
# 从两端递增到中间
|
||||
t = np.arange(n) * dt
|
||||
coi_time = np.minimum(t, (n - 1) * dt - t)
|
||||
# 转换为周期:COI_period = sqrt(2) * coi_time * central_freq (反推)
|
||||
# 实际上 COI boundary in period space: period = sqrt(2) * dt * index / central_freq * central_freq
|
||||
# 简化: coi_period = sqrt(2) * coi_time
|
||||
coi_periods = np.sqrt(2) * coi_time
|
||||
# 最小值截断到最小周期
|
||||
coi_periods = np.maximum(coi_periods, dt)
|
||||
return coi_periods
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# AR(1) 红噪声显著性检验(Monte Carlo方法)
|
||||
# ============================================================================
|
||||
|
||||
def _estimate_ar1(signal: np.ndarray) -> float:
|
||||
"""估计信号的AR(1)自相关系数(lag-1 autocorrelation)
|
||||
|
||||
Parameters
|
||||
----------
|
||||
signal : np.ndarray
|
||||
输入时间序列
|
||||
|
||||
Returns
|
||||
-------
|
||||
float
|
||||
lag-1自相关系数
|
||||
"""
|
||||
n = len(signal)
|
||||
x = signal - np.mean(signal)
|
||||
c0 = np.sum(x ** 2) / n
|
||||
c1 = np.sum(x[:-1] * x[1:]) / n
|
||||
if c0 == 0:
|
||||
return 0.0
|
||||
alpha = c1 / c0
|
||||
return np.clip(alpha, -0.999, 0.999)
|
||||
|
||||
|
||||
def _generate_ar1_surrogate(n: int, alpha: float, variance: float) -> np.ndarray:
|
||||
"""生成AR(1)红噪声替代数据
|
||||
|
||||
x(t) = alpha * x(t-1) + noise
|
||||
|
||||
Parameters
|
||||
----------
|
||||
n : int
|
||||
序列长度
|
||||
alpha : float
|
||||
AR(1)系数
|
||||
variance : float
|
||||
原始信号方差
|
||||
|
||||
Returns
|
||||
-------
|
||||
np.ndarray
|
||||
AR(1)替代序列
|
||||
"""
|
||||
noise_std = np.sqrt(variance * (1 - alpha ** 2))
|
||||
noise = np.random.normal(0, noise_std, n)
|
||||
surrogate = np.zeros(n)
|
||||
surrogate[0] = noise[0]
|
||||
for i in range(1, n):
|
||||
surrogate[i] = alpha * surrogate[i - 1] + noise[i]
|
||||
return surrogate
|
||||
|
||||
|
||||
def significance_test_monte_carlo(
|
||||
signal: np.ndarray,
|
||||
periods: np.ndarray,
|
||||
dt: float = 1.0,
|
||||
wavelet: str = WAVELET,
|
||||
n_surrogates: int = N_SURROGATES,
|
||||
significance_level: float = SIGNIFICANCE_LEVEL,
|
||||
) -> Tuple[np.ndarray, np.ndarray]:
|
||||
"""AR(1)红噪声Monte Carlo显著性检验
|
||||
|
||||
生成大量AR(1)替代数据,计算其全局小波谱分布,
|
||||
得到指定置信水平的阈值。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
signal : np.ndarray
|
||||
原始时间序列
|
||||
periods : np.ndarray
|
||||
CWT分析的周期数组
|
||||
dt : float
|
||||
采样间隔
|
||||
wavelet : str
|
||||
小波名称
|
||||
n_surrogates : int
|
||||
替代数据数量
|
||||
significance_level : float
|
||||
显著性水平(如0.95对应95%置信度)
|
||||
|
||||
Returns
|
||||
-------
|
||||
significance_threshold : np.ndarray
|
||||
各周期的显著性阈值
|
||||
surrogate_spectra : np.ndarray
|
||||
所有替代数据的全局谱 (n_surrogates, n_periods)
|
||||
"""
|
||||
n = len(signal)
|
||||
alpha = _estimate_ar1(signal)
|
||||
variance = np.var(signal)
|
||||
scales = _periods_to_scales(periods, wavelet, dt)
|
||||
|
||||
print(f" AR(1) 系数 alpha = {alpha:.4f}")
|
||||
print(f" 生成 {n_surrogates} 个AR(1)替代数据进行Monte Carlo检验...")
|
||||
|
||||
surrogate_global_spectra = np.zeros((n_surrogates, len(periods)))
|
||||
|
||||
for i in range(n_surrogates):
|
||||
surrogate = _generate_ar1_surrogate(n, alpha, variance)
|
||||
coeffs_surr, _ = pywt.cwt(surrogate, scales, wavelet, sampling_period=dt)
|
||||
power_surr = np.abs(coeffs_surr) ** 2
|
||||
surrogate_global_spectra[i, :] = np.mean(power_surr, axis=1)
|
||||
|
||||
if (i + 1) % 200 == 0:
|
||||
print(f" Monte Carlo 进度: {i + 1}/{n_surrogates}")
|
||||
|
||||
# 计算指定分位数作为显著性阈值
|
||||
percentile = significance_level * 100
|
||||
significance_threshold = np.percentile(surrogate_global_spectra, percentile, axis=0)
|
||||
|
||||
return significance_threshold, surrogate_global_spectra
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# 全局小波谱
|
||||
# ============================================================================
|
||||
|
||||
def compute_global_wavelet_spectrum(power: np.ndarray) -> np.ndarray:
|
||||
"""计算全局小波谱(时间平均功率)
|
||||
|
||||
Parameters
|
||||
----------
|
||||
power : np.ndarray
|
||||
功率谱矩阵 (n_scales, n_times)
|
||||
|
||||
Returns
|
||||
-------
|
||||
np.ndarray
|
||||
全局小波谱 (n_scales,)
|
||||
"""
|
||||
return np.mean(power, axis=1)
|
||||
|
||||
|
||||
def find_significant_periods(
|
||||
global_spectrum: np.ndarray,
|
||||
significance_threshold: np.ndarray,
|
||||
periods: np.ndarray,
|
||||
) -> List[Dict]:
|
||||
"""找出超过显著性阈值的周期峰
|
||||
|
||||
在全局谱中检测超过95%置信水平的局部极大值。
|
||||
|
||||
Parameters
|
||||
----------
|
||||
global_spectrum : np.ndarray
|
||||
全局小波谱
|
||||
significance_threshold : np.ndarray
|
||||
显著性阈值
|
||||
periods : np.ndarray
|
||||
周期数组
|
||||
|
||||
Returns
|
||||
-------
|
||||
list of dict
|
||||
显著周期列表,每项包含 period, power, threshold, ratio
|
||||
"""
|
||||
# 找出超过阈值的区域
|
||||
above_mask = global_spectrum > significance_threshold
|
||||
|
||||
significant = []
|
||||
if not np.any(above_mask):
|
||||
return significant
|
||||
|
||||
# 在超过阈值的连续区间内找峰值
|
||||
diff = np.diff(above_mask.astype(int))
|
||||
starts = np.where(diff == 1)[0] + 1
|
||||
ends = np.where(diff == -1)[0] + 1
|
||||
|
||||
# 处理边界情况
|
||||
if above_mask[0]:
|
||||
starts = np.insert(starts, 0, 0)
|
||||
if above_mask[-1]:
|
||||
ends = np.append(ends, len(above_mask))
|
||||
|
||||
for s, e in zip(starts, ends):
|
||||
segment = global_spectrum[s:e]
|
||||
peak_idx = s + np.argmax(segment)
|
||||
significant.append({
|
||||
'period': float(periods[peak_idx]),
|
||||
'power': float(global_spectrum[peak_idx]),
|
||||
'threshold': float(significance_threshold[peak_idx]),
|
||||
'ratio': float(global_spectrum[peak_idx] / significance_threshold[peak_idx]),
|
||||
})
|
||||
|
||||
# 按功率降序排列
|
||||
significant.sort(key=lambda x: x['power'], reverse=True)
|
||||
return significant
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# 关键周期功率时间演化
|
||||
# ============================================================================
|
||||
|
||||
def extract_power_at_periods(
|
||||
power: np.ndarray,
|
||||
periods: np.ndarray,
|
||||
key_periods: List[float] = None,
|
||||
) -> Dict[float, np.ndarray]:
|
||||
"""提取关键周期处的功率随时间变化
|
||||
|
||||
Parameters
|
||||
----------
|
||||
power : np.ndarray
|
||||
功率谱矩阵 (n_scales, n_times)
|
||||
periods : np.ndarray
|
||||
周期数组
|
||||
key_periods : list of float
|
||||
要追踪的关键周期(天)
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
{period: power_time_series} 映射
|
||||
"""
|
||||
if key_periods is None:
|
||||
key_periods = KEY_PERIODS
|
||||
|
||||
result = {}
|
||||
for target_period in key_periods:
|
||||
# 找到最接近目标周期的尺度索引
|
||||
idx = np.argmin(np.abs(periods - target_period))
|
||||
actual_period = periods[idx]
|
||||
result[target_period] = {
|
||||
'power': power[idx, :],
|
||||
'actual_period': float(actual_period),
|
||||
}
|
||||
|
||||
return result
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# 可视化模块
|
||||
# ============================================================================
|
||||
|
||||
def plot_cwt_scalogram(
|
||||
power: np.ndarray,
|
||||
periods: np.ndarray,
|
||||
dates: pd.DatetimeIndex,
|
||||
coi_periods: np.ndarray,
|
||||
output_path: Path,
|
||||
title: str = 'BTC/USDT CWT 时频功率谱(Scalogram)',
|
||||
) -> None:
|
||||
"""绘制CWT scalogram(时间-周期-功率热力图)含影响锥
|
||||
|
||||
Parameters
|
||||
----------
|
||||
power : np.ndarray
|
||||
功率谱矩阵
|
||||
periods : np.ndarray
|
||||
周期数组(天)
|
||||
dates : pd.DatetimeIndex
|
||||
时间索引
|
||||
coi_periods : np.ndarray
|
||||
影响锥边界
|
||||
output_path : Path
|
||||
输出文件路径
|
||||
title : str
|
||||
图标题
|
||||
"""
|
||||
fig, ax = plt.subplots(figsize=(16, 8))
|
||||
|
||||
# 使用对数归一化的伪彩色图
|
||||
t = mdates.date2num(dates.to_pydatetime())
|
||||
T, P = np.meshgrid(t, periods)
|
||||
|
||||
# 功率取对数以获得更好的视觉效果
|
||||
power_plot = power.copy()
|
||||
power_plot[power_plot <= 0] = np.min(power_plot[power_plot > 0]) * 0.1
|
||||
|
||||
im = ax.pcolormesh(
|
||||
T, P, power_plot,
|
||||
cmap='jet',
|
||||
norm=LogNorm(vmin=np.percentile(power_plot, 5), vmax=np.percentile(power_plot, 99)),
|
||||
shading='auto',
|
||||
)
|
||||
|
||||
# 绘制影响锥(COI)
|
||||
coi_t = mdates.date2num(dates.to_pydatetime())
|
||||
ax.fill_between(
|
||||
coi_t, coi_periods, periods[-1] * 1.1,
|
||||
alpha=0.3, facecolor='white', hatch='x',
|
||||
label='影响锥 (COI)',
|
||||
)
|
||||
|
||||
# Y轴对数刻度
|
||||
ax.set_yscale('log')
|
||||
ax.set_ylim(periods[0], periods[-1])
|
||||
ax.invert_yaxis()
|
||||
|
||||
# 标记关键周期
|
||||
for kp in KEY_PERIODS:
|
||||
if periods[0] <= kp <= periods[-1]:
|
||||
ax.axhline(y=kp, color='white', linestyle='--', alpha=0.6, linewidth=0.8)
|
||||
ax.text(t[-1] + (t[-1] - t[0]) * 0.01, kp, f'{kp}d',
|
||||
color='white', fontsize=8, va='center')
|
||||
|
||||
# 格式化
|
||||
ax.xaxis_date()
|
||||
ax.xaxis.set_major_locator(mdates.YearLocator())
|
||||
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
|
||||
ax.set_xlabel('日期', fontsize=12)
|
||||
ax.set_ylabel('周期(天)', fontsize=12)
|
||||
ax.set_title(title, fontsize=14)
|
||||
|
||||
cbar = fig.colorbar(im, ax=ax, pad=0.08, shrink=0.8)
|
||||
cbar.set_label('功率(对数尺度)', fontsize=10)
|
||||
|
||||
ax.legend(loc='lower right', fontsize=9)
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_path, dpi=DPI, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" Scalogram 已保存: {output_path}")
|
||||
|
||||
|
||||
def plot_global_spectrum(
|
||||
global_spectrum: np.ndarray,
|
||||
significance_threshold: np.ndarray,
|
||||
periods: np.ndarray,
|
||||
significant_periods: List[Dict],
|
||||
output_path: Path,
|
||||
title: str = 'BTC/USDT 全局小波谱 + 95%显著性',
|
||||
) -> None:
|
||||
"""绘制全局小波谱及95%红噪声显著性阈值
|
||||
|
||||
Parameters
|
||||
----------
|
||||
global_spectrum : np.ndarray
|
||||
全局小波谱
|
||||
significance_threshold : np.ndarray
|
||||
95%显著性阈值
|
||||
periods : np.ndarray
|
||||
周期数组
|
||||
significant_periods : list of dict
|
||||
显著周期信息
|
||||
output_path : Path
|
||||
输出路径
|
||||
title : str
|
||||
图标题
|
||||
"""
|
||||
fig, ax = plt.subplots(figsize=(10, 7))
|
||||
|
||||
ax.plot(periods, global_spectrum, 'b-', linewidth=1.5, label='全局小波谱')
|
||||
ax.plot(periods, significance_threshold, 'r--', linewidth=1.2, label='95% 红噪声显著性')
|
||||
|
||||
# 填充显著区域
|
||||
above = global_spectrum > significance_threshold
|
||||
ax.fill_between(
|
||||
periods, global_spectrum, significance_threshold,
|
||||
where=above, alpha=0.25, color='blue', label='显著区域',
|
||||
)
|
||||
|
||||
# 标注显著周期峰值
|
||||
for sp in significant_periods:
|
||||
ax.annotate(
|
||||
f"{sp['period']:.0f}d\n({sp['ratio']:.1f}x)",
|
||||
xy=(sp['period'], sp['power']),
|
||||
xytext=(sp['period'] * 1.3, sp['power'] * 1.2),
|
||||
fontsize=9,
|
||||
arrowprops=dict(arrowstyle='->', color='darkblue', lw=1.0),
|
||||
color='darkblue',
|
||||
fontweight='bold',
|
||||
)
|
||||
|
||||
# 标记关键周期
|
||||
for kp in KEY_PERIODS:
|
||||
if periods[0] <= kp <= periods[-1]:
|
||||
ax.axvline(x=kp, color='gray', linestyle=':', alpha=0.5, linewidth=0.8)
|
||||
ax.text(kp, ax.get_ylim()[1] * 0.95, f'{kp}d',
|
||||
ha='center', va='top', fontsize=8, color='gray')
|
||||
|
||||
ax.set_xscale('log')
|
||||
ax.set_yscale('log')
|
||||
ax.set_xlabel('周期(天)', fontsize=12)
|
||||
ax.set_ylabel('功率', fontsize=12)
|
||||
ax.set_title(title, fontsize=14)
|
||||
ax.legend(loc='upper left', fontsize=10)
|
||||
ax.grid(True, alpha=0.3, which='both')
|
||||
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_path, dpi=DPI, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" 全局小波谱 已保存: {output_path}")
|
||||
|
||||
|
||||
def plot_key_period_power(
|
||||
key_power: Dict[float, Dict],
|
||||
dates: pd.DatetimeIndex,
|
||||
coi_periods: np.ndarray,
|
||||
output_path: Path,
|
||||
title: str = 'BTC/USDT 关键周期功率时间演化',
|
||||
) -> None:
|
||||
"""绘制关键周期处的功率随时间变化
|
||||
|
||||
Parameters
|
||||
----------
|
||||
key_power : dict
|
||||
extract_power_at_periods 的返回结果
|
||||
dates : pd.DatetimeIndex
|
||||
时间索引
|
||||
coi_periods : np.ndarray
|
||||
影响锥边界
|
||||
output_path : Path
|
||||
输出路径
|
||||
title : str
|
||||
图标题
|
||||
"""
|
||||
n_periods = len(key_power)
|
||||
fig, axes = plt.subplots(n_periods, 1, figsize=(16, 3.5 * n_periods), sharex=True)
|
||||
if n_periods == 1:
|
||||
axes = [axes]
|
||||
|
||||
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b']
|
||||
|
||||
for i, (target_period, info) in enumerate(key_power.items()):
|
||||
ax = axes[i]
|
||||
power_ts = info['power']
|
||||
actual_period = info['actual_period']
|
||||
|
||||
# 标记COI内外区域
|
||||
in_coi = coi_periods < actual_period # COI内=不可靠
|
||||
reliable_power = power_ts.copy()
|
||||
reliable_power[in_coi] = np.nan
|
||||
unreliable_power = power_ts.copy()
|
||||
unreliable_power[~in_coi] = np.nan
|
||||
|
||||
color = colors[i % len(colors)]
|
||||
ax.plot(dates, reliable_power, color=color, linewidth=1.0,
|
||||
label=f'{target_period}d (实际 {actual_period:.1f}d)')
|
||||
ax.plot(dates, unreliable_power, color=color, linewidth=0.8,
|
||||
alpha=0.3, linestyle='--', label='COI 内(不可靠)')
|
||||
|
||||
# 对功率做平滑以显示趋势
|
||||
window = max(int(target_period / 5), 7)
|
||||
smoothed = pd.Series(power_ts).rolling(window=window, center=True, min_periods=1).mean()
|
||||
ax.plot(dates, smoothed, color='black', linewidth=1.5, alpha=0.6, label=f'平滑 ({window}d)')
|
||||
|
||||
ax.set_ylabel('功率', fontsize=10)
|
||||
ax.set_title(f'周期 ~ {target_period} 天', fontsize=11)
|
||||
ax.legend(loc='upper right', fontsize=8, ncol=3)
|
||||
ax.grid(True, alpha=0.3)
|
||||
|
||||
axes[-1].xaxis.set_major_locator(mdates.YearLocator())
|
||||
axes[-1].xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
|
||||
axes[-1].set_xlabel('日期', fontsize=12)
|
||||
|
||||
fig.suptitle(title, fontsize=14, y=1.01)
|
||||
plt.tight_layout()
|
||||
fig.savefig(output_path, dpi=DPI, bbox_inches='tight')
|
||||
plt.close(fig)
|
||||
print(f" 关键周期功率图 已保存: {output_path}")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# 主入口函数
|
||||
# ============================================================================
|
||||
|
||||
def run_wavelet_analysis(
|
||||
df: pd.DataFrame,
|
||||
output_dir: str,
|
||||
wavelet: str = WAVELET,
|
||||
min_period: float = MIN_PERIOD,
|
||||
max_period: float = MAX_PERIOD,
|
||||
num_scales: int = NUM_SCALES,
|
||||
key_periods: List[float] = None,
|
||||
n_surrogates: int = N_SURROGATES,
|
||||
) -> Dict:
|
||||
"""执行完整的小波变换分析流程
|
||||
|
||||
Parameters
|
||||
----------
|
||||
df : pd.DataFrame
|
||||
日线 DataFrame,需包含 'close' 列和 DatetimeIndex
|
||||
output_dir : str
|
||||
输出目录路径
|
||||
wavelet : str
|
||||
小波函数名
|
||||
min_period : float
|
||||
最小分析周期(天)
|
||||
max_period : float
|
||||
最大分析周期(天)
|
||||
num_scales : int
|
||||
尺度分辨率
|
||||
key_periods : list of float
|
||||
要追踪的关键周期
|
||||
n_surrogates : int
|
||||
Monte Carlo替代数据数量
|
||||
|
||||
Returns
|
||||
-------
|
||||
dict
|
||||
包含所有分析结果的字典:
|
||||
- coeffs: CWT系数矩阵
|
||||
- power: 功率谱矩阵
|
||||
- periods: 周期数组
|
||||
- global_spectrum: 全局小波谱
|
||||
- significance_threshold: 95%显著性阈值
|
||||
- significant_periods: 显著周期列表
|
||||
- key_period_power: 关键周期功率演化
|
||||
- ar1_alpha: AR(1)系数
|
||||
- dates: 时间索引
|
||||
"""
|
||||
if key_periods is None:
|
||||
key_periods = KEY_PERIODS
|
||||
|
||||
output_dir = Path(output_dir)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# ---- 1. 数据准备 ----
|
||||
print("=" * 70)
|
||||
print("小波变换分析 (Continuous Wavelet Transform)")
|
||||
print("=" * 70)
|
||||
|
||||
prices = df['close'].dropna()
|
||||
dates = prices.index
|
||||
n = len(prices)
|
||||
|
||||
print(f"\n[数据概况]")
|
||||
print(f" 时间范围: {dates[0].strftime('%Y-%m-%d')} ~ {dates[-1].strftime('%Y-%m-%d')}")
|
||||
print(f" 样本数: {n}")
|
||||
print(f" 小波函数: {wavelet}")
|
||||
print(f" 分析周期范围: {min_period}d ~ {max_period}d")
|
||||
|
||||
# 对数收益率 + 标准化,作为CWT输入信号
|
||||
log_ret = log_returns(prices)
|
||||
signal = standardize(log_ret).values
|
||||
signal_dates = log_ret.index
|
||||
|
||||
# 处理可能的NaN/Inf
|
||||
valid_mask = np.isfinite(signal)
|
||||
if not np.all(valid_mask):
|
||||
print(f" 警告: 移除 {np.sum(~valid_mask)} 个非有限值")
|
||||
signal = signal[valid_mask]
|
||||
signal_dates = signal_dates[valid_mask]
|
||||
|
||||
n_signal = len(signal)
|
||||
print(f" CWT输入信号长度: {n_signal}")
|
||||
|
||||
# ---- 2. 连续小波变换 ----
|
||||
print(f"\n[CWT 计算]")
|
||||
print(f" 尺度数量: {num_scales}")
|
||||
|
||||
coeffs, periods, scales = compute_cwt(
|
||||
signal, dt=1.0, wavelet=wavelet,
|
||||
min_period=min_period, max_period=max_period, num_scales=num_scales,
|
||||
)
|
||||
power = compute_power_spectrum(coeffs)
|
||||
|
||||
print(f" 系数矩阵形状: {coeffs.shape}")
|
||||
print(f" 周期范围: {periods[0]:.1f}d ~ {periods[-1]:.1f}d")
|
||||
|
||||
# ---- 3. 影响锥 ----
|
||||
coi_periods = compute_coi(n_signal, dt=1.0, wavelet=wavelet)
|
||||
|
||||
# ---- 4. 全局小波谱 ----
|
||||
print(f"\n[全局小波谱]")
|
||||
global_spectrum = compute_global_wavelet_spectrum(power)
|
||||
|
||||
# ---- 5. AR(1) 红噪声 Monte Carlo 显著性检验 ----
|
||||
print(f"\n[Monte Carlo 显著性检验]")
|
||||
significance_threshold, surrogate_spectra = significance_test_monte_carlo(
|
||||
signal, periods, dt=1.0, wavelet=wavelet,
|
||||
n_surrogates=n_surrogates, significance_level=SIGNIFICANCE_LEVEL,
|
||||
)
|
||||
|
||||
# ---- 6. 找出显著周期 ----
|
||||
significant_periods = find_significant_periods(
|
||||
global_spectrum, significance_threshold, periods,
|
||||
)
|
||||
|
||||
print(f"\n[显著周期(超过95%置信水平)]")
|
||||
if significant_periods:
|
||||
for sp in significant_periods:
|
||||
days = sp['period']
|
||||
years = days / 365.25
|
||||
print(f" * {days:7.0f} 天 ({years:5.2f} 年) | "
|
||||
f"功率={sp['power']:.4f} | 阈值={sp['threshold']:.4f} | "
|
||||
f"比值={sp['ratio']:.2f}x")
|
||||
else:
|
||||
print(" 未发现超过95%显著性水平的周期")
|
||||
|
||||
# ---- 7. 关键周期功率时间演化 ----
|
||||
print(f"\n[关键周期功率追踪]")
|
||||
key_power = extract_power_at_periods(power, periods, key_periods)
|
||||
for kp, info in key_power.items():
|
||||
print(f" {kp}d -> 实际匹配周期: {info['actual_period']:.1f}d, "
|
||||
f"平均功率: {np.mean(info['power']):.4f}")
|
||||
|
||||
# ---- 8. 可视化 ----
|
||||
print(f"\n[生成图表]")
|
||||
|
||||
# 8.1 CWT Scalogram
|
||||
plot_cwt_scalogram(
|
||||
power, periods, signal_dates, coi_periods,
|
||||
output_dir / 'wavelet_scalogram.png',
|
||||
)
|
||||
|
||||
# 8.2 全局小波谱 + 显著性
|
||||
plot_global_spectrum(
|
||||
global_spectrum, significance_threshold, periods, significant_periods,
|
||||
output_dir / 'wavelet_global_spectrum.png',
|
||||
)
|
||||
|
||||
# 8.3 关键周期功率演化
|
||||
plot_key_period_power(
|
||||
key_power, signal_dates, coi_periods,
|
||||
output_dir / 'wavelet_key_periods.png',
|
||||
)
|
||||
|
||||
# ---- 9. 汇总结果 ----
|
||||
ar1_alpha = _estimate_ar1(signal)
|
||||
|
||||
results = {
|
||||
'coeffs': coeffs,
|
||||
'power': power,
|
||||
'periods': periods,
|
||||
'scales': scales,
|
||||
'global_spectrum': global_spectrum,
|
||||
'significance_threshold': significance_threshold,
|
||||
'significant_periods': significant_periods,
|
||||
'key_period_power': key_power,
|
||||
'coi_periods': coi_periods,
|
||||
'ar1_alpha': ar1_alpha,
|
||||
'dates': signal_dates,
|
||||
'wavelet': wavelet,
|
||||
'signal_length': n_signal,
|
||||
}
|
||||
|
||||
print(f"\n{'=' * 70}")
|
||||
print(f"小波分析完成。共生成 3 张图表,保存至: {output_dir}")
|
||||
print(f"{'=' * 70}")
|
||||
|
||||
return results
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# 独立运行入口
|
||||
# ============================================================================
|
||||
|
||||
if __name__ == '__main__':
|
||||
from src.data_loader import load_daily
|
||||
|
||||
print("加载 BTC/USDT 日线数据...")
|
||||
df = load_daily()
|
||||
print(f"数据加载完成: {len(df)} 行\n")
|
||||
|
||||
results = run_wavelet_analysis(df, output_dir='outputs/wavelet')
|
||||