RAFT: Why Your Time Series Model Needs a Memory

Time series forecasting has been dominated by increasingly complex architectures. From CNNs to RNNs, from Transformers to MLP-mixers, we've thrown everything at the problem of predicting the future from past data. Yet a fundamental question remains: Why are we forcing models to memorize patterns when we could simply look them up?See RAFT (Retrieval-Augmented Forecasting of Time-series), a refreshingly simple yet powerful approach from researchers at KAIST, Max Planck Institute, and Google Cloud AI. Instead of relying solely on learned parameters, RAFT retrieves relevant historical patterns from the training data at inference time, and the results are striking.

Table of Contents

The Problem: Memorization vs. Understanding

Real-world time series are messy. They exhibit:Non-stationary patterns that shift over timeRare events that appear infrequently but matter enormouslyComplex, non-deterministic processes lacking clear temporal correlationsTraditional deep learning models must internalize all of this into their weights. As the authors note, "the advantages of indiscriminately memorizing all patterns, including noisy and uncorrelated ones, are questionable in terms of both generalizability and efficiency."Consider this scenario: Your model encounters a pattern it has seen only twice in training. Can it reliably predict what will come next? Probably not, unless it can retrieve those exact historical instances and use them as references.

The RAFT Solution: Retrieval as an Inductive Bias

Figure: RAFT retrieves similar historical patterns and uses their subsequent values to inform predictions.RAFT's architecture elegantly combines three components:

1.Multi-Period Downsampling

The model generates multiple views of the input series by downsampling at different periods (typically 1×, 2×, and 4×). This captures both short-term fluctuations and long-term trends; like how humans analyze charts at different zoom levels.

2.The Retrieval Module

For each period, RAFT:

Extracts key patches from the historical training data using sliding windows
Computes Pearson correlation between the current input and all historical keys (after removing offsets)
Selects the top-m most similar patches and retrieves their corresponding value patches (what happened next historically)
Aggregates these values using softmax-weighted attention

The similarity of metric matters: Pearson correlation specifically ignores scale variations, focusing purely on shape patterns, crucial for time series where amplitude may vary, but pattern structure remains predictive.

3.Lightweight Prediction Head

The retrieved information is projected, aggregated across periods, and concatenated with the processed input. A simple linear layer produces the final forecast.Key insight: The model doesn't need to learn rare patterns; it just needs to recognize when they reappear and fetch the appropriate historical continuation.Results: 86% Win Rate Across 10 Benchmarks

Dataset	RAFT	Best Baseline	Improvement
ETTh1	0.420	0.447 (TimeMixer)	6.0%
ETTh2	0.359	0.364 (TimeMixer)	1.4%
ETTm1	0.348	0.381 (TimeMixer)	8.7%
ETTm2	0.254	0.275 (TimeMixer)	7.6%
Electricity	0.160	0.182 (TimeMixer)	12.1%
Exchange	0.441	0.386 (TimeMixer)	-14.3%
Illness	2.097	1.480 (PatchTST)	-41.7%
Solar	0.231	0.216 (TimeMixer)	-7.0%
Traffic	0.434	0.484 (TimeMixer)	10.3%
Weather	0.241	0.240 (TimeMixer)	-0.4%

Average MSE across forecasting horizons (96, 192, 336, 720). Bold indicates the best performance.RAFT achieves state-of-the-art or second-best performance on 8 out of 10 datasets, with an average win ratio of 86% against contemporary baselines including TimeMixer, PatchTST, and DLinear.Notably, RAFT accomplishes this with a shallow MLP architecture, no Transformers, no complex attention mechanisms, no decomposition tricks. The retrieval module provides heavy lifting.

When Does Retrieval Help Most?

The authors conducted rigorous ablation studies using synthetic data to understand RAFT's strengths:Scenario 1: Rare PatternsUsing autoregressive-generated short-term patterns with varying frequencies:

Pattern Occurrences	RAFT (no retrieval)	RAFT (with retrieval)	Improvement
1 (rarest)	0.2590	0.2209	14.7% ↓
2	0.2310	0.2064	10.7% ↓
4	0.2344	0.2128	9.2% ↓

Lower MSE is better. Table adapted from the paper. Finding: As patterns become rarer, retrieval provides greater benefit. The model simply cannot memorize what it rarely sees, but it can retrieve it. Scenario 2: Temporally Uncorrelated Patterns Replacing autoregressive patterns with random walks (where the next step depends only on noise, not history):

Pattern Occurrences	Improvement with Retrieval
1	31.5% MSE reduction
2	31.4% MSE reduction
4	16.0% MSE reduction

When patterns lack temporal correlation, making it nearly impossible to learn through gradient descent, retrieval becomes even more critical. The improvement doubles compared to temporally correlated patterns.Beyond MLPs: Retrieval as a Universal EnhancerCan retrieval help existing architectures? The authors tested adding their retrieval module to AutoFormer, a Transformer-based model:

Dataset	AutoFormer	AutoFormer + Retrieval
ETTh1	0.496	0.471
ETTm2	0.450	0.444
ETTh1	0.588	0.454
ETTm2	0.327	0.326

Consistent improvements across the board, suggesting retrieval augmentation could become a plug-and-play enhancement for time series models regardless of base architecture.

Practical Considerations

Computational Efficiency

Pre-computation: Key-value pairs are extracted once during training (O(N²) where N is series length)Training: 7.3 seconds per epoch on ETTm1Inference: 1.9 seconds totalOptimization: Increasing stride from 1 to 8 reduces pre-computation time by 89% with minimal performance degradation

Key Hyperparameters

Look-back window (L): Larger windows generally improve performance (consistent with linear model findings)Number of retrievals (m): 1, 5, 10, or 20 (dataset dependent)Temperature (τ): 0.1 provides good calibration of attention weightsPeriods: {1, 2, 4} captures multi-scale patterns effectively

The Broader Implication: RAG for Numbers

RAFT represents a philosophical shift in time series modeling, paralleling the Retrieval-Augmented Generation (RAG) revolution in NLP. Just as LLMs benefit from retrieving relevant documents rather than hallucinating facts, time series models benefit from retrieving relevant historical patterns rather than extrapolating limited learned representations.This approach addresses fundamental limitations:Catastrophic forgetting → Irrelevant, historical patterns remain accessibleDistribution shift → Retrieval adapts to reappearing patterns without retrainingData efficiency → Rare patterns don't require massive datasets to learn

Looking Forward

The authors identify exciting directions for future work:Advanced similarity metrics: Beyond Pearson correlation, exploring learned similarity functionsSelective retrieval: Determining when retrieval helps versus when learned patterns sufficeNonlinear dependencies: Capturing complex similarity measures for nonstationary characteristicsExternal data integration: Retrieving from related time series or auxiliary features

Takeaways for Practitioners

Don't just build bigger models, consider what information should be learned vs. retrievedRare patterns are retrievable, if your domain has infrequent but important events, retrieval augmentation is compellingSimplicity scales, RAFT's MLP backbone trains faster and infers cheaper than Transformer alternativesRetrieval is complementary, it can enhance existing models without architectural overhauls

Code & Resources

The official implementation is available at: https://github.com/archon159/RAFTFor those working with time series in production, RAFT offers a pragmatic path forward: augment, don't just memorize.What patterns in your data might be too rare to learn but perfect for retrieval?About the Research: Sungwon Han, Seungeon Lee, Meeyoung Cha, Sercan Ö. Arik, and Jinsung Yoon. "Retrieval Augmented Time Series Forecasting." ICML 2025.

Anıl Taysi