Accepted at ICML 2026 Seoul, South Korea

QuITE: Query-Based
Irregular Time Series Embedding

Adapt any state-of-the-art MTS model to irregular time series — with no architecture change.

JungHoon Lim 1
1 SK Shieldus, Seongnam, Republic of Korea
The idea

The bottleneck in irregular time series
isn't the backbone —
it's the input embedding.

Modern multivariate time series models silently assume uniform sampling at their input layer. QuITE replaces that layer with a tiny, plug-and-play module — and the rest of the model just works on irregular data.

Problem

Existing methods trade off

Specialized IMTS architectures lose backbone flexibility. Interpolation introduces artificial values that distort dynamics.

Idea

Learnable query tokens

A single self-attention layer lets a small set of queries aggregate irregular observations into a fixed-shape embedding — like [CLS] in BERT.

Result

Drop-in gain on every backbone

Up to +54.7% MSE reduction in forecasting and +15.8% in classification, with fewer parameters than the original backbone.

Full abstract

Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing approaches typically either (i) design specialized architectures that limit the reuse of proven MTS models, or (ii) map IMTS onto regular temporal grids through interpolation, which may distort temporal dynamics by introducing artificial values. To address these limitations, we propose a new input-embedding-based approach. We identify that the key bottleneck lies not in the backbone architecture, but in conventional embedding layers that assume uniform sampling. We introduce QuITE, a simple yet effective plug-and-play embedding module for IMTS that employs learnable query tokens to aggregate irregular observations through a single self-attention layer, directly producing backbone-compatible latent representations without artificial value generation or architectural modification. Extensive experiments show that QuITE consistently improves MTS models, yielding average relative gains of up to 54.7% in forecasting and 15.8% in classification across diverse datasets and backbone architectures.

Why a new approach?

Two existing camps, each with a cost.

Prior work for irregular time series falls cleanly into two groups — neither preserves both backbone flexibility and faithful temporal dynamics.

Table 1 · Comparison of approaches
Only the input-embedding-based approach satisfies both criteria.
Ours satisfies both
Approach No artificial value Model flexibility
Architecture-based
Data-based
Input-embedding-based (Ours)
Approach 01 · Architecture

Build a new model for IMTS

RNNs with decay gates, neural ODEs, graph networks designed from scratch for irregular data.

No artificial values
Cannot reuse proven MTS models
GRU-D · Latent-ODE · Raindrop · Hi-Patch
Approach 02 · Data

Resample to a regular grid

Interpolate IMTS to a uniform timeline at the raw-data or representation level, then use a standard MTS model.

Backbone flexibility
Distorts dynamics with fake values
IP-Nets · mTAND
Ours
Approach 03 · Embedding

Replace the input embedding

Keep the backbone untouched; rewrite only the embedding layer with learnable query tokens that aggregate irregular observations.

No artificial values
Full backbone flexibility
QuITE (this work)
Method

Tokenize. Then aggregate.

Two stages, one self-attention layer — that's all QuITE adds to the backbone.

QuITE Framework
Figure 1. QuITE framework — irregular observations are tokenized into (time, value) embeddings, then a small set of learnable query tokens aggregate them via masked self-attention.
A
Tokenization

Each observation becomes one token

Every value–time–mask triplet $(x_{n,i}, t_{n,i}, m_{n,i})$ is encoded as the sum of a harmonic time embedding and a learned value projection. No interpolation — the irregular set is preserved exactly.

$$\phi(t)[k] = \begin{cases} \omega_0 t + \alpha_0, & k = 0 \\ \sin(\omega_k t + \alpha_k), & k > 0 \end{cases}$$
$$\mathbf{z}_{n,i} = f_{\text{val}}(x_{n,i}) + \phi(t_{n,i})$$
B
Aggregation

One learnable query summarizes them all

A learnable query token $\mathbf{q}_n$ is prepended to the observation tokens, and a single masked self-attention layer lets the query summarize all observed entries. Two flavors: variable-level (variate-token models) and patch-level (patch-token models).

$$\mathbf{H}_n = \text{SelfAttn}([\mathbf{q}_n;\mathbf{Z}_n],\ [1\,|\,\mathbf{m}_n])$$
$$\mathbf{e}_n = \mathbf{H}_n[0] \in \mathbb{R}^D \quad\Rightarrow\quad \mathbf{E}^{\text{var}}\in\mathbb{R}^{N\times D}$$
Extension

QuITE++ — a forecasting architecture built natively around queries

Hierarchical encoder interleaving patch- and variable-level attention, with a cross-attention decoder over future timestamps.

QuITE++ Architecture
Figure 2. QuITE++ — patch-level intra-variable attention, variable-level inter-variable attention, and a cross-attention decoder.

Patch-level attention

Aggregates information across temporal patches within each variable.

Variable-level attention

Models cross-variable dependencies along the variate axis.

Cross-attention decoder

Future time embeddings query the encoder for arbitrary forecast horizons.

Experimental setup

Seven IMTS benchmarks across three domains.

Forecasting
DatasetSamplesVarsAvg LMissing
Human Activity5,4001212075.0%
USHCN26,736516377.9%
PhysioNet12,000367488.4%
MIMIC-III23,457964696.7%
Classification
DatasetSamplesVarsClassesMissing
P1938,80334294.9%
P1211,98836288.4%
PAM5,33317860.0%
5 random seeds · NVIDIA RTX A6000 · hidden dim = 64 for QuITE-equipped models
Headline results

Every backbone gets better.
Every benchmark, every metric.

Six representative MTS backbones (PatchTST, PatchMixer, TMix, iTransformer, S-Mamba, TimeXer) across four forecasting and three classification benchmarks.

Forecasting MSE (lower is better)
↓ up to 54.7%
Forecasting MSE gains
Classification score (higher is better)
↑ up to 15.8%
Classification gains
Forecasting MSE improvement
iTransformer
↓ 54.7%
S-Mamba
↓ 25.0%
PatchTST
↓ 16.2%
TimeXer
↓ 12.0%
TMix
↓ 11.4%
PatchMixer
↓ 5.1%
Classification gain
PatchMixer
↑ 15.8%
TMix
↑ 13.8%
S-Mamba
↑ 6.5%
PatchTST
↑ 6.3%
iTransformer
↑ 5.3%
TimeXer
↑ 5.3%
Forecasting comparison

All forecasting models, side by side.

11 IMTS-specific baselines, 6 QuITE-equipped MTS backbones, and QuITE++ across 4 datasets × 3 horizons × MSE/MAE = 24 settings (Table 4). QuITE++ achieves the best result in 20/24. red = best, blue = second-best.

Model3000→1000 (ms)2000→2000 (ms)1000→3000 (ms)
MSEMAEMSEMAEMSEMAE
IMTS-specific baselines
Warpformer2.613.123.603.814.264.26
Raindrop4.424.655.575.155.755.37
GRU-D3.944.375.935.666.145.75
tPatchGNN2.793.243.713.894.564.32
GraFITi3.033.454.594.454.914.62
CRU3.033.604.124.434.854.86
mTAND3.143.714.384.595.295.12
NeuralFlow4.294.615.475.356.015.66
Latent-ODE3.323.915.045.115.485.33
HyperIMTS2.493.023.153.584.004.13
Hi-Patch2.563.123.263.674.204.22
MTS + QuITE
PatchTST + QuITE2.763.143.623.744.694.29
PatchMixer + QuITE2.783.133.673.754.714.31
TMix + QuITE2.773.153.663.744.754.38
iTransformer + QuITE2.583.123.253.654.104.18
S-Mamba + QuITE2.723.243.373.724.224.22
TimeXer + QuITE2.533.043.193.574.044.03
QuITE++2.462.923.113.493.964.04
Model24→1 (months)24→6 (months)24→12 (months)
MSEMAEMSEMAEMSEMAE
IMTS-specific baselines
Warpformer5.093.105.123.135.103.13
Raindrop5.643.297.014.247.614.61
GRU-D5.173.215.293.345.363.25
tPatchGNN5.003.075.233.246.233.83
GraFITi5.072.975.123.095.013.14
CRU5.153.186.774.116.644.08
mTAND5.033.005.163.105.073.09
NeuralFlow5.413.355.523.465.483.56
Latent-ODE5.163.215.183.365.233.35
HyperIMTS4.963.004.993.104.973.08
Hi-Patch5.003.035.133.055.043.03
MTS + QuITE
PatchTST + QuITE5.073.015.063.175.043.00
PatchMixer + QuITE5.113.065.023.045.003.07
TMix + QuITE5.183.105.523.395.033.05
iTransformer + QuITE5.063.064.862.964.943.01
S-Mamba + QuITE5.043.064.933.044.933.02
TimeXer + QuITE4.972.974.983.054.932.97
QuITE++4.842.924.812.944.812.93
Model12→36 (hours)24→24 (hours)36→12 (hours)
MSEMAEMSEMAEMSEMAE
IMTS-specific baselines
Warpformer6.514.245.043.724.173.38
Raindrop10.245.8310.636.0210.675.87
GRU-D7.805.135.764.536.854.88
tPatchGNN6.454.245.063.754.223.38
GraFITi6.304.385.113.964.583.65
CRU7.664.976.434.516.744.82
mTAND7.464.856.184.445.614.15
NeuralFlow7.985.087.684.848.875.43
Latent-ODE7.284.836.854.776.994.74
HyperIMTS6.114.164.653.563.993.21
Hi-Patch6.394.105.073.634.273.30
MTS + QuITE
PatchTST + QuITE17.477.1510.625.178.874.43
PatchMixer + QuITE17.527.2211.885.629.064.55
TMix + QuITE17.487.2110.725.228.964.50
iTransformer + QuITE6.324.154.993.654.333.34
S-Mamba + QuITE6.264.115.113.674.113.27
TimeXer + QuITE6.184.084.913.644.063.27
QuITE++6.083.994.993.623.813.18
Model12→36 (hours)24→24 (hours)36→12 (hours)
MSEMAEMSEMAEMSEMAE
IMTS-specific baselines
Warpformer2.328.141.767.271.456.74
Raindrop2.368.632.318.612.219.17
GRU-D2.398.432.358.342.038.14
tPatchGNN2.358.231.977.761.446.78
GraFITi2.228.131.767.281.617.16
CRU2.348.322.237.992.008.16
mTAND2.298.382.158.002.018.13
NeuralFlow2.268.292.348.091.978.39
Latent-ODE2.388.352.117.761.907.92
HyperIMTS1.857.711.686.921.526.68
Hi-Patch1.887.951.707.181.566.74
MTS + QuITE
PatchTST + QuITE5.3717.013.9012.353.8211.84
PatchMixer + QuITE5.3716.963.9412.343.8711.51
TMix + QuITE5.4116.943.9812.323.8711.63
iTransformer + QuITE1.837.641.676.931.566.78
S-Mamba + QuITE1.827.561.646.901.526.64
TimeXer + QuITE1.847.671.687.121.556.73
QuITE++1.807.541.636.831.486.56
Ablation · Embedding methods

What if we replace QuITE with something simpler?

Compared against four standard input adaptations — Add, Concat, mTAND interpolation, Mean Pool — on PatchTST, iTransformer, and QuITE++ (Table 5).

MethodMetric PatchTSTiTransformerQuITE++
ActivityUSHCNPhysioNetMIMIC-III ActivityUSHCNPhysioNetMIMIC-III ActivityUSHCNPhysioNetMIMIC-III
AddMSE4.005.2313.794.714.986.2618.266.343.445.055.341.71
MAE4.033.176.5414.744.843.808.0119.733.763.043.867.31
ConcatMSE3.905.2112.974.525.776.1018.276.483.354.995.431.75
MAE3.913.185.9614.135.213.678.0120.283.713.023.907.28
mTANDMSE3.745.2113.384.393.505.2313.114.343.345.014.961.71
MAE3.763.206.0313.803.653.155.9613.723.643.073.607.17
Mean PoolMSE3.755.1412.844.433.595.0812.154.333.314.945.261.69
MAE3.773.195.9513.843.723.115.7013.613.643.013.897.23
QuITEMSE3.695.0612.324.363.314.955.211.693.184.824.961.64
MAE3.723.065.5813.733.653.013.717.123.482.933.606.98
vs. Add
12 / 12 MSE wins
vs. Concat
12 / 12 MSE wins
vs. mTAND
11 / 12 MSE wins
vs. Mean Pool
12 / 12 MSE wins
Analysis
A · Embedding quality

Tighter, more separable clusters

t-SNE projection of learned embeddings on PAM (8 activity classes). Across patch-, variable-, and hybrid-token backbones, QuITE produces more compact and clearly separated clusters.

PatchTST patch-token
PatchTST t-SNE without QuITE
w/o QuITE
PatchTST t-SNE with QuITE
w/ QuITE
iTransformer variable-token
iTransformer t-SNE without QuITE
w/o QuITE
iTransformer t-SNE with QuITE
w/ QuITE
TimeXer hybrid
TimeXer t-SNE without QuITE
w/o QuITE
TimeXer t-SNE with QuITE
w/ QuITE
Figure 4 (paper) · t-SNE Visualization on PAM
Robustness to sparsity
Figure 3. Additional random observation removal of 0–75% on top of already-sparse benchmarks.
B · Observation sparsity

Stable up to 50% extra missingness

On benchmarks that are already 75–97% sparse, accuracy degrades only marginally up to 50% additional removal. Even at 75% removal, the model still produces usable predictions on most datasets.

MSE across hyperparameter sweeps
Hidden dim 3264 Layers 123 Heads 1248
Activity USHCN PhysioNet MIMIC-III
Figure 5 — all sensitivity curves are essentially flat.
C · Hyperparameter sensitivity

Robust to hyperparameter choices

Grid sweep over hidden dim ∈ {32, 64}, layers ∈ {1, 2, 3}, heads ∈ {1, 2, 4, 8} on the four forecasting benchmarks. QuITE++ remains generally robust across different hyperparameter choices.

D · Query initialization

Robust to initialization

QuITE remains robust across initialization schemes (Xavier / Uniform / Zero / Random), with only minor performance differences.

DatasetMetricXavierUniformZeroRandom
Activity 3000→1000MSE2.462.452.462.46
MAE3.002.993.012.92
USHCN 24→12MSE4.834.864.874.81
MAE2.992.952.982.93
E · Computational efficiency

Favorable accuracy–complexity trade-off

Computationally, QuITE-equipped models often used fewer parameters and generally reduced FLOPs compared to their base counterparts, indicating a favorable accuracy–complexity trade-off.

PhysioNet 24h → 24h · excerpted from Tables E.1–E.2
BackboneParamsFLOPsMSE
PatchTST1.73M75.2G15.28
+ QuITE127K10.6G10.62
iTransformer1.77M107G16.48
+ QuITE129K13.8G4.99
S-Mamba2.52M63.2G6.93
+ QuITE190K8.5G5.11
Conclusion

Bridging irregular data and proven backbones.

QuITE offers a powerful and flexible input-embedding module that bridges the gap between irregular time series data and existing, validated MTS backbones — enabling their effective application to challenging IMTS tasks without architectural modifications or artificial value generation. By aggregating irregular observations through learnable query tokens at the input stage, QuITE preserves backbone flexibility while avoiding the distortion of interpolation. Built on the same principle, QuITE++ extends this idea into a full hierarchical forecasting architecture and achieves the best result in 20 out of 24 settings across diverse IMTS benchmarks.

Cite this work

BibTeX

@inproceedings{lim2026quite,
  title     = {QuITE: Query-Based Irregular Time Series Embedding},
  author    = {Lim, JungHoon},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026}
}
Correspondence: junghoon9502@gmail.com
Acknowledgements

We thank Daheen Kim and Seunghan Lee for insightful discussions and feedback. We are also grateful to Prof. Changhee Lee, Dr. Jaeho Kim, Seongjun Lee, and Seokhyun Lee of Korea University, and Prof. Kyungwoo Song of Yonsei University. Finally, we thank the anonymous ICML reviewers for their constructive comments.