Temporal Diffuser: Timing Scale-Aware Modulation for Sign Language Production

aFaculty of Information Technology, Ton Duc Thang University, bDepartment of Computer Engineering, Sejong University,
cDepartment of Software, Sejong University

The 'left' is the generated sign sequenece from SignSAM, the 'right' is the GROUND TRUTH.

Abstract

Recent advances in Sign Language Production (SLP) highlight denoising diffusion models as promising alternatives to traditional autoregressive methods. Most existing approaches follow a two-stage pipeline that encodes sign motion into discrete latent codes, often sacrificing Space-Time fidelity and requiring gloss annotations or complex codebooks. Transformer-based models aim to simplify this, but often produce overly smooth, unnatural motions. We introduce Sign Language Production with Scale-Aware Modulation (SignSAM), a novel single-stage, gloss-free SLP framework that directly synthesizes motion in continuous space, preserving fine temporal details. At its core is a Space-Time U-Net that learns compact temporal features by jointly downscaling the frame and sign feature dimensions, thereby reducing computational cost compared to a no-pyramid UNet or a pyramid UNet without consistency between dimensions. To further enhance temporal precision, we propose a Timing Scale-Aware Modulation module that fuses multiscale temporal resolutions for better motion coherence. Experiments on PHOENIX14T and How2Sign show that SignSAM achieves state-of-the-art (SOTA) fluency, accuracy, and naturalness, offering an efficient and expressive solution for SLP.

Qualitative evaluation

PHOENIX14T

GT Ours SinMDM MDM
Nach einer kurzen wetterberuhigung erreicht uns morgen von westen ein neuer tiefausläufer.
(After a brief period of calm weather, a new low pressure system will reach us from the west tomorrow.)
Am freitag mal sonne mal wolken und nur einzelne schauer stellenweise zeigt sich die sonne auch für längere zeit.
(On Friday sometimes sun sometimes clouds and only isolated showers in places the sun will also appear for longer periods.)
Und so erwartet uns eine mischung aus teilweise zähen nebelfeldern wolken und sonnenschein.
(And so we can expect a mixture of partly thick fog fields, clouds and sunshine.)

How2Sign

GT Ours SinMDM MDM
Now this here, although it looks like a guitar, it's still in the guitar family of instruments, this is called ukulele not technically a guitar.
Each has a unique feel, and there's no one particular one that's right for everyone, it's a highly personal choice.
So they get very confused or upset when their brand new piercing of three of four months, after the initial healing period, they're not cleaning it anymore and it should be healed enough, but they take it out overnight and it's gone.

BibTeX

@article{kha2026temporal,
  title={Temporal diffuser: Timing scale-aware modulation for sign language production},
  author={Kha, Kim-Thuy and Vo, Anh H and Le, Van-Vang and Song, Oh-Young and Kim, Yong-Guk},
  journal={Engineering Applications of Artificial Intelligence},
  volume={163},
  pages={112739},
  year={2026},
  publisher={Elsevier}
}