LumiVid: HDR Video Generation via Latent Alignment with Logarithmic Encoding

Naomi Ken Korem1   Mohamed Oumoumad2   Harel Cain1   Matan Ben Yosef1
Urska Jelercic1   Ofir Bibi1   Yaron Inger1   Or Patashnik3   Daniel Cohen-Or3
1Lightricks    2Gear Productions    3Tel Aviv University

Interactive Exposure Comparison

Drag the exposure slider to reveal how our HDR output preserves detail across the full dynamic range, while the SDR input clips to white or black. Use the arrow keys or click on EV ticks for fine control.

SDR Input
HDR Output (Ours)
Exposure EV 0
EV −4 EV +4
−4 −3 −1.5 0 +1.5 +3 +4
Negative EV reveals highlight detail · Positive EV reveals shadow detail

Before / After

Drag the divider to compare SDR input with our HDR-graded output. The HDR version recovers highlight and shadow detail that is permanently lost in the SDR.

Airport — Sunrise
SDR
HDR - Color Graded
Sunset — Golden Hour
SDR
HDR - Color Graded

Side-by-Side Results

Full video comparisons showing SDR input alongside our HDR output, tone-mapped to reveal the extended dynamic range.

Carousel — Night Glow Top: SDR · Bottom: HDR
Airport Silhouettes — Sunset Top: SDR · Bottom: HDR
Boy — Cozy Room Top: SDR · Bottom: HDR
Dandelion Field — Sunset Top: SDR · Bottom: HDR

Abstract

High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured by pretrained generative models. We observe that a logarithmic encoding widely used in cinematic pipelines maps HDR imagery into a distribution that is naturally aligned with the latent space of these models, enabling direct adaptation via lightweight fine-tuning without retraining an encoder. To recover details that are not directly observable in the input, we further introduce a training strategy based on camera-mimicking degradations that encourages the model to infer missing high dynamic range content from its learned priors. Combining these insights, we demonstrate high-quality HDR video generation using a pretrained video model with minimal adaptation, achieving strong results across diverse scenes and challenging lighting conditions.

LumiVid training pipeline
Training. Scene-linear HDR frames are compressed via LogC3 and encoded by the frozen VAE to produce target latents. The same HDR frames are tonemapped to SDR and degraded (MP4 compression, blur, contrast) to produce reference latents. Both are concatenated and fed to the DiT with LoRA adapters; only the LoRA weights are trained via flow matching loss.
LumiVid inference pipeline
Inference. An SDR video is VAE-encoded, concatenated with noise, and denoised by the DiT+LoRA. The output latents are VAE-decoded and decompressed via LogC3−1 to produce scene-linear float16 EXR. The VAE and DiT remain frozen; only the LoRA adapters (<1% parameters) are trained.