LumiVid: HDR Video Generation via Latent Alignment with Logarithmic Encoding

Naomi Ken Korem1   Mohamed Oumoumad2   Matan Ben Yosef1   Amir Gam1   Harel Cain1
Urska Jelercic1   Ofir Bibi1   Yaron Inger1   Or Patashnik3   Daniel Cohen-Or3
1Lightricks    2Gear Productions    3Tel Aviv University

Interactive Exposure Comparison

Drag the exposure slider to reveal how our HDR output preserves detail across the full dynamic range, while the SDR input clips to white or black. Use the arrow keys or click on EV ticks for fine control.

SDR Input
HDR Output (Ours)
Exposure EV 0
EV −3 EV +3
−3 −1.5 0 +1.5 +3
Negative EV reveals highlight detail · Positive EV reveals shadow detail

Text-to-HDR Generation

LumiVid can also generate HDR video directly from text prompts. The output is a full scene-linear EXR with real dynamic range — drag the exposure slider to explore highlights and shadows that a standard video cannot capture.

HDR Output (Text-to-HDR)
Exposure EV 0
EV −3 EV +3
−3 −1.5 0 +1.5 +3
Negative EV reveals highlight detail · Positive EV reveals shadow detail

Before / After

Drag the divider to compare SDR input with our HDR-graded output. The HDR version recovers highlight and shadow detail that is permanently lost in the SDR.

Airport — Sunrise
SDR
HDR - Color Graded
Sunset — Golden Hour
SDR
HDR - Color Graded

Side-by-Side Results

Full video comparisons showing SDR input alongside our HDR output, tone-mapped to reveal the extended dynamic range.

Carousel — Night Glow Top: SDR · Bottom: HDR
Airport Silhouettes — Sunset Top: SDR · Bottom: HDR
Boy — Cozy Room Top: SDR · Bottom: HDR
Dandelion Field — Sunset Top: SDR · Bottom: HDR

Abstract

High dynamic range (HDR) imagery provides a rich representation of scene radiance, but remains challenging for diffusion models trained on bounded, perceptually compressed imagery. A natural approach is to learn a mapping from HDR data into the latent space of a pretrained diffusion model. However, this requires large HDR datasets and substantial additional training. In this work, we present a framework for SDR-to-HDR video translation and text-to-HDR video generation, leveraging the visual priors of pretrained diffusion models. We observe that applying a logarithmic encoding, commonly used in cinematic pipelines, to HDR videos produces representations that are naturally aligned with the latent space of these models. This alignment enables adapting pretrained diffusion models for HDR generation through lightweight fine-tuning, without modifying the latent space in which they operate or requiring an explicit HDR-to-latent mapping. To encourage the model to infer missing HDR content from its learned priors, we augment SDR-to-HDR training with camera-mimicking degradations that require recovering lost details. Using only lightweight adaptation of a pretrained video diffusion model, we demonstrate high-quality HDR video generation from both text and SDR video across diverse scenes and challenging lighting conditions. Our results show that HDR can be effectively modeled when its representation is aligned with the model's learned priors.

LumiVid training pipeline
LumiVid Training Overview. Scene-linear HDR frames are compressed via LogC3 and encoded by the frozen VAE to produce target latents ztgt. The gray optional block is utilized only for SDR-to-HDR training, where HDR frames are tone-mapped and processed through camera-mimicking degradations to produce reference latents zref. In the SDR-to-HDR case, these reference latents are concatenated with the noisy target latents before being fed to the Video DiT. For Text-to-HDR synthesis, there are no reference tokens to concatenate, and the DiT operates directly on noisy target latents conditioned by text prompts. Throughout both processes, the VAE remains frozen, and only the lightweight LoRA adapters are trained via flow matching loss.
LumiVid inference pipeline
LumiVid Inference Overview. Our pipeline generates scene-linear HDR video from either semantic text prompts or SDR references. For SDR-to-HDR translation, an input video is processed through the gray optional path, where it is VAE-encoded to zref, and concatenated with noise to provide spatial conditioning. In the Text-to-HDR case, this concatenation step is bypassed as there are no reference tokens; the DiT operates directly on noise and text embeddings. In both modes, the Video DiT uses trained LoRA adapters to denoise the latents, which are then VAE-decoded and decompressed via LogC3−1 to produce scene-linear float16 EXR files.