LumiVid: HDR Video Generation via Latent Alignment with Logarithmic Encoding

Abstract

High dynamic range (HDR) imagery provides a rich representation of scene radiance, but remains challenging for diffusion models trained on bounded, perceptually compressed imagery. A natural approach is to learn a mapping from HDR data into the latent space of a pretrained diffusion model. However, this requires large HDR datasets and substantial additional training. In this work, we present a framework for SDR-to-HDR video translation and text-to-HDR video generation, leveraging the visual priors of pretrained diffusion models. We observe that applying a logarithmic encoding, commonly used in cinematic pipelines, to HDR videos produces representations that are naturally aligned with the latent space of these models. This alignment enables adapting pretrained diffusion models for HDR generation through lightweight fine-tuning, without modifying the latent space in which they operate or requiring an explicit HDR-to-latent mapping. To encourage the model to infer missing HDR content from its learned priors, we augment SDR-to-HDR training with camera-mimicking degradations that require recovering lost details. Using only lightweight adaptation of a pretrained video diffusion model, we demonstrate high-quality HDR video generation from both text and SDR video across diverse scenes and challenging lighting conditions. Our results show that HDR can be effectively modeled when its representation is aligned with the model's learned priors.

LumiVid Training Overview. Scene-linear HDR frames are compressed via LogC3 and encoded by the frozen VAE to produce target latents z_tgt. The gray optional block is utilized only for SDR-to-HDR training, where HDR frames are tone-mapped and processed through camera-mimicking degradations to produce reference latents z_ref. In the SDR-to-HDR case, these reference latents are concatenated with the noisy target latents before being fed to the Video DiT. For Text-to-HDR synthesis, there are no reference tokens to concatenate, and the DiT operates directly on noisy target latents conditioned by text prompts. Throughout both processes, the VAE remains frozen, and only the lightweight LoRA adapters are trained via flow matching loss.

LumiVid Inference Overview. Our pipeline generates scene-linear HDR video from either semantic text prompts or SDR references. For SDR-to-HDR translation, an input video is processed through the gray optional path, where it is VAE-encoded to z_ref, and concatenated with noise to provide spatial conditioning. In the Text-to-HDR case, this concatenation step is bypassed as there are no reference tokens; the DiT operates directly on noise and text embeddings. In both modes, the Video DiT uses trained LoRA adapters to denoise the latents, which are then VAE-decoded and decompressed via LogC3⁻¹ to produce scene-linear float16 EXR files.

LumiVid: HDR Video Generation via Latent Alignment with Logarithmic Encoding

Interactive Exposure Comparison

Text-to-HDR Generation

Before / After

Side-by-Side Results

Abstract