StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer

Abstract

Kinship face synthesis is challenging due to the scarcity and low quality of available kinship data. Existing methods struggle to balance diversity and fidelity of generated faces while precisely controlling facial attributes such as age and gender. To address these issues, we propose the Style Latent Diffusion Transformer (StyleDiT), a novel framework that integrates the strengths of StyleGAN with the diffusion model to generate diverse and high-quality kinship faces. Our conditional diffusion model mainly focuses on modeling the complex kinship distribution to allow sampling a StyleGAN latent aligned with the kinship relationship of conditioning images. The final face is then synthesized through the pretrained StyleGAN generator. The design not only significantly reduces the training complexity but also guarantees the generation diversity and quality. To further enhance control, we introduce Relational Trait Guidance (RTG), which enables independent modulation of parental influence and fine-grained trade-offs between diversity and fidelity. Meanwhile, our framework also provides precise attribute control by leveraging the rich facial priors of StyleGAN. Furthermore, we extend the application to an insufficiently explored domain: partner face prediction, using a child’s image and one parent’s image within the same framework. Extensive experiments demonstrate that our StyleDiT striking the best balance between generating diverse and high-fidelity kinship faces over existing baseline methods.

Method

Overview of the Proposed Text Slider

Text Slider injects and fine-tunes the low-rank parameters ∆θ within the pre-trained text encoder τθ(·) of a text-guided diffusion model using contrastive prompts (e.g., c_t: person, c_+: old person, and c_-: young person) derived from concept representations. This enables continuous control over visual attributes across diverse model architectures, supporting both image and video synthesis tasks.

Experiments

Results on Text-to-Image Generation

Combining SD-XL with Text Slider enables continuous attribute manipulation across diverse object categories, with controllable attribute intensity achieved by simply adjusting the inference-time scale.

Results on Text-to-Video Generation

Integrating AnimateDiff with Text Slider enables fine-grained and continuous attribute control across diverse object categories, such as person, hair, car, style, and scene, while preserving structural consistency throughout the video. For each video, representative frames are sampled to illustrate the gradual progression of attribute intensity over time.

Results on Slider Composition

We demonstrate the composability of Text Slider in both text-to-image (left) and text-to-video (right) generation by sequentially manipulating different attributes. The proposed approach preserves structural consistency while enabling fine-grained control over the target concepts at each editing stage.

Zero-shot Generalization to FLUX and SD-3

Text Slider can be directly applied to transformer-based diffusion models such as FLUX.1-schnell and SD-3 without retraining, further demonstrating the strong generalizability of our method.

Real Image Editing

By inverting real images with ReNoise and applying our method, we achieve fine-grained attribute control on real images.

BibTeX

@inproceedings{chiu2026textslider,
  title={Text Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters},
  author={Pin-Yen Chiu and I-Sheng Fang and Jun-Cheng Chen},
  booktitle={IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}