dual use text encoder/kv fast edits #13853
Replies: 1 comment
-
|
Hi @kwal559, this is a solid workflow. The part that actually makes it work on one 24GB card is loading only the text encoder first, using it as an LLM for prompt expansion and then for embeddings, and dumping it before the transformer and VAE come in. That matches how diffusers already thinks about offload order anyway, you’re just doing it harder. Since Klein’s text encoder is Qwen3, calling generate and then encode_prompt on the same weights isn’t a hack. That’s the model. The KV transformer plus the resized base image as reference is what keeps the character consistent across the grid. You pay for enhancement once, lock the look at 1024, then batch cheap edits at 256. One thing I’d try is batching encode_prompt with the full variant list instead of looping one prompt at a time. Should help a lot once you crank variations up to 50 or 100. Also your system prompt asks for the final line in quotes but the parser mostly looks for thinking tags. If those don’t show up you might get extra junk in the embed. Pulling the last quoted string with a simple regex fallback would probably make that step more reliable. Did you try Flux2KleinPipeline instead of Flux2Pipeline here. Might be a cleaner fit since you’re already on distilled 4 step settings and passing prompt_embeds plus image directly. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
this script explores flux2 klein 9b-kv. Pass a prompt to enhance it directly to text encoder, allow it to think and capture it's final output. Then we create variation prompts and feed them back to text encoder for embeds. 1 component dual use = save memory.. Quantize it if you want to save time.. We include the initial image and pile on the variation prompts. Allow batch generation for speed. Receive a grid of consistent characters in different poses/image challenges. If you want to see magic, load up a svdq or similar small transformer and set the image count to 100. on rtx 4090 100 pics (128x128) generate less than 10 seconds. each image unique and character remains.
import torch,diffusers,gc,time,psutil,random
from PIL import Image
def flush():
gc.collect();torch.cuda.empty_cache()
print(f"🧹✂️ {torch.cuda.memory_reserved()/10243:.1f}GB")
print(f"VRAM: {24 - torch.cuda.mem_get_info()[0]/10243:.2f}GB | RAM: {psutil.virtual_memory()[3]/1024**3:.1f}GB")
model_id, kv_tran= "black-forest-labs/FLUX.2-klein-9B","black-forest-labs/FLUX.2-klein-9b-kv"
def enhance_and_embed(user_concept, num_prompts=20):
time_1 = time.time()
print("🧠 Text Encode + Enhance")
pipe = diffusers.DiffusionPipeline.from_pretrained(model_id,transformer=None,vae=None,scheduler=None,torch_dtype=torch.bfloat16).to("cuda")
def generate_images(init_embeddings, prompt_embeddings, num_prompts=20):
print("\n🚀 Loading Image Generation Models...")
vae = diffusers.AutoencoderKLFlux2.from_pretrained("black-forest-labs/FLUX.2-small-decoder", torch_dtype=torch.bfloat16)
transformer = diffusers.AutoModel.from_pretrained(kv_tran, subfolder="transformer", torch_dtype=torch.bfloat16)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn)
EXECUTE PIPELINE
if name == "main":
USER_CONCEPT = "Portrait of a ghoul"
NUM_VARIATIONS = 20
Beta Was this translation helpful? Give feedback.
All reactions