Happy Horse rewards structure over verbosity. The model has what its prompt guide calls a "prompt budget" โ past roughly 60 words, faces go generic, motion gets mushy, and lip-sync drifts. The fix is the six-part formula, the same skeleton Alibaba's ATH team built the model around.
This guide adapts that formula for AI influencer UGC video specifically: talking-head Reels, sponsored lip-sync ads, multilingual variants, multi-shot mini-stories, and atmospheric mood pieces. Every template is copy-paste ready and built to slot into the OmniGems AI pipeline alongside GPT-Image-2 persona anchors.
For background on what Happy Horse is and why we run it as the default video model, see the Happy Horse pillar guide.
The Six-Part Formula
Every Happy Horse prompt has six blocks. Order matters. Block-by-block:
- Subject โ who or what is on screen, with persona invariants restated
- Action โ what they do, as a single fluid motion phrase
- Environment โ setting, lighting, time of day
- Style/Composition โ aspect ratio, framing, visual tone
- Camera Motion โ explicit move or static framing
- Audio โ voiceover script, language, ambient bed
Skip a block and the model fills it with a generic default. Always provide all six, even if the answer is "static, no camera motion" or "no voiceover, ambient only."
Why Block Order Matters
The model parses prompts left-to-right and weights early blocks higher. Subject and Action carry the most quality budget. If you bury the persona invariants under decorative environment description, the persona drifts. Lead with who and what; let environment, style, and camera fall into place after.
The Prompt Budget
Aim for 40โ60 words total across all six blocks. Twenty is too thin (the model fills gaps unpredictably). Eighty is too dense (quality dilutes across blocks). Forty to sixty is the sweet spot.
The discipline that gets you there: one specific noun and one specific adjective per block. Not "a beautiful young woman with stunning features in a lovely outfit" โ that's six adjectives doing the work of one noun. Try "26-year-old, olive skin, cream turtleneck." Three nouns, three modifiers, done.
Template 1: Talking-Head Reel
The bread and butter. Persona speaks to camera, 9:16, 8โ12 seconds, single shot, conversational tone.
Subject: Same persona as reference image, same face, same hair. Action: Speaking directly to camera, slight head movement, natural blinks. Environment: Sunlit Brooklyn cafรฉ window seat, soft golden hour light. Style: 9:16 vertical, casual iPhone-style, slight handheld drift. Camera: Locked-off medium close-up, eye level. Audio: Female voiceover, English, conversational tone โ "Honestly? This one product changed my whole morning routine."
49 words. Within budget. Every block has one specific noun and one specific modifier. Pass the GPT-Image-2 persona anchor as the reference image and the model holds the face.
What to Vary
- Audio script โ swap the line, keep everything else
- Environment โ swap "Brooklyn cafรฉ" for "Tokyo subway platform" or "Seoul rooftop at night"
- Time of day โ swap "golden hour" for "blue hour" or "harsh midday"
- Wardrobe โ restate the wardrobe in Subject if you're swapping it; the model needs the cue
Template 2: Sponsored UGC Ad with Lip-Sync
The format brands actually pay for. Persona on camera, holding the product, delivering the brand line.
Subject: Same persona as reference, same face, holding [product reference image] in right hand. Action: Showing product to camera, smiling, speaking the brand line. Environment: Bright kitchen counter, morning natural light through window. Style: 9:16 vertical, polished UGC, slight handheld. Camera: Medium close-up, locked, eye level. Audio: Female voiceover, English, warm and confident โ "Three weeks in and I'm not going back."
53 words. Pass two reference images (persona anchor + product still). The model handles multi-image input cleanly.
Lip-Sync Tips
- Quote the script verbatim in the Audio block โ paraphrasing the script in the prompt produces drifted lip-sync
- Specify the language explicitly even if it's English โ the model uses it to select phoneme-level lip patterns
- For brand names with unusual pronunciation, write them phonetically in a parenthetical:
"Try our new Nuance (NEW-AHNS) cream"
Template 3: Multilingual Localized Variant
Same persona, same scene, different language. This is where Happy Horse compounds โ generate four language variants of one ad from one prompt skeleton.
Subject: Same persona as reference, same face, same wardrobe. Action: Speaking directly to camera, holding product, light smile. Environment: Same kitchen counter as English variant, morning light. Style: 9:16 vertical, polished UGC. Camera: Medium close-up, locked. Audio: Female voiceover, Japanese, warm and confident โ "ไธ้ฑ้ไฝฟใฃใฆใใใๆปใใชใใ"
The only blocks that change between language variants are the script inside Audio and the language label. Subject, Action, Environment, Style, Camera stay identical. This is why one Happy Horse generation per language replaces an entire reshoot.
Supported Languages with Strong Lip-Sync
English, Mandarin Chinese, Cantonese Chinese, Japanese, Korean, German, French. For other languages the model still generates audio but lip-sync quality degrades โ see the Happy Horse vs Sora 2 vs Veo 3 breakdown.
Template 4: Multi-Shot Mini-Story
15-second beat with setup โ action โ payoff. Compress the sequence into a single fluid motion phrase in the Action block โ multi-step prose breaks the cuts.
Subject: Same persona as reference, casual loungewear. Action: Opens fridge, pours iced matcha into glass, walks to window, looks at camera with raised eyebrow. Environment: Sunlit Brooklyn loft, late morning. Style: 9:16 vertical, three-shot cut, polished UGC. Camera: Shot 1 wide on fridge, shot 2 medium on pour, shot 3 close on look-to-camera. Audio: Ambient morning kitchen sounds, no voiceover, soft lo-fi music bed.
68 words โ slightly over budget but multi-shot inherently needs more. The trick: enumerate the shots inside Camera, not Action. Action describes the persona's continuous motion; Camera describes how the camera observes it.
Why This Works
Happy Horse trains on multi-shot sequences but parses the persona's motion as one trajectory. If you split the trajectory across multiple sentences in Action, the model treats each sentence as an independent generation request and continuity breaks. One Action sentence, one persona motion, one continuous beat โ even when the camera cuts.
Template 5: Atmospheric Mood Piece
Slower, cinematic, non-speaking. Used for brand-establishing posts and influencer-launch announcements.
Subject: Same persona as reference, charcoal turtleneck, contemplative. Action: Walking slowly through coffee shop, pausing at window, gazing out. Environment: Tokyo coffee shop, blue hour, neon reflections in puddles outside. Style: 9:16 vertical, cinematic, color-graded teal-and-amber. Camera: Steadicam glide following persona, slow dolly-in to medium close-up at window. Audio: Ambient cafรฉ sound, distant rain, lo-fi instrumental โ no voiceover.
64 words. This format leans into Happy Horse's strengths โ atmospherics, fabric dynamics, geometric consistency in reflections, cinema-grade color grading.
When to Use
- Influencer launch posts (introducing the persona to the feed)
- Campaign opening clips (set the mood before the talking-head ad lands)
- Sponsored brand films where the persona is the subject of the cinematography, not the speaker
Common Prompt Mistakes
- Bloated Subject blocks โ "a beautiful young woman with cascading auburn hair, piercing blue eyes, a warm smile, wearing a stunning cream-colored turtleneck" eats half the budget. Compress: "26-year-old, auburn hair, cream turtleneck."
- Multi-step Action prose โ "She opens the door, walks to the table, sits down, picks up a book, then opens it" produces broken cuts. Compress: "Opens door, sits at table reading."
- Decorative cinematography โ "stunning, breathtaking, professional film look" is noise. The model wants concrete cinematography vocabulary: "locked-off medium close-up, eye level, slight handheld drift."
- Skipping Audio โ if you don't specify, you get random ambient. Always describe at least the audio bed, even on non-speaking clips: "ambient cafรฉ sound, no voiceover."
- Vague language tags โ "speaking the brand line" without an Audio block produces TTS-quality lip-sync. Always quote the script verbatim and label the language explicitly.
- Restating the persona anchor description in text โ pass the anchor as a reference image; in Subject, just write "Same persona as reference, same face, same hair." The image carries the heavy load.
Prompt Iteration Workflow
The single-change-per-pass discipline that works for image generation works for video too:
- Generate the base clip with the full six-block prompt
- Lock five blocks; vary one
- Compare output to base; keep what works
- Move to next block; vary that one
- Stop iterating when you have a clip that ships
This is how series content stays coherent across 30+ daily Reels. Same persona anchor, same prompt skeleton, one variable at a time. Trying to vary three blocks at once produces unpredictable output and a folder of unusable takes.
How OmniGems AI Uses This Formula
Inside the OmniGems AI Studio, the influencer's persona brief auto-generates the Subject block. The creator's content schedule defines the Action and Audio blocks. Style and Camera defaults are set per platform (9:16 for Reels/TikTok/Shorts, 16:9 for YouTube long-form). The creator only writes the Action and Audio variation โ the rest is templated.
This is what turns Happy Horse from a powerful video model into a content-pipeline component. Discipline at the prompt level scales the discipline at the persona level.
Next Steps
- For why we picked Happy Horse over Sora 2 and Veo 3, see Happy Horse vs Sora 2 vs Veo 3
- For the persona anchor workflow that feeds image-to-video, see GPT-Image-2 for AI Influencers
- For aspect ratios and platform formats, see Best Aspect Ratios for Social Platforms
- For image-side prompt structure, see How to Write Prompts for AI Influencer Content
Start Generating
Try the six-part formula inside the OmniGems AI Studio. Persona anchor handled, video pipeline integrated, model routing per clip available, posting agent and token launch in the same flow.