Field Notes
Article ยท happy-horse

How to Write Happy Horse Prompts: Six-Part Formula for AI Influencer Video

The six-part Happy Horse prompt formula adapted for AI influencer UGC: copy-paste templates for talking-head Reels, sponsored ads, multilingual lip-sync, and multi-shot mini-stories.

May 2, 20267 min read
happy-horseprompt engineeringAI influencersUGC video

Happy Horse rewards structure over verbosity. The model has what its prompt guide calls a "prompt budget" โ€” past roughly 60 words, faces go generic, motion gets mushy, and lip-sync drifts. The fix is the six-part formula, the same skeleton Alibaba's ATH team built the model around.

This guide adapts that formula for AI influencer UGC video specifically: talking-head Reels, sponsored lip-sync ads, multilingual variants, multi-shot mini-stories, and atmospheric mood pieces. Every template is copy-paste ready and built to slot into the OmniGems AI pipeline alongside GPT-Image-2 persona anchors.

For background on what Happy Horse is and why we run it as the default video model, see the Happy Horse pillar guide.

The Six-Part Formula

Every Happy Horse prompt has six blocks. Order matters. Block-by-block:

  1. Subject โ€” who or what is on screen, with persona invariants restated
  2. Action โ€” what they do, as a single fluid motion phrase
  3. Environment โ€” setting, lighting, time of day
  4. Style/Composition โ€” aspect ratio, framing, visual tone
  5. Camera Motion โ€” explicit move or static framing
  6. Audio โ€” voiceover script, language, ambient bed

Skip a block and the model fills it with a generic default. Always provide all six, even if the answer is "static, no camera motion" or "no voiceover, ambient only."

Why Block Order Matters

The model parses prompts left-to-right and weights early blocks higher. Subject and Action carry the most quality budget. If you bury the persona invariants under decorative environment description, the persona drifts. Lead with who and what; let environment, style, and camera fall into place after.

The Prompt Budget

Aim for 40โ€“60 words total across all six blocks. Twenty is too thin (the model fills gaps unpredictably). Eighty is too dense (quality dilutes across blocks). Forty to sixty is the sweet spot.

The discipline that gets you there: one specific noun and one specific adjective per block. Not "a beautiful young woman with stunning features in a lovely outfit" โ€” that's six adjectives doing the work of one noun. Try "26-year-old, olive skin, cream turtleneck." Three nouns, three modifiers, done.

Template 1: Talking-Head Reel

The bread and butter. Persona speaks to camera, 9:16, 8โ€“12 seconds, single shot, conversational tone.

Subject: Same persona as reference image, same face, same hair. Action: Speaking directly to camera, slight head movement, natural blinks. Environment: Sunlit Brooklyn cafรฉ window seat, soft golden hour light. Style: 9:16 vertical, casual iPhone-style, slight handheld drift. Camera: Locked-off medium close-up, eye level. Audio: Female voiceover, English, conversational tone โ€” "Honestly? This one product changed my whole morning routine."

49 words. Within budget. Every block has one specific noun and one specific modifier. Pass the GPT-Image-2 persona anchor as the reference image and the model holds the face.

What to Vary

  • Audio script โ€” swap the line, keep everything else
  • Environment โ€” swap "Brooklyn cafรฉ" for "Tokyo subway platform" or "Seoul rooftop at night"
  • Time of day โ€” swap "golden hour" for "blue hour" or "harsh midday"
  • Wardrobe โ€” restate the wardrobe in Subject if you're swapping it; the model needs the cue

Template 2: Sponsored UGC Ad with Lip-Sync

The format brands actually pay for. Persona on camera, holding the product, delivering the brand line.

Subject: Same persona as reference, same face, holding [product reference image] in right hand. Action: Showing product to camera, smiling, speaking the brand line. Environment: Bright kitchen counter, morning natural light through window. Style: 9:16 vertical, polished UGC, slight handheld. Camera: Medium close-up, locked, eye level. Audio: Female voiceover, English, warm and confident โ€” "Three weeks in and I'm not going back."

53 words. Pass two reference images (persona anchor + product still). The model handles multi-image input cleanly.

Lip-Sync Tips

  • Quote the script verbatim in the Audio block โ€” paraphrasing the script in the prompt produces drifted lip-sync
  • Specify the language explicitly even if it's English โ€” the model uses it to select phoneme-level lip patterns
  • For brand names with unusual pronunciation, write them phonetically in a parenthetical: "Try our new Nuance (NEW-AHNS) cream"

Template 3: Multilingual Localized Variant

Same persona, same scene, different language. This is where Happy Horse compounds โ€” generate four language variants of one ad from one prompt skeleton.

Subject: Same persona as reference, same face, same wardrobe. Action: Speaking directly to camera, holding product, light smile. Environment: Same kitchen counter as English variant, morning light. Style: 9:16 vertical, polished UGC. Camera: Medium close-up, locked. Audio: Female voiceover, Japanese, warm and confident โ€” "ไธ‰้€ฑ้–“ไฝฟใฃใฆใ€ใ‚‚ใ†ๆˆปใ‚Œใชใ„ใ€‚"

The only blocks that change between language variants are the script inside Audio and the language label. Subject, Action, Environment, Style, Camera stay identical. This is why one Happy Horse generation per language replaces an entire reshoot.

Supported Languages with Strong Lip-Sync

English, Mandarin Chinese, Cantonese Chinese, Japanese, Korean, German, French. For other languages the model still generates audio but lip-sync quality degrades โ€” see the Happy Horse vs Sora 2 vs Veo 3 breakdown.

Template 4: Multi-Shot Mini-Story

15-second beat with setup โ†’ action โ†’ payoff. Compress the sequence into a single fluid motion phrase in the Action block โ€” multi-step prose breaks the cuts.

Subject: Same persona as reference, casual loungewear. Action: Opens fridge, pours iced matcha into glass, walks to window, looks at camera with raised eyebrow. Environment: Sunlit Brooklyn loft, late morning. Style: 9:16 vertical, three-shot cut, polished UGC. Camera: Shot 1 wide on fridge, shot 2 medium on pour, shot 3 close on look-to-camera. Audio: Ambient morning kitchen sounds, no voiceover, soft lo-fi music bed.

68 words โ€” slightly over budget but multi-shot inherently needs more. The trick: enumerate the shots inside Camera, not Action. Action describes the persona's continuous motion; Camera describes how the camera observes it.

Why This Works

Happy Horse trains on multi-shot sequences but parses the persona's motion as one trajectory. If you split the trajectory across multiple sentences in Action, the model treats each sentence as an independent generation request and continuity breaks. One Action sentence, one persona motion, one continuous beat โ€” even when the camera cuts.

Template 5: Atmospheric Mood Piece

Slower, cinematic, non-speaking. Used for brand-establishing posts and influencer-launch announcements.

Subject: Same persona as reference, charcoal turtleneck, contemplative. Action: Walking slowly through coffee shop, pausing at window, gazing out. Environment: Tokyo coffee shop, blue hour, neon reflections in puddles outside. Style: 9:16 vertical, cinematic, color-graded teal-and-amber. Camera: Steadicam glide following persona, slow dolly-in to medium close-up at window. Audio: Ambient cafรฉ sound, distant rain, lo-fi instrumental โ€” no voiceover.

64 words. This format leans into Happy Horse's strengths โ€” atmospherics, fabric dynamics, geometric consistency in reflections, cinema-grade color grading.

When to Use

  • Influencer launch posts (introducing the persona to the feed)
  • Campaign opening clips (set the mood before the talking-head ad lands)
  • Sponsored brand films where the persona is the subject of the cinematography, not the speaker

Common Prompt Mistakes

  • Bloated Subject blocks โ€” "a beautiful young woman with cascading auburn hair, piercing blue eyes, a warm smile, wearing a stunning cream-colored turtleneck" eats half the budget. Compress: "26-year-old, auburn hair, cream turtleneck."
  • Multi-step Action prose โ€” "She opens the door, walks to the table, sits down, picks up a book, then opens it" produces broken cuts. Compress: "Opens door, sits at table reading."
  • Decorative cinematography โ€” "stunning, breathtaking, professional film look" is noise. The model wants concrete cinematography vocabulary: "locked-off medium close-up, eye level, slight handheld drift."
  • Skipping Audio โ€” if you don't specify, you get random ambient. Always describe at least the audio bed, even on non-speaking clips: "ambient cafรฉ sound, no voiceover."
  • Vague language tags โ€” "speaking the brand line" without an Audio block produces TTS-quality lip-sync. Always quote the script verbatim and label the language explicitly.
  • Restating the persona anchor description in text โ€” pass the anchor as a reference image; in Subject, just write "Same persona as reference, same face, same hair." The image carries the heavy load.

Prompt Iteration Workflow

The single-change-per-pass discipline that works for image generation works for video too:

  1. Generate the base clip with the full six-block prompt
  2. Lock five blocks; vary one
  3. Compare output to base; keep what works
  4. Move to next block; vary that one
  5. Stop iterating when you have a clip that ships

This is how series content stays coherent across 30+ daily Reels. Same persona anchor, same prompt skeleton, one variable at a time. Trying to vary three blocks at once produces unpredictable output and a folder of unusable takes.

How OmniGems AI Uses This Formula

Inside the OmniGems AI Studio, the influencer's persona brief auto-generates the Subject block. The creator's content schedule defines the Action and Audio blocks. Style and Camera defaults are set per platform (9:16 for Reels/TikTok/Shorts, 16:9 for YouTube long-form). The creator only writes the Action and Audio variation โ€” the rest is templated.

This is what turns Happy Horse from a powerful video model into a content-pipeline component. Discipline at the prompt level scales the discipline at the persona level.

Next Steps

  • For why we picked Happy Horse over Sora 2 and Veo 3, see Happy Horse vs Sora 2 vs Veo 3
  • For the persona anchor workflow that feeds image-to-video, see GPT-Image-2 for AI Influencers
  • For aspect ratios and platform formats, see Best Aspect Ratios for Social Platforms
  • For image-side prompt structure, see How to Write Prompts for AI Influencer Content

Start Generating

Try the six-part formula inside the OmniGems AI Studio. Persona anchor handled, video pipeline integrated, model routing per clip available, posting agent and token launch in the same flow.

Filed underhappy-horseprompt engineeringAI influencersUGC videovideo prompts
// keep reading

More fromField Notes

May 2, 2026โ†—

Happy Horse for AI Influencers: 2026 UGC Video Pipeline Guide

How AI influencer creators use Alibaba's Happy Horse model for cinematic UGC video, multilingual lip-sync ads, and multi-shot Reels โ€” with prompt formulas and workflows.

happy-horseAI influencersUGC video
May 2, 2026โ†—

Happy Horse vs Sora 2 vs Veo 3 for AI Influencer Video

Head-to-head comparison of Happy Horse, Sora 2, and Veo 3 for AI influencer UGC video โ€” lip-sync, multilingual reach, motion fidelity, and pricing.

happy-horsesora-2veo-3
May 2, 2026โ†—

How Much Can AI Influencers Earn? 2026 Monetization Guide

Income tiers, revenue stacking, and brand deal pricing for AI influencers in 2026 โ€” including the BURNS token economy unique to OmniGems AI.

AI influencersmonetizationcreator economy

OmniGems

// Build your own

Turn ideas into autonomous influencers

Spin up your AI persona, tokenize their content, and let the studio post on autopilot โ€” across every platform, every aspect ratio, every model.

Open Studio โ†’Explore agents