Sora
Sora overview
#TL;DR(中文)
- 是 OpenAI 的 text-to-video 模型方向,核心关注点是:video generation 的可控性、物理一致性、以及 safety。code
Sora - 这类模型的 prompt 往往更像 storyboard:subject、setting、camera、lighting、style、duration、shot transitions 等信息越明确越可控。
- 落地建议:用 iterative prompting(先要 rough cut,再逐步加约束),并对输出做安全与合规审查。
#中文导读(术语保留英文)
阅读这页建议关注:
- Capabilities:在复杂场景(多角色、背景、运动)下的稳定性
- Methods:patch/token 表示、Transformer/diffusion 等设计思路
- Limitations:physics、cause-and-effect、camera trajectory 等常见 failure mode
如果你用
Soraevaluation- prompt adherence(是否遵循关键约束)
- temporal consistency(人物/物体跨镜头一致性)
- safety policy(敏感内容过滤)
#Original (English)
OpenAI introduces Sora, its new text-to-video AI model. Sora can create videos of up to a minute of realistic and imaginative scenes given text instructions.
OpenAI reports that its vision is to build AI systems that understand and simulate the physical world in motion and train models to solve problems requiring real-world interaction.
#Capabilities
Sora can generate videos that maintain high visual quality and adherence to a user's prompt. Sora also has the ability to generate complex scenes with multiple characters, different motion types, and backgrounds, and understand how they relate to each other. Other capabilities include creating multiple shots within a single video with persistence across characters and visual style. Below are a few examples of videos generated by Sora.
Prompt:
codeA stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
Prompt:
codeA movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.
Video source: https://openai.com/sora
#Methods
Sora is reported to be a diffusion model that can generate entire videos or extend generated videos. It also uses a Transformer architecture leading to scaling performance. Videos and images are represented as patches, similar to tokens in GPT, leading to a unified video generation system that enables higher durations, resolution, and aspect ratios. They use the recaptioning technique used in DALL·E 3 to enable Sora to follow the text instructions more closely. Sora is also able to generate videos from a given image which enables the system to accurately animate the image.
#Limitations and Safety
The reported limitations of Sora include simulating physics and lack of cause and effect. Spatial details and events described (e.g., camera trajectory) in the prompts are also sometimes misunderstood by Sora. OpenAI reports that they are making Sora available to red teamers and creators to assess harms and capabilities.
Prompt:
codePrompt: Step-printing scene of a person running, cinematic film shot in 35mm.
Video source: https://openai.com/sora
Find more examples of videos generated by the Sora model here: https://openai.com/sora