Back to list
This article was auto-translated.View original (中文)
Tech1mo ago

GPT Image 2 Team Revealed: Led by a Wuxi Talent, 13 People Achieved Breakthroughs in 4 Months

GPTImage2 has gone viral, but why is the effect so good? The research leader, Chen Boyuan, revealed that the underlying architecture has been completely restructured. However, he refused to answer whether diffusion models or autoregressive techniques were used, mysteriously describing it as a "general model" or "GPT for the image field."

GPT Image 2 Team Revealed: Led by a Wuxi Talent, 13 People Achieved Breakthroughs in 4 Months

Chen Boyuan's tweet also revealed that it only took four months to achieve such significant improvements, starting from GPT Image 1.5 last December.

This groundbreaking achievement was accomplished by a core team of only 13 people.

The team leader, Gabriel Goh, shared a team photo of the members as AI avatars.

In the comments, some netizens exclaimed: "Why are they all Asians?"

Chen Boyuan: From not understanding Python to Research Lead

What is the architecture of GPT Image 2?

OpenAI probably won't release it for a long time, but some clues can be gleaned from the academic backgrounds of the core team members.

Chen Boyuan is the Research Lead of the team, and he had the same advisor, Vincent Sitzmann, as another member, Kiwhan Song, when they were both PhD students at MIT.

His doctoral work, "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion," was selected for NeurIPS 2024.

This research proposes a new sequence generation training paradigm called Diffusion Forcing, which combines per-token independent noise level diffusion with causal next-token prediction, integrating the variable-length generation capabilities of autoregressive models with the long-range guidance advantages of full-sequence diffusion models.

During his internship at Google, he also published SpatialVLM as a co-first author.

By automatically constructing an internet-scale 3D spatial reasoning VQA dataset (10 million images, 2 billion QA pairs), it empowers visual language models with quantitative/qualitative spatial reasoning capabilities, capable of outputting metric distances, dimensions, orientations, and other precise numerical values from a single 2D image.

This research applies chain-of-thought spatial reasoning to embodied intelligence.

During his internship at Google, the instruction fine-tuning technique he developed was later adopted by Gemini 2.0.

He met Xia Fei, a senior researcher at Google DeepMind, at a science summer camp in high school when he didn't even understand the basics of Python, and Xia Fei introduced him to the world of AI.

Xia Fei invited him to DeepMind twice for high-quality internships, which allowed Chen Boyuan to accumulate engineering experience in large-scale model training and provided him with a valuable perspective on the data requirements of multimodal systems.

After graduating with his doctorate, Chen Boyuan joined OpenAI in June 2025 and quickly became one of the five core members responsible for all training of GPT image generation models, and is also a member of the Sora video generation team.

In the demonstration, he made a poster for his hometown of Wuxi. Then he made a Korean poster for teammates from Seoul, and a Bengali poster for teammates from Bangladesh. The text rendering in each one was accurate.

Jianfeng Wang of USTC: Enabling Generative AI to Understand World Knowledge

Jianfeng Wang, a PhD graduate from USTC, is responsible for another amazing capability in the GPT Image 2 team: instruction following and understanding the world.

Old models always painted clocks pointing to 10:10, because of clock advertisements on the internet, which almost always showed 10:10.

This is because clock manufacturers had psychologists conduct experiments, believing it would help stimulate consumers' desire to buy watches.

He had the new model paint 2:25, 3:30, 9:10, 7:45, all accurately.

This is just an appetizer.

More complex spatial layouts, with the apple in the center, the cup on the right, the book on top, the camera on the left, and the basketball on the bottom. The model executed everything accurately.

Before joining OpenAI, he worked at Microsoft for nearly 9 years and collaborated with the OpenAI team on DALLE-3 during his time there.

He has published numerous academic papers in the field of computer vision, with research potentially covering image classification, object detection, semantic segmentation, and visual representation learning.

The significant improvement in world knowledge understanding means the semantic content and functional structure of objects are correctly understood.

JianFeng Wang said at the end of the demonstration video: GPT Image 2 is eliminating the gap between your intentions and the model's output.

It truly gives you what you want.

Yuguang Yang: Generating High-Precision Complex Infographics

Yuguang Yang demonstrated generating infographics and PPTs during the GPT Image 2 launch event.

The entire 75-page GPT-3 paper was dragged into ChatGPT, automatically generating 7 slides.

His experience is arguably the richest among the team members, with each job change being a cross-disciplinary move, but always focused on machine learning.

He studied engineering at Zhejiang University's Zhuke Zhen College for his undergraduate degree, and computational chemistry physics and machine learning at Johns Hopkins University for his doctorate.

His first full-time job was as a quantitative analyst, and during his visiting research position at Tsinghua University, he excelled in reinforcement learning and control algorithms for nanorobots.

He later worked at Amazon on Alexa voice research.

And at Microsoft on Bing search query understanding and retrieval, and document understanding.

After joining OpenAI in early 2025, he participated in the ChatGPT agent project in addition to image generation.

On his personal account, he introduced GPT Image 2's infographic generation capabilities, which can save researchers a lot of time.

He also repeatedly reminded everyone to choose a thinking mode when making infographics.

From DALL-E to GPT Image 2.0

From the self-introduction of team member Kenji Hata, GPT Image 1.0 is the image generation part of GPT-4o.

One person has been involved in OpenAI's multimodal series research from the very beginning, starting with DALL-E.

He is Gabriel Goh, the leader of the GPT Image 2.0 team.

He joined OpenAI in 2019, and his early research was more theoretical, focusing on interpretability and convex optimization, etc.

He gradually shifted to image generation starting with DALL-E.

Looking at the research resume of another team member, Weixin Liang, another aspect of GPT Image 2's technical foundation was revealed.

His representative work, Mixture-of-Transformers, during his internship at Meta introduced modality-decoupled MoE and decoupled attention, significantly reducing the computational cost of pre-training multimodal models.

He received his doctorate from Stanford and his undergraduate degree from Zhejiang University's Zhuke Zhen College, but several years after Yuguang Yang.

Weixin Liang, like Chen Boyuan, joined OpenAI immediately after graduating with his doctorate in 2025 and quickly became a core member of the team.

Other GPT Image 2.0 team members include:

Ayaan Haque, previously worked at Luma AI and participated in the training of Luma's video generation foundation model Dream Machine.

Bing Liang, worked at Google for more than 5 years, participating in Imagen3, Veo, Gemini Multimodal, and jumped to OpenAI in 2025 to do image generation research.

Mengchao Zhong, an alumnus of Shanghai Jiao Tong University for his undergraduate degree, and graduated from Texas A&M University with a master's degree. He has worked as a software engineer at Pinterest and Airtable, and is responsible for multimodal product engineering at OpenAI.

Dibya Bhattacharjee, Yale University, 2015 IPhO bronze medalist, CIE A-Level Mathematics and Biology global highest score.

Kiwhan Song was the latest to join in October 2025. In addition to doing research, he is also the team's prompt master, and many of the official demonstration images are from him.

...

From the earliest DALL-E to today's GPT Image 2.0, this team has successively solved the problems of being able to paint, paint clearly, paint beautifully, and paint accurately.

Despite the large talent turnover at OpenAI in recent years, OpenAI is still a company that can continuously attract a variety of individual talents, unrestricted by major, welcoming cross-disciplinary work, and believing in bottom-up emergent research.

Starting with a small team, after breakthroughs, the company invests more resources, until it changes the world.

One More Thing

Once, the GPT-4o image generation imitating the Ghibli style of avatars swept the world.

Now, the team members of GPT Image 2.0 have all changed their avatars to this quirky neck style.

What is the prompt for this style? The team members also revealed it:

Use my photo only for identity. Redraw me as a very simple surreal Japanese sticker-style caricature: long thin neck, small deadpan face, minimal black outline, flat light coloring, almost no shading, very few facial details, simplified hair shape, lots of white space, plain white background, slightly awkward and funny. Ultratall 1:3 image.

Reference links:

[1]https://x.com/gabeeegoooh/status/2046674385407512687?s=20 [2]https://venturebeat.com/technology/openais-chatgpt-images-2-0-is-here-and-it-does-multilingual-text-full-infographics-slides-maps-even-manga-seemingly-flawlessly