In the mind’s eye

Click to enlarge

A video still from SoraAI prompt: <em>Reflections in the window of a train travelling through the Tokyo suburbs</em>. — A video still from SoraAI prompt: *Reflections in the window of a train travelling through the Tokyo suburbs*. Image: SoraAI

The architect as magician: Like many I was captivated by Junya Ishigami’s presentation at this year’s in:situ conference — a virtuoso display with gravity-defying tricks, such as his impossibly thin (12mm) steel roof sagging across a 90m span to create a semi-covered plaza at the Kanagawa Institute of Technology (KAIT) in Japan.

Ishigami was a keynote speaker at the conference run by Te Kāhui Whaihanga New Zealand Institute of Architects in February. I was particularly fascinated by the security camera footage Ishigami showed of people weaving their way through the diverse arrangement of some 305 slim, steel columns dotted throughout the 2000sqm glass-walled Workshop building, also at KAIT. So, too, was one of the next presenters, Jean Pierre Crousse of Barclay & Crousse in Peru, who said, “Look, we do it like this” as he took a crazily weaving path around the furniture to the stage.

The video captured vividly the architectural ambiguity Ishigami likes to play with. Rather than a seemingly random arrangement, the columns’ placement was carefully thought out. Groups of columns are arranged to divide different spaces and there is a startling array of circulation paths for both individuals and groups to pass through, as though wending their way through a forest of tree trunks.

I thought again of Ishigami’s mesmerising video sequences about a week later when OpenAI announced Sora, an AI text-to-video generator that’s apparently next level. Sora can generate a 60-second-long photorealistic high-definition video from written descriptions, reportedly creating synthetic video (minus audio at present) at a fidelity and consistency greater than any text-to-video model currently available.

I’ve been largely sceptical about the impact of AI on architecture. Sora might be changing my mind. It created a video from the prompt: “Reflections in the window of a train travelling through the Tokyo suburbs”. The result was not only highly believable but strangely compelling. Similarly, the prompt asking for “a beautiful homemade video showing the people of Lagos, Nigeria, in the year 2056” resulted in a complex visualisation of scenes, beginning with a group at a table at an outdoor restaurant then panning to an open-air market and cityscape.

What gets me interested is that Sora seems to be not just manipulating pixels but conceptualising three-dimensional scenes that unfold in time. In other words, it’s doing something like what architects do when they design — picturing spaces, people and movement in scenes and places in their mind’s eye, imagining not just how they look but what they are. No doubt Ishigami visualised people weaving among his columns as he was designing his KAIT Workshop.

As Joshua Rothman describes it in The New Yorker, Sora doesn’t make recordings, it renders ideas: “Sora isn’t Photoshop — it contains knowledge about what it shows us”. Where does this all lead? Rothman argues Sora’s overall comprehension of the objects and spaces it conjures means that it isn’t just a system for generating video. It’s a step, as OpenAI puts it, “towards building general purpose simulators of the physical world”. Which is a little bit terrifying. If AI can simulate everything in the physical world, then it can probably control it.

Sora is a diffusion transformer. When receiving ‘noisy’ patches (and conditioning information like text prompts), SoraAI is trained to predict the original ‘clean’ patches. Image: SoraAI

According to New Scientist, to achieve this higher level of realism, Sora combines two different AI approaches: “The first is a diffusion model similar to those used in AI image generators such as DALL-E. These models learn to gradually convert randomised image pixels into a coherent image. The second AI technique is called ‘transformer architecture’ and is used to contextualise and piece together sequential data.” This is like the ‘large language models’ used in AI programmes like ChatGPT that use transformer architecture to assemble words into generally comprehensible sentences. In Sora, OpenAI broke down video clips into visual “spacetime patches” that Sora’s transformer architecture could process.

At the time of writing, Sora is still in the research preview stage and generating quite a bit of speculation. Benj Edwards in Ars Technica notes OpenAI has not revealed its dataset but says it’s likely Sora is using “synthetic video data generated in a video game engine in addition to sources of real video (say, scraped from YouTube or licensed from stock video libraries)”.

The article quotes computer scientist Jim Fan, a specialist in training AI with synthetic data: “If you think OpenAI Sora is a creative toy like DALL-E…, think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, ‘intuitive’ physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths.”

Is there a downside? Edwards: “Very soon, every photorealistic video you see online could be 100 percent false in every way. Moreover, every historical video you see could also be false.” Truth and fiction in media becoming indistinguishable all of the time. It sounds like a nightmare. But as a tool to assist an architect’s imagination to conceptualise 3D scenes unfolding in time and to render real worlds, even if they seem fantastical, into existence, Sora-like AI seems like a magic trick that will be hard to resist.