Apple today announced its iPhone 15 lineup of smartphones, including the iPhone 15 Pro which will be the company’s first phone to capture spatial video for immersive viewing on Vision Pro.

While Apple Vision Pro itself works as a spatial camera, allowing users to capture immersive photos and videos, I think we can all agree that wearing a camera on your head isn’t the most convenient way to capture content.

Image courtesy Apple

Apple seems to feel the same way. Today during the company’s iPhone 15 announcement, it was revealed that the new iPhone 15 Pro will be capable of capturing spatial video which can be viewed immersively on the company’s upcoming Vision Pro headset. The base versions of the phone, the iPhone 15 and iPhone 15 Plus, won’t have the spatial capture capability.

Details on exactly how this function works are slim for the time being.

“We use the ultrawide and main cameras together to create a three-dimensional video,” the company said during its announcement. But it isn’t clear if “three-dimensional” means stereoscopic footage with a fixed viewpoint, or some kind of depth projection with a bit of 6DOF wiggle room.

Given that the iPhone 15 Pro cameras are so close together—not offering enough distance between the two views for straightforward stereo capture—it seems that some kind of depth projection or scene reconstruction will be necessary.

Image courtesy Apple pro 

Apple didn’t specifically say whether the phone’s depth-sensor was involved, but considering the phone uses it for other camera functions, we wouldn’t be surprised to find that it has some role to play. Curiously, Apple didn’t mention spatial photo capture, but ostensibly this should be possible as well.

While users will be able to watch their immersive videos on Vision Pro, Apple also said they’ll be able to share the footage with others who can watch on their own headset.

SEE ALSO
VR Horror 'HappyFunland' Coming to PSVR 2 & SteamVR This Month, Trailer Here

While the new iPhone 15 lineup will launch on September 22nd, Apple says the spatial capture capability won’t be available until “later this year”—which is curious considering the company also said today that Vision Pro is “on track to launch in early 2024.” Perhaps the company plans to allow creators to access the spatial video files for editing and use outside of Apple’s platform?

Newsletter graphic

This article may contain affiliate links. If you click an affiliate link and buy a product we may receive a small commission which helps support the publication. More information.


Ben is the world's most senior professional analyst solely dedicated to the XR industry, having founded Road to VR in 2011—a year before the Oculus Kickstarter sparked a resurgence that led to the modern XR landscape. He has authored more than 3,000 articles chronicling the evolution of the XR industry over more than a decade. With that unique perspective, Ben has been consistently recognized as one of the most influential voices in XR, giving keynotes and joining panel and podcast discussions at key industry events. He is a self-described "journalist and analyst, not evangelist."
  • Brian Elliott Tate

    It seems so weird to call it “Spatial Video” if it’s just a 3D video that we’ve been able to do for years (and you can easily do taping two old iPhones together still with half decent cameras and hitting record at the same time).

    Aka, I really hope they do have something that adds a little of the old “Apple Magical” that does a great job adding just a little 6DOF to it

    • Christian Schildwaechter

      TL;DR: Spatial Video ≠ Stereoscopic Video ≠ 3D Video. Spatial video includes true depth data, stereoscopic video includes none, but software can try to reconstruct it from the parallax.

      These are not (necessarily) the same.

      3D video is a very broad term that mostly means “not just 2D”, but can mean anything from adding a second, slightly moved perspective in post-processing to full holographic video that shows a different perspective depending on the view direction even for multiple users, with all the information actually being encoded in the medium. While the first 3D movies actually used two slightly shifted cameras (stereoscopy), most 3D movies today are conversions from 2D, with software “guessing” the depth information from other cues like shadows, which works sufficiently well at much lower costs.

      Stereoscopic video is what most people assume 3D video is and what could be done with two old iPhones, basically two 2D images recorded in parallel and played back to two eyes. They don’t contain any actual 3D informations which could be used later, e.g. to masks out objects. Software can (try to) extract depth information by comparing the two 2D images, which works decently well, esp. with video, but fails in situations lacking visual differences, for example a white cup in front of a white wall, or anything involving mirrors.

      Spatial video actually includes 3D depth information encoded in the data. This usually requires an extra depth sensor, of which there are several types, which work/fail in different situations. Though there is no guarantee that Apple uses the term this way, they have had 3D depth sensors on the high end phones for some time, and these can utilize and store depth information in addition to regular pixels. This can be used for example to change the focus length in post processing, keeping all objects at the appropriated distance, and helps with reconstructing actual spatial data.

      So if you have a picture of a white cup with a black spot in front of a white wall, stereoscopic analysis might fail to tell whether the spot is on the wall or the cup for short distances and unknown curvature of the cup, meaning if you look at that picture in VR, moving your head sideways may show the spot sticking to the wall, while a spatial image would contain the actual depth of the spot, and it would therefore vanish if you look behind the cup in VR.

      That’s not “magical” enough to show what is on the back of the cup, as no sensor could record that in a single image, but is enough to let you move your head around and get a proper 6DoF “feel”. Smart software may us the extra information to create full 3D scenes similar to photogrammetry from a regular spatial video, and one day this could allow you to record a spatial video on your phone and than later walk through that room in VR, with everything that was recorded at least in one frame in the right position. A certain part of YouTube is currently going crazy about “3D Gaussian Splatting”, a technology that allows to render such scenes very fast (with tons of VRAM), but I’d say that something we can expect to get from spatial video in a few years, which wouldn’t work with stereoscopic video or at least require multiple passes to record everything from several directions.

      • Dragon Marble

        Photogrammetry is very different: you move the camera around to capture a still scene — kind of the opposite of a video, where you hold the camera still to capture a moving scene.

        • Christian Schildwaechter

          I am referring to using the stream from a moving camera that sort of “accidentally” catches most of the room, like someone just recording with a phone while walking around the room. That’s not actually new, creating hires static images from combined video frames of much lower resolution has been done since the 1980s.

          So the idea is basically like photogrammetry, but instead of moving around the object, taking a large number of static pictures with significant overlap to allow comparing and “stitching” them together, a much smaller number of quasi-random images is sufficient, as each spatial image contains information allowing to determine its exact position and rotation, and each pixel already comes with the proper depth information to turn it into a 3D object. That of course doesn’t work if you hold the camera still to capture a movie scene.

          • Dragon Marble

            Understood. But that, I assume, would only work “accidentally”.

          • Christian Schildwaechter

            Depends on what you use for recording. We currently create lots of “accidental” scans of the room just by using a VR HMD, which uses them to create 3D maps of the environment. Like the first generations of Roombas not actually following a path or creating a map, instead just randomly changing directions, doing that often enough will still provide full coverage.

            Now very few people randomly film the room with their phone for a long time, and people do not tend to wear VR HMDs while memorable things are happening in the room. But assuming that we one day get actual AR glasses that people wear all day, it should be pretty easy to generate enough accidental spatial images to later reconstruct the room in 3D with a very high level of details. AR glasses would probably have 6DoF tracking, so they would actually know where they have already taken pictures and what directions are still missing, reducing the randomness.

            It would be interesting to see how much of a room one would capture by just starting the recording and then keeping the phone in a shirt pocket while cleaning. That would create mostly redundant data, trading less user effort against higher computational needs, but I’m usually fine with my computers doing more work if that means I have to do less.

      • Brian Elliott Tate

        Yes, exactly, that’s why I said what I did. Apple has yet to confirm that it’s actually “spatial video.” For all we know, it could just be stereoscopic 3D.

        Still waiting for Apple to give some actual details about what it is.

  • Ad

    Depth in photos has been a thing, I can take them with my iphone 11 with some dedicated apps, although they’re all a bit weird in how they function, there needs to be some Nerf-esque functionality to make it work right at new angles.

  • Foreign Devil

    F*** Apples renaming of tech jargon with their own cult speak. So the cameras can capture 3D stereoscopic video, or VR 180 video? That’s cool!

    • Ben Lang

      Unclear. From the article:

      “We use the ultrawide and main cameras together to create a three-dimensional video,” the company said during its announcement. But it isn’t clear if “three-dimensional” means stereoscopic footage with a fixed viewpoint, or some kind of depth projection with a bit of 6DOF wiggle room.

      Given that the iPhone 15 Pro cameras are so close together—not offering enough distance between the two views for straightforward stereo capture—it seems that some kind of depth projection or scene reconstruction will be necessary.

      Apple didn’t specifically say whether the phone’s depth-sensor was involved, but considering the phone uses it for other camera functions, we wouldn’t be surprised to find that it has some role to play. Curiously, Apple didn’t mention spatial photo capture, but ostensibly this should be possible as well.

      • Christian Schildwaechter

        TL;DR: It doesn’t capture 3D stereoscopic or VR 180 video, it captures spatial/true depth information, which is better and can be displayed as 3D stereoscopic or VR 180 video.

        The naming is somewhat arbitrary, but based on the high end iPhones including true depth sensors, already storing their information alongside the regular pixels, and Apple using “spatial video” to describe it, their “three-dimensional” should not mean “only” stereoscopic images with no actual depth information, but true spatial data, where the 2D pixels are associated with depth (at a much lower resolution).

        It should always be possible to create a stereoscopic/dual 2D image from a spatial/actual 3D image (with some limits), but not the other way around. Creating depth information from stereoscopic images boils down to comparing the edges of the two images, interpreting shifts between them as a result of parallax and deriving a depth from them. This only works with a sufficient distance between the cameras and enough contrasts in the images to detect edge shifts, and therefore fails with objects of the same color, bright scenes lacking shadows and curved objects.

        But if you have actual depth information, you don’t need the parallax and therefore you also don’t need different camera images recorded at about the same distances as the human IPD, or even a second image. You can turn a monoscopic spatial image into a 6DoF representation you can walk around in, with obviously the backside of everything missing. The only advantage of stereoscopy here would be that you get some pixels from the sides of objects that are right in front of the viewer with one or both sides located between the eyes, but it comes with several disadvantages when trying to determine the depth of complex shapes.

        So the iPhone most likely will not record stereoscopic video, but spatial video that can be displayed/projected on a stereoscopic display with the correct depth. I’d expect them to merge the image from normal and wide angle lens instead of storing them both, possibly getting some of the missing image information from the sides of small objects that way. In contrast to stereoscopic video this spatial video should allow for a lot of “6DoF wiggle room” in situations where the same fails with stereoscopic footage or creates a lot of depth artifacts, because the images doesn’t contain enough visible edges to correctly reconstruct the parallax.

        Stereoscopy is somewhat “overrated” anyway, because we now associate stereoscopic VR with 3D vision. But the resolution of the eyes/displays allow to perceive parallax only for a few dozen meters, beyond that our depth perception relies on other cues like shadows, size, object occlusion, haze and others. The brain is easily fooled with optical illusions, because much of “seeing” is actually “interpreting based on experience”. You can use VR HMDs with one eye (closed), the ability to look around provides most of the actual immersion, and the stereoscopic view helps mostly with hand-eye coordination and estimating close, ambiguous distances.

        • Dragon Marble

          “It should always be possible to create a stereoscopic/dual 2D image from a spatial/actual 3D image”.

          Is that so? I have my Quest on the table in front of me. I put one hand up such that it completely blocks the Quest from my left eye’s view, but my right eye can still see it fully. Now imagine I take a “spacial picture” from the left eye position. The Quest would be missing from reconstructed 2D image because the camera doesn’t see it. And there could be even worse artifacts. A true stereoscopic image will not have this problem.

          The most compelling 3D images are those that not only have depths, but depth contrasts. And that I assume is where the spacial pictures completely break down.

          • Christian Schildwaechter

            The human stereoscopic vision has a 75% overlap between both eyes or about 120°, meaning the situation where one eye sees the object and the other doesn’t only happens in the extreme periphery, beyond 120° FoV, where the cells on your retina will only be able to detect rough brightness changes, but not really see a sharp image. The same is true for stereoscopic imaging that is supposed to be seen by a human, making this pretty much a non-issue.

            So yes, it should be always possible to create a stereoscopic image from a spatial image, as the depth information allows to calculate the perspective shift between the rather close eyes. This would only become a problem if we start building VR HMDs for prey species like deer with eyes pointing to the side, which are sacrificing depth perception with only about 60° overlap for a very large 310° FoV without moving the eyes.
            Bad for VR, but very good for spotting approaching tigers.

            And proper reproduction of depth contrast should be much easier with actual depth information than with an approximation derived from analyzing stereoscopic images, as this cannot reliably differentiate between color changes caused by changes in lighting, depth or actual object color, meaning color spots may accidentally be interpreted as dents, causing a projection of the object in VR with different lighting to show surface artifacts.

            The resolution of current depth sensors is rather low, something like 16bit for 20m, so there is still enough room for error there too, meaning most solution will fuse multiple sensor data, e.g. the iPhone 15 Pro correcting the depth sensor data with a stereoscopic analysis of data from the regular and wide lens sensors, and vice versa.

          • Dragon Marble

            No, it has nothing to do with FOV. It happens at the center of your vision. Try it. Hold one hand up at an arm’s length and try to read this post on your TV 6-8 feet away, close one eye at at time, and see that each eye can only read half of the sentence. It’s simply because the two eyes are looking at it from a different angle.

          • Christian Schildwaechter

            Okay, got it, you are talking about obstructing objects. That’s basically the edge case mentioned where stereoscopy can see the sides/around the corners of an object placed between the eyes, while a single, centered camera cannot. You should get a similar effect on the Quest Pro, where the b/w cameras can see some parts that are hidden from the hires color camera, leading to a small colorless “shadow” surrounding objects.

            There is no direct way around physical obstruction, just like there is no way to record the backside of an object from the front. The only options are to add more distant cameras, live with the artifacts or construct a real time 3D map of the room and try to add information currently hidden from the camera that the eyes would be able to see from that 3D map.

  • ViRGiN

    I believe it’s going to be stereoscopic video, but nothing spectacular, just cut out cardboard style depth of field. Cool, but that isn’t anywhere near immersive video content.
    Rather pointless. Should have do some magic and make the phone capture near-360 pictures. Always surprised me why no company ever pulled it off.

    I loved the first 360 cameras, but the trend had long died down. Now they are just more universal cameras that capture 360, but used to montage the video into flat content.

  • g-man

    Sounds cool, despite being artificially limited to the 15 Pro. I wonder how long it will be be before the format is handled on other headsets. You know, ones people can actually afford.

  • Till Eulenspiegel

    iPhone has LiDAR, can’t they use it to map the video in 3d like photogrammetry – but in video format? This way, it will be truly spatial – it will have depth and you can move closer to the objects unlike current VR video.

    • Christian Schildwaechter

      That’s pretty much exactly what they already do with photos and what they will most likely do for video too. The depth information is basically just another (very lores) video stream. The depth sensor in the iPhone 15 has a resolution of 140*170 pixels with most likely 16bit of depth information for each pixel. The MP4 container format used for pretty much every type of video allows for several streams, so they don’t even need to define a new format.

      They can simply add the depth information as a secondary stream to normal videos. Most software will just ignore that second video stream, but spatial aware apps can use the extra depth information for numerous nifty things, including allowing users to move their head to look around. This is a significant advantage over stereoscopic video, which is neither compatible with regular screens nor does it allow for all the extra options of spatial video like e.g. removing everything further away than 3m, playing back only what happened directly in front of the camera with the current room still showing in the background.

      • Till Eulenspiegel

        A depth map is just a greyscale image for every frame of the video – even 8 bit is good enough, that’s 256 shades of grey. Apple can use an AI-upscaler to enlarge the depth map.

        The depth map can be use to generate geometry, then apply projection-mapping technique to project the video onto the geometry. Similar to those light shows they projected on buildings.

        It takes a lot of processing to do this in real-time.

        • Christian Schildwaechter

          Depth maps used in computer graphics are 8bit by convention, the resolution is enough, because the object the depth map applies to also has its own world coordinate. So a far away landscape and a very close wood carving will both look sufficiently detailed.

          But the data from the depth sensor is always in world coordinates, so if you have a sensor covering up to 20m and the scale is linear, the smallest depth difference you can express with the 256 possible 8bit values is 78mm, about the width of a hand, which would make it pretty useless for detecting finger position for hand tracking.

          To increase the resolution for hand tracking, you can either increase the bit count or switch to a logarithmic scale. Switching to logarithmic scale for 8bit PCM audio signals with e.g. µ-law increases the perceived dynamic range to that of ~12bit PCM, though it is not clear how well that would translate to depth values, as our visual perception works very different from hearing. With perceived 12bit you’d basically get fine enough finger resolution for tracking, but a basketball at the other side of the room would become a cube.

          AFAIK depth sensors therefore do both: store values with a logarithmic scale, because the depth information is most relevant for objects close that we can grab, and increase the stored number of bits. Information about that is astonishingly hard to find, and e.g. Sony specified the exact X and Y resolution of the IMX611 sensor used in the iPhone 15 Pro, but didn’t say a word about the Z resolution, which I would assume is pretty relevant for a depth sensor. So all I have is the statement that the “typical” resolution for depth sensors is 10-16bit, which lead to my “with most likely 16bit of depth information”, as according to Sony the IMX611 has top of the class photon detection efficiency, which should allow for both working better under worse conditions and finer resolution.

          The projection of spatial video would work somewhat different. What you described is the typical projection of textures onto geometry, but that texture is assumed to be a flat top-down image. In spatial video, the non-depth pixel information is already the result of a projection, so it is distorted. You can do it sort of the other way around, extract the actual geometry from the depth information, then calculate what flat texture would have been needed to create the projection that was recorded in the video, and thereby allowing to e.g. move the camera upwards and look down on an object originally recorded from the front, though this would create a lot of artifacts for larger movements, because the spatial video has no information about everything occluded in front view.

  • ViRGiNCRUSHER

    Apple saying they care about the environment.. Having yearly releases and non replaceable batteries..