This post will be a quick but comprehensive summary of the technology behind AR and some of its applications as of September 2020. We will investigate Microsoft HoloLens, Facebook Spark AR, Google ARCore, and Apple ARKit. After all, that’s what we’ve mostly worked with since 2016 at R2U (at the time called Real2U).
Before proceeding, it’s important to make it clear what exactly is defined as AR technology. Briefly speaking, Augmented Reality is a combination of multiple techniques related to computer vision and motion processing that combine information from the device’s camera and sensors in order to apply 2D or 3D content on top of it. The creators of Pokemon Go have one of the best definitions I could find:
Most of the augmented reality technology today actually comes from the Artificial Intelligence [Robotics] background, because for Augmented Reality there’s the reality part — you need to understand reality in order to augment it (Ross Finman, Head of AR, Niantic)
Perhaps one of the first consumer applications of computer vision in augmented reality, Target Tracking was developed 10 years ago as a way to place 3D content on top of a 2D plane with specific characteristics, such as a QR Code, often referred to as a Marker.
Image Tracking (also called Image Anchor or Augmented Image) is the evolution of the QR Code. Instead of placing an object on top of a dull black-and-white marker, we can now use any image we want, provided it has enough feature points so that it is detected with clarity by the device.
Spatial mapping provides a detailed representation of real-world surfaces in the environment around the user. This is more evident in powerful devices such as the HoloLens, where the world mesh is comprehensive enough to allow complex applications such as physics simulation and interaction of virtual objects with the real world.
Spatial Awareness or Scene Geometry is real-world environmental awareness in augmented reality applications. It takes a step further on the spatial mapping by actually understanding what has been mapped and putting some kind of label to the objects encountered.
Now that we can map and understand the world around us, let’s keep track of horizontal and vertical planes, such as floors and walls, so that we can place virtual objects there and be sure they will not move in relation to the device.
After mapping our surroundings with a device, we can track any arbitrary position and rotation on that space, not only planes. That specific pose will be used as an anchor, just like our old friend QR Code, to store where the virtual object is located.
Cloud Anchors (also called Shared Experiences) are Spatial Anchors powered by the cloud. They enable the interaction between multiple devices and platforms, allowing HoloLens, iPhones, and Android devices to see the same mixed reality world.
Popularized by Snapchat Landmarkers in 2019, Location Anchors are when Augmented Reality experiences are tied to a specific geolocation. They are now coming to ARKit 4, so we’ll probably see more applications of this tech in the near future.
Object Tracking is very similar to Image Tracking in the sense that you first need to scan a physical marker in order for it to be recognized by the device. This time, the marker is a 3D object in the real world, which can be detected after its spatial features are recorded.
Beginning with Hand Tracking, we exit the realm of positioning 2D or 3D content on top of static/fixed things such as floors and walls and start interacting with the human body. From here on, the technology becomes much more platform-dependent and we see a big gap in feature support between different SDKs.
For example, while Microsoft’s HoloLens can map your hand to a great degree of accuracy (which can even be boosted with external sensors), Facebook’s SparkAR will only understand the palm of your hand facing the camera up.
In order to provide an accurate interpretation of Hand Gestures, you first need a precise tracking of hand movement. This is why only the HoloLens currently has good support for this technology, allowing the user to touch, grab, point to, and focus on a target.
While mobile devices have generally poor support for Hand Tracking capabilities, they surely excel when it comes to Face Tracking. Facebook’s SparkAR is perhaps the one with the largest userbase, surpassing even Apple and Google’s Augmented Faces, which are more restricted in terms of device support.
In the same way that it’s possible to understand Hand Gestures, we can also understand Face Gestures. The Facebook SDK can tell you when a user blinks, raises or lowers their eyebrows, moves their head, and open their mouths. It can even understand the generic context of the facial expression and say if you are with a happy, kissing, smiley, or surprised face — something we could maybe call Face Awareness, borrowing and extending the definition from Spatial Awareness.
Perhaps more common in the VR world than in AR, Eye Tracking provides developers with the ability to use information about what the user is looking at. On HoloLens 2, you can track the attention heatmap of the eye movement and even interact with the application by just looking at the UI.
Body Tracking can be used to follow a person in the physical environment and visualize their motion by applying the same body movements to a virtual character. The current number of joints detected is much lower than what we see on the HoloLens hand tracking, but it’s good enough for most consumer applications.
SparkAR is the platform that has popularized Background Segmentation both on Facebook and on Instagram. It’s like having a green screen behind you in the real world and then applying any kind of animation to the environment.
An Occlusion is an event that occurs when one object is hidden by another object that passes between it and the observer. In Augmented Reality, it happens when a 3D object blends seamlessly in the environment where it’s been placed. Depending on the platform, it can be achieved through a variety of different technologies, such as with Depth APIs, Depth Maps, or Depth Images.
People Occlusion cover an app’s virtual content with people perceived in the camera feed. It is currently only available on iOS for iPhone XS or more recent devices. So, pretty powerful, but not pretty much used yet.
I don’t really like to say that Instant Placement (or Instant AR) is a new technology, per se, since this isn’t something you are actually using on your augmented reality application. It’s rather the technological progress of the AR technology itself (meaning computer vision and sensor fusion), which will inevitably get better over time.
In 2018, we were happy to wait 10 seconds for a virtual chair to appear on the floor, but now we complain if it takes more than 3 seconds. It will eventually be as fast as lightning, so do we really need yet another fancy name every time technology improves? Apple and Google think we do.
Light Estimation APIs on AR applications analyze a given image for discrete visual cues (such as shadows, ambient light, shading, specular highlights, and reflections) and provide detailed information about the lighting in a given scene. Developers can then use this information when rendering virtual objects to light them under the same conditions as the scene they’re placed in, making these objects feel more realistic and enhancing the immersive experience for users.
Spatial Sound gives life into holograms and gives them a presence in the world — so that if users happen to lose sight of the virtual objects, they can find them with the help of echos positioned in 3D space. This is still restricted to headsets such as HoloLens, but hopefully one day we’ll see more integration with glasses and other wearable devices.
Augmented Reality is very much a reality in our daily lives. Whether we shop for furniture online or we use face filters on Instagram, it happens so seamlessly that we barely notice what’s going on under the hood. The idea of this post was to provide an overview of some of its applications, so please let me know in the comments if something’s missing 🚀