How Apple Performs Person Recognition Without Photos Leaving Your iPhone
Learn how contrastive learning, embeddings and clustering recognize people across a diverse demographic.
Get a list of personally curated and freely accessible ML, NLP, and computer vision resources for FREE on newsletter sign-up.
To read more on this topic see the references section at the bottom. Consider sharing this post with someone who wants to study machine learning.
With 2 billion active devices, Apple has cemented its place in the consumer marketplace.
More than 4.7 billion photos are taken daily with an average person taking around 20 photos [5].
With so many photos, Apple (and Google Android) have developed machine learning-powered features to help users navigate through photos captured on their smart devices.
0. The Rise of Smartphone Photography
Smartphones are the go-to device for photography enthusiasts because they offer convenience and an entire photo studio in your pocket. You can snap, edit, and share photos instantly.
Smartphones make photography effortless, leading to a massive amount of user-generated content. To handle this data effectively, machine learning (ML) powered features need to be robust and on the cutting edge of tech.
Apple and Google have built features to automatically organize photos, create albums, and remove duplicates. One such feature that uses ML at its core is recognizing people using visual cues from images.
1. Scoping The Problem
Apple uses on-device machine learning to recognize people in your photos. Using machine learning they provide powerful exploration and automation features to users.
Browse by Person: Tap a person's face in a photo to see all their pictures.
Search by Name: Type a person's name to find photos of them.
Memories: Automatically create videos showcasing moments with important people in your life.
All this happens privately and on the device, keeping your photos secure.
Varying Photographic Styles: People capture moments differently depending on age and culture. Younger users are more likely to use photos for self-expression and sharing on social media while older users use photos more for personal archiving [6]. The algorithm needs to work equally well on selfies and family photos.
Change Appearances: A more challenging part is the varying appearance of the same subject across photos. A single subject can appear in different poses, sizes, lighting, clothing, hairstyles, and makeup. With multiple subjects involved, there are even more moving parts. Visual appearance varies widely with age, gender, and skin color making the problem harder.
Benchmarks mimicking real-world usage: With billions of users worldwide, Apple can't test their photo detective on everyone. Instead, they create special "benchmarks" - curated sets of photos representing diverse demographics and usage patterns. Think of a college student's photo library compared to a parent of three young children. These benchmarks ensure the system works well across user groups.
Apart from these, since Apple is serious about user privacy, they have additional constraints on the deployed solution:
Private, on-device, efficient: A model1 that is lightweight and efficient. It should have minimal memory and battery consumption.
Scalable: Works well on both small and large numbers of images in a user’s library.
Incorporate user feedback: No system is perfect but its efficiency can be improved using human-in-the-loop. Users can provide manual labels or correct wrong suggestions made by the model. This additional supervision for the model is worth its weight in gold.
Here is another article you might be interested in:
Or something about LLMs perhaps? With LLMs on the rise, there is a need for proper benchmarks so you and I can know which model is better: OpenAI’s ChatGPT4, Google’s Gemini, or the new kid on the block from Meta’s LLAMA series.
2. Faces and Beyond
The ML system does not just look at faces but also the body. Object detection is performed to detect (1) the face and (2) the upper body of all subjects in a photo [3].
The upper body provides strong visual cues when the face is occluded or turned away from the camera. Assuming consistency of clothing and appearance cues (such as beard, hairstyle, head-gear, etc.) the “appearance” of the upper body helps recognize and connect the same subject across multiple photos.
Faces and upper bodies are matched using overlap between bounding box pairs.