In recent years, increasingly mature facial recognition technologies are now able to significantly outperform humans in the ability to identify faces, and has been widely used for building such projects as "smart city" and "safe city". In practical applications, however, cameras cannot always capture clear images of faces. In addition, cameras have limited range and there are often no overlaps between the areas captured by multiple cameras in real-world scenarios.
Therefore, it becomes necessary to identify and find a person using information about his or her whole body - tracking a person across cameras by using the overall features of the person as an important supplement to facial information. Thus, scientists in the field of computer vision have gradually begun their study on "ReID" technology.
ReID Enormous Practical Significance and Reliance on Manual Labor
As its name would suggest, Person Re-Identification (ReID) is the re-identification of persons by establishing correspondence between images of persons captured by different cameras that have no overlapping views. When the areas captured by different cameras do not overlap, it will be much more difficult to perform a retrieval due to a lack of sequential information. Therefore, ReID emphasizes the retrieval of a specific person in videos captured by different cameras.
ReID compares the features of a person from an image with the features of another person in different images, and determines whether they are the same person.
If person detection is to determine whether there is a person in an image, then ReID requires a machine to recognize all images of a particular person shot by different cameras. Specifically, it is a person comparison technology implemented based on the overall features of a person by finding one or more images of a person based on a given image of such person.
ReID has wide applications in criminal investigation in public security and image retrieval. In addition, ReID can help mobile phone users achieve image clustering and help retailers and supermarkets get customer trajectories and create commercial value. However, the precision of ReID is not high enough to be profitable currently, and much work still needs to be done manually.
Breaking the Industry Record of ReID and Surpassing Human Experts for the First Time
Research on ReID is very challenging due to the uncertainty in time and location when images are captured. In addition, different lighting, angles, and gestures, as well as occlusion strongly affects detection accuracy.
Thanks to the development of deep learning in recent years, ReID has become technologically mature. For the two most commonly used ReID test sets, Market1501 and CUHK03, the rank-1 identification accuracy have reached 89.9% and 91.8% respectively.
However, there still is a gap between the results and those achievable by human beings. Experiments show that the rank-1 accuracy of a skilled labeler on Market1501 and CUHK03 can reach 93.5% and 95.7%, respectively.
In order to test the ReID ability of human beings, the researchers assembled 10 professional labelers to carry out the test. Experiments show that the detection accuracy of a skilled labeler on Market1501 and CUHK03 can reach 93.5% and 95.7%, respectively. This is an exciting result that the current ReID method cannot achieve.
Not too long ago, Face++ (Megvii) made an exciting progress in this research: In an article titled "AlignedReID" published by the research team from the institute of Megvii, the authors proposed a new approach characterized by Dynamic Alignment, Mutual Learning and Re-Ranking, which made the rank-1 accuracy of the machines on Market1501 and CUHK03 reach 94.0% and 96.1% respectively. This is also the first time that machines have outperformed human experts in ReID, setting a record in the industry.
Machines have surpassed human beings in the more complex field of ReID in addition to facial recognition! This offers powerful technology that can be used to comprehend human images or videos.
Sun Jian, chief scientist and head of Megvii, said: "With the revival of deep learning methods in recent years, we have seen that machines have surpassed human beings in solving more and more image perception issues, from facial recognition in 2014 to ImageNet image classification in 2015. I remember that, not long ago, when I talked with my mentor, Dr. Shen Xiangyang (former Global Executive Vice President of Microsoft), I boasted that most perception issues would be resolved in 5-10 years. Today, I am very pleased to see another image perception issue, which is difficult and has great potential for application, has been solved by the algorithm developed by the Megvii team."
Multiple Networks Automatically Learning the Alignment of Human Features and Learning from Each Other
So how did the author achieved this?
Similar to other ReID methods based on deep learning, the author also used a deep convolutional neural network to extract features and used Triplet Loss after Hard Sample Mining as the loss function, and took the Euclidean distance of features as the similarity of two images.
The difference is that the author considered the alignment of the human body when studying the similarity of images. Although, some people had considered this before, such as dividing the human body into the head, the torso, and the legs, or performing an estimation based on the human skeleton, and performing an alignment based on the information about the skeleton. However, the latter approach introduced another difficult issue or required additional tagging. The idea of the author of AlignedReID [1] is to introduce an end-to-end approach that allows the network to automatically learn how to align the human body to improve performance.
In AlignedReID, deep convolutional neural networks extract both global features and local information. The distance between any pair of local information in two images is calculated to generate a distance matrix. Then the shortest path from the upper left corner to the lower right corner of the matrix is calculated through dynamic programming. An edge of the shortest path corresponds to the matching of a pair of local features, which gives a way of aligning the human body. The total distance of this alignment is the shortest when ensuring the relative order of the different parts of the body. During training, the length of the shortest path is added to the loss function to aid in the study of the overall features of a person.
As shown in the figure, some edges of this shortest path are redundant, such as the first edge in the figure. Why not just look for those matching edges? The author explained, "Local information should not only be matched, the alignment of the entire human body should also be taken into account. In order to match the human body from the head to the foot, it is necessary to have some redundant matchings. In addition, these redundant matchings contribute little to the length of the entire shortest path by designing the local distance function.
In addition to auto-aligning the body structure during training, the author also mentioned that the precision of the model can be effectively improved by training the two networks simultaneously and making them learn from each other. This training method is common in the classification issue, and the author made some improvements so that it can be applied to Metric Learning.
In the training process shown in the figure above, both networks trained at the same time include a branch for classification and a branch for Metric Learning. The two branches for classification learn from each other through KL divergence; the two branches for Metric Learning learn from each other through the metric mutual loss proposed by the author. As mentioned above, the branch for metric learning consists of two sub-branches, the sub-branch for global features and the sub-branch for local features. Interestingly, once training is completed, both the sub-branches for classification and local features will be discarded, and only the branch for global features will be retained for ReID. In other words, both person classification and the study of local features through the alignment of the human body aim to better get the global features of images.
Finally, the author also used the k-reciprocal encoding proposed in for reordering.
Conclusion
The first line in the figure above show the persons to be looked for. The bottom rows are results produced by a human tester and by the machine. Which row of images corresponds to result generated by the machine? (The answers will be revealed at the end of this article)
The approach presented in this article allows ReID technology to show better performance. However, at the end of this article, the author also pointed out that although machines outperform human beings in the two common datasets, it cannot be concluded that the task of ReID has been well resolved. In practical applications, human testers, especially those who are professionally trained, can be more accurate in evaluating images with crowds or in a dim environment. This is typically based on experience, intuition, the environment, and context. Therefore, people still have great advantages over machines in extreme conditions. In future practice, more efforts are needed to address and implement ReID.
Zhang Chi, one of the authors of AlignedReID, said: "When we started to study ReID in 2016, a rank-1 accuracy of 60% could be considered state of the art. However, businesses normally require an accuracy of at least 90% or higher for it to be practical. Even though we have outperformed human beings in the two common datasets, but this is just our first step towards real-world application. There will be many challenges to deal with in real-world scenarios. We hope that, with its development, ReID technology can make our society safer and smarter."
Finally, let's announce the answer to the previous question. The third row shows the results generated by the machine.