Facial recognition, how do you know it's the same person?

Hello. My curiosity led me to test an API that claims to do face detection and facial recognition.

Exploring a little more with my esp32-cam I took three pictures of my face keeping the same position, same lighting, etc.

The API responded in JSON:

{"results":[{"status":{"code":"ok","message":"Success"},"name":"esp32-cam.jpg","md5":"9d69b87ca4f638f685f85cd70f4feccf","entities":[{"kind":"objects","name":"face-detector","objects":[{"box":[0.26976995218709976,0.12491494323036217,0.5152465488728709,0.6869953984971613],"entities":[{"kind":"classes","name":"face","classes":{"face":0.9388834238052368}},{"kind":"namedpoints","name":"face-landmarks","namedpoints":{"left-eye":[0.37281060695648194,0.3989172077178955],"right-eye":[0.5729453086853027,0.3699918079376221],"nose-tip":[0.451428747177124,0.5087626254558564],"mouth-left-corner":[0.435301628112793,0.6622309684753418],"mouth-right-corner":[0.5814676094055176,0.6382978916168214]}}]}]}]}]}
16168214]}

{"results":[{"status":{"code":"ok","message":"Success"},"name":"esp32-cam.jpg","md5":"b22d28c4ec09ea54b05e689564b2f130","entities":[{"kind":"objects","name":"face-detector","objects":[{"box":[0.21273358143976484,0.19003698955126802,0.5770508424915641,0.7694011233220854],"entities":[{"kind":"classes","name":"face","classes":{"face":0.9846720099449158}},{"kind":"namedpoints","name":"face-landmarks","namedpoints":{"left-eye":[0.3557454586029053,0.4791991996765136],"right-eye":[0.5900873565673829,0.47020770072937007],"nose-tip":[0.4579240131378174,0.6371343088150024],"mouth-left-corner":[0.3957681941986084,0.7933987998962402],"mouth-right-corner":[0.5702394294738771,0.7867423629760741]}}]}]}]}]}
29760741]}

{"results":[{"status":{"code":"ok","message":"Success"},"name":"esp32-cam.jpg","md5":"0095a1904e0e979483c53d69ed40b3e9","entities":[{"kind":"objects","name":"face-detector","objects":[{"box":[0.2333206442123697,0.13479227484942086,0.5248686450469403,0.6998248600625872],"entities":[{"kind":"classes","name":"face","classes":{"face":0.9846428632736206}},{"kind":"namedpoints","name":"face-landmarks","namedpoints":{"left-eye":[0.3545965671539307,0.4072417259216308],"right-eye":[0.5614657402038574,0.3929729652404785],"nose-tip":[0.43846559524536133,0.5404448652267456],"mouth-left-corner":[0.3985099267959595,0.681445655822754],"mouth-right-corner":[0.549906005859375,0.6702262496948242]}}]}]}]}]}
96948242]}

But in all fields the data is different. How do you know it's the same person?

I try to answer this question by imagining that in the code, in the arduino for example, I will store the 4 fields of values ​​for the tip of the nose, left and right corner of the mouth, etc. and create an average and create a String to then be compared with the next readings.

Is this how it's done?

Because I noticed very different values ​​in the three API responses. Unless the variation is minimal and it is this minimal variation that will differentiate one person from another.

Anyway, does anyone have a piece of code that I can test here, turn on the espcam flash led only if the face is mine.

Thanks

The ESP32 does not have enough power to differentiate between faces.

If you want to detect particular faces Tensors can be made of each face and the TensorFlow ML can be used to tell different faces apart. A RaspberryPi or a BeagleBone-AI64 can do the job.

Wikipedia has a whole page about it.

A quick search gave me this.

https://maker.pro/arduino/projects/how-to-build-an-esp32-based-facial-recognition-system

That's not what the topic is about. But thank you.

That's not what the topic is about. But thank you.

Thank you.

They are numeric values. Why would you convert them to Strings? I'd treat each reply as an N-dimensional vector. Then, determine the difference (error) between subsequent result vectors in a "Mean Squared Error" sense. Then, compare that error against a threshold to determine if there's a match.

That looks like a "face detection" result rather than a "face recognition" result. It's telling you where in the image it found a face and where it thinks the points of interest are. That is probably insufficient to recognize a face.

The example ESP32-CAM sketch doesn't enable face recognition on a basic ESP32. See the comment in the CameraWebServer example sketch:

// Face Recognition takes upward from 15 seconds per frame on chips other than ESP32S3
// Makes no sense to have it enabled for them

You should upgrade to an ESP32S3-based board.

1 Like

Oops, this is Johnwasser,

But isn't that what this API claims to do ?

We send a photo and it even selects person#1, person#2, etc ?

I tested it with three pictures of my own face. But I didn't quite understand how to create a comparator code.

Or did I co-found myself with this API functionality ?

Now I get it. Mergers. It's their algorithm:

Query parameter: embeddings

The embeddings query parameter allows a client to enable/disable embeddings calculation. If a client passes True value then the service will perform a calculation of embeddings for each face detected in an image. Otherwise, if a client passes False value then embeddings will not be calculated.

Embeddings calculation is disabled by default.

Note: If you want to skip face detection and just calculate embeddings for the whole image, use the following combination of flags: detection=False&embeddings=True.

And their system works on images with more than one face where they just differentiate face1, face2, face3.

But... would it be very difficult to create your own filter that allows you to differentiate face 1 from face 2 ?

I think it would be enough to do some calculations with that response data and create an X value for each face. As if it were a standard average.

But of course I don't even know how to start doing that.

The 'embeddings' seem to represent the face as a point in a 512-dimensional 'face space'. My guess is that you would find the 'distance' between two points to see how close the two points are in face space.

For each of the 512 values in the two 'embeddings' vectors, subtract A from B and square it. Average the 512 squares and then take the square root.

Who dares to try ?

Although you don't need to take the final square root as long as you compare the Mean Squared Error against the proper threshold.

Indeed.

Let's have faith.

Someone will show up who will help and post the snippet of the filter code using those 5 results that the API delivers so that we can continue with the tests and see if it is even possible to differentiate faces with moderate precision using the espcam + api.

I left a photo of my face for testing:

https://www.linkpicture.com/view.php?img=LPic63f27e2e77f1c569955444

  double errorSum = 0;
  for (uint8_t i = 0; i < 5; i++) {
    double error = faceVector1[i] - faceVector2[i];
    errorSum += error * error;
  }
  if (errorSum < matchThreshold) {
    Serial.println("Faces Match");
  }

I know it's easier to find a couple of dwarf twins than for you to post a code. thanks++;

But what would faceVector1[i] and faceVector2[i] ? and matchThreshold ?

faceVector1 & faceVector2 would be populated from the API's reply message. You'd have to determine matchThreshold heuristically based on your tolerance for False Positive verses False Negative results.

You have to turn on the calculation of 'embeddings' which are 512-element vectors. You then have to compare those 512 elements per face to the 512 elements for every other face to determine the distance.

Hint: Add "embeddings=True" to the end of the query URL.