Person Tracking

Person tracking requires the robot to identify a person and be able to follow them as they move around the environment. This process is required in several tasks, such as Carry My Luggage and GPSR (General Purpose Service Robot). Therefore, computational vision is required to detect, identify and track a person.

Person detection

In order to detect people, the YOLO v8 model was used, obtaining the bounding boxes of each person in each frame. Additionally the ByteTracker algorithm was implemented to automatically assign a track_id to each person that is still visible from the previous frame. However, using the default tracker from YOLO was not consistent enough, since people who go behind large objects or even other people are assigned new ids, since this tracker does not keep track of people who leave and re-enter the frame. This introduced the necessity of a re-identification model that could recognize people who have appeared in previous frames, specially because the tracked person could exit the frame at any moment and it is necessary to identify them again to continue following.

# Get the results from the YOLOv8 model
results = self.model.track(frame, persist=True, tracker='bytetrack.yaml', classes=0, verbose=False)

Person re-identification

To re-identify people, the repository Person_reID_baseline_pytorch was used as a base to train different models using the Market1501 dataset. More specifically, the models ResNet50, Swin, PCB and DenseNet were trained and tested, eventually opting for the DenseNet model as it was the lightest. With the model trained, it was possible to obtain a feature vector from an image with a person.

# Crop the image using the bbox coordinates
cropped_image = frame[y1:y2, x1:x2]

# Convert the image array to a PIL image
pil_image = PILImage.fromarray(cropped_image)

# Get feature
with torch.no_grad():
    new_feature = extract_feature_from_img(pil_image, self.model_reid)

Nonetheless, extracting the embeddings from each person in every frame is not efficient, therefore an array of previous detections was kept to check if there is a new person detected. In such case, the embeddings would be extracted and compared to the embeddings of the tracked person using a cosine difference threshold.

def compare_images(features1, features2, threshold=0.6):
    if features1.ndim != 1 or features2.ndim != 1:
        return False

    # Compute cosine similarity between feature vectors
    similarity_score = 1 - cosine(features1, features2)

    # Compare similarity score with threshold
    if similarity_score >= threshold:
        return True  # Same person
    else:
        return False  # Different person

Media pipe implementation

An issue encountered with the detection and re-identification process was that the YOLO model would sometimes identify parts of people as a person, meaning that bounding boxes would sometimes contain only hands or heads, for example. This became problematic as the re-identification model could return unexpected results for this cases. Therefore, the MediaPipe library was used to detect the chest of a person, and only then the re-identification model would be used.

pose_model = mp.solutions.pose.Pose(min_detection_confidence=0.8)

def check_visibility(poseModel, image):
    pose = poseModel
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Process the image
    results = pose.process(image)

    # Check if the pose landmarks are detected
    if results.pose_landmarks is not None:

        # Get the x and y coordinates of the chest
        chest_x = results.pose_landmarks.landmark[11].x
        chest_y = results.pose_landmarks.landmark[11].y
        chest_visibility = results.pose_landmarks.landmark[11].visibility

        # Check if the chest is in the frame
        if (chest_x < 0 or chest_x > 1 or chest_y < 0 or chest_y > 1) and chest_visibility < 0.95:
            return False
        else:
            return True

Node structure

In order to start tracking, the service should be called, recieving a boolean to start or end the process. Once the service is called to begin tracking, the person detected with the largest bounding box area is assigned as the tracked person. While the tracking process is still enabled, the node will publish the coordinates of the person relative to the image (in pixels) as long as the tracked person is in frame, else it will not publish anything.

The ROS node is structured in the following way:

Subscriber topics

/zed2/zed_node/left/image_rect_color: ZED camera topic

Publisher topics

/vision/person_detection: Publishes a Point with the x and y coordinates of the tracked person. (Does not publish if the track person is not visible)
/vision/img_tracking: Publishes an annotated image for debugging and display purposes.

Service topics

/vision/change_person_tracker_state: Of type SetBool to enable or disable tracking.

Launch file

For easier execution, the node can be launched using the following command:

roslaunch vision receptionist.launch