Processes and systems are directed to training a neural network of an object recognition system. The processes and systems record video streams of people. Sequences of object images are extracted from each video stream, each sequence of object images corresponding to one of the people. A triplet comprising an anchor feature vector and a positive feature vector of the same object and a negative feature vector of a different object of feature vectors are formed for each sequence of object images. The anchor, positive, and negative feature vectors of each triplet are separately input to the neural network to compute corresponding output anchor, positive, and negative vectors. A triplet loss function value computed from the output anchor, positive, and negative vectors. When the triplite loss function value is greater than a threshold, the neural network is retrained using the anchor and positive feature vectors of the sequences of object images.