Object tracking is one of the core applications in the field of computer vision. It creates pixel-level or instance-level connections between frames and creates trajectories in the form of boxes (masks). There are four main categories of object tracking which include: Single Object Tracking (SOT); Video Object Segmentation (VOS); Multiple Object Tracking (MOT); Multi-Object Tracking and Segmentation (MOTS). Most applications only address one tracking category. A unified model is needed to manage all tracking categories simultaneously.
This work proposes two core components to address this issue: the pixel-wise letter and the target before. Pixel-wise letters represent the similarity of all pairs of points from the reference and the current frames. The previous target acts as a switch between the four tracking categories and an additional input for the analysis head. This study introduces Unicorn, a network architecture that addresses the problem of tracking four categories alike.
Contributions to this work include 1) For the four tracking categories, Unicorn successfully combined the learning paradigm and network architecture. 2) It fills the gap between pixel-wise letters and target prior methods for all four tracking categories. 3) Unicorn presents the latest performance of eight severe tracking benchmarks using the same model parameters. The comparison between Unicorn and task -specific techniques is shown in Figure 1.
This work proposes Unicorn, a combined method for object tracking that consists of three major components: integrated inputs and backbone, integrated embedding, and integrated head. The unicorn’s unified network predicts the states of the monitored targets in the current frame using the current frame, the reference frame, and reference target. The architecture of the proposed Unicorn method is shown in Figure 2.
The proposed Unicorn method selects the entire image as an input. The current and the reference frames are fed through the weight-sharing backbone to obtain Feature Pyramid Representations (FPN). In Unicorn, pixel-wise letters are calculated by multiplying the spatially flattened reference frame embedding by the existing frame embedding. The transformer was chosen to improve the representation of the original part because it eliminates long-range dependencies. End-to-end optimization of unified embedding was performed using Dice Loss for SOT and VOS or Cross-Entropy Loss for MOT and MOTS. Unicorn adds an additional input to identify any target in the reference frame. Thanks to the reference target map, the distributed target map can provide reliable prior knowledge about the status of the tracked target, confirming the importance of using the target beforehand. The former target and the original part of the FPN are the inputs used in the joint head. These two inputs are combined by Unicorn using a broadcast sum, and the integrated feature is sent back to the initial detection head.
The training is divided into two parts. Using data from SOT & MOT, the network is the first end-to-end fine-tuned for letter detection and loss. Addition and optimization of a mask branch with mask loss was performed in the second phase using data from VOS & MOTS while maintaining the other parameters consistently. In the testing phase, Unicorn selected the box or mask with the highest assurance score as the final result of the tracking without performing hyperparameter-sensitive post-processing, such as a cosine window.
The proposed method is compared to ConvNeXt-Large, ConvNeXt-Tiny, and ResNet-50. The dimension of an input image is 800 x 1280 pixels. 16 NVIDIA Tesla A100 GPUs were used to train the model, with a 32 global batch size. There are 200,000 pairs of frames in each of the 15 seasons comprising an individual training phase.
LaSOT and TrackingNet data were used to evaluate SOT performance. Accuracy, success, and normalized accuracy were used to assess the performance of the proposed method. MOT17 and BDD100K data were used to evaluate MOT performance. Multi-Object Tracking Accuracy (MOTA), Identity F1 Score (IDF1), percentage of most commonly tracked trajectories (MT), False Negatives (FN), False Positives (FP), Mostly lost trajectories (ML), and Identity Switches (IDS). ) is used to evaluate MOT performance. VOS was evaluated on DAVIS 2016 and 2017 datasets. Regional similarity, contour accuracy, and average of the two were used as evaluation parameters for VOT. MOTS20 and BDD100K were used to evaluate the performance of MOTS using sMOTSA and mMOTSA and other parameters used in MOT. To perform the ablation analysis, ConvNeXt-Tiny was used.
Therefore, this paper introduces Unicorn, a coordinated strategy with the same model parameters to manage tracking categories. The proposed model outperforms other methods specific to the category of eight severe benchmarks.
This Article is written as a summary article by Marktechpost Staff based on the research paper 'Towards Grand Unification of Object Tracking'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and github link. Please Don't Forget To Join Our ML Subreddit