Pose Recognition with Cascade Transformers [Eng]
Ke Li et al. / Pose Recognition with Cascade Transformers / CVPR 2021
Last updated
Ke Li et al. / Pose Recognition with Cascade Transformers / CVPR 2021
Last updated
Contemporary Human Pose Recognition techniques are generally represented by two main methods:
Heatmap-based
Regression-based
While the former ones perform better for specific tasks, they are mostly based on hand-built heuristics and contain non-differentiable steps. The main aim of this paper is to present a regression-based architecture, capable of producing competitive results in pose recognition. For this, Encoder-Decoder Transformer architecture is utilized for every subtask of detection.
Pose recognition task is composed of two fundamental subtasks, namely in-frame person detection and key-point detection. The paper proposes two top-down alternative approaches: two-stage and end-to-end ones. As it was mentioned earlier, both methods leverage Transformers for producing predictions.
The first one uses general-purpose DEtection TRansformer (DETR)[3] to detect people in a scene. A bounding box of the detected person is then cropped to be passed to a key-point extractor.
Meanwhile, the second one utilizes the notion of Spatial Transformer Network (STN)[4] to generate a grid and pass sampled image to output joints' coordinates. The detailed overview of these alternatives is to be presented below.
Lastly, the authors developed a visualization method to represent internal structure of Transformers.
Although the aforementioned approaches may seem completely different, the overall architecture is shared. Initially, input image is passed to the Backbone model for feature extraction. This model is usually represented with CNN architecture, which fits the required parameters most. Having extracted image features, they are passed to the first-stage Transformer person detector. Image patches with identified people in them are passed to the second Transformer, responsible for key-point identification.
The main difference between alternatives is in the method of extracting and passing image patches between two Transformer models.
[Figure: 1] Two-stage Architecture
In this architecture, the input image is firstly fed to the backbone model, where with the help of absolute positional encoding, produced flattened features are passed to the object detector. It is important to mention that produced features are not recurrent, which is why a positional encoder is used to preprocess them before feeding into the Transformer model. DETR[3] model was chosen as an object detector for its performance in general-purpose tasks.
Since the main task of the Transformer model in this subtask is to extract meaningful person features, the natural question of existence of people in the frame arises. To answer this, the classification model on top of the Transformer is used. This model aims to identify, whether produced by a Transformer object is indeed a person. In the case of an object being a background, it is marked as an empty set (). Being sure in receiving a person as an object, it is afterwards passed to a regression model, which produces a 4-dimensional vector with box's coordinates.
With bounding boxes from the first Transformer model, images are cropped and passed to the second block. Another backbone model takes patches as an input. By combining image features with the relative positional encoding, outputs are fed into the Keypoint-detection Transformer.
Having produced larger number of features in comparison with the number of joints, they get mapped with Hungarian Algorithm. This algorithm aims to find optimal function (optimal bipartite matching) to match inputs with joints . Therefore, the matching cost is , where is the class probability of the query and corresponds to the bounding box coordinates. Since ground-truth values are not available during inference, the cost gets compressed to . Finally, the loss function is just a negative log-likelihood of the matching cost.
The algorithm is used for a dimensionality reduction to produce joint inputs for the Keypoint Classifier. This model identifies, whether the chosen feature is a background. If it is not, it gets passed to Coordinate Regressor, which produces final coordinates.
Similarly to the Two-Stage Architecture, an input image is processed to output its features. Spatial Transformer Network is used to crop images without losing the end-to-end nature. To generate the grid, which will filter out an image, its relative starting point needs to be found. To find the center, coordinates of bounding box are used, resulting in:
With this data, it is possible to calculate sampled feature map: , where is the output map and is the source kernel map. This process is put after the end of each stage of the backbone.
As soon as cropped kernel maps are obtained, the data is passed to the second Transformer and the process is identical to the one mentioned before.
The implementation of Sequential Model can be found below:
The most essential imports are given below:
The framework used for this work is PyTorch.
The main class representing the model is PrtrSequential. It provides insights in the way model operates. It takes the following parameters as inputs:
backbone - CNN backbone model, which returns a set of feature maps, which are then used in sampling process
transformer - Object detection transformer model for producing bounding boxes' features
transformer_kpt - Keypoint detection transformer for outputting the coordinates of the joints
level - set of layers, which will be used for sampling process
x_res, y_res - width and height of produced feature maps, fed into the keypoint transformer
class_embed - linear classifier, predicting, whether input object is a person.
bbox_embed - bounding box embedding layer, presented by specified multi-layer perceptron.
MLP layer - feed-forward neural network composed of 3 (the last argument) Linear layers followed by ReLU activation function.
query_embed - Embedding layer, modifying an input to be fed into the Transformer model
input_proj - input projection layer
x_interpolate, y_interpolate - bounding box editing layers. These are used to enlarge bounding box dimensions by 25%
mask - the filter grid, necessary to extract an image patch
In this code, positional embeddings of the kernel maps are learnt. To build this embedding for the keypoint Transformer, sinusoidal embedding is used. Having them built, it is possible to use both built and learnt embeddings together. The same positional embedding is used for every bounding box with a person.
Initially, the mask, set in the model is negated. Afterwards, both x and y axis embeddings are produced by applying cumulative sum over the 1 and the 2 axis. Having them extracted, embeddings get normalized. The dimension temperature is specified linearly according to number of positional features. Afterwards, embeddings' positions get calculated with sine function and concatenated to be saved in the buffer. Finally, row and column embeddings uniformly initialized.
Learnt positional embeddings are simply reformatted to fit the x_res and y_res values, set initially.
Having all necessary structural components initialized, it is now possible to proceed to the training process.
Initially, the person gets identified. Backbone model outputs image features and the absolute positions. These outputs get transferred to the first transformer model. The transformer produces person predictions, stored in hs. Afterwards, they are sent to the class embedding layer, which identifies, whether it is a person or a noobject. Simultaneously, hs is passed to the bounding box embedding layer, which produces a bounding box for an object. Lastly, outputs are stored in a dictionary.
To prepare for STN, number of people per bounding box is identified. Heights, widths are repeated accordingly. Having prepared feature space, features get extracted from the backbone features on specified layers. Finally, coordinates corresponding bounding boxes are identified and duplicated for multi-person per box scenes.
STN cropping is produced by matrix multiplication bounding box editing layers and related coordinates. This is made without masks applied, since all inputs are of the same size. As grids are known, they can be applied for every level as well as mask, specified in the initialization phase. Finally, keypoint transformer produces coordinates and logits.
To infer coordinates, output keypoint coordinates are used. Like forward
, infer
takes images as input, although it only selects the 17 keypoints as prediction. To select necessary keypoints, the second Transofrmer creates a node for each query. Sets of nodes are grouped by keypoint type. Type and node sets are connected with the help of logits. Connections with the highest logit value are chosen as predictions.
Transformer producing keypoints is defined below. It takes the following parameters as inputs:
transformer - Transformer model for keypoint detection
num_kpts - number of keypoints per person (17)
num_queries - number of queries
input_dim - the shape of image feature dimension from the first Transformer
It is important to note that all bounding boxes are upscaled by 25% and the coordinates are relative to the whole image.
Enver Menadjiev
KAIST AI
First-year Graduate Student at KAIST University
B.Sc. in Information and Communication Engineering at Inha University in Tashkent
enver@kaist.ac.kr
[Figure: 2] Sequential Architecture