transformer
Papers with tag transformer
2022
- VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose EstimationYuxing Chen, Renshu Gu, Ouhan Huang, and Gangyong JiaIn 2022
This paper presents Volumetric Transformer Pose estimator (VTP), the first 3Dvolumetric transformer framework for multi-view multi-person 3D human poseestimation. VTP aggregates features from 2D keypoints in all camera views anddirectly learns the spatial relationships in the 3D voxel space in anend-to-end fashion. The aggregated 3D features are passed through 3Dconvolutions before being flattened into sequential embeddings and fed into atransformer. A residual structure is designed to further improve theperformance. In addition, the sparse Sinkhorn attention is empowered to reducethe memory cost, which is a major bottleneck for volumetric representations,while also achieving excellent performance. The output of the transformer isagain concatenated with 3D convolutional features by a residual design. Theproposed VTP framework integrates the high performance of the transformer withvolumetric representations, which can be used as a good alternative to theconvolutional backbones. Experiments on the Shelf, Campus and CMU Panopticbenchmarks show promising results in terms of both Mean Per Joint PositionError (MPJPE) and Percentage of Correctly estimated Parts (PCP). Our code willbe available.
使用3DVolTransformer回归人体坐标
- Multi-view Human Body Mesh TranslatorXiangjian Jiang, Xuecheng Nie, Zitian Wang, Luoqi Liu, and Si LiuIn 2022
Existing methods for human mesh recovery mainly focus on single-viewframeworks, but they often fail to produce accurate results due to theill-posed setup. Considering the maturity of the multi-view motion capturesystem, in this paper, we propose to solve the prior ill-posed problem byleveraging multiple images from different views, thus significantly enhancingthe quality of recovered meshes. In particular, we present a novel\textbfMulti-view human body \textbfMesh \textbfTranslator (MMT) modelfor estimating human body mesh with the help of vision transformer.Specifically, MMT takes multi-view images as input and translates them totargeted meshes in a single-forward manner. MMT fuses features of differentviews in both encoding and decoding phases, leading to representations embeddedwith global information. Additionally, to ensure the tokens are intensivelyfocused on the human pose and shape, MMT conducts cross-view alignment at thefeature level by projecting 3D keypoint positions to each view and enforcingtheir consistency in geometry constraints. Comprehensive experimentsdemonstrate that MMT outperforms existing single or multi-view models by alarge margin for human mesh recovery task, notably, 28.8% improvement in MPVEover the current state-of-the-art method on the challenging HUMBI dataset.Qualitative evaluation also verifies the effectiveness of MMT in reconstructinghigh-quality human mesh. Codes will be made available upon acceptance.
- Poseur: Direct Human Pose Regression with TransformersWeian Mao, Yongtao Ge, Chunhua Shen, Zhi Tian, Xinlong Wang, Zhibin Wang, and Anton HengelIn ECCV 2022
We propose a direct, regression-based approach to 2D human pose estimationfrom single images. We formulate the problem as a sequence prediction task,which we solve using a Transformer network. This network directly learns aregression mapping from images to the keypoint coordinates, without resortingto intermediate representations such as heatmaps. This approach avoids much ofthe complexity associated with heatmap-based approaches. To overcome thefeature misalignment issues of previous regression-based methods, we propose anattention mechanism that adaptively attends to the features that are mostrelevant to the target keypoints, considerably improving the accuracy.Importantly, our framework is end-to-end differentiable, and naturally learnsto exploit the dependencies between keypoints. Experiments on MS-COCO and MPII,two predominant pose-estimation datasets, demonstrate that our methodsignificantly improves upon the state-of-the-art in regression-based poseestimation. More notably, ours is the first regression-based approach toperform favorably compared to the best heatmap-based pose estimation methods.
2021
- Direct Multi-view Multi-person 3D Pose EstimationTao Wang, Jianfeng Zhang, Yujun Cai, Shuicheng Yan, and Jiashi FengIn 2021
We present Multi-view Pose transformer (MvP) for estimating multi-person 3Dposes from multi-view images. Instead of estimating 3D joint locations fromcostly volumetric representation or reconstructing the per-person 3D pose frommultiple detected 2D poses as in previous methods, MvP directly regresses themulti-person 3D poses in a clean and efficient way, without relying onintermediate tasks. Specifically, MvP represents skeleton joints as learnablequery embeddings and let them progressively attend to and reason over themulti-view information from the input images to directly regress the actual 3Djoint locations. To improve the accuracy of such a simple pipeline, MvPpresents a hierarchical scheme to concisely represent query embeddings ofmulti-person skeleton joints and introduces an input-dependent query adaptationapproach. Further, MvP designs a novel geometrically guided attentionmechanism, called projective attention, to more precisely fuse the cross-viewinformation for each joint. MvP also introduces a RayConv operation tointegrate the view-dependent camera geometry into the feature representationsfor augmenting the projective attention. We show experimentally that our MvPmodel outperforms the state-of-the-art methods on several benchmarks whilebeing much more efficient. Notably, it achieves 92.3% AP25 on the challengingPanoptic dataset, improving upon the previous best approach [36] by 9.8%. MvPis general and also extendable to recovering human mesh represented by the SMPLmodel, thus useful for modeling multi-person body shapes. Code and models areavailable at https://github.com/sail-sg/mvp.
多视角的feature直接通过transformer聚合
- Adaptive Multi-view and Temporal Fusing Transformer for 3D Human Pose EstimationHui Shuai, Lele Wu, and Qingshan LiuIn 2021
This paper proposes a unified framework dubbed Multi-view and Temporal FusingTransformer (MTF-Transformer) to adaptively handle varying view numbers andvideo length without camera calibration in 3D Human Pose Estimation (HPE). Itconsists of Feature Extractor, Multi-view Fusing Transformer (MFT), andTemporal Fusing Transformer (TFT). Feature Extractor estimates 2D pose fromeach image and fuses the prediction according to the confidence. It providespose-focused feature embedding and makes subsequent modules computationallylightweight. MFT fuses the features of a varying number of views with a novelRelative-Attention block. It adaptively measures the implicit relativerelationship between each pair of views and reconstructs more informativefeatures. TFT aggregates the features of the whole sequence and predicts 3Dpose via a transformer. It adaptively deals with the video of arbitrary lengthand fully unitizes the temporal information. The migration of transformersenables our model to learn spatial geometry better and preserve robustness forvarying application scenarios. We report quantitative and qualitative resultson the Human3.6M, TotalCapture, and KTH Multiview Football II. Compared withstate-of-the-art methods with camera parameters, MTF-Transformer obtainscompetitive results and generalizes well to dynamic capture with an arbitrarynumber of unseen views.
多视角特征融合的transformer以及时序融合的transformer