1p
Papers with tag 1p
2022
- Learnable human mesh triangulation for 3D human pose and shape estimationSungho Chun, Sungbum Park, and Ju Yong ChangIn 2022
Compared to joint position, the accuracy of joint rotation and shapeestimation has received relatively little attention in the skinned multi-personlinear model (SMPL)-based human mesh reconstruction from multi-view images. Thework in this field is broadly classified into two categories. The firstapproach performs joint estimation and then produces SMPL parameters by fittingSMPL to resultant joints. The second approach regresses SMPL parametersdirectly from the input images through a convolutional neural network(CNN)-based model. However, these approaches suffer from the lack ofinformation for resolving the ambiguity of joint rotation and shapereconstruction and the difficulty of network learning. To solve theaforementioned problems, we propose a two-stage method. The proposed methodfirst estimates the coordinates of mesh vertices through a CNN-based model frominput images, and acquires SMPL parameters by fitting the SMPL model to theestimated vertices. Estimated mesh vertices provide sufficient information fordetermining joint rotation and shape, and are easier to learn than SMPLparameters. According to experiments using Human3.6M and MPI-INF-3DHP datasets,the proposed method significantly outperforms the previous works in terms ofjoint rotation and shape estimation, and achieves competitive performance interms of joint location estimation.
每个视角进行可见性判断再进行特征融合,最后接了拟合模块
- On Triangulation as a Form of Self-Supervision for 3D Human Pose EstimationSoumava Kumar Roy, Leonardo Citraro, Sina Honari, and Pascal FuaIn 2022
Supervised approaches to 3D pose estimation from single images are remarkablyeffective when labeled data is abundant. However, as the acquisition ofground-truth 3D labels is labor intensive and time consuming, recent attentionhas shifted towards semi- and weakly-supervised learning. Generating aneffective form of supervision with little annotations still poses majorchallenge in crowded scenes. In this paper we propose to impose multi-viewgeometrical constraints by means of a weighted differentiable triangulation anduse it as a form of self-supervision when no labels are available. We thereforetrain a 2D pose estimator in such a way that its predictions correspond to there-projection of the triangulated 3D pose and train an auxiliary network onthem to produce the final 3D poses. We complement the triangulation with aweighting mechanism that alleviates the impact of noisy predictions caused byself-occlusion or occlusion from other subjects. We demonstrate theeffectiveness of our semi-supervised approach on Human3.6M and MPI-INF-3DHPdatasets, as well as on a new multi-view multi-person dataset that featuresocclusion.
使用多视角三角化来自监督
- PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimationHaoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang, and Xiaohui XieIn 2022
Recently, the vision transformer and its variants have played an increasinglyimportant role in both monocular and multi-view human pose estimation.Considering image patches as tokens, transformers can model the globaldependencies within the entire image or across images from other views.However, global attention is computationally expensive. As a consequence, it isdifficult to scale up these transformer-based methods to high-resolutionfeatures and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2Dhuman pose estimation, which can locate a rough human mask and performsself-attention only within selected tokens. Furthermore, we extend our PPT tomulti-view human pose estimation. Built upon PPT, we propose a new cross-viewfusion strategy, called human area fusion, which considers all human foregroundpixels as corresponding candidates. Experimental results on COCO and MPIIdemonstrate that our PPT can match the accuracy of previous pose transformermethods while reducing the computation. Moreover, experiments on Human 3.6M andSki-Pose demonstrate that our Multi-view PPT can efficiently fuse cues frommultiple views and achieve new state-of-the-art results.
使用人体区域来做fusion
- Learned Vertex Descent: A New Direction for 3D Human Model FittingEnric Corona, Gerard Pons-Moll, Guillem Alenyà, and Francesc Moreno-NoguerIn 2022
We propose a novel optimization-based paradigm for 3D human model fitting onimages and scans. In contrast to existing approaches that directly regress theparameters of a low-dimensional statistical body model (e.g. SMPL) from inputimages, we train an ensemble of per-vertex neural fields network. The networkpredicts, in a distributed manner, the vertex descent direction towards theground truth, based on neural features extracted at the current vertexprojection. At inference, we employ this network, dubbed LVD, within agradient-descent optimization pipeline until its convergence, which typicallyoccurs in a fraction of a second even when initializing all vertices into asingle point. An exhaustive evaluation demonstrates that our approach is ableto capture the underlying body of clothed people with very different bodyshapes, achieving a significant improvement compared to state-of-the-art. LVDis also applicable to 3D model fitting of humans and hands, for which we show asignificant improvement to the SOTA with a much simpler and faster method.
一种新的姿态估计的框架
- PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimationHaoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang, and Xiaohui XieIn 2022
Recently, the vision transformer and its variants have played an increasinglyimportant role in both monocular and multi-view human pose estimation.Considering image patches as tokens, transformers can model the globaldependencies within the entire image or across images from other views.However, global attention is computationally expensive. As a consequence, it isdifficult to scale up these transformer-based methods to high-resolutionfeatures and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2Dhuman pose estimation, which can locate a rough human mask and performsself-attention only within selected tokens. Furthermore, we extend our PPT tomulti-view human pose estimation. Built upon PPT, we propose a new cross-viewfusion strategy, called human area fusion, which considers all human foregroundpixels as corresponding candidates. Experimental results on COCO and MPIIdemonstrate that our PPT can match the accuracy of previous pose transformermethods while reducing the computation. Moreover, experiments on Human 3.6M andSki-Pose demonstrate that our Multi-view PPT can efficiently fuse cues frommultiple views and achieve new state-of-the-art results.
使用人体区域来做fusion
- Learning Visibility for Robust Dense Human Body EstimationChun-Han Yao, Jimei Yang, Duygu Ceylan, Yi Zhou, Yang Zhou, and Ming-Hsuan YangIn ECCV 2022
Estimating 3D human pose and shape from 2D images is a crucial yetchallenging task. While prior methods with model-based representations canperform reasonably well on whole-body images, they often fail when parts of thebody are occluded or outside the frame. Moreover, these results usually do notfaithfully capture the human silhouettes due to their limited representationpower of deformable models (e.g., representing only the naked body). Analternative approach is to estimate dense vertices of a predefined templatebody in the image space. Such representations are effective in localizingvertices within an image but cannot handle out-of-frame body parts. In thiswork, we learn dense human body estimation that is robust to partialobservations. We explicitly model the visibility of human joints and verticesin the x, y, and z axes separately. The visibility in x and y axes helpdistinguishing out-of-frame cases, and the visibility in depth axis correspondsto occlusions (either self-occlusions or occlusions by other objects). Weobtain pseudo ground-truths of visibility labels from dense UV correspondencesand train a neural network to predict visibility along with 3D coordinates. Weshow that visibility can serve as 1) an additional signal to resolve depthordering ambiguities of self-occluded vertices and 2) a regularization termwhen fitting a human body model to the predictions. Extensive experiments onmultiple 3D human datasets demonstrate that visibility modeling significantlyimproves the accuracy of human body estimation, especially for partial-bodycases. Our project page with code is at: https://github.com/chhankyao/visdb.
考虑了遮挡来估计SMPL
2021
- Learning Temporal 3D Human Pose Estimation with Pseudo-LabelsArij Bouazizi, Ulrich Kressel, and Vasileios BelagiannisIn 2021
We present a simple, yet effective, approach for self-supervised 3D humanpose estimation. Unlike the prior work, we explore the temporal informationnext to the multi-view self-supervision. During training, we rely ontriangulating 2D body pose estimates of a multiple-view camera system. Atemporal convolutional neural network is trained with the generated 3Dground-truth and the geometric multi-view consistency loss, imposinggeometrical constraints on the predicted 3D body skeleton. During inference,our model receives a sequence of 2D body pose estimates from a single-view topredict the 3D body pose for each of them. An extensive evaluation shows thatour method achieves state-of-the-art performance in the Human3.6M andMPI-INF-3DHP benchmarks. Our code and models are publicly available at\urlhttps://github.com/vru2020/TM_HPE/.
输入一段序列的2D关键点,输出3Dpose,通过多视角的一致性来监督
- Generalizable Human Pose TriangulationKristijan Bartol, David Bojanić, Tomislav Petković, and Tomislav PribanićIn 2021
We address the problem of generalizability for multi-view 3D human poseestimation. The standard approach is to first detect 2D keypoints in images andthen apply triangulation from multiple views. Even though the existing methodsachieve remarkably accurate 3D pose estimation on public benchmarks, most ofthem are limited to a single spatial camera arrangement and their number.Several methods address this limitation but demonstrate significantly degradedperformance on novel views. We propose a stochastic framework for human posetriangulation and demonstrate a superior generalization across different cameraarrangements on two public datasets. In addition, we apply the same approach tothe fundamental matrix estimation problem, showing that the proposed method cansuccessfully apply to other computer vision problems. The stochastic frameworkachieves more than 8.8% improvement on the 3D pose estimation task, compared tothe state-of-the-art, and more than 30% improvement for fundamental matrixestimation, compared to a standard algorithm.
提出了一个框架来解决泛化的三角化问题
- Adaptive Multi-view and Temporal Fusing Transformer for 3D Human Pose EstimationHui Shuai, Lele Wu, and Qingshan LiuIn 2021
This paper proposes a unified framework dubbed Multi-view and Temporal FusingTransformer (MTF-Transformer) to adaptively handle varying view numbers andvideo length without camera calibration in 3D Human Pose Estimation (HPE). Itconsists of Feature Extractor, Multi-view Fusing Transformer (MFT), andTemporal Fusing Transformer (TFT). Feature Extractor estimates 2D pose fromeach image and fuses the prediction according to the confidence. It providespose-focused feature embedding and makes subsequent modules computationallylightweight. MFT fuses the features of a varying number of views with a novelRelative-Attention block. It adaptively measures the implicit relativerelationship between each pair of views and reconstructs more informativefeatures. TFT aggregates the features of the whole sequence and predicts 3Dpose via a transformer. It adaptively deals with the video of arbitrary lengthand fully unitizes the temporal information. The migration of transformersenables our model to learn spatial geometry better and preserve robustness forvarying application scenarios. We report quantitative and qualitative resultson the Human3.6M, TotalCapture, and KTH Multiview Football II. Compared withstate-of-the-art methods with camera parameters, MTF-Transformer obtainscompetitive results and generalizes well to dynamic capture with an arbitrarynumber of unseen views.
多视角特征融合的transformer以及时序融合的transformer
- Semi-supervised Dense Keypointsusing Unlabeled Multiview ImagesZhixuan Yu, Haozheng Yu, Long Sha, Sujoy Ganguly, and Hyun Soo ParkIn 2021
This paper presents a new end-to-end semi-supervised framework to learn adense keypoint detector using unlabeled multiview images. A key challenge liesin finding the exact correspondences between the dense keypoints in multipleviews since the inverse of keypoint mapping can be neither analytically derivednor differentiated. This limits applying existing multiview supervisionapproaches on sparse keypoint detection that rely on the exact correspondences.To address this challenge, we derive a new probabilistic epipolar constraintthat encodes the two desired properties. (1) Soft correspondence: we define amatchability, which measures a likelihood of a point matching to the otherimage’s corresponding point, thus relaxing the exact correspondences’requirement. (2) Geometric consistency: every point in the continuouscorrespondence fields must satisfy the multiview consistency collectively. Weformulate a probabilistic epipolar constraint using a weighted average ofepipolar errors through the matchability thereby generalizing thepoint-to-point geometric error to the field-to-field geometric error. Thisgeneralization facilitates learning a geometrically coherent dense keypointdetection model by utilizing a large number of unlabeled multiview images.Additionally, to prevent degenerative cases, we employ a distillation-basedregularization by using a pretrained model. Finally, we design a new neuralnetwork architecture, made of twin networks, that effectively minimizes theprobabilistic epipolar errors of all possible correspondences between two viewimages by building affinity matrices. Our method shows superior performancecompared to existing methods, including non-differentiable bootstrapping interms of keypoint accuracy, multiview consistency, and 3D reconstructionaccuracy.
提出了软的对应关系通过几何一致性来约束
- FLEX: Extrinsic Parameters-free Multi-view 3D Human Motion ReconstructionBrian Gordon, Sigal Raab, Guy Azov, Raja Giryes, and Daniel Cohen-OrIn ECCV 2021
The increasing availability of video recordings made by multiple cameras hasoffered new means for mitigating occlusion and depth ambiguities in pose andmotion reconstruction methods. Yet, multi-view algorithms strongly depend oncamera parameters; particularly, the relative transformations between thecameras. Such a dependency becomes a hurdle once shifting to dynamic capture inuncontrolled settings. We introduce FLEX (Free muLti-view rEconstruXion), anend-to-end extrinsic parameter-free multi-view model. FLEX is extrinsicparameter-free (dubbed ep-free) in the sense that it does not require extrinsiccamera parameters. Our key idea is that the 3D angles between skeletal parts,as well as bone lengths, are invariant to the camera position. Hence, learning3D rotations and bone lengths rather than locations allows predicting commonvalues for all camera views. Our network takes multiple video streams, learnsfused deep features through a novel multi-view fusion layer, and reconstructs asingle consistent skeleton with temporally coherent joint rotations. Wedemonstrate quantitative and qualitative results on three public datasets, andon synthetic multi-person video streams captured by dynamic cameras. We compareour model to state-of-the-art methods that are not ep-free and show that in theabsence of camera parameters, we outperform them by a large margin whileobtaining comparable results when camera parameters are available. Code,trained models, and other materials are available on our project page.
输入多视角的序列的2D估计,估计脚步接触标签以及骨长、3D旋转,可以不给定相机
2020
- AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the WildZhe Zhang, Chunyu Wang, Weichao Qiu, Wenhu Qin, and Wenjun ZengIn 2020
Occlusion is probably the biggest challenge for human pose estimation in thewild. Typical solutions often rely on intrusive sensors such as IMUs to detectoccluded joints. To make the task truly unconstrained, we present AdaFuse, anadaptive multiview fusion method, which can enhance the features in occludedviews by leveraging those in visible views. The core of AdaFuse is to determinethe point-point correspondence between two views which we solve effectively byexploring the sparsity of the heatmap representation. We also learn an adaptivefusion weight for each camera view to reflect its feature quality in order toreduce the chance that good features are undesirably corrupted by “bad”views. The fusion model is trained end-to-end with the pose estimation network,and can be directly applied to new camera configurations without additionaladaptation. We extensively evaluate the approach on three public datasetsincluding Human3.6M, Total Capture and CMU Panoptic. It outperforms thestate-of-the-arts on all of them. We also create a large scale syntheticdataset Occlusion-Person, which allows us to perform numerical evaluation onthe occluded joints, as it provides occlusion labels for every joint in theimages. The dataset and code are released athttps://github.com/zhezh/adafuse-3d-human-pose.
同时输入多视角的图像关键点估计heatmap,输出2D关键点
2019
- Learnable Triangulation of Human PoseKarim Iskakov, Egor Burkov, Victor Lempitsky, and Yury MalkovIn 2019
We present two novel solutions for multi-view 3D human pose estimation basedon new learnable triangulation methods that combine 3D information frommultiple 2D views. The first (baseline) solution is a basic differentiablealgebraic triangulation with an addition of confidence weights estimated fromthe input images. The second solution is based on a novel method of volumetricaggregation from intermediate 2D backbone feature maps. The aggregated volumeis then refined via 3D convolutions that produce final 3D joint heatmaps andallow modelling a human pose prior. Crucially, both approaches are end-to-enddifferentiable, which allows us to directly optimize the target metric. Wedemonstrate transferability of the solutions across datasets and considerablyimprove the multi-view state of the art on the Human3.6M dataset. Videodemonstration, annotations and additional materials will be posted on ourproject page (https://saic-violet.github.io/learnable-triangulation).
多个视角的特征反投影到3D空间中通过3D网络获得最终输出
- Cross View Fusion for 3D Human Pose EstimationHaibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, and Wenjun ZengIn 2019
We present an approach to recover absolute 3D human poses from multi-viewimages by incorporating multi-view geometric priors in our model. It consistsof two separate steps: (1) estimating the 2D poses in multi-view images and (2)recovering the 3D poses from the multi-view 2D poses. First, we introduce across-view fusion scheme into CNN to jointly estimate 2D poses for multipleviews. Consequently, the 2D pose estimation for each view already benefits fromother views. Second, we present a recursive Pictorial Structure Model torecover the 3D pose from the multi-view 2D poses. It gradually improves theaccuracy of 3D pose with affordable computational cost. We test our method ontwo public datasets H36M and Total Capture. The Mean Per Joint Position Errorson the two datasets are 26mm and 29mm, which outperforms the state-of-the-artsremarkably (26mm vs 52mm, 29mm vs 35mm). Our code is released at\urlhttps://github.com/microsoft/multiview-human-pose-estimation-pytorch.
直接多视角的融合
2018
2017
- Harvesting Multiple Views for Marker-less 3D Human Pose AnnotationsGeorgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas DaniilidisIn 2017
Recent advances with Convolutional Networks (ConvNets) have shifted thebottleneck for many computer vision tasks to annotated data collection. In thispaper, we present a geometry-driven approach to automatically collectannotations for human pose prediction tasks. Starting from a generic ConvNetfor 2D human pose, and assuming a multi-view setup, we describe an automaticway to collect accurate 3D human pose annotations. We capitalize on constraintsoffered by the 3D geometry of the camera setup and the 3D structure of thehuman body to probabilistically combine per view 2D ConvNet predictions into aglobally optimal 3D pose. This 3D pose is used as the basis for harvestingannotations. The benefit of the annotations produced automatically with ourapproach is demonstrated in two challenging settings: (i) fine-tuning a genericConvNet-based 2D pose predictor to capture the discriminative aspects of asubject’s appearance (i.e.,"personalization"), and (ii) training a ConvNet fromscratch for single view 3D human pose prediction without leveraging 3D posegroundtruth. The proposed multi-view pose estimator achieves state-of-the-artresults on standard benchmarks, demonstrating the effectiveness of our methodin exploiting the available multi-view information.
多视角的heatmap生成了3D特征,然后使用3Dpictorial获取骨架位置