Paper Reading
2022
- Learnable human mesh triangulation for 3D human pose and shape estimationSungho Chun, Sungbum Park, and Ju Yong ChangIn 2022
Compared to joint position, the accuracy of joint rotation and shapeestimation has received relatively little attention in the skinned multi-personlinear model (SMPL)-based human mesh reconstruction from multi-view images. Thework in this field is broadly classified into two categories. The firstapproach performs joint estimation and then produces SMPL parameters by fittingSMPL to resultant joints. The second approach regresses SMPL parametersdirectly from the input images through a convolutional neural network(CNN)-based model. However, these approaches suffer from the lack ofinformation for resolving the ambiguity of joint rotation and shapereconstruction and the difficulty of network learning. To solve theaforementioned problems, we propose a two-stage method. The proposed methodfirst estimates the coordinates of mesh vertices through a CNN-based model frominput images, and acquires SMPL parameters by fitting the SMPL model to theestimated vertices. Estimated mesh vertices provide sufficient information fordetermining joint rotation and shape, and are easier to learn than SMPLparameters. According to experiments using Human3.6M and MPI-INF-3DHP datasets,the proposed method significantly outperforms the previous works in terms ofjoint rotation and shape estimation, and achieves competitive performance interms of joint location estimation.
每个视角进行可见性判断再进行特征融合,最后接了拟合模块
- On Triangulation as a Form of Self-Supervision for 3D Human Pose EstimationSoumava Kumar Roy, Leonardo Citraro, Sina Honari, and Pascal FuaIn 2022
Supervised approaches to 3D pose estimation from single images are remarkablyeffective when labeled data is abundant. However, as the acquisition ofground-truth 3D labels is labor intensive and time consuming, recent attentionhas shifted towards semi- and weakly-supervised learning. Generating aneffective form of supervision with little annotations still poses majorchallenge in crowded scenes. In this paper we propose to impose multi-viewgeometrical constraints by means of a weighted differentiable triangulation anduse it as a form of self-supervision when no labels are available. We thereforetrain a 2D pose estimator in such a way that its predictions correspond to there-projection of the triangulated 3D pose and train an auxiliary network onthem to produce the final 3D poses. We complement the triangulation with aweighting mechanism that alleviates the impact of noisy predictions caused byself-occlusion or occlusion from other subjects. We demonstrate theeffectiveness of our semi-supervised approach on Human3.6M and MPI-INF-3DHPdatasets, as well as on a new multi-view multi-person dataset that featuresocclusion.
使用多视角三角化来自监督
- PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimationHaoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang, and Xiaohui XieIn 2022
Recently, the vision transformer and its variants have played an increasinglyimportant role in both monocular and multi-view human pose estimation.Considering image patches as tokens, transformers can model the globaldependencies within the entire image or across images from other views.However, global attention is computationally expensive. As a consequence, it isdifficult to scale up these transformer-based methods to high-resolutionfeatures and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2Dhuman pose estimation, which can locate a rough human mask and performsself-attention only within selected tokens. Furthermore, we extend our PPT tomulti-view human pose estimation. Built upon PPT, we propose a new cross-viewfusion strategy, called human area fusion, which considers all human foregroundpixels as corresponding candidates. Experimental results on COCO and MPIIdemonstrate that our PPT can match the accuracy of previous pose transformermethods while reducing the computation. Moreover, experiments on Human 3.6M andSki-Pose demonstrate that our Multi-view PPT can efficiently fuse cues frommultiple views and achieve new state-of-the-art results.
使用人体区域来做fusion
- VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose EstimationYuxing Chen, Renshu Gu, Ouhan Huang, and Gangyong JiaIn 2022
This paper presents Volumetric Transformer Pose estimator (VTP), the first 3Dvolumetric transformer framework for multi-view multi-person 3D human poseestimation. VTP aggregates features from 2D keypoints in all camera views anddirectly learns the spatial relationships in the 3D voxel space in anend-to-end fashion. The aggregated 3D features are passed through 3Dconvolutions before being flattened into sequential embeddings and fed into atransformer. A residual structure is designed to further improve theperformance. In addition, the sparse Sinkhorn attention is empowered to reducethe memory cost, which is a major bottleneck for volumetric representations,while also achieving excellent performance. The output of the transformer isagain concatenated with 3D convolutional features by a residual design. Theproposed VTP framework integrates the high performance of the transformer withvolumetric representations, which can be used as a good alternative to theconvolutional backbones. Experiments on the Shelf, Campus and CMU Panopticbenchmarks show promising results in terms of both Mean Per Joint PositionError (MPJPE) and Percentage of Correctly estimated Parts (PCP). Our code willbe available.
使用3DVolTransformer回归人体坐标
- Learned Vertex Descent: A New Direction for 3D Human Model FittingEnric Corona, Gerard Pons-Moll, Guillem Alenyà, and Francesc Moreno-NoguerIn 2022
We propose a novel optimization-based paradigm for 3D human model fitting onimages and scans. In contrast to existing approaches that directly regress theparameters of a low-dimensional statistical body model (e.g. SMPL) from inputimages, we train an ensemble of per-vertex neural fields network. The networkpredicts, in a distributed manner, the vertex descent direction towards theground truth, based on neural features extracted at the current vertexprojection. At inference, we employ this network, dubbed LVD, within agradient-descent optimization pipeline until its convergence, which typicallyoccurs in a fraction of a second even when initializing all vertices into asingle point. An exhaustive evaluation demonstrates that our approach is ableto capture the underlying body of clothed people with very different bodyshapes, achieving a significant improvement compared to state-of-the-art. LVDis also applicable to 3D model fitting of humans and hands, for which we show asignificant improvement to the SOTA with a much simpler and faster method.
一种新的姿态估计的框架
- State of the Art in Dense Monocular Non-Rigid 3D ReconstructionEdith Tretschk, Navami Kairanda, Mallikarjun B R, Rishabh Dabral, Adam Kortylewski, Bernhard Egger, Marc Habermann, Pascal Fua, Christian Theobalt, and Vladislav GolyanikIn 2022
3D reconstruction of deformable (or non-rigid) scenes from a set of monocular2D image observations is a long-standing and actively researched area ofcomputer vision and graphics. It is an ill-posed inverse problem,since–without additional prior assumptions–it permits infinitely manysolutions leading to accurate projection to the input 2D images. Non-rigidreconstruction is a foundational building block for downstream applicationslike robotics, AR/VR, or visual content creation. The key advantage of usingmonocular cameras is their omnipresence and availability to the end users aswell as their ease of use compared to more sophisticated camera set-ups such asstereo or multi-view systems. This survey focuses on state-of-the-art methodsfor dense non-rigid 3D reconstruction of various deformable objects andcomposite scenes from monocular videos or sets of monocular views. It reviewsthe fundamentals of 3D reconstruction and deformation modeling from 2D imageobservations. We then start from general methods–that handle arbitrary scenesand make only a few prior assumptions–and proceed towards techniques makingstronger assumptions about the observed objects and types of deformations (e.g.human faces, bodies, hands, and animals). A significant part of this STAR isalso devoted to classification and a high-level comparison of the methods, aswell as an overview of the datasets for training and evaluation of thediscussed techniques. We conclude by discussing open challenges in the fieldand the social aspects associated with the usage of the reviewed methods.
值得一看的单目非刚性重建的综述
- SUPR: A Sparse Unified Part-Based Human RepresentationAhmed A. A. Osman, Timo Bolkart, Dimitrios Tzionas, and Michael J. BlackIn 2022
Statistical 3D shape models of the head, hands, and fullbody are widely usedin computer vision and graphics. Despite their wide use, we show that existingmodels of the head and hands fail to capture the full range of motion for theseparts. Moreover, existing work largely ignores the feet, which are crucial formodeling human movement and have applications in biomechanics, animation, andthe footwear industry. The problem is that previous body part models aretrained using 3D scans that are isolated to the individual parts. Such datadoes not capture the full range of motion for such parts, e.g. the motion ofhead relative to the neck. Our observation is that full-body scans provideimportant information about the motion of the body parts. Consequently, wepropose a new learning scheme that jointly trains a full-body model andspecific part models using a federated dataset of full-body and body-partscans. Specifically, we train an expressive human body model called SUPR(Sparse Unified Part-Based Human Representation), where each joint strictlyinfluences a sparse set of model vertices. The factorized representationenables separating SUPR into an entire suite of body part models. Note that thefeet have received little attention and existing 3D body models have highlyunder-actuated feet. Using novel 4D scans of feet, we train a model with anextended kinematic tree that captures the range of motion of the toes.Additionally, feet deform due to ground contact. To model this, we include anovel non-linear deformation function that predicts foot deformationconditioned on the foot pose, shape, and ground contact. We train SUPR on anunprecedented number of scans: 1.2 million body, head, hand and foot scans. Wequantitatively compare SUPR and the separated body parts and find that oursuite of models generalizes better than existing models. SUPR is available athttp://supr.is.tue.mpg.de
基于part的人体模型
- FIND: An Unsupervised Implicit 3D Model of Articulated Human FeetOliver Boyne, James Charles, and Roberto CipollaIn 2022
In this paper we present a high fidelity and articulated 3D human foot model.The model is parameterised by a disentangled latent code in terms of shape,texture and articulated pose. While high fidelity models are typically createdwith strong supervision such as 3D keypoint correspondences orpre-registration, we focus on the difficult case of little to no annotation. Tothis end, we make the following contributions: (i) we develop a Foot ImplicitNeural Deformation field model, named FIND, capable of tailoring explicitmeshes at any resolution i.e. for low or high powered devices; (ii) an approachfor training our model in various modes of weak supervision with progressivelybetter disentanglement as more labels, such as pose categories, are provided;(iii) a novel unsupervised part-based loss for fitting our model to 2D imageswhich is better than traditional photometric or silhouette losses; (iv)finally, we release a new dataset of high resolution 3D human foot scans,Foot3D. On this dataset, we show our model outperforms a strong PCAimplementation trained on the same data in terms of shape quality and partcorrespondences, and that our novel unsupervised part-based loss improvesinference on images.
使用RGB来自监督的训练脚的隐式表达
- HDHumans: A Hybrid Approach for High-fidelity Digital HumansMarc Habermann, Lingjie Liu, Weipeng Xu, Gerard Pons-Moll, Michael Zollhoefer, and Christian TheobaltIn 2022
Photo-real digital human avatars are of enormous importance in graphics, asthey enable immersive communication over the globe, improve gaming andentertainment experiences, and can be particularly beneficial for AR and VRsettings. However, current avatar generation approaches either fall short inhigh-fidelity novel view synthesis, generalization to novel motions,reproduction of loose clothing, or they cannot render characters at the highresolution offered by modern displays. To this end, we propose HDHumans, whichis the first method for HD human character synthesis that jointly produces anaccurate and temporally coherent 3D deforming surface and highlyphoto-realistic images of arbitrary novel views and of motions not seen attraining time. At the technical core, our method tightly integrates a classicaldeforming character template with neural radiance fields (NeRF). Our method iscarefully designed to achieve a synergy between classical surface deformationand NeRF. First, the template guides the NeRF, which allows synthesizing novelviews of a highly dynamic and articulated character and even enables thesynthesis of novel motions. Second, we also leverage the dense pointcloudsresulting from NeRF to further improve the deforming surface via 3D-to-3Dsupervision. We outperform the state of the art quantitatively andqualitatively in terms of synthesis quality and resolution, as well as thequality of 3D surface reconstruction.
DeepCap的拓展
- JRDB-Pose: A Large-scale Dataset for Multi-Person Pose Estimation and TrackingEdward Vendrow, Duy Tho Le, and Hamid RezatofighiIn 2022
Autonomous robotic systems operating in human environments must understandtheir surroundings to make accurate and safe decisions. In crowded human sceneswith close-up human-robot interaction and robot navigation, a deepunderstanding requires reasoning about human motion and body dynamics over timewith human body pose estimation and tracking. However, existing datasets eitherdo not provide pose annotations or include scene types unrelated to roboticapplications. Many datasets also lack the diversity of poses and occlusionsfound in crowded human scenes. To address this limitation we introduceJRDB-Pose, a large-scale dataset and benchmark for multi-person pose estimationand tracking using videos captured from a social navigation robot. The datasetcontains challenge scenes with crowded indoor and outdoor locations and adiverse range of scales and occlusion types. JRDB-Pose provides human poseannotations with per-keypoint occlusion labels and track IDs consistent acrossthe scene. A public evaluation server is made available for fair evaluation ona held-out test set. JRDB-Pose is available at https://jrdb.erc.monash.edu/ .
使用的是全景相机,而不是普通多视角相机
- Bootstrapping Human Optical Flow and PoseAritro Roy Arko, James J. Little, and Kwang Moo YiIn 2022
We propose a bootstrapping framework to enhance human optical flow and pose.We show that, for videos involving humans in scenes, we can improve both theoptical flow and the pose estimation quality of humans by considering the twotasks at the same time. We enhance optical flow estimates by fine-tuning themto fit the human pose estimates and vice versa. In more detail, we optimize thepose and optical flow networks to, at inference time, agree with each other. Weshow that this results in state-of-the-art results on the Human 3.6M and 3DPoses in the Wild datasets, as well as a human-related subset of the Sinteldataset, both in terms of pose estimation accuracy and the optical flowaccuracy at human joint locations. Code available athttps://github.com/ubc-vision/bootstrapping-human-optical-flow-and-pose
使用pose来增强光流
- PoseScript: 3D Human Poses from Natural LanguageGinger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, and Grégory RogezIn 2022
Natural language is leveraged in many computer vision tasks such as imagecaptioning, cross-modal retrieval or visual question answering, to providefine-grained semantic information. While human pose is key to humanunderstanding, current 3D human pose datasets lack detailed languagedescriptions. In this work, we introduce the PoseScript dataset, which pairs afew thousand 3D human poses from AMASS with rich human-annotated descriptionsof the body parts and their spatial relationships. To increase the size of thisdataset to a scale compatible with typical data hungry learning algorithms, wepropose an elaborate captioning process that generates automatic syntheticdescriptions in natural language from given 3D keypoints. This process extractslow-level pose information – the posecodes – using a set of simple butgeneric rules on the 3D keypoints. The posecodes are then combined into higherlevel textual descriptions using syntactic rules. Automatic annotationssubstantially increase the amount of available data, and make it possible toeffectively pretrain deep models for finetuning on human captions. Todemonstrate the potential of annotated poses, we show applications of thePoseScript dataset to retrieval of relevant poses from large-scale datasets andto synthetic pose generation, both based on a textual pose description.
- Multi-view Tracking Using Weakly Supervised Human Motion PredictionMartin Engilberge, Weizhe Liu, and Pascal FuaIn 2022
Multi-view approaches to people-tracking have the potential to better handleocclusions than single-view ones in crowded scenes. They often rely on thetracking-by-detection paradigm, which involves detecting people first and thenconnecting the detections. In this paper, we argue that an even more effectiveapproach is to predict people motion over time and infer people’s presence inindividual frames from these. This enables to enforce consistency both overtime and across views of a single temporal frame. We validate our approach onthe PETS2009 and WILDTRACK datasets and demonstrate that it outperformsstate-of-the-art methods.
- ARAH: Animatable Volume Rendering of Articulated Human SDFsShaofei Wang, Katja Schwarz, Andreas Geiger, and Siyu TangIn 2022
Combining human body models with differentiable rendering has recentlyenabled animatable avatars of clothed humans from sparse sets of multi-view RGBvideos. While state-of-the-art approaches achieve realistic appearance withneural radiance fields (NeRF), the inferred geometry often lacks detail due tomissing geometric constraints. Further, animating avatars inout-of-distribution poses is not yet possible because the mapping fromobservation space to canonical space does not generalize faithfully to unseenposes. In this work, we address these shortcomings and propose a model tocreate animatable clothed human avatars with detailed geometry that generalizewell to out-of-distribution poses. To achieve detailed geometry, we combine anarticulated implicit surface representation with volume rendering. Forgeneralization, we propose a novel joint root-finding algorithm forsimultaneous ray-surface intersection search and correspondence search. Ouralgorithm enables efficient point sampling and accurate point canonicalizationwhile generalizing well to unseen poses. We demonstrate that our proposedpipeline can generate clothed avatars with high-quality pose-dependent geometryand appearance from a sparse set of multi-view RGB videos. Our method achievesstate-of-the-art performance on geometry and appearance reconstruction whilecreating animatable avatars that generalize well to out-of-distribution posesbeyond the small number of training poses.
- Human Body Measurement Estimation with Adversarial AugmentationNataniel Ruiz, Miriam Bellver, Timo Bolkart, Ambuj Arora, Ming C. Lin, Javier Romero, and Raja BalaIn 2022
We present a Body Measurement network (BMnet) for estimating 3Danthropomorphic measurements of the human body shape from silhouette images.Training of BMnet is performed on data from real human subjects, and augmentedwith a novel adversarial body simulator (ABS) that finds and synthesizeschallenging body shapes. ABS is based on the skinned multiperson linear (SMPL)body model, and aims to maximize BMnet measurement prediction error withrespect to latent SMPL shape parameters. ABS is fully differentiable withrespect to these parameters, and trained end-to-end via backpropagation withBMnet in the loop. Experiments show that ABS effectively discovers adversarialexamples, such as bodies with extreme body mass indices (BMI), consistent withthe rarity of extreme-BMI bodies in BMnet’s training set. Thus ABS is able toreveal gaps in training data and potential failures in predictingunder-represented body shapes. Results show that training BMnet with ABSimproves measurement prediction accuracy on real bodies by up to 10%, whencompared to no augmentation or random body shape sampling. Furthermore, ourmethod significantly outperforms SOTA measurement estimation methods by as muchas 3x. Finally, we release BodyM, the first challenging, large-scale dataset ofphoto silhouettes and body measurements of real human subjects, to furtherpromote research in this area. Project website:https://adversarialbodysim.github.io
- HiFECap: Monocular High-Fidelity and Expressive Capture of Human PerformancesYue Jiang, Marc Habermann, Vladislav Golyanik, and Christian TheobaltIn 2022
Monocular 3D human performance capture is indispensable for many applicationsin computer graphics and vision for enabling immersive experiences. However,detailed capture of humans requires tracking of multiple aspects, including theskeletal pose, the dynamic surface, which includes clothing, hand gestures aswell as facial expressions. No existing monocular method allows joint trackingof all these components. To this end, we propose HiFECap, a new neural humanperformance capture approach, which simultaneously captures human pose,clothing, facial expression, and hands just from a single RGB video. Wedemonstrate that our proposed network architecture, the carefully designedtraining strategy, and the tight integration of parametric face and hand modelsto a template mesh enable the capture of all these individual aspects.Importantly, our method also captures high-frequency details, such as deformingwrinkles on the clothes, better than the previous works. Furthermore, we showthat HiFECap outperforms the state-of-the-art human performance captureapproaches qualitatively and quantitatively while for the first time capturingall aspects of the human.
- Self-Supervised 3D Human Pose Estimation in Static Video Via Neural RenderingLuca Schmidtke, Benjamin Hou, Athanasios Vlontzos, and Bernhard KainzIn 2022
Inferring 3D human pose from 2D images is a challenging and long-standingproblem in the field of computer vision with many applications including motioncapture, virtual reality, surveillance or gait analysis for sports andmedicine. We present preliminary results for a method to estimate 3D pose from2D video containing a single person and a static background without the needfor any manual landmark annotations. We achieve this by formulating a simpleyet effective self-supervision task: our model is required to reconstruct arandom frame of a video given a frame from another timepoint and a renderedimage of a transformed human shape template. Crucially for optimisation, ourray casting based rendering pipeline is fully differentiable, enabling end toend training solely based on the reconstruction task.
- AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose RegressionYabo Xiao, Xiaojuan Wang, Dongdong Yu, Kai Su, Lei Jin, Mei Song, Shuicheng Yan, and Jian ZhaoIn 2022
Multi-person pose estimation generally follows top-down and bottom-upparadigms. Both of them use an extra stage (\boldsymbole.g., humandetection in top-down paradigm or grouping process in bottom-up paradigm) tobuild the relationship between the human instance and corresponding keypoints,thus leading to the high computation cost and redundant two-stage pipeline. Toaddress the above issue, we propose to represent the human parts as adaptivepoints and introduce a fine-grained body representation method. The novel bodyrepresentation is able to sufficiently encode the diverse pose information andeffectively model the relationship between the human instance and correspondingkeypoints in a single-forward pass. With the proposed body representation, wefurther deliver a compact single-stage multi-person pose regression network,termed as AdaptivePose. During inference, our proposed network only needs asingle-step decode operation to form the multi-person pose without complexpost-processes and refinements. We employ AdaptivePose for both 2D/3Dmulti-person pose estimation tasks to verify the effectiveness of AdaptivePose.Without any bells and whistles, we achieve the most competitive performance onMS COCO and CrowdPose in terms of accuracy and speed. Furthermore, theoutstanding performance on MuCo-3DHP and MuPoTS-3D further demonstrates theeffectiveness and generalizability on 3D scenes. Code is available athttps://github.com/buptxyb666/AdaptivePose.
- Contact-aware Human Motion ForecastingWei Mao, Miaomiao Liu, Richard Hartley, and Mathieu SalzmannIn 2022
In this paper, we tackle the task of scene-aware 3D human motion forecasting,which consists of predicting future human poses given a 3D scene and a pasthuman motion. A key challenge of this task is to ensure consistency between thehuman and the scene, accounting for human-scene interactions. Previous attemptsto do so model such interactions only implicitly, and thus tend to produceartifacts such as "ghost motion" because of the lack of explicit constraintsbetween the local poses and the global motion. Here, by contrast, we propose toexplicitly model the human-scene contacts. To this end, we introducedistance-based contact maps that capture the contact relationships betweenevery joint and every 3D scene point at each time instant. We then develop atwo-stage pipeline that first predicts the future contact maps from the pastones and the scene point cloud, and then forecasts the future human poses byconditioning them on the predicted contact maps. During training, we explicitlyencourage consistency between the global motion and the local poses via a priordefined using the contact maps and future poses. Our approach outperforms thestate-of-the-art human motion forecasting and human synthesis methods on bothsynthetic and real datasets. Our code is available athttps://github.com/wei-mao-2019/ContAwareMotionPred.
- Spatio-temporal Tendency Reasoning for Human Body Pose and Shape Estimation from VideosBoyang Zhang, SuPing Wu, Hu Cao, Kehua Ma, Pan Li, and Lei LinIn 2022
In this paper, we present a spatio-temporal tendency reasoning (STR) networkfor recovering human body pose and shape from videos. Previous approaches havefocused on how to extend 3D human datasets and temporal-based learning topromote accuracy and temporal smoothing. Different from them, our STR aims tolearn accurate and natural motion sequences in an unconstrained environmentthrough temporal and spatial tendency and to fully excavate the spatio-temporalfeatures of existing video data. To this end, our STR learns the representationof features in the temporal and spatial dimensions respectively, to concentrateon a more robust representation of spatio-temporal features. More specifically,for efficient temporal modeling, we first propose a temporal tendency reasoning(TTR) module. TTR constructs a time-dimensional hierarchical residualconnection representation within a video sequence to effectively reasontemporal sequences’ tendencies and retain effective dissemination of humaninformation. Meanwhile, for enhancing the spatial representation, we design aspatial tendency enhancing (STE) module to further learns to excite spatiallytime-frequency domain sensitive features in human motion informationrepresentations. Finally, we introduce integration strategies to integrate andrefine the spatio-temporal feature representations. Extensive experimentalfindings on large-scale publically available datasets reveal that our STRremains competitive with the state-of-the-art on three datasets. Our code areavailable at https://github.com/Changboyang/STR.git.
使用了integration的策略来提升性能
- Capturing and Animation of Body and Clothing from Monocular VideoYao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo BolkartIn 2022
While recent work has shown progress on extracting clothed 3D human avatarsfrom a single image, video, or a set of 3D scans, several limitations remain.Most methods use a holistic representation to jointly model the body andclothing, which means that the clothing and body cannot be separated forapplications like virtual try-on. Other methods separately model the body andclothing, but they require training from a large set of 3D clothed human meshesobtained from 3D/4D scanners or physics simulations. Our insight is that thebody and clothing have different modeling requirements. While the body is wellrepresented by a mesh-based parametric 3D model, implicit representations andneural radiance fields are better suited to capturing the large variety inshape and appearance present in clothing. Building on this insight, we proposeSCARF (Segmented Clothed Avatar Radiance Field), a hybrid model combining amesh-based body with a neural radiance field. Integrating the mesh into thevolumetric rendering in combination with a differentiable rasterizer enables usto optimize SCARF directly from monocular videos, without any 3D supervision.The hybrid modeling enables SCARF to (i) animate the clothed body avatar bychanging body poses (including hand articulation and facial expressions), (ii)synthesize novel views of the avatar, and (iii) transfer clothing betweenavatars in virtual try-on applications. We demonstrate that SCARF reconstructsclothing with higher visual quality than existing methods, that the clothingdeforms with changing body pose and body shape, and that clothing can besuccessfully transferred between avatars of different subjects. The code andmodels are available at https://github.com/YadiraF/SCARF.
输入单目RGB视频与衣服的分割,输出一个单独的人体和衣服层,可驱动
- Multi-view Human Body Mesh TranslatorXiangjian Jiang, Xuecheng Nie, Zitian Wang, Luoqi Liu, and Si LiuIn 2022
Existing methods for human mesh recovery mainly focus on single-viewframeworks, but they often fail to produce accurate results due to theill-posed setup. Considering the maturity of the multi-view motion capturesystem, in this paper, we propose to solve the prior ill-posed problem byleveraging multiple images from different views, thus significantly enhancingthe quality of recovered meshes. In particular, we present a novel\textbfMulti-view human body \textbfMesh \textbfTranslator (MMT) modelfor estimating human body mesh with the help of vision transformer.Specifically, MMT takes multi-view images as input and translates them totargeted meshes in a single-forward manner. MMT fuses features of differentviews in both encoding and decoding phases, leading to representations embeddedwith global information. Additionally, to ensure the tokens are intensivelyfocused on the human pose and shape, MMT conducts cross-view alignment at thefeature level by projecting 3D keypoint positions to each view and enforcingtheir consistency in geometry constraints. Comprehensive experimentsdemonstrate that MMT outperforms existing single or multi-view models by alarge margin for human mesh recovery task, notably, 28.8% improvement in MPVEover the current state-of-the-art method on the challenging HUMBI dataset.Qualitative evaluation also verifies the effectiveness of MMT in reconstructinghigh-quality human mesh. Codes will be made available upon acceptance.
- SelfNeRF: Fast Training NeRF for Human from Monocular Self-rotating VideoBo Peng, Jun Hu, Jingtao Zhou, and Juyong ZhangIn 2022
In this paper, we propose SelfNeRF, an efficient neural radiance field basednovel view synthesis method for human performance. Given monocularself-rotating videos of human performers, SelfNeRF can train from scratch andachieve high-fidelity results in about twenty minutes. Some recent works haveutilized the neural radiance field for dynamic human reconstruction. However,most of these methods need multi-view inputs and require hours of training,making it still difficult for practical use. To address this challengingproblem, we introduce a surface-relative representation based onmulti-resolution hash encoding that can greatly improve the training speed andaggregate inter-frame information. Extensive experimental results on severaldifferent datasets demonstrate the effectiveness and efficiency of SelfNeRF tochallenging monocular videos.
- Heatmap Distribution Matching for Human Pose EstimationHaoxuan Qu, Li Xu, Yujun Cai, Lin Geng Foo, and Jun LiuIn 2022
For tackling the task of 2D human pose estimation, the great majority of therecent methods regard this task as a heatmap estimation problem, and optimizethe heatmap prediction using the Gaussian-smoothed heatmap as the optimizationobjective and using the pixel-wise loss (e.g. MSE) as the loss function. Inthis paper, we show that optimizing the heatmap prediction in such a way, themodel performance of body joint localization, which is the intrinsic objectiveof this task, may not be consistently improved during the optimization processof the heatmap prediction. To address this problem, from a novel perspective,we propose to formulate the optimization of the heatmap prediction as adistribution matching problem between the predicted heatmap and the dotannotation of the body joint directly. By doing so, our proposed method doesnot need to construct the Gaussian-smoothed heatmap and can achieve a moreconsistent model performance improvement during the optimization of the heatmapprediction. We show the effectiveness of our proposed method through extensiveexperiments on the COCO dataset and the MPII dataset.
强调了heatmap回归任务的监督loss问题
- MonoNHR: Monocular Neural Human RendererHongsuk Choi, Gyeongsik Moon, Matthieu Armando, Vincent Leroy, Kyoung Mu Lee, and Gregory RogezIn 2022
Existing neural human rendering methods struggle with a single image inputdue to the lack of information in invisible areas and the depth ambiguity ofpixels in visible areas. In this regard, we propose Monocular Neural HumanRenderer (MonoNHR), a novel approach that renders robust free-viewpoint imagesof an arbitrary human given only a single image. MonoNHR is the first methodthat (i) renders human subjects never seen during training in a monocularsetup, and (ii) is trained in a weakly-supervised manner without geometrysupervision. First, we propose to disentangle 3D geometry and texture featuresand to condition the texture inference on the 3D geometry features. Second, weintroduce a Mesh Inpainter module that inpaints the occluded parts exploitinghuman structural priors such as symmetry. Experiments on ZJU-MoCap, AIST, andHUMBI datasets show that our approach significantly outperforms the recentmethods adapted to the monocular case.
- SmartMocap: Joint Estimation of Human and Camera Motion using Uncalibrated RGB CamerasNitin Saini, Chun-hao P. Huang, Michael J. Black, and Aamir AhmadIn 2022
Markerless human motion capture (mocap) from multiple RGB cameras is a widelystudied problem. Existing methods either need calibrated cameras or calibratethem relative to a static camera, which acts as the reference frame for themocap system. The calibration step has to be done a priori for every capturesession, which is a tedious process, and re-calibration is required whenevercameras are intentionally or accidentally moved. In this paper, we propose amocap method which uses multiple static and moving extrinsically uncalibratedRGB cameras. The key components of our method are as follows. First, since thecameras and the subject can move freely, we select the ground plane as a commonreference to represent both the body and the camera motions unlike existingmethods which represent bodies in the camera coordinate. Second, we learn aprobability distribution of short human motion sequences (\sim1sec) relativeto the ground plane and leverage it to disambiguate between the camera andhuman motion. Third, we use this distribution as a motion prior in a novelmulti-stage optimization approach to fit the SMPL human body model and thecamera poses to the human body keypoints on the images. Finally, we show thatour method can work on a variety of datasets ranging from aerial cameras tosmartphones. It also gives more accurate results compared to thestate-of-the-art on the task of monocular human mocap with a static camera. Ourcode is available for research purposes onhttps://github.com/robot-perception-group/SmartMocap.
- Regularizing Vector Embedding in Bottom-Up Human Pose EstimationIn ECCV 2022
使用scale来提升embedding
- NDF: Neural Deformable Fields for Dynamic Human ModellingRuiqi Zhang, and Jie ChenIn ECCV 2022
We propose Neural Deformable Fields (NDF), a new representation for dynamichuman digitization from a multi-view video. Recent works proposed to representa dynamic human body with shared canonical neural radiance fields which linksto the observation space with deformation fields estimations. However, thelearned canonical representation is static and the current design of thedeformation fields is not able to represent large movements or detailedgeometry changes. In this paper, we propose to learn a neural deformable fieldwrapped around a fitted parametric body model to represent the dynamic human.The NDF is spatially aligned by the underlying reference surface. A neuralnetwork is then learned to map pose to the dynamics of NDF. The proposed NDFrepresentation can synthesize the digitized performer with novel views andnovel poses with a detailed and reasonable dynamic appearance. Experiments showthat our method significantly outperforms recent human synthesis methods.
- 3D Human Pose Estimation Using Möbius Graph Convolutional NetworksNiloofar Azizi, Horst Possegger, Emanuele Rodolà, and Horst BischofIn ECCV 2022
3D human pose estimation is fundamental to understanding human behavior.Recently, promising results have been achieved by graph convolutional networks(GCNs), which achieve state-of-the-art performance and provide ratherlight-weight architectures. However, a major limitation of GCNs is theirinability to encode all the transformations between joints explicitly. Toaddress this issue, we propose a novel spectral GCN using the Möbiustransformation (MöbiusGCN). In particular, this allows us to directly andexplicitly encode the transformation between joints, resulting in asignificantly more compact representation. Compared to even the lightestarchitectures so far, our novel approach requires 90-98% fewer parameters, i.e.our lightest MöbiusGCN uses only 0.042M trainable parameters. Besides thedrastic parameter reduction, explicitly encoding the transformation of jointsalso enables us to achieve state-of-the-art results. We evaluate our approachon the two challenging pose estimation benchmarks, Human3.6M and MPI-INF-3DHP,demonstrating both state-of-the-art results and the generalization capabilitiesof MöbiusGCN.
- PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimationHaoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang, and Xiaohui XieIn 2022
Recently, the vision transformer and its variants have played an increasinglyimportant role in both monocular and multi-view human pose estimation.Considering image patches as tokens, transformers can model the globaldependencies within the entire image or across images from other views.However, global attention is computationally expensive. As a consequence, it isdifficult to scale up these transformer-based methods to high-resolutionfeatures and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2Dhuman pose estimation, which can locate a rough human mask and performsself-attention only within selected tokens. Furthermore, we extend our PPT tomulti-view human pose estimation. Built upon PPT, we propose a new cross-viewfusion strategy, called human area fusion, which considers all human foregroundpixels as corresponding candidates. Experimental results on COCO and MPIIdemonstrate that our PPT can match the accuracy of previous pose transformermethods while reducing the computation. Moreover, experiments on Human 3.6M andSki-Pose demonstrate that our Multi-view PPT can efficiently fuse cues frommultiple views and achieve new state-of-the-art results.
使用人体区域来做fusion
- DiffuStereo: High Quality Human Reconstruction via Diffusion-based Stereo Using Sparse CamerasRuizhi Shao, Zerong Zheng, Hongwen Zhang, Jingxiang Sun, and Yebin LiuIn ECCV 2022
We propose DiffuStereo, a novel system using only sparse cameras (8 in thiswork) for high-quality 3D human reconstruction. At its core is a noveldiffusion-based stereo module, which introduces diffusion models, a type ofpowerful generative models, into the iterative stereo matching network. To thisend, we design a new diffusion kernel and additional stereo constraints tofacilitate stereo matching and depth estimation in the network. We furtherpresent a multi-level stereo network architecture to handle high-resolution (upto 4k) inputs without requiring unaffordable memory footprint. Given a set ofsparse-view color images of a human, the proposed multi-level diffusion-basedstereo network can produce highly accurate depth maps, which are then convertedinto a high-quality 3D human model through an efficient multi-view fusionstrategy. Overall, our method enables automatic reconstruction of human modelswith quality on par to high-end dense-view camera rigs, and this is achievedusing a much more light-weight hardware setup. Experiments show that our methodoutperforms state-of-the-art methods by a large margin both qualitatively andquantitatively.
- Explicit Occlusion Reasoning for Multi-person 3D Human Pose EstimationQihao Liu, Yi Zhang, Song Bai, and Alan YuilleIn ECCV 2022
Occlusion poses a great threat to monocular multi-person 3D human poseestimation due to large variability in terms of the shape, appearance, andposition of occluders. While existing methods try to handle occlusion with posepriors/constraints, data augmentation, or implicit reasoning, they still failto generalize to unseen poses or occlusion cases and may make large mistakeswhen multiple people are present. Inspired by the remarkable ability of humansto infer occluded joints from visible cues, we develop a method to explicitlymodel this process that significantly improves bottom-up multi-person humanpose estimation with or without occlusions. First, we split the task into twosubtasks: visible keypoints detection and occluded keypoints reasoning, andpropose a Deeply Supervised Encoder Distillation (DSED) network to solve thesecond one. To train our model, we propose a Skeleton-guided human ShapeFitting (SSF) approach to generate pseudo occlusion labels on the existingdatasets, enabling explicit occlusion reasoning. Experiments show thatexplicitly learning from occlusions improves human pose estimation. Inaddition, exploiting feature-level information of visible joints allows us toreason about occluded joints more accurately. Our method outperforms both thestate-of-the-art top-down and bottom-up methods on several benchmarks.
估计被遮挡住的数据然后再进行association
- Learning Visibility for Robust Dense Human Body EstimationChun-Han Yao, Jimei Yang, Duygu Ceylan, Yi Zhou, Yang Zhou, and Ming-Hsuan YangIn ECCV 2022
Estimating 3D human pose and shape from 2D images is a crucial yetchallenging task. While prior methods with model-based representations canperform reasonably well on whole-body images, they often fail when parts of thebody are occluded or outside the frame. Moreover, these results usually do notfaithfully capture the human silhouettes due to their limited representationpower of deformable models (e.g., representing only the naked body). Analternative approach is to estimate dense vertices of a predefined templatebody in the image space. Such representations are effective in localizingvertices within an image but cannot handle out-of-frame body parts. In thiswork, we learn dense human body estimation that is robust to partialobservations. We explicitly model the visibility of human joints and verticesin the x, y, and z axes separately. The visibility in x and y axes helpdistinguishing out-of-frame cases, and the visibility in depth axis correspondsto occlusions (either self-occlusions or occlusions by other objects). Weobtain pseudo ground-truths of visibility labels from dense UV correspondencesand train a neural network to predict visibility along with 3D coordinates. Weshow that visibility can serve as 1) an additional signal to resolve depthordering ambiguities of self-occluded vertices and 2) a regularization termwhen fitting a human body model to the predictions. Extensive experiments onmultiple 3D human datasets demonstrate that visibility modeling significantlyimproves the accuracy of human body estimation, especially for partial-bodycases. Our project page with code is at: https://github.com/chhankyao/visdb.
考虑了遮挡来估计SMPL
- Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with TransformersJunhyeong Cho, Kim Youwang, and Tae-Hyun OhIn ECCV 2022
Transformer encoder architectures have recently achieved state-of-the-artresults on monocular 3D human mesh reconstruction, but they require asubstantial number of parameters and expensive computations. Due to the largememory overhead and slow inference speed, it is difficult to deploy such modelsfor practical use. In this paper, we propose a novel transformerencoder-decoder architecture for 3D human mesh reconstruction from a singleimage, called FastMETRO. We identify the performance bottleneck in theencoder-based transformers is caused by the token design which introduces highcomplexity interactions among input tokens. We disentangle the interactions viaan encoder-decoder architecture, which allows our model to demand much fewerparameters and shorter inference time. In addition, we impose the priorknowledge of human body’s morphological relationship via attention masking andmesh upsampling operations, which leads to faster convergence with higheraccuracy. Our FastMETRO improves the Pareto-front of accuracy and efficiency,and clearly outperforms image-based methods on Human3.6M and 3DPW. Furthermore,we validate its generalizability on FreiHAND.
为了高效的学习关系,mask掉了没有连接的vertices
- Self-Constrained Inference Optimization on Structural Groups for Human Pose EstimationZhehan Kan, Shuoshuo Chen, Zeng Li, and Zhihai HeIn ECCV 2022
We observe that human poses exhibit strong group-wise structural correlationand spatial coupling between keypoints due to the biological constraints ofdifferent body parts. This group-wise structural correlation can be explored toimprove the accuracy and robustness of human pose estimation. In this work, wedevelop a self-constrained prediction-verification network to characterize andlearn the structural correlation between keypoints during training. During theinference stage, the feedback information from the verification network allowsus to perform further optimization of pose prediction, which significantlyimproves the performance of human pose estimation. Specifically, we partitionthe keypoints into groups according to the biological structure of human body.Within each group, the keypoints are further partitioned into two subsets,high-confidence base keypoints and low-confidence terminal keypoints. Wedevelop a self-constrained prediction-verification network to perform forwardand backward predictions between these keypoint subsets. One fundamentalchallenge in pose estimation, as well as in generic prediction tasks, is thatthere is no mechanism for us to verify if the obtained pose estimation orprediction results are accurate or not, since the ground truth is notavailable. Once successfully learned, the verification network serves as anaccuracy verification module for the forward pose prediction. During theinference stage, it can be used to guide the local optimization of the poseestimation results of low-confidence keypoints with the self-constrained losson high-confidence keypoints as the objective function. Our extensiveexperimental results on benchmark MS COCO and CrowdPose datasets demonstratethat the proposed method can significantly improve the pose estimation results.
- Poseur: Direct Human Pose Regression with TransformersWeian Mao, Yongtao Ge, Chunhua Shen, Zhi Tian, Xinlong Wang, Zhibin Wang, and Anton HengelIn ECCV 2022
We propose a direct, regression-based approach to 2D human pose estimationfrom single images. We formulate the problem as a sequence prediction task,which we solve using a Transformer network. This network directly learns aregression mapping from images to the keypoint coordinates, without resortingto intermediate representations such as heatmaps. This approach avoids much ofthe complexity associated with heatmap-based approaches. To overcome thefeature misalignment issues of previous regression-based methods, we propose anattention mechanism that adaptively attends to the features that are mostrelevant to the target keypoints, considerably improving the accuracy.Importantly, our framework is end-to-end differentiable, and naturally learnsto exploit the dependencies between keypoints. Experiments on MS-COCO and MPII,two predominant pose-estimation datasets, demonstrate that our methodsignificantly improves upon the state-of-the-art in regression-based poseestimation. More notably, ours is the first regression-based approach toperform favorably compared to the best heatmap-based pose estimation methods.
- VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual DataJiajun Su, Chunyu Wang, Xiaoxuan Ma, Wenjun Zeng, and Yizhou WangIn ECCV 2022
While monocular 3D pose estimation seems to have achieved very accurateresults on the public datasets, their generalization ability is largelyoverlooked. In this work, we perform a systematic evaluation of the existingmethods and find that they get notably larger errors when tested on differentcameras, human poses and appearance. To address the problem, we introduceVirtualPose, a two-stage learning framework to exploit the hidden "free lunch"specific to this task, i.e. generating infinite number of poses and cameras fortraining models at no cost. To that end, the first stage transforms images toabstract geometry representations (AGR), and then the second maps them to 3Dposes. It addresses the generalization issue from two aspects: (1) the firststage can be trained on diverse 2D datasets to reduce the risk of over-fittingto limited appearance; (2) the second stage can be trained on diverse AGRsynthesized from a large number of virtual cameras and poses. It outperformsthe SOTA methods without using any paired images and 3D poses from thebenchmarks, which paves the way for practical applications. Code is availableat https://github.com/wkom/VirtualPose.
使用root depth估计人体三维的位置,使用3DCNN恢复关键点位置
- HULC: 3D Human Motion Capture with Pose Manifold Sampling and Dense Contact GuidanceSoshi Shimada, Vladislav Golyanik, Zhi Li, Patrick Pérez, Weipeng Xu, and Christian TheobaltIn ECCV 2022
Marker-less monocular 3D human motion capture (MoCap) with scene interactionsis a challenging research topic relevant for extended reality, robotics andvirtual avatar generation. Due to the inherent depth ambiguity of monocularsettings, 3D motions captured with existing methods often contain severeartefacts such as incorrect body-scene inter-penetrations, jitter and bodyfloating. To tackle these issues, we propose HULC, a new approach for 3D humanMoCap which is aware of the scene geometry. HULC estimates 3D poses and densebody-environment surface contacts for improved 3D localisations, as well as theabsolute scale of the subject. Furthermore, we introduce a 3D pose trajectoryoptimisation based on a novel pose manifold sampling that resolves erroneousbody-environment inter-penetrations. Although the proposed method requires lessstructured inputs compared to existing scene-aware monocular MoCap algorithms,it produces more physically-plausible poses: HULC significantly andconsistently outperforms the existing approaches in various experiments and ondifferent metrics. Project page: https://vcai.mpi-inf.mpg.de/projects/HULC/.
- Neural Capture of Animatable 3D Human from Monocular VideoGusi Te, Xiu Li, Xiao Li, Jinglu Wang, Wei Hu, and Yan LuIn ECCV 2022
We present a novel paradigm of building an animatable 3D human representationfrom a monocular video input, such that it can be rendered in any unseen posesand views. Our method is based on a dynamic Neural Radiance Field (NeRF) riggedby a mesh-based parametric 3D human model serving as a geometry proxy. Previousmethods usually rely on multi-view videos or accurate 3D geometry informationas additional inputs; besides, most methods suffer from degraded quality whengeneralized to unseen poses. We identify that the key to generalization is agood input embedding for querying dynamic NeRF: A good input embedding shoulddefine an injective mapping in the full volumetric space, guided by surfacemesh deformation under pose variation. Based on this observation, we propose toembed the input query with its relationship to local surface regions spanned bya set of geodesic nearest neighbors on mesh vertices. By including bothposition and relative distance information, our embedding defines adistance-preserved deformation mapping and generalizes well to unseen poses. Toreduce the dependency on additional inputs, we first initialize per-frame 3Dmeshes using off-the-shelf tools and then propose a pipeline to jointlyoptimize NeRF and refine the initial mesh. Extensive experiments show ourmethod can synthesize plausible human rendering results under unseen poses andviews.
使用query embedding包含了距离、方向
- BodySLAM: Joint Camera Localisation, Mapping, and Human Motion TrackingDorian F. Henning, Tristan Laidlow, and Stefan LeuteneggerIn ECCV 2022
Estimating human motion from video is an active research area due to its manypotential applications. Most state-of-the-art methods predict human shape andposture estimates for individual images and do not leverage the temporalinformation available in video. Many "in the wild" sequences of human motionare captured by a moving camera, which adds the complication of conflatedcamera and human motion to the estimation. We therefore present BodySLAM, amonocular SLAM system that jointly estimates the position, shape, and postureof human bodies, as well as the camera trajectory. We also introduce a novelhuman motion model to constrain sequential body postures and observe the scaleof the scene. Through a series of experiments on video sequences of humanmotion captured by a moving monocular camera, we demonstrate that BodySLAMimproves estimates of all human body parameters and camera poses when comparedto estimating these separately.
同时完成相机定位和人体跟踪
2021
- Learning Temporal 3D Human Pose Estimation with Pseudo-LabelsArij Bouazizi, Ulrich Kressel, and Vasileios BelagiannisIn 2021
We present a simple, yet effective, approach for self-supervised 3D humanpose estimation. Unlike the prior work, we explore the temporal informationnext to the multi-view self-supervision. During training, we rely ontriangulating 2D body pose estimates of a multiple-view camera system. Atemporal convolutional neural network is trained with the generated 3Dground-truth and the geometric multi-view consistency loss, imposinggeometrical constraints on the predicted 3D body skeleton. During inference,our model receives a sequence of 2D body pose estimates from a single-view topredict the 3D body pose for each of them. An extensive evaluation shows thatour method achieves state-of-the-art performance in the Human3.6M andMPI-INF-3DHP benchmarks. Our code and models are publicly available at\urlhttps://github.com/vru2020/TM_HPE/.
输入一段序列的2D关键点,输出3Dpose,通过多视角的一致性来监督
- Direct Multi-view Multi-person 3D Pose EstimationTao Wang, Jianfeng Zhang, Yujun Cai, Shuicheng Yan, and Jiashi FengIn 2021
We present Multi-view Pose transformer (MvP) for estimating multi-person 3Dposes from multi-view images. Instead of estimating 3D joint locations fromcostly volumetric representation or reconstructing the per-person 3D pose frommultiple detected 2D poses as in previous methods, MvP directly regresses themulti-person 3D poses in a clean and efficient way, without relying onintermediate tasks. Specifically, MvP represents skeleton joints as learnablequery embeddings and let them progressively attend to and reason over themulti-view information from the input images to directly regress the actual 3Djoint locations. To improve the accuracy of such a simple pipeline, MvPpresents a hierarchical scheme to concisely represent query embeddings ofmulti-person skeleton joints and introduces an input-dependent query adaptationapproach. Further, MvP designs a novel geometrically guided attentionmechanism, called projective attention, to more precisely fuse the cross-viewinformation for each joint. MvP also introduces a RayConv operation tointegrate the view-dependent camera geometry into the feature representationsfor augmenting the projective attention. We show experimentally that our MvPmodel outperforms the state-of-the-art methods on several benchmarks whilebeing much more efficient. Notably, it achieves 92.3% AP25 on the challengingPanoptic dataset, improving upon the previous best approach [36] by 9.8%. MvPis general and also extendable to recovering human mesh represented by the SMPLmodel, thus useful for modeling multi-person body shapes. Code and models areavailable at https://github.com/sail-sg/mvp.
多视角的feature直接通过transformer聚合
- Generalizable Human Pose TriangulationKristijan Bartol, David Bojanić, Tomislav Petković, and Tomislav PribanićIn 2021
We address the problem of generalizability for multi-view 3D human poseestimation. The standard approach is to first detect 2D keypoints in images andthen apply triangulation from multiple views. Even though the existing methodsachieve remarkably accurate 3D pose estimation on public benchmarks, most ofthem are limited to a single spatial camera arrangement and their number.Several methods address this limitation but demonstrate significantly degradedperformance on novel views. We propose a stochastic framework for human posetriangulation and demonstrate a superior generalization across different cameraarrangements on two public datasets. In addition, we apply the same approach tothe fundamental matrix estimation problem, showing that the proposed method cansuccessfully apply to other computer vision problems. The stochastic frameworkachieves more than 8.8% improvement on the 3D pose estimation task, compared tothe state-of-the-art, and more than 30% improvement for fundamental matrixestimation, compared to a standard algorithm.
提出了一个框架来解决泛化的三角化问题
- Adaptive Multi-view and Temporal Fusing Transformer for 3D Human Pose EstimationHui Shuai, Lele Wu, and Qingshan LiuIn 2021
This paper proposes a unified framework dubbed Multi-view and Temporal FusingTransformer (MTF-Transformer) to adaptively handle varying view numbers andvideo length without camera calibration in 3D Human Pose Estimation (HPE). Itconsists of Feature Extractor, Multi-view Fusing Transformer (MFT), andTemporal Fusing Transformer (TFT). Feature Extractor estimates 2D pose fromeach image and fuses the prediction according to the confidence. It providespose-focused feature embedding and makes subsequent modules computationallylightweight. MFT fuses the features of a varying number of views with a novelRelative-Attention block. It adaptively measures the implicit relativerelationship between each pair of views and reconstructs more informativefeatures. TFT aggregates the features of the whole sequence and predicts 3Dpose via a transformer. It adaptively deals with the video of arbitrary lengthand fully unitizes the temporal information. The migration of transformersenables our model to learn spatial geometry better and preserve robustness forvarying application scenarios. We report quantitative and qualitative resultson the Human3.6M, TotalCapture, and KTH Multiview Football II. Compared withstate-of-the-art methods with camera parameters, MTF-Transformer obtainscompetitive results and generalizes well to dynamic capture with an arbitrarynumber of unseen views.
多视角特征融合的transformer以及时序融合的transformer
- Semi-supervised Dense Keypointsusing Unlabeled Multiview ImagesZhixuan Yu, Haozheng Yu, Long Sha, Sujoy Ganguly, and Hyun Soo ParkIn 2021
This paper presents a new end-to-end semi-supervised framework to learn adense keypoint detector using unlabeled multiview images. A key challenge liesin finding the exact correspondences between the dense keypoints in multipleviews since the inverse of keypoint mapping can be neither analytically derivednor differentiated. This limits applying existing multiview supervisionapproaches on sparse keypoint detection that rely on the exact correspondences.To address this challenge, we derive a new probabilistic epipolar constraintthat encodes the two desired properties. (1) Soft correspondence: we define amatchability, which measures a likelihood of a point matching to the otherimage’s corresponding point, thus relaxing the exact correspondences’requirement. (2) Geometric consistency: every point in the continuouscorrespondence fields must satisfy the multiview consistency collectively. Weformulate a probabilistic epipolar constraint using a weighted average ofepipolar errors through the matchability thereby generalizing thepoint-to-point geometric error to the field-to-field geometric error. Thisgeneralization facilitates learning a geometrically coherent dense keypointdetection model by utilizing a large number of unlabeled multiview images.Additionally, to prevent degenerative cases, we employ a distillation-basedregularization by using a pretrained model. Finally, we design a new neuralnetwork architecture, made of twin networks, that effectively minimizes theprobabilistic epipolar errors of all possible correspondences between two viewimages by building affinity matrices. Our method shows superior performancecompared to existing methods, including non-differentiable bootstrapping interms of keypoint accuracy, multiview consistency, and 3D reconstructionaccuracy.
提出了软的对应关系通过几何一致性来约束
- Graph-Based 3D Multi-Person Pose Estimation Using Multi-View ImagesSize Wu, Sheng Jin, Wentao Liu, Lei Bai, Chen Qian, Dong Liu, and Wanli OuyangIn 2021
This paper studies the task of estimating the 3D human poses of multiplepersons from multiple calibrated camera views. Following the top-down paradigm,we decompose the task into two stages, i.e. person localization and poseestimation. Both stages are processed in coarse-to-fine manners. And we proposethree task-specific graph neural networks for effective message passing. For 3Dperson localization, we first use Multi-view Matching Graph Module (MMG) tolearn the cross-view association and recover coarse human proposals. The CenterRefinement Graph Module (CRG) further refines the results via flexiblepoint-based prediction. For 3D pose estimation, the Pose Regression GraphModule (PRG) learns both the multi-view geometry and structural relationsbetween human joints. Our approach achieves state-of-the-art performance on CMUPanoptic and Shelf datasets with significantly lower computation complexity.
通过图网络来学习多人对应关系
- Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural Human RenderingMingfei Chen, Jianfeng Zhang, Xiangyu Xu, Lijuan Liu, Yujun Cai, Jiashi Feng, and Shuicheng YanIn ECCV 2021
In this work we develop a generalizable and efficient Neural Radiance Field(NeRF) pipeline for high-fidelity free-viewpoint human body synthesis undersettings with sparse camera views. Though existing NeRF-based methods cansynthesize rather realistic details for human body, they tend to produce poorresults when the input has self-occlusion, especially for unseen humans undersparse views. Moreover, these methods often require a large number of samplingpoints for rendering, which leads to low efficiency and limits their real-worldapplicability. To address these challenges, we propose a Geometry-guidedProgressive NeRF (GP-NeRF). In particular, to better tackle self-occlusion, wedevise a geometry-guided multi-view feature integration approach that utilizesthe estimated geometry prior to integrate the incomplete information from inputviews and construct a complete geometry volume for the target human body.Meanwhile, for achieving higher rendering efficiency, we introduce aprogressive rendering pipeline through geometry guidance, which leverages thegeometric feature volume and the predicted density values to progressivelyreduce the number of sampling points and speed up the rendering process.Experiments on the ZJU-MoCap and THUman datasets show that our methodoutperforms the state-of-the-arts significantly across multiple generalizationsettings, while the time cost is reduced > 70% via applying our efficientprogressive rendering pipeline.
Geometry-guided image feature integration获得density volume,减少采样的点的数量
- SimCC: a Simple Coordinate Classification Perspective for Human Pose EstimationYanjie Li, Sen Yang, Peidong Liu, Shoukui Zhang, Yunxiao Wang, Zhicheng Wang, Wankou Yang, and Shu-Tao XiaIn ECCV 2021
The 2D heatmap-based approaches have dominated Human Pose Estimation (HPE)for years due to high performance. However, the long-standing quantizationerror problem in the 2D heatmap-based methods leads to several well-knowndrawbacks: 1) The performance for the low-resolution inputs is limited; 2) Toimprove the feature map resolution for higher localization precision, multiplecostly upsampling layers are required; 3) Extra post-processing is adopted toreduce the quantization error. To address these issues, we aim to explore abrand new scheme, called \textitSimCC, which reformulates HPE as twoclassification tasks for horizontal and vertical coordinates. The proposedSimCC uniformly divides each pixel into several bins, thus achieving\emphsub-pixel localization precision and low quantization error. Benefitingfrom that, SimCC can omit additional refinement post-processing and excludeupsampling layers under certain settings, resulting in a more simple andeffective pipeline for HPE. Extensive experiments conducted over COCO,CrowdPose, and MPII datasets show that SimCC outperforms heatmap-basedcounterparts, especially in low-resolution settings by a large margin.
从坐标分类的角度来看2D人体姿态估计问题
- SmoothNet: A Plug-and-Play Network for Refining Human Poses in VideosAiling Zeng, Lei Yang, Xuan Ju, Jiefeng Li, Jianyi Wang, and Qiang XuIn ECCV 2021
When analyzing human motion videos, the output jitters from existing poseestimators are highly-unbalanced with varied estimation errors across frames.Most frames in a video are relatively easy to estimate and only suffer fromslight jitters. In contrast, for rarely seen or occluded actions, the estimatedpositions of multiple joints largely deviate from the ground truth values for aconsecutive sequence of frames, rendering significant jitters on them. Totackle this problem, we propose to attach a dedicated temporal-only refinementnetwork to existing pose estimators for jitter mitigation, named SmoothNet.Unlike existing learning-based solutions that employ spatio-temporal models toco-optimize per-frame precision and temporal smoothness at all the joints,SmoothNet models the natural smoothness characteristics in body movements bylearning the long-range temporal relations of every joint without consideringthe noisy correlations among joints. With a simple yet effective motion-awarefully-connected network, SmoothNet improves the temporal smoothness of existingpose estimators significantly and enhances the estimation accuracy of thosechallenging frames as a side-effect. Moreover, as a temporal-only model, aunique advantage of SmoothNet is its strong transferability across varioustypes of estimators and datasets. Comprehensive experiments on five datasetswith eleven popular backbone networks across 2D and 3D pose estimation and bodyrecovery tasks demonstrate the efficacy of the proposed solution. Code isavailable at https://github.com/cure-lab/SmoothNet.
- Rethinking Keypoint Representations: Modeling Keypoints and Poses as Objects for Multi-Person Human Pose EstimationWilliam McNally, Kanav Vats, Alexander Wong, and John McPheeIn ECCV 2021
In keypoint estimation tasks such as human pose estimation, heatmap-basedregression is the dominant approach despite possessing notable drawbacks:heatmaps intrinsically suffer from quantization error and require excessivecomputation to generate and post-process. Motivated to find a more efficientsolution, we propose to model individual keypoints and sets of spatiallyrelated keypoints (i.e., poses) as objects within a dense single-stageanchor-based detection framework. Hence, we call our method KAPAO (pronounced"Ka-Pow"), for Keypoints And Poses As Objects. KAPAO is applied to the problemof single-stage multi-person human pose estimation by simultaneously detectinghuman pose and keypoint objects and fusing the detections to exploit thestrengths of both object representations. In experiments, we observe that KAPAOis faster and more accurate than previous methods, which suffer greatly fromheatmap post-processing. The accuracy-speed trade-off is especially favourablein the practical setting when not using test-time augmentation. Source code:https://github.com/wmcnally/kapao.
把关键点当做物体来直接用yolo回归
- FLEX: Extrinsic Parameters-free Multi-view 3D Human Motion ReconstructionBrian Gordon, Sigal Raab, Guy Azov, Raja Giryes, and Daniel Cohen-OrIn ECCV 2021
The increasing availability of video recordings made by multiple cameras hasoffered new means for mitigating occlusion and depth ambiguities in pose andmotion reconstruction methods. Yet, multi-view algorithms strongly depend oncamera parameters; particularly, the relative transformations between thecameras. Such a dependency becomes a hurdle once shifting to dynamic capture inuncontrolled settings. We introduce FLEX (Free muLti-view rEconstruXion), anend-to-end extrinsic parameter-free multi-view model. FLEX is extrinsicparameter-free (dubbed ep-free) in the sense that it does not require extrinsiccamera parameters. Our key idea is that the 3D angles between skeletal parts,as well as bone lengths, are invariant to the camera position. Hence, learning3D rotations and bone lengths rather than locations allows predicting commonvalues for all camera views. Our network takes multiple video streams, learnsfused deep features through a novel multi-view fusion layer, and reconstructs asingle consistent skeleton with temporally coherent joint rotations. Wedemonstrate quantitative and qualitative results on three public datasets, andon synthetic multi-person video streams captured by dynamic cameras. We compareour model to state-of-the-art methods that are not ep-free and show that in theabsence of camera parameters, we outperform them by a large margin whileobtaining comparable results when camera parameters are available. Code,trained models, and other materials are available on our project page.
输入多视角的序列的2D估计,估计脚步接触标签以及骨长、3D旋转,可以不给定相机
2020
- Rethinking the Heatmap Regression for Bottom-up Human Pose EstimationZhengxiong Luo, Zhicheng Wang, Yan Huang, Tieniu Tan, and Erjin ZhouIn 2020
Heatmap regression has become the most prevalent choice for nowadays humanpose estimation methods. The ground-truth heatmaps are usually constructed viacovering all skeletal keypoints by 2D gaussian kernels. The standard deviationsof these kernels are fixed. However, for bottom-up methods, which need tohandle a large variance of human scales and labeling ambiguities, the currentpractice seems unreasonable. To better cope with these problems, we propose thescale-adaptive heatmap regression (SAHR) method, which can adaptively adjustthe standard deviation for each keypoint. In this way, SAHR is more tolerant ofvarious human scales and labeling ambiguities. However, SAHR may aggravate theimbalance between fore-background samples, which potentially hurts theimprovement of SAHR. Thus, we further introduce the weight-adaptive heatmapregression (WAHR) to help balance the fore-background samples. Extensiveexperiments show that SAHR together with WAHR largely improves the accuracy ofbottom-up human pose estimation. As a result, we finally outperform thestate-of-the-art model by +1.5AP and achieve 72.0AP on COCO test-dev2017, whichis com-arable with the performances of most top-down methods. Source codes areavailable at https://github.com/greatlog/SWAHR-HumanPose.
均衡不同距离的heatmap的高斯核大小
- AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the WildZhe Zhang, Chunyu Wang, Weichao Qiu, Wenhu Qin, and Wenjun ZengIn 2020
Occlusion is probably the biggest challenge for human pose estimation in thewild. Typical solutions often rely on intrusive sensors such as IMUs to detectoccluded joints. To make the task truly unconstrained, we present AdaFuse, anadaptive multiview fusion method, which can enhance the features in occludedviews by leveraging those in visible views. The core of AdaFuse is to determinethe point-point correspondence between two views which we solve effectively byexploring the sparsity of the heatmap representation. We also learn an adaptivefusion weight for each camera view to reflect its feature quality in order toreduce the chance that good features are undesirably corrupted by “bad”views. The fusion model is trained end-to-end with the pose estimation network,and can be directly applied to new camera configurations without additionaladaptation. We extensively evaluate the approach on three public datasetsincluding Human3.6M, Total Capture and CMU Panoptic. It outperforms thestate-of-the-arts on all of them. We also create a large scale syntheticdataset Occlusion-Person, which allows us to perform numerical evaluation onthe occluded joints, as it provides occlusion labels for every joint in theimages. The dataset and code are released athttps://github.com/zhezh/adafuse-3d-human-pose.
同时输入多视角的图像关键点估计heatmap,输出2D关键点
2019
- Learnable Triangulation of Human PoseKarim Iskakov, Egor Burkov, Victor Lempitsky, and Yury MalkovIn 2019
We present two novel solutions for multi-view 3D human pose estimation basedon new learnable triangulation methods that combine 3D information frommultiple 2D views. The first (baseline) solution is a basic differentiablealgebraic triangulation with an addition of confidence weights estimated fromthe input images. The second solution is based on a novel method of volumetricaggregation from intermediate 2D backbone feature maps. The aggregated volumeis then refined via 3D convolutions that produce final 3D joint heatmaps andallow modelling a human pose prior. Crucially, both approaches are end-to-enddifferentiable, which allows us to directly optimize the target metric. Wedemonstrate transferability of the solutions across datasets and considerablyimprove the multi-view state of the art on the Human3.6M dataset. Videodemonstration, annotations and additional materials will be posted on ourproject page (https://saic-violet.github.io/learnable-triangulation).
多个视角的特征反投影到3D空间中通过3D网络获得最终输出
- Cross View Fusion for 3D Human Pose EstimationHaibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, and Wenjun ZengIn 2019
We present an approach to recover absolute 3D human poses from multi-viewimages by incorporating multi-view geometric priors in our model. It consistsof two separate steps: (1) estimating the 2D poses in multi-view images and (2)recovering the 3D poses from the multi-view 2D poses. First, we introduce across-view fusion scheme into CNN to jointly estimate 2D poses for multipleviews. Consequently, the 2D pose estimation for each view already benefits fromother views. Second, we present a recursive Pictorial Structure Model torecover the 3D pose from the multi-view 2D poses. It gradually improves theaccuracy of 3D pose with affordable computational cost. We test our method ontwo public datasets H36M and Total Capture. The Mean Per Joint Position Errorson the two datasets are 26mm and 29mm, which outperforms the state-of-the-artsremarkably (26mm vs 52mm, 29mm vs 35mm). Our code is released at\urlhttps://github.com/microsoft/multiview-human-pose-estimation-pytorch.
直接多视角的融合
2018
- Self-supervised Multi-view Person Association and Its ApplicationsMinh Vo, Ersin Yumer, Kalyan Sunkavalli, Sunil Hadap, Yaser Sheikh, and Srinivasa NarasimhanIn 2018
Reliable markerless motion tracking of people participating in a complexgroup activity from multiple moving cameras is challenging due to frequentocclusions, strong viewpoint and appearance variations, and asynchronous videostreams. To solve this problem, reliable association of the same person acrossdistant viewpoints and temporal instances is essential. We present aself-supervised framework to adapt a generic person appearance descriptor tothe unlabeled videos by exploiting motion tracking, mutual exclusionconstraints, and multi-view geometry. The adapted discriminative descriptor isused in a tracking-by-clustering formulation. We validate the effectiveness ofour descriptor learning on WILDTRACK [14] and three new complex social scenescaptured by multiple cameras with up to 60 people "in the wild". We reportsignificant improvement in association accuracy (up to 18%) and stable andcoherent 3D human skeleton tracking (5 to 10 times) over the baseline. Usingthe reconstructed 3D skeletons, we cut the input videos into a multi-anglevideo where the image of a specified person is shown from the best visiblefront-facing camera. Our algorithm detects inter-human occlusion to determinethe camera switching moment while still maintaining the flow of the actionwell.
自监督的特征学习进行人体的聚类
2017
- Harvesting Multiple Views for Marker-less 3D Human Pose AnnotationsGeorgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas DaniilidisIn 2017
Recent advances with Convolutional Networks (ConvNets) have shifted thebottleneck for many computer vision tasks to annotated data collection. In thispaper, we present a geometry-driven approach to automatically collectannotations for human pose prediction tasks. Starting from a generic ConvNetfor 2D human pose, and assuming a multi-view setup, we describe an automaticway to collect accurate 3D human pose annotations. We capitalize on constraintsoffered by the 3D geometry of the camera setup and the 3D structure of thehuman body to probabilistically combine per view 2D ConvNet predictions into aglobally optimal 3D pose. This 3D pose is used as the basis for harvestingannotations. The benefit of the annotations produced automatically with ourapproach is demonstrated in two challenging settings: (i) fine-tuning a genericConvNet-based 2D pose predictor to capture the discriminative aspects of asubject’s appearance (i.e.,"personalization"), and (ii) training a ConvNet fromscratch for single view 3D human pose prediction without leveraging 3D posegroundtruth. The proposed multi-view pose estimator achieves state-of-the-artresults on standard benchmarks, demonstrating the effectiveness of our methodin exploiting the available multi-view information.
多视角的heatmap生成了3D特征,然后使用3Dpictorial获取骨架位置