Efficient 3D Human Pose Estimation on Mobile Devices

A. Shukla


The recent advances in deep convolutional neural networks have shown successful results for human pose and activity classification in videos. It has also shown promising results in the classification of diving actions. However, the accurate estimation of the diving pose in the real-time video stream is still challenging due to the unconstrained environment and complex maneuvers. The thesis work primarily focuses on a weakly supervised model for learning 3D human motion dynamics for diving sports. To this end, we use three models to estimate 2D and 3D human pose and human shape. For the 2D pose, we use the BlazePose model. BlazePose first learns heatmap from 2D image and regresses 2D pose from the heatmap. For the 3D pose, we use the MobileHumanPose model. MobileHumanPose estimates 3D human pose directly from a 2D image using a volumetric heatmap. For shape estimation, we use the I2L-Meshnet model. I2L-meshnet first estimates the 3D pose, then using the 3D pose, it estimates the shape of the human. The thesis aims to develop a model that can estimate 2D and 3D human pose on a mobile device in real-time. To this end, we use BlazePose and MobileHumanPose because they both are light-weight models, but without much compromise on the accuracy of the model. Although I2L-Meshnet is a large model, we investigate light-weight backbone networks to estimate the shape. The model performed well on pose estimation task but failed to estimate the accurate shape and global orientation of the diver in the videos. We created a demonstrator tool to integrate BlazePose and MobileHumaPose into mobile devices. We successfully estimated 2D human pose in the android device using BlazePose and 3D human pose in the iOS device using MobileHumanPose. Diving sports has complex poses, despite the challenging poses our mobile models achieve over 95% accuracy with fine tuned model and around 85% with generic model.