An Efficient Three-Dimensional Convolutional Neural Network for Inferring Physical Interaction Force from Video
Dongyi Kim, Hyeon Cho, Hochul Shin, Soo-Chul Lim and Wonjun Hwang
ABSTRACT : Interaction forces are traditionally predicted by a contact type haptic sensor. In this paper, we propose a novel and practical method for inferring the interaction forces between two objects based only on video data—one of the non-contact type camera sensors—without the use of common haptic sensors. In detail, we could predict the interaction force by observing the texture changes of the target object by an external force. For this purpose, our hypothesis is that a three-dimensional (3D) convolutional neural network (CNN) can be made to predict the physical interaction forces from video images. In this paper, we proposed a bottleneck-based 3D depthwise separable CNN architecture where the video is disentangled into spatial and temporal information. By applying the basic depthwise convolution concept to each video frame, spatial information can be efficiently learned; for temporal information, the 3D pointwise convolution can be used to learn the linear combination among sequential frames. To validate and train the proposed model, we collected large quantities of datasets, which are video clips of the physical interactions between two objects under different conditions (illumination and angle variations) and the corresponding interaction forces measured by the haptic sensor (as the ground truth). Our experimental results confirmed our hypothesis; when compared with previous models, the proposed model was more accurate and efficient, and although its model size was 10 times smaller, the 3D convolutional neural network architecture exhibited better accuracy. The experiments demonstrate that the proposed model remains robust under different conditions and can successfully estimate the interaction force between objects.