Forecasting Future Sequence of Actions to Complete an Activity
Future human action forecasting from partial observations of activities is an important problem in many practical applications such as assistive robotics, video surveillance and security. We present a method to forecast actions for the unseen future of the video using a neural machine translation technique that uses encoder-decoder architecture. The input to this model is the observed RGB video, and the target is to generate the future symbolic action sequence. Unlike most methods that predict frame or clip level predictions for some unseen percentage of video, we predict the complete action sequence that is required to accomplish the activity. To cater for two types of uncertainty in the future predictions, we propose a novel loss function. We show a combination of optimal transport and future uncertainty losses help to boost results. We evaluate our model in three challenging video datasets (Charades, MPII cooking and Breakfast). We outperform other state-of-the art techniques for frame based action forecasting task by 5.06% on average across several action forecasting setups.
READ FULL TEXT