Human Action Recognition in Drone Videos using a Few Aerial Training Examples

Publication

Waqas Sultani, Mubarak Shah, Human action recognition in drone videos using a few aerial training examples, Computer Vision and Image Understanding (2021): 103186.

Overview

Drones are enabling new forms of human actions surveillance due to their low cost and fast mobility. However, using deep neural networks for automatic aerial action recognition is difficult due to the need for a large number of training aerial human action videos. Collecting a large number of human action aerial videos is costly, time-consuming, and difficult. In this paper, we explore two alternative data sources to improve aerial action classification when only a few training aerial examples are available. As a first data source, we resort to video games. We collect plenty of aerial game action videos using two gaming engines. For the second data source, we leverage conditional Wasserstein Generative Adversarial Networks to generate aerial features from ground videos. Given that both data sources have some limitations, e.g. game videos are biased towards specific actions categories (fighting, shooting, etc.,), and it is not easy to generate good discriminative GAN-generated features for all types of actions, we need to efficiently integrate two dataset sources with few available real aerial training videos. To address this challenge of the heterogeneous nature of the data, we propose to use a disjoint multitask learning framework. We feed the network with real and game, or real and GAN-generated data in an alternating fashion to obtain an improved action classifier. We validate the proposed approach on two aerial action datasets and demonstrate that features from aerial game videos and those generated from GAN can be extremely useful for an improved action recognition in real aerial videos when only a few real aerial training examples are available.

Problem & Motivation

Automatically recognizing human action in drone videos is a daunting task. It is challenging due to drone camera motion, small actor size, and most importantly the difficulty of collecting large scale training aerial action videos. Computer vision researchers have tried to detect human action in varieties of videos including sports videos (Soomro et al., 2013) , surveillance CCTV videos Sultani et al. (2018), cooking and ego-centric videos Damen et al. (2018). However, despite being very useful and of practical importance, not much research work is done to automatically recognize human action in drone videos.

  • We propose to tackle the new problem of drone-based human action recognition when only a few aerial training examples are available.
  • To the best of our knowledge, we are the first one to demonstrate the feasibility of game action videos for improving action recognition in real-world aerial videos. Although game imagery has been used before in different computer vision applications, it has not been used for aerial action recognition.
  • We show that game and GAN-generated action examples can help to learn a more accurate action classifier through a disjoint multitask learning framework.
  • We present two new action datasets: 1) Aerial-Ground game dataset containing seven human actions where for each action we have 100 aerial-ground video pairs, 2) Real aerial dataset containing actions corresponding to eight actions of UCF101.

Method

Our adopted approach is depicted in the figure shown below. We propose to utilize game videos and GAN generated aerial features to improve aerial action classification when a few real aerial training examples are available. Our approach does not require the same labels for real and game actions. To tackle different action labels in the game and real dataset, we propose to use disjoint multitask learning framework to efficiently learn robust action classifier.

Games Action Dataset

We employed GTA-5 (Grand Theft Auto) and FIFA (Federation International Football Association) for collecting the games action dataset. We asked the players to play the games and record the same action from multiple views. Note that GTA and FIFA allow users to record the actions from multiple angles with real-looking scenes and different realistic camera motions.

In total, we collected seven human actions including cycling, fighting, soccer kicking, running, walking, shooting, and skydiving. Due to the availability of plenty of soccer kicking in FIFA games, we collected kicking from FIFA and the rest of the actions are collected from GTA-5. Although in our current approach we are only using aerial game video, for more complete dataset purposes, we captured both ground and aerial video pairs i.e., the same action frames captured from both aerial and ground cameras.

For each action, our dataset contains 200 videos (100 ground and 100 aerial) with a total of 1400 videos for seven actions. Note that most of the scenes and interactions in the video games are biased towards actions related to fighting, shooting, walking and running, etc. Therefore, employing game videos to improve action recognition in real-world videos is not trivial. Therefore, in this paper, we proposed a unified approach to combine games and real videos employing disjoint multitask learning.

YouTube Aerial Dataset

We collected this new dataset ourselves from the drone videos available on YouTube. This dataset contains actions corresponding to eight actions of UCF101. The actions include band marching, biking, cliff-diving, golf-swing, horse-riding, kayaking, skateboarding, and surfing. The videos in this dataset contain large and fast camera motion and aerial videos are captured at variable heights. A few examples of videos in this dataset are shown below. Each action contains 50 videos. Similar to the UCF-ARG dataset, the dataset partition includes 60%, 10%, and 30% of videos for training, validation, and testing respectively.

BIBTEX

@article{sultani2021human,
title={Human action recognition in drone videos using a few aerial training examples},
author={Sultani, Waqas and Shah, Mubarak},
journal={Computer Vision and Image Understanding},
pages={103186},
year={2021},
publisher={Elsevier}
}