Dogfight: Detecting Drones from Drones Videos


Muhammad Waseem Ashraf, Waqas Sultani and Mubarak Shah,Dogfight: Detecting Drones from Drones Videos, Computer Vision and Pattern Recognition (CVPR) 2021.


As airborne vehicles are becoming more autonomous and ubiquitous, it has become vital to develop the capability to detect the objects in their surroundings. This paper attempts to address the problem of drones detection from other flying drones. The erratic movement of the source and target drones, small size, arbitrary shape, large intensity variations, and occlusion make this problem quite challenging. In this scenario, region-proposal based methods are not able to capture sufficient discriminative foreground-background information. Also, due to the extremely small size and complex motion of the source and target drones, feature aggregation based methods are unable to perform well. To handle this, instead of using region-proposal based methods, we propose to use a two-stage segmentation-based approach employing spatio-temporal attention cues. During the first stage, given the overlapping frame regions, detailed contextual information is captured over convolution feature maps using pyramid pooling. After that pixel and channel-wise attention is enforced on the feature maps to ensure accurate drone localization. In the second stage, first stage detections are verified and new probable drone locations are explored. To discover new drone locations, motion boundaries are used. This is followed by tracking candidate drone detections for a few frames, cuboid formation, extraction of the 3D convolution feature map, and drones detection within each cuboid.
The proposed approach is evaluated on two publicly available drone detection datasets and outperforms several competitive baselines.


We have presented a two-stage approach for drone detection from other flying drones employing spatio-temporal cues. Instead of relying on region proposal-based methods, we have used a segmentation-based approach for accurate drone detection using pixel and channel-wise attention. In addition to using appearance information, we have also exploited motion information between frames to get a better recall. We observe that for drones to drones videos, the two-stage approach performs better than the one-stage approach.


Our pipeline is divided into two stages. Stage-1 extracts Resnet50* features from the overlapping regions of each frame followed by pyramid pooling to retain global and local contextual information. Channel-wise and pixel-wise attention help in learning better localization of drones. Resnet50* refers to the modifications that we have applied (ref Section 3.1). Stage-2 combines spatial information with temporal data of the videos. Detections from stage-1 along with candidate regions discovered using motion boundaries are used as candidate regions where UAV can exist. All the proposals are tracked for 8 frames in a forward and backward manner to generate cuboids of size 224 x 224 x 8. Each cuboid is passed through the I3D network followed by the attention network to accurately locate drones within each cuboid. In figure MD, TP, FP, and MB corresponds to missed detection, true positive, false positive, motion boundaries respectively.

Architectural details of (a) channel-wise and (b) pixel-wise attention network, where ‘FC’ represents fully connected layer with number of units in (a), ‘C’ and ‘F’ represent convolution and number of filters respectively in (b).

Qualitative comparison of one stage versus two stage detection results. (a) shows the detection results for stage-1 and (b) represents the detection results of two stage approach. Red boxes represent detection and blues boxes are just for better visualizations. The first stage misses one drone in each example.

Our method uses two stage approach. Stage 1 video shows some missing detections which are added by stage 2 and are highlighted by black arrows.



Green box = Ground truth
Red box = detection
Bigger boxes are shown for better visualization
Black arrows indicate additional detection in two stage

The following videos shows FL drones detections:

Green box for ground truth
Red box for detection
Bigger boxes are shown for better visualization

The following videos shows NPS drones detections:

Green box for ground truth
Red box for detection
Bigger boxes are shown for better visualization


Samples frames from NPS-drone dataset. The green boxes enclose drones.

Samples frames from FL-drone dataset. The green boxes enclose drones.

This figure shows the variability of drone shape and size in two datasets: NPS-drone (first two rows) and FL-drone (last two rows) datasets. The green boxes present the ground truth bounding boxes.


title={Dogfight: Detecting Drones from Drones Videos},
author={Ashraf, Muhammad Waseem and Sultani, Waqas and Shah, Mubarak},
journal={arXiv preprint arXiv:2103.17242},