MLSL: Multi-Level Self-Supervised Learning for Domain Adaptation with Spatially Independent and Semantically Consistent Labeling

Most of the recent Deep Semantic Segmentation algorithms suffer from large generalization errors, even when powerful hierarchical representation models based on convolutional neural networks have been employed. This could be attributed to limited training data and large distribution gap in train and test domain datasets.In this paper, we propose a multi-level self-supervised learning model for domain adaptation of semantic segmentation.Exploiting the idea that an object (and most of the stuff given context) should be labeled consistently regardless of its location, we generate spatially independent and semantically consistent (SISC) pseudo-labels by segmenting multiple sub-images using base model and designing an aggregation strategy. Image level pseudo weak-labels, PWL, are computed to guide domain adaptation by capturing global context similarity in source and domain at latent space level. Thus helping latent space learn the representation even when there are very few pixels belonging to the domain category (small object for example) compared to rest of the image.Our multi-level Self-supervised learning (MLSL) outperforms existing state-of-art (self or adversarial learning) algorithms. Specifically, keeping all setting similar and employing MLSL we obtain an mIoU gain of 5.1% on GTA-V to Cityscapes adaptation and 4.3% on SYNTHIA to Cityscapes adaptation compared to existing state-of-art method.

Accepted in WACV2020


Visual identification of gunmen in a crowd is a challenging problem, that requires resolving the association of a person with an object (firearm). We present a novel approach to address this problem, by defining human-object interaction (and non-interaction) bounding boxes. In a given image, human and firearms are separately detected. Each detected human is paired with each detected firearm, allowing us to create a paired bounding box that contains both object and the human.A network is trained to classify these paired-bounding-boxes into human carrying the identified firearm or not. Extensive experiments were performed to evaluate effectiveness of the algorithm, including exploiting full pose of the human, hand key-points, and their association with the firearm. The knowledge of spatially localized features is key to success of our method by using multi-size proposals with adaptive average pooling. We have also extended a previously firearm detection dataset, by adding more images and tagging in extended dataset the human-firearm pairs (including bounding boxes for firearms and gunmen). The experimental results (AP = 78.5) demonstrate effectiveness of the proposed method.

Accepted in ICIP2020

Exploiting Geometric Constraints on Dense Trajectories for Motion Saliency

The existing approaches for salient motion segmentation are unable to explicitly learn geometric cues and often give false detections on prominent static objects. We exploit multiview geometric constraints to avoid such mistakes. To handle nonrigid background like sea, we also propose a robust fusion mechanism between motion and appearance-based features. We find dense trajectories, covering every pixel in the video, and propose trajectory-based epipolar distances to distinguish between background and foreground regions. Trajectory epipolar distances are data-independent and can be readily computed given a few features’ correspondences in the images. We show that by combining epipolar distances with optical flow, a powerful motion network can be learned. Enabling the network to leverage both of these information, we propose a simple mechanism, we call input-dropout. We outperform the previous motion network on DAVIS-2016 dataset by 5.2% in mean IoU score. By robustly fusing our motion network with an appearance network using the proposed input-dropout, we also outperform the previous methods on DAVIS-2016, 2017 and Segtrackv2 dataset.

Muhammad Faisal, Ijaz Akhter, Mohsen Ali and Richard Hartley 

Accepted in WACV-2020

Twin-Net Descriptor: Twin Negative Mining with Quad Loss for Patch Based Matching

Local keypoint matching is an important step for computer vision based tasks.
In recent years, Deep Convolutional Neural Network (CNN) based strategies have been employed to learn descriptor generation to enhance keypoint matching accuracy.  Recent state-of-art works in this direction primarily rely upon a triplet based loss function (and its variations) utilizing three samples: an anchor, a positive and a negative. In this work we propose a novel “Twin Negative Mining” based sampling strategy coupled with a Quad loss function to train a deep neural network based pipeline (Twin-Net) for generating a robust descriptor that provides an increased discriminatory power to differentiate between patches that do not correspond to each other. Our sampling strategy and choice of loss function is aimed at placing an upper bound that descriptors of two patches representing same location could be at worst no more dissimilar than the descriptors of two similar looking patches that do-not belong to same 3D location. This results in an increase in the generalization capability of the network and outperforms its existing counterparts when trained over the same datasets. Twin-Net outputs a 128-dimensional descriptor and uses L2 Distance as the similarity metric, and hence conforms to the classical descriptor matching pipelines such as that of SIFT. Our results on Brown and HPatches datasets demonstrate Twin-Net’s consistently better performance as well as better discriminatory and generalization capability as compared to the state-of-art.

Aman Irshad, Rehan Hafiz, Mohsen Ali, Yongju Cho, and Jeongil Seo
IEEE Access

Multi-focus Image Fusion Using Content Adaptive Blurring

Multi-focus image fusion has emerged as an important research area in information fusion. It aims at increasing the depth-of-field by extracting focused regions from multiple partially focused images, and merging them together to produce a composite image in which all objects are in focus. In this paper, a novel multi-focus image fusion algorithm is presented in which the task of detecting the focused regions is achieved using a Content Adaptive Blurring (CAB) algorithm. The proposed algorithm induces non-uniform blur in a multi-focus image depending on its underlying content. In particular, it analyzes the local image quality in a neighborhood and determines if the blur should be induced or not without losing local image quality. In CAB, pixels belonging to the blur regions receive little or no blur at all, whereas the focused regions receive significant blur. Absolute difference of the original image and the CAB-blurred image yields initial segmentation map, which is further refined using morphological operators and graph-cut techniques to improve the segmentation accuracy. Quantitative and qualitative evaluations and comparisons with current state-of-the-art on two publicly available datasets demonstrate the strength of the proposed algorithm.

Muhammad Shahid Farid, Arif Mahmood, Somaya Ali Al-Maadeed
Information Fusion 2019

Moving Object Detection in Complex Scene Usin Spatiotemporal Structured-Sparse RPCA

Moving object detection is a fundamental step in various computer vision applications. Robust principal component analysis (RPCA)-based methods have often been employed for this task. However, the performance of these methods deteriorates in the presence of dynamic background scenes, camera jitter, camouflaged moving objects, and/or variations in illumination. It is because of an underlying assumption that the elements in the sparse component are mutually independent, and thus the spatiotemporal structure of the moving objects is lost. To address this issue, we propose a spatiotemporal structured sparse RPCA algorithm for moving objects detection, where we impose spatial and temporal regularization on the sparse component in the form of graph Laplacians. Each Laplacian corresponds to a multi-feature graph constructed over superpixels in the input matrix. We enforce the sparse component to act as eigenvectors of the spatial and temporal graph Laplacians while minimizing the RPCA objective function. These constraints incorporate a spatiotemporal subspace structure within the sparse component. Thus, we obtain a novel objective function for separating moving objects in the presence of complex backgrounds. The proposed objective function is solved using a linearized alternating direction method of multipliers based batch optimization. Moreover, we also propose an online optimization algorithm for real-time applications. We evaluated both the batch and online solutions using six publicly available data sets that included most of the aforementioned challenges. Our experiments demonstrated the superior performance of the proposed algorithms compared with the current state-of-the-art methods.

Sajid Javed , Arif Mahmood , Somaya Al-Maadeed , Thierry Bouwmans , and Soon Ki Jung , Senior Member, IEEE


With the recent breakthrough in commodity 3D imaging solutions such as depth sensing, photogrammetry, stereoscopic vision and structured light, 3D shape recognition is becoming an increasingly important problem. A longstanding question is what should be the format of the 3D shape (such as voxel, mesh, point-cloud etc.) and what could be a good generic feature representation for shape recognition. This question is particularly important in the context of convolutional neural network (CNN) whose efficacy and complexity depends upon the choice of input shape format and the design of network. It has been seen that both 3D voxel representation as well as collection of rendered views on 2D images have produced competing results. Similarly, it have been seen that networks with few million parameters and networks with several hundred million parameters have similar performance. In this work we compare these solutions and provide an analysis on the factors resulting in increase in the parameters without significantly improving accuracy. On the basis of the above analysis we propose a representation method (point cloud to 2D grid) and architecture that results in much less parameters for the CNN but has competing accuracy.

Usama Shafiq, Murtaza Taj, Mohsen Ali
International Conference on Image Processing (ICIP) 2017


Current image transformation and recoloring algorithms try to introduce artistic effect in the photographed images, based on users input of target image(s) or selection of pre-designed filters. In this paper we present an automatic image-transformation method that transforms the source image such that it induces an emotional affect on the viewer, as desired by the user. Our method can handle much more diverse set of images than previous methods. A discussion and reasoning of failure cases has been provided, indicating inherent limitation of color-transfer based methods in use of emotion assignment.

Afsheen Rafaqat Ali, Mohsen Ali
British Machine Vision Conference (BMVC) 2017


Deep convolutional neural networks (CNNs) have outperformed existing object recognition and detection algorithms. This paper describes a deep learning approach that analyzes a geo-referenced satellite image and efficiently detects built structures in it. A Fully Convolutional Network (FCN) is trained on low-resolution Google earth satellite imagery in order to achieve the end result. The detected built communities are then correlated with the vaccination activity.

Anza Shakeel, Mohsen Ali
Arxiv 2017


This paper aims to bridge the affective gap between image content and the emotional response of the viewer, it elicits, by using High-Level Concepts (HLCs). In contrast to previous work that relied solely on low-level features or used convolutional neural network (CNN) as a blackbox, we use HLCs generated by pre-trained CNNs in an explicit way to investigate the relations/associations between these HLCs and a(small)set of Ekman’s emotional classes. Experimental results have demonstrated that our results are comparable to existing methods, with a clear view of the association between HLCs and emotional classes that is ostensibly missing in most existing work.

Afsheen Rafaqat Ali, Usman Shahid, Mohsen Ali, Jeffrey Ho
Winter Conference on Applications of Computer Vision (WACV) 2017


This paper develops the novel notion of deconstructive learning and proposes a practical model for deconstructing a broad class of binary classifiers commonly used in vision applications. Specifically, the problem studied in this paper is: Given an image-based binary classifier CC as a black-box oracle, how much can we learn of its internal working by simply querying it? In particular, we demonstrate that it is possible to ascertain the type of kernel function used by the classifier and the number of support vectors using only image queries and ascertain the unknown feature space too.

Mohsen Ali, Jeffrey Ho
Asian Conference on Computer Vision (ACCV) 2014