Jingkai Zhou Varun Jampani Zhixiong Pi Qiong Liu Ming-Hsuan Yang
[Paper] [Supplementary] [Poster] [Code]
Convolution is one of the basic building blocks of CNN architectures. Despite its common use, standard convolution has two main shortcomings: Content-agnostic and Computation-heavy. Dynamic filters are content-adaptive, while further increasing the computational overhead. Depth-wise convolution is a lightweight variant, but it usually leads to a drop in CNN performance or requires a larger number of channels. In this work, we propose the Decoupled Dynamic Filter (DDF) that can simultaneously tackle both of these shortcomings. Inspired by recent advances in attention, DDF decouples a depth-wise dynamic filter into spatial and channel dynamic filters. This decomposition considerably reduces the number of parameters and limits computational costs to the same level as depth-wise convolution. Meanwhile, we observe a significant boost in performance when replacing standard convolution with DDF in classification networks. ResNet50 / 101 get improved by 1.9% and 1.3% on the top-1 accuracy, while their computational costs are reduced by nearly half. Experiments on the detection and joint upsampling networks also demonstrate the superior performance of the DDF upsampling variant (DDF-Up) in comparison with standard convolution and specialized content-adaptive layers.
Here we illustrate the DDF operation and the DDF module. The orange color denotes spatial dynamic filters / branch, and the green color denotes channel dynamic filters / branch. The filter application means applying the convolution operation at a single position. ‘GAP’ means the global average pooling and ‘FC’ denotes the fully connected layer.
We also propose DDF-Up module for typical / joint upsampling scenario. Here is the structure of the DDF-Up module. When the upsampling scale factor is set to 2, the DDF-Up module contains 4 branches. DDF-Up generate the high-resolution feature by stacking and pixel-shuffling the branch outputs. For typical upsampling, the guidance feature is predicted from input features via a depth-wise convolution layer
To demonstrate the use of DDFs as basic building blocks, we experiment with the widely used ResNet architectures for image classification. We switch the 3 by 3 convolution layer in the basic and bottleneck block of ResNet with DDF layers and keep the original hyperparameters, especially using the same number of channels.
As we can see, DDF-ResNets perform favorably against other filter types. In particular, DDF-ResNets surpass resnet50 and 101 by 1.9% and 1.3%, respectively, while reducing computational costs by nearly half.
Also, DDF-ResNets surpass related several state-of-the-art variants of ResNets.
We evaluate DDF-Up in the joint depth upsampling task. DDF-Up surpasses standard CNN techniques like DJF and DJF+ by a large margin. DDF-Up also improves over dynamic-filtering PAC-Net by relatively 10%, while reducing computational costs by an order of magnitude.
Please consider citing the following papers if you make use of this work and/or the corresponding code:
@inproceedings{zhou_ddf_cvpr_2021,
title = {Decoupled Dynamic Filter Networks},
author = {Zhou, Jingkai and Jampani, Varun and Pi, Zhixiong and Liu, Qiong and Yang, Ming-Hsuan},
booktitle = { IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)},
month = jun,
year = {2021}
}