Overview

Abstract

Convolution is one of the basic building blocks of CNN architectures. Despite its common use, standard convolution has two main shortcomings: Content-agnostic and Computation-heavy. Dynamic filters are content-adaptive, while further increasing the computational overhead. Depth-wise convolution is a lightweight variant, but it usually leads to a drop in CNN performance or requires a larger number of channels. In this work, we propose the Decoupled Dynamic Filter (DDF) that can simultaneously tackle both of these shortcomings. Inspired by recent advances in attention, DDF decouples a depth-wise dynamic filter into spatial and channel dynamic filters. This decomposition considerably reduces the number of parameters and limits computational costs to the same level as depth-wise convolution. Meanwhile, we observe a significant boost in performance when replacing standard convolution with DDF in classification networks. ResNet50 / 101 get improved by 1.9% and 1.3% on the top-1 accuracy, while their computational costs are reduced by nearly half. Experiments on the detection and joint upsampling networks also demonstrate the superior performance of the DDF upsampling variant (DDF-Up) in comparison with standard convolution and specialized content-adaptive layers.

Method

Here we illustrate of the DDF operation and the DDF module. The orange color denotes spatial dynamic filters / branch, and the green color denotes channel dynamic filters / branch. The filter application means applying the convolution operation at a single position. ‘GAP’ means the global average pooling and ‘FC’ denotes the fully connected layer.

We also propose DDF-Up module for typical / joint upsampling scenario. Here is the structure of the DDF-Up module. When the upsampling scale factor is set to 2, the DDF-Up module contains 4 branches. DDF-Up generate the high-resolution feature by stacking and pixel-shuffling the branch outputs. For typical upsampling, the guidance feature is predicted from input features via a depth-wise convolution layer

Experiments

To demonstrate the use of DDFs as basic building blocks, we experiment with the widely used ResNet architectures for image classification. We switch the 3 by 3 convolution layer in the basic and bottleneck block of ResNet with DDF layers and keep the original hyperparameters, especially using the same number of channels.

As we can see, DDF-ResNets perform favorably against other filter types. In particular, DDF-ResNets surpass resnet50 and 101 by 1.9% and 1.3%, respectively, while reducing computational costs by nearly half.

Also, DDF-ResNets surpass related several state-of-the-art variants of ResNets.

We evaluate DDF-Up in the joint depth upsampling task. DDF-Up surpasses standard CNN techniques like DJF and DJF+ by a large margin. DDF-Up also improves over dynamic-filtering PAC-Net by relatively 10%, while reducing computational costs by an order of magnitude.

Visualization

We visualize 16 times joint depth upsampling results, where we can see that DDF-Up-Net recovers more details compared to PAC-Net and other techniques.

Citation

Please consider citing the following papers if you make use of this work and/or the corresponding code:

@inproceedings{zhou_ddf_cvpr_2021,
                    title = {Decoupled Dynamic Filter Networks},
                    author = {Zhou, Jingkai and Jampani, Varun and Pi, Zhixiong and Liu, Qiong and Yang, Ming-Hsuan},
                    booktitle = { IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)},
                    month = jun,
                    year = {2021}
                }