Soramichi's blog

Some seek complex solutions to simple problems; it is better to find simple solutions to complex problems

Dataflow Analysis to Semi-Automatically Find Chainer Bugs

Preface

As a system software researcher working for an (you know, one of many) "artificial intelligence research center", I use Chainer to explore what kind of system characteristics/supports the real AI applications need. Chainer is really good for this purpose because the framework itself is really simple so it is easy to hack as you wish.

Although the framework is intensively maintained, I sometimes happen to find bugs, especially when I use it in a bit different usage than normally done. This post explains a tiny tiny idea I came up with to (kind of) semi-automatically find a certain type of bugs of Chainer.

The Idea

So the idea is "the forward and the backward propagations for the same calculation are supposed to do similar things, especially for preparation". For example, both forward and backward of the linear function converts the first input into a matrix and assign the result into x (x = _as_mat(inputs[0])), and assign the second input into W (W = inputs[1]).

Given this idea, I extracted all assignments for each variable, and compare the extrancted assignments between forward and backward functions. If there is a variable with the same name in forward and backward but with different assignments, it might be a potential bug. In the linear example, both x and W have the same assignments in forward and backward.

Bugs It Found

Let's see how it works. Here is the code I wrote to extract the assignments and compare them. You should set the names of forward and backward functions by hand (l13 and l15), depending on whether they are vanilla forward/backward, forward_cpu/backward_cpu, or forward_gpu/backward_gpu.

Clone the Chainer repository and revert it to a point before the bug I found in this method has been fixed. After that, apply my script to chainer/chainer/functions/connection/deconvolution_2d.py.

$ git clone chainer && cd chainer
$ git checkout e6a7ec62773f0df0e3e0
$ ~/src/chainer_dataflow/chainer_dataflow.py chainer/functions/connection/deconvolution_2d.py
different data flow! ( b )
forward:
111 b = inputs[2] if len(inputs) == 3 else None
137 b = cuda.cupy.ascontiguousarray(b)
backward:
228 b = inputs[2] if len(inputs) == 3 else None
--------------------------------------------------
different data flow! ( kh )
forward:
123 kh, kw = W.shape[2:]
backward:
242 _, out_channels, kh, kw = W.shape
--------------------------------------------------
different data flow! ( kw )
forward:
123 kh, kw = W.shape[2:]
backward:
242 _, out_channels, kh, kw = W.shape
--------------------------------------------------
different data flow! ( c )
forward:
125 c = W.shape[1]  # out_c
backward:
243 c, h, w = gy.shape[1:]
--------------------------------------------------
different data flow! ( algo )
forward:
160 algo = libcudnn.getConvolutionBackwardDataAlgorithm(
165 algo = cuda.cupy.cuda.cudnn.CUDNN_CONVOLUTION_BWD_DATA_ALGO_1  # NOQA
backward:
258 algo = libcudnn.getConvolutionForwardAlgorithm(
283 algo = libcudnn.getConvolutionBackwardFilterAlgorithm(
288 algo = cuda.cupy.cuda.cudnn.CUDNN_CONVOLUTION_BWD_FILTER_ALGO_1  # NOQA
--------------------------------------------------

There are many outputs, but (unfortunately) only the first one (b) is relavant here. The output shows that, in forward, b is assigned from inputs[2] in line 111 and converted into c-contiguous in line 137. However in backward, b is assigned in line 228 and that's it with no conversion into c-contiguous, which is a bug (#2666). In the same way, it can also find a smilar bug such as #2582 (do not forget to set l13 and l15 of chainer_dataflow.py into forward and backward, instead of forward_gpu and backward_gpu). This bug fix actually is the one that motivated me to try this idea.

Here's another example:

$ git checkout e6a7ec62773f0df0 # same commit as the above
$ ~/chainer_dataflow.py chainer/functions/connection/dilated_convolution_2d.py
...
...
--------------------------------------------------
different data flow! ( x_desc )
forward:
133 x_desc = cudnn.create_tensor_descriptor(xji)
backward:
247 x_desc = cudnn.create_tensor_descriptor(x)
--------------------------------------------------

In this case x_desc are assigned with tensor descriptors created from different tensors, which was actually not a critical bug but a naming inconsisntecy (#2665).

Limitations and Potantial Extensions

Because both the idea and the script are very simple, of cource there are many limitations. The aim of this post is not like "a research paper that claims super novelty", but to tell the idea to other people with a hope that they may come up with a more clever idea besed on mine, which will be beneficial to the whole community. One obvisous limitation is that it yields a loooot of false positives. It might be useful to defined a threshold of "relevant difference level".

A possible way to extend the idea I have in mind, is to compare the code among forward_cpu and foward_gpu, but and among foward and backward. This is based on a though that some preparation code must be shared both in the cpu mode and the gpu mode. For example, #2589 fixed a missing assertion in the gpu mode code that already existed in the cpu mode code.