PenguinV: Trying to Optimize Median Filtering

Continuing from my last post, after thinking about the change from -O2 to -O3, I realized just changing the options on the performance test seems dumb. The library is mean’t to be included into other peoples projects as source files and invoked from their own software. Those people can just set their options to -O3 when compiling their software; there’s no need for them to follow the performance tests’ parameters. I will still create a pull request for the change as I have run the performance tests on an x86_64 machine and confirmed performance does not drop on that architecture. However, I want to try to try to improve the performance of the only function that has a significant drop in performance on a aarch64 system with -O3: median filtering.

For those who are unfamiliar with median filtering (such as myself), here’s the wikipedia page for the technique. The following is the source code for the function:

void Median( const Image & in, uint32_t startXIn, uint32_t startYIn, Image & out, uint32_t startXOut, uint32_t startYOut,
                 uint32_t width, uint32_t height, uint32_t kernelSize )
    {
        ParameterValidation( in, startXIn, startYIn, out, startXOut, startYOut, width, height );
        VerifyGrayScaleImage( in, out );

        if( kernelSize < 3 || kernelSize % 2 == 0 || kernelSize >= width || kernelSize >= height )
            throw imageException( "Kernel size for filter is not correct" );

        // Border's problem is well-known problem which can be solved in different ways
        // We just copy parts of original image without applying filtering
        Copy( in, startXIn, startYIn, out, startXOut,
              startYOut, width, kernelSize / 2 );
        Copy( in, startXIn, startYIn + height - kernelSize / 2, out, startXOut,
              startYOut + height - kernelSize / 2, width, kernelSize / 2 );
        Copy( in, startXIn, startYIn + kernelSize / 2, out, startXOut,
              startYOut + kernelSize / 2, kernelSize / 2, height - (kernelSize - 1) );
        Copy( in, startXIn + width - kernelSize / 2, startYIn + kernelSize / 2, out, startXOut + width - kernelSize / 2,
              startYOut + kernelSize / 2, kernelSize / 2, height - (kernelSize - 1) );

        std::vector < uint8_t > data( kernelSize * kernelSize );
        uint8_t * dataFirstValue = data.data();
        uint8_t * medianValue    = dataFirstValue + data.size() / 2;
        uint8_t * dataLastValue  = dataFirstValue + data.size();
        
        const uint32_t rowSizeIn  = in.rowSize();
        const uint32_t rowSizeOut = out.rowSize();

        const uint8_t * inY  = in.data()  + startYIn                     * rowSizeIn  + startXIn;
        uint8_t       * outY = out.data() + (startYOut + kernelSize / 2) * rowSizeOut + startXOut + kernelSize / 2;

        width  = width  - (kernelSize - 1);
        height = height - (kernelSize - 1);

        const uint8_t * outYEnd = outY + height * rowSizeOut;

        for( ; outY != outYEnd; outY += rowSizeOut, inY += rowSizeIn ) {
            const uint8_t * inX = inY;
            uint8_t       * outX = outY;

            const uint8_t * outXEnd = outX + width;

            for( ; outX != outXEnd; ++outX, ++inX ) {
                uint8_t * value = data.data();

                const uint8_t * inYRead    = inX;
                const uint8_t * inYReadEnd = inYRead + kernelSize * rowSizeIn;

                for( ; inYRead != inYReadEnd; inYRead += rowSizeIn ) {
                    const uint8_t * inXRead    = inYRead;
                    const uint8_t * inXReadEnd = inXRead + kernelSize;

                    for( ; inXRead != inXReadEnd; ++inXRead, ++value )
                        *value = *inXRead;
                }

                std::nth_element( dataFirstValue, medianValue, dataLastValue );
                (*outX) = *medianValue;
            }
        }
    }

So what am I looking at here? Well the reference to the input and output images are passed in into the function, as well as pointers to their starting pixels. The image’s dimension and the window size for the filter are also passed. The function includes some parameter validation, some copies to resolve the border issue, and the nested loops to perform the actual filtering. From a glance, that code looks solid.

Now let’s see if there are any hotspots for the function. With perf, the annotation shows me that inner block in the deepest loop is time consuming.

       │      _ZN14Image_Function6MedianERKN14PenguinV_Image13ImageTemplateIhEEjjRS2_jjjjj():                                      
       │                              *value = *inXRead;                                                                           
  5.10 │        ldrb   w9, [x1, x0]                                                                                                 
  2.49 │        strb   w9, [x2, x0]   

This makes sense since the line is nested into 4 loops. There’s not much I can do about that. Every other line was green so honestly, this function doesn’t really have any noticeable hotspots. I tried doing an objdump of the binary but its honestly a mess. I did find the portion with the “hotspot” noted earlier. It

  <_ZN14Image_Function6MedianERKN14PenguinV_Image13ImageTemplateIhEEjjRS2_jjjjj+0x6e8>  // b.none
    1e00:	3940080b 	ldrb	w11, [x0, #2]
    1e04:	91000c0a 	add	x10, x0, #0x3
    1e08:	3900092b 	strb	w11, [x9, #2]
    1e0c:	eb0a007f 	cmp	x3, x10
    1e10:	54000740 	b.eq	1ef8 <_ZN14Image_Function6MedianERKN14PenguinV_Image13ImageTemplateIhEEjjRS2_jjjjj+0x6e8>  // b.none
    1e14:	39400c0b 	ldrb	w11, [x0, #3]
    1e18:	9100100a 	add	x10, x0, #0x4
    1e1c:	39000d2b 	strb	w11, [x9, #3]
    1e20:	eb0a007f 	cmp	x3, x10
    1e24:	540006a0 	b.eq	1ef8 <_ZN14Image_Function6MedianERKN14PenguinV_Image13ImageTemplateIhEEjjRS2_jjjjj+0x6e8>  // b.none
    1e28:	3940100b 	ldrb	w11, [x0, #4]
    1e2c:	9100140a 	add	x10, x0, #0x5
    1e30:	3900112b 	strb	w11, [x9, #4]
    1e34:	eb0a007f 	cmp	x3, x10
    1e38:	54000600 	b.eq	1ef8 <_ZN14Image_Function6MedianERKN14PenguinV_Image13ImageTemplateIhEEjjRS2_jjjjj+0x6e8>  // b.none
    1e3c:	3940140b 	ldrb	w11, [x0, #5]
    1e40:	9100180a 	add	x10, x0, #0x6
    1e44:	3900152b 	strb	w11, [x9, #5]
    1e48:	eb0a007f 	cmp	x3, x10

Looks like the inner most loop was unrolled: the lines repeat about 20 times before continuing to something else. I guess there’s not much to gain from focusing on this hotspot.

Next time, I will try comparing the objdump from the -O2 compilation, to see where the difference lies that cause performance difference.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website at WordPress.com
Get started
%d bloggers like this: