penguinV (S1): Benchmark and Plan

Last time I looked into Google’s HighwayHash as my target repo but failed to get the benchmark working. This timeMoving away from hashes, I looked for some image processing libraries, specifically a small to medium size project that would be simple to follow. Many Github searches later, I find a project that fits my criteria. PenguinV is ” a simple and fast C++ image processing library with focus on heterogeneous systems”. The project is easy to follow and looks like there are areas to which I can improve.

The function I am hoping to improve is the BlobDetection::find function, which searches for all the possible blobs in an image. The package includes a built-in benchmark to test BlobDetection functionality so I will profile that program first. Compiling the code takes a while but execution takes about a second. The input and output are as follows:

Looking into the C++ file, I try to modify the code to use a larger Bitmap image to increase the run time but there wasn’t a noticeable increase in time.


Using the time command, I get the following average times after 20 runs:

real    0m0.084s
user    0m0.062s
sys     0m0.015s
Perf / Gprof

Using perf, I get the following report

# To display the header info, please use --header/--header-only options.
# Total Lost Samples: 0
# Samples: 293  of event 'cycles:uppp'
# Event count (approx.): 60337088
# Overhead  Command          Shared Object           Symbol
# ........  ...............  ......................  ....................................................................................................
    27.52%  example_blob_de            [.] _mcount@@GLIBC_2.18
    24.42%  example_blob_de  example_blob_detection  [.] Blob_Detection::BlobDetection::find
    14.80%  example_blob_de  example_blob_detection  [.] std::vector<unsigned int, std::allocator<unsigned int> >::emplace_back<unsigned int>
     6.25%  example_blob_de            [.] __memcpy_generic
     5.19%  example_blob_de  example_blob_detection  [.] Image_Function::ConvertToGrayScale
     3.33%  example_blob_de  example_blob_detection  [.] Image_Function::Histogram
     2.87%  example_blob_de  example_blob_detection  [.] Image_Function::Threshold
     2.26%  example_blob_de            [.] _int_malloc
     1.76%  example_blob_de              [.] _dl_lookup_symbol_x
     1.32%  example_blob_de            [.] cfree@GLIBC_2.17
     1.25%  example_blob_de            [.] _int_free
     1.19%  example_blob_de            [.] malloc
     1.16%  example_blob_de              [.] do_lookup_x
     0.93%  example_blob_de              [.] _dl_relocate_object
     0.74%  example_blob_de            [.] __memmove_generic
     0.56%  example_blob_de            [.] malloc_consolidate
     0.40%  example_blob_de     [.] operator new
     0.40%  example_blob_de  example_blob_detection  [.] _mcount@plt
     0.40%  example_blob_de  example_blob_detection  [.] std::vector<unsigned int, std::allocator<unsigned int> >::_M_realloc_insert<unsigned int const&>
     0.39%  example_blob_de     [.] std::use_facet<std::codecvt<char, char, __mbstate_t> >@plt
     0.38%  example_blob_de  example_blob_detection  [.] Image_Function::SetPixel
     0.37%  example_blob_de  example_blob_detection  [.] Blob_Detection::BlobInfo::contourY
     0.36%  example_blob_de     [.] 0x000000000008d6e8
     0.36%  example_blob_de              [.] strcmp
     0.34%  example_blob_de  example_blob_detection  [.] main
     0.33%  example_blob_de  example_blob_detection  [.] Blob_Detection::BlobInfo::contourX
     0.28%  example_blob_de              [.] _dl_fixup
     0.22%  example_blob_de              [.] check_match
     0.09%  example_blob_de  [unknown]               [k] 0xffff000010096654
     0.05%  example_blob_de              [.] _dl_load_cache_lookup

A lot of time is taken up by the BlobDetection::find, however more than a third of that time is used by the stdlib vector::emplace function. Annotating the Blob_Detection::BlobDetection::find line, I find that no single part of the takes up a majority of its time. When profiling with gprof, the report also shows similar results:

                0.01    0.01       1/1           Blob_Detection::BlobDetection::find(PenguinV_Image::ImageTemplate<unsigned char> const&, Blob_Detection::BlobParameters, unsigned char) [2]
[1]    100.0    0.01    0.01       1         Blob_Detection::BlobDetection::find(PenguinV_Image::ImageTemplate<unsigned char> const&, unsigned int, unsigned int, unsigned int, unsigned int, Blob_Detection::BlobParameters, unsigned char) [1]
                0.01    0.00  328018/328018      void std::vector<unsigned int, std::allocator<unsigned int> >::emplace_back<unsigned int>(unsigned int&&) [3]
                0.00    0.00    3022/3022        void std::vector<unsigned int, std::allocator<unsigned int> >::_M_realloc_insert<unsigned int>(__gnu_cxx::__normal_iterator<unsigned 

vector::emplace_back is called 328018 times, not exactly what I was hoping to see. Looks like outside the use of the vector functions, there isn’t an obvious hotspot for me to fix.

Plan of Approach

I will have to do a deeper inspection of the function to see if there’s any optimizations I can incorporate. However, one thing I will like to try would be to experiment with the g++ optimization flags. The following are the default flags:

-std=c++11 -Wall -Wextra -Wstrict-aliasing -Wpedantic -Wconversion -O2 -march=native

Since it’s using -O2, there might be some additional flags that will further increase performance.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website at
Get started
%d bloggers like this: