Lab 5: SIMD

For today’s lab, I will be looking at the impacts that “Single Instruction, Multiple Data” (SIMD) has on software performance. SIMD refers to a set of instructions which perform the same operation on several separate pieces of data in parallel. This includes the related instructions to set up data for SIMD processing, and to summarize results. The lab experiments with the following methods of implementation on an Aarch64 system:

Auto-Vectorization

I will be applying auto-vectorization to the following base code.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
 
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
        return (int16_t) (volume_factor * (float) sample);
}
 
int main() {
 
        // Allocate memory for large in and out arrays
        int16_t*        data;
        data = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
 
        int             x;
        int             ttl = 0;
 
        // Seed the pseudo-random number generator
        srand(1);
 
        // Fill the array with random data
        for (x = 0; x < SAMPLES; x++) {
                data[x] = (rand()%65536)-32768;
        }
 
        // ######################################
        // This is the interesting part!
        // Scale the volume of all of the samples
        for (x = 0; x < SAMPLES; x++) {
                data[x] = scale_sample(data[x], 0.75);
        }
        // ######################################
 
        // Sum up the data
        for (x = 0; x < SAMPLES; x++) {
                ttl = (ttl+data[x])%1000;
        }
 
        // Print the sum
        printf("Result: %d\n", ttl);
 
        return 0;
 
}

To display information about vectorization, I compile with the compiler option -fopt-info-vec-all:

gcc -g -O3 -fopt-info-vec-all vol1.c -o vol1

This gives me a long log file. Each loop has their own section in the log similar to the following:

Analyzing loop at vol1.c:38
vol1.c:38:2: note: ===== analyze_loop_nest =====
vol1.c:38:2: note: === vect_analyze_loop_form ===
vol1.c:38:2: note: === get_loop_niters ===
vol1.c:38:2: note: === vect_analyze_data_refs ===
vol1.c:38:2: note: got vectype for stmt: _17 = *_16;
vector(8) short int
vol1.c:38:2: note: === vect_analyze_scalar_cycles ===
vol1.c:38:2: note: Analyze phi: x_53 = PHI <0(7), x_37(9)>
vol1.c:38:2: note: Access function of PHI: {0, +, 1}_3
vol1.c:38:2: note: step: 1,  init: 0
vol1.c:38:2: note: Detected induction.
vol1.c:38:2: note: Analyze phi: ttl_54 = PHI <0(7), ttl_36(9)>
vol1.c:38:2: note: Access function of PHI: ttl_54
vol1.c:38:2: note: Analyze phi: ivtmp_77 = PHI <100000000(7), ivtmp_76(9)>
vol1.c:38:2: note: Access function of PHI: {100000000, +, 4294967295}_3
vol1.c:38:2: note: step: 4294967295,  init: 100000000
vol1.c:38:2: note: Detected induction.
vol1.c:38:2: note: Analyze phi: ttl_54 = PHI <0(7), ttl_36(9)>
vol1.c:38:2: note: reduction: not commutative/associative: ttl_36 = _19 % 1000;
vol1.c:38:2: note: Unknown def-use cycle pattern.
vol1.c:38:2: note: === vect_pattern_recog ===
vol1.c:38:2: note: vect_is_simple_use: operand _14
vol1.c:38:2: note: def_stmt: _14 = (long unsigned int) x_53;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: vect_is_simple_use: operand x_53
vol1.c:38:2: note: def_stmt: x_53 = PHI <0(7), x_37(9)>
vol1.c:38:2: note: type of def: induction
vol1.c:38:2: note: vect_is_simple_use: operand 2
vol1.c:38:2: note: vect_recog_mult_pattern: detected:
vol1.c:38:2: note: patt_63 = _14 << 1;
vol1.c:38:2: note: mult pattern recognized: patt_63 = _14 << 1;
vol1.c:38:2: note: vect_recog_divmod_pattern: detected: patt_47 = _19 - patt_48;
vol1.c:38:2: note: divmod pattern recognized: patt_47 = _19 - patt_48;
vol1.c:38:2: note: === vect_analyze_data_ref_accesses ===
vol1.c:38:2: note: === vect_mark_stmts_to_be_vectorized ===
vol1.c:38:2: note: init: phi relevant? x_53 = PHI <0(7), x_37(9)>
vol1.c:38:2: note: init: phi relevant? ttl_54 = PHI <0(7), ttl_36(9)>
vol1.c:38:2: note: init: phi relevant? ivtmp_77 = PHI <100000000(7), ivtmp_76(9)>
vol1.c:38:2: note: init: stmt relevant? # DEBUG ttl => ttl_54
vol1.c:38:2: note: init: stmt relevant? # DEBUG x => x_53
vol1.c:38:2: note: init: stmt relevant? # DEBUG BEGIN_STMT
vol1.c:38:2: note: init: stmt relevant? _14 = (long unsigned int) x_53;
vol1.c:38:2: note: init: stmt relevant? _15 = _14 * 2;
vol1.c:38:2: note: init: stmt relevant? _16 = data_28 + _15;
vol1.c:38:2: note: init: stmt relevant? _17 = *_16;
vol1.c:38:2: note: init: stmt relevant? _18 = (int) _17;
vol1.c:38:2: note: init: stmt relevant? _19 = _18 + ttl_54;
vol1.c:38:2: note: init: stmt relevant? ttl_36 = _19 % 1000;
vol1.c:38:2: note: vec_stmt_relevant_p: used out of loop.
vol1.c:38:2: note: vect_is_simple_use: operand _19
vol1.c:38:2: note: def_stmt: _19 = _18 + ttl_54;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: vec_stmt_relevant_p: stmt live but not relevant.
vol1.c:38:2: note: mark relevant 1, live 1: ttl_36 = _19 % 1000;
vol1.c:38:2: note: last stmt in pattern. don't mark relevant/live.
vol1.c:38:2: note: init: stmt relevant? # DEBUG ttl => ttl_36
vol1.c:38:2: note: init: stmt relevant? x_37 = x_53 + 1;
vol1.c:38:2: note: init: stmt relevant? # DEBUG x => x_37
vol1.c:38:2: note: init: stmt relevant? # DEBUG ttl => ttl_36
vol1.c:38:2: note: init: stmt relevant? # DEBUG x => x_37
vol1.c:38:2: note: init: stmt relevant? ivtmp_76 = ivtmp_77 - 1;
vol1.c:38:2: note: init: stmt relevant? if (ivtmp_76 != 0)
vol1.c:38:2: note: worklist: examine stmt: patt_47 = _19 - patt_48;
vol1.c:38:2: note: vect_is_simple_use: operand _19
vol1.c:38:2: note: def_stmt: _19 = _18 + ttl_54;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: mark relevant 1, live 0: _19 = _18 + ttl_54;
vol1.c:38:2: note: vect_is_simple_use: operand patt_48
vol1.c:38:2: note: def_stmt: patt_48 = patt_49 * 1000;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: mark relevant 1, live 0: patt_48 = patt_49 * 1000;
vol1.c:38:2: note: worklist: examine stmt: patt_48 = patt_49 * 1000;
vol1.c:38:2: note: vect_is_simple_use: operand patt_49
vol1.c:38:2: note: def_stmt: patt_49 = patt_57 - patt_50;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: mark relevant 1, live 0: patt_49 = patt_57 - patt_50;
vol1.c:38:2: note: worklist: examine stmt: patt_49 = patt_57 - patt_50;
vol1.c:38:2: note: vect_is_simple_use: operand patt_57
vol1.c:38:2: note: def_stmt: patt_57 = patt_58 >> 6;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: mark relevant 1, live 0: patt_57 = patt_58 >> 6;
vol1.c:38:2: note: vect_is_simple_use: operand patt_50
vol1.c:38:2: note: def_stmt: patt_50 = _19 >> 31;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: mark relevant 1, live 0: patt_50 = _19 >> 31;
vol1.c:38:2: note: worklist: examine stmt: patt_50 = _19 >> 31;
vol1.c:38:2: note: vect_is_simple_use: operand _19
vol1.c:38:2: note: def_stmt: _19 = _18 + ttl_54;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: mark relevant 1, live 0: _19 = _18 + ttl_54;
vol1.c:38:2: note: already marked relevant/live.
vol1.c:38:2: note: worklist: examine stmt: patt_57 = patt_58 >> 6;
vol1.c:38:2: note: vect_is_simple_use: operand patt_58
vol1.c:38:2: note: def_stmt: patt_58 = _19 h* 274877907;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: mark relevant 1, live 0: patt_58 = _19 h* 274877907;
vol1.c:38:2: note: worklist: examine stmt: patt_58 = _19 h* 274877907;
vol1.c:38:2: note: vect_is_simple_use: operand _19
vol1.c:38:2: note: def_stmt: _19 = _18 + ttl_54;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: mark relevant 1, live 0: _19 = _18 + ttl_54;
vol1.c:38:2: note: already marked relevant/live.
vol1.c:38:2: note: worklist: examine stmt: _19 = _18 + ttl_54;
vol1.c:38:2: note: vect_is_simple_use: operand _18
vol1.c:38:2: note: def_stmt: _18 = (int) _17;
vol1.c:38:2: note: type of def: internal
vol1.c:38:2: note: mark relevant 1, live 0: _18 = (int) _17;
vol1.c:38:2: note: vect_is_simple_use: operand ttl_54
vol1.c:38:2: note: def_stmt: ttl_54 = PHI <0(7), ttl_36(9)>
vol1.c:38:2: note: type of def: unknown
vol1.c:38:2: note: Unsupported pattern.
vol1.c:38:2: note: not vectorized: unsupported use in stmt.
vol1.c:38:2: note: unexpected pattern.

The beginning of the log tells you which loop the compiler is analyzing and the end will state whether or not the loop has been vectorized. When the compiler fails to vectorize the loop, it also provides the reason. I won’t post the whole log since it is actually very long, but it shows that only the second loop was vectorized:

Analyzing loop at vol1.c:32
vol1.c:32:2: note: ===== analyze_loop_nest =====
...
...
...
vol1.c:32:2: note: LOOP VECTORIZED

For more information about auto-vectorization, refer to this GCC documentation.

Inline Assembler

Next up is inline assembler, which involves embedding assembly code into another language such as C. This method can help improve performance but will make the code less portable as the code will now be bound to a system architecture. Here’s the base code I have to work with along with some questions:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"

int main() {

        int16_t*                data;           // input array
        int16_t*                limit;          // end of input array

        // these variables will be used in our assembler code, so we're going
        // to hand-allocate which register they are placed in
        // Q: what is an alternate approach?
        register int16_t*       cursor          asm("r20");     // input cursor
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int                     x;              // array interator
        int                     ttl =0 ;        // array total

        data=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        srand(-1);
        printf("Generating sample data.\n");
        for (x = 0; x < SAMPLES; x++) {
                data[x] = (rand()%65536)-32768;
        }

// --------------------------------------------------------------------

        cursor = data;
        limit = data+ SAMPLES ;

        // set vol_int to fixed-point representation of 0.75
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t) (0.75 * 32767.0);

        printf("Scaling samples.\n");

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[cursor]], #0        \n\t"
                        // load eight samples into q0 (v0.8h)
                        // from in_cursor

                        "sqdmulh v0.8h, v0.8h, v1.8h    \n\t"
                        // multiply each lane in v0 by v1*2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        // Q: Why is #16 included in the str line
                        // but not in the ldr line?

                        "str q0, [%[cursor]],#16                \n\t"
                        // store eight samples to [cursor]
                        // post-increment cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [cursor]"+r"(cursor)
                        : "r"(cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        printf("Summing samples.\n");
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+data[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

}
What is an alternate approach to?
// these variables will be used in our assembler code, so we're going
// to hand-allocate which register they are placed in
// Q: what is an alternate approach?
register int16_t* cursor asm(“r20”); // input cursor
register int16_t vol_int asm(“r22”); // volume as int16_t 

The compiler is capable of registers, so I can declare those variables in C syntax rather than hand-allocating them.

Should we use 32767 or 32768 in next line? why?
// set vol_int to fixed-point representation of 0.75
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t) (0.75 * 32767.0);

If 0.75 remains a constant float, setting the number as 32768.0 would only change the value of vol_int. The change won’t cause any unexpected issues because it would multiple two floats before casting it to int16_t. However, the variable name hints that 0.75 is place holder for the volume control which should be a inclusive float between 0 and 1. Therefore, using 32768.0 in that will cause problems as the upper bound of a int16_t value is 32767.

What does it mean to “duplicate” values in the next line?
// Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

It is copying the value of vol_int into the 8 lanes in register 1.

Why is #16 included in the str line but not in the ldr line?
"ldr q0, [%[cursor]], #0        \n\t"
// load eight samples into q0 (v0.8h)
// from in_cursor
 
"sqdmulh v0.8h, v0.8h, v1.8h    \n\t"
// multiply each lane in v0 by v1*2
// saturate results
// store upper 16 bits of results into
// the corresponding lane in v0
 
// Q: Why is #16 included in the str line
// but not in the ldr line?
 
"str q0, [%[cursor]],#16                \n\t"

If the cursor is incremented at ldr, it may write into the wrong lane. #16 is included in str to offset the cursor so that the next ldr is already pointing to the correct lane before executing.

What do these next 3 lines do?
// Q: What do these next three lines do?
: [cursor]"+r"(cursor)
: "r"(cursor)
: "memory"

: [cursor]”+r”(cursor) will be the output value. “+r” indicates that it will be a read/write register.

: “r”(cursor) input value from cursor.

: “memory” is a clobber. It tells the compiler to mistrust values that were loaded from memory before the assembly code was executed.

Are the results usable? Are they correct?

The output for vol1.c is 94 while the vol_inline.c produces 930. Therefore, I do not believe its correct.

Intrinsics

Here’s the base code

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <arm_neon.h>
#include "vol.h"

int main() {

        int16_t*                data;           // data array
        int16_t*                limit;          // end of input array

        register int16_t*       cursor          asm("r20");     // array cursor (pointer)
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int                     x;              // array interator
        int                     ttl =  0;       // array total

        data=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        srand(-1);
        printf("Generating sample data.\n");
        for (x = 0; x < SAMPLES; x++) {
                data[x] = (rand()%65536)-32768;
        }

// --------------------------------------------------------------------

        cursor = data; // Pointer to start of array
        limit = data + SAMPLES ;

        vol_int = (int16_t) (0.75 * 32767.0);

        printf("Scaling samples.\n");

        while ( cursor < limit ) {
                // Q: What do these intrinsic functions do?
                // (See gcc intrinsics documentation)
                vst1q_s16(cursor, vqdmulhq_s16(vld1q_s16(cursor), vdupq_n_s16(vol_int)));

                // Q: Why is the increment below 8 instead of 16 or some other value?
                // Q: Why is this line not needed in the inline assembler version
                // of this program?
                cursor += 8;
        }

// --------------------------------------------------------------------

        printf("Summing samples.\n");
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+data[x])%1000;
        }

        // Q: Are the results usable? Are they accurate?
        printf("Result: %d\n", ttl);

        return 0;

}
What do these intrinsic functions do?
// Q: What do these intrinsic functions do?
// (See gcc intrinsics documentation)
 
vst1q_s16(cursor, vqdmulhq_s16(vld1q_s16(cursor), vdupq_n_s16(vol_int)));

vst1q_s16 stores all lanes or a single lane of a vector. vqdmulhq_s16 multiplies two vector lanes. vld1q_s16 loads and stores a single vector of some type. vdupq_n_s16 sets all lanes to the same value.

Why is the increment below 8 instead of 16 or some other value? Why is this line not needed in the inline assembler version of this program?
// Q: Why is the increment below 8 instead of 16 or some other value?
// Q: Why is this line not needed in the inline assembler version
// of this program?
cursor += 8;

int16_t only has 8 vector lanes and we want the cursor to point to the beginning next int16_t location. Any other number might misplace the cursor and cause misalignment with subsequent instructions. The line is not needed in inline assembler because assembler code already handles the storage of value.

Are the results usable? Are they accurate?

Again, the result is 930 so it is not accurate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website at WordPress.com
Get started
%d bloggers like this: