Solve the prefix sum problem with SIMDs and Vector APIs

Java, SIMD operations and Vectors

The speed and efficiency of traditionally developed software applications is limited by the fact that CPUs only process one instruction at a time.

Intelligent approaches to algorithm design can be used to improve the efficiency and time complexity of a given problem problem, but in the end, the single-step reality of a CPU is always a limiting factor.

However, modern CPUs capable of single operation, multiple data (SIMD) operations overcome the single-step limitation by allowing more than one piece of data to be manipulated during a single CPU clock-cycle.

While SIMD functions are typically limited to data spaces that are 128, 256 or 512 bits in size, their repeated use can have a significant impact in the time required to run a given algorithm.

Linear vs logarithmic time complexity

To allow Java developers to take advantage of the SIMD space, the Vector API is available in the Java 21 JDK as a preview feature. In this tutorial, we will look at how to solve the prefix sum puzzle in Java with SIMD operations and reduce linear big O complexity down to one that is logarithmic.

A previous post described how to solve the prefix sum problem in Java using a traditional approach that looped through one array and updated values in a second.

That straightforward approach produced the following result:

Original Values  10  20  12  28  10  19  101  799
Prefix Sums  10  30  42 70 80 99 200 999

The problem with this brute-force approach to the prefix sum problem is that it is highly inefficient. Discrete logic must be performed on every element of the array. As arrays become larger, the inefficiency of this approach becomes increasingly more noticeable.

Java SIMD processing with Vectors

However, modern CPUs come with the ability to perform a single logical instruction on multiple elements on a data set at once. The capability is known as single instruction, multiple data, or SIMD.

The ability to perform SIMD operations in Java is enabled through the incubating Vector API, which is a preview feature in JDK 21.

The following code snippet shows a Java Vector formed from an array.

The content of the vector is printed to the console, after which the result of the vector being added to itself is printed to the console.

int numbers[] = { 10, 20, 12, 28, 10, 19, 101, 799 }; 
var vector = IntVector.fromArray(IntVector.SPECIES_PREFERRED, numbers, 0);

System.out.println( vector );
System.out.println( vector.add(vector) );

The output is as follows:

[10, 20, 12, 28, 10, 19, 101, 799]
[20, 40, 24, 56, 20, 38, 202, 1598]

The significance of the vector addition can’t be understated.

In our example above, rather than requiring eight operations if we looped through the array one element at a time, the vector addition happened in a single clock cycle. The logic applies linearly as the size of the data set grows, up to the point that the underlying SIMD architecture can handle the data set’s size.

For the examples given here, the array size is always a power of two for the sake of simplicity.

Prefix sums, SIMDs and the Java Vector API

Given the high performance and efficiency of SIMD operations, developers will increasingly look for ways to solve data-driven problems using this approach. The prefix sum problem demonstrates is a perfect example of this.

To solve the prefix sum problem with our current data set, we will perform multiple Vector add functions.

To achieve the prefix sum, we add the output of each Vector addition to itself. The result shifts by a factor of two on each new summation to take into account previously summed fields.

Here’s a visual representation of the calculations the Java Vector API must perform:

  [10, 20, 12, 28, 10, 19, 101, 799] 
+ [ 0, 10, 20, 12, 28, 10,  19, 101] 
  [10, 30, 32, 40, 38, 29, 120, 900]

  [10, 30, 32, 40, 38, 29, 120, 900]
+ [ 0,  0, 10, 30, 32, 40,  38,  29]
  [10, 30, 42, 70, 70, 69, 158, 929]

  [10, 30, 42, 70, 70, 69, 158, 929]
+ [ 0,  0,  0,  0, 10, 30,  42,  70]
  [10, 30, 42, 70, 80, 99, 200, 999]

To shift data elements in a set to the right, the process uses the Java Vector API’s unslice method. (The Java Vector API’s slice method shifts data fields to the left.)

The code to solve the Java prefix sum problem with the Vector API is as follows:

int array[] = { 10, 20, 12, 28, 10, 19, 101, 799 }; 
var vector = IntVector.fromArray(IntVector.SPECIES_PREFERRED, array, 0);
System.out.println(vector);

vector = vector.add(vector.unslice(1));
vector = vector.add(vector.unslice(2));
vector = vector.add(vector.unslice(4));

System.out.println(vector);

The output of the Java prefix sum solution is as follows:

[10, 20, 12, 28, 10, 19, 101, 799]

[10, 30, 42, 70, 80, 99, 200, 999]

Loops vs Java’s Vector API

With this eight-element array, we end up with only three addition operations on the Vector to calculate the prefix sum. That’s a significant improvement on the eight operations in the brute-force approach with a loop.

Furthermore, if this array was 16 elements in size, the difference would be even more significant, as it would reduce 16 would be reduced down to four.

Note how given the limited number of lanes available to a Vector, a ShortVector is required to deal with 16 digits.

short array [] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
var vector = ShortVector. fromArray (ShortVector. SPECIES_PREFERRED , array , 0);
System. out .println( vector );
vector = vector .add( vector .unslice(1));
vector = vector .add( vector .unslice(2));
vector = vector .add( vector .unslice(4));
vector = vector .add( vector .unslice(8));
System. out .println( vector );

When this prefix sum solution runs, the output is:

[1, 2, 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13,  14,  15,  16]
[1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136]

Modern applications based on AI, ML and blockchain commonly require single instructions applied to multiple data elements. The addition of the incubating Java Vector API to the JDK to help support highly efficient SIMD operations is a welcome addition to the Java ecosystem.