3 min read

Speeding Up Linear Search on M1 and M2 Macs

The problem: given a sequence of random numbers find the first occurrence of a known value if it exists.

Setup

#include <random>
#include <vector>

const size_t NUM_VALS = 400000; // significance of multiple of 4 will be explained later
std::vector<int> vals(NUM_VALS);

std::random_device dev;
std::mt19937 rng(dev());
std::uniform_int_distribution<> distrib(std::numeric_limits<int>::min(), std::numeric_limits<int>::max());

int val_to_find;
for (size_t i = 0; i < vals.size(); i++) {
	vals[i] = distrib(rng);
	if (i == NUM_VALS / 2) {
		val_to_find = vals[i];
	}
}

Naive Solution

In C++ (my preferred programming language) an idiomatic solution might look like this:

int found_val = *std::find(vals.begin(), vals.end(), val_to_find);
assert(found_val == val_to_find);

And we'd assume that we can't go much faster than this in practice, sure maybe using iterators has some overhead that a pure C solution doesn't have, but we'd expect that to only make a small difference, maybe a few percent, right? That's what I thought too, but it's actually possible to make std::find 2-3x faster, at least in the simple case I've given where we're searching for an int.

SIMD Solution

The trick is to use NEON instructions. NEON is ARM's SIMD instruction set and you can use it in a few ways, first you can hope the compiler is smart enough to figure out that the code you wrote can be vectorized, obviously in the case given above the compiler was not smart enough to do this at -O3, but maybe it needed some extra flags or something to do so. Second, you can write pure assembly. This is tricky since you have to handle register allocation yourself and the resulting mix of assembly and C++ is ugly to look at. Lastly, you can write your code using NEON intrisics, these look like normal function calls but they tell the compiler to insert NEON instructions instead of normal instructions. This is the approach I took.

NEON operates on 64 or 128 bit long vectors. In our case we will only work with 128 bit vectors. Since an int is 32 bits, this means we can process 128 / 32 = 4 ints at a time. The basic idea of what we're trying to do is as follows:

idx = 0
while (idx < length):
	grab 4 elements starting at idx
	compare these elements to the value we're searching for
	if value found:
		return index of first occurence of value
	else:
		idx += 4

To keep things simple I've limited the problem to the case where the vector we're searching for is a multiple of 4. If it's not a multiple it's possible to do some pre or postprocessing of the remaining elements but this complicates things and so I will skip over it for the sake of keeping this discussion as simple as possible. The implementation of the algorithm described above is fairly straightforward as the following code and comments demonstrate:

template <typename InputIterator>
InputIterator find(InputIterator first, InputIterator last, const int& val) {
	// avoid having to handle edge cases
	assert(std::distance(first, last) % 4 == 0);
	
	// load 4 copies of val into the 4 lanes of a 32bit x 4 vector
	int32x4_t val_to_find = vdupq_n_s32(val);
	while (first != last) {
		// load next 4 values
		int32x4_t vals_to_comp = vld1q_s32(&*first);

		// compare loaded vals with expected value, set all bits in lane to 1 if the two are equal
		uint32x4_t eq_v = vceqq_s32(val_to_find, vals_to_comp);
	
		// if any lane is greater than 0, the value in that lane is the value we're searching for 
		if (vdups_laneq_u32(eq_v, 0) > 0) {
			return first;
		} else if (vdups_laneq_u32(eq_v, 1) > 0) {
			return first + 1;
		} else if (vdups_laneq_u32(eq_v, 1) > 0) {
			return first + 2;
		} else if (vdups_laneq_u32(eq_v, 1) > 0) {
			return first + 3;
		} else {
			first += 4;
		}
	}
	
	return last;
}

The only somewhat tricky part is the if statement at the end, which selects the appropriate value in the case where we have 1 or more matches. It's possible there is a slicker way to do this, but I'm not aware of one. In theory this code should result in a 4x speedup, but in pracitce I observed a speedup between 2-3x. I assume the difference is because there are more instructions in the loop for the NEON version so each iteration takes more time but more work gets done per iteration.