Modular: Understanding SIMD: Infinite Complexity of Trivial Problems

TLDR;

This article explores the complexities of Single Instruction, Multiple Data (SIMD) programming and how it can be used to optimize performance for trivial problems like cosine similarity. The author, Ash Vardanian, highlights the challenges of working with SIMD, including unreliable auto-vectorization, complex instructions, unpredictable performance, and difficult debugging. He then demonstrates how to implement cosine similarity using SIMD intrinsics for different CPU architectures, including Intel Haswell, AMD Genoa, and Arm NEON. The article concludes by discussing the importance of CPU-specific code and the challenges of packaging and distributing SIMD libraries.

SIMD programming can significantly improve performance for trivial problems like cosine similarity.
SIMD programming presents challenges like unreliable auto-vectorization, complex instructions, unpredictable performance, and difficult debugging.
CPU-specific code is crucial for achieving optimal performance with SIMD.

Introduction to Cosine Similarity

Cosine similarity is a widely used algorithm in various fields, including robotics, climate modeling, and Retrieval Augmented Generation (RAG). It measures the similarity between two vectors by calculating the angle between them. The article provides a simplified example of how RAG uses cosine similarity to determine the semantic similarity of user prompts to existing knowledge bases.

Baseline Implementation for Intel Haswell

This chapter focuses on implementing cosine similarity using SIMD intrinsics for Intel Haswell and newer CPUs, which support the AVX2 instruction set. The author provides a detailed breakdown of the code, explaining the use of _mm_loadu_si128 for loading data, _simsimd_bf16x8_to_f32x8_haswell for converting bfloat16 to float32, and _mm256_fmadd_ps for accumulating the dot product. The chapter also discusses the importance of horizontal accumulation and how to avoid it within the loop.

Reducing Horizontal Accumulation

This chapter delves into the challenges of horizontal accumulation in SIMD programming. The author explains that it is a slow operation and should be avoided within the loop. He provides an example of _simsimd_reduce_f32x8_haswell, which performs horizontal accumulation outside the loop using intra-register shuffling.

Reciprocal Square Root Approximation in AVX2

This chapter focuses on the use of the _mm_rsqrt_ps instruction for calculating the reciprocal square root in AVX2. The author highlights the potential errors introduced by this instruction and discusses the use of Newton-Raphson iteration to improve accuracy.

Partial Loads

This chapter addresses the issue of partial loads in SIMD programming, which arises when dealing with variable-length data. The author explains that almost every SIMD kernel requires a main loop and a tail loop to handle the last few elements. He provides an example of how to handle partial loads manually by adding an if-else statement within the loop.

Baseline Implementation for AMD Genoa

This chapter explores the implementation of cosine similarity using SIMD intrinsics for AMD Genoa CPUs, which support the AVX-512 instruction set. The author highlights the use of the vdpbf16ps instruction for dot products and the use of masking to partially load data.

Masking in x86 AVX-512

This chapter explains the concept of masking in AVX-512, which allows for partial execution of instructions based on a mask. The author discusses the use of KMASK registers and provides examples of mask manipulation instructions.

Dot Products with vdpbf16ps

This chapter focuses on the use of the _mm512_dpbf16_ps function for performing dot products in AVX-512. The author explains how this instruction can process 32 input numbers into a 16-word result register.

Reciprocal Square Root Approximation for AVX-512

This chapter discusses the use of the _mm512_rsqrt14_ps intrinsic for calculating the reciprocal square root in AVX-512. The author highlights the improved accuracy of this instruction compared to _mm_rsqrt_ps.

Baseline Implementation for Arm NEON

This chapter explores the implementation of cosine similarity using SIMD intrinsics for Arm NEON, which is similar to SSE. The author explains the use of vbfmlaltq_f32 and vbfmlalbq_f32 intrinsics for computing the dot product and discusses the challenges of working with the Arm instruction set.

CISC vs RISC, It's Not What People Say

This chapter challenges the common perception of Arm as a RISC architecture and x86 as a CISC architecture. The author argues that Arm can be more complex than x86, especially when dealing with SIMD instructions. He provides examples of complex instructions in Arm NEON and discusses the importance of CPU-specific code.

Reverse Engineering Arm Accuracy

This chapter highlights the lack of documentation for the rsqrt instruction in Arm NEON, particularly regarding its latency and maximum relative error. The author explains how he reverse-engineered the accuracy of the instruction using a benchmark program.

Baseline Implementation for Arm SVE

This chapter introduces Arm SVE, which features variable-width implementation-defined registers. The author explains the concept of variable-width registers and how they can be used to implement SIMD kernels. He provides an example of a cosine similarity kernel for Arm SVE.

Masking in Arm SVE

This chapter discusses the use of progress masks in Arm SVE for handling partial loads. The author explains how progress masks can be used to build a more powerful alternative to _bzhi_u32, which was used for AVX-512 tails.

Adding Up the Performance Gains

This chapter summarizes the performance gains achieved by using architecture-specific SIMD optimizations for cosine similarity. The author highlights the significant improvements in performance compared to naive serial code and stock NumPy implementations.

Packaging and Distributing Libraries

This chapter discusses the challenges of packaging and distributing CPU-specific SIMD libraries. The author explains two approaches: compile-time selection and runtime dispatch. He provides examples of how to implement both approaches using different techniques.

Conclusion

This chapter concludes the article by summarizing the complexities of SIMD programming and the importance of CPU-specific code. The author emphasizes the significant performance gains that can be achieved by harnessing SIMD correctly. He also encourages readers to explore the Modular Discord server for further discussion and requests for Part 2 of the series.

Date: 11/29/2024 Source: www.modular.com