L1 Data Cache-Optimized Matrix Multiplication

This project focuses on optimizing matrix multiplication by leveraging L1 data cache characteristics, demonstrating how different cache-aware strategies can significantly improve performance.

Overview

Matrix multiplication is a memory-bound operation that suffers severely from cache misses when implemented naively. This project specifically targets L1 data cache optimization techniques to improve performance:

Naive implementation: Standard triply-nested loops with poor cache utilization
Transposed implementation: Improves cache locality by pre-transposing the second matrix for sequential memory access patterns
Cache-line blocking: Divides matrices into blocks precisely sized to match L1 data cache lines (typically 64 bytes = 16 floats)
Cache-line with transposition: Combines cache-line blocking with transposed matrix for optimal memory access patterns
SSE with cache awareness: Combines 128-bit SIMD instructions with cache-line blocking to maximize both L1 cache utilization and parallel computation
SSE with transposition: Leverages transposed matrices with SIMD for optimal cache coherence
AVX with cache awareness: Extends to 256-bit SIMD instructions while maintaining optimal cache-line alignment and access patterns
AVX with transposition: Combines AVX operations with transposed matrix for maximum performance

Each implementation builds upon the previous one to demonstrate the cumulative effect of different cache optimization techniques.

Features

L1 data cache-focused optimizations
Cache-line aligned memory allocation using posix_memalign
Blocking algorithms precisely tuned to L1 cache line size
Memory access pattern optimization to reduce cache misses
SIMD acceleration (SSE and AVX) aligned with cache line boundaries
Automatic detection of L1 data cache line size at compile time
Comprehensive benchmarking to measure impact of each cache optimization
GCC cleanup attribute for automatic resource management

Building and Running

Build Requirements

GCC compiler with GNU11 support
POSIX-compliant system
CPU with SSE and AVX instruction set support
Python 3 Interpreter for visualizations

Command Line Options

C Program

./mm -o [output_csv_file]

If you provide an output filename, benchmark results will be written to that CSV file
If you don't provide a filename, results will only be displayed on the console

Python Visualization

Before running Python script make sure to install dependencies:

pip install matplotlib pandas seaborn numpy

After that run visualization script with:

python3 visualize.py -i [input_csv_file] -o [output_basename]

If you provide an input CSV filename, the script will read data from that file
Output files will be named based on the input filename (e.g., mm-data-performance.png and mm-data-speedup.png) or basename defined with -o argument

Using the Makefile Targets

The Makefile provides several convenient targets that handle the CSV filename:

Basic Usage

# Just compile the program
make

# Run benchmarks and display results on console
make run

# Run benchmarks and save to default CSV file (mm-data.csv)
make run-csv

# Visualize data from the default CSV file
make visualize

# Run benchmarks, save to CSV, and create visualizations
make plot

High-Priority Execution

# Run with high priority settings (for more accurate benchmarks)
make run-priority

# Run with high priority settings and save to default CSV file (for more accurate benchmarks)
make run-csv-priority

# Run benchmarks with high priority and generate visualizations
make plot-priority

Cleanup

# Remove all generated files
make clean

CSV File Format

The CSV file should have the following columns:

implementation: Name of the implementation (e.g., "naive", "sse", etc.)
matrix_size: Size of the matrix (N where the matrix is N×N×N)
duration_ns: Duration per iteration in nanoseconds (will be automatically converted to milliseconds)

Example:

implementation,matrix_size,duration_ns
naive,32,12345678
naive,64,98765432
sse,32,1234567
sse,64,9876543

Visualizations

The Python script creates two charts based on your data:

Performance Chart: Shows execution time vs. matrix size for all implementations
- Saved as [basename]-performance.png and [basename]-performance.pdf
Speedup Chart: Shows the speedup factor relative to the naive implementation
- Saved as [basename]-speedup.png and [basename]-speedup.pdf
- Only created if the data includes the "naive" implementation

The basename is derived from the input CSV filename (without extension) or from argument directly.

Technical Details

Memory Layout & Cache Considerations

All matrices are stored in row-major order, which is critical for the cache optimization strategy. The implementation:

Aligns all matrix memory to the exact L1 data cache line size
Processes data in chunks matching the size of L1 cache lines (N_F32_PER_CACHE_LINE floating-point values)
Carefully handles matrices whose dimensions are not multiples of the cache line size
Uses blocking factors derived directly from the L1 cache line size rather than arbitrary values
Organizes computation to minimize TLB misses alongside cache misses

L1 Cache Optimizations

Cache Line Alignment: All matrices are aligned to L1 cache line boundaries (typically 64 bytes) to eliminate split cache line accesses
Cache Line Blocking: Matrices are processed in blocks precisely sized to match L1 cache lines (N_F32_PER_CACHE_LINE float elements)
Sequential Access Patterns: Matrix traversal order optimized to maximize sequential accesses and minimize cache misses
Matrix Transposition: Pre-transposed matrices improve spatial locality by converting column-wise accesses to row-wise
Minimized Cache Pollution: Implementation carefully manages which data remains in cache during computation
Vector Length Alignment: SIMD vector operations aligned with cache line boundaries for optimal performance
Optimized Loop Orders: In transposed implementations, the inner loop order is arranged to maximize sequential memory access

Benchmarking & Cache Analysis

The benchmarking system is designed to accurately measure L1 cache performance:

Initial warm-up phase preloads caches to measure steady-state performance
Result matrix reset between runs to avoid contaminating measurements
High-precision monotonic clock timing for accurate microsecond measurements
Reports in both nanoseconds and milliseconds to highlight magnitude of improvements
Optional high-priority execution mode to minimize system interference
CPU pinning to avoid cache coherency effects between cores
Non-power-of-2 matrix dimensions (257×257×257) to stress edge cases in cache line utilization

Code Structure

The implementation progresses through increasingly sophisticated L1 cache optimizations:

matrix_mul: Naive matrix multiplication with poor cache utilization
matrix_mul_transposed: Restructures memory access for better spatial locality
matrix_mul_cacheline: Implements blocking precisely matched to L1 cache line size
matrix_mul_transposed_cacheline: Combines cache-line blocking with transposed B matrix for perfectly sequential access patterns
matrix_mul_sse: Combines cache-line blocking with 4-wide SIMD operations
matrix_mul_transposed_sse: Optimizes SIMD operations with transposed matrix and inner loop reordering
matrix_mul_avx: Extends to 8-wide SIMD while maintaining cache-line optimization
matrix_mul_transposed_avx: Maximizes performance with AVX instructions and transposed matrices

Key constants:

CACHE_LINE_SIZE: Automatically detected size of L1 data cache line (typically 64 bytes)
N_F32_PER_CACHE_LINE: Number of 32-bit floats that fit in one cache line (typically 16 floats)

The implementation is carefully structured to:

Minimize cache thrashing (repeated loading/unloading of same cache lines)
Maximize spatial locality (accessing memory in sequential patterns)
Optimize temporal locality (reusing cached values as much as possible)
Handle edge cases where matrix dimensions aren't multiples of cache lines

Cache Issues with Non-Transposed Implementations

It's important to note that even with careful cache line blocking and prefetching, the non-transposed implementations still suffer from suboptimal cache behavior when accessing the B matrix and updating C matrix. This occurs because:

The access pattern for B is strided with jumps of P elements between consecutive memory accesses
- Hardware prefetchers often struggle to recognize this access pattern
- Software prefetching (_mm_prefetch) has limitations in fully addressing this issue
C matrix is sequentially updated as B matrix elements are accessed

The transposed implementations directly solve this issue by reorganizing memory layout to ensure all memory accesses follow cache-friendly sequential patterns and that C matrix elements are calculated fully in every iteration reducing number of modifications per element. Additionally, the inner loop reordering for SIMD optimization in matrix_mul_transposed_sse and matrix_mul_transposed_avx maximizes vectorization efficiency by processing vectors of elements in the optimal dimension.

Performance Expectations & L1 Cache Impact

The performance improvements demonstrate the critical impact of L1 data cache optimization. Here are the actual benchmarking results for matrices of size 257×257×257:

Implementation	Execution Time (ms)	Speedup vs. Naive
Naive implementation	9.089 ms	1.00×
Transposed implementation	8.064 ms	1.13×
Cache-line implementation	5.485 ms	1.66×
Transposed cache-line	4.206 ms	2.16×
SSE with cache awareness	2.243 ms	4.05×
Transposed SSE	1.660 ms	5.47×
AVX with cache awareness	2.071 ms	4.39×
Transposed AVX	1.526 ms	5.95×

These results clearly demonstrate several key insights:

Simply transposing the B matrix provides a modest 13% performance improvement
Cache-line blocking delivers a significant 66% speedup over the naive implementation
Combining transposition with cache-line blocking yields a 2.16× speedup
SIMD vectorization with SSE provides an impressive 4× performance boost
When combining SIMD with transposition, we achieve 5.5-6× speedup over the baseline
The AVX implementation with transposition delivers the best performance, nearly 6× faster than the naive version

These improvements are most dramatic when matrix dimensions exceed L1 cache size but still fit in L2/L3 cache. The non-power-of-2 dimensions (257×257×257) specifically test how the implementation handles cache line boundary edge cases.

Test System Specifications

The benchmarks were performed on system with following specifications:

CPU: AMD Ryzen 7 PRO 7840U with Radeon 780M Graphics
CPU Cores: 8 cores, 16 threads
L1 Data Cache: 32 KiB per core (256 KiB total)
L1 Instruction Cache: 32 KiB per core (256 KiB total)
L2 Cache: 1 MiB per core (8 MiB total)
L3 Cache: 16 MiB (shared)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.c		main.c
makefile		makefile
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

L1 Data Cache-Optimized Matrix Multiplication

Overview

Features

Building and Running

Build Requirements

Command Line Options

C Program

Python Visualization

Using the Makefile Targets

Basic Usage

High-Priority Execution

Cleanup

CSV File Format

Visualizations

Technical Details

Memory Layout & Cache Considerations

L1 Cache Optimizations

Benchmarking & Cache Analysis

Code Structure

Cache Issues with Non-Transposed Implementations

Performance Expectations & L1 Cache Impact

Test System Specifications

About

Languages

License

karlobratko/L1d-cache-optimization

Folders and files

Latest commit

History

Repository files navigation

L1 Data Cache-Optimized Matrix Multiplication

Overview

Features

Building and Running

Build Requirements

Command Line Options

C Program

Python Visualization

Using the Makefile Targets

Basic Usage

High-Priority Execution

Cleanup

CSV File Format

Visualizations

Technical Details

Memory Layout & Cache Considerations

L1 Cache Optimizations

Benchmarking & Cache Analysis

Code Structure

Cache Issues with Non-Transposed Implementations

Performance Expectations & L1 Cache Impact

Test System Specifications

About

Topics

Resources

License

Stars

Watchers

Forks

Languages