Tiling FlashAttention in HLS

FlashAttention's key insight is that you never need the full N×N attention matrix in memory at once. You tile the computation so each tile fits in SRAM. On a CPU or GPU this is clever cache management. On an FPGA it's a design problem.

The seven modules

I split the core into seven pipelined modules: Q/K/V Loader, Dot-Product Engine, Online Softmax, Weighted Accumulator, and Output Writeback. Each is its own HLS function with DATAFLOW pragma, which tells the tool to run them as a pipeline rather than sequentially.

DATAFLOW gives pipeline-level parallelism: while the Dot-Product Engine computes one tile, the Loader is fetching the next.
AXI4 for memory, AXI-Lite for control. AXI-Lite register mapping is tedious but the generated RTL is clean.
Online softmax (Milakov & Gimelshein, 2018) is what makes this tractable — you compute max and denominator in one pass per tile without storing everything.

The tool does what you tell it, not what you mean. The pragma says pipeline. Whether that pipeline meets timing is a different question.

RTL synthesis passed. Timing closure on the target SoC is next.