this is a note on different types of attention and different ways to implement them (sliding window, flash). for reference Q=query, K=key, V=value

multi head attention

the classic one that was explained in the architecture basics and how everything work in theory uses multiple heads each having their own query, key , values for each head

multi query attention

all heads share the same K,V in each block respectively but Q is diff for each block reduces perf a lot

group query attention

each block has it’s own Q but K,V are shared with groups of head, the groups are architectural and fixed not learned on the fly. reduces performance by like a medium size

multi latent attention