osc.viz.rollout.self_attn_rollout

self_attn_rollout(attns, head_reduction='mean', adjust_residual=True, global_avg_pool=True)[source]

Self-attn rollout: how much output token(s) attend to input tokens across layers

Parameters

attns (Union[Mapping[str, Tensor], Sequence[Tensor]]) – dict or list where each entry has shape [B heads Q K]
head_reduction (Union[str, Callable]) – ‘mean’, ‘max’, or a callable that reduces the head dimension
adjust_residual – bool, whether to add 0.5 for the self connection
global_avg_pool – bool, if the output of the final attention layer is avg-pooled into a single vector of features

Returns

Rollout, shape [B Q K] if global_avg_pool=False else [B K]