Block-State Transformer

Fathi, Mahan
Pilault, Jonathan
Bacon, Pierre-Luc
Pal, Christopher
Firat, Orhan
Goroshin, Ross

Publication date

August 2023

Language

English

Abstract

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show ...

Extracted data

We use cookies to provide a better user experience.

Data Protection

Block-State Transformer

Abstract

Extracted data

Block-State Transformer

Abstract

Extracted data

Related items

Related items