General cross-architecture distillation of pretrained language models into matrix embedding

Galke, L.
Cuber, I.
Meyer, C.
Nölscher, H.
Sonderecker, A.
Scherp, A.

Open PDF

Open link

Publication date

January 2022

DOI

10.1109/IJCNN55064.2022.9892144

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Language

English

Abstract

Large pretrained language models (PreLMs) are rev-olutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive for small laboratories or for deployment on mobile devices. Approaches like pruning and distillation reduce the model size but typically retain the same model architecture. In contrast, we explore distilling PreLMs into a different, more efficient architecture, Continual Multiplication of Words (CMOW), which embeds each word as a matrix and uses matrix multiplication to encode sequences. We extend the CMOW architecture and its CMOW/CBOW-Hybrid variant with a bidirectional component for more expressive power, per-token representations for a general (task-agnostic) distillation during pretr...