We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1st-stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the 2nd-stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit con...
One-bit quantization is a general tool to execute a complex model,such as deep neural networks, on a...
Thesis (Master's)--University of Washington, 2021As more electronic devices have an on-device Keywor...
Post-training quantization (PTQ) is the go-to compression technique for large generative models, suc...
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quant...
Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come w...
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Comput...
The introduction of artificial neural networks (ANNs) to speech recognition applications has sparked...
Artificial Neural Networks (NNs) can effectively be used to solve many classification and regression...
The compression of deep learning models is of fundamental importance in deploying such models to edg...
We introduce an Artificial Neural Network (ANN) quantization methodology for platforms without wide ...
In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnabl...
Neural networks have demonstrably achieved state-of-the art accuracy using low-bitlength integer qua...
Power-consumption in small devices is dominated by off-chip memory accesses, necessitating small mod...
Deep neural networks (DNN) have achieved impressive success in multiple domains. Over the years, the...
Reducing the latency and model size has always been a significant research problem for live Automati...
One-bit quantization is a general tool to execute a complex model,such as deep neural networks, on a...
Thesis (Master's)--University of Washington, 2021As more electronic devices have an on-device Keywor...
Post-training quantization (PTQ) is the go-to compression technique for large generative models, suc...
Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quant...
Large language models(LLMs) exhibit excellent performance across a variety of tasks, but they come w...
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Comput...
The introduction of artificial neural networks (ANNs) to speech recognition applications has sparked...
Artificial Neural Networks (NNs) can effectively be used to solve many classification and regression...
The compression of deep learning models is of fundamental importance in deploying such models to edg...
We introduce an Artificial Neural Network (ANN) quantization methodology for platforms without wide ...
In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnabl...
Neural networks have demonstrably achieved state-of-the art accuracy using low-bitlength integer qua...
Power-consumption in small devices is dominated by off-chip memory accesses, necessitating small mod...
Deep neural networks (DNN) have achieved impressive success in multiple domains. Over the years, the...
Reducing the latency and model size has always been a significant research problem for live Automati...
One-bit quantization is a general tool to execute a complex model,such as deep neural networks, on a...
Thesis (Master's)--University of Washington, 2021As more electronic devices have an on-device Keywor...
Post-training quantization (PTQ) is the go-to compression technique for large generative models, suc...