Correlation Between Hidden Layer Depth in Linear Attention Models and Semantic Textual Similarity on GLUE
Description
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT c
Research goal: What is the correlation between hidden layer depth in linear attention models and semantic textual similarity performance on the GLUE benchmark?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.2/10.
Notes
Files
paper.pdf
Files
(79.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:dff8029c7f7d30c7337e442714c6f6e0
|
79.2 kB | Preview Download |