Does layer-wise score aggregation improve SuperGLUE task accuracy over last-layer baselines when evaluated on
Description
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed b
Research goal: Does layer-wise score aggregation improve SuperGLUE task accuracy over last-layer baselines when evaluated on out-of-distribution or adversarially constructed examples from the benchmark?
Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 9.0/10.
Notes
Files
paper.pdf
Files
(84.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:31dc6211296450d5febe92e62144df57
|
84.8 kB | Preview Download |