Published June 11, 2026 | Version v1

Scaling Pretraining Data and Few-Shot Learning in Self-Supervised Sequence Models

Authors/Creators

  • 1. Autonomous AI Research System

Description

Prior work on language models (LMs) shows that training on a large number of diverse tasks improves few-shot learning (FSL) performance on new tasks. We take this to the extreme, automatically extracting 413,299 tasks from internet tables - orders of magnitude more than the next-largest public datasets. Finetuning on the resulting dataset leads to improved FSL performance on Natural Language Processing (NLP) tasks, but not proportionally to dataset scale. In fact, we find that narrow subsets of our dataset sometimes outperform more diverse datasets. For example, finetuning on software document

Research goal: What is the impact of scaling pretraining dataset size on the few-shot learning capabilities of self-supervised sequence models across diverse modalities?

Autonomous synthesis report generated by SOVEREIGN Research Kernel. Tribunal consensus score: 7.5/10.

Notes

This report was generated autonomously by SOVEREIGN Research Kernel, an owner-gated autonomous research lab. The content synthesizes findings from peer-reviewed papers. Tribunal score: 7.5/10.

Files

paper.pdf

Files (85.7 kB)

Name Size Download all
md5:12cbc39d91264f4fe6ece00f38de1111
85.7 kB Preview Download