Published June 3, 2026 | Version v2
Working paper Open

Representation Before Action: How Dynamics-Aware Perception, Tactile Grounding, and Instruction Granularity Define the Upstream Bottleneck in Robot Generalization

  • 1. Saluca LLC

Description

Version 2 — revised in response to an external structural review and an automated critique pass. See "Response to Review" appendix in the PDF for the change log.

A persistent structural pattern across recent robotics preprints is that generalization failures in robot learning are predominantly *upstream* failures — they originate in how the robot represents the world before any action is computed, not in the action-selection mechanism itself. This paper synthesizes five to seven findings from recent cs.RO and cs.HC preprints to argue that three complementary upstream bottlenecks — dynamics-aware visual representation, physics-grounded tactile encoding, and fine-grained language supervision — each independently constrain downstream policy generalization, and that addressing any one in isolation yields bounded gains. This is offered as a **heuristic reading**, not a formal derivation: the three bottlenecks share a structural pattern (richer upstream signal → more decodable downstream behavior) but are not unified by a single formalism, and the analogies drawn across modalities are structural rather than mechanistic. The corpus spans cs.RO and cs.HC preprints from May–June 2026, with supporting evidence from eess.SY on sample complexity. Key falsifiable claims include: (1) dynamics-aware visual encoders trained on image-language-3D flow triplets outperform static encoders by up to +22.5% in out-of-distribution manipulation scenarios, but only under the simulation and limited real-world conditions reported in the abstract [corpus:arxiv:2605.30350]; (2) Center-of-Pressure tactile representations achieve zero-shot sim-to-real transfer on contact-rich tasks where coarse binary-contact baselines fail, evaluated on two tasks with a single multi-fingered hand platform [corpus:arxiv:2605.28812]; (3) fine-grained instruction supervision follows an inverted-U mixing curve, peaking at FG:Raw = 1:2 to 1:1 and reaching 86.8% success in simulation only [corpus:arxiv:2605.27284]; (4) embodied VR feedback reshapes neural representations to yield r = 0.762 motor-imagery decoding correlation versus r = 0.672 for screen feedback, with improvements of 8.9–13.0% across movement dimensions, in a ten-participant human BCI study [corpus:arxiv:2605.29677]. The primary falsification path is: train matched policies on identical downstream tasks with and without each upstream enrichment, controlling for policy architecture and data volume, and test whether gains persist under held-out embodiment transfer. ---

Authorship: Saluca Agentic AI Research Team (Saluca LLC). AI-drafted from arXiv preprint corpus on the date in the filename.

Cited arXiv preprints: 2605.01597, 2605.26640, 2605.27284, 2605.28726, 2605.28812, 2605.29091, 2605.29677, 2605.30280, 2605.30326, 2605.30350, 2605.30864, 2606.01478, 2606.01970, 2606.02027, 2606.02562

Notes

This paper was AI-drafted by an internal multi-persona research agent over a curated arXiv corpus. It is not peer-reviewed. All cited works are listed by arXiv ID; readers should follow those links to verify claims against the primary preprints.

Files

20260603_cyborg_upstream-representation-bottleneck-robot-generalization_v2.pdf