Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Sun, Jing

doi:10.5281/zenodo.19497907

Published April 10, 2026 | Version v1

Preprint Open

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Sun, Jing¹

1. Chengyi College, Jimei University

Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Ablation studies on the LunarLander-v2 environment demonstrate that our method successfully avoids the local optimum traps of single-timescale architectures, completely eliminates policy collapse caused by routing mechanisms, and achieves optimal sample efficiency and asymptotic performance.

Files

article.pdf

Files (1.9 MB)

Name	Size	Download all
article.pdf md5:e07b47e4cc14f984549ebe62cab18a26	1.9 MB	Preview Download

	All versions	This version
Views	40	39
Downloads	28	27
Data volume	76.9 MB	71.2 MB

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Authors/Creators

Description

Files

article.pdf

Files (1.9 MB)