There is a newer version of the record available.

Published June 21, 2025 | Version v4
Preprint Open

Usai Solution to the Symbol Grounding Problem

Description

TRIPIX: An Experimental Framework for Perceptually
Grounded Artificial Intelligence An Implementation of the Usai
Universal Cognitive Model
Luigi Usai
June 20, 2025
Abstract
Modern Artificial Intelligence, particularly Large Language Models (LLMs), exhibits
impressive capabilities in symbolic manipulation but lacks genuine understanding due to
the unresolved "symbol grounding problem." AI’s knowledge is not anchored in perceptual
reality, leading to well-documented issues of hallucination, brittleness, and a lack of commonsense reasoning. This paper argues that a robust solution requires a foundational theoretical
framework that explicitly connects perception to semantics. We posit that this framework is
provided by the Usai Universal Cognitive Model (UCM), which describes how an intelligent
agent processes sensory data into meaningful, structured representations.
This paper introduces TRIPIX, a system designed as the first practical implementation
and experimental validation of the UCM. TRIPIX programmatically constructs a bimodal
dataset where visual scenes are inextricably linked to their semantic ground truth. This is
achieved by generating a visual scene (perception) and, simultaneously, a structured RDF
knowledge graph representing the objects, properties, and relationships within it (semantics). We present a working prototype that successfully applies this process to generate a
synthetic, queryable, and perceptually grounded knowledge base. This work demonstrates
the feasibility of the UCM and charts a clear course toward a new generation of AI systems
whose intelligence is verifiable, contextualized, and fundamentally rooted in a structured
model of reality.
1 Introduction: The Symbol Grounding Problem
The capabilities of modern AI are undeniable, yet they are built on a fragile foundation. Models
like GPT-4 can write poetry and code but can also "hallucinate" facts with complete confidence
[1]. The root of this paradox is the symbol grounding problem: the model’s symbols (words and
tokens) are not connected to the real-world, perceptual phenomena they are meant to represent.
Their meaning is purely relational, derived from statistical co-occurrence in a vast corpus of
ungrounded text.
To build a more robust and trustworthy AI, we must bridge the chasm between symbolic
reasoning and sensory reality. This requires more than just adding images to datasets; it demands
a foundational theory of how perception is structured into meaning. This paper posits that such
a theory is provided by the Usai Universal Cognitive Model (UCM) [6], and introduces TRIPIX,
a system designed to operationalize it.
2 Theoretical Framework: The Usai Universal Cognitive Model
(UCM)
The design of TRIPIX is not ad-hoc; it is a direct implementation of the principles laid out in
a series of prior works by the author. This body of research provides the theoretical bedrock
1for our system. The foundational theory is presented in "The Universal Cognitive Model" [6],
which proposes a universal process for transforming raw sensory input into a structured semantic
representation.
This framework is then applied specifically to the visual domain in "An innovative system for
image semantics" [3] and further elaborated in "Visual Semantics and Artificial Intelligence" [5].
These works detail how a visual scene can be decomposed into its constituent semantic parts,
directly informing the TRIPIX methodology. Finally, "Perception and Language" [4] provides a
profound justification for using the Subject-Predicate-Object structure of the Resource Description Framework (RDF), arguing it mirrors the agent-action-patient structure we perceive in the
world.
3 The TRIPIX System: A Synthetic-First Approach
To validate the UCM, we developed a system for generating a bimodal dataset of visual scenes
and their corresponding semantic descriptions. Recognizing the limitations and scalability issues
of manually annotating real-world images, we adopted a **synthetic-first** approach using a
2.5D scene composer scripted in Python with the Pillow library.
This composer programmatically arranges 2D assets (e.g., images of objects with transparent
backgrounds) onto a digital canvas. For each generated image, it simultaneously creates a
perfectly synchronized RDF file describing the scene’s ground truth. This knowledge graph
captures:
• Object Identity and Class: Each object is an instance of a class (e.g., ‘inst:dog01rdf :
typetpix : Dog‘).Object Properties:Intrinsicattributeslikepositionandsizearerecordedasdataliterals.
•• Spatial Relationships: Inter-object relations like occlusion are explicitly stated (e.g., ‘inst:dog01tpix :
isP artiallyOccludedByinst : tree01‘).
An example of the generated semantic data is shown below:
1 @prefix inst: < http: // tripix . org / data / master _ dataset / > .
2 @prefix tpix: < http: // tripix . org / schema / > .
3 @prefix rdf: < http: // www . w3 . org /1999/02/22 - rdf - syntax - ns # > .
4 @prefix xsd: < http: // www . w3 . org /2001/ XMLSchema # > .
5
6 inst:dog _ 0001 a tpix:Dog ;
7 tpix:hasPosition " (150 , 220) " ;
8 t p i x : i s P a r t i a l l y O c c l u d e d B y inst:tree _ 0001 .
9
10 inst:tree _ 0001 a tpix:Tree ;
11 tpix:hasPosition " (180 , 150) " .
12
13 inst:scene _ 0001 _ rot90 _ br1 .3 a tpix:SceneVariant ;
14 tpix:hasRotation " 90 " ^^ xsd:integer ;
15 tp ix :hasBrightness " 1.3 " ^^ xsd:float .
Listing 1: Example of a generated RDF triple set for a scene.
This process allows for the rapid generation of a large, complex, and perfectly labeled knowledge base, which we have successfully demonstrated with a prototype.
4 Interrogating the Knowledge Base
A key feature of our system is that the generated knowledge is immediately queryable. We
developed a separate graphical interface to execute SPARQL queries against the unified RDF
graph. This allows for complex, structured questions about the dataset, such as "Find all scenes
where a dog is overlapping with a rock" or "Count all objects of type Tree." This demonstrates
2that our system creates not just data, but an active, analyzable knowledge base, fulfilling a core
tenet of the UCM.
5 Future Work and Vision: The 4D+T Synthetic Universe
The current prototype is a successful proof-of-concept. The ultimate vision is to evolve this
system into a comprehensive "reality factory" for training AI, based on a 4D+T (Space,
Time + Transformations) synthetic data pipeline, likely using a 3D environment like
Blender.
This pipeline will generate a dataset of unprecedented richness by systematically varying
not just the content of the scenes, but the very laws of their perception. The "T" in 4D+T
represents a vector of transformations, where each transformation becomes a data point in the
knowledge graph, teaching a fundamental concept of perception:
5.1 Systematic Transformation Vectors
• Geometric Transformations: To teach viewpoint and object invariance.
– Rotation: Scenes will be captured from 360 degrees. The RDF will store the camera’s angle (e.g., ‘tpix:hasViewpointRotation "27"‘). An AI trained on this can learn
to recognize a rotated object and estimate its angle of rotation.
– Scale/Zoom: Objects will be rendered at different sizes. The RDF will note the
scale factor (e.g., ‘tpix:hasScale "1.5"‘). This teaches size constancy and depth cues.
– Shear & Distortion: Images will be non-uniformly scaled or sheared. The RDF
will store the distortion matrix (e.g., ‘tpix:hasShear "0.2"‘). This builds resilience to
lens distortions and unusual perspectives.
• Photometric Transformations: To teach resilience to environmental conditions.
– Luminosity & Contrast: Scene lighting will be programmatically altered. The
RDF will record the brightness level (e.g., ‘tpix:hasLuminosity "0.8"‘). This allows
an AI to function in varied lighting conditions (day, night, shadow).
– Color & Hue: The color palette will be modified. The RDF will log the changes
(e.g., ‘tpix:hasHueShift "+15"‘).
• Physical & Conceptual Transformations: To teach higher-order reasoning.
– Occlusion: Objects will be partially or fully hidden. The RDF will explicitly state
the occlusion relationship and percentage of visibility (e.g., ‘tpix:visibility "45"‘).
This is crucial for teaching object permanence.
– Material/Texture Change: The same object (e.g., a sphere) will be rendered
with different materials (wood, glass, metal). The RDF will state the material
(‘tpix:hasMaterial tpix:Wood‘). This separates the concept of "shape" from "material."
By training a model on this 4D+T dataset, we hypothesize it can learn not just to recognize
objects, but to understand the physical and perceptual context in which they exist. It could
answer questions like, "Is this object out of focus?" or "Is this scene viewed from above?"—a
level of understanding far beyond current systems.

5.3 From Static Scenes to Dynamic Events: Time, Action, and Causality

The most profound extension of the 4D+T pipeline is the explicit modeling of Time as a dimension of change. This moves TRIPIX from generating static snapshots to creating dynamic "mini-movies" where events unfold, grounded in a causal RDF* graph. Using a 3D engine like Blender, we can generate sequences of frames and, for each step, update the knowledge graph to reflect changes.

This enables the grounding of fundamental concepts beyond object recognition:

  • Action Grounding: An animation of a hand picking up a cup is no longer a series of disconnected images. It is grounded in the RDF as a single, cohesive event. The RDF graph will contain triples like:

    • event:pickup01 rdf:type tpix:Action .

    • event:pickup01 tpix:hasAgent agent:hand01 .

    • event:pickup01 tpix:actsOn object:cup01 .

    • event:pickup01 tpix:hasStartTime "1.2s"^^xsd:float .

    • event:pickup01 tpix:hasEndTime "2.5s"^^xsd:float .

  • Causality and State Change: We can explicitly model the consequences of actions. An Appoggiare(Tazza, Tavolo) action causes a state change, which can be grounded in both perception and semantics.

    • Perception: The sound "clink" is generated by the physics engine at the moment of contact.

    • Semantics: The graph is updated with object:cup01 tpix:hasState state:StableOnSurface . and event:clink01 tpix:causedBy action:placing01 .

  • Multimodal Grounding: A 3D environment naturally produces multimodal data. The action of "sorseggiare" can be grounded simultaneously by:

    • Visuals: The trajectory of the cup toward the agent's head.

    • Audio: A "slurp" sound effect triggered by the event.

    • Physics Data: A decrease in the volume of the liquid inside the cup, registered by the engine.

By generating data where visual events, audio cues, and symbolic descriptions of actions and their consequences are perfectly synchronized, we provide the ideal training data for an AI to learn a causal, interactive model of the world. It could answer questions not just like "Where is the cup?" but "What just happened to the cup?" and "Why is the cup now empty?".

# Listing 2: Example of a generated RDF for a dynamic event.

@prefix inst: <http://tripix.org/data/master_dataset/> .
@prefix tpix: <http://tripix.org/schema/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

# Scene and Object Initial State (at t=0)
inst:scene_0002 a tpix:Scene ;
    tpix:contains inst:hand_0001, inst:cup_0001, inst:table_0001 .

inst:cup_0001 a tpix:Cup ;
    tpix:hasState inst:state_on_table ;
    tpix:isFilledWith inst:coffee_0001 .

# Definition of the Action Event
inst:action_grasp_01 a tpix:GraspAction ;
    tpix:hasAgent inst:hand_0001 ;
    tpix:actsOn inst:cup_0001 ;
    tpix:hasStartTime "0.5"^^xsd:float ;
    tpix:hasEndTime "1.2"^^xsd:float .

# State change caused by the action
inst:action_grasp_01 tpix:resultsIn inst:state_in_hand .

inst:state_in_hand a tpix:ObjectState ;
    tpix:description "The object is being held firmly by an agent" .

 

We acknowledge that a purely synthetic grounding, such as the one implemented in TRIPIX, can be viewed as a "hermeneutic carousel"—a closed system of symbols referring only to other symbols (Harnad, 1990). However, we argue that this is an indispensable intermediate step. Before an agent can tackle the complexity and noise of the real world (a process known as sim-to-real), it must first master an internally consistent and causally transparent model of the world. TRIPIX provides precisely this 'clean' training environment, where the foundations of causal reasoning and intuitive physics can be verifiably constructed.
A legitimate criticism is that TRIPIX provides ontological categories a priori, whereas a true agent should learn these categories through interaction. While our current framework does not address concept formation from scratch, it provides the ideal dataset for training categorization models. By generating thousands of variants of 'cups' (with different shapes, materials, and textures) and 'non-cups', we can train a model to perform the discrimination task that Harnad deems essential. Our contribution, therefore, lies in providing perfectly labeled training data on a scale unattainable through manual annotation.
Finally, TRIPIX does not aim to replicate the slow, sensorimotor learning process of a child, which acquires semantics through 'honest toil'. Instead, it aims to engineer a more direct path to equipping AI systems with the same core competencies. By providing an explicit and 'grounded' model of the world, we hypothesize that we can drastically accelerate the acquisition of physical commonsense, which can then be refined and validated through limited, targeted interaction with the real world.


36 Conclusion
This paper presented TRIPIX, a system that successfully implements the principles of the Usai
Universal Cognitive Model. We have demonstrated the feasibility of generating a large-scale,
perfectly labeled, bimodal dataset through programmatic composition. The vision of a 4D+T
synthetic universe provides a clear and powerful roadmap for creating the data needed to train
the next generation of AI. By grounding symbolic knowledge in a rich, structured, and physically
coherent perceptual model, we are paving the way for artificial intelligences that can truly
understand our world.
References
[1] Harnad, S. (1990). The symbol grounding problem.

 

 

@prefix : <http://tripix.org/schema/> .
@prefix tpix: <http://tripix.org/schema/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

#################################################################
#
#    Metadati dell'Ontologia TRIPIX (tpix)
#
#################################################################

tpix: rdf:type owl:Ontology ;
      rdfs:label "TRIPIX Ontology" ;
      rdfs:comment "An ontology for grounding symbolic knowledge in a programmatically generated, dynamic, multimodal reality. It describes entities, properties, states, actions, and causality within a synthetic environment."@en ;
      owl:versionInfo "1.0" .

#################################################################
#
#    1. Classi Fondamentali (Upper-Level)
#
#################################################################

tpix:Entity rdf:type owl:Class ;
    rdfs:label "Entity" ;
    rdfs:comment "The base class for anything that can exist in the TRIPIX world. Includes objects, agents, locations, etc."@en .

tpix:Action rdf:type owl:Class ;
    rdfs:label "Action" ;
    rdfs:comment "A process or event initiated by an agent that can affect entities and cause state changes. Actions occur over time."@en .

tpix:State rdf:type owl:Class ;
    rdfs:label "State" ;
    rdfs:comment "A specific condition or status of an entity at a point in time (e.g., 'Broken', 'OnSurface'). Often the result of an action."@en .

tpix:Property rdf:type owl:Class ;
    rdfs:label "Property" ;
    rdfs:comment "A characteristic of an entity, such as its color, mass, or material."@en .


#################################################################
#
#    2. Sotto-Classi di Entità (Le "cose" nel mondo)
#
#################################################################

tpix:PhysicalObject rdf:type owl:Class ;
    rdfs:subClassOf tpix:Entity ;
    rdfs:label "Physical Object" ;
    rdfs:comment "Any entity with a physical presence, mass, and volume in the simulated world."@en .

tpix:Agent rdf:type owl:Class ;
    rdfs:subClassOf tpix:PhysicalObject ;
    rdfs:label "Agent" ;
    rdfs:comment "A physical object capable of initiating actions."@en .

tpix:Container rdf:type owl:Class ;
    rdfs:subClassOf tpix:PhysicalObject ;
    rdfs:label "Container" ;
    rdfs:comment "An object designed to hold other substances or objects."@en .

# Esempi concreti di classi di oggetti
tpix:Hand rdf:type owl:Class ; rdfs:subClassOf tpix:Agent .
tpix:Cup rdf:type owl:Class ; rdfs:subClassOf tpix:Container .
tpix:Table rdf:type owl:Class ; rdfs:subClassOf tpix:PhysicalObject .
tpix:Liquid rdf:type owl:Class ; rdfs:subClassOf tpix:Entity .

#################################################################
#
#    3. Proprietà (Predicati)
#
#################################################################

# --- Proprietà che collegano Entità a Dati (Data Properties) ---

tpix:hasPosition3D rdf:type owl:DatatypeProperty ;
    rdfs:domain tpix:PhysicalObject ;
    rdfs:range xsd:string ;
    rdfs:comment "The (x, y, z) coordinates of the object's centroid. E.g., '(1.5, 2.0, 0.8)'."@en .

tpix:hasOrientationQuaternion rdf:type owl:DatatypeProperty ;
    rdfs:domain tpix:PhysicalObject ;
    rdfs:range xsd:string ;
    rdfs:comment "The (w, x, y, z) quaternion representing the object's rotation."@en .

tpix:hasColorRGB rdf:type owl:DatatypeProperty ;
    rdfs:label "has Color (RGB)" ;
    rdfs:domain tpix:PhysicalObject ;
    rdfs:range xsd:string ;
    rdfs:comment "The average RGB color of the object. E.g., '(255, 0, 0)'."@en .

tpix:hasMass rdf:type owl:DatatypeProperty ;
    rdfs:domain tpix:PhysicalObject ;
    rdfs:range xsd:float ;
    rdfs:comment "The mass of the object in kilograms."@en .

tpix:hasStartTime rdf:type owl:DatatypeProperty ;
    rdfs:domain tpix:Action ;
    rdfs:range xsd:float ;
    rdfs:comment "The simulation time (in seconds) when an action begins."@en .

tpix:hasEndTime rdf:type owl:DatatypeProperty ;
    rdfs:domain tpix:Action ;
    rdfs:range xsd:float ;
    rdfs:comment "The simulation time (in seconds) when an action ends."@en .

# --- Proprietà che collegano Entità ad altre Entità (Object Properties) ---

tpix:hasState rdf:type owl:ObjectProperty ;
    rdfs:domain tpix:Entity ;
    rdfs:range tpix:State ;
    rdfs:comment "Links an entity to its current state."@en .

tpix:hasMaterial rdf:type owl:ObjectProperty ;
    rdfs:domain tpix:PhysicalObject ;
    rdfs:range tpix:Property ;
    rdfs:comment "Links an object to its material property (e.g., tpix:Wood, tpix:Glass)."@en .

tpix:contains rdf:type owl:ObjectProperty ;
    rdfs:domain tpix:Container ;
    rdfs:range tpix:Entity ;
    rdfs:comment "Specifies that a container holds another entity."@en .

tpix:hasAgent rdf:type owl:ObjectProperty ;
    rdfs:domain tpix:Action ;
    rdfs:range tpix:Agent ;
    rdfs:comment "The agent who performs the action."@en .

tpix:actsOn rdf:type owl:ObjectProperty ;
    rdfs:domain tpix:Action ;
    rdfs:range tpix:Entity ;
    rdfs:comment "The entity (patient) that is the target of the action."@en .

tpix:usesInstrument rdf:type owl:ObjectProperty ;
    rdfs:domain tpix:Action ;
    rdfs:range tpix:PhysicalObject ;
    rdfs:comment "The instrument used to perform an action (e.g., using a spoon to stir)."@en .

# --- Proprietà Causali e Relazionali ---

tpix:resultsIn rdf:type owl:ObjectProperty ;
    rdfs:domain tpix:Action ;
    rdfs:range tpix:State ;
    rdfs:comment "The crucial causal link. This action results in a new state for an entity."@en .

tpix:isCausedBy rdf:type owl:ObjectProperty ;
    rdfs:domain tpix:State ;
    rdfs:range tpix:Action ;
    owl:inverseOf tpix:resultsIn ;
    rdfs:comment "The inverse of resultsIn. This state was caused by an action."@en .

tpix:isOccludedBy rdf:type owl:ObjectProperty ;
    rdfs:domain tpix:PhysicalObject ;
    rdfs:range tpix:PhysicalObject ;
    rdfs:comment "A spatial relationship indicating one object is visually blocked by another from a certain viewpoint."@en .

#################################################################
#
#    4. Sotto-Classi di Azioni (Verbi concreti)
#
#################################################################

tpix:MoveAction rdf:type owl:Class ;
    rdfs:subClassOf tpix:Action ;
    rdfs:label "Move Action" ;
    rdfs:comment "An action of changing the position of an object."@en .

tpix:GraspAction rdf:type owl:Class ; rdfs:subClassOf tpix:Action .
tpix:ReleaseAction rdf:type owl:Class ; rdfs:subClassOf tpix:Action .
tpix:PushAction rdf:type owl:Class ; rdfs:subClassOf tpix:Action .

#################################################################
#
#    5. Sotto-Classi di Stati (Condizioni concrete)
#
#################################################################

tpix:OnSurface rdf:type owl:Class ;
    rdfs:subClassOf tpix:State ;
    rdfs:label "On Surface" ;
    rdfs:comment "State of an object resting on a surface."@en .

tpix:HeldByAgent rdf:type owl:Class ; rdfs:subClassOf tpix:State .
tpix:Falling rdf:type owl:Class ; rdfs:subClassOf tpix:State .
tpix:Broken rdf:type owl:Class ; rdfs:subClassOf tpix:State .

Files

Files (42.8 kB)

Name Size Download all
md5:ce3d5ae2b8374c6ce8fec5a9b718e0f4
7.5 kB Download
md5:d4ddbf1d4cb86de825903b4918465e3e
5.5 kB Download
md5:50a2757c08a7abfc2d25ee7a8baab52e
6.2 kB Download
md5:61c14af7fba7ae228e05aeb81da8242c
2.8 kB Download
md5:3b1f1bfcc5bfa32d2e781fe6a9cb4718
20.8 kB Download

Additional details

Related works

Continues
Preprint: 10.5281/zenodo.15716251 (DOI)