Published May 9, 2026 | Version v1
Preprint Open

Building a Transformer Model for Flight Waypoint Prediction A Proof-of-Concept Tutorial

  • 1. Sovereign Machine Lab (SOMALA)

Description

This paper outlines a proof-of-concept tutorial for building a Transformer-based neural network designed to predict flight path waypoints from natural language descriptions. Developed by Frank Morales Aguilera at Sovereign Machine Lab, the project demonstrates how to apply the Transformer architecture—originally built for text—to a structured geographic regression task.

System Architecture and Pipeline

The pipeline is structured into eight distinct stages, moving from environment setup to real-world evaluation:

  • Tokenization: Uses a GPT-2 tokenizer. Since GPT-2 lacks a default padding token, a custom <pad> token is added to handle varying input lengths.

  • Data Processing: The model uses the flight_plan_waypoints dataset from Hugging Face. Geographic coordinates (latitude and longitude) are z-score normalized to ensure training stability and are denormalized back to real-world coordinates during inference.

  • Model Design: The architecture features a Transformer encoder with a regression output head. It incorporates learned token and positional embeddings, followed by six encoder layers using a "Pre-LN" pattern (LayerNorm applied before sub-layers).

  • Pooling Strategies: The tutorial explores two ways to collapse sequence data into a single vector: last-token pooling and mean pooling, with the latter noted for providing better stability.

Training and Optimization

To ensure the model learns effectively, several specific machine learning techniques are implemented:

  • Masked MSE Loss: Because waypoints are padded to a maximum length of 20, a standard Mean Squared Error loss would be skewed by the "zero" padding. A binary mask is used to ensure the model is only penalized for errors on actual waypoints.

  • Optimization: The system uses the AdamW optimizer with weight decay. A learning rate scheduler (ReduceLROnPlateau) is employed to automatically halve the learning rate if the loss plateaus for 10 epochs.

  • Stability: Gradient clipping is applied to prevent "exploding gradients," a common issue in deep networks.

Evaluation and Performance

The model's performance is measured by converting coordinate differences into a kilometer-based error metric. This calculation accounts for the convergence of meridians by scaling longitude error based on the latitude.

  • Excellent: < 500 km average error.

  • Good: 500–1000 km (indicates the model is learning route patterns).

  • Needs Training: > 1000 km.

  • FULL CODE: https://github.com/frank-morales2020/MLxDL/blob/main/transformer_tutorial_fp.ipynb

Limitations and Future Scope

The primary constraint of this proof-of-concept is the dataset size. With only 2,000 records, the model is prone to geographic bias (skewed toward the Americas) and poor generalization for unseen routes. For a production-grade system, at least 10,000 records are recommended. The paper suggests using synthetic data generation via great-circle path computations to reach the necessary data volume for a more robust, globally-aware model.

Files

transformer_poc_paper.pdf

Files (15.9 kB)

Name Size Download all
md5:1595afe00fec68441f2e3015c43ba2b0
15.9 kB Preview Download