Published June 26, 2025 | Version 1.1
Preprint Open

X-Spanformer: A Tokenizer-Free, Span-Aware Encoder Inspired by X-Bar Theory

  • 1. Independent Researcher

Contributors

  • 1. Independent Researcher

Description

This paper introduces X-Spanformer, a tokenizer-free, span-aware encoder that learns compositional segmentation directly from raw input streams using a pointer-network mechanism inspired by X-bar theory. Starting with a compact BPE seed, the model refines span boundaries through a staged curriculum involving synthetic supervision, entropy regularization, and contrastive alignment, producing softly typed spans pooled into transformer layers via a lightweight compositional interface. This joint optimization approach supports adaptable segmentation and representation across modalities such as code and natural language, validated through metrics including compression ratio, entropy decay, span-type KL divergence, and syntactic fidelity. The release includes an ONNX-compatible implementation and reproducible training recipes, positioning X-Spanformer as a foundation for interpretable, scalable encoders in structured learning, neural parsing, and program induction.

Files

XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf

Additional details

Dates

Created
2025-06-26
Created the initial Draft

Software

Repository URL
https://github.com/p3nGu1nZz/x-spanformer
Programming language
Python, C++
Development Status
Wip

References

  • Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with Subword Units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, 2016, pp. 17151725. doi: 10.18653/v1/P16-1162. url: https: //aclanthology.org/P16-1162.
  • Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub word tokenizer and detokenizer for Neural Text Processing. In: Proceedings of the 2018 Confer ence on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 6671. doi: 10.18653/v1/D18 2012. url: https://aclanthology.org/D18-2012.
  • Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer Networks. In: Advances in Neural Information Processing Systems. Vol. 28. 2015, pp. 26922700. url: https://arxiv. org/abs/1506.03134.
  • Ashish Vaswani et al. Attention Is All You Need. In: Advances in Neural Information Pro cessing Systems. Vol. 30. 2017, pp. 59986008. url: https://arxiv.org/abs/1706.03762.
  • Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 41714186. doi: 10.18653/v1/N19-1423. url: https://aclanthology.org/N19-1423.
  • Alec Radford et al. Language Models are Unsupervised Multitask Learners. OpenAI Technical Report. Available at https://cdn.openai.com/better-language-models/language_ models_are_unsupervised_multitask_learners.pdf. 2019.
  • Colin Ra el et al. Exploring the Limits of Transfer Learning with a Uni ed Text-to-Text Transformer. In: Journal of Machine Learning Research 21.140 (2020), pp. 167. url: https: //jmlr.org/papers/v21/20-074.html.
  • Michiel de Galle, Benoît Sagot, and Djamé Seddah. Respite: A Tokenization-Free Multilingual Language Model . In: Proceedings of EMNLP 2021. 2021, pp. 288302.
  • Yi Tay et al. Charformer: Fast Character Transformers via Gradient-based Subword Tok enization. In: arXiv preprint arXiv:2106.12672 (2021). url: https://arxiv.org/abs/2106. 12672.
  • Jonathan H. Clark et al. CANINE: Pre-training an E cient Tokenization-Free Encoder for Language Representation. In: Transactions of the Association for Computational Linguistics 9 (2021), pp. 11991212.
  • Yinhan Liu et al. Learning Unsupervised Segmentation for Text-to-Text Generation. In: Proceedings of NAACL 2022. 2022, pp. 27362750.
  • Yi Liao, Xin Jiang, and Qun Liu. Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order . In: Proceedings of ACL 2020. 2020, pp. 263274.
  • Ray Jackendo . X-bar Syntax: A Study of Phrase Structure. Linguistic Inquiry Monograph 2. Cambridge, MA: MIT Press, 1977. isbn: 9780262600095.
  • Mathias Creutz and Krista Lagus. Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Tech. rep. A81. Helsinki University of Technology, 2005. url: http://users.ics.aalto.fi/mcreutz/papers/Creutz05tr.pdf.
  • Mandar Joshi et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 6478. doi: 10.1162/tacl_a_00300.
  • Shaoqing Ren et al. Faster R-CNN: Towards Real-Time Object Detection with Region Pro posal Networks. In: arXiv preprint arXiv:1506.01497 (2015). url: https://arxiv.org/abs/ 1506.01497.
  • Robin Strudel et al. Segmenter: Transformer for Semantic Segmentation. In: arXiv preprint arXiv:2105.05633 (2021). Available at arXiv. url: https://arxiv.org/abs/2105.05633.
  • Linting Xue et al. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Mod els. In: Transactions of the Association for Computational Linguistics 10 (2022), pp. 291 306.
  • Jonathan H. Clark et al. CANINE: Pre-training an E cient Tokenization-Free Encoder for Language Representation. In: Transactions of the Association for Computational Linguistics 10 (2022), pp. 7391.
  • Yi Tay et al. Charformer: Fast Character Transformers via Gradient-based Subword Tok enization. In: Advances in Neural Information Processing Systems. Vol. 34. 2021, pp. 15884 15897. doi: 10.48550/arXiv.2106.12672. url: https://arxiv.org/abs/2106.12672.
  • Shuyuan Cao et al. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. 2022. arXiv: 2203.13474 [cs.CL]. url: https://arxiv.org/abs/2203. 13474.
  • Shuyuan Cao et al. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. 2022. arXiv: 2203.13474 [cs.CL]. url: https://arxiv.org/abs/2203. 13474.
  • Haoran Xu et al. Faster and Better: A Dual-Path Framework for Document-Level Relation Extraction. In: arXiv preprint arXiv:2202.05544 (2022).
  • Julia Kreutzer et al. Distilling Structured Knowledge from Large Language Models. In: Findings of the Association for Computational Linguistics: ACL/IJCNLP. 2021, pp. 3844 3853.
  • Jianpeng Liu et al. Table-to-text generation by structure-aware seq2seq learning. In: AAAI. 2018, pp. 48814888.
  • Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Masked-Attention Mask Trans former for Universal Image Segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021, pp. 12901299.
  • Zi Lin, Sweta Agrawal, and Smaranda Muresan. Learning Cross-lingual Code-switching for Generative Language Models. In: Findings of EMNLP 2021. 2021, pp. 26782689.
  • Jai Gupta et al. Molt: Modular Prompt Tuning for Multi-task and Cross-lingual Transfer. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2022.
  • Daniel Khashabi et al. Uni edQA: Crossing Format Boundaries with a Single QA System. In: Findings of EMNLP 2020. 2020, pp. 18961907.
  • Xiang Lisa Li and Percy Liang. Pre x-Tuning: Optimizing Continuous Prompts for Gen eration. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). 2021, pp. 45824597.
  • Zi Lin et al. GLAIVE: Global Context Aware Generation for Code-Mixed Dialogues. In: Findings of ACL 2022. 2022, pp. 672685.
  • Kenton Lee, Mike Lewis, and Luke Zettlemoyer. End-to-End Neural Coreference Resolution. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2017.
  • Jianpeng Cheng, Michael Kuehl, and Mirella Lapata. Probing What Di erent NLP Tasks Teach Machines About Function Word Comprehension. In: Findings of EMNLP. 2020.
  • Kenton Lee et al. Higher-Order Coreference Resolution with Coarse-to-Fine Inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2018.
  • Kelvin Guu et al. REALM: Retrieval-Augmented Language Model Pre-Training. In: Pro ceedings of the 37th International Conference on Machine Learning (ICML). 2020.
  • Weizhe Zuo et al. Rethinking Insertion for Transformer-Based Language Modeling. In: Find ings of ACL. 2022.
  • Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative Position Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2018.
  • Susan Zhang et al. OPT: OpenPre-trained Transformer Language Models. In: arXiv preprint arXiv:2205.01068 (2022). url: https://arxiv.org/abs/2205.01068.
  • Gautier Izacard and Edouard Grave. Distilling Knowledge from Reader to Retriever for Ques tion Answering. In: Advances in Neural Information Processing Systems (NeurIPS). 2020.
  • Shivangi Arora et al. ExSum: From Local Explanations to Model Understanding. In: Ad vances in Neural Information Processing Systems (NeurIPS). 2022.
  • Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Trans former. In: arXiv preprint arXiv:2004.05150 (2020). url: https://arxiv.org/abs/2004. 05150
  • Manzil Zaheer et al. Big Bird: Transformers for Longer Sequences. In: Advances in Neural Information Processing Systems 33 (2020), pp. 1728317297. url: https://arxiv.org/abs/ 2007.14062.
  • Noam Shazeer et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of Experts Layer. In: arXiv preprint arXiv:1701.06538 (2017). url: https://arxiv.org/abs/ 1701.06538.
  • Joshua Ainslie et al. CoLT5: Faster Long-Range Transformers with Conditional Computa tion. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore: Association for Computational Linguistics, 2023, pp. 5085 5100. url: https://aclanthology.org/2023.emnlp-main.309/.
  • Junxian He et al. Syntax-Enhanced Transformer for Neural Machine Translation. In: arXiv preprint arXiv:2002.01160 (2020). url: https://arxiv.org/abs/2002.01160.
  • Colin Ra el et al. Exploring the Limits of Transfer Learning with a Uni ed Text-to-Text Transformer. In: Journal of Machine Learning Research 21.140 (2020), pp. 167. url: https: //jmlr.org/papers/v21/20-074.html.
  • Edward J. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. In: arXiv preprint arXiv:2106.09685 (2021). url: https://arxiv.org/abs/2106.09685.
  • Mike Lewis et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In: Proceedings of ACL. 2020, pp. 78717880. url: https://aclanthology.org/2020.acl-main.703.
  • André FT Martins et al. Latent Structure Models for Natural Language Processing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. 2019, pp. 15.
  • Chenchen Ma, Jing Ouyang, and Gongjun Xu. Learning Latent and Hierarchical Structures in Cognitive Diagnosis Models. In: Psychometrika 88.1 (2023), pp. 175207. doi: 10.1007/ s11336-022-09867-5.
  • YiTayetal. E cient Content-Based Sparse Attention with Routing Transformers. In: Trans actions of the Association for Computational Linguistics 9 (2021), pp. 5368. doi: 10.1162/ tacl\_a\_00353.
  • Yves Grandvalet and Yoshua Bengio. Semi-Supervised Learning by Entropy Minimization . In: Advances in Neural Information Processing Systems. 2005, pp. 529536.
  • Gabriel Pereyra et al. Regularizing Neural Networks by Penalizing Con dent Output Distri butions. In: International Conference on Learning Representations (ICLR). 2017.
  • Yoshua Bengio et al. Curriculum Learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, pp. 4148.
  • Andrew Drozdov et al. Unsupervised Latent Tree Induction with Deep Inside-Outside Recur sive Autoencoders. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2019, pp. 11291141. url: https://aclanthology.org/N19 1116/.
  • Kevin Clark et al. Semi-Supervised Sequence Modeling with Cross-View Training. In: Pro ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. As sociation for Computational Linguistics, 2018, pp. 19141925. url: https://aclanthology. org/D18-1217/.
  • David R. So et al. Primer: Searching for E cient Transformers for Language Modeling. In: Advances in Neural Information Processing Systems (NeurIPS). 2022. url: https://arxiv. org/abs/2109.08668.
  • Ross Taylor et al. Galactica: A Large Language Model for Science. In: arXiv preprint arXiv:2211.09085 (2022). url: https://arxiv.org/abs/2211.09085.
  • Pengfei Liu et al. PADA: Prompting Adaptation for Text Classi cation with Pretrained Language Models. In: Proceedings of ACL. 2022. url: https://aclanthology.org/2022. acl-long.456.
  • Jack W. Rae et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. In: arXiv preprint arXiv:2112.11446 (2021). url: https://arxiv.org/abs/2112. 11446.
  • Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent Retrieval for Weakly Super vised Open Domain Question Answering. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019. url: https://arxiv.org/abs/1906. 00300.
  • Peter J. Liu et al. Generating Wikipedia by Summarizing Long Sequences. In: International Conference on Learning Representations (ICLR). 2018. url: https://arxiv.org/abs/1801. 10198.
  • Jason Naradowsky, Sharon Goldwater, and Sebastian Riedel. Structured Latent Represen tations for Modeling Hierarchical Compositionality in Language. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). 2021. url: https: //aclanthology.org/2021.acl-long.123.
  • Yonatan Belinkov. Probing Classi ers: Promises, Shortcomings, and Advances. In: Compu tational Linguistics 48.1 (2022), pp. 207219. doi: 10.1162/coli\_a\_00422. url: https: //arxiv.org/abs/2102.12452.
  • Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In: International Conference on Learning Representations (ICLR). 2019. url: https://arxiv.org/abs/1711. 05101.
  • John Hewitt and Christopher D. Manning. A Structural Probe for Finding Syntax in Word Representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2019, pp. 41294138. url: https://aclanthology.org/N19 1419/.
  • YangLiuandMirella Lapata. Hierarchical Transformers for Multi-Document Summarization. In: Transactions of the Association for Computational Linguistics 7 (2019), pp. 337351. doi: 10.1162/tacl\_a\_00276. url: https://aclanthology.org/Q19-1024.
  • Kara Marie Rawson. Stream-Mix: A Synthetic Benchmark for Compositional Span Induction. Manuscript in preparation. 2025.
  • Stephen Merity et al. Pointer Sentinel Mixture Models. 2016. doi: 10.48550/arXiv.1609. 07843. arXiv: 1609.07843 [cs.CL]. url: https://arxiv.org/abs/1609.07843.
  • Alexander M. Rush, Sumit Chopra, and Jason Weston. ANeural Attention Model for Abstrac tive Sentence Summarization. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015, pp. 379389. url: https://aclanthology.org/D15-1044.
  • Jesse Vig et al. Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias. In: arXiv preprint arXiv:2004.12265 (2020). doi: 10.48550/arXiv.2004.12265. url: https://arxiv.org/abs/2004.12265.
  • Nikita Kitaev and Dan Klein. Constituency Parsing with a Self-Attentive Encoder . In: Pro ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Vol ume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 26762686. doi: 10.18653/v1/P18-1249. url: https://aclanthology.org/P18-1249.
  • Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear. 2017. url: https://sentometrics-research.com/publication/72/.
  • Ralph Weischedel et al. OntoNotes Release 5.0. Linguistic Data Consortium, LDC2013T19. Philadelphia: Linguistic Data Consortium. 2013. url: https://catalog.ldc.upenn.edu/ LDC2013T19.
  • Yves Grandvalet and Yoshua Bengio. Entropy Regularization. In: Semi-Supervised Learning. Ed. by Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. MIT Press, 2006, pp. 151 168. doi: 10.7551/MITPRESS/9780262033589.003.0009.
  • Zilliz. How do I implement embedding pooling strategies (mean, max, CLS)? Accessed: 2025 06-26. 2023. url: https://zilliz.com/ai-faq/how-do-i-implement-embedding-pooling strategies-mean-max-cls.
  • Shicheng Liu et al. SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models. In: Findings of the Association for Computational Linguistics: NAACL 2024 (2024), pp. 4535 4555. doi: 10.18653/v1/2024.findings-naacl.283. url: https://aclanthology.org/2024.findings-naacl.283.
  • Xiaoya Li et al. A Uni ed MRCFramework for Named Entity Recognition. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2020, pp. 58495859. doi: 10.18653/v1/2020.acl-main.519. url: https://aclanthology.org/2020.acl-main.519.
  • Ahsaas Bajaj et al. Long Document Summarization in a Low Resource Setting using Pre trained Language Models. In: arXiv preprint arXiv:2103.00751 (2021). doi: 10.48550/arXiv. 2103.00751. url: https://arxiv.org/abs/2103.00751.
  • Ingo Ziegler et al. CRAFT Your Dataset: Task-Speci c Synthetic Dataset Generation Through Corpus Retrieval and Augmentation. In: arXiv preprint arXiv:2409.02098 (2024). doi: 10.48550/arXiv.2409.02098. url: https://arxiv.org/abs/2409.02098.
  • Kaustubh D. Dhole. A Multi-Encoder Frozen-Decoder Approach for Fine-Tuning Large Lan guage Models. In: arXiv preprint arXiv:2501.07818 (2025). doi: 10.48550/arXiv.2501. 07818. url: https://arxiv.org/abs/2501.07818.
  • Bingfeng Zhang et al. Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024. url: https://openaccess.thecvf.com/content/CVPR2024/ html/Zhang_Frozen_CLIP_A_Strong_Backbone_for_Weakly_Supervised_Semantic_ Segmentation_CVPR_2024_paper.pdf.
  • Jesse Vig and Yonatan Belinkov. Analyzing the Structure of Attention in a Transformer Lan guage Model . In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Inter preting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics, 2019, pp. 6376. doi: 10.18653/v1/W19-4808. url: https://aclanthology.org/W19-4808.
  • Benjamin Hoover, Hendrik Strobelt, and Sebastian Gehrmann. exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 2020, pp. 187196. doi: 10.18653/v1/2020.acl demos.21. url: https://aclanthology.org/2020.acl-demos.21.
  • Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies. In: Transactions of the Association for Computational Linguistics 4 (2016), pp. 521535. doi: 10.1162/tacl_a_00115. url: https://aclanthology. org/Q16-1037.
  • Yonatan Belinkov and James Glass. Analysis Methods in Neural Language Processing: A Survey. In: Transactions of the Association for Computational Linguistics 7 (2019), pp. 49 72. doi: 10.1162/tacl\_a\_00254. url: https://aclanthology.org/Q19-1004.
  • Chris Olah et al. The Building Blocks of Interpretability. In: Distill (2018). doi: 10.23915/ distill.00010. url: https://distill.pub/2018/building-blocks/.
  • Honggang Wang et al. Structured Variational Inference in Bayesian State-Space Models. In: Proceedings of the 25th International Conference on Arti cial Intelligence and Statistics (AISTATS). Vol. 151. Proceedings of Machine Learning Research. PMLR, 2022, pp. 88848905. url: https://proceedings.mlr.press/v151/wang22g.html.