X-Spanformer: A Tokenizer-Free, Span-Aware Encoder Inspired by X-Bar Theory

Rawson, Kara

doi:10.5281/zenodo.15750962

Published June 26, 2025 | Version 1.1

Preprint Open

X-Spanformer: A Tokenizer-Free, Span-Aware Encoder Inspired by X-Bar Theory

Rawson, Kara (Researcher)¹

1. Independent Researcher

Contributors

Researcher:

Chrzanowski, Aimee¹

1. Independent Researcher

This paper introduces X-Spanformer, a tokenizer-free, span-aware encoder that learns compositional segmentation directly from raw input streams using a pointer-network mechanism inspired by X-bar theory. Starting with a compact BPE seed, the model refines span boundaries through a staged curriculum involving synthetic supervision, entropy regularization, and contrastive alignment, producing softly typed spans pooled into transformer layers via a lightweight compositional interface. This joint optimization approach supports adaptable segmentation and representation across modalities such as code and natural language, validated through metrics including compression ratio, entropy decay, span-type KL divergence, and syntactic fidelity. The release includes an ONNX-compatible implementation and reproducible training recipes, positioning X-Spanformer as a foundation for interpretable, scalable encoders in structured learning, neural parsing, and program induction.

Files

XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf

Files (19.3 MB)

Name	Size	Download all
XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf md5:c39d6b5d70a3f86141891bac00e67e13	9.0 MB	Preview Download
XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.zip md5:081160c4dea8b54dc97894d65d9d7f0e	10.3 MB	Preview Download

Additional details

Created: 2025-06-26

Created the initial Draft

Repository URL: https://github.com/p3nGu1nZz/x-spanformer
Programming language: Python, C++
Development Status: Wip

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Machine Translation of Rare Words with Subword Units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, 2016, pp. 17151725. doi: 10.18653/v1/P16-1162. url: https: //aclanthology.org/P16-1162.
Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub word tokenizer and detokenizer for Neural Text Processing. In: Proceedings of the 2018 Confer ence on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 6671. doi: 10.18653/v1/D18 2012. url: https://aclanthology.org/D18-2012.
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer Networks. In: Advances in Neural Information Processing Systems. Vol. 28. 2015, pp. 26922700. url: https://arxiv. org/abs/1506.03134.
Ashish Vaswani et al. Attention Is All You Need. In: Advances in Neural Information Pro cessing Systems. Vol. 30. 2017, pp. 59986008. url: https://arxiv.org/abs/1706.03762.
Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 41714186. doi: 10.18653/v1/N19-1423. url: https://aclanthology.org/N19-1423.
Alec Radford et al. Language Models are Unsupervised Multitask Learners. OpenAI Technical Report. Available at https://cdn.openai.com/better-language-models/language_ models_are_unsupervised_multitask_learners.pdf. 2019.
Colin Ra el et al. Exploring the Limits of Transfer Learning with a Uni ed Text-to-Text Transformer. In: Journal of Machine Learning Research 21.140 (2020), pp. 167. url: https: //jmlr.org/papers/v21/20-074.html.
Michiel de Galle, Benoît Sagot, and Djamé Seddah. Respite: A Tokenization-Free Multilingual Language Model . In: Proceedings of EMNLP 2021. 2021, pp. 288302.
Yi Tay et al. Charformer: Fast Character Transformers via Gradient-based Subword Tok enization. In: arXiv preprint arXiv:2106.12672 (2021). url: https://arxiv.org/abs/2106. 12672.
Jonathan H. Clark et al. CANINE: Pre-training an E cient Tokenization-Free Encoder for Language Representation. In: Transactions of the Association for Computational Linguistics 9 (2021), pp. 11991212.
Yinhan Liu et al. Learning Unsupervised Segmentation for Text-to-Text Generation. In: Proceedings of NAACL 2022. 2022, pp. 27362750.
Yi Liao, Xin Jiang, and Qun Liu. Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order . In: Proceedings of ACL 2020. 2020, pp. 263274.
Ray Jackendo . X-bar Syntax: A Study of Phrase Structure. Linguistic Inquiry Monograph 2. Cambridge, MA: MIT Press, 1977. isbn: 9780262600095.
Mathias Creutz and Krista Lagus. Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0. Tech. rep. A81. Helsinki University of Technology, 2005. url: http://users.ics.aalto.fi/mcreutz/papers/Creutz05tr.pdf.
Mandar Joshi et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans. In: Transactions of the Association for Computational Linguistics 8 (2020), pp. 6478. doi: 10.1162/tacl_a_00300.
Shaoqing Ren et al. Faster R-CNN: Towards Real-Time Object Detection with Region Pro posal Networks. In: arXiv preprint arXiv:1506.01497 (2015). url: https://arxiv.org/abs/ 1506.01497.
Robin Strudel et al. Segmenter: Transformer for Semantic Segmentation. In: arXiv preprint arXiv:2105.05633 (2021). Available at arXiv. url: https://arxiv.org/abs/2105.05633.
Linting Xue et al. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Mod els. In: Transactions of the Association for Computational Linguistics 10 (2022), pp. 291 306.
Jonathan H. Clark et al. CANINE: Pre-training an E cient Tokenization-Free Encoder for Language Representation. In: Transactions of the Association for Computational Linguistics 10 (2022), pp. 7391.
Yi Tay et al. Charformer: Fast Character Transformers via Gradient-based Subword Tok enization. In: Advances in Neural Information Processing Systems. Vol. 34. 2021, pp. 15884 15897. doi: 10.48550/arXiv.2106.12672. url: https://arxiv.org/abs/2106.12672.
Shuyuan Cao et al. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. 2022. arXiv: 2203.13474 [cs.CL]. url: https://arxiv.org/abs/2203. 13474.
Shuyuan Cao et al. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. 2022. arXiv: 2203.13474 [cs.CL]. url: https://arxiv.org/abs/2203. 13474.
Haoran Xu et al. Faster and Better: A Dual-Path Framework for Document-Level Relation Extraction. In: arXiv preprint arXiv:2202.05544 (2022).
Julia Kreutzer et al. Distilling Structured Knowledge from Large Language Models. In: Findings of the Association for Computational Linguistics: ACL/IJCNLP. 2021, pp. 3844 3853.
Jianpeng Liu et al. Table-to-text generation by structure-aware seq2seq learning. In: AAAI. 2018, pp. 48814888.
Bowen Cheng, Alexander G. Schwing, and Alexander Kirillov. Masked-Attention Mask Trans former for Universal Image Segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021, pp. 12901299.
Zi Lin, Sweta Agrawal, and Smaranda Muresan. Learning Cross-lingual Code-switching for Generative Language Models. In: Findings of EMNLP 2021. 2021, pp. 26782689.
Jai Gupta et al. Molt: Modular Prompt Tuning for Multi-task and Cross-lingual Transfer. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2022.
Daniel Khashabi et al. Uni edQA: Crossing Format Boundaries with a Single QA System. In: Findings of EMNLP 2020. 2020, pp. 18961907.
Xiang Lisa Li and Percy Liang. Pre x-Tuning: Optimizing Continuous Prompts for Gen eration. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). 2021, pp. 45824597.
Zi Lin et al. GLAIVE: Global Context Aware Generation for Code-Mixed Dialogues. In: Findings of ACL 2022. 2022, pp. 672685.
Kenton Lee, Mike Lewis, and Luke Zettlemoyer. End-to-End Neural Coreference Resolution. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2017.
Jianpeng Cheng, Michael Kuehl, and Mirella Lapata. Probing What Di erent NLP Tasks Teach Machines About Function Word Comprehension. In: Findings of EMNLP. 2020.
Kenton Lee et al. Higher-Order Coreference Resolution with Coarse-to-Fine Inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2018.
Kelvin Guu et al. REALM: Retrieval-Augmented Language Model Pre-Training. In: Pro ceedings of the 37th International Conference on Machine Learning (ICML). 2020.
Weizhe Zuo et al. Rethinking Insertion for Transformer-Based Language Modeling. In: Find ings of ACL. 2022.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative Position Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2018.
Susan Zhang et al. OPT: OpenPre-trained Transformer Language Models. In: arXiv preprint arXiv:2205.01068 (2022). url: https://arxiv.org/abs/2205.01068.
Gautier Izacard and Edouard Grave. Distilling Knowledge from Reader to Retriever for Ques tion Answering. In: Advances in Neural Information Processing Systems (NeurIPS). 2020.
Shivangi Arora et al. ExSum: From Local Explanations to Model Understanding. In: Ad vances in Neural Information Processing Systems (NeurIPS). 2022.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The Long-Document Trans former. In: arXiv preprint arXiv:2004.05150 (2020). url: https://arxiv.org/abs/2004. 05150
Manzil Zaheer et al. Big Bird: Transformers for Longer Sequences. In: Advances in Neural Information Processing Systems 33 (2020), pp. 1728317297. url: https://arxiv.org/abs/ 2007.14062.
Noam Shazeer et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of Experts Layer. In: arXiv preprint arXiv:1701.06538 (2017). url: https://arxiv.org/abs/ 1701.06538.
Joshua Ainslie et al. CoLT5: Faster Long-Range Transformers with Conditional Computa tion. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore: Association for Computational Linguistics, 2023, pp. 5085 5100. url: https://aclanthology.org/2023.emnlp-main.309/.
Junxian He et al. Syntax-Enhanced Transformer for Neural Machine Translation. In: arXiv preprint arXiv:2002.01160 (2020). url: https://arxiv.org/abs/2002.01160.
Colin Ra el et al. Exploring the Limits of Transfer Learning with a Uni ed Text-to-Text Transformer. In: Journal of Machine Learning Research 21.140 (2020), pp. 167. url: https: //jmlr.org/papers/v21/20-074.html.
Edward J. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. In: arXiv preprint arXiv:2106.09685 (2021). url: https://arxiv.org/abs/2106.09685.
Mike Lewis et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In: Proceedings of ACL. 2020, pp. 78717880. url: https://aclanthology.org/2020.acl-main.703.
André FT Martins et al. Latent Structure Models for Natural Language Processing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. 2019, pp. 15.
Chenchen Ma, Jing Ouyang, and Gongjun Xu. Learning Latent and Hierarchical Structures in Cognitive Diagnosis Models. In: Psychometrika 88.1 (2023), pp. 175207. doi: 10.1007/ s11336-022-09867-5.
YiTayetal. E cient Content-Based Sparse Attention with Routing Transformers. In: Trans actions of the Association for Computational Linguistics 9 (2021), pp. 5368. doi: 10.1162/ tacl\_a\_00353.
Yves Grandvalet and Yoshua Bengio. Semi-Supervised Learning by Entropy Minimization . In: Advances in Neural Information Processing Systems. 2005, pp. 529536.
Gabriel Pereyra et al. Regularizing Neural Networks by Penalizing Con dent Output Distri butions. In: International Conference on Learning Representations (ICLR). 2017.
Yoshua Bengio et al. Curriculum Learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, pp. 4148.
Andrew Drozdov et al. Unsupervised Latent Tree Induction with Deep Inside-Outside Recur sive Autoencoders. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2019, pp. 11291141. url: https://aclanthology.org/N19 1116/.
Kevin Clark et al. Semi-Supervised Sequence Modeling with Cross-View Training. In: Pro ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. As sociation for Computational Linguistics, 2018, pp. 19141925. url: https://aclanthology. org/D18-1217/.
David R. So et al. Primer: Searching for E cient Transformers for Language Modeling. In: Advances in Neural Information Processing Systems (NeurIPS). 2022. url: https://arxiv. org/abs/2109.08668.
Ross Taylor et al. Galactica: A Large Language Model for Science. In: arXiv preprint arXiv:2211.09085 (2022). url: https://arxiv.org/abs/2211.09085.
Pengfei Liu et al. PADA: Prompting Adaptation for Text Classi cation with Pretrained Language Models. In: Proceedings of ACL. 2022. url: https://aclanthology.org/2022. acl-long.456.
Jack W. Rae et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. In: arXiv preprint arXiv:2112.11446 (2021). url: https://arxiv.org/abs/2112. 11446.
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent Retrieval for Weakly Super vised Open Domain Question Answering. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL). 2019. url: https://arxiv.org/abs/1906. 00300.
Peter J. Liu et al. Generating Wikipedia by Summarizing Long Sequences. In: International Conference on Learning Representations (ICLR). 2018. url: https://arxiv.org/abs/1801. 10198.
Jason Naradowsky, Sharon Goldwater, and Sebastian Riedel. Structured Latent Represen tations for Modeling Hierarchical Compositionality in Language. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL). 2021. url: https: //aclanthology.org/2021.acl-long.123.
Yonatan Belinkov. Probing Classi ers: Promises, Shortcomings, and Advances. In: Compu tational Linguistics 48.1 (2022), pp. 207219. doi: 10.1162/coli\_a\_00422. url: https: //arxiv.org/abs/2102.12452.
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In: International Conference on Learning Representations (ICLR). 2019. url: https://arxiv.org/abs/1711. 05101.
John Hewitt and Christopher D. Manning. A Structural Probe for Finding Syntax in Word Representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2019, pp. 41294138. url: https://aclanthology.org/N19 1419/.
YangLiuandMirella Lapata. Hierarchical Transformers for Multi-Document Summarization. In: Transactions of the Association for Computational Linguistics 7 (2019), pp. 337351. doi: 10.1162/tacl\_a\_00276. url: https://aclanthology.org/Q19-1024.
Kara Marie Rawson. Stream-Mix: A Synthetic Benchmark for Compositional Span Induction. Manuscript in preparation. 2025.
Stephen Merity et al. Pointer Sentinel Mixture Models. 2016. doi: 10.48550/arXiv.1609. 07843. arXiv: 1609.07843 [cs.CL]. url: https://arxiv.org/abs/1609.07843.
Alexander M. Rush, Sumit Chopra, and Jason Weston. ANeural Attention Model for Abstrac tive Sentence Summarization. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015, pp. 379389. url: https://aclanthology.org/D15-1044.
Jesse Vig et al. Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias. In: arXiv preprint arXiv:2004.12265 (2020). doi: 10.48550/arXiv.2004.12265. url: https://arxiv.org/abs/2004.12265.
Nikita Kitaev and Dan Klein. Constituency Parsing with a Self-Attentive Encoder . In: Pro ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Vol ume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 26762686. doi: 10.18653/v1/P18-1249. url: https://aclanthology.org/P18-1249.
Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear. 2017. url: https://sentometrics-research.com/publication/72/.
Ralph Weischedel et al. OntoNotes Release 5.0. Linguistic Data Consortium, LDC2013T19. Philadelphia: Linguistic Data Consortium. 2013. url: https://catalog.ldc.upenn.edu/ LDC2013T19.
Yves Grandvalet and Yoshua Bengio. Entropy Regularization. In: Semi-Supervised Learning. Ed. by Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. MIT Press, 2006, pp. 151 168. doi: 10.7551/MITPRESS/9780262033589.003.0009.
Zilliz. How do I implement embedding pooling strategies (mean, max, CLS)? Accessed: 2025 06-26. 2023. url: https://zilliz.com/ai-faq/how-do-i-implement-embedding-pooling strategies-mean-max-cls.
Shicheng Liu et al. SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models. In: Findings of the Association for Computational Linguistics: NAACL 2024 (2024), pp. 4535 4555. doi: 10.18653/v1/2024.findings-naacl.283. url: https://aclanthology.org/2024.findings-naacl.283.
Xiaoya Li et al. A Uni ed MRCFramework for Named Entity Recognition. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2020, pp. 58495859. doi: 10.18653/v1/2020.acl-main.519. url: https://aclanthology.org/2020.acl-main.519.
Ahsaas Bajaj et al. Long Document Summarization in a Low Resource Setting using Pre trained Language Models. In: arXiv preprint arXiv:2103.00751 (2021). doi: 10.48550/arXiv. 2103.00751. url: https://arxiv.org/abs/2103.00751.
Ingo Ziegler et al. CRAFT Your Dataset: Task-Speci c Synthetic Dataset Generation Through Corpus Retrieval and Augmentation. In: arXiv preprint arXiv:2409.02098 (2024). doi: 10.48550/arXiv.2409.02098. url: https://arxiv.org/abs/2409.02098.
Kaustubh D. Dhole. A Multi-Encoder Frozen-Decoder Approach for Fine-Tuning Large Lan guage Models. In: arXiv preprint arXiv:2501.07818 (2025). doi: 10.48550/arXiv.2501. 07818. url: https://arxiv.org/abs/2501.07818.
Bingfeng Zhang et al. Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024. url: https://openaccess.thecvf.com/content/CVPR2024/ html/Zhang_Frozen_CLIP_A_Strong_Backbone_for_Weakly_Supervised_Semantic_ Segmentation_CVPR_2024_paper.pdf.
Jesse Vig and Yonatan Belinkov. Analyzing the Structure of Attention in a Transformer Lan guage Model . In: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Inter preting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics, 2019, pp. 6376. doi: 10.18653/v1/W19-4808. url: https://aclanthology.org/W19-4808.
Benjamin Hoover, Hendrik Strobelt, and Sebastian Gehrmann. exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 2020, pp. 187196. doi: 10.18653/v1/2020.acl demos.21. url: https://aclanthology.org/2020.acl-demos.21.
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies. In: Transactions of the Association for Computational Linguistics 4 (2016), pp. 521535. doi: 10.1162/tacl_a_00115. url: https://aclanthology. org/Q16-1037.
Yonatan Belinkov and James Glass. Analysis Methods in Neural Language Processing: A Survey. In: Transactions of the Association for Computational Linguistics 7 (2019), pp. 49 72. doi: 10.1162/tacl\_a\_00254. url: https://aclanthology.org/Q19-1004.
Chris Olah et al. The Building Blocks of Interpretability. In: Distill (2018). doi: 10.23915/ distill.00010. url: https://distill.pub/2018/building-blocks/.
Honggang Wang et al. Structured Variational Inference in Bayesian State-Space Models. In: Proceedings of the 25th International Conference on Arti cial Intelligence and Statistics (AISTATS). Vol. 151. Proceedings of Machine Learning Research. PMLR, 2022, pp. 88848905. url: https://proceedings.mlr.press/v151/wang22g.html.

	All versions	This version
Views	193	180
Downloads	382	346
Data volume	4.1 GB	3.9 GB

X-Spanformer: A Tokenizer-Free, Span-Aware Encoder Inspired by X-Bar Theory

Contributors

Researcher:

Files

XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf

Files (19.3 MB)

Additional details

Dates

Software

References

X-Spanformer: A Tokenizer-Free, Span-Aware Encoder Inspired by X-Bar Theory

Creators

Contributors

Researcher:

Description

Files

XSpanformer_TokenizerFree_SpanAwareEncoder_RawsonChrzanowski_Preprint_v1.0_2025-06-26.pdf

Files (19.3 MB)

Additional details

Dates

Software

References