Published March 8, 2026 | Version v1
Software Open

Extraction of Abstracts from arXiv TEI XML Files

Authors/Creators

  • 1. ROR icon Universidad Politécnica de Madrid

Description

This repository contains code to extract abstracts from TEI XML files of arXiv papers.

The project includes:
- Python script for parsing XML
- Dockerfile for reproducible execution
- Sample dataset

Files

text-extraction-analysis-grobid.zip

Files (61.2 MB)

Name Size Download all
md5:efe1dd9f7bd880f04d9cb4e5c3748804
61.2 MB Preview Download