Published February 10, 2026 | Version 1.0
Software Open

Asset-Level Location Extraction from Corporate Sustainability PDF Reports

  • 1. ROR icon Technical University of Munich
  • 2. ROR icon Ludwig-Maximilians-Universität München
  • 3. ROR icon University of Oxford
  • 4. ROR icon Deutsche Bundesbank

Description

This repository contains a Python script (reports_NER_for_pdf.py) that extracts location entities from PDF files and classifies them into asset-related locations and other locations.

The script uses spaCy NER, dependency parsing, optional fuzzy matching with RapidFuzz, and optional BERT-based sentence classification (not yet implemented). All runtime settings are read from a configuration file (cfg.yml).

Files

geonames_filtered.json.zip

Files (25.2 MB)

Name Size Download all
md5:89eed8bf0e973576a167133e41e5ec36
3.8 kB Download
md5:a1297e15388c7108d1e7d139341c59bb
64 Bytes Download
md5:bfd45bd1a14efa3afe2527c9b36574c2
1.7 kB Download
md5:4dc88f3152b0259b338c78d4c3d5595e
25.2 MB Preview Download
md5:2e96cd79a8dc0a092f0c5cc25e85cfde
1.2 kB Download
md5:f37a4439775ea5453c0b694ba65461e7
15.6 kB Preview Download
md5:710867d2387d47ddce9acfd12c7955ce
51.8 kB Download
md5:38c3c78e0a0fd1225feb33e8328be9b9
243 Bytes Preview Download