Published February 10, 2026
| Version 1.0
Software
Open
Asset-Level Location Extraction from Corporate Sustainability PDF Reports
Authors/Creators
Description
This repository contains a Python script (reports_NER_for_pdf.py) that extracts location entities from PDF files and classifies them into asset-related locations and other locations.
The script uses spaCy NER, dependency parsing, optional fuzzy matching with RapidFuzz, and optional BERT-based sentence classification (not yet implemented). All runtime settings are read from a configuration file (cfg.yml).
Files
geonames_filtered.json.zip
Files
(25.2 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:89eed8bf0e973576a167133e41e5ec36
|
3.8 kB | Download |
|
md5:a1297e15388c7108d1e7d139341c59bb
|
64 Bytes | Download |
|
md5:bfd45bd1a14efa3afe2527c9b36574c2
|
1.7 kB | Download |
|
md5:4dc88f3152b0259b338c78d4c3d5595e
|
25.2 MB | Preview Download |
|
md5:2e96cd79a8dc0a092f0c5cc25e85cfde
|
1.2 kB | Download |
|
md5:f37a4439775ea5453c0b694ba65461e7
|
15.6 kB | Preview Download |
|
md5:710867d2387d47ddce9acfd12c7955ce
|
51.8 kB | Download |
|
md5:38c3c78e0a0fd1225feb33e8328be9b9
|
243 Bytes | Preview Download |