Published March 3, 2024 | Version v1
Dataset Open

Large Oil and Gas industry text dataset from Norwegian , UK and Dutch public oil and gas documents

Description

This is a large dataset of extracted text from public Oil and gas documents that was prepared in the run up to the FORCE 2023 Large Languagel model Hackathon  in Stavanger, Norway

The dataset is uninque since it contains the largest public collection of extracted text from Ocr'ed  oil and gas documents currently available. It has been created with the aim to make more oil and gas documents knowledge better embedded in language models
Additional the text has been classified in if the extracted pages are real text or mostly gibberish.
Personal identifiable information has been removed as best as possible
A file with 1500 hand classified pages is part of the upload to further train text classifiers.

Files

Netherlands - Netherlands Oil & Gas Portal reports.csv

Files (6.1 GB)

Name Size Download all
md5:d5a40381aaee02acde189520efe8e151
3.6 MB Download
md5:efc83a69340cb57c692804c5d57f8b24
166.3 MB Preview Download
md5:9d76003330ff13f869b5b02925d5ca2f
5.7 GB Preview Download
md5:f6ca32637b9117b2f18b5578c02b55be
24.7 MB Preview Download
md5:00b02f526545961934b23a085d8a0417
198.7 MB Preview Download

Additional details

Dates

Available
2024-03-03