Large Oil and Gas industry text dataset from Norwegian , UK and Dutch public oil and gas documents
Authors/Creators
Description
This is a large dataset of extracted text from public Oil and gas documents that was prepared in the run up to the FORCE 2023 Large Languagel model Hackathon in Stavanger, Norway
The dataset is uninque since it contains the largest public collection of extracted text from Ocr'ed oil and gas documents currently available. It has been created with the aim to make more oil and gas documents knowledge better embedded in language models
Additional the text has been classified in if the extracted pages are real text or mostly gibberish.
Personal identifiable information has been removed as best as possible
A file with 1500 hand classified pages is part of the upload to further train text classifiers.
Files
Netherlands - Netherlands Oil & Gas Portal reports.csv
Files
(6.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:d5a40381aaee02acde189520efe8e151
|
3.6 MB | Download |
|
md5:efc83a69340cb57c692804c5d57f8b24
|
166.3 MB | Preview Download |
|
md5:9d76003330ff13f869b5b02925d5ca2f
|
5.7 GB | Preview Download |
|
md5:f6ca32637b9117b2f18b5578c02b55be
|
24.7 MB | Preview Download |
|
md5:00b02f526545961934b23a085d8a0417
|
198.7 MB | Preview Download |
Additional details
Dates
- Available
-
2024-03-03