Published August 3, 2022 | Version v1
Dataset Open

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Contributors

  • 1. Humboldt-Universität zu Berlin
  • 2. University of Kassel

Description

  • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
  • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
  • ger_train.csv – The German training set as CSV file.
  • ger_validation.csv – The German validation set as CSV file.
  • en_test.csv – The English test set as CSV file.
  • en_train.csv – The English training set as CSV file.
  • en_validation.csv – The English validation set as CSV file.
  • splitting.py – The python code for splitting a dataset into train, test and validation set.
  • DataSetTrans_de.csv – The final German dataset as a CSV file.
  • DataSetTrans_en.csv – The final English dataset as a CSV file.
  • translation.py – The python code for translating the cleaned dataset.

Files

Cleaned_Dataset.csv

Files (866.4 MB)

Name Size Download all
md5:33c171cb8be54533dc94c4f27d426960
166.7 MB Preview Download
md5:def2b046b3f4617db27945983817ad30
397.8 kB Preview Download
md5:284adf9cb77c07cbc51c0e1520ff3894
184.5 MB Preview Download
md5:c7cff475f4ae44819259f645476e2758
165.0 MB Preview Download
md5:a4491c0d2fca781eaa93ff66c64bcb01
36.6 MB Preview Download
md5:db8dbc1005a119f4a5017b9202c54513
92.3 MB Preview Download
md5:932e2f06114c58fa76341b13e4609464
36.2 MB Preview Download
md5:8b2a07697b869d3ec74e95ff89576289
40.8 MB Preview Download
md5:2e7f7d370b1d6a7f24490d2d8f788064
103.2 MB Preview Download
md5:1924c883a2bd1fee39d72a39f6ea9311
40.6 MB Preview Download
md5:f7b2240dc5f9a6c92849f94c463e16ca
5.2 kB Download
md5:f0cbe75a49a888d5804e030504f8dc71
2.0 kB Download