Dataset Open Access

Extract from the Berlin State Library's Main Catalog

Zellhöfer, David

The data set is based on the main catalog of the library. Currently, the following fields are extracted:

  • title
  • author (+ optional GND ID)
  • publisher
  • place of publication
  • country of publication
  • year of publication

The extract has been created by the processPicaPlus script available here. Attention, some special characters might not have been extracted correctly in versions <1.0.0.

Change Log:

0.2.0   

fixes various encoding issues for non-ASCII characters

0.3.0   

added year of publication; added separate files for languages: rus, pol, rum, cze, slo, gre with minor encoding issues

 

Dataset Characteristics

The following languages are available in separate data files:

  • eng
  • ger
  • lat
  • fre
  • ita
  • spa
  • por
  • dut
  • swe
  • dan
  • nor
  • ice
  • fry
  • rus*
  • pol*
  • rum*
  • cze*
  • slo*
  • gre*

*: Language file might be subject to character encoding issues.

The other languages are present in the data set but have not been separated, i.e., they are combined in one data file:

'fre', 'rus', 'pol', 'ger', 'eng', 'lit', 'dan', 'dut', 'spa', 'swe', 'ita', 'lat', 'nor', 'ind', 'bul', 'grc', 'fry', 'rum', 'cze', 'slo', 'bel', 'ice', 'fin', 'gre', 'hun', 'tur', 'enm', 'hrv', 'est', 'srp', 'roh', 'syr', 'wen', 'mal', 'afr', 'slv', 'mac', 'smi', 'nds', 'qmw', 'pra', 'oci', 'bre', 'san', 'alb', 'baq', 'non', 'ara', 'chm', 'per', 'cat', 'gmh', 'sla', 'arm', 'ukr', 'por', 'chu', 'heb', 'arc', 'gle', 'tib', 'lav', 'geo', 'crp', 'hin', 'mul', 'chi', 'epo', 'kor', 'kan', 'vot', 'csb', 'glg', 'kaz', 'frm', 'jpn', 'bur', 'srd', 'sal', 'ira', 'bos', 'mol', 'rom', 'tat', 'aze', 'yid', 'mar', 'mak', 'pli', 'rys', 'tgk', 'map', 'vie', 'tuk', 'oss', 'ota', 'tut', 'ben', 'sun', 'tir', 'bak', 'chv', 'ber', 'khm', 'may', 'pan', 'uzb', 'swa', 'kir', 'egy', 'dum', 'nep', 'cop', 'mon', 'tam', 'urd', 'zxx', 'wel', 'mis', 'ng', 'goh', 'dt', 'en', 'fao', 'fro', 'pus', 'kur', 'cus', 'hau', 'uig', 'sit', 'dt.', 'cpf', 'tgl', 'qoj', 'tag', 'raj', 'fiu', 'xal', 'kbd', 'udm', 'scr', 'gag', 'kas', 'scc', 'pro', 'tha', 'dar', 'dr', 'sna', 'ewe', 'de', 'dra', 'ang', 'ine', 'zza', 'und', 'ave', 'amh', 'crh', 'jav', 'cpe', 'akk', 'dsb', 'qce', 'guj', 'ltz', 'got', 'bua', 'peo', 'mdr', 'nob', 'ava', 'che', 'sux', 'kok', 'zap', 'nl', 'inc', 'sah', 'gem', 'law', 'bem', 'sin', 'qdo', 'hsb', 'som', 'lao', 'kam', 'kom', 'abk', 'roa', 'cau', 'ady', 'bat', 'mlt', 'sai', 'xho', 'paa', 'sot', 'bnt', 'lug', 'myn', 'kar', 'qhe', 'kin', 'zul', 'tsn', 'apa', 'nso', 'yao', 'yor', 'bih', 'nog', 'nap', 'loz', 'nbl', 'kon', 'nya', 'snh', 'chn', 'run', 'suk', 'fur', 'osa', 'bra', 'den', 'kpe', 'kal', 'tig', 'wol', 'gla', 'lad', 'mos', 'cre', 'krc', 'ge', 'fr', 'dak', 'fij', 'mad', 'srr', 'kum', 'her', 'nai', 'cel', 'inh', 'kro', 'hit', 'pal', 'tmh', 'tsw', 'bam', 'kab', 'kik', 'kua', 'lub', 'luo', 'nub', 'tem', 'znd', 'mai', 'tai', 'qkr', 'ful', 'man', 'lol', 'sag', 'tog', 'hai', 'arg', 'fat', 'nav', 'niu', 'ibo', 'ido', 'men', 'qju', 'gaa', 'vol', 'nah', 'mlg', 'nic', 'ijo', 'sus', 'orm', 'smo', 'mag', 'tyv', 'mnc', 'cos', 'mdf', 'kaa', 'dua', 'gez', 'ton', 'ven', 'snd', 'syc', 'nym', 'nia', 'sem', 'chg', 'fan', 'twi', 'mas', 'ina', 'ile', 'art', 'ori', 'qai', 'arw', 'mao', 'bas', 'kmb', 'tiv', 'bal', 'tar', 'tpi', 'abs', 'asm', 'qqa', 'iku', 'min', 'rup', 'tel', 'or', 'tah', 'aka', 'day', 'qqg', 'lah', 'lus', 'sio', 'oto', 'alg', 'shn', 'ndo', 'haw', 'tso', 'mus', 'cai', 'qev', 'new', 'zha', 'grn', 'khi', 'ssw', 'nde', 'bla', 'grb', 'mun', 'din', 'sam', 'mwr', 'cor', 'sat', 'cho', 'ger,', 'que', 'btk', 'glv', 'rar', 'jk', 'nno', 'cmc', 'mga', 'jw', 'iro', 'sog', 'hat', 'dzo', 'mkh', 'bik', 'ban', 'ilo', 'pam', 'ts', 'sme', 'myv', 'qnn', 'jpr', 'qte', 'yap', 'bis', 'sga', 'qkj', 'pap', 'ath', 'ipk', 'phi', 'sco', 'del', 'moh', 'iri', 'gae', 'ryl', 'our', 't--', 'grk', 'ssa', 'awa', 'efi', 'jrb', 'enk', 'kru', 'oji', 'arn', 'car', 'gsw', 'lez', 'war', 'ace', 'qrn', 'wln', 'ceb', 'aar', 'bug', 'kaw', 'chr', 'cpp', 'tet', 'aym', 'ces', 'hmo'

Files (1.9 GB)
Name Size
cze_out.txt
md5:9732d372f8ba9516dc64d988b7311152
7.0 MB Download
dan_out.txt
md5:e582e959f5f85795ee2e968226491e2d
3.6 MB Download
dut_out.txt
md5:fc6785edcb4c3e4d8977a7d21c87001a
14.2 MB Download
eng_out.txt
md5:77574c7d25fe0f202fc104528485d5a2
328.2 MB Download
fre_out.txt
md5:5c54395a842deff86d2cecf8238ef259
75.5 MB Download
fry_out.txt
md5:999f2ab3f52c5d249568c2f20d88471e
67.5 kB Download
ger_out.txt
md5:fcdee9d86f2702a06cc912557aa09a9d
490.4 MB Download
gre_out.txt
md5:7b85ff65d905369b76b01812d6f588de
1.2 MB Download
ice_out.txt
md5:4dae71c7fabbbdc8e7d17b9f8c9e9f36
189.9 kB Download
ita_out.txt
md5:8b57f0be40288cb0008a1871ef23b97c
30.1 MB Download
lat_out.txt
md5:bf4f5a62fc94384a8162badda0301b39
65.9 MB Download
nor_out.txt
md5:02fea1cb9c8877e66c8295604a86f5c8
2.0 MB Download
out.txt
md5:34622f9d23f1550c5546c390c80f51b8
813.4 MB Download
pol_out.txt
md5:7da7f8dbb963d5429d1da677714d252a
13.5 MB Download
por_out.txt
md5:b980f399670bac0d7fea24e5995d4eb3
1.6 MB Download
rum_out.txt
md5:44a4cc16658cd631a164591167c63603
1.7 MB Download
rus_out.txt
md5:ee805132b692b9d11226614ed85c3c18
39.0 MB Download
slo_out.txt
md5:56443d3eea99bd2d6db36f02ded6efb8
1.8 MB Download
spa_out.txt
md5:e30af9a14d1e04ed4899c2039f628674
7.9 MB Download
statistics.txt
md5:9e55ba7ddf93fe050d138ddefe210426
6.5 kB Download
swe_out.txt
md5:b3f99c3795809b0e34e6ae4b6545c7d5
5.8 MB Download
200
3,831
views
downloads
All versions This version
Views 200145
Downloads 3,8311,241
Data volume 584.7 GB141.9 GB
Unique views 160119
Unique downloads 3,4011,084

Share

Cite as