Published March 11, 2019 | Version 0.3.0
Dataset Open

Extract from the Berlin State Library's Main Catalog

  • 1. Staatsbibliothek zu Berlin

Description

The data set is based on the main catalog of the library. Currently, the following fields are extracted:

  • title
  • author (+ optional GND ID)
  • publisher
  • place of publication
  • country of publication
  • year of publication

The extract has been created by the processPicaPlus script available here. Attention, some special characters might not have been extracted correctly in versions <1.0.0.

Change Log:

0.2.0   

fixes various encoding issues for non-ASCII characters

0.3.0   

added year of publication; added separate files for languages: rus, pol, rum, cze, slo, gre with minor encoding issues

 

Dataset Characteristics

The following languages are available in separate data files:

  • eng
  • ger
  • lat
  • fre
  • ita
  • spa
  • por
  • dut
  • swe
  • dan
  • nor
  • ice
  • fry
  • rus*
  • pol*
  • rum*
  • cze*
  • slo*
  • gre*

*: Language file might be subject to character encoding issues.

The other languages are present in the data set but have not been separated, i.e., they are combined in one data file:

'fre', 'rus', 'pol', 'ger', 'eng', 'lit', 'dan', 'dut', 'spa', 'swe', 'ita', 'lat', 'nor', 'ind', 'bul', 'grc', 'fry', 'rum', 'cze', 'slo', 'bel', 'ice', 'fin', 'gre', 'hun', 'tur', 'enm', 'hrv', 'est', 'srp', 'roh', 'syr', 'wen', 'mal', 'afr', 'slv', 'mac', 'smi', 'nds', 'qmw', 'pra', 'oci', 'bre', 'san', 'alb', 'baq', 'non', 'ara', 'chm', 'per', 'cat', 'gmh', 'sla', 'arm', 'ukr', 'por', 'chu', 'heb', 'arc', 'gle', 'tib', 'lav', 'geo', 'crp', 'hin', 'mul', 'chi', 'epo', 'kor', 'kan', 'vot', 'csb', 'glg', 'kaz', 'frm', 'jpn', 'bur', 'srd', 'sal', 'ira', 'bos', 'mol', 'rom', 'tat', 'aze', 'yid', 'mar', 'mak', 'pli', 'rys', 'tgk', 'map', 'vie', 'tuk', 'oss', 'ota', 'tut', 'ben', 'sun', 'tir', 'bak', 'chv', 'ber', 'khm', 'may', 'pan', 'uzb', 'swa', 'kir', 'egy', 'dum', 'nep', 'cop', 'mon', 'tam', 'urd', 'zxx', 'wel', 'mis', 'ng', 'goh', 'dt', 'en', 'fao', 'fro', 'pus', 'kur', 'cus', 'hau', 'uig', 'sit', 'dt.', 'cpf', 'tgl', 'qoj', 'tag', 'raj', 'fiu', 'xal', 'kbd', 'udm', 'scr', 'gag', 'kas', 'scc', 'pro', 'tha', 'dar', 'dr', 'sna', 'ewe', 'de', 'dra', 'ang', 'ine', 'zza', 'und', 'ave', 'amh', 'crh', 'jav', 'cpe', 'akk', 'dsb', 'qce', 'guj', 'ltz', 'got', 'bua', 'peo', 'mdr', 'nob', 'ava', 'che', 'sux', 'kok', 'zap', 'nl', 'inc', 'sah', 'gem', 'law', 'bem', 'sin', 'qdo', 'hsb', 'som', 'lao', 'kam', 'kom', 'abk', 'roa', 'cau', 'ady', 'bat', 'mlt', 'sai', 'xho', 'paa', 'sot', 'bnt', 'lug', 'myn', 'kar', 'qhe', 'kin', 'zul', 'tsn', 'apa', 'nso', 'yao', 'yor', 'bih', 'nog', 'nap', 'loz', 'nbl', 'kon', 'nya', 'snh', 'chn', 'run', 'suk', 'fur', 'osa', 'bra', 'den', 'kpe', 'kal', 'tig', 'wol', 'gla', 'lad', 'mos', 'cre', 'krc', 'ge', 'fr', 'dak', 'fij', 'mad', 'srr', 'kum', 'her', 'nai', 'cel', 'inh', 'kro', 'hit', 'pal', 'tmh', 'tsw', 'bam', 'kab', 'kik', 'kua', 'lub', 'luo', 'nub', 'tem', 'znd', 'mai', 'tai', 'qkr', 'ful', 'man', 'lol', 'sag', 'tog', 'hai', 'arg', 'fat', 'nav', 'niu', 'ibo', 'ido', 'men', 'qju', 'gaa', 'vol', 'nah', 'mlg', 'nic', 'ijo', 'sus', 'orm', 'smo', 'mag', 'tyv', 'mnc', 'cos', 'mdf', 'kaa', 'dua', 'gez', 'ton', 'ven', 'snd', 'syc', 'nym', 'nia', 'sem', 'chg', 'fan', 'twi', 'mas', 'ina', 'ile', 'art', 'ori', 'qai', 'arw', 'mao', 'bas', 'kmb', 'tiv', 'bal', 'tar', 'tpi', 'abs', 'asm', 'qqa', 'iku', 'min', 'rup', 'tel', 'or', 'tah', 'aka', 'day', 'qqg', 'lah', 'lus', 'sio', 'oto', 'alg', 'shn', 'ndo', 'haw', 'tso', 'mus', 'cai', 'qev', 'new', 'zha', 'grn', 'khi', 'ssw', 'nde', 'bla', 'grb', 'mun', 'din', 'sam', 'mwr', 'cor', 'sat', 'cho', 'ger,', 'que', 'btk', 'glv', 'rar', 'jk', 'nno', 'cmc', 'mga', 'jw', 'iro', 'sog', 'hat', 'dzo', 'mkh', 'bik', 'ban', 'ilo', 'pam', 'ts', 'sme', 'myv', 'qnn', 'jpr', 'qte', 'yap', 'bis', 'sga', 'qkj', 'pap', 'ath', 'ipk', 'phi', 'sco', 'del', 'moh', 'iri', 'gae', 'ryl', 'our', 't--', 'grk', 'ssa', 'awa', 'efi', 'jrb', 'enk', 'kru', 'oji', 'arn', 'car', 'gsw', 'lez', 'war', 'ace', 'qrn', 'wln', 'ceb', 'aar', 'bug', 'kaw', 'chr', 'cpp', 'tet', 'aym', 'ces', 'hmo'

Files

cze_out.txt

Files (1.9 GB)

Name Size Download all
md5:9732d372f8ba9516dc64d988b7311152
7.0 MB Preview Download
md5:e582e959f5f85795ee2e968226491e2d
3.6 MB Preview Download
md5:fc6785edcb4c3e4d8977a7d21c87001a
14.2 MB Preview Download
md5:77574c7d25fe0f202fc104528485d5a2
328.2 MB Preview Download
md5:5c54395a842deff86d2cecf8238ef259
75.5 MB Preview Download
md5:999f2ab3f52c5d249568c2f20d88471e
67.5 kB Preview Download
md5:fcdee9d86f2702a06cc912557aa09a9d
490.4 MB Preview Download
md5:7b85ff65d905369b76b01812d6f588de
1.2 MB Preview Download
md5:4dae71c7fabbbdc8e7d17b9f8c9e9f36
189.9 kB Preview Download
md5:8b57f0be40288cb0008a1871ef23b97c
30.1 MB Preview Download
md5:bf4f5a62fc94384a8162badda0301b39
65.9 MB Preview Download
md5:02fea1cb9c8877e66c8295604a86f5c8
2.0 MB Preview Download
md5:34622f9d23f1550c5546c390c80f51b8
813.4 MB Preview Download
md5:7da7f8dbb963d5429d1da677714d252a
13.5 MB Preview Download
md5:b980f399670bac0d7fea24e5995d4eb3
1.6 MB Preview Download
md5:44a4cc16658cd631a164591167c63603
1.7 MB Preview Download
md5:ee805132b692b9d11226614ed85c3c18
39.0 MB Preview Download
md5:56443d3eea99bd2d6db36f02ded6efb8
1.8 MB Preview Download
md5:e30af9a14d1e04ed4899c2039f628674
7.9 MB Preview Download
md5:9e55ba7ddf93fe050d138ddefe210426
6.5 kB Preview Download
md5:b3f99c3795809b0e34e6ae4b6545c7d5
5.8 MB Preview Download