Published January 11, 2023
| Version 0.0.1
Other
Open
Pinyin - IPA Mapping
Description
This upload contains IPA json-mappings for all pinyin romanized Chinese syllables retrieved from a big corpus.
The corpus that contained the original Chinese words was taken from uni-leipzig. We took the 1M Wikipedia Corpus from 2018. Each syllable was extracted and then converted to pinyin. The pinyin transcription was retrieved with pypinyin (v0.47.1) using dict-from-pypinyin (v0.0.1) which was then transcribed to IPA using pinyin-to-ipa (v0.0.1). Only the first possible transcription was included in the mappings.
Note: tone sandhi is not considered since the vocabulary consists only of stand-alone syllables.
Files:
hanzi-vocabulary.txt
- contains the hanzi vocabulary from which pinyin was transcribed (Chinese syllables), e.g.,
㩳
- contains the hanzi vocabulary from which pinyin was transcribed (Chinese syllables), e.g.,
pinyin-ipa-map-NORMAL.json
(418 mappings)- contains toneless pinyin mapped to IPA in pypinyin-style
NORMAL
, e.g.,beng
- contains toneless pinyin mapped to IPA in pypinyin-style
pinyin-ipa-map-TONE.json
(1400 mappings)- contains pinyin mapped to IPA with pinyin tones in pypinyin-style
TONE
, e.g.,bèng
- contains pinyin mapped to IPA with pinyin tones in pypinyin-style
pinyin-ipa-map-TONE2.json
(1400 mappings)- contains pinyin mapped to IPA with pinyin tones in pypinyin-style
TONE2
, e.g.,be4ng
- contains pinyin mapped to IPA with pinyin tones in pypinyin-style
pinyin-ipa-map-TONE3.json
(1400 mappings)- contains pinyin mapped to IPA with pinyin tones in pypinyin-style
TONE3
, e.g.,beng4
- contains pinyin mapped to IPA with pinyin tones in pypinyin-style
pinyin-ipa-map-TONE3-all.json
(2508 mappings)- contains all theoretical combinations of pinyin mapped to IPA with pinyin tones in pypinyin-style
TONE3
, e.g.,beng4
- contains all theoretical combinations of pinyin mapped to IPA with pinyin tones in pypinyin-style
oov-vocabulary.txt
- contains the vocabulary from which no pinyin could have been transcribed (because it was no Chinese symbol or doesn't have a pinyin representation), e.g.,
방
or㕔
- contains the vocabulary from which no pinyin could have been transcribed (because it was no Chinese symbol or doesn't have a pinyin representation), e.g.,
script.sh
- contains the script to reproduce all results
Notes
Files
hanzi-vocabulary.txt
Files
(229.7 kB)
Name | Size | Download all |
---|---|---|
md5:8ae3e0e18bdd436d074cbc41aab6790a
|
48.0 kB | Preview Download |
md5:4d0d7f727573f6b7b971c90c25736c36
|
6.8 kB | Preview Download |
md5:82b56b91105131a44ec4835af3d12840
|
8.4 kB | Preview Download |
md5:f1d73f58ff409a54a452c2b3dc01738d
|
34.5 kB | Preview Download |
md5:65bb66864435d5a891a26626fb4aac83
|
34.6 kB | Preview Download |
md5:f5e0a4e416e481c4da20b0e2b09a1276
|
59.0 kB | Preview Download |
md5:81e7ac22dc9918b738716b140485aa20
|
34.6 kB | Preview Download |
md5:15b0a474d3319a758b41602a63e0ecd6
|
4.0 kB | Download |
Additional details
References
- D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the 8th International Language Resources and Evaluation (LREC'12), 2012