rspeer/langcodes: Version 3.3

Robyn Speer

doi:10.5281/zenodo.7186012

Published October 11, 2022 | Version v3.3.0

Software Open

rspeer/langcodes: Version 3.3

Robyn Speer¹

1. Elemental Cognition

langcodes knows what languages are. It knows the standardized codes that refer to them, such as en for English, es for Spanish and hi for Hindi.

These are IETF language tags. You may know them by their old name, ISO 639 language codes. IETF has done some important things for backward compatibility and supporting language variations that you won't find in the ISO standard.

It may sound to you like langcodes solves a pretty boring problem. At one level, that's right. Sometimes you have a boring problem, and it's great when a library solves it for you.

But there's an interesting problem hiding in here. How do you work with language codes? How do you know when two different codes represent the same thing? How should your code represent relationships between codes, like the following?

eng is equivalent to en.
fra and fre are both equivalent to fr.
en-GB might be written as en-gb or en_GB. Or as 'en-UK', which is erroneous, but should be treated as the same.
en-CA is not exactly equivalent to en-US, but it's really, really close.
en-Latn-US is equivalent to en-US, because written English must be written in the Latin alphabet to be understood.
The difference between ar and arb is the difference between "Arabic" and "Modern Standard Arabic", a difference that may not be relevant to you.
You'll find Mandarin Chinese tagged as cmn on Wiktionary, but many other resources would call the same language zh.
Chinese is written in different scripts in different territories. Some software distinguishes the script. Other software distinguishes the territory. The result is that zh-CN and zh-Hans are used interchangeably, as are zh-TW and zh-Hant, even though occasionally you'll need something different such as zh-HK or zh-Latn-pinyin.
The Indonesian (id) and Malaysian (ms or zsm) languages are mutually intelligible.
jp is not a language code. (The language code for Japanese is ja, but people confuse it with the country code for Japan.)

One way to know is to read IETF standards and Unicode technical reports. Another way is to use a library that implements those standards and guidelines for you, which langcodes does.

When you're working with these short language codes, you may want to see the name that the language is called in a language: fr is called "French" in English. That language doesn't have to be English: fr is called "français" in French. A supplement to langcodes, language_data, provides this information.

langcodes is maintained by Elia Robyn Lake a.k.a. Robyn Speer, and is released as free software under the MIT license.

Files

rspeer/langcodes-v3.3.0.zip

Files (187.8 kB)

Name	Size	Download all
rspeer/langcodes-v3.3.0.zip md5:cd92fb1afe4dea7fdb2493d173114c2c	187.8 kB	Preview Download

Additional details

Is supplement to: https://github.com/rspeer/langcodes/tree/v3.3.0 (URL)

	All versions	This version
Views	72	72
Downloads	8	8
Data volume	1.5 MB	1.5 MB

rspeer/langcodes: Version 3.3

Authors/Creators

Description

Files

rspeer/langcodes-v3.3.0.zip

Files (187.8 kB)

Additional details

Related works