Other Open Access

LIFCACH 2.0: Word Frequency List of Chilean Spanish (Lista de Frecuencias de Palabras del Castellano de Chile), version 2.0

Sadowsky, Scott; Martínez-Gamboa, Ricardo

LIFCACH 2.0
Word Frequency List of Chilean Spanish (Lista de Frecuencias de Palabras del Castellano de Chile), version 2.0

More information, as well as the Spanish version of this document, is available in the included README file.

1. Description

The Word Frequency List of Chilean Spanish (LIFCACH) is a set of 102 frequency lists derived from the sub-corpora of the Corpus Dinámico del Castellano de Chile (Dynamic Corpus of Chilean Spanish, CODICACH), a database of contemporary written1 Chilean Spanish developed by Sadowsky between 1997 and 2002; this corpus contained approximately 450 million words when the LIFCACH was created2. The LIFCACH also contains a non-weighted list of total frequencies (the Total Occurrences column), which is the sum of the frequencies of the 102 individual lists (in other words, the list of frequencies of the entire CODICACH corpus.)

The CODICACH is an opportunistic corpus with a bias toward press-based sources; it does not seek to be a BNC-style representative sampling of the overall written language. The modular nature of the CODICACH and of the 102 individual LIFCACH lists, however, allows researchers to use one or more of these lists alone, to combine them as needed, or to create their own frequency lists for Chilean Spanish by weighting each of the LIFCACH’s individual lists as they see fit.

The LIFCACH 2.0 contains 476,776 lemmas3 derived from the approximately 4.5 million types found in the 450 million running words contained in the CODICACH at the time the lists were created.

2. Creation of the LIFCACH

The steps in creating the LIFCACH were as follows:

  1. Type frequency lists based on the running words of each of the 102 sub-corpora of the CODICACH were generated.

  2. Each type frequency list was lemmatized and POS-tagged using the Universitat Politecnica de Catalunya’s MS-Tools v2.04.

  3. Lemmas with a frequency of 1 were removed (approximately 300,000) in the case of the …No-Hapax.xlsx version. Eliminating these was considered an acceptable trade-off in exchange for a far more manageable file size.

  4. The resulting lemma frequency lists were assembled and total occurrences were calculated.

An important caveat regarding this methodology must be mentioned. The use of type frequency lists instead of running words in the POS tagging and lemmatizing process was a practical necessity, due to the speed of the software used and the computing resources available at the time the LIFCACH was created. However, this reduced the accuracy of the lemmatization process by eliminating context. As a result, the software had to analyze words such as canto without the information required to decide if a given instance of this word is a form of the verb cantar or the noun canto.

It should also be noted that the lemmatizing and tagging software that was used is based on European Spanish, a national dialect that is rather removed from Chilean Spanish.

3. Part of Speech Categories

The following are the POS codes used in the frequency lists:

  • AJ = Adjective
  • AV = Adverb
  • C = Conjunction
  • D = Determiner
  • I = Interjection
  • N = Noun, Common
  • NG = Noun, Geographic (Toponym)
  • NP = Noun, Proper
  • PN = Pronoun
  • PP = Preposition
  • SG = Abbreviation
  • V = Verb

4. List of Sources

Each frequency list in the LIFCACH is derived from a different sub-corpus of the CODICACH. The codes used for these lists are as follows:

  • ACAD_CCAA = Academic Texts - Applied Sciences
  • ACAD_CCNN = Academic Texts - Natural Sciences
  • ACAD_CCSS = Academic Texts - Social Sciences
  • ACAD_Hum = Academic Texts - Humanities
  • DIAR_CEN_Estrella_Valpo = Newspaper – Central Chile – Estrella de Valparaíso
  • DIAR_CEN_Gran_Valpo = Newspaper – Central Chile – Gran Valparaíso
  • DIAR_CEN_Lider_San_Antonio = Newspaper – Central Chile – El Líder, San Antonio
  • DIAR_CEN_Mercurio_Valpo = Newspaper – Central Chile – El Mercurio, Valparaíso
  • DIAR_NOR_Estrella_Arica = Newspaper – North Chile – La Estrella, Arica
  • DIAR_NOR_Estrella_Iquique = Newspaper – North Chile – La Estrella, Iquique
  • DIAR_NOR_Estrella_Loa = Newspaper – North Chile – La Estrella, Loa
  • DIAR_NOR_Estrella_Norte_Antofagasta = Newspaper – North Chile – La Estrella, Antofagasta
  • DIAR_NOR_Mercurio_Antofagasta = Newspaper – North Chile – El Mercurio, Antofagasta
  • DIAR_NOR_Mercurio_Calama = Newspaper – North Chile – El Mercurio, Calama
  • DIAR_NOR_Nortino_Iquique = Newspaper – North Chile – El Nortino, Iquique
  • DIAR_SAN_Cuarta = Newspaper – Santiago – La Cuarta
  • DIAR_SAN_Estrategia = Newspaper – Santiago – Estrategia
  • DIAR_SAN_Firme = Newspaper – Santiago – La Firme
  • DIAR_SAN_Mercurio = Newspaper – Santiago – El Mercurio
  • DIAR_SAN_Metropolitano = Newspaper – Santiago – El Metropolitano
  • DIAR_SAN_Mostrador = Newspaper – Santiago – El Mostrador
  • DIAR_SAN_Primera_Linea = Newspaper – Santiago – Primera Línea
  • DIAR_SAN_Primera_Pagina-El_Area = Newspaper – Santiago – Primera Página / El Área
  • DIAR_SAN_Segunda = Newspaper – Santiago – La Segunda
  • DIAR_SAN_Tercera = Newspaper – Santiago – La Tercera
  • DIAR_SAN_Ultimas_Noticias = Newspaper – Santiago – Las Últimas Noticias
  • DIAR_SUR_Austral_Osorno = Newspaper – South Chile – Austral, Osorno
  • DIAR_SUR_Austral_Temuco = Newspaper – South Chile – Austral, Temuco
  • DIAR_SUR_Austral_Valdivia = Newspaper – South Chile – Austral, Valdivia
  • DIAR_SUR_Cronica = Newspaper – South Chile – Crónica
  • DIAR_SUR_El_Sur = Newspaper – South Chile – El Sur
  • DIAR_SUR_Enc_BioBio = Newspaper – South Chile – Enciclop. Bío-Bío
  • DIAR_SUR_Llanquihue_Pto_Montt = Newspaper – South Chile – El Llanquihue, Pto. Montt
  • ESPER_CartasDirector = Personal Writings – Letters to Editor
  • ESPER_ForosInet = Personal Writings – Internet Site Forums
  • ESPER_Clasificados = Personal Writings – Classified Ads
  • ESPER_ForosMedios = Personal Writings – Media Forums
  • ESPER_Usenet = Personal Writings – Usenet
  • LEX_Jurisprudencia = Legal – Jurisprudence
  • LEX_Leyes = Legal – Laws
  • LEX_Libros = Legal – Law Books
  • LEX_Misc = Legal – Miscellaneous
  • LIBR_Ficcion = Books – Fiction
  • LIBR_NoFiccion = Books – Non-Fiction
  • OBRC_CandiaCares_DicoCoa = Reference Works – Dictionary of Coa
  • OBRC_GonzalezParra_ManualProvrb = Reference Works – Book of Chilean Proverbs
  • ORAL_Entrevistas_Lgtcas = Oral – Linguistic Interviews
  • ORAL_TV = Oral – Television
  • PUB_Misc = Advertising – General 1
  • PUB_Publicidad = Advertising – General 2
  • REV_CMP_ChileTech = Magazine – Computers – ChileTech
  • REV_CMP_CompuChile = Magazine – Computers – CompuChile
  • REV_CMP_ComputerWorld = Magazine – Computers – ComputerWorld
  • REV_CMP_Informatica = Magazine – Computers – Informática
  • REV_CMP_Infoweek = Magazine – Computers – Infoweek
  • REV_CMP_Internet21 = Magazine – Computers – Internet21
  • REV_CMP_Mouse = Magazine – Computers – Mouse
  • REV_DEP_All = Magazine – Sports
  • REV_ESP_Capital = Magazine – Specialty – Capital
  • REV_ESP_CiudadArquitectura = Magazine – Specialty – CiudadArquitectura
  • REV_ESP_Conicyt = Magazine – Specialty – Conicyt Scientific
  • REV_ESP_CopropInmob = Magazine – Specialty – Copropiedad Inmobiliaria
  • REV_ESP_DiarioSocCivil = Magazine – Specialty – Diario de la Sociedad Civil
  • REV_ESP_Educar = Magazine – Specialty – Educar
  • REV_ESP_LemuChile = Magazine – Specialty – LemuChile
  • REV_ESP_Lignum = Magazine – Specialty – Lignum
  • REV_ESP_Mensaje = Magazine – Specialty – Mensaje
  • REV_ESP_Notas_CESAF = Magazine – Specialty – Notas CESAF
  • REV_ESP_Publimark = Magazine – Specialty – Publimark
  • REV_ESP_Rev_Inf_Musical = Magazine – Specialty – Revista Musical
  • REV_ESP_Rev_Scielo = Magazine – Specialty – Scielo Scientific
  • REV_ESP_Rev_Social = Magazine – Specialty – Revista Social
  • REV_ESP_Rev_Trabajo_Social = Magazine – Specialty – Revista de Trabajo Social
  • REV_ESP_RevChil_Cirujia = Magazine – Specialty – Revista Chilena de Cirujía
  • REV_ESP_Revistas_Industriales = Magazine – Specialty – Industrial Magazines
  • REV_ESP_Sidhartha = Magazine – Specialty – Siddhartha
  • REV_GEN_Asuntos_Publicos = Magazine – General – Asuntos Públicos
  • REV_GEN_Cosas = Magazine – General – Cosas
  • REV_GEN_Cultura_Urbana = Magazine – General – Cultura Urbana
  • REV_GEN_El_Siglo = Magazine – General – El Siglo
  • REV_GEN_Ercilla = Magazine – General – Ercilla
  • REV_GEN_Hacer_Familia = Magazine – General – Hacer Familia
  • REV_GEN_Man = Magazine – General – Man
  • REV_GEN_Mujer_a_mujer = Magazine – General – Mujer a mujer
  • REV_GEN_Nos = Magazine – General – Nos
  • REV_GEN_Puerto_Paralelo = Magazine – General – Puerto Paralelo
  • REV_GEN_Punto_Final = Magazine – General – Punto Final
  • REV_GEN_Que_Pasa = Magazine – General – Qué Pasa
  • REV_GEN_Revista_ED = Magazine – General – Revista ED
  • REV_GEN_Rocinante = Magazine – General – Rocinante
  • REV_INF_Dirigible = Magazine – Children’s – Dirigible
  • REV_INF_Icarito = Magazine – Children’s – Icarito
  • REV_INF_Papas_Fritas = Magazine – Children’s – Papas Fritas
  • REV_INF_Volare = Magazine – Children’s – Volare
  • REV_JUV_All = Magazines – Youth
  • REV_LOC_All = Magazines – Local
  • RVDI_ECN_Diario_PyME = Financial Mags & Newspapers – Diario PyME
  • RVDI_ECN_El_Diario = Financial Mags & Newspapers – El Diario
  • RVDI_ECN_Emprendedores = Financial Mags & Newspapers – Emprendedores
  • RVDI_ECN_Negocios_Ambientales = Financial Mags & Newspapers – Negoc. Ambientales
  • SIT_INS_All = Government Sites 1
  • SIT_INS_Old = Government Sites 2

NOTES

1 Although the CODICACH does contain two oral corpora, ORAL_Entrevistas_Lgtcas and ORAL_TV, these are of such negligible size that the CODICACH must be considered a corpus of written Spanish.

2 The CODICACH currently contains approximately 850 million words.

3 This is the number of non-hapax lemmas. The total number of lemmas in the LIFCACH, including hapax legomena, is 844,370.

4 MS-Tools was the predecessor of FreeLing.

 

Files (9.0 MB)
Name Size
Sadowsky_&_Martinez_-_LIFCACH-2.0.zip md5:d9f7c799c883661b8733f1c29168acd2 9.0 MB Download

Share

Cite as