268043
doi
10.5281/zenodo.268043
oai:zenodo.org:268043
user-hispanic-linguistics
user-linguistics
Martínez-Gamboa, Ricardo
Universidad de Chile
LIFCACH 2.0: Word Frequency List of Chilean Spanish (Lista de Frecuencias de Palabras del Castellano de Chile), version 2.0
Sadowsky, Scott
Pontificia Universidad Católica de Chile
url:http://sadowsky.cl/lifcach.html
info:eu-repo/semantics/openAccess
Creative Commons Attribution Non Commercial 4.0 International
https://creativecommons.org/licenses/by-nc/4.0/legalcode
Frequency list
Spanish
Chilean Spanish
Corpus
Lexical frequency
<p><strong>LIFCACH 2.0<br>
Word Frequency List of Chilean Spanish (Lista de Frecuencias de Palabras del Castellano de Chile), version 2.0</strong></p>
<p><em>More information, as well as the Spanish version of this document, is available in the included README file.</em></p>
<p><strong>1. Description</strong></p>
<p>The Word Frequency List of Chilean Spanish (LIFCACH) is a set of 102 frequency lists derived from the sub-corpora of the <em>Corpus Dinámico del Castellano de Chile</em> (Dynamic Corpus of Chilean Spanish, CODICACH), a database of contemporary written<sup>1</sup> Chilean Spanish developed by Sadowsky between 1997 and 2002; this corpus contained approximately 450 million words when the LIFCACH was created<sup>2</sup>. The LIFCACH also contains a non-weighted list of total frequencies (the <em>Total Occurrences</em> column), which is the sum of the frequencies of the 102 individual lists (in other words, the list of frequencies of the entire CODICACH corpus.)</p>
<p>The CODICACH is an opportunistic corpus with a bias toward press-based sources; it does not seek to be a BNC-style representative sampling of the overall written language. The modular nature of the CODICACH and of the 102 individual LIFCACH lists, however, allows researchers to use one or more of these lists alone, to combine them as needed, or to create their own frequency lists for Chilean Spanish by weighting each of the LIFCACH’s individual lists as they see fit.</p>
<p>The LIFCACH 2.0 contains 476,776 lemmas<sup>3</sup> derived from the approximately 4.5 million types found in the 450 million running words contained in the CODICACH at the time the lists were created.</p>
<p><strong>2. Creation of the LIFCACH</strong></p>
<p>The steps in creating the LIFCACH were as follows:</p>
<ol>
<li>
<p>Type frequency lists based on the running words of each of the 102 sub-corpora of the CODICACH were generated.</p>
</li>
<li>
<p>Each type frequency list was lemmatized and POS-tagged using the Universitat Politecnica de Catalunya’s MS-Tools v2.0<sup>4</sup>.</p>
</li>
<li>
<p>Lemmas with a frequency of 1 were removed (approximately 300,000) in the case of the …No-Hapax.xlsx version. Eliminating these was considered an acceptable trade-off in exchange for a far more manageable file size.</p>
</li>
<li>
<p>The resulting lemma frequency lists were assembled and total occurrences were calculated.</p>
</li>
</ol>
<p>An important caveat regarding this methodology must be mentioned. The use of type frequency lists instead of running words in the POS tagging and lemmatizing process was a practical necessity, due to the speed of the software used and the computing resources available at the time the LIFCACH was created. However, this reduced the accuracy of the lemmatization process by eliminating context. As a result, the software had to analyze words such as <em>canto</em> without the information required to decide if a given instance of this word is a form of the verb <em>cantar</em> or the noun <em>canto</em>.</p>
<p>It should also be noted that the lemmatizing and tagging software that was used is based on European Spanish, a national dialect that is rather removed from Chilean Spanish.</p>
<p><strong>3. Part of Speech Categories</strong></p>
<p>The following are the POS codes used in the frequency lists:</p>
<ul>
<li>AJ = Adjective</li>
<li>AV = Adverb</li>
<li>C = Conjunction</li>
<li>D = Determiner</li>
<li>I = Interjection</li>
<li>N = Noun, Common</li>
<li>NG = Noun, Geographic (Toponym)</li>
<li>NP = Noun, Proper</li>
<li>PN = Pronoun</li>
<li>PP = Preposition</li>
<li>SG = Abbreviation</li>
<li>V = Verb</li>
</ul>
<p><strong>4. List of Sources</strong></p>
<p>Each frequency list in the LIFCACH is derived from a different sub-corpus of the CODICACH. The codes used for these lists are as follows:</p>
<ul>
<li>ACAD_CCAA = Academic Texts - Applied Sciences</li>
<li>ACAD_CCNN = Academic Texts - Natural Sciences</li>
<li>ACAD_CCSS = Academic Texts - Social Sciences</li>
<li>ACAD_Hum = Academic Texts - Humanities</li>
<li>DIAR_CEN_Estrella_Valpo = Newspaper – Central Chile – Estrella de Valparaíso</li>
<li>DIAR_CEN_Gran_Valpo = Newspaper – Central Chile – Gran Valparaíso</li>
<li>DIAR_CEN_Lider_San_Antonio = Newspaper – Central Chile – El Líder, San Antonio</li>
<li>DIAR_CEN_Mercurio_Valpo = Newspaper – Central Chile – El Mercurio, Valparaíso</li>
<li>DIAR_NOR_Estrella_Arica = Newspaper – North Chile – La Estrella, Arica</li>
<li>DIAR_NOR_Estrella_Iquique = Newspaper – North Chile – La Estrella, Iquique</li>
<li>DIAR_NOR_Estrella_Loa = Newspaper – North Chile – La Estrella, Loa</li>
<li>DIAR_NOR_Estrella_Norte_Antofagasta = Newspaper – North Chile – La Estrella, Antofagasta</li>
<li>DIAR_NOR_Mercurio_Antofagasta = Newspaper – North Chile – El Mercurio, Antofagasta</li>
<li>DIAR_NOR_Mercurio_Calama = Newspaper – North Chile – El Mercurio, Calama</li>
<li>DIAR_NOR_Nortino_Iquique = Newspaper – North Chile – El Nortino, Iquique</li>
<li>DIAR_SAN_Cuarta = Newspaper – Santiago – La Cuarta</li>
<li>DIAR_SAN_Estrategia = Newspaper – Santiago – Estrategia</li>
<li>DIAR_SAN_Firme = Newspaper – Santiago – La Firme</li>
<li>DIAR_SAN_Mercurio = Newspaper – Santiago – El Mercurio</li>
<li>DIAR_SAN_Metropolitano = Newspaper – Santiago – El Metropolitano</li>
<li>DIAR_SAN_Mostrador = Newspaper – Santiago – El Mostrador</li>
<li>DIAR_SAN_Primera_Linea = Newspaper – Santiago – Primera Línea</li>
<li>DIAR_SAN_Primera_Pagina-El_Area = Newspaper – Santiago – Primera Página / El Área</li>
<li>DIAR_SAN_Segunda = Newspaper – Santiago – La Segunda</li>
<li>DIAR_SAN_Tercera = Newspaper – Santiago – La Tercera</li>
<li>DIAR_SAN_Ultimas_Noticias = Newspaper – Santiago – Las Últimas Noticias</li>
<li>DIAR_SUR_Austral_Osorno = Newspaper – South Chile – Austral, Osorno</li>
<li>DIAR_SUR_Austral_Temuco = Newspaper – South Chile – Austral, Temuco</li>
<li>DIAR_SUR_Austral_Valdivia = Newspaper – South Chile – Austral, Valdivia</li>
<li>DIAR_SUR_Cronica = Newspaper – South Chile – Crónica</li>
<li>DIAR_SUR_El_Sur = Newspaper – South Chile – El Sur</li>
<li>DIAR_SUR_Enc_BioBio = Newspaper – South Chile – Enciclop. Bío-Bío</li>
<li>DIAR_SUR_Llanquihue_Pto_Montt = Newspaper – South Chile – El Llanquihue, Pto. Montt</li>
<li>ESPER_CartasDirector = Personal Writings – Letters to Editor</li>
<li>ESPER_ForosInet = Personal Writings – Internet Site Forums</li>
<li>ESPER_Clasificados = Personal Writings – Classified Ads</li>
<li>ESPER_ForosMedios = Personal Writings – Media Forums</li>
<li>ESPER_Usenet = Personal Writings – Usenet</li>
<li>LEX_Jurisprudencia = Legal – Jurisprudence</li>
<li>LEX_Leyes = Legal – Laws</li>
<li>LEX_Libros = Legal – Law Books</li>
<li>LEX_Misc = Legal – Miscellaneous</li>
<li>LIBR_Ficcion = Books – Fiction</li>
<li>LIBR_NoFiccion = Books – Non-Fiction</li>
<li>OBRC_CandiaCares_DicoCoa = Reference Works – Dictionary of Coa</li>
<li>OBRC_GonzalezParra_ManualProvrb = Reference Works – Book of Chilean Proverbs</li>
<li>ORAL_Entrevistas_Lgtcas = Oral – Linguistic Interviews</li>
<li>ORAL_TV = Oral – Television</li>
<li>PUB_Misc = Advertising – General 1</li>
<li>PUB_Publicidad = Advertising – General 2</li>
<li>REV_CMP_ChileTech = Magazine – Computers – ChileTech</li>
<li>REV_CMP_CompuChile = Magazine – Computers – CompuChile</li>
<li>REV_CMP_ComputerWorld = Magazine – Computers – ComputerWorld</li>
<li>REV_CMP_Informatica = Magazine – Computers – Informática</li>
<li>REV_CMP_Infoweek = Magazine – Computers – Infoweek</li>
<li>REV_CMP_Internet21 = Magazine – Computers – Internet21</li>
<li>REV_CMP_Mouse = Magazine – Computers – Mouse</li>
<li>REV_DEP_All = Magazine – Sports</li>
<li>REV_ESP_Capital = Magazine – Specialty – Capital</li>
<li>REV_ESP_CiudadArquitectura = Magazine – Specialty – CiudadArquitectura</li>
<li>REV_ESP_Conicyt = Magazine – Specialty – Conicyt Scientific</li>
<li>REV_ESP_CopropInmob = Magazine – Specialty – Copropiedad Inmobiliaria</li>
<li>REV_ESP_DiarioSocCivil = Magazine – Specialty – Diario de la Sociedad Civil</li>
<li>REV_ESP_Educar = Magazine – Specialty – Educar</li>
<li>REV_ESP_LemuChile = Magazine – Specialty – LemuChile</li>
<li>REV_ESP_Lignum = Magazine – Specialty – Lignum</li>
<li>REV_ESP_Mensaje = Magazine – Specialty – Mensaje</li>
<li>REV_ESP_Notas_CESAF = Magazine – Specialty – Notas CESAF</li>
<li>REV_ESP_Publimark = Magazine – Specialty – Publimark</li>
<li>REV_ESP_Rev_Inf_Musical = Magazine – Specialty – Revista Musical</li>
<li>REV_ESP_Rev_Scielo = Magazine – Specialty – Scielo Scientific</li>
<li>REV_ESP_Rev_Social = Magazine – Specialty – Revista Social</li>
<li>REV_ESP_Rev_Trabajo_Social = Magazine – Specialty – Revista de Trabajo Social</li>
<li>REV_ESP_RevChil_Cirujia = Magazine – Specialty – Revista Chilena de Cirujía</li>
<li>REV_ESP_Revistas_Industriales = Magazine – Specialty – Industrial Magazines</li>
<li>REV_ESP_Sidhartha = Magazine – Specialty – Siddhartha</li>
<li>REV_GEN_Asuntos_Publicos = Magazine – General – Asuntos Públicos</li>
<li>REV_GEN_Cosas = Magazine – General – Cosas</li>
<li>REV_GEN_Cultura_Urbana = Magazine – General – Cultura Urbana</li>
<li>REV_GEN_El_Siglo = Magazine – General – El Siglo</li>
<li>REV_GEN_Ercilla = Magazine – General – Ercilla</li>
<li>REV_GEN_Hacer_Familia = Magazine – General – Hacer Familia</li>
<li>REV_GEN_Man = Magazine – General – Man</li>
<li>REV_GEN_Mujer_a_mujer = Magazine – General – Mujer a mujer</li>
<li>REV_GEN_Nos = Magazine – General – Nos</li>
<li>REV_GEN_Puerto_Paralelo = Magazine – General – Puerto Paralelo</li>
<li>REV_GEN_Punto_Final = Magazine – General – Punto Final</li>
<li>REV_GEN_Que_Pasa = Magazine – General – Qué Pasa</li>
<li>REV_GEN_Revista_ED = Magazine – General – Revista ED</li>
<li>REV_GEN_Rocinante = Magazine – General – Rocinante</li>
<li>REV_INF_Dirigible = Magazine – Children’s – Dirigible</li>
<li>REV_INF_Icarito = Magazine – Children’s – Icarito</li>
<li>REV_INF_Papas_Fritas = Magazine – Children’s – Papas Fritas</li>
<li>REV_INF_Volare = Magazine – Children’s – Volare</li>
<li>REV_JUV_All = Magazines – Youth</li>
<li>REV_LOC_All = Magazines – Local</li>
<li>RVDI_ECN_Diario_PyME = Financial Mags & Newspapers – Diario PyME</li>
<li>RVDI_ECN_El_Diario = Financial Mags & Newspapers – El Diario</li>
<li>RVDI_ECN_Emprendedores = Financial Mags & Newspapers – Emprendedores</li>
<li>RVDI_ECN_Negocios_Ambientales = Financial Mags & Newspapers – Negoc. Ambientales</li>
<li>SIT_INS_All = Government Sites 1</li>
<li>SIT_INS_Old = Government Sites 2</li>
</ul>
<p><strong>NOTES</strong></p>
<p><sup>1</sup> Although the CODICACH does contain two oral corpora, <em>ORAL_Entrevistas_Lgtcas</em> and <em>ORAL_TV</em>, these are of such negligible size that the CODICACH must be considered a corpus of written Spanish.</p>
<p><sup>2</sup> The CODICACH currently contains approximately 850 million words.</p>
<p><sup>3</sup> This is the number of non-hapax lemmas. The total number of lemmas in the LIFCACH, including hapax legomena, is 844,370.</p>
<p><sup>4</sup> MS-Tools was the predecessor of FreeLing.</p>
<p> </p>
Zenodo
2012-08-01
info:eu-repo/semantics/other
757640
user-hispanic-linguistics
user-linguistics
1579542003.902046
9045932
md5:d9f7c799c883661b8733f1c29168acd2
https://zenodo.org/records/268043/files/Sadowsky_&_Martinez_-_LIFCACH-2.0.zip
public
http://sadowsky.cl/lifcach.html
Is identical to
url
isVersionOf
doi