LIFCACH 2.0: Word Frequency List of Chilean Spanish (Lista de Frecuencias de Palabras del Castellano de Chile), version 2.0

Sadowsky, Scott; Martínez-Gamboa, Ricardo

  "abstract": "<p><strong>LIFCACH 2.0<br>\nWord Frequency List of Chilean Spanish (Lista de Frecuencias de Palabras del Castellano de Chile), version 2.0</strong></p>\n\n<p><em>More information, as well as the Spanish version of this document, is available in the included README file.</em></p>\n\n<p><strong>1. Description</strong></p>\n\n<p>The Word Frequency List of Chilean Spanish (LIFCACH) is a set of 102 frequency lists derived from the sub-corpora of the <em>Corpus Din\u00e1mico del Castellano de Chile</em> (Dynamic Corpus of Chilean Spanish, CODICACH), a database of contemporary written<sup>1</sup> Chilean Spanish developed by Sadowsky between 1997 and 2002; this corpus contained approximately 450 million words when the LIFCACH was created<sup>2</sup>. The LIFCACH also contains a non-weighted list of total frequencies (the <em>Total Occurrences</em> column), which is the sum of the frequencies of the 102 individual lists (in other words, the list of frequencies of the entire CODICACH corpus.)</p>\n\n<p>The CODICACH is an opportunistic corpus with a bias toward press-based sources; it does not seek to be a BNC-style representative sampling of the overall written language. The modular nature of the CODICACH and of the 102 individual LIFCACH lists, however, allows researchers to use one or more of these lists alone, to combine them as needed, or to create their own frequency lists for Chilean Spanish by weighting each of the LIFCACH\u2019s individual lists as they see fit.</p>\n\n<p>The LIFCACH 2.0 contains 476,776 lemmas<sup>3</sup> derived from the approximately 4.5 million types found in the 450 million running words contained in the CODICACH at the time the lists were created.</p>\n\n<p><strong>2. Creation of the LIFCACH</strong></p>\n\n<p>The steps in creating the LIFCACH were as follows:</p>\n\n<ol>\n\t<li>\n\t<p>Type frequency lists based on the running words of each of the 102 sub-corpora of the CODICACH were generated.</p>\n\t</li>\n\t<li>\n\t<p>Each type frequency list was lemmatized and POS-tagged using the Universitat Politecnica de Catalunya\u2019s MS-Tools v2.0<sup>4</sup>.</p>\n\t</li>\n\t<li>\n\t<p>Lemmas with a frequency of 1 were removed (approximately 300,000) in the case of the \u2026No-Hapax.xlsx version. Eliminating these was considered an acceptable trade-off in exchange for a far more manageable file size.</p>\n\t</li>\n\t<li>\n\t<p>The resulting lemma frequency lists were assembled and total occurrences were calculated.</p>\n\t</li>\n</ol>\n\n<p>An important caveat regarding this methodology must be mentioned. The use of type frequency lists instead of running words in the POS tagging and lemmatizing process was a practical necessity, due to the speed of the software used and the computing resources available at the time the LIFCACH was created. However, this reduced the accuracy of the lemmatization process by eliminating context. As a result, the software had to analyze words such as <em>canto</em> without the information required to decide if a given instance of this word is a form of the verb <em>cantar</em> or the noun <em>canto</em>.</p>\n\n<p>It should also be noted that the lemmatizing and tagging software that was used is based on European Spanish, a national dialect that is rather removed from Chilean Spanish.</p>\n\n<p><strong>3. Part of Speech Categories</strong></p>\n\n<p>The following are the POS codes used in the frequency lists:</p>\n\n<ul>\n\t<li>AJ = Adjective</li>\n\t<li>AV = Adverb</li>\n\t<li>C = Conjunction</li>\n\t<li>D = Determiner</li>\n\t<li>I = Interjection</li>\n\t<li>N = Noun, Common</li>\n\t<li>NG = Noun, Geographic (Toponym)</li>\n\t<li>NP = Noun, Proper</li>\n\t<li>PN = Pronoun</li>\n\t<li>PP = Preposition</li>\n\t<li>SG = Abbreviation</li>\n\t<li>V = Verb</li>\n</ul>\n\n<p><strong>4. List of Sources</strong></p>\n\n<p>Each frequency list in the LIFCACH is derived from a different sub-corpus of the CODICACH. The codes used for these lists are as follows:</p>\n\n<ul>\n\t<li>ACAD_CCAA = Academic Texts - Applied Sciences</li>\n\t<li>ACAD_CCNN = Academic Texts - Natural Sciences</li>\n\t<li>ACAD_CCSS = Academic Texts - Social Sciences</li>\n\t<li>ACAD_Hum = Academic Texts - Humanities</li>\n\t<li>DIAR_CEN_Estrella_Valpo = Newspaper \u2013 Central Chile \u2013 Estrella de Valpara\u00edso</li>\n\t<li>DIAR_CEN_Gran_Valpo = Newspaper \u2013 Central Chile \u2013 Gran Valpara\u00edso</li>\n\t<li>DIAR_CEN_Lider_San_Antonio = Newspaper \u2013 Central Chile \u2013 El L\u00edder, San Antonio</li>\n\t<li>DIAR_CEN_Mercurio_Valpo = Newspaper \u2013 Central Chile \u2013 El Mercurio, Valpara\u00edso</li>\n\t<li>DIAR_NOR_Estrella_Arica = Newspaper \u2013 North Chile \u2013 La Estrella, Arica</li>\n\t<li>DIAR_NOR_Estrella_Iquique = Newspaper \u2013 North Chile \u2013 La Estrella, Iquique</li>\n\t<li>DIAR_NOR_Estrella_Loa = Newspaper \u2013 North Chile \u2013 La Estrella, Loa</li>\n\t<li>DIAR_NOR_Estrella_Norte_Antofagasta = Newspaper \u2013 North Chile \u2013 La Estrella, Antofagasta</li>\n\t<li>DIAR_NOR_Mercurio_Antofagasta = Newspaper \u2013 North Chile \u2013 El Mercurio, Antofagasta</li>\n\t<li>DIAR_NOR_Mercurio_Calama = Newspaper \u2013 North Chile \u2013 El Mercurio, Calama</li>\n\t<li>DIAR_NOR_Nortino_Iquique = Newspaper \u2013 North Chile \u2013 El Nortino, Iquique</li>\n\t<li>DIAR_SAN_Cuarta = Newspaper \u2013 Santiago \u2013 La Cuarta</li>\n\t<li>DIAR_SAN_Estrategia = Newspaper \u2013 Santiago \u2013 Estrategia</li>\n\t<li>DIAR_SAN_Firme = Newspaper \u2013 Santiago \u2013 La Firme</li>\n\t<li>DIAR_SAN_Mercurio = Newspaper \u2013 Santiago \u2013 El Mercurio</li>\n\t<li>DIAR_SAN_Metropolitano = Newspaper \u2013 Santiago \u2013 El Metropolitano</li>\n\t<li>DIAR_SAN_Mostrador = Newspaper \u2013 Santiago \u2013 El Mostrador</li>\n\t<li>DIAR_SAN_Primera_Linea = Newspaper \u2013 Santiago \u2013 Primera L\u00ednea</li>\n\t<li>DIAR_SAN_Primera_Pagina-El_Area = Newspaper \u2013 Santiago \u2013 Primera P\u00e1gina / El \u00c1rea</li>\n\t<li>DIAR_SAN_Segunda = Newspaper \u2013 Santiago \u2013 La Segunda</li>\n\t<li>DIAR_SAN_Tercera = Newspaper \u2013 Santiago \u2013 La Tercera</li>\n\t<li>DIAR_SAN_Ultimas_Noticias = Newspaper \u2013 Santiago \u2013 Las \u00daltimas Noticias</li>\n\t<li>DIAR_SUR_Austral_Osorno = Newspaper \u2013 South Chile \u2013 Austral, Osorno</li>\n\t<li>DIAR_SUR_Austral_Temuco = Newspaper \u2013 South Chile \u2013 Austral, Temuco</li>\n\t<li>DIAR_SUR_Austral_Valdivia = Newspaper \u2013 South Chile \u2013 Austral, Valdivia</li>\n\t<li>DIAR_SUR_Cronica = Newspaper \u2013 South Chile \u2013 Cr\u00f3nica</li>\n\t<li>DIAR_SUR_El_Sur = Newspaper \u2013 South Chile \u2013 El Sur</li>\n\t<li>DIAR_SUR_Enc_BioBio = Newspaper \u2013 South Chile \u2013 Enciclop. B\u00edo-B\u00edo</li>\n\t<li>DIAR_SUR_Llanquihue_Pto_Montt = Newspaper \u2013 South Chile \u2013 El Llanquihue, Pto. Montt</li>\n\t<li>ESPER_CartasDirector = Personal Writings \u2013 Letters to Editor</li>\n\t<li>ESPER_ForosInet = Personal Writings \u2013 Internet Site Forums</li>\n\t<li>ESPER_Clasificados = Personal Writings \u2013 Classified Ads</li>\n\t<li>ESPER_ForosMedios = Personal Writings \u2013 Media Forums</li>\n\t<li>ESPER_Usenet = Personal Writings \u2013 Usenet</li>\n\t<li>LEX_Jurisprudencia = Legal \u2013 Jurisprudence</li>\n\t<li>LEX_Leyes = Legal \u2013 Laws</li>\n\t<li>LEX_Libros = Legal \u2013 Law Books</li>\n\t<li>LEX_Misc = Legal \u2013 Miscellaneous</li>\n\t<li>LIBR_Ficcion = Books \u2013 Fiction</li>\n\t<li>LIBR_NoFiccion = Books \u2013 Non-Fiction</li>\n\t<li>OBRC_CandiaCares_DicoCoa = Reference Works \u2013 Dictionary of Coa</li>\n\t<li>OBRC_GonzalezParra_ManualProvrb = Reference Works \u2013 Book of Chilean Proverbs</li>\n\t<li>ORAL_Entrevistas_Lgtcas = Oral \u2013 Linguistic Interviews</li>\n\t<li>ORAL_TV = Oral \u2013 Television</li>\n\t<li>PUB_Misc = Advertising \u2013 General 1</li>\n\t<li>PUB_Publicidad = Advertising \u2013 General 2</li>\n\t<li>REV_CMP_ChileTech = Magazine \u2013 Computers \u2013 ChileTech</li>\n\t<li>REV_CMP_CompuChile = Magazine \u2013 Computers \u2013 CompuChile</li>\n\t<li>REV_CMP_ComputerWorld = Magazine \u2013 Computers \u2013 ComputerWorld</li>\n\t<li>REV_CMP_Informatica = Magazine \u2013 Computers \u2013 Inform\u00e1tica</li>\n\t<li>REV_CMP_Infoweek = Magazine \u2013 Computers \u2013 Infoweek</li>\n\t<li>REV_CMP_Internet21 = Magazine \u2013 Computers \u2013 Internet21</li>\n\t<li>REV_CMP_Mouse = Magazine \u2013 Computers \u2013 Mouse</li>\n\t<li>REV_DEP_All = Magazine \u2013 Sports</li>\n\t<li>REV_ESP_Capital = Magazine \u2013 Specialty \u2013 Capital</li>\n\t<li>REV_ESP_CiudadArquitectura = Magazine \u2013 Specialty \u2013 CiudadArquitectura</li>\n\t<li>REV_ESP_Conicyt = Magazine \u2013 Specialty \u2013 Conicyt Scientific</li>\n\t<li>REV_ESP_CopropInmob = Magazine \u2013 Specialty \u2013 Copropiedad Inmobiliaria</li>\n\t<li>REV_ESP_DiarioSocCivil = Magazine \u2013 Specialty \u2013 Diario de la Sociedad Civil</li>\n\t<li>REV_ESP_Educar = Magazine \u2013 Specialty \u2013 Educar</li>\n\t<li>REV_ESP_LemuChile = Magazine \u2013 Specialty \u2013 LemuChile</li>\n\t<li>REV_ESP_Lignum = Magazine \u2013 Specialty \u2013 Lignum</li>\n\t<li>REV_ESP_Mensaje = Magazine \u2013 Specialty \u2013 Mensaje</li>\n\t<li>REV_ESP_Notas_CESAF = Magazine \u2013 Specialty \u2013 Notas CESAF</li>\n\t<li>REV_ESP_Publimark = Magazine \u2013 Specialty \u2013 Publimark</li>\n\t<li>REV_ESP_Rev_Inf_Musical = Magazine \u2013 Specialty \u2013 Revista Musical</li>\n\t<li>REV_ESP_Rev_Scielo = Magazine \u2013 Specialty \u2013 Scielo Scientific</li>\n\t<li>REV_ESP_Rev_Social = Magazine \u2013 Specialty \u2013 Revista Social</li>\n\t<li>REV_ESP_Rev_Trabajo_Social = Magazine \u2013 Specialty \u2013 Revista de Trabajo Social</li>\n\t<li>REV_ESP_RevChil_Cirujia = Magazine \u2013 Specialty \u2013 Revista Chilena de Ciruj\u00eda</li>\n\t<li>REV_ESP_Revistas_Industriales = Magazine \u2013 Specialty \u2013 Industrial Magazines</li>\n\t<li>REV_ESP_Sidhartha = Magazine \u2013 Specialty \u2013 Siddhartha</li>\n\t<li>REV_GEN_Asuntos_Publicos = Magazine \u2013 General \u2013 Asuntos P\u00fablicos</li>\n\t<li>REV_GEN_Cosas = Magazine \u2013 General \u2013 Cosas</li>\n\t<li>REV_GEN_Cultura_Urbana = Magazine \u2013 General \u2013 Cultura Urbana</li>\n\t<li>REV_GEN_El_Siglo = Magazine \u2013 General \u2013 El Siglo</li>\n\t<li>REV_GEN_Ercilla = Magazine \u2013 General \u2013 Ercilla</li>\n\t<li>REV_GEN_Hacer_Familia = Magazine \u2013 General \u2013 Hacer Familia</li>\n\t<li>REV_GEN_Man = Magazine \u2013 General \u2013 Man</li>\n\t<li>REV_GEN_Mujer_a_mujer = Magazine \u2013 General \u2013 Mujer a mujer</li>\n\t<li>REV_GEN_Nos = Magazine \u2013 General \u2013 Nos</li>\n\t<li>REV_GEN_Puerto_Paralelo = Magazine \u2013 General \u2013 Puerto Paralelo</li>\n\t<li>REV_GEN_Punto_Final = Magazine \u2013 General \u2013 Punto Final</li>\n\t<li>REV_GEN_Que_Pasa = Magazine \u2013 General \u2013 Qu\u00e9 Pasa</li>\n\t<li>REV_GEN_Revista_ED = Magazine \u2013 General \u2013 Revista ED</li>\n\t<li>REV_GEN_Rocinante = Magazine \u2013 General \u2013 Rocinante</li>\n\t<li>REV_INF_Dirigible = Magazine \u2013 Children\u2019s \u2013 Dirigible</li>\n\t<li>REV_INF_Icarito = Magazine \u2013 Children\u2019s \u2013 Icarito</li>\n\t<li>REV_INF_Papas_Fritas = Magazine \u2013 Children\u2019s \u2013 Papas Fritas</li>\n\t<li>REV_INF_Volare = Magazine \u2013 Children\u2019s \u2013 Volare</li>\n\t<li>REV_JUV_All = Magazines \u2013 Youth</li>\n\t<li>REV_LOC_All = Magazines \u2013 Local</li>\n\t<li>RVDI_ECN_Diario_PyME = Financial Mags &amp; Newspapers \u2013 Diario PyME</li>\n\t<li>RVDI_ECN_El_Diario = Financial Mags &amp; Newspapers \u2013 El Diario</li>\n\t<li>RVDI_ECN_Emprendedores = Financial Mags &amp; Newspapers \u2013 Emprendedores</li>\n\t<li>RVDI_ECN_Negocios_Ambientales = Financial Mags &amp; Newspapers \u2013 Negoc. Ambientales</li>\n\t<li>SIT_INS_All = Government Sites 1</li>\n\t<li>SIT_INS_Old = Government Sites 2</li>\n</ul>\n\n<p><strong>NOTES</strong></p>\n\n<p><sup>1</sup> Although the CODICACH does contain two oral corpora, <em>ORAL_Entrevistas_Lgtcas</em> and <em>ORAL_TV</em>, these are of such negligible size that the CODICACH must be considered a corpus of written Spanish.</p>\n\n<p><sup>2</sup> The CODICACH currently contains approximately 850 million words.</p>\n\n<p><sup>3</sup> This is the number of non-hapax lemmas. The total number of lemmas in the LIFCACH, including hapax legomena, is 844,370.</p>\n\n<p><sup>4</sup> MS-Tools was the predecessor of FreeLing.</p>\n\n<p>\u00a0</p>", 
  LIFCACH 2.0: Word Frequency List of Chilean Spanish (Lista de Frecuencias de Palabras del Castellano de Chile), version 2.0 
