Software Open Access

Supplementary Materials for "Creating a Frequency Dictionary of Spoken Hebrew"

Juan D. Pinto


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="041" ind1=" " ind2=" ">
    <subfield code="a">eng</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">corpus data</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">linguistic analysis</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">language learning</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Python 3</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">OpenSubtitles</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Internet Movie Database</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Hebrew</subfield>
  </datafield>
  <controlfield tag="005">20191101071556.0</controlfield>
  <controlfield tag="001">1239886</controlfield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">679926</subfield>
    <subfield code="z">md5:d8d6fdddb7fb53fcec026117925a0380</subfield>
    <subfield code="u">https://zenodo.org/record/1239886/files/juandpinto/frequency-dictionary-v1.0.zip</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2018-05-02</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="o">oai:zenodo.org:1239886</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">University of Texas at Austin</subfield>
    <subfield code="a">Juan D. Pinto</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Supplementary Materials for "Creating a Frequency Dictionary of Spoken Hebrew"</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">http://www.opensource.org/licenses/MIT</subfield>
    <subfield code="a">MIT License</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;This repository houses all the scripts and files used to create the &lt;em&gt;Frequency Dictionary of Spoken Hebrew&lt;/em&gt;&amp;nbsp;(FDOSH), along with the dictionary itself. This project was created as part of my MA thesis at the University of Texas at Austin in 2018. The thesis itself describes the creation process&amp;mdash;and the use of each script&amp;mdash;in depth, and can be found in the &lt;a href="https://repositories.lib.utexas.edu/"&gt;University of Texas thesis repository&lt;/a&gt;. A GitHub repository for the thesis manuscript can also be found at &lt;a href="https://github.com/juandpinto/thesis-manuscript"&gt;https://github.com/juandpinto/thesis-manuscript&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The scripts make use of OPUS&amp;#39;s &lt;a href="https://opus.nlpl.eu/OpenSubtitles2018.php"&gt;OpenSubtitles2018&lt;/a&gt;&amp;nbsp;collection, which is a mega-corpus of cleaned, tokenized, and parsed versions of XML files originally obtained from &lt;a href="http://opensubtitles.org"&gt;opensubtitles.org&lt;/a&gt;. The final frequency dictionary consists of Hebrew lemmas, and is arranged based on a usage coefficient of Gries&amp;#39; (2008) deviation of proportions, or &lt;em&gt;U&lt;sub&gt;DP&lt;/sub&gt;&lt;/em&gt;. It also includes frequency and range measures for each entry.&lt;/p&gt;

&lt;p&gt;The most important files in this repository are listed below.&lt;/p&gt;

&lt;p&gt;- The &lt;em&gt;Frequency Dictionary of Spoken Hebrew&lt;/em&gt;&amp;nbsp;(FDOSH): &lt;em&gt;export/frequency-dictionary.tsv&lt;/em&gt;&lt;br&gt;
- The main script used for creating the dictionary: &lt;em&gt;create-freq-list.py&lt;/em&gt;&lt;br&gt;
- The script used to clean the OpenSubtitles2018 corpus: &lt;em&gt;single_file_extract.py&lt;/em&gt;&lt;br&gt;
- The script used to fetch movie metadata for each subtitle file in the corpus: &lt;em&gt;OMDb-fetch.py&lt;/em&gt;&lt;br&gt;
- The script used to find the shared entries in two different frequency lists: &lt;em&gt;list_comparison.py&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The script used to fetch movie metadata (OMDb-fetch.py) uses &lt;a href="https://github.com/dgilland"&gt;Derrick Gilland&lt;/a&gt;&amp;#39;s &lt;a href="https://github.com/dgilland/omdb.py"&gt;omdb.py library&lt;/a&gt;, which is a Python wrapper around the &lt;a href="http://omdbapi.com"&gt;OMDb API (Open Movie Database API)&lt;/a&gt;. OMDb is, in turn, a project that makes use of &lt;a href="http://www.imdb.com"&gt;IMDb (Internet Movie Database)&lt;/a&gt; for its data. For each subtitle file in the corpus, the script finds the IMDB ID, title, year, and original language(s). The &lt;em&gt;movies-info/&lt;/em&gt; folder contains extensive lists of the metadata found for the movies used to create the FDOSH.&lt;/p&gt;

&lt;p&gt;Each script includes detailed notes within the comments to allow them to be clear and easily customizable. This project is licensed under the MIT License, so feel free to clone and use as you see fit. Suggestions and pull requests are also welcome.&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">url</subfield>
    <subfield code="i">isSupplementTo</subfield>
    <subfield code="a">https://github.com/juandpinto/frequency-dictionary/tree/v1.0</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.1239885</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.1239886</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">software</subfield>
  </datafield>
</record>
61
5
views
downloads
All versions This version
Views 6162
Downloads 55
Data volume 3.4 MB3.4 MB
Unique views 5758
Unique downloads 55

Share

Cite as