There is a newer version of this record available.

Dataset Open Access

Swahili : News Classification Dataset

Davis David


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="999" ind1="C" ind2="5">
    <subfield code="x">https://www.k4all.org/project/language-dataset-fellowship/</subfield>
  </datafield>
  <datafield tag="041" ind1=" " ind2=" ">
    <subfield code="a">swa</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">swahili</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">news</subfield>
  </datafield>
  <controlfield tag="005">20210918162452.0</controlfield>
  <controlfield tag="001">4300294</controlfield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">52342579</subfield>
    <subfield code="z">md5:95c41cff90efda1d961aa67d17f6a269</subfield>
    <subfield code="u">https://zenodo.org/record/4300294/files/train.csv</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2020-12-01</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="p">user-africanlp</subfield>
    <subfield code="o">oai:zenodo.org:4300294</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">TYD Innovation Incubator</subfield>
    <subfield code="a">Davis David</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">Swahili : News Classification Dataset</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-africanlp</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;Swahili is spoken by 100-150 million people across East Africa. In Tanzania, it is one of two national languages (the other is English) and it is the official language of instruction in all schools. News in Swahili is an important part of the media sphere in Tanzania.&lt;/p&gt;

&lt;p&gt;News contributes to education, technology, and the economic growth of a country, and news in local languages plays an important cultural role in many Africa countries. In the modern age, African languages in news and other spheres are at risk of being lost as English becomes the dominant language in online spaces.&lt;br&gt;
&lt;br&gt;
&amp;nbsp;The Swahili news dataset was created to reduce the gap of using the Swahili language to create NLP technologies and help AI practitioners in Tanzania and across the Africa continent to practice their NLP skills to solve different problems in organizations or societies related to the Swahili language. Swahili News were&amp;nbsp;collected from different websites that provide news in the Swahili language. I was able to find some websites that provide news in Swahili only and others in different languages including Swahili.&lt;br&gt;
&lt;br&gt;
The dataset was created for a specific task of text classification, this means each news content can be categorized into six&amp;nbsp;different topics (Local News, International News, Finance News, Health News, Sports News, and Entertainment news). The dataset comes with a specified train/test split. The train set contains 75% of the dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Acknowledgment&lt;/strong&gt;: This project was supported by the&amp;nbsp;&lt;a href="https://www.k4all.org/project/language-dataset-fellowship/"&gt;AI4D language dataset fellowship&lt;/a&gt;&amp;nbsp;through K4All and &lt;a href="https://zindi.africa/"&gt;Zindi Africa&lt;/a&gt;.&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.4300293</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.4300294</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
3,904
2,406
views
downloads
All versions This version
Views 3,9043,737
Downloads 2,4062,385
Data volume 125.6 GB124.8 GB
Unique views 3,5143,410
Unique downloads 2,1522,140

Share

Cite as