Dataset Open Access

TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19 Pandemic (Part 2, May 2020)

Baran, Erdal; Dimitrov, Dimitar


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nmm##2200000uu#4500</leader>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">twitter</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">tweets</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">linked data</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">microblogging</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">RDF</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">csv</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">covid-19</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">coronavirus</subfield>
  </datafield>
  <controlfield tag="005">20210311002725.0</controlfield>
  <controlfield tag="001">4593502</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="a">Dimitrov, Dimitar</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">404722462</subfield>
    <subfield code="z">md5:e08e4b873841e737cb8cf1835370af4d</subfield>
    <subfield code="u">https://zenodo.org/record/4593502/files/TweetsCOV19_052020.n3.gz</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">197659685</subfield>
    <subfield code="z">md5:4e8fc16a2bea5cd3421578522fb87f22</subfield>
    <subfield code="u">https://zenodo.org/record/4593502/files/TweetsCOV19_052020.tsv.gz</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2021-03-10</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="p">openaire_data</subfield>
    <subfield code="p">user-covid-19</subfield>
    <subfield code="p">user-twitter-datasets</subfield>
    <subfield code="o">oai:zenodo.org:4593502</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="a">Baran, Erdal</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19 Pandemic (Part 2, May 2020)</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-covid-19</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">user-twitter-datasets</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;&lt;strong&gt;&lt;a href="https://data.gesis.org/tweetscov19/"&gt;TweetsCOV19&lt;/a&gt;&lt;/strong&gt;&lt;strong&gt; &lt;/strong&gt;is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of &lt;a href="https://data.gesis.org/tweetskb"&gt;TweetsKB&lt;/a&gt; and aims at capturing online discourse about various aspects of the pandemic and its societal impact. &lt;strong&gt;Metadata&lt;/strong&gt; information about the tweets as well as extracted &lt;strong&gt;entities&lt;/strong&gt;, &lt;strong&gt;sentiments&lt;/strong&gt;, &lt;strong&gt;hashtags&lt;/strong&gt;, &lt;strong&gt;user mentions&lt;/strong&gt;, and &lt;strong&gt;resolved URLs &lt;/strong&gt;are exposed in RDF using established RDF/S vocabularies*.&lt;/p&gt;

&lt;p&gt;We also provide a &lt;em&gt;&lt;strong&gt;tab-separated values (tsv)&lt;/strong&gt;&lt;/em&gt; version of the dataset. Each line contains features of a tweet instance. Features are separated by tab character (&amp;quot;\t&amp;quot;). The following list indicate the feature indices:&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;Tweet Id: Long.&lt;/li&gt;
	&lt;li&gt;Username: String. Encrypted for privacy issues*.&lt;/li&gt;
	&lt;li&gt;Timestamp: Format ( &amp;quot;EEE MMM dd HH:mm:ss Z yyyy&amp;quot; ).&lt;/li&gt;
	&lt;li&gt;#Followers: Integer.&lt;/li&gt;
	&lt;li&gt;#Friends: Integer.&lt;/li&gt;
	&lt;li&gt;#Retweets: Integer.&lt;/li&gt;
	&lt;li&gt;#Favorites: Integer.&lt;/li&gt;
	&lt;li&gt;Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from &lt;a href="https://github.com/yahoo/FEL"&gt;FEL&lt;/a&gt; library. Each entity is separated from another entity by char &amp;quot;;&amp;quot;. Also, each entity is separated by char &amp;quot;:&amp;quot; in order to store &amp;quot;original_text:annotated_entity:score;&amp;quot;. If FEL did not find any entities, we have stored &amp;quot;null;&amp;quot;.&lt;/li&gt;
	&lt;li&gt;Sentiment: String. &lt;a href="http://sentistrength.wlv.ac.uk/"&gt;SentiStrength&lt;/a&gt; produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char &amp;quot; &amp;quot;. Positive sentiment was stored first and then negative sentiment (i.e. &amp;quot;2 -1&amp;quot;).&lt;/li&gt;
	&lt;li&gt;Mentions: String. If the tweet contains mentions, we remove the char &amp;quot;@&amp;quot; and concatenate the mentions with whitespace char &amp;quot; &amp;quot;. If no mentions appear, we have stored &amp;quot;null;&amp;quot;.&lt;/li&gt;
	&lt;li&gt;Hashtags: String. If the tweet contains hashtags, we remove the char &amp;quot;#&amp;quot; and concatenate the hashtags with whitespace char &amp;quot; &amp;quot;. If no hashtags appear, we have stored &amp;quot;null;&amp;quot;.&lt;/li&gt;
	&lt;li&gt;URLs: String: If the tweet contains URLs, we concatenate the URLs using &amp;quot;:-: &amp;quot;. If no URLs appear, we have stored &amp;quot;null;&amp;quot;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To extract the dataset from &lt;a href="https://data.gesis.org/tweetskb"&gt;TweetsKB&lt;/a&gt;, we compiled a seed list of 268 COVID-19-related &lt;a href="https://data.gesis.org/tweetscov19/keywords.txt"&gt;keywords&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;* For the sake of privacy, we anonymize&amp;nbsp;user IDs&amp;nbsp;and we do not provide the text of the tweets.&lt;/em&gt;&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">url</subfield>
    <subfield code="i">isDocumentedBy</subfield>
    <subfield code="a">https://data.gesis.org/tweetscov19/</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.4593501</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.4593502</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">dataset</subfield>
  </datafield>
</record>
1,078
353
views
downloads
All versions This version
Views 1,0781,078
Downloads 353353
Data volume 88.2 GB88.2 GB
Unique views 1,0281,028
Unique downloads 273273

Share

Cite as