Based and confused: Tracing the political connotations of a memetic phrase across the Web
Description
Datasets for the case study on the spread of the vernacular term "based" across 4chan/pol/, Reddit, and Twitter. Data was gathered in November 2021. All files are anonymised as much as possible. They contain:
- Posts and tweets mentioning "based". False positives were filtered out until the precision was over 0.9. See filtering steps below.
- Twitter: Tweets mentioning "based" between 2010-01-01 and 2021-11-01. Collected via the Twitter v2 API. Retweets not allowed.
- Reddit: Comments mentioning "based" between 2010-01-01 and 2021-11-01. Collected via the Pushshift API. Data from deleted subreddits may be absent in their last month.
- 4chan/pol/: Posts and comments mentioning "based" between 2013-11-28 and 2021-11-01. Data derived from 4plebs and 4CAT.
- Counts per day and month for the datasets above.
- Sampled and annotated posts from three time slices. Includes the 200 most-liked or highest-scoring posts from Twitter and Reddit, and a random sample for 4chan/pol/. Annotated with general comments, who was deemed based, and whether the post concerns meta-discussion on the term. Time slices:
- July 2012
- January 2017
- September 2021
- The most-common words between "based " and "pilled" on all three platforms.
Table 2 below details the queries we carried out for the collection of the initial datasets. For all platforms, we chose to retain non-English languages since the diffusion of the term in other languages was also deemed relevant.
| source | query | query type |
|
(#based OR (based (pilled OR pill OR redpilled OR redpill OR chad OR virgin OR cringe OR cringy OR triggered OR trigger OR tbh OR lol OR lmao OR wtf OR swag OR nigga OR finna OR bitch OR rare) ) OR " is based" OR "that\'s based" OR "based as fuck" OR "based af" OR "too based" OR "fucking based" "extremely based" OR "totally based" OR "incredibly based" OR "very based" OR "so based" OR "pretty based" OR "quite based" OR "kinda based" OR "kind of based" OR "fairly based" OR "based ngl" OR "as based as" OR "thank you based " OR "stay based" OR "based god") -"based in"-"based off"-"based * off"-"based around"-"based * around"-"based on"-"based * on"-"based out of"-"based upon"-"based * upon"-"based at"-"based from"-"is based by"-"is based of"-"on which * is based"-"upon which * is based"-"which is based there"-"is based all over"-"based more on"-"plant based"-"text based"-"turn based"-"need based"-"evidence based"-"community based" -"web based" -is:retweet -is:nullcast |
Twitter v2 API | |
| based -"based in" -"based off" -"based around" -"based on" -"based them on" -"based it on" -"evidence based" | Pushshift API | |
| 4chan/pol/ | lower(body) LIKE '%based%' AND lower(body) NOT SIMILAR TO '%(-based|debased|based in |based off |based around |based on |based them on|based it on|based her on|based him on|based only on|based completely on|based solely on|based purely on|based entirely on|based not on |based not simply on|based entirely around|based out of|based upon |based at |is based by |is based of|on which it is based|on which this is based|which is based there|is based all over|which it is based|is based of |based firmly on|based off |based solely off|based more on|plant based|text based|turn based|need based|evidence based|community based|home based|internet based|web based|physics based)%' | PostgreSQL |
There were some data gaps for 4chan/pol/ and Reddit. /pol/ data was missing because of gaps in the archives (mostly due to outages). The following time periods are incomplete or missing entirely:
-
15 - 16 April 2019
-
14 - 15 December 2019
-
3 - 10 December 2020
-
29 March 2021
-
10 - 12 April 2021
-
16 - 18 August 2021
-
11 October 2021
The 4plebs archive moreover only started in November 2013, meaning the first two years of /pol/’s existence are missing.
The data returned by the Pushshift API did not return posts for certain dates. We somewhat mitigated this by also retrieving data through the new Beta endpoint. However, the following time periods were still missing data:
-
1 - 30 September 2017
-
1 February - 31 March 2018
-
5 - 6 November 2020
-
23 March 2021 through 27 March 2021
-
10 - 13 April 2021
Afterward initial data collection, we carried out several rounds of filtering to get rid of remaining false positives. For 4chan/pol/, we only needed to do this filtering once (attaining 0.95 precision), while for Twitter we carried out eight rounds (0.92 precision). For Reddit, we formulated nearly 500 exclusions but failed to generate a precision over 0.9. We thus had to do more rigorous filtering. We observed that longer comments were more likely to be false positives, so we removed all comments over 350 characters long. We settled on this number on the basis of our first sample; almost no true positives were over 350 characters long. Furthermore, we removed all comments except for those wherein based was used as a standalone word (thus excluding e.g. “plant-based”), at the start or end of a sentence, in capitals, or in conjunction with certain keywords or in certain phrases (e.g. “kinda based”). We also deleted posts by bot accounts by (rather crudely) removing posts of usernames including ‘bot’ or ‘auto’. This finally led to a precision of 0.9.
| -based|location based |
|
@-mentions with “based” "on which <max. 25 characters> is based" "where <max. 25 characters> is based" "wherever <max. 15 characters> is based" #based #customer| alkaline based| anime based | are based near | astrology based | at the based of| b0Iuip5wnA| based economy| based game | based locally| based my name | based near | based not upon| based points| based purely off| based quite near | based solely off| based soy source| based upstairs| blast based| class based| clearly based of this| combat based| condition based| dos based| emotional based| eth based| fact based| gender based| he based his | he's based in | indian based| is based for fans| is based lies| is based near | is based not around | is based not on | is based once again on | is based there| is based within| issue based| jersey based| listen to 01 we rare| music based| oil based| on which it's based| page based 1000| paper based| park based | pc based| pic based| pill based regimen| puzzle based| sex based | she based her | she's based in | skill based| story based| they based their | they're based in| toronto based| trigger on a new yoga 2| u.s. based| universal press| us based| value based| we're based in | where you based?| you're based in |#alkaline #based|#apps #based|#based #acidic|#flash #based|#home #based|#miami #based|#piano #based|#value #based|american based|australia based|australian based|based my decision|based entirely around|based entirely on|based exactly on |based her announcement|based her decision|based her off|based him off|based his announcement|based his decision|based largely on|based less on|based mostly on|based my guess|based only around|based only on|based partly on|based partly upon|based purely on |based solely around|based solely on|based strictly on|based the announcement|based the decision|based their announcement|based their decision|based, not upon|battery based|behavior based|behaviour based|blockchain based|book based series|canon based|character based|cloud based|commision based|component based|computer based|confusion based|content based|depression based|dev based|dnd based|factually based|faith based|fear based|flash based|flintstones based|flour based|home based|homin based|i based my|interaction based|is based circa|is based competely on|is based entirely off|is based here|is based more on|is based outta|is based totally on |is based up here|is based way more on|live conferences with r3|living based of|london based|luck based|malex based|market based|miami based|needs based|nyc based|on which the film is based|opinion based|piano based|point based|potato based|premise is based|region based|religious based|science based|she is based there|slavery based show|softball based|thanks richard clark|u.k. based|uk based|vendor based|vodka based|volunteer based|water based|where he is based|where the disney film is based|where the military is based|who are based there|who is based there|wordpress cms |
|
Allowed all posts:
Excluded all posts with the following patterns:
Excluded all posts with the following collocations: amd based| animal based| app based| arch based| are based inside| are based near| area based| arm based| asset based| based a bit off| based after | based censor | based chips| based consumer| based definition| based dessert| based device| based diet| based entirely on| based food| based gpu| based harassment| based highly around| based machine| based magic| based mostly in| based not on | based of that| based of whether| based only off | based product| based program| based project| based transaction| based upon| belief based| bell based| bleach based| book based| bp based| br based| bsd based| bt based| card based| caustic based| challenge based| chassis based| cis based| class based| cloud based| coconut based| cream based| creature based| csx based| cube based| cuisine based| dare based| debt based| dex based| driver based| earth based| eu based| event based| fact based| faith based| faith based| fat based| fear based| ffmpeg based| flash based| flipswitch| flour based| food based| foundry based| frog based| gcd based| gel based| gig based| glitch based| glue based| grid based| ground based| hit based| hours based| i based it | incentive based| iota based| is based extensively on| is based extremely on| is based heavily| is based inside| is based per | is based solely on| is primarily based| judgement based| latex based| law based| lead based| light based| lightning based| logic based| loot based| lore based| luck based| mac based| map based| match based| meaning based| melt based| meta based| mib based| mobile based| na based | node based| norfolk based| nyc based| oat based| oil based| pc based| pea based| pg based| power based| price based| project based| projectile based| proof based| race based| race based| rage based| ratio based| reaction based| reality based| result based| rhythm based| rights based| round based| rule based| salt based| samba based| samurai based| sap based| sea based| self based| seo based| sex based| shadow based| sinus based| soy based| soy based| space based| spr based| stack based| stat based| state based| stealth based| strength based| team based| time based| token based| transformer based| trent based| tuna based| tw based| u.s. based| uk based| uk based| us based| usa based| values based| vg based| war based| ward based| water based| where based?| whiplash based| wine based| wis based| xp based| you based your | zone based|-based|ability based|achievement based|acoustic based|alcohol based|american based|amphetamine based|android based|armor based|artifact based|asia based|assumption based|athletically based|attack based|barter based|based a bit on|based almost entirely on|based and referenced|based around |based at |based entirely around|based entirely off|based entirely on|based exclusively on|based from her |based from my |based from our |based from their |based from your |based her decision|based her opinion|based his decision|based his opinion|based in |based it off |based just around |based just off |based just on |based just outside|based mostly on|based my decision|based my opinion|based of a |based of an |based of of |based of what|based off |based on |based our decision|based our opinion|based out of|based outside|based purely on|based salad|based solely around|based solely on|based strictly on |based their decision|based their opinion|based upon |battery based|bean based|bitcoin based|blockchain based|blueberry based|box based|browser based|camera based|canada based|carbon based|caste based|causality based|cellulose based|chance based|character based|charisma based|chicago based|chrome based|chromium based|church based|city based|classroom based|clinically based|combo based|commision based|commission based|community based|compassion based|computer based|concept based|connection based|consensus based|consumer based|consumption based|cookie based|corn based|cornstarch based|corporate based|country based|crypto based|customer based|cutthroath based|cypher based|dairy based|debian based|defence based|democracy based|desktop based|detective based|diamond based|dioxide based|discussion based|domestic based|donation based|element based|elemental based|emotion based|empathy based|employment based|enchantment based|energy based|english based|error based|espresso based|estonia based|europe based|european based|evasion based|evidence based|evidenced based|exploration based|explosive based|factually based|faith based|fetish based|fiction based|fire based|firefox based|frantic based|fundamental based|gender based|github based|gps based|grappling based|gravity based|greed based|ground based|group based|he based his|heat based|home based|horizon based|hotkey based|houston based|i based my |income based|input based|inquiry based|intel based|interest based|internet based|is based all over|is based by |is based of|is based of |israeli based|issue based|japan based|java based|javascript based|jelly based|kent based|knowledge based|land based|latin based|lecture based|level based|linux based|liverpool based|location based|london based|magic based|market based|melee based|merit based|mexico based|military based|mission based|momentum based|monetary based|monster based|movement based|music based|mustard based|myth based|mythology based|méxico based|narrative based|nature based|need based|neelix based|neutral based|number based|objective based|on which it is based|on which this is based|opinion based|outcome based|oxide based|pagan based|party based|pattern based|pcr based|penalty based|percentage based|performance based|peroxide based|personality based|petroleum based|phonk based|physics based|picture based|plant based|pornographic based|potato based|product based|protocol based|ptsd based|recovery based|religion based|religions based|religious based|results based|revenue based|reward based|rfid based|ribbon based|rng based|road based|role-playing based|rpg based|ryzen based|sample based|sativa based|sci fi based|science based|scientific based|server based|service based|sexually based|she based her|shellac based|silicone based|simulation based|sketch based|skill based|smear based|snapshot based|source based|spider based|sprite based|sql based|story based|strategically based|strategy based|subscription based|sword based|tech based|text based|they are based there|they based their|thunder based|tomato based|tower based|transaction based|transformation based|triplet based|turn based|u.k. based|uk based|unix based|usa based|usd based|vagina based|vector based|vinegar based|water based|we based our|wealth based|web based|where are you based|where i'm based|where they are based|where they were based|where you based?|which is based there|which it is based|windows based|x4 based|yeast based Excluded all posts with “bot” or “auto” in the username |
Files
based-4chan-final_anonymised.csv
Files
(3.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:e15be6ec687f400018b5f851c9c064d3
|
1.3 GB | Preview Download |
|
md5:e5136f9f4b808dd9ce77886b914bab57
|
22.9 kB | Download |
|
md5:b90793987032ca564d65fb69c1c46a37
|
757.6 MB | Preview Download |
|
md5:2b7ea6295fdb37f572ef519f60fa481e
|
1.1 GB | Preview Download |
|
md5:d8253789e80f595709510f2cf6dd19b2
|
687.3 kB | Download |