Published November 4, 2020 | Version 1.0
Dataset Open

Familiist (pro-natalist) communities in the social network VKontakte

  • 1. Lomonosov Moscow State University
  • 2. Bauman Moscow State Technical University

Description

The database contains an upload of text comments in Russian from the social networkVkontakte in .csv format (UTF-8 encoding). Comments are collected from communities, which discuss pregnancy, childhood, motherhood, paternity, etc. The unloading contains comments under the posts with which the interaction took place. The absolute amount of likes was used as a criterion, (comments were collected where the number of likes is greater than or equal to 5). The text data was processed (stemmization and lemmatization). The data are suitable for thematic analysis (e.g. LDA - Latent Dirichlet Allocation), for modelling the graph structure of communities (the link_comment variable contains a unique identifier of the post, link_author contains a unique user identifier), for analysis of the tonalities of statements and forming a dictionary of demographic connotation. 

Sample Information:

- Number of communities 38 

- Content type of communities: communities in which users are mainly positive about the birth of children, motherhood, parenthood and own family are selected. But users (communities) with anti-familistic biases may be encountered.    

- Only comments with the number of likes >= 5 are collected 

- Comments are collected only from communities (the list of communities below) discussing issues related to childhood, motherhood, pregnancy, etc. 

- A sample of communities on average contains 309 thousand subscribers (maximum value - 1,482,303, minimum value - 72,570, total number of subscribers excluding intersections - 11,743 295) 

- The sample of comments contains 112,900 user comments 

Sample Structure: 

link_author - link to the author of the comment in the form of https://vk.com/*author identificator* 

gender of author (F - female, M - male, NaN - no data) 

link_comment - link to comment in the form of https://vk.com/* post identificatior on a *community wall*?reply=*comment id * 

date_time - date and time of publication (format “YYYY-MM-DD HH:MM:SS”) 

text - raw comment text 

likes - number of likes the comment has 

text_prep - processed text (punctuation marks removed, words brought down to lowercase) 

text_stem - processed text (based on the text_prep column stemmization using SnowBallstemmer (“Russian”) of the nltk library) is performed 

text_sw - processed text (based on the text_prep column stop words are deleted using word_tokenize (text) of the nltk library) 

text_lemm - processed text (lemmatization using mystem.lemmatize (text) of pymystem3 library is performed based on the text_prep column) 

List of communities (38 communities):

https://vk.com/club52388302

https://vk.com/club34677924

https://vk.com/club99834596

https://vk.com/club170234932

https://vk.com/club20199180

https://vk.com/club118030893

https://vk.com/club14395935

https://vk.com/club100104267

https://vk.com/club181526404

https://vk.com/club35095382

https://vk.com/club58530763

https://vk.com/club69716165

https://vk.com/club29746763

https://vk.com/club78865067

https://vk.com/club20709572

https://vk.com/club93466205

https://vk.com/club61700163

https://vk.com/club91423062

https://vk.com/club69285929

https://vk.com/club104012302

https://vk.com/club20622108

https://vk.com/club86333616

https://vk.com/club24765

https://vk.com/club87169444

https://vk.com/club86688308

https://vk.com/club93776129

https://vk.com/club47207301

https://vk.com/club39873171

https://vk.com/club59224150

https://vk.com/club7430494

https://vk.com/club37739956

https://vk.com/club59701255

https://vk.com/club27427277

https://vk.com/club126238531

https://vk.com/club127678644

https://vk.com/club57782234

https://vk.com/club51314884

https://vk.com/club134261249

Notes

Data set was prepared with the financial support of the faculty of Economics of Lomonosov Moscow State University as a part of the research project on "Population reproduction in digital society" (by I.E.Kalabikhina)

Files

vk_posts_stem_lemm.csv

Files (117.0 MB)

Name Size Download all
md5:2e629ab4964a6fbb1e9d22886dc50edb
117.0 MB Preview Download

Additional details

Related works

Is cited by
Journal article: 10.3897/popecon.5.e70786 (DOI)
Journal article: 10.3390/math9090987 (DOI)
Is supplemented by
Dataset: 10.5281/zenodo.4612131 (DOI)