On the quest of scholarly communities of attention: large-scale clustering of Twitter users around scientific publications
- 1. University of Campinas
- 2. Universidad de Granada
- 3. Leiden University
Description
In this study we provide a first account of the methodological workflow for a large-scale clustering of the Twitter communities of attention around scientific publications as captured in the open database Crossref Event Data. To the best of our knowledge this is the largest algorithmic clustering of Twitter users and scientific publications performed to date. The availability of this type of clustering opens new analytical possibilities in the study of the Twitter dissemination of scientific publications. For example, making possible the study of the diversity of the communities in which publications have been tweeted enabling the differentiation of publications tweeted in smaller or larger communities, or the identification of those communities that tweet more superficially or automatically.
From a technical point of view, the use of big data tools (Google BigQuery) was implemented given the large size of data involved in the clustering. Moreover, the use of the relative weight allowed for the determination of well connected communities without much skewness in its sizes. The sheer size and availability of open data opens the way for several kinds of analysis that demand careful use of file formats and computation resources, usually based on big data tools such as data warehouses and running multiprocessor code.
Future research will necessarily focus on two additional developments: 1) refining the clustering to include those less connected communities in a meaningful manner, making them also more balanced, and 2) implementing a labelling of the different clusters obtained. For the first, additional clustering (e.g. clustering of clusters) and reclustering of smaller clusters will be very likely the approach to go. For the second, we aim at finding potentially meaningful information by collecting metadata from papers (e.g. journals, titles, topics) and Twitter users (e.g. profile descriptions, geolocations, URLs). That information, combined with language processing techniques will potentially allow the labelling of the clusters in order to better characterise the communities and their dynamics in disseminating scientific publications on Twitter.
Files
238.pdf
Files
(547.0 kB)
Name | Size | Download all |
---|---|---|
md5:35d4988f5f4904690817d45727642333
|
547.0 kB | Preview Download |
Additional details
Related works
- Is described by
- Presentation: 10.5281/zenodo.7129562 (DOI)