Published September 7, 2022 | Version v1
Conference paper Open

On the quest of scholarly communities of attention: large-scale clustering of Twitter users around scientific publications

  • 1. University of Campinas
  • 2. Universidad de Granada
  • 3. Leiden University

Description

In this study we provide a first account of the methodological workflow for a large-scale clustering of the Twitter communities of attention around scientific publications as captured in the open database Crossref Event Data. To the best of our knowledge this is the largest algorithmic clustering of Twitter users and scientific publications performed to date. The availability of this type of clustering opens new analytical possibilities in the study of the Twitter dissemination of scientific publications. For example, making possible the study of  the diversity of the communities in which publications have been tweeted enabling the differentiation of publications tweeted in smaller or larger communities, or the identification of those communities that tweet more superficially or automatically.

From a technical point of view, the use of big data tools (Google BigQuery) was implemented given the large size of data involved in the clustering. Moreover, the use of the relative weight allowed for the determination of well connected communities without much skewness in its sizes. The sheer size and availability of open data opens the way for several kinds of analysis that demand careful use of file formats and computation resources, usually based on big data tools such as data warehouses and running multiprocessor code.

Future research will  necessarily focus on two additional developments: 1) refining the clustering to include those less connected communities in a meaningful manner, making them also more balanced, and 2) implementing a labelling of the different clusters obtained. For the first, additional clustering (e.g. clustering of clusters) and reclustering of smaller clusters will be very likely the approach to go. For the second, we aim at finding potentially meaningful information by collecting metadata from papers (e.g. journals, titles, topics)  and Twitter users (e.g. profile descriptions, geolocations, URLs). That information, combined with language processing techniques will potentially allow the labelling of the clusters in order to better characterise the communities and their dynamics in disseminating scientific publications on Twitter.

Files

238.pdf

Files (547.0 kB)

Name Size Download all
md5:35d4988f5f4904690817d45727642333
547.0 kB Preview Download

Additional details

Related works

Is described by
Presentation: 10.5281/zenodo.7129562 (DOI)