Dataset Open Access

Citation network data sets for 'Oxytocin – a social peptide? Deconstructing the evidence'

Leng, Rhodri Ivor


This note describes the data sets used for all analyses contained in the manuscript 'Oxytocin - a social peptide?’[1] 

Data Collection

The datasets described here were originally retrieved from Web of Science (WoS) Core Collection via the University of Edinburgh’s library subscription [2]. The aim of the original study for which these data were gathered was to survey peer-reviewed primary studies on oxytocin and social behaviour. To capture relevant papers, we used the following query:

TI = (“oxytocin” OR “pitocin” OR “syntocinon”) AND TS = (“social*” OR “pro$social” OR “anti$social”)

The final search was performed on the 13 September 2021. This returned a total of 2,747 records, of which 2,049 were classified by WoS as ‘articles’. Given our interest in primary studies only – articles reporting original data – we excluded all other document types. We further excluded all articles sub-classified as ‘book chapters’ or as ‘proceeding papers’ in order to limit our analysis to primary studies published in peer-reviewed academic journals. This reduced the set to 1,977 articles. All of these were published in the English language, and no further language refinements were unnecessary.

All available metadata on these 1,977 articles was exported as plain text ‘flat’ format files in four batches, which we later merged together via Notepad++. Upon manually examination, we discovered examples of papers classified as ‘articles’ by WoS that were, in fact, reviews. To further filter our results, we searched all available PMIDs in PubMed (1,903 had associated PMIDs - ~96% of set). We then filtered results to identify all records classified as ‘review’, ‘systematic review’, or ‘meta-analysis’, identifying 75 records [3] (thus, ~4% of records classified by WoS were classified as reviews in PubMed). After examining a sample and agreeing with the PubMed classification, these were removed these from our dataset - leaving a total of 1,902 articles.

From these data, we constructed two datasets via parsing out relevant reference data via the Sci2 Tool [4]. First, we constructed a ‘node-attribute-list’ by first linking unique reference strings (‘Cite Me As’ column in WoS data files) to unique identifiers, we then parsed into this dataset information on the identify of a paper, including the title of the article, all authors, journal publication, year of publication, total citations as recorded from WoS, and WoS accession number. Second, we constructed an ‘edge-list’ that records the citations from a citing paper in the ‘Source’ column and identifies the cited paper in the ‘Target’ column, using the unique identifies as described previously to link these data to the node-attribute-list.

We then constructed a network in which papers are nodes, and citation links between nodes are directed edges between nodes. We used Gephi Version 0.9.2 [5] to manually clean these data by merging duplicate references that are caused by different reference formats or by referencing errors. To do this, we needed to retain both all retrieved records (1,902) as well as including all of their references to papers whether these were included in our original search or not. In total, this produced a network of 46,633 nodes (unique reference strings) and 112,520 edges (citation links). Thus, the average reference list size of these articles is ~59 references. The mean indegree (within network citations) is 2.4 (median is 1) for the entire network reflecting a great diversity in referencing choices among our 1,902 articles.

After merging duplicates, we then restricted the network to include only articles fully retrieved (1,902), and retrained only those that were connected together by citations links in a large interconnected network (i.e. the largest component). In total, 1,892 (99.5%) of our initial set were connected together via citation links, meaning a total of ten papers were removed from the following analysis – and these were neither connected to the largest component, nor did they form connections with one another (i.e. these were ‘isolates’).

This left us with a network of 1,892 nodes connected together by 26,019 edges. It is this network that is described by the ‘node-attribute-list’ and ‘edge-list’ provided here. This network has a mean in-degree of 13.76 (median in-degree of 4). By restricting our analysis in this way, we lose 44,741 unique references (96%) and 86,501 citations (77%) from the full network, but retain a set of articles tightly knitted together, all of which have been fully retrieved due to possessing certain terms related to oxytocin AND social behaviour in their title, abstract, or associated keywords.

Before moving on, we calculated indegree for all nodes in this network – this counts the number of citations to a given paper from other papers within this network – and have included this in the node-attribute-list. We further clustered this network via modularity maximisation via the Leiden algorithm [6]. We set the algorithm to resolution 1, and allowed the algorithm to run over 100 iterations and 100 restarts. This gave Q=0.43 and identified seven clusters, which we describe in detail within the body of the paper. We have included cluster membership as an attribute in the node-attribute-list.

For additional analysis, we also analysed the full reference list data to examine the most commonly cited references between 2016 and 2021 - the results of this are described in OTSOC_Cited_2016-2021.csv. This takes the reference lists of all retrieved papers within the network and examines their full reference lists (including references to other papers not contained within the network). These data were cleaned by matching DOIs and manual cleansing. 

Data description

We include here two network datasets: (i) ‘OTSOC-node-attribute-list.csv’ consists of the attributes of 1,892 primary articles retrieved from WoS that include terms indicating a focus on oxytocin and social behaviour; (ii) ‘OTSOC-edge-list.csv’ records the citations between these papers. Together, these can be imported into a range of different software for network analysis; however, we have formatted these for ease of upload into Gephi 0.9.2. Finally, we include (iii) 'OTSOC_Cited_2016-2021' that lists all papers cited by >10 papers in the OTSOC network following any analysis of the bibliographies of retrieved papers. Below, we detail their contents:

1. ‘OTSOC-node-attribute-list.csv’ is a comma-separate values file that contains all node attributes for the citation network (n=1,892) analysed in the paper. The columns refer to:

Id, the unique identifier

Label, the reference string of the paper to which the attributes in this row correspond. This is taken from the ‘Cite Me As’ column from the original WoS download. The reference string is in the following format: last name of first author, publication year, journal, volume, start page, and DOI (if available). 

Wos_id, unique Web of Science (WoS) accession number. These can be used to query WoS to find further data on all papers via the ‘UT= ’ field tag.

Title, paper title.

Authors, all named authors.

Journal, journal of publication.

Pub_year, year of publication.

Wos_citations, total number of citations recorded by WoS Core Collection to a given paper as of 13 September 2021

Indegree, the number of within network citations to a given paper, calculated for the network shown in Figure 1 of the manuscript.

Cluster, provides the cluster membership number as discussed within the manuscript (Figure 1). This was established via modularity maximisation via the Leiden algorithm (Res 1; Q=0.43|7 clusters)

2. ‘OTSOC-edge -list.csv’ is a comma-separated values file that contains all citation links between the 1,892 articles (n=26,019). The columns refer to:

Source, the unique identifier of the citing paper.

Target, the unique identifier of the cited paper.

Type, edges are ‘Directed’, and this column tells Gephi to regard all edges as such.

Syr_date, this contains the date of publication of the citing paper.

Tyr_date, this contains the date of publication of the cited paper.

3. 'OTSOC_Cited_2016-2021.csv' is a comma-separated values file that contain citations to all cited references that were cited by at least 10 of the retrieved papers within the OTSOC network published from 2016 onwards. The columns refer to: 

Reference, the cited reference string extracted from the bibliographies of retrieved papers.

Publication year, the publication year of the cited reference.

DOI, the DOI of the cited reference. 

indegree_2016, the total number of citations to a cited reference from papers published in 2016 and contained within the OTSOC network. 

indegree_2017, the total number of citations to a cited reference from papers published in 2017 and contained within the OTSOC network. 

indegree_2018, the total number of citations to a cited reference from papers published in 2018 and contained within the OTSOC network. 

indegree_2019, the total number of citations to a cited reference from papers published in 2019 and contained within the OTSOC network. 

indegree_2020, the total number of citations to a cited reference from papers published in 2020 and contained within the OTSOC network. 

indegree_2021, the total number of citations to a cited reference from papers published in 2021 and contained within the OTSOC network. 

total indegree 2016-21, the total number of citation to a cited reference from papers published between 2016-2021 and contained within the OTSOC network. 

Software recommended for analysis

Gephi version 0.9.2 was used for the visualisations within the manuscript, and both files can be read and into Gephi without modification.


[1] Leng, G., Leng, R. I., Ludwig, M. (Submitted). Oxytocin – a social peptide? Deconstructing the evidence.

[2] Edinburgh University’s subscription to Web of Science covers the following databases: (i) Science Citation Index Expanded, 1900-present; (ii) Social Sciences Citation Index, 1900-present; (iii) Arts & Humanities Citation Index, 1975-present; (iv) Conference Proceedings Citation Index- Science, 1990-present; (v) Conference Proceedings Citation Index- Social Science & Humanities, 1990-present; (vi) Book Citation Index– Science, 2005-present; (vii) Book Citation Index– Social Sciences & Humanities, 2005-present; (viii) Emerging Sources Citation Index, 2015-present.

[3] For those interested, the following PMIDs were identified as ‘articles’ by WoS, but as ‘reviews’ by PubMed: ‘34502097’ ‘33400920’ ‘32060678’ ‘31925983’ ‘31734142’ ‘30496762’ ‘30253045’ ‘29660735’ ‘29518698’ ‘29065361’ ‘29048602’ ‘28867943’ ‘28586471’ ‘28301323’ ‘27974283’ ‘27626613’ ‘27603523’ ‘27603327’ ‘27513442’ ‘27273834’ ‘27071789’ ‘26940141’ ‘26932552’ ‘26895254’ ‘26869847’ ‘26788924’ ‘26581735’ ‘26548910’ ‘26317636’ ‘26121678’ ‘26094200’ ‘25997760’ ‘25631363’ ‘25526824’ ‘25446893’ ‘25153535’ ‘25092245’ ‘25086828’ ‘24946432’ ‘24637261’ ‘24588761’ ‘24508579’ ‘24486356’ ‘24462936’ ‘24239932’ ‘24239931’ ‘24231551’ ‘24216134’ ‘23955310’ ‘23856187’ ‘23686025’ ‘23589638’ ‘23575742’ ‘23469841’ ‘23055480’ ‘22981649’ ‘22406388’ ‘22373652’ ‘22141469’ ‘21960250’ ‘21881219’ ‘21802859’ ‘21714746’ ‘21618004’ ‘21150165’ ‘20435805’ ‘20173685’ ‘19840865’ ‘19546570’ ‘19309413’ ‘15288368’ ‘12359512’ ‘9401603’ ‘9213136’ ‘7630585’

[4] Sci2 Team. (2009). Science of Science (Sci2) Tool. Indiana University and SciTech Strategies. Stable URL:

[5] Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media. Gephi is available via

[6] Traag, V. A., Waltman, L., & van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing well-connected communities. Scientific reports, 9(1), 5233.

Files (1.5 MB)
Name Size
116.3 kB Download
780.2 kB Download
611.5 kB Download
All versions This version
Views 13475
Downloads 12475
Data volume 61.8 MB25.6 MB
Unique views 9961
Unique downloads 7448


Cite as