Book section Open Access

Linguistic Bias in Crowdsourced Biographies: A Cross-lingual Examination

Jahna Otterbacher; Ioannis Katakis; Pantelis Agathangelou

Biographies make up a significant portion of Wikipedia entries and are a source of information and inspiration for the public. We examine a threat to their objectivity, linguistic biases, which are pervasive in human communication. Linguistic bias, the systematic asymmetry in the language used to describe people as a function of their social groups, plays a role in the perpetuation of stereotypes. Theory predicts that we describe people who are expected – because they are members of our own in-groups or are stereotype-congruent – with more abstract, subjective language, as compared to others. Abstract language has the power to sway our impressions of others as it implies stability over time. Extending our monolingual work, we consider biographies of intellectuals at the English- and Greek-language Wikipedias. We use our recently introduced sentiment analysis tool, DidaxTo, which extracts domain-specific opinion words to build lexicons of subjective words in each language and for each gender, and compare the extent to which abstract language is used. Contrary to expectation, we find evidence of gender-based linguistic bias, with women being described more abstractly as compared to men. However, this is limited to English-language biographies. We discuss the implications of using DidaxTo to monitor linguistic bias in texts produced via crowdsourcing.


This work has been partly supported by the project that has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 739578 (RISE – Call: H2020-WIDESPREAD-01-2016-2017-TeamingPhase2) and the Government of the Republic of Cyprus through the Directorate General for European Programmes, Coordination and Development. Electronic version of a book chapter article published as Multilingual Text Analysis Challenges, Models, and Approaches, 2019, 411–440,] © 2019 World Scientific Publishing Company, .
Files (462.8 kB)
Name Size
462.8 kB Download
  • A. Capocci, V. D. Servedio, F. Colaiori, L. S. Buriol, D. Donato, S. Leonardi, and G. Caldarelli, Preferential attachment in the growth of social networks: The internet encyclopedia wikipedia, Physical review E. 74(3), 036116 (2006).

  • A. Forte and A. Bruckman. From wikipedia to the classroom: Exploring online publication and learning. In Proceedings of the 7th international conference on Learning sciences, pp. 182–188 (2006).

  • A. Hunt. The Linguistic Expectancy Bias and the American Mass Media. PhD thesis, Temple University, Philadelphia, PA (2011).

  • A. Kittur and R. E. Kraut. Harnessing the wisdom of crowds in wikipedia: Quality through coordination. In Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work, CSCW '08, pp. 37–46, ACM, New York, NY, USA (2008). ISBN 978-1-60558-007-4. doi: 10.1145/1460563. 1460572. URL

  • A. Maass, D. Salvi, L. Arcuri, and G. Semin, Language Use in Intergroup Context: The Linguistic Intergroup Bias, Journal of Personality and Social Psychology. 57(6), 981–993 (1989).

  • A. Maass. Linguistic Intergroup Bias: Stereotype Perpetuation through Language. In ed. M. Zanna, Advanced in Experimental Social Psychology, pp. 79–121. Academic Press, San Diego, CA (1999).

  • B. Collier and J. Bear. Conflict, criticism, or confidence: an empirical examination of the gender gap in wikipedia contributions. In Proceedings of the ACM 2012 conference on computer supported cooperative work, pp. 383–392 (2012).

  • C. Beukeboom. Mechanisms of Linguistic Bias: How Words Reflect and Maintain Stereotypic Expectations. In eds. J. Laszlo, J. Forgas, and O. Vincze, Social Cognition and Communication, pp. 313–330. Psychology Press, New York, NY (2013).

  • C. Pentzold, Fixing the Floating Gap: The Online Encyclopedia Wikipedia as a Global Memory Place, Memory Studies. 2(2), 255–272 (2009).

  • C. Wagner, D. Garcia, M. Jadidi, and M. Strohmaier. It's a man's wikipedia? assessing gender inequality in an online encyclopedia. In Proceedings of the AAAI International Conference on Web and Social Media (ICWSM), pp. 454–463 (2015).

  • C. Wagner, E. Graells-Garrido, D. Garcia, and F. Menczer, Women through the glass ceiling: gender asymmetries in wikipedia, EPJ Data Science. 5(1), 5 (2016).

  • D. Hasan Dalip, M. Andr´e Gon¸calves, M. Cristo, and P. Calado. Automatic quality assessment of content created collaboratively by web communities:a case study of wikipedia. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, pp. 295–304 (2009).

  • D. Radev, J. Otterbacher, A. Winkel, and S. Blair-Goldensohn, Newsinessence: summarizing online news topics, Communications of the ACM. 48(10), 95–98 (2005).

  • D. Wigboldus, R. Spears, and G. Semin, When do We Communicate Stereotypes? Influence of the Social Context on the Linguistic Expectancy Bias, Group Processes & Intergroup Relations. 8(3), 215–230 (2005).

  • E. Graells-Garrido, M. Lalmas, and F. Menczer. First women, second sex: gender bias in wikipedia. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 165–174 (2015).

  • E. S. Callahan and S. C. Herring, Cultural bias in wikipedia content on famous persons, Journal of the Association for Information Science and Technology. 62(10), 1899–1915 (2011).

  • G. Giannakopoulos, M. El-Haj, B. Favre, M. Litvak, J. Steinberger, and V. Varma, Tac 2011 multiling pilot overview (2011).

  • G. Semin and K. Fiedler, The Cognitive Functions of Linguistic Categories in Describing Persons: Social Cognition and Language, Journal of Personality and Social Psychology. 54, 558–568 (1988).

  • H. Abdi and L. J. Williams, Tukey's honestly significant difference (hsd) test, Encyclopedia of Research Design. Thousand Oaks, CA: Sage. pp. 1–5 (2010).

  • H.-F. Yu, P. Jain, P. Kar, and I. Dhillon. Large-scale multi-label learning with missing labels. In International Conference on Machine Learning, pp. 593–601 (2014).

  • I. Protonotarios, V. Sarimpei, and J. Otterbacher. Similar gaps, different origins? women readers and editors at greek wikipedia. In Wiki@ ICWSM (2016).

  • J. Antin, R. Yee, C. Cheshire, and O. Nov. Gender differences in wikipedia editing. In Proceedings of the 7th international symposium on Wikis and open collaboration, pp. 11–14 (2011).

  • J. Cohen, P. Cohen, S. G. West, and L. S. Aiken, Applied multiple regression/ correlation analysis for the behavioral sciences. Routledge (2013).

  • J. D. Gibbons and S. Chakraborti. Nonparametric statistical inference. In International encyclopedia of statistical science, pp. 977–979. Springer (2011).

  • J. E. Blumenstock. Size matters: word count as a measure of quality on wikipedia. In Proceedings of the 17th international conference on World Wide Web, pp. 1095–1096 (2008).

  • J. M. Hilbe. Logistic regression. In International Encyclopedia of Statistical Science, pp. 755–758. Springer (2011).

  • J. Otterbacher. Crowdsourcing stereotypes: Linguistic bias in metadata generated via gwap. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1955–1964 (2015).

  • J. Otterbacher. Linguistic bias in collaboratively produced biographies: crowdsourcing social stereotypes? In ICWSM, pp. 298–307 (2015).

  • L. Coenen, L. Hedebouw, and G. Semin. Measuring Language Abstraction: The Linguistic Category Model Manual. Technical report, Free University Amsterdam, Amsterdam, The Netherlands (June, 2006). URL http://www.

  • L. Flekova, O. Ferschke, and I. Gurevych. What makes a good biography?: Multidimensional quality analysis based on wikipedia article feedback data. In Proceedings of the 23rd International Conference on World Wide Web, WWW '14, pp. 855–866, ACM, New York, NY, USA (2014). ISBN 978-1- 4503-2744-2. doi: 10.1145/2566486.2567972. URL 1145/2566486.2567972.

  • M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '04, pp. 168–177, ACM, New York, NY, USA (2004). ISBN 1-58113-888-1. doi: 10.1145/1014052.1014073. URL

  • M. Hu, E.-P. Lim, A. Sun, H. W. Lauw, and B.-Q. Vuong. Measuring article quality in wikipedia: models and evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 243–252 (2007).

  • M. Kimura, K. Saito, and R. Nakano. Extracting influential nodes for information diffusion on a social network. In AAAI, vol. 7, pp. 1371–1376 (2007).

  • M. Strube and S. P. Ponzetto. Wikirelate! computing semantic relatedness using wikipedia. In AAAI, vol. 6, pp. 1419–1424 (2006).

  • P. Agathangelou, I. Katakis, F. Kokkoras, and K. Ntonas, Mining Domain- Specific Dictionaries of Opinion Words, In eds. B. Benatallah, A. Bestavros, Y. Manolopoulos, A. Vakali, and Y. Zhang, Web Information Systems Engineering – WISE 2014: 15th International Conference, Thessaloniki, Greece, October 12-14, 2014, Proceedings, Part I, pp. 47–62. Springer International Publishing (2014).

  • P. Agathangelou, I. Katakis, I. Koutoulakis, F. Kokkoras, and D. Gunopulos, Learning patterns for discovering domain oriented opinion words, Knowledge and Information Systems (2017).

  • P. Agathangelou, I. Katakis, I. Koutoulakis, F. Kokkoras, and D. Gunopulos, Learning patterns for discovering domain-oriented opinion words, Knowledge and Information Systems. pp. 1–33 (2017).

  • P. Devine and A. Elliot, Are Racial Stereotypes Really Fading? The Princeton Trilogy Revisited, Personality and Social Psychology Bulletin. 21(11), 1139–1150 (1995).

  • P. Willett, The porter stemming algorithm: then and now, Program. 40(3), 219–223 (2006).

  • P. Winkielman, J. Halberstadt, T. Fazendeiro, and S. Catty, Prototypes are Attractive because they are Easy on the Mind, Psychological Science. 17(9), 799–806 (2006).

  • S. Downes, New technology supporting informal learning, Journal of Emerging Technologies in Web Intelligence. 2(1), 27–33 (2010).

  • S. T. Fiske, A. J. Cuddy, and P. Glick, Universal dimensions of social cognition: Warmth and competence, Trends in cognitive sciences. 11(2), 77–83 (2007).

  • S. T. Fiske, A. J. Cuddy, P. Glick, and J. Xu, A model of (often mixed) stereotype content: competence and warmth respectively follow from perceived status and competition., Journal of personality and social psychology. 82(6), 878 (2002).

  • T. Wilson, J. Wiebe, and P. Hoffman. Recognizing Contextual Polarity in Phrase-level Sentiment Analysis. In Proceedings of the ACL HLT / EMNLP (2005).

  • T. W¨ohner and R. Peters. Assessing the quality of wikipedia articles with lifecycle based metrics. In Proceedings of the 5th International Symposium on Wikis and Open Collaboration, p. 16 (2009).

  • W. von Hippel, D. Sekaquaptewa, and P. Vargas, The Linguistic Intergroup Bias as an Implicit Indicator of Prejudice, Journal of Experimental Social Psychology. 33, 490–509 (1997).

Views 114
Downloads 48
Data volume 22.2 MB
Unique views 101
Unique downloads 44


Cite as