Published March 1, 2019 | Version Accepted pre-print
Book chapter Open

Linguistic Bias in Crowdsourced Biographies: A Cross-lingual Examination

  • 1. Research Centre on Interactive Media, Smart Systems and Emerging Technologies & Cyprus Center for Algorithmic Transparency Open University of Cyprus Nicosia, Cyprus
  • 2. Computer Science Department University of Nicosia Nicosia, Cyprus


Biographies make up a significant portion of Wikipedia entries and are a source of information and inspiration for the public. We examine a threat to their objectivity, linguistic biases, which are pervasive in human communication. Linguistic bias, the systematic asymmetry in the language used to describe people as a function of their social groups, plays a role in the perpetuation of stereotypes. Theory predicts that we describe people who are expected – because they are members of our own in-groups or are stereotype-congruent – with more abstract, subjective language, as compared to others. Abstract language has the power to sway our impressions of others as it implies stability over time. Extending our monolingual work, we consider biographies of intellectuals at the English- and Greek-language Wikipedias. We use our recently introduced sentiment analysis tool, DidaxTo, which extracts domain-specific opinion words to build lexicons of subjective words in each language and for each gender, and compare the extent to which abstract language is used. Contrary to expectation, we find evidence of gender-based linguistic bias, with women being described more abstractly as compared to men. However, this is limited to English-language biographies. We discuss the implications of using DidaxTo to monitor linguistic bias in texts produced via crowdsourcing.



This work has been partly supported by the project that has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 739578 (RISE – Call: H2020-WIDESPREAD-01-2016-2017-TeamingPhase2) and the Government of the Republic of Cyprus through the Directorate General for European Programmes, Coordination and Development. Electronic version of a book chapter article published as Multilingual Text Analysis Challenges, Models, and Approaches, 2019, 411–440,] © 2019 World Scientific Publishing Company, .



Files (462.8 kB)

Name Size Download all
462.8 kB Preview Download

Additional details


RISE – Research Center on Interactive Media, Smart System and Emerging Technologies 739578
European Commission
CyCAT – Cyprus Center for Algorithmic Transparency 810105
European Commission


  • S. Downes, New technology supporting informal learning, Journal of Emerging Technologies in Web Intelligence. 2(1), 27–33 (2010).
  • A. Forte and A. Bruckman. From wikipedia to the classroom: Exploring online publication and learning. In Proceedings of the 7th international conference on Learning sciences, pp. 182–188 (2006).
  • M. Strube and S. P. Ponzetto. Wikirelate! computing semantic relatedness using wikipedia. In AAAI, vol. 6, pp. 1419–1424 (2006).
  • G. Giannakopoulos, M. El-Haj, B. Favre, M. Litvak, J. Steinberger, and V. Varma, Tac 2011 multiling pilot overview (2011).
  • H.-F. Yu, P. Jain, P. Kar, and I. Dhillon. Large-scale multi-label learning with missing labels. In International Conference on Machine Learning, pp. 593–601 (2014).
  • M. Kimura, K. Saito, and R. Nakano. Extracting influential nodes for information diffusion on a social network. In AAAI, vol. 7, pp. 1371–1376 (2007).
  • A. Capocci, V. D. Servedio, F. Colaiori, L. S. Buriol, D. Donato, S. Leonardi, and G. Caldarelli, Preferential attachment in the growth of social networks: The internet encyclopedia wikipedia, Physical review E. 74(3), 036116 (2006).
  • M. Hu, E.-P. Lim, A. Sun, H. W. Lauw, and B.-Q. Vuong. Measuring article quality in wikipedia: models and evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 243–252 (2007).
  • T. W¨ohner and R. Peters. Assessing the quality of wikipedia articles with lifecycle based metrics. In Proceedings of the 5th International Symposium on Wikis and Open Collaboration, p. 16 (2009).
  • D. Hasan Dalip, M. Andr´e Gon¸calves, M. Cristo, and P. Calado. Automatic quality assessment of content created collaboratively by web communities:a case study of wikipedia. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries, pp. 295–304 (2009).
  • J. E. Blumenstock. Size matters: word count as a measure of quality on wikipedia. In Proceedings of the 17th international conference on World Wide Web, pp. 1095–1096 (2008).
  • A. Kittur and R. E. Kraut. Harnessing the wisdom of crowds in wikipedia: Quality through coordination. In Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work, CSCW '08, pp. 37–46, ACM, New York, NY, USA (2008). ISBN 978-1-60558-007-4. doi: 10.1145/1460563. 1460572. URL
  • C. Pentzold, Fixing the Floating Gap: The Online Encyclopedia Wikipedia as a Global Memory Place, Memory Studies. 2(2), 255–272 (2009).
  • L. Flekova, O. Ferschke, and I. Gurevych. What makes a good biography?: Multidimensional quality analysis based on wikipedia article feedback data. In Proceedings of the 23rd International Conference on World Wide Web, WWW '14, pp. 855–866, ACM, New York, NY, USA (2014). ISBN 978-1- 4503-2744-2. doi: 10.1145/2566486.2567972. URL 1145/2566486.2567972.
  • A. Maass. Linguistic Intergroup Bias: Stereotype Perpetuation through Language. In ed. M. Zanna, Advanced in Experimental Social Psychology, pp. 79–121. Academic Press, San Diego, CA (1999).
  • W. von Hippel, D. Sekaquaptewa, and P. Vargas, The Linguistic Intergroup Bias as an Implicit Indicator of Prejudice, Journal of Experimental Social Psychology. 33, 490–509 (1997).
  • C. Beukeboom. Mechanisms of Linguistic Bias: How Words Reflect and Maintain Stereotypic Expectations. In eds. J. Laszlo, J. Forgas, and O. Vincze, Social Cognition and Communication, pp. 313–330. Psychology Press, New York, NY (2013).
  • J. Otterbacher. Linguistic bias in collaboratively produced biographies: crowdsourcing social stereotypes? In ICWSM, pp. 298–307 (2015).
  • A. Maass, D. Salvi, L. Arcuri, and G. Semin, Language Use in Intergroup Context: The Linguistic Intergroup Bias, Journal of Personality and Social Psychology. 57(6), 981–993 (1989).
  • P. Agathangelou, I. Katakis, I. Koutoulakis, F. Kokkoras, and D. Gunopulos, Learning patterns for discovering domain-oriented opinion words, Knowledge and Information Systems. pp. 1–33 (2017).
  • G. Semin and K. Fiedler, The Cognitive Functions of Linguistic Categories in Describing Persons: Social Cognition and Language, Journal of Personality and Social Psychology. 54, 558–568 (1988).
  • L. Coenen, L. Hedebouw, and G. Semin. Measuring Language Abstraction: The Linguistic Category Model Manual. Technical report, Free University Amsterdam, Amsterdam, The Netherlands (June, 2006). URL http://www.
  • P. Winkielman, J. Halberstadt, T. Fazendeiro, and S. Catty, Prototypes are Attractive because they are Easy on the Mind, Psychological Science. 17(9), 799–806 (2006).
  • D. Wigboldus, R. Spears, and G. Semin, When do We Communicate Stereotypes? Influence of the Social Context on the Linguistic Expectancy Bias, Group Processes & Intergroup Relations. 8(3), 215–230 (2005).
  • A. Hunt. The Linguistic Expectancy Bias and the American Mass Media. PhD thesis, Temple University, Philadelphia, PA (2011).
  • J. Otterbacher. Crowdsourcing stereotypes: Linguistic bias in metadata generated via gwap. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1955–1964 (2015).
  • P. Devine and A. Elliot, Are Racial Stereotypes Really Fading? The Princeton Trilogy Revisited, Personality and Social Psychology Bulletin. 21(11), 1139–1150 (1995).
  • P. Agathangelou, I. Katakis, I. Koutoulakis, F. Kokkoras, and D. Gunopulos, Learning patterns for discovering domain oriented opinion words, Knowledge and Information Systems (2017).
  • P. Agathangelou, I. Katakis, F. Kokkoras, and K. Ntonas, Mining Domain- Specific Dictionaries of Opinion Words, In eds. B. Benatallah, A. Bestavros, Y. Manolopoulos, A. Vakali, and Y. Zhang, Web Information Systems Engineering – WISE 2014: 15th International Conference, Thessaloniki, Greece, October 12-14, 2014, Proceedings, Part I, pp. 47–62. Springer International Publishing (2014).
  • M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '04, pp. 168–177, ACM, New York, NY, USA (2004). ISBN 1-58113-888-1. doi: 10.1145/1014052.1014073. URL
  • J. D. Gibbons and S. Chakraborti. Nonparametric statistical inference. In International encyclopedia of statistical science, pp. 977–979. Springer (2011).
  • P. Willett, The porter stemming algorithm: then and now, Program. 40(3), 219–223 (2006).
  • S. T. Fiske, A. J. Cuddy, and P. Glick, Universal dimensions of social cognition: Warmth and competence, Trends in cognitive sciences. 11(2), 77–83 (2007).
  • S. T. Fiske, A. J. Cuddy, P. Glick, and J. Xu, A model of (often mixed) stereotype content: competence and warmth respectively follow from perceived status and competition., Journal of personality and social psychology. 82(6), 878 (2002).
  • C. Wagner, D. Garcia, M. Jadidi, and M. Strohmaier. It's a man's wikipedia? assessing gender inequality in an online encyclopedia. In Proceedings of the AAAI International Conference on Web and Social Media (ICWSM), pp. 454–463 (2015).
  • E. Graells-Garrido, M. Lalmas, and F. Menczer. First women, second sex: gender bias in wikipedia. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 165–174 (2015).
  • J. Cohen, P. Cohen, S. G. West, and L. S. Aiken, Applied multiple regression/ correlation analysis for the behavioral sciences. Routledge (2013).
  • H. Abdi and L. J. Williams, Tukey's honestly significant difference (hsd) test, Encyclopedia of Research Design. Thousand Oaks, CA: Sage. pp. 1–5 (2010).
  • J. M. Hilbe. Logistic regression. In International Encyclopedia of Statistical Science, pp. 755–758. Springer (2011).
  • D. Radev, J. Otterbacher, A. Winkel, and S. Blair-Goldensohn, Newsinessence: summarizing online news topics, Communications of the ACM. 48(10), 95–98 (2005).
  • J. Antin, R. Yee, C. Cheshire, and O. Nov. Gender differences in wikipedia editing. In Proceedings of the 7th international symposium on Wikis and open collaboration, pp. 11–14 (2011).
  • B. Collier and J. Bear. Conflict, criticism, or confidence: an empirical examination of the gender gap in wikipedia contributions. In Proceedings of the ACM 2012 conference on computer supported cooperative work, pp. 383–392 (2012).
  • C. Wagner, E. Graells-Garrido, D. Garcia, and F. Menczer, Women through the glass ceiling: gender asymmetries in wikipedia, EPJ Data Science. 5(1), 5 (2016).
  • E. S. Callahan and S. C. Herring, Cultural bias in wikipedia content on famous persons, Journal of the Association for Information Science and Technology. 62(10), 1899–1915 (2011).
  • I. Protonotarios, V. Sarimpei, and J. Otterbacher. Similar gaps, different origins? women readers and editors at greek wikipedia. In Wiki@ ICWSM (2016).
  • T. Wilson, J. Wiebe, and P. Hoffman. Recognizing Contextual Polarity in Phrase-level Sentiment Analysis. In Proceedings of the ACL HLT / EMNLP (2005).