Conference paper Open Access
Human computation is often subject to systematic biases. We consider the case of linguistic biases and their consequences
for the words that crowd workers use to describe people images in an annotation task. Social psychologists explain that when describing others, the subconscious perpetuation of stereotypes is inevitable, as we describe stereotype-congruent people and/or in-group members more abstractly than others. In an MTurk experiment we show evidence of these biases, which are exacerbated when an image’s “popular tags” are displayed, a common feature used to provide social information to workers. Underscoring recent calls for a deeper examination of the role of training data quality in algorithmic biases, results suggest that it is rather easy to sway human judgment.
Ames, M., and Naaman, M. 2007. Why we tag: motivations for annotation in mobile and online media. In Proceedings of the SIGCHI conference on Human factors in computing systems, 971–980. ACM.
Beukeboom, C.; Forgas, J.; Vincze, O.; and Laszlo, J. 2014. Mechanisms of linguistic bias: How words reflect and maintain stereotypic expectancies. Social Cognition and Communication 313–330.
Cohen, J.; Cohen, P.; West, S. G.; and Aiken, L. S. 2013. Applied multiple regression/correlation analysis for the behavioral sciences. Routledge.
Day, R., and Quinn, G. 1989. Comparisons of treatments after an analysis of variance in ecology. Ecological monographs 59(4):433–463.
Faltings, B.; Jurca, R.; Pu, P.; and Tran, B. D. 2014. Incentives to counter bias in human computation. In Second AAAI conference on human computation and crowdsourcing.
Fleiss, J. F., and Cohen, J. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 33(3):613–619.
Giles, H., and Powesland, P. 1997. Accommodation theory. In Sociolinguistics. Springer. 232–239.
Hilbe, J. M. 2011. Logistic regression. In International Encyclopedia of Statistical Science. Springer. 755–758.
Ipeirotis, P. G.; Provost, F.; and Wang, J. 2010. Quality management on amazon mechanical turk. In Proceedings of the ACMSIGKDD workshop on human computation, 64–67. ACM.
Ireland, M. E.; Slatcher, R. B.; Eastwick, P. W.; Scissors, L. E.; Finkel, E. J.; and Pennebaker, J. W. 2011. Language style matching predicts relationship initiation and stability. Psychological science 22(1):39–44.
Kamar, E.; Hacker, S.; and Horvitz, E. 2012. Combining human and machine intelligence in large-scale crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, 467–474. International Foundation for Autonomous Agents and Multiagent Systems.
Kamar, E.; Kapoor, A.; and Horvitz, E. 2015. Identifying and accounting for task-dependent bias in crowdsourcing. In Third AAAI Conference on Human Computation and Crowdsourcing.
Kazai, G.; Kamps, J.; and Milic-Frayling, N. 2012. The face of quality in crowdsourcing relevance labels: Demographics, personality and labeling accuracy. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM '12, 2583–2586. New York, NY, USA: ACM.
Kazai, G.; Kamps, J.; and Milic-Frayling, N. 2013. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Information Retrieval 16(2):138–178.
Labov, W. 1990. The intersection of sex and social class in the course of linguistic change. Language variation and change 2(2):205–254.
Law, E., and von Ahn, L. 2009. Input-agreement: a new mechanism for collecting data using human computation games. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1197–1206. ACM.
Law, E., and von Ahn, L. 2011. Human computation. Synthesis Lectures on Artificial Intelligence and Machine Learning 5(3):1–121.
Maass, A.; Salvi, D.; Arcuri, L.; and Semin, G. R. 1989. Language use in intergroup contexts: The linguistic inter-group bias. Journal of personality and social psychology 57(6):981.
Manzo, C.; Kaufman, G.; Punjasthitkul, S.; and Flanagan, M. 2015. " by the people, for the people": Assessing the value of crowdsourced, user-generated metadata. DHQ: Digital Humanities Quarterly 9(1).
Matsangidou, M.; Otterbacher, J.; Ang, C. S.; and Zaphiris, P. 2018. Can the crowd tell how I feel? Trait empathy and ethnic background in a visual pain judgment task. Universal Access in the Information Society 1–13.
Park, J.-R. 2009. Metadata quality in digital repositories: A survey of the current state of the art. Cataloging & classification quarterly 47(3-4):213–228.
Pennebaker, J. W.; Boyd, R. L.; Jordan, K.; and Blackburn, K. 2015. The development and psychometric properties of liwc2015. Technical report.
Rader, E., and Wash, R. 2008. Influences on tag choices in del. icio. us. In Proceedings of the 2008 ACM conference on Computer supported cooperative work, 239–248. ACM.
Ross, J.; Irani, L.; Silberman, M. S.; Zaldivar, A.; and Tomlinson, B. 2010. Who are the crowdworkers?: Shifting demographics in mechanical turk. In CHI '10 Extended Abstracts on Human Factors in Computing Systems, CHI EA '10, 2863–2872. New York, NY, USA: ACM.
Semin, G. R., and Fiedler, K. 1991. The linguistic category model, its bases, applications and range. European review of social psychology 2(1):1–30.
Semin, G. R.; de Montes, L. G.; and Valencia, J. F. 2003. Communication constraints on the linguistic intergroup bias. Journal of Experimental Social Psychology 39(2):142–148.
von Ahn, L., and Dabbish, L. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, 319–326. ACM.
von Ahn, L., and Dabbish, L. 2008. Designing games with a purpose. Communications of the ACM 51(8):58–67.
Wigboldus, D. H.; Semin, G. R.; and Spears, R. 2000. How do we communicate stereotypes? linguistic bases and inferential consequences. Journal of personality and social psychology 78(1):5.
Wigboldus, D. H.; Semin, G. R.; and Spears, R. 2006. Communicating expectancies about others. European Journal of Social Psychology 36(6):815–824.
Williams, L. J., and Abdi, H. 2010. Post-hoc comparisons. Encyclopedia of Research Design 1060–1067.
Wilson, T.; Wiebe, J.; and Hoffmann, P. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the conference on human language technology and empirical methods in natural language processing, 347–354. Association for Computational Linguistics.
Winkielman, P.; Halberstadt, J.; Fazendeiro, T.; and Catty, S. 2006. Prototypes are attractive because they are easy on the mind. Psychological science 17(9):799–806.
Zhao, J.;Wang, T.; Yatskar, M.; Ordonez, V.; and Chang, K.- W. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
Zhuang, H., and Young, J. 2015. Leveraging in-batch annotation bias for crowdsourced active learning. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM '15, 243–252. New York, NY, USA: ACM.