Published November 26, 2018 | Version v1
Dataset Open

Transparency in Keyword Faceted Search: a dataset of Google Shopping html pages

  • 1. Department of Information Engineering, University of Padua, Italy
  • 2. IMT School for Advanced Studies, Lucca, Italy
  • 3. IIT Institute of Informatics and Telematics, National Research Council (CNR), Pisa, Italy

Description

This dataset contains a collection of around 2,000 HTML pages: these web pages contain the search results obtained in return to queries for different products, searched by a set of synthetic users surfing Google Shopping (US version) from different locations, in July, 2016.

Each file in the collection has a name where there is indicated the location from where the search has been done, the userID, and the searched product:   no_email_LOCATION_USERID.PRODUCT.shopping_testing.#.html

The locations  are Philippines (PHI), United States (US), India (IN). The userIDs: 26 to 30 for users searching from Philippines, 1 to 5 from US, 11 to 15 from India.

Products have been choice following 130 keywords (e.g., MP3 player, MP4 Watch, Personal organizer, Television, etc.).

In the following, we describe how the search results have been collected.

Each user has a fresh profile. The creation of a new profile corresponds to launch a new, isolated, web browser client instance and open the Google Shopping US web page.

To mimic real users, the synthetic users can browse, scroll pages, stay on a  page, and click on links.

A fully-fledged web browser is used to get the correct desktop version of the website under investigation. This is because websites could be designed to behave according to user agents, as witnessed by the differences between the mobile and desktop versions of the same website.

The prices are the retail ones displayed by Google Shopping in US dollars (thus, excluding shipping fees).

Several frameworks have been proposed for interacting with web browsers and analysing results from search engines. This research adopts OpenWPM. OpenWPM is automatised with Selenium to efficiently create and manage different users with isolated Firefox and Chrome client instances, each of them with their own associated cookies.

The experiments run, on average, 24 hours. In each of them, the software runs on our local server, but the browser's traffic is redirected to the designated remote servers (i.e., to India), via tunneling in SOCKS proxies. This way, all commands are simultaneously distributed over all proxies. The experiments adopt the Mozilla Firefox browser (version 45.0) for  the web browsing tasks and run  under Ubuntu 14.04. Also, for each query,  we consider the first page of results, counting 40 products. Among them, the focus of the experiments is mostly on the top 10 and top 3 results.

Due to connection errors, one of the Philippine profiles have no associated results. Also, for Philippines, a few keywords did not lead to any results: videocassette recorders, totes, umbrellas. Similarly, for US,  no results were for totes and umbrellas.

The search results have been analyzed in order to check if there were evidence of price steering, based on users' location.

One term of usage applies:

In any research product whose findings are based on this dataset, please cite

@inproceedings{DBLP:conf/ircdl/CozzaHPN19,
  author    = {Vittoria Cozza and
               Van Tien Hoang and
               Marinella Petrocchi and
               Rocco {De Nicola}},
  title     = {Transparency in Keyword Faceted Search: An Investigation on Google
               Shopping},
  booktitle = {Digital Libraries: Supporting Open Science - 15th Italian Research
               Conference on Digital Libraries, {IRCDL} 2019, Pisa, Italy, January
               31 - February 1, 2019, Proceedings},
  pages     = {29--43},
  year      = {2019},
  crossref  = {DBLP:conf/ircdl/2019},
  url       = {https://doi.org/10.1007/978-3-030-11226-4\_3},
  doi       = {10.1007/978-3-030-11226-4\_3},
  timestamp = {Fri, 18 Jan 2019 23:22:50 +0100},
  biburl    = {https://dblp.org/rec/bib/conf/ircdl/CozzaHPN19},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

 

 

Files

g_shopping_html_pages.zip

Files (1.1 GB)

Name Size Download all
md5:9cf394bdf06332266e4f0b05a46cac39
1.1 GB Preview Download

Additional details

Funding

European Commission
NeCS – European Network for Cyber-security 675320

References

  • Cozza, V., Hoang, V.T., Petrocchi, M., Spognardi, A.: Experimental measures of news personalization in google news. In: Casteleyn, S., Dolog, P., Pautasso, C. (eds.) Current Trends in Web Engineering. Springer International Publishing, Cham (2016)
  • Cozza, Vittoria and Hoang, Van Tien and Petrocchi, Marinella and De Nicola, Rocco: Online User Behavioural Modeling with Applications to Price Steering. In FinRec (2016), online ceur-ws.org/Vol-1606/
  • Englehardt, S., Narayanan, A.: Online tracking: A 1-million-site measurement and analysis. In: Computer and Communications Security. ACM (2016)