Published June 26, 2019 | Version v1
Dataset Restricted

Understanding Brand Consistency from Web Content

  • 1. Indian Institute of Technology Kharagpur, India
  • 2. Adobe Research, Bangalore, India


If you want this dataset, kindly fill the "Request access" form towards the bottom of this page and also mail at :

Kindly cite the paper :

BibTex : 

 author = {Roy, Soumyadeep and Ganguly, Niloy and Sural, Shamik and Chhaya, Niyati and Natarajan, Anandhavelu},
 title = {Understanding Brand Consistency from Web Content},
 booktitle = {Proceedings of the 10th ACM Conference on Web Science},
 series = {WebSci '19},
 year = {2019},
 isbn = {978-1-4503-6202-3},
 location = {Boston, Massachusetts, USA},
 pages = {245--253},
 numpages = {9},
 url = {},
 doi = {10.1145/3292522.3326048},
 acmid = {3326048},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {affective computing, brand personality, reputation management, text classification},

Abstract :

Brands produce content to engage with the audience continually and tend to maintain a set of human characteristics in their marketing campaigns. In this era of digital marketing, they need to create a lot of content to keep up the engagement with their audiences. However, such kind of content authoring at scale introduces challenges in maintaining consistency in a brand's messaging tone, which is very important from a brand's perspective to ensure a persistent impression for its customers and audiences. In this work, we quantify brand personality and formulate its linguistic features. We score text articles extracted from brand communications on five personality dimensions: sincerity, excitement, competence, ruggedness and sophistication, and show that a linear SVM model achieves a decent F1 score of $0.822$. The linear SVM allows us to annotate a large set of data points free of any annotation error. We utilize this huge annotated dataset to characterize the notion of brand consistency, which is maintaining a company's targeted brand personality across time and over different content categories; we make certain interesting observations. As per our knowledge, this is the first study which investigates brand personality from the company's official websites, and that formulates and analyzes the notion of brand consistency on such a large scale.

Dataset description:
Each file contain the scrapped textual content from the official webpages of Fortune 1000 companies. We use the 2017 Fortune 1000 list ranks. Please read the paper for details about data collection and cleaning

Directory structure : compressed size - 3.7 GB, uncompressed size - 28.9 GB

├── Cleaned MTlarge data
│   ├── final_dynamic_data.csv (1.0 GB) : Dynamic pages per company
│   └── final_static_data.csv (3.8 MB) : Static pages for each company
└── Raw Scrapped Data (27.8 GB)
    ├── first50fortune.csv : contains raw scrapped files for Fortune 1000 companies between the rank 1 and 50
    ├── fortune150_300.csv : Between Rank 150 and 300
    ├── fortune300_500.csv : Between Rank 300 to 500
    ├── fortune500_550.csv : Between Rank 500 and 550
    ├── fortune50_150.csv : Between Rank 50 and 150
    ├── fortune550_800.csv : Between Rank 550 and 800
    └── fortune800_1000.csv : Between Rank 800 and 1000




The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:


Kindly send me a mail at, in order to access the dataset.

Thanks and regards,

Soumyadeep Roy

Ph.D. student, IIT Kharagpur, India

You are currently not logged in. Do you have an account? Log in here