Understanding Brand Consistency from Web Content
- 1. Indian Institute of Technology Kharagpur, India
- 2. Adobe Research, Bangalore, India
Description
If you want this dataset, kindly fill the "Request access" form towards the bottom of this page and also mail at : soumyadeep.roy9@gmail.com.
Kindly cite the paper : https://dl.acm.org/citation.cfm?id=3326048
BibTex :
@inproceedings{Roy:2019:UBC:3292522.3326048, author = {Roy, Soumyadeep and Ganguly, Niloy and Sural, Shamik and Chhaya, Niyati and Natarajan, Anandhavelu}, title = {Understanding Brand Consistency from Web Content}, booktitle = {Proceedings of the 10th ACM Conference on Web Science}, series = {WebSci '19}, year = {2019}, isbn = {978-1-4503-6202-3}, location = {Boston, Massachusetts, USA}, pages = {245--253}, numpages = {9}, url = {http://doi.acm.org/10.1145/3292522.3326048}, doi = {10.1145/3292522.3326048}, acmid = {3326048}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {affective computing, brand personality, reputation management, text classification}, }
Abstract :
Brands produce content to engage with the audience continually and tend to maintain a set of human characteristics in their marketing campaigns. In this era of digital marketing, they need to create a lot of content to keep up the engagement with their audiences. However, such kind of content authoring at scale introduces challenges in maintaining consistency in a brand's messaging tone, which is very important from a brand's perspective to ensure a persistent impression for its customers and audiences. In this work, we quantify brand personality and formulate its linguistic features. We score text articles extracted from brand communications on five personality dimensions: sincerity, excitement, competence, ruggedness and sophistication, and show that a linear SVM model achieves a decent F1 score of $0.822$. The linear SVM allows us to annotate a large set of data points free of any annotation error. We utilize this huge annotated dataset to characterize the notion of brand consistency, which is maintaining a company's targeted brand personality across time and over different content categories; we make certain interesting observations. As per our knowledge, this is the first study which investigates brand personality from the company's official websites, and that formulates and analyzes the notion of brand consistency on such a large scale.
Dataset description:
Each file contain the scrapped textual content from the official webpages of Fortune 1000 companies. We use the 2017 Fortune 1000 list ranks. Please read the paper for details about data collection and cleaning
Directory structure : compressed size - 3.7 GB, uncompressed size - 28.9 GB
├── Cleaned MTlarge data │ ├── final_dynamic_data.csv (1.0 GB) : Dynamic pages per company │ └── final_static_data.csv (3.8 MB) : Static pages for each company └── Raw Scrapped Data (27.8 GB) ├── first50fortune.csv : contains raw scrapped files for Fortune 1000 companies between the rank 1 and 50 ├── fortune150_300.csv : Between Rank 150 and 300 ├── fortune300_500.csv : Between Rank 300 to 500 ├── fortune500_550.csv : Between Rank 500 and 550 ├── fortune50_150.csv : Between Rank 50 and 150 ├── fortune550_800.csv : Between Rank 550 and 800 └── fortune800_1000.csv : Between Rank 800 and 1000