Published January 6, 2022 | Version v1
Software Open

Source Code for Youtube dataset processing

  • 1. INRAE

Description

The file CodeSource.rar contain an archive with main file that 
main process the Youtube corpus about net-activism and whistleblowing 
see https://doi.org/10.5281/zenodo.5824627 

Turenne N   Net activism and whistleblowing on YouTube: a text mining analysis (2022).

the file _pipeline.txt describe different steps : 
- download, 
- storage in a Mongo database, 
- splitting id(s), 
- filtering collection, 
- creationg of collection with only text and sentences, 
- linguistic feature extraction , 
- features extraction, 
- clustering

the sub-directory called ExtractYoutube is java code for getting transcription from id


To run source code require, hadoop server and lots of libraries (see _description file)

Notes

Turenne N Net activism and whistleblowing on YouTube: a text mining analysis (2022).

Files

_description.txt

Files (31.6 MB)

Name Size Download all
md5:be68ffd1cf110becc3c2f2cc6f421f51
1.9 kB Preview Download
md5:b5d54dcfed4c9130c4af952be5d8da58
10.2 kB Preview Download
md5:251bd6d184738752dda0b9f97742e39f
31.5 MB Download