Source Code for Youtube dataset processing
Description
The file CodeSource.rar contain an archive with main file that
main process the Youtube corpus about net-activism and whistleblowing
see https://doi.org/10.5281/zenodo.5824627
Turenne N Net activism and whistleblowing on YouTube: a text mining analysis (2022).
the file _pipeline.txt describe different steps :
- download,
- storage in a Mongo database,
- splitting id(s),
- filtering collection,
- creationg of collection with only text and sentences,
- linguistic feature extraction ,
- features extraction,
- clustering
the sub-directory called ExtractYoutube is java code for getting transcription from id
To run source code require, hadoop server and lots of libraries (see _description file)