The file CodeSource.rar contain an archive with main file that main process the Youtube corpus about net-activism and whistleblowing see https://doi.org/10.5281/zenodo.5824627 the file _pipeline.txt describe different steps : - download, - storage in a Mongo database, - splitting id(s), - filtering collection, - creationg of collection with only text and sentences, - linguistic feature extraction , - features extraction, - clustering the sub-directory called ExtractYoutube is java code for getting transcription from id To run source code require, hadoop server and lots of libraries such as : 1. OS Linux 2. MongoDB https://www.mongodb.com/download-center?filter=enterprise#enterprise 3. JAVA 1.8 or later 4. Python 2.7 or later 5. Maltparser http://www.maltparser.org/download.html 6. Stanford NER http://nlp.stanford.edu/software/CRF-NER.shtml 7. TreeTagger http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ 8. SRILM http://www.speech.sri.com/projects/srilm/download.html 9. MorphSegmenter https://www-i6.informatik.rwth-aachen.de/~mansour/MorphSegmenter/ # About python 1. langdetect-1.0.6 https://pypi.python.org/pypi/langdetect? 1. Morfessor-2.0.2alpha3 https://pypi.python.org/pypi/Morfessor/2.0.2alpha3 2. nltk-3.2.1 https://pypi.python.org/pypi/nltk/3.2.1 3. numpy-1.9.3 https://pypi.python.org/pypi/numpy/1.11.2rc1 4. polyglot-master https://pypi.python.org/pypi/polyglot/16.7.4 5. pycld2-0.31 https://pypi.python.org/pypi/pycld2/0.31 6. PyICU-1.9.3 https://pypi.python.org/pypi/PyICU/1.9.3 7. pymongo-3.3.0 https://pypi.python.org/pypi/pymongo/3.3.0 8. six-1.10.0 https://pypi.python.org/pypi/six/1.10.0 9. wheel-0.29.0 https://pypi.python.org/pypi/wheel/0.30.0a0 10. JIEBA https://pypi.python.org/pypi/jieba/