Software Environment
====================

1. All scripts were written for, and tested with, Python 3.9 (Python Software Foundation, 2020).

2. Required modules:  
* csv  
* geopy (Geopy Contributors, 2023)  
* os  
* pickle  
* re  
* time  
* tkinter

Installation of Modules
-----------------------
After having installed Python 3.9, establish an internet connection and enter `pip install [module name]` on a command line with sufficient rights, and press `ENTER`. Repeat this procedure for all modules which are not yet installed. 

Installation of Scripts
-----------------------
* Place all scripts in one single folder.
* Make them executable.
* In the files `script_s_extract.py` and `script_c_extract.py`, replace `info@dummy.eu` by a legit e-mail address, e.g. of your research institution.

General Usage
=============
(also read the Usage sections of the descriptions of the single scripts)

1. First prepare the data filenames and the first line in each file as described in the Raw Data Preparation section.
2. Run the script `script_analysis-master.py`.
3. Run the script `script_findexternals.py`.
4. Optionally, run the script `replace-dots.py`, if you are working in an environment where a comma (',') is used as decimal separator.
5. Run the script `script_combine-event-data.py`.

Raw Data Preparation
====================
1. The raw data extracted from Joyclub must be text data, encoded in UTF-8, stored in single files as follows.
2. For every location, and then for women, men, and couples, separate text files have to be produced.
3. Each file has to be prepared as follows:
	* A first line must be inserted before the raw data, which follow from the second line onwards.
	* That first line must contain the plain geographic name which distances are to be calculated to (e.g. venue).
	* The filename must follow the follow the scheme:
		* `[P-]filename [men|women|couples].txt`
		* The prefix `P-` must be added if the raw data relate to a couple.
		* The selector `men`, `women`, or `couples` has to be used if the script `combine-event-data.py` will be used to aggregate data.
		* The filename should consist of a string which helps to identify the location or event for later use of the data.

Descriptions of the Scripts
===========================

file: script_analysis-master.py
-------------------------------
### Description ###
* Opens a window for selecting a folder with the raw data from the relevant Joyclub webpage.
* Then invokes the scripts `script_s_extract.py` and `script_c_extract.py` to harvest the relevant data from the raw data.
* The script `script_c_extract.py` is only applied to files the filenames of which begin with `P-`.
	
### Usage ###
* Prepare the data files as described in the Raw Data Preparation section.
* Run the script and select the folder in the dialog.

file: script_s_extract.py
-------------------------
### Description ###
* Retrieves and makes processable data from the raw data which relate to entries of persons of whom no two ages are presented  
(i.e. solo women, solo men, accompanied solos).
* Invoked by the script `script_analysis-master.py`
* Scans all files in the selected folder.
* In each file, it retrieves the first line as the geographic name of a place which distances are to be calculated to (e.g. venue).
* In each file, it scans each of the second to the last lines for a string with the scheme `## Jahre`, not preceded by a plus (`+`) sign.
* For each such entry, it retrieves the value `##` as `AGE`.
* Where such entry has been found, it retrieves the plain geographic name of the location related to the entry of the person.
* In the folder where the script is located, it builds or amends a cache containing already retrieved geodata with the filename `location_cache.pkl`.

### Usage ###
* The script operation has to be invoked by the script `script_analysis-master.py`.

### Output ###
* Entry by entry, the script writes all data on such entries into a list in the form:  
	`[AGE]<tab>[distance between first line place and the place related to the entry]<new line>`
* The script saves the result in a file with a filename where the suffix of the original file (`.txt`) is replaced by `s-exp.txt` (UTF-8 coded).
* The original file remains intact.

file: script_c_extract.py
-------------------------
### Description ###
* Retrieves and makes processable data from the raw data which relate to entries where two ages are presented (couples).
* Invoked by the script `script_analysis-master.py`.
* Scans all files in the selected folder the filename of which begins with `P-`.
* In each file, it retrieves the first line as the geographic name of a place which distances are to be calculated to (e.g. venue).
* In each file, it scans each of the second to the last lines for a string with the scheme `##+## Jahre`.
* For each such entry, it retrieves the two values `##` as `AGE_F` and `AGE_M`. By convention, in Joyclub, in entries that relate to mixed gender couples, the age of the woman is the first mention, and the age of the man is the second mention.
* Where such entry has been found, it retrieves the plain geographic name of the location related to the entry of the persons.
* In the folder where the script is located, it builds or amends a cache containing already retrieved geodata with the filename `location_cache.pkl`.

### Usage ###
* The script operation has to be invoked by the script `script_analysis-master.py`.

### Output ###
* Entry by entry, the script writes all data on entries into a list in the form:  
	`[AGE_F]<tab>[AGE_M]<tab>[distance between first line place and the place related to the entry]<new line>`
* It saves the result in a file with a filename where the suffix of the original file (`.txt`) is replaced by `c-exp.txt` (UTF-8 coded).
* The original file remains intact.


file: script_replace-dots.py
----------------------------
### Description ###
* Optional script to replace dots in geodata with commas as a decimal divider for usage of software in language versions which require this.
* Only handles files the filenames end with `-exp.txt`, as the result files of the scripts `script_s_extract.py` and `script_c_extract.py`.

### Usage ###
* Run the script and select the folder in the dialog.

file: script_findexternals.py
-----------------------------
### Description ###
* Extracts external registrations from raw data from Joyclub event participant pages.
* Opens a window for selecting a folder with the raw data from the relevant Joyclub webpages.
* Scans all files in the selected folder and seeks for matches with the pattern for external registrations to an event.
	
### Usage ###
* Run the script and select the folder in the dialog.

### Output ###
* Writes a file with the filename `Extern.txt` where the filename is indicated where patterns were detected, followed by a list of the detected external registrations.


file: script_combine-event-data.py
----------------------------------
### Description ###
* Combines the output of the scripts invoked by the script `script_analysis-master.py` by producing a CSV (comma-separated value) file.

### Usage ###
* Run the script after having processed data in a folder, using the script `script_analysis-master.py`.
* It is important that the filenames of data files relating to processed data of
	* solo women end with `women.s-exp.txt`,
	* solo men end with `men.s-exp.txt`,
	* couples end with `couple.c-exp.txt` or `couples.c-exp.txt`, and
	* accompanied solos (persons registered to an event as a couple on the basis of a solo profile) end with `couple.s-exp.txt`.
* If the instructions for data preparation in the Raw Data Preparation section had been followed, the scripts automatically produce the appropriate filenames.
* Select the folder in the dialog.

### Output ###
A CSV file in UTF-8 format, a semicolon (`;`) used as a separator, with the following columns (the following column variable names contained in the first line):
* `AGEF`: Age of a female person. If a case (row) relates to a solo man or an accompanied solo, the entry has no value.
* `AGEM`: Age of a male person. If a case (row) relates to a solo woman or an accompanied solo, the entry has no value.
* `AGEACC`: Age of an accompanied solo (see Usage section for the definition). If a case (row) relates to a solo woman or woman or a couple, the entry has no value.
* `COUPLE`: A nominal indicator of the type of the case (row):
	* `0` if the case relates to a solo woman,
	* `1` if the case relates to a couple,
	* `2` if the case relates to a solo man, and
	* `3` if the case relates to an accompanied solo.
* `EVENT`: The filename where the data were stored, without the prefix `P-` and without any suffix.
* `NUM_VISITORS`: The number of persons to which the case (row) relates:
	* `1` if the case relates to a solo woman or a solo man, and
	* `2` if the case relates to a couple or an accompanied solo.
* `DIST`: The geographic distance (in kilometers) between the place mentioned in the first line of the raw data set, and the registered place of residence of the case. If a geographic distance could not be calculated, the entry has no value.

Testing
=======
The operation of the scripts can be tested with the files in the ZIP folder `script_testdata.zip`, which reflect what raw data typically look like. The test files are mock data and do not reflect any real web content. The test files are event data. The scripts can also be used for raw data of residents. A geographic location has already been entered into the first line.

References
==========
* Geopy Contributors. (2023). GeoPy (2.4.0) [Software]. https://geopy.readthedocs.io/en/stable/  
* Python Software Foundation. (2020). Python (3.9.0) [Software]. https://www.python.org/downloads/release/python-390/