Software Open Access

Word Segmentation in Sanskrit Using Energy Based Models

Amrith Krishna; Bishal Santra; Sasi Prasanth Bandaru; Gaurav Sahu; Vishnu Dutt Sharma; Pavankumar Satuluri; Pawan Goyal

This is the repository for word segmentation in sanskrit using energy based models.


# Word Segmentation in Sanskrit Using Energy Based Models

## Getting Started
Please download the 2 compressed files '' and 'wordsegmentation.rar' to your working directory and extract them into folders named 'dir' and 'wordsegmentation' respectively.
Your working directory should be as follows
* Working Directory
  * wordsegmentation
    * skt_dcs_DS.bz2_4K_bigram_mir_10K
    * skt_dcs_DS.bz2_4K_bigram_mir_heldout
  * dir
## Prerequisites
* Python3
  * scipy
  * numpy
  * csv
  * pickle
  * multiprocessing
  * bz2
## Instructions for Training
Change your current directory to 'dir'
Run the file by using the following command
* python
To train on different input features like BM2,BM3,BR2,BR3,PM2,PM3,PR,PR3 please modify the bz2_input_folder value in the main function before beginning the training.
Feature  | bz2_input_folder
------------- | -------------
BM2 | wordsegmentation/skt_dcs_DS.bz2_4K_bigram_mir_10K/
BM3 | wordsegmentation/skt_dcs_DS.bz2_1L_bigram_mir_10K
BR2 | wordsegmentation/skt_dcs_DS.bz2_4K_bigram_rfe_10K/
BR3 | wordsegmentation/skt_dcs_DS.bz2_1L_bigram_rfe_10K/
PM2 | wordsegmentation/skt_dcs_DS.bz2_4K_pmi_mir_10K/
PM3 | wordsegmentation/skt_dcs_DS.bz2_1L_pmi_mir_10K2/
PR2 | wordsegmentation/skt_dcs_DS.bz2_4K_pmi_rfe_10K/
PR3 | wordsegmentation/skt_dcs_DS.bz2_1L_pmi_rfe_10K/
## Instructions for Testing
After training, please modify the 'modelList' dictionary  in '' with the name of the neural network that has been saved during training. While testing for a feature, please provide the name of the neural net which was trained for the same feature.
We only provide the trained model for the feature BM2 which was our best performing feature. If the name of the neural net is not changed, then the testing will be performed on the pre-trained model for BM2 provided in outputs/train_t7978754709018
To test with a particular feature vector use the tag of the feature while execution
* python -t <tag>
For example:  
  * python -t BM2
After finishing the testing please run the following command to see the precision and recall values for both the word and word++ prediction tasks
* python <tag>
For example:  
  * python BM2

Files (42.2 GB)
Name Size
453.2 MB Download
2.4 kB Download
41.7 GB Download
All versions This version
Views 172172
Downloads 296296
Data volume 6.3 TB6.3 TB
Unique views 149149
Unique downloads 105105


Cite as