CWB
Data Structures | Macros | Functions | Variables
cwb-scan-corpus.c File Reference
#include "../cl/globals.h"
#include "../cl/corpus.h"
#include "../cl/cl.h"
#include "../cl/special-chars.h"

Data Structures

struct  _Hash
 A specialised hashtable for computing frequency distributions over tuples of lexicon IDs. More...
 

Macros

#define MAX_N   32
 maximum value of N (makes life a little easier) More...
 

Functions

void scancorpus_usage (void)
 Prints a usage message and exits the program. More...
 
int scancorpus_parse_options (int argc, char *argv[])
 Parses the command-line options of the program. More...
 
int scancorpus_word_is_regular (char *s)
 Check regularity of a token. More...
 
void scancorpus_add_key (char *key)
 Adds a key to global variable Hash. More...
 
int get_next_range (int *start, int *end)
 Reads the next range of corpus positions. More...
 
int main (int argc, char *argv[])
 Main function for cwb-scan-corpus. More...
 

Variables

struct _Hash Hash
 
CorpusC
 corpus we're working on More...
 
char * reg_dir = NULL
 registry directory (NULL -> use default) More...
 
char * corpname = NULL
 corpus name (command-line) More...
 
int check_words = 0
 if set, accept only 'regular' words in frequency counts More...
 
CL_Regex regular_rx = NULL
 regex object for use when check_words is true. More...
 
char * progname = NULL
 name of this program (from shell command) More...
 
char * output_file = NULL
 output file name (-o option) More...
 
int frequency_threshold = 0
 frequency threshold for result table (-f option) More...
 
char * frequency_att = NULL
 p-attribute with frequency entries for corpus rows (when abusing corpus as frequency database) More...
 
int global_start = 0
 start scanning at this cpos (defaults to start of corpus) More...
 
int global_end = -1
 will be set up in main() unless changed with -e switch. More...
 
char * ranges_file = NULL
 file with ranges to scan (pairs of corpus positions) More...
 
FILE * ranges_fh = NULL
 corresponding filehandle More...
 
int quiet = 0
 if set, don't show progress information on stderr More...
 
int n_buckets = 0
 if set, use fixed number of buckets; otherwise, revert to cl_ngram_hash defaults More...
 
int debug_level = 0
 CL debug level. More...
 

Macro Definition Documentation

#define MAX_N   32

maximum value of N (makes life a little easier)

Referenced by main(), and scancorpus_usage().

Function Documentation

int get_next_range ( int *  start,
int *  end 
)

Reads the next range of corpus positions.

The ranges of corpus positions are taken either from global settings (-s, -e) or from a specified file (-R).

Parameters
startWhere to put the start of the next range.
endWhere to put the end of the next range.
Returns
FALSE after last range, TRUE otherwise

References CL_MAX_LINE_LENGTH, global_end, global_start, and ranges_fh.

Referenced by main().

int main ( int  argc,
char *  argv[] 
)
void scancorpus_add_key ( char *  key)
int scancorpus_parse_options ( int  argc,
char *  argv[] 
)

Parses the command-line options of the program.

Parameters
argcargc from main()
argvargv from main()
Returns
The value of global optind after the function has run.

References check_words, debug_level, frequency_att, frequency_threshold, global_end, global_start, n_buckets, output_file, quiet, ranges_file, reg_dir, and scancorpus_usage().

Referenced by main().

void scancorpus_usage ( void  )

Prints a usage message and exits the program.

References MAX_N, and VERSION.

Referenced by main(), and scancorpus_parse_options().

int scancorpus_word_is_regular ( char *  s)

Check regularity of a token.

A token is "regular" if it contains only letters, numbers and dashes (with no dash at the start or end).

"Regularity" is used as a filter on the corpus iff the -C option is specified.

Parameters
sString containing the token to check.
Returns
True if the token is regular, otherwise false.

References TCorpus::charset, cl_iso_char_is_alphanumeric(), cl_regex_match(), and utf8.

Referenced by main(), and scancorpus_add_key().

Variable Documentation

Corpus* C

corpus we're working on

Referenced by regex2dfa(), and WriteStates().

int check_words = 0

if set, accept only 'regular' words in frequency counts

Referenced by main(), scancorpus_add_key(), and scancorpus_parse_options().

char* corpname = NULL

corpus name (command-line)

Referenced by main(), and scancorpus_add_key().

int debug_level = 0

CL debug level.

Referenced by main(), and scancorpus_parse_options().

char* frequency_att = NULL

p-attribute with frequency entries for corpus rows (when abusing corpus as frequency database)

Referenced by main(), and scancorpus_parse_options().

int frequency_threshold = 0

frequency threshold for result table (-f option)

Referenced by main(), and scancorpus_parse_options().

int global_end = -1

will be set up in main() unless changed with -e switch.

See also
global_start

Referenced by get_next_range(), main(), and scancorpus_parse_options().

int global_start = 0

start scanning at this cpos (defaults to start of corpus)

Referenced by get_next_range(), main(), and scancorpus_parse_options().

struct _Hash Hash

Referenced by main(), and scancorpus_add_key().

int n_buckets = 0

if set, use fixed number of buckets; otherwise, revert to cl_ngram_hash defaults

Referenced by main(), and scancorpus_parse_options().

char* output_file = NULL

output file name (-o option)

Referenced by main(), and scancorpus_parse_options().

char* progname = NULL

name of this program (from shell command)

Referenced by main().

int quiet = 0

if set, don't show progress information on stderr

Referenced by main(), and scancorpus_parse_options().

FILE* ranges_fh = NULL

corresponding filehandle

Referenced by get_next_range(), and main().

char* ranges_file = NULL

file with ranges to scan (pairs of corpus positions)

Referenced by main(), and scancorpus_parse_options().

char* reg_dir = NULL

registry directory (NULL -> use default)

Referenced by main(), and scancorpus_parse_options().

CL_Regex regular_rx = NULL

regex object for use when check_words is true.

See also
scancorpus_word_is_regular