CWB
Data Structures | Macros | Typedefs | Enumerations | Functions | Variables
cl.h File Reference

This file contains the API for the CWB "Corpus Library" (CL). More...

#include <stdlib.h>
#include <stdio.h>

Data Structures

struct  ClAutoString
 Underlying structure for the ClAutoString object. More...
 
struct  _DCR
 The DynCallResult object (needed to allocate space for dynamic function arguments) More...
 
struct  TCorpusProperty
 The CorpusProperty object. More...
 
struct  _cl_lexhash_entry
 Underlying structure for the cl_lexhash_entry class. More...
 
struct  _cl_lexhash_entry::_cl_lexhash_entry_data
 This entry's data fields, i.e. More...
 
struct  _cl_ngram_hash_entry
 Underlying structure for the cl_ngram_hash_entry class. More...
 

Macros

#define CDA_OK   0
 Error code: everything is fine; actual error values are all less than 0. More...
 
#define CDA_ENULLATT   -1
 Error code: NULL passed as attribute argument. More...
 
#define CDA_EATTTYPE   -2
 Error code: function was called on illegal attribute. More...
 
#define CDA_EIDORNG   -3
 Error code: id out of range. More...
 
#define CDA_EPOSORNG   -4
 Error code: position out of range. More...
 
#define CDA_EIDXORNG   -5
 Error code: index out of range. More...
 
#define CDA_ENOSTRING   -6
 Error code: no such string encoded. More...
 
#define CDA_EPATTERN   -7
 Error code: illegal pattern. More...
 
#define CDA_ESTRUC   -8
 Error code: no structure at position. More...
 
#define CDA_EALIGN   -9
 Error code: no alignment at position. More...
 
#define CDA_EREMOTE   -10
 Error code: error in remote access. More...
 
#define CDA_ENODATA   -11
 Error code: can't load/create necessary data. More...
 
#define CDA_EARGS   -12
 Error code: error in arguments for dynamic call or CL function. More...
 
#define CDA_ENOMEM   -13
 Error code: memory fault [unused]. More...
 
#define CDA_EOTHER   -14
 Error code: other error. More...
 
#define CDA_ENYI   -15
 Error code: not yet implemented. More...
 
#define CDA_EBADREGEX   -16
 Error code: bad regular expression. More...
 
#define CDA_EFSETINV   -17
 Error code: invalid feature set format. More...
 
#define CDA_EBUFFER   -18
 Error code: buffer overflow (hard-coded internal buffer sizes) More...
 
#define CDA_EINTERNAL   -19
 Error code: internal data consistency error (really bad) More...
 
#define CDA_EACCESS   -20
 Error code: insufficient access permissions. More...
 
#define CDA_EPOSIX   -21
 Error code: POSIX-level error: check errno or perror() More...
 
#define CDA_CPOSUNDEF   INT_MIN
 Error code: undefined corpus position (use this code to avoid ambiguity with negative cpos) More...
 
#define cl_free(p)   do { if ((p) != NULL) { free(p); p = NULL; } } while (0)
 Safely frees memory. More...
 
#define CL_MAX_CORPUS_SIZE   2147483647
 Maximum size of a CWB corpus. More...
 
#define CL_MAX_LINE_LENGTH   4096
 General string buffer size constant. More...
 
#define CL_MAX_FILENAME_LENGTH   1024
 String buffer size constant (for filenames). More...
 
#define CL_STREAM_READ   0
 I/O streams with magic for compressed files (.gz, .bz2) and pipes. More...
 
#define CL_STREAM_WRITE   1
 open in write mode More...
 
#define CL_STREAM_APPEND   2
 open in append mode (except for pipe) More...
 
#define CL_STREAM_MAGIC   0
 enable automagic recognition of stream type More...
 
#define CL_STREAM_MAGIC_NOPIPE   1
 enable automagic, but fail on attempt to open pipe (safe mode for filenames from external sources) More...
 
#define CL_STREAM_FILE   2
 read/write plain uncompressed file More...
 
#define CL_STREAM_GZIP   3
 read/write gzip-compressed file More...
 
#define CL_STREAM_BZIP2   4
 read/write bzip2-compressed file More...
 
#define CL_STREAM_PIPE   5
 read/write pipe to shell command More...
 
#define CL_STREAM_STDIO   6
 read from stdin or write to stdout (<filename> is ignored) More...
 
#define cl_xml_is_name_char(c)
 For a given character, say whether it is legal for an XML name. More...
 
#define ATT_NONE   0
 No type of attribute. More...
 
#define ATT_POS   (1<<0)
 Positional attributes, ie streams of word tokens, word tags - any "column" that has a value at every corpus position. More...
 
#define ATT_STRUC   (1<<1)
 Structural attributes, ie a set of SGML/XML-ish "regions" in the corpus delimited by the same SGML/XML tag. More...
 
#define ATT_ALIGN   (1<<2)
 Alignment attributes, ie a set of zones of alignment between a source and target corpus. More...
 
#define ATT_DYN   (1<<6)
 Dynamic attributes, ie a depracated feature, but its datatypes are still used for some CQP function parameters/returns. More...
 
#define ATT_ALL   ( ATT_POS | ATT_STRUC | ATT_ALIGN | ATT_DYN )
 shorthand for "any / all types of attribute" More...
 
#define ATT_REAL   ( ATT_POS | ATT_STRUC | ATT_ALIGN )
 shorthand for "any / all types of attribute except dynamic" More...
 
#define cl_new_attribute(c, name, type)   cl_new_attribute_oldstyle(c, name, type, NULL)
 Finds an attribute that matches the specified parameters, if one exists, for the given corpus. More...
 
#define cl_id2cpos(a, id, freq)   cl_id2cpos_oldstyle(a, id, freq, NULL, 0)
 Gets all the corpus positions where the specified item is found on the given P-attribute. More...
 
#define cl_idlist2cpos(a, idlist, idlist_size, sort, size)   cl_idlist2cpos_oldstyle(a, idlist, idlist_size, sort, size, NULL, 0)
 Gets a list of corpus positions matching a list of ids. More...
 
#define STRUC_INSIDE   1
 cl_cpos2boundary() return flag: specified position is WITHIN a region of this s-attribute More...
 
#define STRUC_LBOUND   2
 cl_cpos2boundary() return flag: specified position is AT THE START BOUNDARY OF a region of this s-attribute More...
 
#define STRUC_RBOUND   4
 cl_cpos2boundary() return flag: specified position is AT THE END BOUNDARY OF a region of this s-attribute More...
 
#define CL_DYN_STRING_SIZE   2048
 maximum size of 'dynamic' strings More...
 
#define ATTAT_NONE   0
 Dynamic att argument type: none. More...
 
#define ATTAT_POS   1
 Dynamic att argument type: ?? More...
 
#define ATTAT_STRING   2
 Dynamic att argument type: string. More...
 
#define ATTAT_INT   3
 Dynamic att argument type: integer. More...
 
#define ATTAT_VAR   4
 Dynamic att argument type: variable number of string arguments (only in arglist) More...
 
#define ATTAT_FLOAT   5
 Dynamic att argument type: floating point. More...
 
#define ATTAT_PAREF   6
 Dynamic att argument type: ?? More...
 
#define CHARSET_FOR_IDENTIFIERS   ascii
 "Dummy" charset macro for calling cl_string_canonical More...
 
#define IGNORE_CASE   1
 Flag: ignore-case in regular expression engine; fold case in cl_string_canonical. More...
 
#define IGNORE_DIAC   2
 Flag ignore-diacritics in regular expression engine; fold diacritics in cl_string_canonical. More...
 
#define IGNORE_REGEX   4
 Flag for: don't use regular expression engine - match as a literal string. More...
 
#define REQUIRE_NFC   8
 Flag for: string requires enforcement of pre-composed normal form (NFC), which is standard in CWB indexed corpora; applies only to UTF-8; all UTF-8 strings passed in from external sources need to be normalised in this way; applies to subject string when used with regex engine, to sole argument string when used with cl_string_canonical;. More...
 
#define ClosePositionStream(ps)   cl_delete_stream(ps)
 
#define OpenPositionStream(a, id)   cl_new_stream(a, id)
 
#define ReadPositionStream(ps, buf, size)   cl_read_stream(ps, buf, size)
 
#define attr_drop_attribute(a)   cl_delete_attribute(a)
 
#define call_dynamic_attribute(a, dcr, args, nr_args)   cl_dynamic_call(a, dcr, args, nr_args)
 
#define cderrno   cl_errno
 
#define cdperror(message)   cl_error(message)
 
#define cdperror_string(no)   cl_error_string(no)
 
#define central_corpus_directory()   cl_standard_registry()
 
#define collect_matches(a, idlist, idlist_size, sort, size, rl, rls)   cl_idlist2cpos_oldstyle(a, idlist, idlist_size, sort, size, rl, rls)
 
#define collect_matching_ids(a, re, flags, size)   cl_regex2id(a, re, flags, size)
 
#define cumulative_id_frequency(a, list, size)   cl_idlist2freq(a, list, size)
 
#define drop_corpus(c)   cl_delete_corpus(c)
 
#define find_attribute(c, name, type, data)   cl_new_attribute_oldstyle(c, name, type, data)
 
#define get_alg_attribute(a, p, start1, end1, start2, end2)   cl_cpos2alg2cpos_oldstyle(a, p, start1, end1, start2, end2)
 
#define get_attribute_size(a)   cl_max_cpos(a)
 
#define get_bounds_of_nth_struc(a, struc, start, end)   cl_struc2cpos(a, struc, start, end)
 
#define get_id_at_position(a, cpos)   cl_cpos2id(a, cpos)
 
#define get_id_of_string(a, str)   cl_str2id(a, str)
 
#define get_id_frequency(a, id)   cl_id2freq(a, id)
 
#define get_id_from_sortidx(a, sid)   cl_sort2id(a, sid)
 
#define get_id_info(a, sid, freq, len)   cl_id2all(a, sid, freq, len)
 
#define get_id_range(a)   cl_max_id(a)
 
#define get_id_string_len(a, id)   cl_id2strlen(a, id)
 
#define get_nr_of_strucs(a, nr)   cl_max_struc_oldstyle(a, nr)
 
#define get_num_of_struc(a, p, num)   cl_cpos2struc_oldstyle(a, p, num)
 
#define get_positions(a, id, freq, rl, rls)   cl_id2cpos_oldstyle(a, id, freq, rl, rls)
 
#define get_sortidxpos_of_id(a, id)   cl_id2sort(a, id)
 
#define get_string_at_position(a, cpos)   cl_cpos2str(a, cpos)
 
#define get_string_of_id(a, id)   cl_id2str(a, id)
 
#define get_struc_attribute(a, cpos, start, end)   cl_cpos2struc2cpos(a, cpos, start, end)
 
#define inverted_file_is_compressed(a)   cl_index_compressed(a)
 
#define item_sequence_is_compressed(a)   cl_sequence_compressed(a)
 
#define nr_of_arguments(a)   cl_dynamic_numargs(a)
 
#define setup_corpus(reg, name)   cl_new_corpus(reg, name)
 
#define structure_has_values(a)   cl_struc_values(a)
 
#define structure_value(a, struc)   cl_struc2str(a, struc)
 
#define structure_value_at_position(a, cpos)   cl_cpos2struc2str(a, cpos)
 
#define get_path_component   cl_path_get_component
 

Typedefs

typedef struct _cl_int_listcl_int_list
 Automatically growing list of integers (just what you always need ...) More...
 
typedef struct _cl_string_listcl_string_list
 Automatically growing list of strings (just what you always need ...) More...
 
typedef struct ClAutoStringClAutoString
 A single-string object whose memory allocation grows automatically. More...
 
typedef struct TCorpus Corpus
 The Corpus object: contains information on a loaded corpus, including all its attributes. More...
 
typedef union _Attribute Attribute
 The Attribute object: an entire segment of a corpus, such as an annotation field, an XML structure, or a set. More...
 
typedef struct _DCR DynCallResult
 The DynCallResult object (needed to allocate space for dynamic function arguments) More...
 
typedef struct
_position_stream_rec_
PositionStream
 The PositionStream object: gives stream-like reading of an Attribute. More...
 
typedef struct TCorpusPropertyCorpusProperty
 The CorpusProperty object. More...
 
typedef enum ECorpusCharset CorpusCharset
 The CorpusCharset object: an identifier for one of the character sets supported by CWB. More...
 
typedef struct _CL_RegexCL_Regex
 The CL_Regex object: an optimised regular expression. More...
 
typedef struct _cl_lexhashcl_lexhash
 The cl_lexhash class (lexicon hashes, with IDs and frequency counts). More...
 
typedef struct _cl_lexhash_entrycl_lexhash_entry
 Underlying structure for the cl_lexhash_entry class. More...
 
typedef struct _cl_ngram_hashcl_ngram_hash
 The cl_ngram_hash class (hash-based frequency counts for n-grams, represented by n-tuples of integer type IDs). More...
 
typedef struct
_cl_ngram_hash_entry
cl_ngram_hash_entry
 Underlying structure for the cl_ngram_hash_entry class. More...
 

Enumerations

enum  ECorpusCharset {
  ascii = 0, latin1, latin2, latin3,
  latin4, cyrillic, arabic, greek,
  hebrew, latin5, latin6, latin7,
  latin8, latin9, utf8, unknown_charset
}
 The CorpusCharset object: an identifier for one of the character sets supported by CWB. More...
 

Functions

void cl_error (char *message)
 Prints an error message, together with a string identifying the current error number. More...
 
char * cl_error_string (int error_num)
 Gets a string describing the error identified by an error number. More...
 
void * cl_malloc (size_t bytes)
 Safely allocates memory malloc-style. More...
 
void * cl_calloc (size_t nr_of_elements, size_t element_size)
 Safely allocates memory calloc-style. More...
 
void * cl_realloc (void *block, size_t bytes)
 Safely reallocates memory. More...
 
char * cl_strdup (const char *string)
 Safely duplicates a string. More...
 
cl_int_list cl_new_int_list (void)
 Creates a new cl_int_list object. More...
 
void cl_delete_int_list (cl_int_list l)
 Deletes a cl_int_list object. More...
 
void cl_int_list_lumpsize (cl_int_list l, int s)
 Sets the lumpsize of a cl_int_list object. More...
 
int cl_int_list_size (cl_int_list l)
 Gets the current size of a cl_int_list object (number of elements on the list). More...
 
int cl_int_list_get (cl_int_list l, int n)
 Retrieves an element from a cl_int_list object. More...
 
void cl_int_list_set (cl_int_list l, int n, int val)
 Sets an integer on a cl_int_list object. More...
 
void cl_int_list_append (cl_int_list l, int val)
 Appends an integer to the end of a cl_int_list object. More...
 
void cl_int_list_qsort (cl_int_list l)
 Sorts a cl_int_list object. More...
 
cl_string_list cl_new_string_list (void)
 Creates a new cl_string_list object. More...
 
void cl_delete_string_list (cl_string_list l)
 Deletes a cl_string_list object. More...
 
void cl_free_string_list (cl_string_list l)
 Frees all the strings in the cl_string_list object. More...
 
void cl_string_list_lumpsize (cl_string_list l, int s)
 Sets the lumpsize of a cl_string_list object. More...
 
int cl_string_list_size (cl_string_list l)
 Gets the current size of a cl_string_list object (number of elements on the list). More...
 
char * cl_string_list_get (cl_string_list l, int n)
 Retrieves an element from a cl_string_list object. More...
 
void cl_string_list_set (cl_string_list l, int n, char *val)
 Sets a string pointer on a cl_string_list object. More...
 
void cl_string_list_append (cl_string_list l, char *val)
 Appends a string pointer to the end of a cl_string_list object. More...
 
void cl_string_list_qsort (cl_string_list l)
 Sorts a cl_string_list object. More...
 
void cl_set_seed (unsigned int seed)
 Initialises the CL-internal random number generator. More...
 
void cl_randomize (void)
 Initialises the CL-internal random number generator from the current system time. More...
 
void cl_get_rng_state (unsigned int *i1, unsigned int *i2)
 Reads current state of CL-internal random number generator. More...
 
void cl_set_rng_state (unsigned int i1, unsigned int i2)
 Restores the state of the CL-internal random number generator. More...
 
unsigned int cl_random (void)
 Gets a random number. More...
 
double cl_runif (void)
 Gets a random number in the range [0,1] with uniform distribution. More...
 
void cl_set_debug_level (int level)
 Sets the debug level configuration variable. More...
 
void cl_set_optimize (int state)
 Turns optimization on or off. More...
 
void cl_set_memory_limit (int megabytes)
 Sets the memory limit respected by some CL functions. More...
 
ClAutoString cl_autostring_new (const char *data, size_t init_bytes)
 Creates a new autostring object. More...
 
void cl_autostring_delete (ClAutoString string)
 Delete an autostring object. More...
 
void cl_autostring_set_increment (ClAutoString string, size_t new_increment)
 Changes the increment size (measured in bytes). More...
 
char * cl_autostring_ptr (ClAutoString string)
 Get a pointer to the string data inside the AutoString (or NULL if the object is NULL). More...
 
size_t cl_autostring_len (ClAutoString string)
 Get the length of the currently-stored string (or negative value in case NULL object is passed). More...
 
void cl_autostring_reclaim_mem (ClAutoString string)
 Tries to free up unused memory by making the AutoString use only as many increments of size as necessary. More...
 
void cl_autostring_copy (ClAutoString dst, const char *src)
 Copy the string in src into the AutoString in dst, automatically reallocating memory if necessary. More...
 
void cl_autostring_concat (ClAutoString dst, const char *src)
 Concatenate the string src onto the end of the AutoString in dst, automatically reallocating memory if necessary. More...
 
void cl_autostring_truncate (ClAutoString string, int new_length)
 Truncates the AutoString to the length specified. More...
 
void cl_autostring_dump (ClAutoString string)
 Debug function: dumps the contents of an AutoString to stderr. More...
 
FILE * cl_open_stream (const char *filename, int mode, int type)
 Open stream of specified (or guessed) type for reading or writing. More...
 
int cl_close_stream (FILE *stream)
 Close I/O stream. More...
 
char * cl_strcpy (char *buf, const char *src)
 Replacement for strcpy that won't copy more than CL_MAX_LINE_LENGTH characters. More...
 
int cl_strcmp (char *s1, char *s2)
 CL internal string comparison (uses signed char on all platforms). More...
 
char * cl_string_latex2iso (char *str, char *result, int target_len)
 Converts ASCII strings with latex-style blackslash escapes for accented characters to ISO-8859-1 (Latin-1). More...
 
char * cl_xml_entity_decode (char *s)
 Decode XML entities in a string. More...
 
void cl_path_adjust_os (char *path)
 Standardises subdirectory-dividers in a string that represents a path, in an OS-sensitive way. More...
 
void cl_path_adjust_independent (char *path)
 Standardises subdirectory-dividers in a string that represents a path into Unix-like form (ie with forward-slash), regardless of what OS we are in. More...
 
char * cl_path_registry_quote (char *path)
 Add quotes and escape slashes to a file path if necessary. More...
 
char * cl_path_get_component (char *s)
 Tokenises a string into components split by ':' (or ';' under Win32). More...
 
int cl_id_validate (char *s)
 Checks a string to see if it is a valid CWB identifier. More...
 
void cl_id_toupper (char *s)
 Converts a lowercase corpus name to an equivalent uppercase form. More...
 
void cl_id_tolower (char *s)
 Converts an uppercase corpus name to an equivalent lowercase form. More...
 
char * cl_make_set (char *s, int split)
 Generates a feature-set attribute value. More...
 
int cl_set_size (char *s)
 Counts the number of elements in a set attribute value. More...
 
int cl_set_intersection (char *result, const char *s1, const char *s2)
 Computes the intersection of two set attribute values. More...
 
Corpuscl_new_corpus (char *registry_dir, char *registry_name)
 Creates a Corpus object to represent a given indexed corpus, located in a given directory accessible to the program. More...
 
int cl_delete_corpus (Corpus *corpus)
 Deletes a Corpus object from memory. More...
 
char * cl_standard_registry ()
 Gets a string containing the path of the default registry directory. More...
 
cl_string_list cl_corpus_list_attributes (Corpus *corpus, int attribute_type)
 Gets a list of the named attributes that this corpus posesses. More...
 
Attributecl_new_attribute_oldstyle (Corpus *corpus, char *attribute_name, int type, char *data)
 Finds an attribute that matches the specified parameters, if one exists, for the given corpus. More...
 
int cl_delete_attribute (Attribute *attribute)
 Deletes the specified Attribute object. More...
 
int cl_sequence_compressed (Attribute *attribute)
 Checks whether the item sequence of the given P-attribute is compressed. More...
 
int cl_index_compressed (Attribute *attribute)
 Check whether the reverse-corpus index (inverted file) of the given P-attribute is compressed. More...
 
Corpuscl_attribute_mother_corpus (Attribute *attribute)
 Accessor function to get the mother corpus of the attribute. More...
 
char * cl_id2str (Attribute *attribute, int id)
 Gets the string that corresponds to the specified item on the given P-attribute. More...
 
int cl_str2id (Attribute *attribute, char *id_string)
 Gets the ID code that corresponds to the specified string on the given P-attribute. More...
 
int cl_id2strlen (Attribute *attribute, int id)
 Calculates the length of the string that corresponds to the specified item on the given P-attribute. More...
 
int cl_sort2id (Attribute *attribute, int sort_index_position)
 Gets the ID code of the item at the specified position in the Attribute's sorted wordlist index. More...
 
int cl_id2sort (Attribute *attribute, int id)
 Gets the position in the Attribute's sorted wordlist index of the item with the specified ID code. More...
 
int cl_max_cpos (Attribute *attribute)
 Gets the maximum position on this P-attribute (ie the size of the attribute). More...
 
int cl_max_id (Attribute *attribute)
 Gets the maximum id on this P-attribute (ie the range of the attribute's ID codes). More...
 
int cl_id2freq (Attribute *attribute, int id)
 Gets the frequency of an item on this attribute. More...
 
int * cl_id2cpos_oldstyle (Attribute *attribute, int id, int *freq, int *restrictor_list, int restrictor_list_size)
 Gets all the corpus positions where the specified item is found on the given P-attribute. More...
 
int cl_cpos2id (Attribute *attribute, int position)
 Gets the integer ID of the item at the specified position on the given p-attribute. More...
 
char * cl_cpos2str (Attribute *attribute, int position)
 Gets the string of the item at the specified position on the given p-attribute. More...
 
char * cl_id2all (Attribute *attribute, int index, int *freq, int *slen)
 Gets the string of the item with the specified ID on the given p-attribute. More...
 
int * cl_regex2id (Attribute *attribute, char *pattern, int flags, int *number_of_matches)
 Gets a list of the ids of those items on a given Attribute that match a particular regular-expression pattern. More...
 
int cl_idlist2freq (Attribute *attribute, int *ids, int number_of_ids)
 Calculates the total frequency of all items on a list of item IDs. More...
 
int * cl_idlist2cpos_oldstyle (Attribute *attribute, int *ids, int number_of_ids, int sort, int *size_of_table, int *restrictor_list, int restrictor_list_size)
 Gets a list of corpus positions matching a list of ids. More...
 
int cl_cpos2struc2cpos (Attribute *attribute, int position, int *struc_start, int *struc_end)
 Gets the start and end positions of the instance of the given S-attribute found at the specified corpus position. More...
 
int cl_cpos2struc (Attribute *a, int cpos)
 Gets the ID number of a structure (instance of an s-attribute) that is found at the given corpus position. More...
 
int cl_cpos2struc_oldstyle (Attribute *attribute, int position, int *struc_num)
 Gets the ID number of a structure (instance of an s-attribute) that is found at the given corpus position. More...
 
int cl_cpos2boundary (Attribute *a, int cpos)
 Compares the location of a corpus position to the regions of an s-attribute. More...
 
int cl_struc2cpos (Attribute *attribute, int struc_num, int *struc_start, int *struc_end)
 Retrieves the start-and-end corpus positions of a specified structure of the given s-attribute type. More...
 
int cl_max_struc (Attribute *a)
 Gets the maximum for this S-attribute (ie the size of the S-attribute). More...
 
int cl_max_struc_oldstyle (Attribute *attribute, int *nr_strucs)
 Gets the number of instances of an s-attribute in the corpus. More...
 
int cl_struc_values (Attribute *attribute)
 Checks whether this s-attribute has attribute values. More...
 
char * cl_struc2str (Attribute *attribute, int struc_num)
 Gets the value that is associated with the specified instance of the given s-attribute. More...
 
char * cl_cpos2struc2str (Attribute *attribute, int position)
 
int cl_has_extended_alignment (Attribute *attribute)
 Checks whether an attribute's XALIGN component exists, that is, whether or not it has extended alignment. More...
 
int cl_max_alg (Attribute *attribute)
 Gets the id number of alignments on this align-attribute. More...
 
int cl_cpos2alg (Attribute *attribute, int cpos)
 Gets the id number of the alignment at the specified corpus position. More...
 
int cl_alg2cpos (Attribute *attribute, int alg, int *source_region_start, int *source_region_end, int *target_region_start, int *target_region_end)
 Gets the corpus positions of an alignment on the given align-attribute. More...
 
int cl_cpos2alg2cpos_oldstyle (Attribute *attribute, int position, int *aligned_start, int *aligned_end, int *aligned_start2, int *aligned_end2)
 Gets the corpus positions of an alignment on the given align-attribute. More...
 
int cl_dynamic_call (Attribute *attribute, DynCallResult *dcr, DynCallResult *args, int nr_args)
 Calls a dynamic attribute. More...
 
int cl_dynamic_numargs (Attribute *attribute)
 Count the number of arguments on a dynamic attribute's argument list. More...
 
PositionStream cl_new_stream (Attribute *attribute, int id)
 Creates a new PositionStream object. More...
 
int cl_delete_stream (PositionStream *ps)
 Deletes a PositionStream object. More...
 
int cl_read_stream (PositionStream ps, int *buffer, int buffer_size)
 Reads corpus positions from a position stream to a buffer. More...
 
CorpusProperty cl_first_corpus_property (Corpus *corpus)
 Gets the first entry in this corpus's list of properties. More...
 
CorpusProperty cl_next_corpus_property (CorpusProperty p)
 Gets the next corpus property on the list of properties. More...
 
char * cl_corpus_property (Corpus *corpus, char *property)
 Gets the value of the specified corpus property. More...
 
CorpusCharset cl_corpus_charset (Corpus *corpus)
 Retrieves the special 'charset' property from a Corpus object. More...
 
char * cl_charset_name (CorpusCharset id)
 Gets a string containing the name of the specified CorpusCharset character set object. More...
 
CorpusCharset cl_charset_from_name (char *name)
 Gets a CorpusCharset enumeration with the id code for the given string. More...
 
char * cl_charset_name_canonical (char *name_to_check)
 Checks whether a string represents a valid charset, and returns a pointer to the name in canonical form (ie lacking any non-standard case there may be in the input string). More...
 
size_t cl_charset_strlen (CorpusCharset charset, char *s)
 
void cl_string_canonical (char *s, CorpusCharset charset, int flags)
 Converts a string to canonical form. More...
 
int cl_string_zap_controls (char *s, CorpusCharset charset, char replace, int zap_tabs, int zap_newlines)
 Replaces any invalid control characters in a string. More...
 
int cl_string_utf8_continuation_byte (unsigned char byte)
 Checks whether a given byte is a UTF-8 continuation byte. More...
 
int cl_string_validate_encoding (char *s, CorpusCharset charset, int repair)
 Checks the encoding of a string. More...
 
char * cl_string_reverse (const char *s, CorpusCharset charset)
 Creates a "backwards" version of the specified string. More...
 
int cl_string_qsort_compare (const char *s1, const char *s2, CorpusCharset charset, int flags, int reverse)
 Compares two strings in a qsort-style. More...
 
CL_Regex cl_new_regex (char *regex, int flags, CorpusCharset charset)
 Create a new CL_regex object (ie a regular expression buffer). More...
 
int cl_regex_optimised (CL_Regex rx)
 Finds the level of optimisation of a CL_Regex. More...
 
int cl_regex_match (CL_Regex rx, char *str, int normalize_utf8)
 Matches a regular expression against a string. More...
 
void cl_delete_regex (CL_Regex rx)
 Deletes a CL_Regex object, and frees all resources associated with the pre-compiled regex. More...
 
void cl_regopt_count_reset (void)
 Reset the "success counter" for optimised regexes. More...
 
int cl_regopt_count_get (void)
 Get a reading from the "success counter" for optimised regexes. More...
 
cl_lexhash cl_new_lexhash (int buckets)
 Creates a new cl_lexhash object. More...
 
void cl_delete_lexhash (cl_lexhash lh)
 Deletes a cl_lexhash object. More...
 
void cl_lexhash_set_cleanup_function (cl_lexhash lh, void(*func)(cl_lexhash_entry))
 
void cl_lexhash_auto_grow (cl_lexhash lh, int flag)
 Turns a cl_lexhash's ability to auto-grow on or off. More...
 
void cl_lexhash_auto_grow_fillrate (cl_lexhash lh, double limit, double target)
 Configure auto-grow parameters. More...
 
cl_lexhash_entry cl_lexhash_add (cl_lexhash lh, char *token)
 Adds a token to a cl_lexhash table. More...
 
cl_lexhash_entry cl_lexhash_find (cl_lexhash lh, char *token)
 Finds the entry corresponding to a particular string within a cl_lexhash. More...
 
int cl_lexhash_id (cl_lexhash lh, char *token)
 Gets the ID of a particular string within a lexhash. More...
 
int cl_lexhash_freq (cl_lexhash lh, char *token)
 Gets the frequency of a particular string within a lexhash. More...
 
int cl_lexhash_del (cl_lexhash lh, char *token)
 Deletes a string from a hash. More...
 
int cl_lexhash_size (cl_lexhash lh)
 Gets the number of different strings stored in a lexhash. More...
 
cl_ngram_hash cl_new_ngram_hash (int N, int buckets)
 Creates a new cl_ngram_hash object. More...
 
void cl_delete_ngram_hash (cl_ngram_hash hash)
 Deletes a cl_ngram_hash object. More...
 
void cl_ngram_hash_auto_grow (cl_ngram_hash hash, int flag)
 Turns a cl_ngram_hash's ability to auto-grow on or off. More...
 
void cl_ngram_hash_auto_grow_fillrate (cl_ngram_hash hash, double limit, double target)
 Configure auto-grow parameters. More...
 
cl_ngram_hash_entry cl_ngram_hash_add (cl_ngram_hash hash, int *ngram, unsigned int f)
 Adds an n-gram to a cl_ngram_hash table. More...
 
cl_ngram_hash_entry cl_ngram_hash_find (cl_ngram_hash hash, int *ngram)
 Finds the entry corresponding to a particular n-gram within a cl_ngram_hash. More...
 
int cl_ngram_hash_del (cl_ngram_hash hash, int *ngram)
 Deletes an n-gram from a hash. More...
 
int cl_ngram_hash_freq (cl_ngram_hash hash, int *ngram)
 Gets the frequency of a particular n-gram within a cl_ngram_hash. More...
 
int cl_ngram_hash_size (cl_ngram_hash hash)
 Gets the number of distinct n-grams stored in a cl_ngram_hash. More...
 
cl_ngram_hash_entrycl_ngram_hash_get_entries (cl_ngram_hash hash, int *ret_size)
 Returns allocated vector of pointers to all entries of the n-gram hash. More...
 
void cl_ngram_hash_iterator_reset (cl_ngram_hash hash)
 Simple iterator for the entries of an n-gram hash. More...
 
cl_ngram_hash_entry cl_ngram_hash_iterator_next (cl_ngram_hash hash)
 Iterate over all entries in an n-gram hash. More...
 
int * cl_ngram_hash_stats (cl_ngram_hash hash, int max_n)
 Statistics on bucket fill rates for debugging purposes. More...
 
void cl_ngram_hash_print_stats (cl_ngram_hash hash, int max_n)
 Display statistics on bucket fill rates (for debugging and optimization). More...
 

Variables

int cl_errno
 Error number for CL: is set after access to any of various corpus-data-access functions. More...
 
int cl_broken_pipe
 This variable will be set to True if a SIGPIPE has been caught and ignored. More...
 
int cl_allow_latex2iso
 Boolean switch enabling/disabling latex-style escapes. More...
 
char cl_regex_error []
 The error message from (PCRE) regex compilation are placed in this buffer if cl_new_regex() fails. More...
 

Detailed Description

This file contains the API for the CWB "Corpus Library" (CL).

If you are programming against the CL, you should #include ONLY this header file, and make use of ONLY the functions declared here.

Other functions in the CL should ONLY be used within the CWB itself by CWB developers.

The header file is laid out in such a way as to semi-document the API, i.e. function prototypes are given with brief notes on usage, parameters, and return values. You may also wish to refer to CWB's automatically-generated HTML code documentation (created using the Doxygen system; if you're reading this text in a web browser, then the auto-generated documentation is almost certainly what you're looking at). However, please note that the auto-generated documentation ALSO covers (a) functions internal to the CL which should NOT be used when programming against it; (b) functions from the CWB utilities and from the CQP program - neither of which are part of the CL. There is also no distinction in that more extensive documentation between information that is relevant to programming against the CL API and information that is relevant to developers working on the CL itself. Caveat lector.

Note that many functions have two names – one that follows the standardised format "cl_do_something()", and another that follows no particular pattern. The former are the "new API" (in v3.0.0 or higher of CWB) and the latter are the "old-style" API (deprecated, but supported for backward compatibility). The old-style function names SHOULD NOT be used in newly-written code. Such double names mostly exist for the core data-access functions (i.e. for the Corpus and (especially) Attribute objects).

In v3.0 and v3.1 of CWB, the new API was implemented as macros to the old API. As of v3.2, the old API is implemented as macros to the new API.

In a very few cases, the parameter list or return behaviour of a function also changed. In this case, a function with the "old" parameter list is preserved (but depracated) and has the same name as the new function but with the suffix "_oldstyle". The old names are then re-implemented as macros to the _oldstyle functions. But, as should be obvious, while these functions and the macros to them will remain in the public API for backwards-compatibility, they should not be used in new code, and are most definitely deprecated!

The CL header is organised to reflect the conceptual structure of the library. While it is not fully "object-oriented" in style most of the functions are organised around a small number of data objects that represent real entities in a CWB-encoded corpus. Each object is defined as an opaque type (usually a structure whose members are PRIVATE and should only be accessed via the functions provided in the CL API).

CONTENTS LIST FOR THIS HEADER FILE:

SECTION 1 CL UTILITIES

1.1 ERROR HANDLING

1.2 MEMORY MANAGEMENT

1.3 DATA LIST CLASSES: cl_string_list AND cl_int_list

1.4 INTERNAL RANDOM NUMBER GENERATOR

1.5 SETTING CL CONFIG VARIABLES

1.6 CONSTANTS

1.7 MISCELLANEOUS UTILITIES

SECTION 2 THE CORE CL LIBRARY (DATA ACCESS)

2.1 THE Corpus OBJECT

2.2 THE Attribute OBJECT

2.3 THE PositionStream OBJECT

SECTION 3 SUPPORT CLASSES

3.1 THE CorpusProperty OBJECT

3.2 THE CorpusCharset OBJECT

3.3 THE CL_Regex OBJECT

3.4 THE cl_lexhash OBJECT

3.5 THE cl_ngram_hash OBJECT

SECTION 4 THE OLD CL API

(If you're looking at the auto-generated HTML documentation, this contents list, which describes the structure of the actual "cl.h" header file, is wrong for you - instead, use the index of links (above) to find the object or function you are interested in.)

We hope you enjoy using the CL!

best regards from

The CWB Development Team

http://cwb.sourceforge.net

Macro Definition Documentation

#define ATT_ALIGN   (1<<2)
#define ATT_ALL   ( ATT_POS | ATT_STRUC | ATT_ALIGN | ATT_DYN )

shorthand for "any / all types of attribute"

#define ATT_DYN   (1<<6)

Dynamic attributes, ie a depracated feature, but its datatypes are still used for some CQP function parameters/returns.

Referenced by aid_name(), cl_delete_attribute(), cl_dynamic_call(), cl_dynamic_numargs(), decode_print_token_sequence(), describe_attribute(), and FunctionCall().

#define ATT_NONE   0
#define ATT_POS   (1<<0)
#define ATT_REAL   ( ATT_POS | ATT_STRUC | ATT_ALIGN )

shorthand for "any / all types of attribute except dynamic"

#define ATT_STRUC   (1<<1)
#define ATTAT_FLOAT   5

Dynamic att argument type: floating point.

Referenced by argid_name(), attat_name(), cl_dynamic_call(), eval_bool(), get_leaf_value(), and makearg().

#define ATTAT_INT   3

Dynamic att argument type: integer.

Referenced by argid_name(), attat_name(), call_predefined_function(), cl_dynamic_call(), eval_bool(), get_leaf_value(), and makearg().

#define ATTAT_NONE   0

Dynamic att argument type: none.

Referenced by argid_name(), attat_name(), call_predefined_function(), cl_dynamic_call(), eval_bool(), and get_leaf_value().

#define ATTAT_PAREF   6
#define ATTAT_POS   1
#define ATTAT_STRING   2
#define ATTAT_VAR   4

Dynamic att argument type: variable number of string arguments (only in arglist)

Referenced by argid_name(), attat_name(), cl_dynamic_call(), cl_dynamic_numargs(), eval_bool(), and makearg().

#define attr_drop_attribute (   a)    cl_delete_attribute(a)
#define call_dynamic_attribute (   a,
  dcr,
  args,
  nr_args 
)    cl_dynamic_call(a, dcr, args, nr_args)

Referenced by get_leaf_value().

#define CDA_CPOSUNDEF   INT_MIN

Error code: undefined corpus position (use this code to avoid ambiguity with negative cpos)

Referenced by print_tabulation(), and pt_get_anchor_cpos().

#define CDA_EACCESS   -20

Error code: insufficient access permissions.

Referenced by cl_error_string(), and cl_open_stream().

#define CDA_EALIGN   -9

Error code: no alignment at position.

Referenced by cl_cpos2alg(), cl_error_string(), and get_extended_alignment().

#define CDA_EARGS   -12

Error code: error in arguments for dynamic call or CL function.

Referenced by cl_dynamic_call(), cl_error_string(), and cl_open_stream().

#define CDA_EATTTYPE   -2

Error code: function was called on illegal attribute.

Referenced by cl_close_stream(), cl_error_string(), and send_cl_error().

#define CDA_EBADREGEX   -16

Error code: bad regular expression.

Referenced by cl_error_string(), cl_new_regex(), cl_regex2id(), and send_cl_error().

#define CDA_EBUFFER   -18

Error code: buffer overflow (hard-coded internal buffer sizes)

Referenced by cl_error_string(), cl_open_stream(), and cl_set_intersection().

#define CDA_EFSETINV   -17

Error code: invalid feature set format.

Referenced by cl_error_string(), cl_make_set(), cl_set_intersection(), and cl_set_size().

#define CDA_EIDORNG   -3
#define CDA_EIDXORNG   -5
#define CDA_EINTERNAL   -19

Error code: internal data consistency error (really bad)

Referenced by cl_error_string(), and cl_struc2str().

#define CDA_ENODATA   -11
#define CDA_ENOMEM   -13

Error code: memory fault [unused].

Referenced by cl_error_string(), and send_cl_error().

#define CDA_ENOSTRING   -6

Error code: no such string encoded.

Referenced by cl_error_string(), and cl_str2id().

#define CDA_ENULLATT   -1

Error code: NULL passed as attribute argument.

Referenced by cl_error_string().

#define CDA_ENYI   -15

Error code: not yet implemented.

Referenced by cl_error_string(), cl_id2sort(), and send_cl_error().

#define CDA_EOTHER   -14

Error code: other error.

Referenced by cl_error_string(), cl_id2strlen(), cl_str2id(), and send_cl_error().

#define CDA_EPATTERN   -7

Error code: illegal pattern.

Referenced by cl_error_string(), and send_cl_error().

#define CDA_EPOSIX   -21

Error code: POSIX-level error: check errno or perror()

Referenced by cl_close_stream(), cl_error_string(), and cl_open_stream().

#define CDA_EPOSORNG   -4
#define CDA_EREMOTE   -10

Error code: error in remote access.

Referenced by cl_error_string().

#define CDA_ESTRUC   -8

Error code: no structure at position.

Referenced by cl_cpos2boundary(), cl_cpos2struc2cpos(), cl_cpos2struc_oldstyle(), and cl_error_string().

#define CDA_OK   0
#define cderrno   cl_errno
#define cdperror (   message)    cl_error(message)

Referenced by compute_code_lengths().

#define cdperror_string (   no)    cl_error_string(no)
#define central_corpus_directory ( )    cl_standard_registry()

Referenced by main().

#define CHARSET_FOR_IDENTIFIERS   ascii

"Dummy" charset macro for calling cl_string_canonical

We have a problem - CorpusCharsets are attached to corpora. So what charset do we use with cl_string_canonical if we are calling it on a string that does not (yet) have a corpus?

The answer: CHARSET_FOR_IDENTIFIERS. This should only be used as the 2nd argument to cl_string_canonical when the string is an identifier for a corpus, attribute, or whatever.

Note it is Ascii in v3.2.x+, breaking backwards compatibility with 2.2.x where Latin1 was allowed for identifiers.

#define CL_DYN_STRING_SIZE   2048

maximum size of 'dynamic' strings

Referenced by call_predefined_function(), and cl_set_intersection().

#define cl_free (   p)    do { if ((p) != NULL) { free(p); p = NULL; } } while (0)

Safely frees memory.

See also
cl_malloc
Parameters
pPointer to memory to be freed.

Referenced by add_hosts_in_subnet_to_list(), after_Query(), assign_temp_to_sub(), attach_subcorpus(), cl_autostring_delete(), cl_close_stream(), cl_delete_attribute(), cl_delete_corpus(), cl_delete_int_list(), cl_delete_lexhash(), cl_delete_lexhash_entry(), cl_delete_ngram_hash(), cl_delete_regex(), cl_delete_string_list(), cl_free_string_list(), cl_id2cpos_oldstyle(), cl_lexhash_check_grow(), cl_make_set(), cl_new_corpus(), cl_new_regex(), cl_ngram_hash_check_grow(), cl_ngram_hash_del(), cl_ngram_hash_print_stats(), cl_regex2id(), cl_string_canonical(), cl_string_qsort_compare(), comp_drop_component(), compute_code_lengths(), context_descriptor_reset_left_context(), context_descriptor_reset_right_context(), creat_rev_corpus(), create_feature_maps(), cwbci_check_line(), delete_interval(), delete_intervals(), DestroyAttributeList(), do_AddSubVariables(), do_cqi_cl_cpos2lbound(), do_cqi_cl_cpos2rbound(), do_cqi_cl_cpos2struc(), do_cqi_cqp_fdist_1(), do_cqi_cqp_fdist_2(), do_flagged_re_variable(), do_IDReference(), do_LabelReference(), do_SearchPattern(), do_StandardQuery(), do_translate(), do_undump(), do_XMLTag(), drop_mapping(), drop_single_mapping(), DropVariable(), encode_add_wattr_line(), encode_generate_registry_file(), encode_scan_directory(), evaltree2searchstr(), evaluate_target(), execute_side_effects(), expand_macro(), free_booltree(), free_environment(), free_group(), free_matchlist(), free_tabulation_list(), FreeIDList(), FreeSortClause(), get_fulllocalpath(), get_fvector(), get_matched_corpus_positions(), initialize_cl(), initialize_cqp(), load_macro_file(), MacroHashDelete(), main(), mallocfile(), matchfirstpattern(), meet_mu(), mfree(), open_input_stream(), open_pager(), open_temporary_file(), OptimizeStringConstraint(), print_tabulation(), printAlignedStrings(), range_close(), range_declare(), range_open(), RangeSetop(), RangeSort(), RecomputeAL(), RemoveNameFromAL(), sencode_check_set(), sencode_parse_line(), set_context_option_value(), set_corpus_matchlists(), set_target(), Setop(), SL_delete(), SortExternally(), SortSubcorpus(), SortSubcorpusRandomize(), split_subcorpus_spec(), Unchain(), validate_revcorp(), VariableDeleteItems(), VariableSubtractItem(), VerifyList(), and VerifyVariable().

#define cl_id2cpos (   a,
  id,
  freq 
)    cl_id2cpos_oldstyle(a, id, freq, NULL, 0)

Gets all the corpus positions where the specified item is found on the given P-attribute.

See also
cl_id2cpos_oldstyle
Parameters
aThe P-attribute to look on.
idThe id of the item to look for.
freqThe frequency of the specified item is written here. This will be 0 in the case of errors.

Referenced by do_cqi_cl_id2cpos().

#define cl_idlist2cpos (   a,
  idlist,
  idlist_size,
  sort,
  size 
)    cl_idlist2cpos_oldstyle(a, idlist, idlist_size, sort, size, NULL, 0)

Gets a list of corpus positions matching a list of ids.

See also
cl_idlist2cpos_oldstyle
Parameters
aThe P-attribute we are looking in
idlistA list of item ids (i.e. id codes for items on this attribute).
idlist_sizeThe length of this list.
sortboolean: return sorted list?
sizeThe size of the allocated table will be placed here.

Referenced by do_cqi_cl_idlist2cpos(), and get_corpus_positions().

#define CL_MAX_CORPUS_SIZE   2147483647

Maximum size of a CWB corpus.

This is the upper limit on the size of a CWB corpus on 64-bit platforms; for 32-bit versions of CWB, much tighter limits apply. cwb-encode will abort once this limit has been reaching, discarding any further input data. The precise value of the limit is 2^32 - 1 tokens, i.e. hex 0x7FFFFFFF and decimal 2147483647.

Referenced by main().

#define CL_MAX_FILENAME_LENGTH   1024

String buffer size constant (for filenames).

This constant can be used for declaring character arrays that will only contain a filename (or path). It is expected that this will be shorter than CL_MAX_LINE_LENGTH.

Referenced by attach_subcorpus(), check_stamp(), cl_open_stream(), compress_reversed_index(), decompress_check_reversed_index(), ensure_corpus_size(), expand_filename(), get_fulllocalpath(), initialize_cqp(), load_corpusnames(), main(), open_file(), and save_subcorpus().

#define CL_MAX_LINE_LENGTH   4096

General string buffer size constant.

This constant is used to determine the maximum length (in bytes) of a line in a CWB input file. It therefore follows that no s-attribute or p-attribute can ever be longer than this. It's also the normal constant to use for (a) a local or global declaration of a character array (b) dynamic memory allocation of a string buffer. The associated function cl_strcpy() will copy this many bytes at most.

Referenced by alignshow_print_next_region(), alignshow_skip_next_region(), cl_autostring_new(), cl_dynamic_call(), cl_new_regex(), cl_strcpy(), cl_string_qsort_compare(), component_full_name(), compute_code_lengths(), ComputeGroupExternally(), corpus_info(), create_feature_maps(), decode_check_huff(), decode_string_escape(), do_undump(), encode_add_wattr_line(), expand_filename(), find_corpus_registry(), findcorpus(), get_next_range(), get_position_values(), get_print_attribute_values(), html_convert_string(), latex_convert_string(), lexdecode_show(), load_corpusnames(), main(), ParsePrintOptions(), process_fd(), push_regchr(), range_close(), range_declare(), read_mapping(), regopt_data_copy_to_regex_object(), scancorpus_add_key(), sencode_open_files(), SetVariableValue(), sgml_convert_string(), SortExternally(), and wattr_declare().

#define cl_new_attribute (   c,
  name,
  type 
)    cl_new_attribute_oldstyle(c, name, type, NULL)

Finds an attribute that matches the specified parameters, if one exists, for the given corpus.

Note that although this is a cl_new_* function, and it is the canonical way that we get an Attribute to call Attribute-functions on, it doesn't actually create any kind of object. The Attribute exists already as one of the dependents of the Corpus object; this function simply locates it and returns a pointer to it.

This "function" is implemented as a macro wrapped round the depracated function, making the means of calling it more in line with the rest of the CL.

See also
cl_new_attribute_oldstyle
Parameters
corpusThe corpus in which to search for the attribute.
attribute_nameThe name of the attribute (i.e. the handle it has in the registry file).
typeType of attribute to be searched for.
Returns
Pointer to Attribute object, or NULL if not found.

Referenced by cqi_lookup_attribute(), describecorpus_show_basic_info(), do_XMLTag(), lexdecode_show(), main(), print_tabulation(), scancorpus_add_key(), and setup_attribute().

#define CL_STREAM_APPEND   2

open in append mode (except for pipe)

Referenced by cl_open_stream(), and open_stream().

#define CL_STREAM_BZIP2   4

read/write bzip2-compressed file

Referenced by cl_close_stream(), and cl_open_stream().

#define CL_STREAM_FILE   2

read/write plain uncompressed file

Referenced by cl_close_stream(), and cl_open_stream().

#define CL_STREAM_GZIP   3

read/write gzip-compressed file

Referenced by cl_close_stream(), and cl_open_stream().

#define CL_STREAM_MAGIC   0
#define CL_STREAM_MAGIC_NOPIPE   1

enable automagic, but fail on attempt to open pipe (safe mode for filenames from external sources)

Referenced by cl_open_stream(), open_input_stream(), open_stream(), and SetVariableValue().

#define CL_STREAM_PIPE   5

read/write pipe to shell command

Referenced by cl_close_stream(), cl_open_stream(), open_input_stream(), and open_pager().

#define CL_STREAM_READ   0

I/O streams with magic for compressed files (.gz, .bz2) and pipes.

These functions can be used to open input and output FILE* streams to compressed files, pipes, and stdin/stdout. The type of stream is either specified directly with one of the constants below, or automagically guessed from the filename, according to the following rules:

  • if filename is "-", the stream reads from stdin or writes to stdout (depending on the mode)
  • if filename starts with "|", it is interpreted as a pipe to/from a shell command (the pipe symbol must always at the start, even when reading from a pipe)
  • if filename ends in ".gz" or ".bz2", it is read/written as a compressed file (through a pipe to external gzip and bzip2 utilities)
  • otherwise it is read/written as a plain uncompressed file
  • if filename starts with "~/" or "$HOME/", the prefix is expanded to the current user's home directory Unless automagic type guessing is explicitly enabled, filenames will always be used literally without any normalization. Read or write mode is controlled by a separate flag and cannot be set automatically (unlike Perl's "redirect"-style notation).

Note that automagic opening of pipes to shell commands is a security risk if <filename> comes from an untrusted source. Use stream type CL_STREAM_MAGIC_NOPIPE to disallow pipes; opening the I/O stream will fail in this case.

While a stream pipe is active (even if implicitly by reading or writing a compressed file), a signal handler is installed on supported platforms to catch and ignore SIGPIPE, which sets the global variable <cl_broken_pipe> to True. Callers writing to a stream might want to check this variable in order to avoid stalling on a broken pipe, even if they did not explicitly open a pipe stream.Mode and type flags for I/O streams (NB: these are mutually exclusive and must not be combined with "|") open in read mode

Referenced by cl_open_stream(), encode_get_input_line(), lexdecode_show(), main(), open_input_stream(), sencode_parse_options(), and SetVariableValue().

#define CL_STREAM_STDIO   6

read from stdin or write to stdout (<filename> is ignored)

Referenced by cl_close_stream(), cl_open_stream(), open_input_stream(), open_stream(), and sencode_parse_options().

#define CL_STREAM_WRITE   1

open in write mode

Referenced by cl_open_stream(), main(), open_pager(), and open_stream().

#define cl_xml_is_name_char (   c)
Value:
( ( c >= 'A' && c <= 'Z') || \
( c >= 'a' && c <= 'z') || \
( c >= '0' && c <= '9') || \
( (unsigned char) c >= 0x80 \
/* && (unsigned char) c <= 0xff */ \
) || \
( c == '-') || \
( c == '_') \
)

For a given character, say whether it is legal for an XML name.

TODO: Currently, anything in the upper half of the 8-bit range is allowed (in the old Latin1 days this was anything from 0xa0 to 0xff). This will work with any non-ascii character set, but is almost certainly too lax.

Parameters
cCharacter to check. (It is expected to be a char, so is typecast to unsigned char for comparison with upper-128 hex values.)

Referenced by main(), and range_open().

#define ClosePositionStream (   ps)    cl_delete_stream(ps)
#define collect_matches (   a,
  idlist,
  idlist_size,
  sort,
  size,
  rl,
  rls 
)    cl_idlist2cpos_oldstyle(a, idlist, idlist_size, sort, size, rl, rls)
#define collect_matching_ids (   a,
  re,
  flags,
  size 
)    cl_regex2id(a, re, flags, size)
#define cumulative_id_frequency (   a,
  list,
  size 
)    cl_idlist2freq(a, list, size)
#define drop_corpus (   c)    cl_delete_corpus(c)
#define find_attribute (   c,
  name,
  type,
  data 
)    cl_new_attribute_oldstyle(c, name, type, data)
#define get_alg_attribute (   a,
  p,
  start1,
  end1,
  start2,
  end2 
)    cl_cpos2alg2cpos_oldstyle(a, p, start1, end1, start2, end2)
#define get_attribute_size (   a)    cl_max_cpos(a)

Referenced by cl_new_stream(), and SystemCorpusSize().

#define get_bounds_of_nth_struc (   a,
  struc,
  start,
  end 
)    cl_struc2cpos(a, struc, start, end)

Referenced by calculate_ranges().

#define get_id_at_position (   a,
  cpos 
)    cl_cpos2id(a, cpos)
#define get_id_frequency (   a,
  id 
)    cl_id2freq(a, id)
#define get_id_from_sortidx (   a,
  sid 
)    cl_sort2id(a, sid)
#define get_id_info (   a,
  sid,
  freq,
  len 
)    cl_id2all(a, sid, freq, len)
#define get_id_of_string (   a,
  str 
)    cl_str2id(a, str)
#define get_id_range (   a)    cl_max_id(a)
#define get_id_string_len (   a,
  id 
)    cl_id2strlen(a, id)

Referenced by cl_id2all().

#define get_nr_of_strucs (   a,
  nr 
)    cl_max_struc_oldstyle(a, nr)

Referenced by calculate_ranges().

#define get_num_of_struc (   a,
  p,
  num 
)    cl_cpos2struc_oldstyle(a, p, num)
#define get_path_component   cl_path_get_component

Referenced by load_corpusnames().

#define get_positions (   a,
  id,
  freq,
  rl,
  rls 
)    cl_id2cpos_oldstyle(a, id, freq, rl, rls)
#define get_sortidxpos_of_id (   a,
  id 
)    cl_id2sort(a, id)
#define get_string_at_position (   a,
  cpos 
)    cl_cpos2str(a, cpos)

Referenced by get_leaf_value().

#define get_string_of_id (   a,
  id 
)    cl_id2str(a, id)
#define get_struc_attribute (   a,
  cpos,
  start,
  end 
)    cl_cpos2struc2cpos(a, cpos, start, end)
#define IGNORE_CASE   1

Flag: ignore-case in regular expression engine; fold case in cl_string_canonical.

Referenced by cl_new_regex(), cl_string_canonical(), cl_string_maptable(), create_feature_maps(), main(), print_pattern(), regopt_data_copy_to_regex_object(), and scancorpus_add_key().

#define IGNORE_DIAC   2

Flag ignore-diacritics in regular expression engine; fold diacritics in cl_string_canonical.

Referenced by cl_new_regex(), cl_string_canonical(), cl_string_maptable(), create_feature_maps(), main(), print_pattern(), and scancorpus_add_key().

#define IGNORE_REGEX   4

Flag for: don't use regular expression engine - match as a literal string.

Referenced by do_flagged_re_variable(), do_flagged_string(), do_mval_string(), do_XMLTag(), and print_pattern().

#define inverted_file_is_compressed (   a)    cl_index_compressed(a)
#define item_sequence_is_compressed (   a)    cl_sequence_compressed(a)

Referenced by cl_cpos2id().

#define nr_of_arguments (   a)    cl_dynamic_numargs(a)
#define OpenPositionStream (   a,
  id 
)    cl_new_stream(a, id)
#define ReadPositionStream (   ps,
  buf,
  size 
)    cl_read_stream(ps, buf, size)
#define REQUIRE_NFC   8

Flag for: string requires enforcement of pre-composed normal form (NFC), which is standard in CWB indexed corpora; applies only to UTF-8; all UTF-8 strings passed in from external sources need to be normalised in this way; applies to subject string when used with regex engine, to sole argument string when used with cl_string_canonical;.

Referenced by cl_new_regex(), cl_regex_match(), cl_string_canonical(), encode_get_input_line(), sencode_parse_line(), and VerifyVariable().

#define setup_corpus (   reg,
  name 
)    cl_new_corpus(reg, name)

Referenced by GetSystemCorpus().

#define STRUC_INSIDE   1

cl_cpos2boundary() return flag: specified position is WITHIN a region of this s-attribute

Referenced by cl_cpos2boundary().

#define STRUC_LBOUND   2

cl_cpos2boundary() return flag: specified position is AT THE START BOUNDARY OF a region of this s-attribute

Referenced by cl_cpos2boundary().

#define STRUC_RBOUND   4

cl_cpos2boundary() return flag: specified position is AT THE END BOUNDARY OF a region of this s-attribute

Referenced by cl_cpos2boundary().

#define structure_has_values (   a)    cl_struc_values(a)
#define structure_value (   a,
  struc 
)    cl_struc2str(a, struc)
#define structure_value_at_position (   a,
  cpos 
)    cl_cpos2struc2str(a, cpos)

Typedef Documentation

typedef union _Attribute Attribute

The Attribute object: an entire segment of a corpus, such as an annotation field, an XML structure, or a set.

The attribute can be of any flavour (s, p etc); this information is specified internally.

Note that each Attribute object is associated with a particular corpus. They aren't abstract, i.e. every corpus has a "word" p-attribute but any Attribute object for a "word" refers to the "word" of a specific corpus, not to "word" attributes in general.

typedef struct _cl_int_list* cl_int_list

Automatically growing list of integers (just what you always need ...)

typedef struct _cl_lexhash* cl_lexhash

The cl_lexhash class (lexicon hashes, with IDs and frequency counts).

A "lexicon hash" links strings to integers. Each cl_lexhash object represents an entire table of such things; individual string-to-int links are represented by cl_lexhash_entry objects.

Within the cl_lexhash, the entries are grouped into buckets. A bucket is the term for a "slot" on the hash table. The linked-list in a given bucket represent all the different string-keys that map to one particular index value.

Each entry contains the key itself (for search-and-retrieval), the frequency of that type (incremented when a token is added that is already in the lexhash), an ID integer, plus a bundle of "data" associated with that string.

These lexicon hashes are used, notably, in the encoding of corpora to CWB-index-format.

WARNING: cl_lexhash objects are intended for data sets ranging from a few dozen entries to several million entries. Do not try to store more than a billion (distinct) strings in a lexicon hash, otherwise bad (and unpredictable) things will happen. You have been warned!

Underlying structure for the cl_lexhash_entry class.

Unlike most underlying structures, this is public in the CL API. This is done so that applications can access the embedded payload directly (as entry->data->integer, ...).

Such structures MUST NOT be allocated or copied directly by an application! Neither may internal fields, esp. entry->key, be modified. Only read and write access to the payload of entries returned by cl_lexhash_find() and cl_lexhash_add() is allowed.

typedef struct _cl_ngram_hash* cl_ngram_hash

The cl_ngram_hash class (hash-based frequency counts for n-grams, represented by n-tuples of integer type IDs).

A "n-gram hash" is used to collect frequency counts for n-grams, which are represented by n-tuples of integer type IDs. The mapping between types and IDs is not part of a cl_ngram_hash object and must be provided externally.

N-gram hashes encapsulate a central aspect of the cwb-scan-corpus utility, making efficient n-gram frequency counts available to other applications.

The implementation of the cl_ngram_hash class is similar to cl_lexhash. However, at the current time there is no mapping to unique n-gram IDs and no support for user data (a "payload"). The sole purpose of the implementation is to enable fast and memory-efficient frequency counts for very large sets of n-grams.

WARNING: cl_ngram_hash objects cannot store more than 2^32 - 1 entries. Bad things will happen if you try to do so!

Underlying structure for the cl_ngram_hash_entry class.

Unlike most underlying structures, this is public in the CL API, so that applications can iterate through entries, sort them, etc.

Access the frequency count with entry->freq, and the type IDs of the tuple members with entry->ngram[0], entry->ngram[1], ...

Entries MUST NOT be allocated, copied or modified directly by an application!

typedef struct _CL_Regex* CL_Regex

The CL_Regex object: an optimised regular expression.

The CL regex engine wraps around another regex library (v3.1.x: POSIX, will be PCRE in v3.2.0+) to implement CL semantics. These are: (a) the engine always matches the entire string; (b) there is support for case-/diacritic-insensitive matching; (c) certain optimisations are implemented.

Associated with the CL regular expression engine are macros for three flags: IGNORE_CASE, IGNORE_DIAC and IGNORE_REGEX. All three are used by the related cl_regex2id(), but only the first two are used by the CL_Regex object (since it does not support non-regexp search).

See also
cl_regex2id

Automatically growing list of strings (just what you always need ...)

typedef struct ClAutoString* ClAutoString

A single-string object whose memory allocation grows automatically.

typedef struct TCorpus Corpus

The Corpus object: contains information on a loaded corpus, including all its attributes.

The CorpusCharset object: an identifier for one of the character sets supported by CWB.

(Note on adding new character sets: add them immediately before unknown_charset. Do not change the order of existing charsets. Remember to update the special-chars module if you do so.)

typedef struct TCorpusProperty * CorpusProperty

The CorpusProperty object.

The underlying structure takes the form of a linked-list entry.

Note that unlike most CL objects, the underlying structure is exposed in the public API.

Each Corpus object has, as one of its members, the head entry on a list of CorpusProperties.

typedef struct _DCR DynCallResult

The DynCallResult object (needed to allocate space for dynamic function arguments)

The PositionStream object: gives stream-like reading of an Attribute.

Enumeration Type Documentation

The CorpusCharset object: an identifier for one of the character sets supported by CWB.

(Note on adding new character sets: add them immediately before unknown_charset. Do not change the order of existing charsets. Remember to update the special-chars module if you do so.)

Enumerator
ascii 
latin1 
latin2 
latin3 
latin4 
cyrillic 
arabic 
greek 
hebrew 
latin5 
latin6 
latin7 
latin8 
latin9 
utf8 
unknown_charset 

Function Documentation

int cl_alg2cpos ( Attribute attribute,
int  alg,
int *  source_region_start,
int *  source_region_end,
int *  target_region_start,
int *  target_region_end 
)

Gets the corpus positions of an alignment on the given align-attribute.

Note that four corpus positions are retrieved, into the addresses given as parameters.

Parameters
attributeThe align-attribute to look on.
algThe ID of the alignment whose positions are wanted.
source_region_startLocation to put source corpus start position.
source_region_endLocation to put source corpus end position.
target_region_startLocation to put target corpus start position.
target_region_endLocation to put target corpus end position.
Returns
Boolean: true = all OK, false = problem.

References CDA_EIDXORNG, CDA_ENODATA, CDA_OK, cl_errno, cl_has_extended_alignment(), CompAlignData, CompXAlignData, TMblob::data, TComponent::data, ensure_component(), and TComponent::size.

Referenced by check_alignment_constraints(), compose_kwic_line(), decode_print_token_sequence(), do_cqi_cl_alg2cpos(), do_translate(), main(), and printAlignedStrings().

Corpus* cl_attribute_mother_corpus ( Attribute attribute)

Accessor function to get the mother corpus of the attribute.

References _Attribute::any.

Referenced by create_feature_maps().

void cl_autostring_concat ( ClAutoString  dst,
const char *  src 
)

Concatenate the string src onto the end of the AutoString in dst, automatically reallocating memory if necessary.

References ClAutoString::bytes_allocated, cl_realloc(), ClAutoString::data, ClAutoString::increment, and ClAutoString::len.

Referenced by compose_kwic_line(), get_field_separators(), get_position_values(), and get_print_attribute_values().

void cl_autostring_copy ( ClAutoString  dst,
const char *  src 
)

Copy the string in src into the AutoString in dst, automatically reallocating memory if necessary.

References ClAutoString::bytes_allocated, cl_realloc(), ClAutoString::data, ClAutoString::increment, and ClAutoString::len.

void cl_autostring_delete ( ClAutoString  string)

Delete an autostring object.

References cl_free, and ClAutoString::data.

Referenced by cleanup_kwic_line_memory().

void cl_autostring_dump ( ClAutoString  string)

Debug function: dumps the contents of an AutoString to stderr.

References ClAutoString::bytes_allocated, ClAutoString::data, ClAutoString::increment, and ClAutoString::len.

size_t cl_autostring_len ( ClAutoString  string)

Get the length of the currently-stored string (or negative value in case NULL object is passed).

Equivalent to reading the ->len member, except this function checks for a NULL!

ClAutoString cl_autostring_new ( const char *  data,
size_t  init_bytes 
)

Creates a new autostring object.

The string is initialised to data (or to a zero-length string if data is NULL).

Initially, init_bytes is allocated (and the increment step is the same size), unless the string is longer... in which case the length of the string becomes the inital amount of memory allocated.

Use 0 for init_len, and the length of the specified string is used as the initial allocation.

References ClAutoString::bytes_allocated, cl_malloc(), CL_MAX_LINE_LENGTH, ClAutoString::data, ClAutoString::increment, and ClAutoString::len.

Referenced by get_field_separators(), and setup_kwic_line_memory().

char* cl_autostring_ptr ( ClAutoString  string)

Get a pointer to the string data inside the AutoString (or NULL if the object is NULL).

Equivalent to reading the ->data member, except this function checks for a NULL!

Referenced by compose_kwic_line().

void cl_autostring_reclaim_mem ( ClAutoString  string)

Tries to free up unused memory by making the AutoString use only as many increments of size as necessary.

References cl_realloc(), ClAutoString::data, ClAutoString::increment, and ClAutoString::len.

void cl_autostring_set_increment ( ClAutoString  string,
size_t  new_increment 
)

Changes the increment size (measured in bytes).

Whenever memory reallocation is necessary, the AutoString will request a multiple of its increment value.

void cl_autostring_truncate ( ClAutoString  string,
int  new_length 
)

Truncates the AutoString to the length specified.

Note, does not respect UTF-8 encoding, so if the string is UTF8 you need to ascertain in advance that the cut-off does not break any UTF-8 characters into bits.

This function should be used if the character buffer is tampered with by direct access (which of course will not update the internal member of the object that tracks string length....).

References ClAutoString::len.

Referenced by compose_kwic_line(), get_field_separators(), get_position_values(), and setup_kwic_line_memory().

void* cl_calloc ( size_t  nr_of_elements,
size_t  element_size 
)

Safely allocates memory calloc-style.

See also
cl_malloc
Parameters
nr_of_elementsNumber of elements to allocate
element_sizeSize of each element
Returns
Pointer to the block of allocated memory

Referenced by alloc_mblob(), cl_new_int_list(), cl_new_lexhash(), cl_new_ngram_hash(), cl_new_string_list(), cl_ngram_hash_stats(), cl_regex2id(), compute_code_lengths(), do_translate(), evaluate_target(), range_declare(), and validate_revcorp().

CorpusCharset cl_charset_from_name ( char *  name)

Gets a CorpusCharset enumeration with the id code for the given string.

References _charset_spec::name, and unknown_charset.

Referenced by add_corpus_property(), cwbci_parse_options(), main(), and sencode_parse_options().

char* cl_charset_name ( CorpusCharset  id)

Gets a string containing the name of the specified CorpusCharset character set object.

Note that returned string cannot be modified. TODO It should probably be a const char.

References _charset_spec::name.

Referenced by corpus_info(), describecorpus_show_basic_info(), and do_cqi_corpus_charset().

char* cl_charset_name_canonical ( char *  name_to_check)

Checks whether a string represents a valid charset, and returns a pointer to the name in canonical form (ie lacking any non-standard case there may be in the input string).

Note that the returned string cannot be modified.

Parameters
name_to_checkString containing the character set name to be checked
Returns
Pointer to canonical-form string for that charset's name or NULL if name_to_check cannot be linked to a valid charset.

References _charset_spec::name.

Referenced by cwbci_parse_options(), encode_parse_options(), and sencode_parse_options().

size_t cl_charset_strlen ( CorpusCharset  charset,
char *  s 
)

References utf8.

Referenced by compose_kwic_line().

int cl_close_stream ( FILE *  handle)

Close I/O stream.

This function can only be used for FILE* objects opened with cl_open_stream()!

Parameters
streamAn I/O stream that has been opened with cl_open_stream()
Returns
0 on success, otherwise the error code returned by fclose() or pclose(); <cl_errno> is set accordingly

This function can only be used for FILE* objects opened with cl_open_stream()!

Parameters
streamAn I/O stream that has been opened with cl_open_stream()
Returns
0 on success, otherwise the error code returned by fclose() or pclose()

References CDA_EATTTYPE, CDA_EPOSIX, CDA_OK, cl_broken_pipe, cl_errno, cl_free, CL_STREAM_BZIP2, CL_STREAM_FILE, CL_STREAM_GZIP, CL_STREAM_PIPE, CL_STREAM_STDIO, _CLStream::handle, _CLStream::next, open_streams, STREAM_IS_PIPE, and _CLStream::type.

Referenced by alignshow_goodbye(), close_input_stream(), close_stream(), encode_get_input_line(), lexdecode_show(), main(), open_input_stream(), open_stream(), and SetVariableValue().

CorpusCharset cl_corpus_charset ( Corpus corpus)

Retrieves the special 'charset' property from a Corpus object.

Parameters
corpusThe corpus object from which to retrieve the charset
Returns
The character set (as a CorpusCharset object).

References TCorpus::charset.

Referenced by create_feature_maps(), decode_print_xml_declaration(), main(), scancorpus_add_key(), and sencode_parse_options().

cl_string_list cl_corpus_list_attributes ( Corpus corpus,
int  attribute_type 
)

Gets a list of the named attributes that this corpus posesses.

This function creates a list of strings containing the names of all and only those Attributes in this corpus whose type matches that specified in the second parameter.

Parameters
corpusThe corpus whose attributes are to be listed.
attribute_typeThe type of attributes to be listed. This must be one of the attribute type macros: ATT_POS, ATT_STRUC etc. For all attributes, specify ATT_ALL (natuerlich).
Returns
String list containing names of all the corpus's attributes that have the desired type. All the actual character buffers have been newly allocated, so it is safe to call cl_free_string_list on the returned cl_string_list object once you're done with it.

References _Attribute::any, TCorpus::attributes, cl_new_string_list(), cl_strdup(), and cl_string_list_append().

char* cl_corpus_property ( Corpus corpus,
char *  property 
)

Gets the value of the specified corpus property.

Parameters
corpusPointer to the Corpus object.
propertyName of the property to retrieve.
Returns
Pointer to string that contains the value of the property, or NULL if the specified property is undefined for this Corpus object.

References cl_first_corpus_property(), cl_next_corpus_property(), TCorpusProperty::property, and TCorpusProperty::value.

Referenced by add_corpus_property(), and corpus_info().

int cl_cpos2alg ( Attribute attribute,
int  cpos 
)

Gets the id number of the alignment at the specified corpus position.

Parameters
attributeThe align-attribute to look on.
cposThe corpus position to look at.
Returns
The id number of the alignment at this position, or a negative int error code.

References CDA_EALIGN, CDA_ENODATA, CDA_EPOSORNG, CDA_OK, cl_errno, cl_has_extended_alignment(), CompAlignData, CompXAlignData, TMblob::data, TComponent::data, ensure_component(), get_alignment(), get_extended_alignment(), and TComponent::size.

Referenced by check_alignment_constraints(), compose_kwic_line(), decode_print_token_sequence(), do_cqi_cl_cpos2alg(), do_translate(), and printAlignedStrings().

int cl_cpos2alg2cpos_oldstyle ( Attribute attribute,
int  position,
int *  source_corpus_start,
int *  source_corpus_end,
int *  aligned_corpus_start,
int *  aligned_corpus_end 
)

Gets the corpus positions of an alignment on the given align-attribute.

This is for old-style alignments only: it doesn't (can't) deal with extended alignments. Depracated: use cl_alg2cpos instead (but note its parameters are not identical).

See also
cl_alg2cpos.
Parameters
attributeThe align-attribute to look on.
positionThe corpus position {??} of the alignment whose positions are wanted.
source_corpus_startLocation to put source corpus start position.
source_corpus_endLocation to put source corpus end position.
aligned_corpus_startLocation to put target corpus start position.
aligned_corpus_endLocation to put target corpus end position.
Returns
Boolean: true = all OK, false = problem.

References ATT_ALIGN, CDA_ENODATA, CDA_EPOSORNG, CDA_OK, check_arg, cl_errno, CompAlignData, TMblob::data, TComponent::data, ensure_component(), get_alignment(), and TComponent::size.

int cl_cpos2boundary ( Attribute a,
int  cpos 
)

Compares the location of a corpus position to the regions of an s-attribute.

This determines whether the specified corpus position is within a region (i.e. a structure, an instance of that s-attribute) on the given s-attribute; and/or on a boundary; or outside a region.

See also
STRUC_INSIDE
STRUC_LBOUND
STRUC_RBOUND
Parameters
aThe s-attribute on which to search.
cposThe corpus position to look for.
Returns
0 if this position is outside a region; some combination of flags if it is within a region or on a bound; or a negative number (error code) in case of error.

References CDA_ESTRUC, cl_cpos2struc2cpos(), cl_errno, STRUC_INSIDE, STRUC_LBOUND, and STRUC_RBOUND.

int cl_cpos2id ( Attribute attribute,
int  position 
)
char* cl_cpos2str ( Attribute attribute,
int  position 
)

Gets the string of the item at the specified position on the given p-attribute.

Parameters
attributeThe P-attribute to look on.
positionThe corpus position to look at.
Returns
The string of the item at that position on this attribute (pointer to actual data within the attribute, DO NOT FREE!), or NULL if there is an error.

References ATT_POS, CDA_OK, check_arg, cl_cpos2id(), cl_errno, and cl_id2str().

Referenced by alignshow_print_next_region(), decode_print_token_sequence(), do_cqi_cl_cpos2str(), get_position_values(), print_tabulation(), SortExternally(), and SortSubcorpus().

int cl_cpos2struc ( Attribute a,
int  cpos 
)

Gets the ID number of a structure (instance of an s-attribute) that is found at the given corpus position.

This is a wrapper of the "old" function get_num_of_struc() that normalises it to standard return value behaviour.

Parameters
aThe s-attribute on which to search.
cposThe corpus position to look for.
Returns
The number of the structure that is found.

References cl_cpos2struc_oldstyle(), and cl_errno.

Referenced by compose_kwic_line(), decode_print_surrounding_s_att_values(), decode_print_token_sequence(), do_cqi_cl_cpos2lbound(), do_cqi_cl_cpos2rbound(), do_cqi_cl_cpos2struc(), eval_constraint(), get_position_values(), and main().

int cl_cpos2struc2cpos ( Attribute attribute,
int  position,
int *  struc_start,
int *  struc_end 
)

Gets the start and end positions of the instance of the given S-attribute found at the specified corpus position.

This function finds one particular instance of the S-attribute, and assigns its start and end points to the locations given as arguments.

Parameters
attributeThe s-attribute to search.
positionThe corpus position to search for.
struc_startLocation for the start position of the instance.
struc_endLocation for the end position of the instance.

References ATT_STRUC, CDA_ENODATA, CDA_ESTRUC, CDA_OK, check_arg, cl_errno, CompStrucData, TMblob::data, TComponent::data, ensure_component(), get_previous_mark(), and TComponent::size.

Referenced by cl_cpos2boundary(), and decode_print_token_sequence().

char* cl_cpos2struc2str ( Attribute attribute,
int  position 
)
int cl_cpos2struc_oldstyle ( Attribute attribute,
int  position,
int *  struc_num 
)

Gets the ID number of a structure (instance of an s-attribute) that is found at the given corpus position.

Depracated function: use cl_cpos2struc.

See also
cl_cpos2struc
Parameters
attributeThe s-attribute on which to search.
positionThe corpus position to look for.
struc_numLocation where the number of the structure that is found will be put.
Returns
Boolean: true for all OK, false for error.

References ATT_STRUC, CDA_ENODATA, CDA_ESTRUC, CDA_OK, check_arg, cl_errno, CompStrucData, TMblob::data, TComponent::data, ensure_component(), get_previous_mark(), and TComponent::size.

Referenced by cl_cpos2struc().

int cl_delete_attribute ( Attribute attribute)

Deletes the specified Attribute object.

The function also appropriately amends the Corpus object of which this Attribute is a dependent. This means you can call it repreatedly on the first element of a Corpus's Attribute list (as the linked list is automatically adjusted).

Returns
Boolean: true for all OK, false for a problem.

References _Attribute::any, Dynamic_Attribute::arglist, ATT_DYN, ATT_NONE, ATT_POS, TCorpus::attributes, Dynamic_Attribute::call, cl_free, comp_drop_component(), CompDirectory, CompLast, corpus, _Attribute::dyn, POS_Attribute::hc, _DynArg::next, _Attribute::pos, and _Attribute::type.

Referenced by cl_delete_corpus(), cqi_drop_attribute(), and drop_attribute().

int cl_delete_corpus ( Corpus corpus)

Deletes a Corpus object from memory.

A Corpus object keeps track of how many times it has been requested via cl_new_corpus(). When cl_delete_corpus() is called, the object is only actually deleted when there is just one outstanding request. Otherwise, the variable tracking the number of requests is decremented.

Parameters
corpusThe Corpus to delete.
Returns
Always 1.

References TCorpus::admin, TCorpus::attributes, cl_delete_attribute(), cl_free, FreeIDList(), TCorpus::groupAccessList, TCorpus::hostAccessList, TCorpus::id, TCorpus::info_file, loaded_corpora, TCorpus::name, TCorpus::next, TCorpus::nr_of_loads, TCorpus::path, TCorpus::registry_dir, TCorpus::registry_name, and TCorpus::userAccessList.

Referenced by cl_new_corpus(), compressrdx_cleanup(), decode_cleanup(), huffcode_usage(), and main().

void cl_delete_int_list ( cl_int_list  l)

Deletes a cl_int_list object.

References cl_free, and _cl_int_list::data.

void cl_delete_lexhash ( cl_lexhash  hash)

Deletes a cl_lexhash object.

This deletes all the entries in all the buckets in the lexhash, plus the cl_lexhash itself.

Parameters
hashThe cl_lexhash to delete.

References _cl_lexhash::buckets, cl_delete_lexhash_entry(), cl_free, _cl_lexhash_entry::next, and _cl_lexhash::table.

Referenced by main().

void cl_delete_ngram_hash ( cl_ngram_hash  hash)

Deletes a cl_ngram_hash object.

This deletes all the entries in all the buckets in the ngram_hash, plus the cl_ngram_hash itself.

Parameters
hashThe cl_ngram_hash to delete.

References _cl_ngram_hash::buckets, cl_free, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.

Referenced by ComputeGroupInternally().

void cl_delete_regex ( CL_Regex  rx)

Deletes a CL_Regex object, and frees all resources associated with the pre-compiled regex.

Note that we use cl_free to deallocate the internal PCRE buffers, not pcre_free, for the simple reason that pcre_free is just a function pointer that will normally contain free, and thus we miss out on the checking that cl_free provides.

Parameters
rxThe CL_Regex to delete.

References cl_free, _CL_Regex::extra, _CL_Regex::grain, _CL_Regex::grains, _CL_Regex::haystack_buf, _CL_Regex::haystack_casefold, and _CL_Regex::needle.

Referenced by cl_regex2id(), free_booltree(), free_environment(), and main().

int cl_delete_stream ( PositionStream ps)

Deletes a PositionStream object.

References BSclose().

Referenced by compress_reversed_index(), and decompress_check_reversed_index().

void cl_delete_string_list ( cl_string_list  l)

Deletes a cl_string_list object.

References cl_free, and _cl_string_list::data.

Referenced by cl_make_set(), encode_parse_options(), and main().

int cl_dynamic_call ( Attribute attribute,
DynCallResult dcr,
DynCallResult args,
int  nr_args 
)

Calls a dynamic attribute.

This is the attribute access function for dynamic attributes.

Parameters
attributeThe (dynamic) attribute in question.
dcrLocation for the result (*int or *char).
argsLocation of the parameters (of *int or *char).
nr_argsNumber of parameters.
Returns
Boolean: True for all OK, false for error.

References Dynamic_Attribute::arglist, ATT_DYN, ATTAT_FLOAT, ATTAT_INT, ATTAT_NONE, ATTAT_PAREF, ATTAT_POS, ATTAT_STRING, ATTAT_VAR, Dynamic_Attribute::call, CDA_EARGS, CDA_OK, _DCR::charres, check_arg, cl_errno, CL_MAX_LINE_LENGTH, cl_strdup(), _Attribute::dyn, _DCR::floatres, _DCR::intres, _DynArg::next, Dynamic_Attribute::res_type, _DynArg::type, _DCR::type, and _DCR::value.

int cl_dynamic_numargs ( Attribute attribute)

Count the number of arguments on a dynamic attribute's argument list.

Parameters
attributepointer to the Attribute object to analyse; it must be a dynamic attribute.
Returns
integer specifying the number of arguments; a negative integer is returned if for any argument on dyn.arglist, the type is equal to ATTAT_VAR

References Dynamic_Attribute::arglist, ATT_DYN, ATTAT_VAR, CDA_OK, check_arg, cl_errno, _Attribute::dyn, _DynArg::next, and _DynArg::type.

void cl_error ( char *  message)
char* cl_error_string ( int  error_num)

Gets a string describing the error identified by an error number.

The string is a pointer to an internal constant string, i.e., do not modify or free it!

Parameters
error_numError number integer (a CDA_* constant as defined in cl.h)

References CDA_EACCESS, CDA_EALIGN, CDA_EARGS, CDA_EATTTYPE, CDA_EBADREGEX, CDA_EBUFFER, CDA_EFSETINV, CDA_EIDORNG, CDA_EIDXORNG, CDA_EINTERNAL, CDA_ENODATA, CDA_ENOMEM, CDA_ENOSTRING, CDA_ENULLATT, CDA_ENYI, CDA_EOTHER, CDA_EPATTERN, CDA_EPOSIX, CDA_EPOSORNG, CDA_EREMOTE, CDA_ESTRUC, and CDA_OK.

Referenced by cl_error(), open_input_stream(), and open_stream().

CorpusProperty cl_first_corpus_property ( Corpus corpus)

Gets the first entry in this corpus's list of properties.

(The corpus properties iterator / property datatype is public.)

Parameters
corpusPointer to the Corpus object.
Returns
The first property.

References TCorpus::properties.

Referenced by cl_corpus_property(), and corpus_info().

void cl_free_string_list ( cl_string_list  l)

Frees all the strings in the cl_string_list object.

References cl_free, _cl_string_list::data, and _cl_string_list::size.

Referenced by main().

void cl_get_rng_state ( unsigned int *  i1,
unsigned int *  i2 
)

Reads current state of CL-internal random number generator.

The (unsigned, 32-bit) integers currently held in RNG_I1 and RNG_I2 are written to the two memory locations supplied as arguments.

Parameters
i1Target location for the value of RNG_I1
i2Target location for the value of RNG_I2

References RNG_I1, and RNG_I2.

int cl_has_extended_alignment ( Attribute attribute)

Checks whether an attribute's XALIGN component exists, that is, whether or not it has extended alignment.

Parameters
attributeAn align-attribute.
Returns
Boolean.

References ATT_ALIGN, check_arg, cl_errno, component_state(), ComponentLoaded, ComponentUnloaded, and CompXAlignData.

Referenced by cl_alg2cpos(), cl_cpos2alg(), cl_max_alg(), and describecorpus_show_statistics().

char* cl_id2all ( Attribute attribute,
int  index,
int *  freq,
int *  slen 
)

Gets the string of the item with the specified ID on the given p-attribute.

As well as returning the string, other information about the item is inserted into locations specified by other parameters.

Parameters
attributeThe P-attribute to look on.
indexThe ID of the item to look at.
freqWill be set to the frequency of the item.
slenWill be set to the string-length of the item.
Returns
The string of the item at that position on this attribute, OR NULL if there is an error.

References ATT_POS, CDA_OK, check_arg, cl_errno, get_id_frequency, get_id_string_len, and get_string_of_id.

Referenced by lexdecode_print_item_info().

int* cl_id2cpos_oldstyle ( Attribute attribute,
int  id,
int *  freq,
int *  restrictor_list,
int  restrictor_list_size 
)

Gets all the corpus positions where the specified item is found on the given P-attribute.

The restrictor list is a set of ranges in which instances of the item MUST occur to be collected by this function. If no restrictor list is specified (i.e. restrictor_list is NULL), then ALL corpus positions where the item occurs are returned.

This restrictor list has the form of a list of ranges {start,end} of size restrictor_list_size, that is, the number of ints in this area is 2 * restrictor_list_size!!!

This function is "oldstyle" because in the "newstyle" function, there is no restrictor list. (And in fact, the newstyle function is implemented as a macro to this one with the last two arguments NULL and 0.)

See also
Parameters
attributeThe P-attribute to look on.
idThe id of the item to look for.
freqThe frequency of the specified item is written here. This will be 0 in the case of errors.
restrictor_listA list of pairs of integers specifying ranges {start,end} in the corpus
restrictor_list_sizeThe number of PAIRS of ints in the restrictor list.
Returns
Pointer to the list of corpus positions; or NULL in case of error.

References ATT_POS, BSclose(), BSopen(), BSseek(), CDA_EIDORNG, CDA_ENODATA, CDA_OK, check_arg, cl_errno, cl_free, cl_id2freq(), cl_index_compressed(), cl_malloc(), cl_max_cpos(), cl_max_id(), cl_realloc(), CompCompRF, CompCompRFX, CompRevCorpus, CompRevCorpusIdx, compute_ba(), TMblob::data, TComponent::data, ensure_component(), and read_golomb_code_bs().

int cl_id2freq ( Attribute attribute,
int  id 
)

Gets the frequency of an item on this attribute.

Parameters
attributeThe P-attribute to look on
idIdentifier of an item on this attribute.
Returns
The frequency count of the item specified by id, or an error code (if less than 0)

References ATT_POS, CDA_EIDXORNG, CDA_ENODATA, CDA_OK, check_arg, cl_errno, CompCorpusFreqs, TMblob::data, TComponent::data, and ensure_component().

Referenced by cl_id2cpos_oldstyle(), cl_idlist2freq(), compress_reversed_index(), compute_code_lengths(), creat_rev_corpus(), create_feature_maps(), decompress_check_reversed_index(), do_cqi_cl_id2freq(), and validate_revcorp().

int cl_id2sort ( Attribute attribute,
int  id 
)

Gets the position in the Attribute's sorted wordlist index of the item with the specified ID code.

This function is NOT YET IMPLEMENTED.

See also
get_id_from_sortidx
Parameters
attributeThe (positional) Attribute whose index is to be searched
idIdentifier of an item on this attribute.
Returns
The offset of that item in the sorted wordlist index.

References ATT_POS, CDA_ENODATA, CDA_ENYI, CDA_OK, check_arg, cl_errno, CompLexiconSrt, and ensure_component().

char* cl_id2str ( Attribute attribute,
int  id 
)

Gets the string that corresponds to the specified item on the given P-attribute.

Parameters
attributeThe Attribute to look the item up on
idIdentifier of an item on this attribute.
Returns
The string (pointer to actual data within the attribute, DO NOT FREE!), or NULL if there is an error.

References ATT_POS, CDA_EIDORNG, CDA_ENODATA, CDA_OK, check_arg, cl_errno, CompLexicon, CompLexiconIdx, TMblob::data, TComponent::data, ensure_component(), and TComponent::size.

Referenced by cl_cpos2str(), cl_id2strlen(), create_feature_maps(), do_cqi_cl_id2str(), Group_id2str(), i2compare(), main(), and scancorpus_add_key().

int cl_id2strlen ( Attribute attribute,
int  id 
)

Calculates the length of the string that corresponds to the specified item on the given P-attribute.

Parameters
attributeThe (positional) Attribute to look up the item on
idIdentifier of an item on this attribute.
Returns
The length of the string, or a CDA_ error code

References ATT_POS, CDA_EIDORNG, CDA_ENODATA, CDA_EOTHER, CDA_OK, check_arg, cl_errno, cl_id2str(), CompLexiconIdx, TMblob::data, TComponent::data, ensure_component(), and TComponent::size.

Referenced by create_feature_maps().

void cl_id_tolower ( char *  s)

Converts an uppercase corpus name to an equivalent lowercase form.

String is modified in situ. Only the ASCII characters are changed.

Note, this function doesn't check for what is and is not an allowed CWB-corpus-name character.

Referenced by changecase_string(), changecase_string_no_copy(), cl_new_corpus(), encode_generate_registry_file(), and main().

void cl_id_toupper ( char *  s)

Converts a lowercase corpus name to an equivalent uppercase form.

String is modified in situ. Only the ASCII characters are changed.

Note, this function doesn't check for what is and is not an allowed CWB-corpus-name character.

The old version of this code was a line in cwb-encode that used the library toupper to cope with Latin1 characters. But these are no longer allowed in identifiers, which must be ASCII only.

Referenced by changecase_string(), changecase_string_no_copy(), encode_generate_registry_file(), and main().

int cl_id_validate ( char *  s)

Checks a string to see if it is a valid CWB identifier.

The rules for these are as follows (see also the CQP lexer):

  • all characters must be ASCII, ie less than 0x80;
  • must be at least 1 character long (of course)
  • first character must be an uppercase or lowercase letter or underscore
  • second and subsequent characters may also be digits, hyphen or fullstop.
  • mixed case is allowed (just-upper and just-lower is imposed elsewhere, where necessary).

TODO: should the CL registry lexer be amended to reflect these restricitons? (ID there is rather laxer than this)

Parameters
sThe string to check.
Returns
A boolean. True if the string is a valid ID. Otherwise false.

Referenced by cl_new_corpus(), and encode_generate_registry_file().

int* cl_idlist2cpos_oldstyle ( Attribute attribute,
int *  word_ids,
int  number_of_words,
int  sort,
int *  size_of_table,
int *  restrictor_list,
int  restrictor_list_size 
)

Gets a list of corpus positions matching a list of ids.

This function returns an (ordered) list of all corpus positions which match one of the ids given in the list of ids. The table is allocated with malloc, so free it when you don't need any more.

The list itself is returned; its size is placed in size_of_table. This size is, of course, the same as the cumulative id frequency of the ids (because each corpus position matching one of the ids is added into the list).

BEWARE: when the id list is rather big or there are highly-frequent ids in the id list (for example, after a call to collect_matching_ids with the pattern ".*") this will give a copy of the corpus – for which you probably don't have enough memory!!! It is therefore a good idea to call cumulative_id_frequency before and to introduce some kind of bias.

This function is DEPRACATED in favour of cl_idlist2cpos().

This function is "oldstyle" because it has the "restrictor list" parameters, which are not available through the "newstyle" function cl_idlist2cpos() (which is currently just a macro to this).

A note on the last two parameters, which are currently unused: restrictor_list is a list of integer pairs [a,b] which means that the returned value only contains positions which fall within at least one of these intervals. The list must be sorted by the start positions, and secondarily by b. restrictor_list_size is the number of integers in this list, NOT THE NUMBER OF PAIRS. WARNING: CURRENTLY UNIMPLEMENTED {NB – this description of restrictor_list_size DOESN'T MATCH the one for get_positions(), which this function calls...

REMEMBER: this monster returns a list of corpus indices, not a list of ids.

See also
collect_matching_ids
get_positions
cl_idlist2cpos
Parameters
attributeThe P-attribute we are looking in
word_idsA list of item ids (i.e. id codes for items on this attribute).
number_of_wordsThe length of this list.
sortboolean: return sorted list?
size_of_tableThe size of the allocated table will be placed here.
restrictor_listSee function description.
restrictor_list_sizeSee function description.
Returns
Pointer to the list of corpus positions.

References ATT_POS, CDA_EIDORNG, CDA_ENODATA, CDA_OK, check_arg, cl_errno, cl_idlist2freq(), cl_malloc(), CompLexiconIdx, ensure_component(), get_positions, intcompare(), and TComponent::size.

Referenced by get_matched_corpus_positions().

int cl_idlist2freq ( Attribute attribute,
int *  word_ids,
int  number_of_words 
)

Calculates the total frequency of all items on a list of item IDs.

This function returns the sum of the word frequencies of words, which is an array of word_ids with length number_of_words.

The result is therefore the number of corpus positions which match one of the words.

Parameters
attributeP-attribute on which these items are found.
word_idsAn array of item IDs.
number_of_wordsLength of the word_ids array.
Returns
Sum of all the frequencies; less than 0 for an error.

References ATT_POS, CDA_ENODATA, CDA_OK, check_arg, cl_errno, and cl_id2freq().

Referenced by cl_idlist2cpos_oldstyle(), and OptimizeStringConstraint().

int cl_index_compressed ( Attribute attribute)

Check whether the reverse-corpus index (inverted file) of the given P-attribute is compressed.

See comments in body of function for what counts as "compressed".

Returns
Boolean.

References ATT_POS, check_arg, cl_errno, CompCompRF, CompCompRFX, component_state(), ComponentLoaded, ComponentUnloaded, CompRevCorpus, and CompRevCorpusIdx.

Referenced by cl_id2cpos_oldstyle(), and cl_new_stream().

void cl_int_list_append ( cl_int_list  l,
int  val 
)

Appends an integer to the end of a cl_int_list object.

References cl_int_list_set(), and _cl_int_list::size.

int cl_int_list_get ( cl_int_list  l,
int  n 
)

Retrieves an element from a cl_int_list object.

Parameters
lThe list to search.
nThe element to retrieve.
Returns
The n'th integer on the list, or 0 if there is no n'th integer.

References _cl_int_list::data, and _cl_int_list::size.

void cl_int_list_lumpsize ( cl_int_list  l,
int  s 
)

Sets the lumpsize of a cl_int_list object.

See also
LUMPSIZE
Parameters
lThe cl_int_list.
sThe new lumpsize.

References _cl_int_list::lumpsize, and LUMPSIZE.

void cl_int_list_qsort ( cl_int_list  l)

Sorts a cl_int_list object.

The list of integers are sorted into ascending order.

References cl_int_list_intcmp(), _cl_int_list::data, and _cl_int_list::size.

void cl_int_list_set ( cl_int_list  l,
int  n,
int  val 
)

Sets an integer on a cl_int_list object.

The n'th element on the list is set to val, and the list is auto-extended if necessary.

References _cl_int_list::allocated, cl_realloc(), _cl_int_list::data, _cl_int_list::lumpsize, and _cl_int_list::size.

Referenced by cl_int_list_append().

int cl_int_list_size ( cl_int_list  l)

Gets the current size of a cl_int_list object (number of elements on the list).

References _cl_int_list::size.

cl_lexhash_entry cl_lexhash_add ( cl_lexhash  hash,
char *  token 
)

Adds a token to a cl_lexhash table.

If the string is already in the hash, its frequency count is increased by 1.

Otherwise, a new entry is created, with an auto-assigned ID; note that the string is duplicated, so the original string that is passed to this function does not need to be kept in memory.

Parameters
hashThe hash table to add to.
tokenThe string to add.
Returns
A pointer to a (new or existing) entry

References _cl_lexhash::auto_grow, _cl_lexhash::buckets, cl_lexhash_check_grow(), cl_lexhash_find_i(), cl_malloc(), _cl_lexhash_entry::data, _cl_lexhash::entries, _cl_lexhash::fillrate_limit, _cl_lexhash_entry::freq, _cl_lexhash_entry::id, _cl_lexhash_entry::_cl_lexhash_entry_data::integer, _cl_lexhash_entry::key, _cl_lexhash_entry::next, _cl_lexhash::next_id, _cl_lexhash_entry::_cl_lexhash_entry_data::numeric, _cl_lexhash_entry::_cl_lexhash_entry_data::pointer, and _cl_lexhash::table.

Referenced by encode_add_wattr_line(), main(), range_close(), range_declare(), range_open(), and sencode_write_region().

void cl_lexhash_auto_grow ( cl_lexhash  hash,
int  flag 
)

Turns a cl_lexhash's ability to auto-grow on or off.

When this setting is switched on, the lexhash will grow automatically to avoid performance degradation.

Note the default value for this setting is SWITCHED ON.

See also
cl_lexhash_check_grow
Parameters
hashThe hash that will be affected.
flagNew value for autogrow setting: boolean where true is on and false is off.

References _cl_lexhash::auto_grow.

void cl_lexhash_auto_grow_fillrate ( cl_lexhash  hash,
double  limit,
double  target 
)

Configure auto-grow parameters.

These settings are only relevant if auto-growing is enabled.

The decision to expand the bucket table of a lexhash is based on its fill rate, i.e. the average number of entries in each bucket. Under normal circumstances, this value corresponds to the average number of comparisons required to insert a new entry into the hash (locating an existing value should require roughly half as many comparisons).

Auto-growing is triggered if the fill rate exceeds a specified limit. The new number of buckets is chosen so that the fill rate after expansion corresponds to the specified target value.

The limit should not be set too low in order to reduce memory overhead and avoid frequent reallocation due to expansion in small increments. Good values seem to be in the range 2.0-5.0; depending on whether speed or memory efficiency is more important. A reasonable value for the target fill rate is 0.4, which corresponds to a 42% overhead over the storage required for entry data structures (48 bytes per entry vs. 8 bytes for each bucket).

See also
cl_lexhash_auto_grow, cl_lexhash_check_grow
Parameters
hashThe hash that will be affected.
limitFill rate limit, which triggers expansion of the lexhash
targetTarget fill rate after expansion (determines new number of buckets)

References _cl_lexhash::fillrate_limit, and _cl_lexhash::fillrate_target.

int cl_lexhash_del ( cl_lexhash  hash,
char *  token 
)

Deletes a string from a hash.

The entry corresponding to the specified string is removed from the lexhash. If the string is not in the lexhash to begin with, no action is taken.

Parameters
hashThe hash to alter.
tokenThe string to remove.
Returns
The frequency of the deleted entry (0 if the string was not found in the hash).

References cl_delete_lexhash_entry(), cl_lexhash_find_i(), _cl_lexhash::entries, _cl_lexhash_entry::freq, _cl_lexhash_entry::next, and _cl_lexhash::table.

cl_lexhash_entry cl_lexhash_find ( cl_lexhash  hash,
char *  token 
)

Finds the entry corresponding to a particular string within a cl_lexhash.

This function is basically a wrapper around the internal function cl_lexhash_find_i.

See also
cl_lexhash_find_i
Parameters
hashThe hash to search.
tokenThe key-string to look for.
Returns
The entry that is found (or NULL if the string is not in the hash).

References cl_lexhash_find_i().

Referenced by main(), range_close(), range_open(), range_print_registry_line(), and sencode_write_region().

int cl_lexhash_freq ( cl_lexhash  hash,
char *  token 
)

Gets the frequency of a particular string within a lexhash.

Parameters
hashThe hash to look in.
tokenThe string to look for.
Returns
The frequency of that string, or 0 if the string is not in the hash (whgich is, of course, actually its frequency).

References cl_lexhash_find_i(), and _cl_lexhash_entry::freq.

Referenced by main(), and range_open().

int cl_lexhash_id ( cl_lexhash  hash,
char *  token 
)

Gets the ID of a particular string within a lexhash.

Note this is the ID integer that identifies THAT PARTICULAR STRING, not the hash value of that string - which only identifies the bucket the string is found in!

Parameters
hashThe hash to look in.
tokenThe string to look for.
Returns
The ID code of that string, or -1 if the string is not in the hash.

References cl_lexhash_find_i(), and _cl_lexhash_entry::id.

Referenced by encode_add_wattr_line(), and range_declare().

void cl_lexhash_set_cleanup_function ( cl_lexhash  lh,
void(*)(cl_lexhash_entry func 
)
int cl_lexhash_size ( cl_lexhash  hash)

Gets the number of different strings stored in a lexhash.

This returns the total number of entries in all the buckets in the whole hash table.

Parameters
hashThe hash to size up.

References _cl_lexhash::entries.

char* cl_make_set ( char *  s,
int  split 
)

Generates a feature-set attribute value.

Parameters
sThe input string.
splitBoolean; if True, s is split on whitespace. If False, the function expects input in '|'-delimited format.
Returns
The set attribute value in standard syntax ('|' delimited, sorted with cl_strcmp). If there is any syntax error, cl_make_set() returns NULL.

References CDA_EFSETINV, CDA_OK, cl_delete_string_list(), cl_errno, cl_free, cl_malloc(), cl_new_string_list(), cl_strdup(), cl_string_list_append(), cl_string_list_get(), cl_string_list_qsort(), and cl_string_list_size().

Referenced by encode_add_wattr_line(), range_open(), and sencode_check_set().

void* cl_malloc ( size_t  bytes)

Safely allocates memory malloc-style.

This function allocates a block of memory of the requested size, and does a test for malloc() failure which aborts the program and prints an error message if the system is out of memory. So the return value of this function can be used without further testing for malloc() failure.

Parameters
bytesNumber of bytes to allocate
Returns
Pointer to the block of allocated memory

Referenced by accessible(), add_corpus_property(), add_grant_to_last_user(), add_host_to_list(), add_hosts_in_subnet_to_list(), add_tabular_pattern(), add_user_to_list(), AddNameToAL(), alloc_mblob(), attach_subcorpus(), binsert_g(), check_alignment_constraints(), cl_autostring_new(), cl_id2cpos_oldstyle(), cl_idlist2cpos_oldstyle(), cl_lexhash_add(), cl_make_set(), cl_new_int_list(), cl_new_lexhash(), cl_new_ngram_hash(), cl_new_regex(), cl_new_string_list(), cl_ngram_hash_add(), cl_ngram_hash_get_entries(), cl_open_stream(), cl_path_registry_quote(), cl_regex2id(), cl_string_latex2iso(), cl_string_qsort_compare(), combine_subcorpus_spec(), compute_code_lengths(), compute_grouping(), ComputeGroupExternally(), ComputeGroupInternally(), cqi_read_bool_list(), cqi_read_byte_list(), cqi_read_int_list(), cqi_read_string(), cqi_read_string_list(), cqp_run_mu_query(), cqp_run_tab_query(), creat_rev_corpus(), creat_rev_corpus_idx(), create_bitfield(), define_macro(), do_cqi_cqp_query(), do_flagged_re_variable(), do_MeetStatement(), do_mval_string(), do_undump(), do_UnionStatement(), do_XMLTag(), duplicate_corpus(), encode_generate_registry_file(), encode_scan_directory(), evaltree2searchstr(), find_corpus_registry(), FormState(), get_leaf_value(), get_matched_corpus_positions(), GetVariableItems(), GetVariableStrings(), initialize_cqp(), labellookup(), list_macros(), LookUp(), macro_iterator_next_prototype(), MacroAddSegment(), MacroHashAdd(), main(), make_attribute_hash(), make_first_tabular_pattern(), make_temp_corpus(), MakeExp(), MakeMacroHash(), mallocfile(), matchfirstpattern(), meet_mu(), mval_string_conversion(), new_reftab(), new_symbol_table(), new_tabulation_item(), NewAttributeList(), NewContextDescriptor(), NewVariable(), open_input_stream(), OptimizeStringConstraint(), parse_macro_name(), PushInputBuffer(), RangeSetop(), RangeSort(), read_mapping(), ReadHCD(), regex2dfa(), set_corpus_matchlists(), set_target(), Setop(), show_corpora_files1(), simulate_dfa(), SL_insert_after_point(), SortExternally(), SortSubcorpus(), SortSubcorpusRandomize(), Store(), strdupto(), try_optimization(), and VariableAddItem().

int cl_max_alg ( Attribute attribute)

Gets the id number of alignments on this align-attribute.

This is equal to the maximum alignment on this attribute.

Parameters
attributeAn align-attribute.
Returns
The number of alignments on this attribute.

References CDA_ENODATA, CDA_OK, cl_errno, cl_has_extended_alignment(), CompAlignData, CompXAlignData, ensure_component(), and TComponent::size.

Referenced by describecorpus_show_statistics(), do_cqi_cl_attribute_size(), and main().

int cl_max_cpos ( Attribute attribute)

Gets the maximum position on this P-attribute (ie the size of the attribute).

The result of this function is equal to the number of tokens in the attribute.

If the attribute's item sequence is compressed, this is read from the attribute's Huffman code descriptor block.

Otherwise, it is read from the size member of the Attribute's CompCorpus component.

Returns
The maximum corpus position, or an error code (if less than 0)

References ATT_POS, CDA_ENODATA, CDA_OK, check_arg, cl_errno, cl_sequence_compressed(), CompCorpus, CompHuffCodes, corpus, ensure_component(), POS_Attribute::hc, _huffman_code_descriptor::length, _Attribute::pos, and TComponent::size.

Referenced by cl_id2cpos_oldstyle(), compose_kwic_line(), compress_reversed_index(), compute_code_lengths(), creat_rev_corpus(), decode_check_huff(), decompress_check_reversed_index(), describecorpus_show_basic_info(), describecorpus_show_statistics(), do_cqi_cl_attribute_size(), get_matched_corpus_positions(), lexdecode_show(), main(), OptimizeStringConstraint(), Setop(), SortSubcorpus(), and validate_revcorp().

int cl_max_id ( Attribute attribute)

Gets the maximum id on this P-attribute (ie the range of the attribute's ID codes).

The result of this function is equal to the number of types in this attribute.

See also
get_attribute_size
Returns
The maximum Id, or an error code (if less than 0)

References ATT_POS, CDA_ENODATA, CDA_OK, check_arg, cl_errno, CompLexiconIdx, ensure_component(), and TComponent::size.

Referenced by cl_id2cpos_oldstyle(), compress_reversed_index(), compute_code_lengths(), creat_rev_corpus(), create_feature_maps(), decompress_check_reversed_index(), describecorpus_show_statistics(), do_cqi_cl_lexicon_size(), get_matched_corpus_positions(), lexdecode_show(), main(), and validate_revcorp().

int cl_max_struc ( Attribute a)

Gets the maximum for this S-attribute (ie the size of the S-attribute).

The result of this function is equal to the number of instances of this s-attribute in the corpus.

This function works as a wrapper round cl_max_struc_oldstyle that normalises it to standard return value behaviour.

The s-attribute to evaluate.

Returns
The maximum corpus position, or an error code (if less than 0)

References cl_errno, and cl_max_struc_oldstyle().

Referenced by compose_kwic_line(), describecorpus_show_statistics(), do_cqi_cl_attribute_size(), main(), matchfirstpattern(), and scancorpus_add_key().

int cl_max_struc_oldstyle ( Attribute attribute,
int *  nr_strucs 
)

Gets the number of instances of an s-attribute in the corpus.

Depracated: use cl_max_struc instead.

See also
cl_max_struc.
Parameters
attributeThe s-attribute to count.
nr_strucsThe number of instances is put here.
Returns
boolean: true for all OK, false for problem.

References ATT_STRUC, CDA_ENODATA, CDA_OK, check_arg, cl_errno, CompStrucData, ensure_component(), and TComponent::size.

Referenced by cl_max_struc().

Attribute* cl_new_attribute_oldstyle ( Corpus corpus,
char *  attribute_name,
int  type,
char *  data 
)

Finds an attribute that matches the specified parameters, if one exists, for the given corpus.

Note that although this is a cl_new_* function, and it is the canonical way that we get an Attribute to call Attribute-functions on, it doesn't actually create any kind of object. The Attribute exists already as one of the dependents of the Corpus object; this function simply locates it and returns a pointer to it.

This function is DEPRACATED. Use cl_new_attribute() instead (which is actually a macro to this function, but the parameter list is different.)

See also
cl_new_attribute
Parameters
corpusThe corpus in which to search for the attribute.
attribute_nameThe name of the attribute (i.e. the handle it has in the registry file).
typeType of attribute to be searched for.
dataNOT USED.
Returns
Pointer to Attribute object, or NULL if not found.

References _Attribute::any, TCorpus::attributes, STREQ, and _Attribute::type.

Referenced by drop_attribute(), get_matched_corpus_positions(), and main().

Corpus* cl_new_corpus ( char *  registry_dir,
char *  registry_name 
)

Creates a Corpus object to represent a given indexed corpus, located in a given directory accessible to the program.

Parameters
registry_dirPath to the CWB registry directory from which the corpus is to be loaded. This may be NULL, in which case the default registry directory is used.
registry_nameThe CWB-name of the indexed corpus to load (in the all-lowercase form)
Returns
Pointer to the resulting Corpus object.

References check_access_conditions(), cl_delete_corpus(), cl_free, cl_id_tolower(), cl_id_validate(), cl_standard_registry(), cl_strdup(), corpus, cregcorpus, cregin, cregin_name, cregin_path, cregparse(), cregrestart(), find_corpus(), find_corpus_registry(), TCorpus::id, loaded_corpora, TCorpus::next, TCorpus::nr_of_loads, TCorpus::registry_dir, and TCorpus::registry_name.

Referenced by main(), printAlignedStrings(), and sencode_parse_options().

cl_int_list cl_new_int_list ( void  )
cl_lexhash cl_new_lexhash ( int  buckets)

Creates a new cl_lexhash object.

Parameters
bucketsThe number of buckets in the newly-created cl_lexhash; set to 0 to use the default number of buckets.
Returns
The new cl_lexhash.

References _cl_lexhash::auto_grow, _cl_lexhash::buckets, cl_calloc(), cl_malloc(), _cl_lexhash::cleanup_func, DEFAULT_FILLRATE_LIMIT, DEFAULT_FILLRATE_TARGET, DEFAULT_NR_OF_BUCKETS, _cl_lexhash::entries, _cl_lexhash::fillrate_limit, _cl_lexhash::fillrate_target, find_prime(), _cl_lexhash::next_id, and _cl_lexhash::table.

Referenced by cl_lexhash_check_grow(), main(), range_declare(), sencode_write_region(), and wattr_declare().

cl_ngram_hash cl_new_ngram_hash ( int  N,
int  buckets 
)

Creates a new cl_ngram_hash object.

Parameters
NN-gram size
bucketsThe number of buckets in the newly-created cl_ngram_hash; set to 0 to use the default number of buckets.
Returns
The new cl_ngram_hash.

References _cl_ngram_hash::auto_grow, _cl_ngram_hash::buckets, cl_calloc(), cl_malloc(), DEFAULT_FILLRATE_LIMIT, DEFAULT_FILLRATE_TARGET, DEFAULT_NR_OF_BUCKETS, _cl_ngram_hash::entries, _cl_ngram_hash::fillrate_limit, _cl_ngram_hash::fillrate_target, find_prime(), _cl_ngram_hash::iter_bucket, _cl_ngram_hash::iter_point, _cl_ngram_hash::N, and _cl_ngram_hash::table.

Referenced by cl_ngram_hash_check_grow(), ComputeGroupInternally(), and main().

CL_Regex cl_new_regex ( char *  regex,
int  flags,
CorpusCharset  charset 
)

Create a new CL_regex object (ie a regular expression buffer).

This function compiles the regular expression according to the specified flags (IGNORE_CASE and/or IGNORE_DIAC and/or REQUIRE_NFC) and for the specified character encoding. The regex is automatically anchored to the start and end of the string (i.e. wrapped in ^(?:...)$).

The regular expression engine used is PCRE. However, the regex is optimized by scanning it for literal strings ("grains") that must be contained in any match; the grains can be used as a fast pre-filter (using Boyer-Moore search for the grains).

The optimizer only understands a subset of PCRE syntax:

  • literal characters (alphanumeric, safe punctuation, escaped punctuation)
  • numeric character codes ( and )
  • escape sequences for character classes and Unicode properties
  • all repetition operators
  • simple alternatives (...|...|...)
  • nested capturing (...) and non-capturing (?:...) groups Any regexp that contains other syntactic elements such as
  • character sets [...]
  • named groups, look-ahead and look-behind patterns, etc.
  • backreferences
  • modifiers such as (?i) cannot be parsed and optimized. Note that even if a regexp is parsed by the optimizer, it might not be able to extract all grains (because grain recognition uses an even more restrictive syntax).

The optimizer is always disabled with IGNORE_DIAC if either PCRE JIT is available or the charset is UTF-8. Testing has showed that in these cases the overhead from case-folding each input string outweighs the benefits of the optimizer.

Parameters
regexString containing the regular expression
flagsIGNORE_CASE, or IGNORE_DIAC, or both, or 0.
charsetThe character set of the regex.
Returns
The new CL_Regex object, or NULL in case of error.

References CDA_EBADREGEX, CDA_OK, _CL_Regex::charset, charset, cl_debug, cl_errno, cl_free, cl_malloc(), CL_MAX_LINE_LENGTH, cl_regex_error, cl_regopt_analyse(), cl_regopt_grain, cl_regopt_grains, cl_regopt_utf8, cl_string_canonical(), cl_string_latex2iso(), _CL_Regex::extra, _CL_Regex::grains, _CL_Regex::haystack_buf, _CL_Regex::haystack_casefold, _CL_Regex::icase, _CL_Regex::idiac, IGNORE_CASE, IGNORE_DIAC, _CL_Regex::needle, PCRE_STUDY_JIT_COMPILE, regopt_data_copy_to_regex_object(), REQUIRE_NFC, and utf8.

Referenced by cl_regex2id(), do_flagged_string(), do_XMLTag(), main(), and scancorpus_add_key().

PositionStream cl_new_stream ( Attribute attribute,
int  id 
)
cl_string_list cl_new_string_list ( void  )
CorpusProperty cl_next_corpus_property ( CorpusProperty  prop)

Gets the next corpus property on the list of properties.

(The corpus properties iterator / property datatype is public.)

Parameters
propThe current property.
Returns
The next property on the list, or NULL if there isn't one.

References TCorpusProperty::next.

Referenced by cl_corpus_property(), and corpus_info().

cl_ngram_hash_entry cl_ngram_hash_add ( cl_ngram_hash  hash,
int *  ngram,
unsigned int  f 
)

Adds an n-gram to a cl_ngram_hash table.

If the n-gram is already in the hash, its frequency count is increased by the specified value f.

Otherwise, a new entry is created and its frequency count is set to f. The n-gram is embedded in the new hash entry, so the original array does not need to be kept in memory.

Parameters
hashThe hash table to add to.
ngramThe n-gram to add.
fFrequency count of the n-gram.
Returns
A pointer to a (new or existing) entry

References _cl_ngram_hash::auto_grow, _cl_ngram_hash::buckets, cl_malloc(), cl_ngram_hash_check_grow(), cl_ngram_hash_find_i(), _cl_ngram_hash::entries, _cl_ngram_hash::fillrate_limit, _cl_ngram_hash_entry::freq, MAX_ENTRIES, _cl_ngram_hash::N, _cl_ngram_hash_entry::next, _cl_ngram_hash_entry::ngram, and _cl_ngram_hash::table.

Referenced by ComputeGroupInternally(), and main().

void cl_ngram_hash_auto_grow ( cl_ngram_hash  hash,
int  flag 
)

Turns a cl_ngram_hash's ability to auto-grow on or off.

When this setting is switched on, the ngram_hash will grow automatically to avoid performance degradation.

Note the default value for this setting is SWITCHED ON.

See also
cl_ngram_hash_auto_grow_fillrate, cl_ngram_hash_check_grow
Parameters
hashThe hash that will be affected.
flagNew value for autogrow setting: boolean where true is on and false is off.

References _cl_ngram_hash::auto_grow.

Referenced by main().

void cl_ngram_hash_auto_grow_fillrate ( cl_ngram_hash  hash,
double  limit,
double  target 
)

Configure auto-grow parameters.

These settings are only relevant if auto-growing is enabled.

The decision to expand the bucket table of a ngram_hash is based on its fill rate, i.e. the average number of entries in each bucket. Under normal circumstances, this value corresponds to the average number of comparisons required to insert a new entry into the hash (locating an existing value should require roughly half as many comparisons).

Auto-growing is triggered if the fill rate exceeds a specified limit. The new number of buckets is chosen so that the fill rate after expansion corresponds to the specified target value.

The two fill rate parameters represent a trade-off between memory overhead (8 bytes for each bucket) and performance (average number of entries that have been checked for each hash access), which depends crucially on the value of N (i.e. n-gram size).

For N=1, a bucket table with low fill rate incurs a substantial memory overhead, which may even exceed the storage required for the entries themselves. For large N, the relative memory overhead is much smaller, while checking the list of entries in a bucket becomes more expensive (N integer comparisons for each item).

Note that the ratio limit / target determines how often the bucket table has to be reallocated; it should not be smaller than 4.0.

A reasonable values for the fill rate limit seems to be around 5.0; if speed is crucial, N is relatively large, and memory footprint isn't a concern, smaller values down to 2.0 might be chosen. The target fill rate should not be set too low for small N. If N=1, a target fill rate of 0.5 results in 100% memory overhead after expansion of the bucket table (16 bytes per entry vs. 8 bytes each for twice as many buckets as there are entries).

When working on very large data sets, it is recommended to disable auto-grow and initialise the n-gram hash with a sufficiently large number of buckets.

See also
cl_ngram_hash_auto_grow, cl_ngram_hash_check_grow
Parameters
hashThe hash that will be affected.
limitFill rate limit, which triggers expansion of the n-gram hash
targetTarget fill rate after expansion (determines new number of buckets)

References _cl_ngram_hash::fillrate_limit, and _cl_ngram_hash::fillrate_target.

int cl_ngram_hash_del ( cl_ngram_hash  hash,
int *  ngram 
)

Deletes an n-gram from a hash.

The entry corresponding to the specified n-gram is removed from the cl_ngram_hash. If the n-gram is not in the hash to begin with, no action is taken.

Parameters
hashThe hash to alter.
ngramThe n-gram to remove.
Returns
The frequency of the deleted entry (0 if not found).

References cl_free, cl_ngram_hash_find_i(), _cl_ngram_hash::entries, _cl_ngram_hash_entry::freq, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.

cl_ngram_hash_entry cl_ngram_hash_find ( cl_ngram_hash  hash,
int *  ngram 
)

Finds the entry corresponding to a particular n-gram within a cl_ngram_hash.

This function is basically a wrapper around the internal function cl_ngram_hash_find_i.

See also
cl_ngram_hash_find_i
Parameters
hashThe hash to search.
n-gramThe n-gram to look for.
Returns
The entry that is found (or NULL if the n-gram is not in the hash).

References cl_ngram_hash_find_i().

int cl_ngram_hash_freq ( cl_ngram_hash  hash,
int *  ngram 
)

Gets the frequency of a particular n-gram within a cl_ngram_hash.

Parameters
hashThe hash to look in.
ngramThe ngram to look for.
Returns
The frequency of that n-gram, or 0 if it is not in the hash

References cl_ngram_hash_find_i(), and _cl_ngram_hash_entry::freq.

Referenced by ComputeGroupInternally().

cl_ngram_hash_entry* cl_ngram_hash_get_entries ( cl_ngram_hash  hash,
int *  ret_size 
)

Returns allocated vector of pointers to all entries of the n-gram hash.

Must be freed by the application and can be modified, e.g. for sorting. Use cl_ngram_hash_size() to find out how many entries there are.

Returns allocated vector of pointers to all entries of the n-gram hash.

This function returns a newly allocated array of cl_ngram_hash_entry pointers enumerating all entries of the hash in an unspecified order.

Parameters
hashThe n-gram hash to operate on.
ret_sizeIf not NULL, the number of entries in the returned array will be stored in this location.

References _cl_ngram_hash::buckets, cl_malloc(), _cl_ngram_hash::entries, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.

cl_ngram_hash_entry cl_ngram_hash_iterator_next ( cl_ngram_hash  hash)

Iterate over all entries in an n-gram hash.

Note that there is only a single iterator for each cl_ngram_hash object, so different parts of the application code must not try to iterate through the hash at the same time.

This function returns the next entry from the hash, or NULL if there are no more entries. Keep in mind that the hash is traversed in an unspecified order.

Parameters
hashThe n-gram hash to iterate over.

References _cl_ngram_hash::buckets, _cl_ngram_hash::iter_bucket, _cl_ngram_hash::iter_point, _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.

Referenced by ComputeGroupInternally(), and main().

void cl_ngram_hash_iterator_reset ( cl_ngram_hash  hash)

Simple iterator for the entries of an n-gram hash.

There is only a single iterator for each cl_ngram_hash object. The iterator is invalidated by all updates of the n-gram hash and will need to be reset afterwards.

Simple iterator for the entries of an n-gram hash.

Note that there is only a single iterator for each cl_ngram_hash object, so different parts of the application code must not try to iterate through the hash at the same time.

This function resets the iterator to the start of the hash.

Parameters
hashThe n-gram hash to iterate over.

References _cl_ngram_hash::buckets, _cl_ngram_hash::iter_bucket, _cl_ngram_hash::iter_point, and _cl_ngram_hash::table.

Referenced by ComputeGroupInternally(), and main().

void cl_ngram_hash_print_stats ( cl_ngram_hash  hash,
int  max_n 
)

Display statistics on bucket fill rates (for debugging and optimization).

This function prints a table showing the distribution of bucket sizes, i.e. how many buckets contain a given number of keys. The table will be printed to STDERR, as all debugging output in CWB.

Parameters
hashThe n-gram hash.
max_nCount buckets with up to max_n entries.

References _cl_ngram_hash::buckets, cl_free, cl_ngram_hash_stats(), and _cl_ngram_hash::entries.

Referenced by cl_ngram_hash_check_grow(), and main().

int cl_ngram_hash_size ( cl_ngram_hash  hash)

Gets the number of distinct n-grams stored in a cl_ngram_hash.

This returns the total number of entries in all the buckets in the whole hash table.

Parameters
hashThe hash to size up.

References _cl_ngram_hash::entries.

Referenced by ComputeGroupInternally(), and main().

int* cl_ngram_hash_stats ( cl_ngram_hash  hash,
int  max_n 
)

Statistics on bucket fill rates for debugging purposes.

Statistics on bucket fill rates for debugging purposes.

This function returns an allocated integer array of length max_n + 1, whose i-th entry specifies the number of buckets containing i keys. For i == 0, this is the number of empty buckets. The last entry (i == max_n) is the cumulative number of buckets containing i or more entries.

Parameters
hashThe n-gram hash.
max_nCount buckets with up to max_n entries.

References _cl_ngram_hash::buckets, cl_calloc(), _cl_ngram_hash_entry::next, and _cl_ngram_hash::table.

Referenced by cl_ngram_hash_print_stats().

FILE* cl_open_stream ( const char *  filename,
int  mode,
int  type 
)

Open stream of specified (or guessed) type for reading or writing.

I/O streams opened with this function must always be closed with cl_close_stream()!

Parameters
filenameFilename or shell command
modeOpen for reading (CL_STREAM_READ) or writing (CL_STREAM_WRITE)
typeType of stream (see above), or guess automagically from <filename> (CL_STREAM_MAGIC)
Returns
Standard C stream, or NULL on error (with details from <cl_errno> or cl_error())

I/O streams opened with this function must always be closed with cl_close_stream()!

Parameters
filenameFilename or shell command
modeOpen for reading (CL_STREAM_READ) or writing (CL_STREAM_WRITE)
typeType of stream (see above), or guess automagically from <filename> (CL_STREAM_MAGIC)
Returns
Standard C stream, or NULL on error

References CDA_EACCESS, CDA_EARGS, CDA_EBUFFER, CDA_EPOSIX, CDA_OK, cl_broken_pipe, cl_errno, cl_handle_sigpipe(), cl_malloc(), CL_MAX_FILENAME_LENGTH, CL_STREAM_APPEND, CL_STREAM_BZIP2, CL_STREAM_FILE, CL_STREAM_GZIP, CL_STREAM_MAGIC, CL_STREAM_MAGIC_NOPIPE, CL_STREAM_PIPE, CL_STREAM_READ, CL_STREAM_STDIO, CL_STREAM_WRITE, _CLStream::handle, _CLStream::mode, mode, _CLStream::next, open_streams, STREAM_IS_PIPE, and _CLStream::type.

Referenced by encode_get_input_line(), lexdecode_show(), main(), open_input_stream(), open_pager(), open_stream(), sencode_parse_options(), and SetVariableValue().

void cl_path_adjust_independent ( char *  path)

Standardises subdirectory-dividers in a string that represents a path into Unix-like form (ie with forward-slash), regardless of what OS we are in.

Or, to put it another way, changes backslashes into forward slashes under Windows.

This may be useful because of the need to move corpora between systems

  • in which case, the paths need to be in '/' format – Windows tolerates forward slashes in paths a hell of a lot better than *nix tolerates unescaped backslashes!

Note that the path is modified in place.

Parameters
pathThe path to modify (must be Ascii-compatible)

References SUBDIR_SEPARATOR.

void cl_path_adjust_os ( char *  path)

Standardises subdirectory-dividers in a string that represents a path, in an OS-sensitive way.

If the CL was compiled for Unix, backslash is changed to forwardslash. If the CL was compiled for Windows, forwardslash is changed to backslash.

Note that the path is modified in place.

Parameters
pathThe path to modify (must be Ascii-compatible)

References SUBDIR_SEPARATOR.

char* cl_path_get_component ( char *  s)

Tokenises a string into components split by ':' (or ';' under Win32).

Parameters
sThe string to tokenise; or, NULL if tokenisation has already been initialised.
Returns
The next token from the string.
See also
PATH_SEPARATOR

References last, and PATH_SEPARATOR.

char* cl_path_registry_quote ( char *  path)

Add quotes and escape slashes to a file path if necessary.

This is for the HOME and INFO fields of the registry file.

If either field contains any characters that can't be treated as an "ID" token by the registry parser, then we make sure it is treated as a string (quoted) instead, and make all appropriate substitutions

For consistency, this function always returns a newly allocated string, regardless of whether changes have been made.

Note that the way the registry parser works, it is quite happy with either "C:\dir\subdir" or "C:\\dir\\subdir" as a path for HOME or INFO.

Parameters
pathString containing the path to quotify.
Returns
The quotified string (newly allocated).

References cl_malloc(), and cl_strdup().

Referenced by encode_generate_registry_file().

unsigned int cl_random ( void  )

Gets a random number.

Part of the CL-internal random number generator.

Returns
The random number, an unsigned 32-bit integer with uniform distribution

References RNG_I1, and RNG_I2.

Referenced by cl_runif(), and SortSubcorpusRandomize().

void cl_randomize ( void  )

Initialises the CL-internal random number generator from the current system time.

References cl_set_seed().

Referenced by initialize_cqp(), and main().

int cl_read_stream ( PositionStream  ps,
int *  buffer,
int  buffer_size 
)

Reads corpus positions from a position stream to a buffer.

Parameters
psThe position stream to read.
bufferLocation to put the resulting item positions.
buffer_sizeMaximum number of item positions to read. (Fewer will be read if fewer are available).
Returns
The number of item positions that have been read. This may be less than buffer_size (and will be 0 if there are no instances of this item left).

References _position_stream_rec_::b, _position_stream_rec_::base, _position_stream_rec_::bs, _position_stream_rec_::id_freq, _position_stream_rec_::is_compressed, _position_stream_rec_::last_pos, _position_stream_rec_::nr_items, and read_golomb_code_bs().

Referenced by compress_reversed_index(), and decompress_check_reversed_index().

void* cl_realloc ( void *  block,
size_t  bytes 
)

Safely reallocates memory.

See also
cl_malloc
Parameters
blockPointer to the block to be reallocated
bytesNumber of bytes to allocate to the resized memory block @ return Pointer to the block of reallocated memory

Referenced by AddBuf(), AddEquiv(), AddState(), binsert_g(), cl_autostring_concat(), cl_autostring_copy(), cl_autostring_reclaim_mem(), cl_id2cpos_oldstyle(), cl_int_list_set(), cl_string_list_set(), ComputeGroupExternally(), ComputeGroupInternally(), load_macro_file(), MakeExp(), meet_mu(), NewVariable(), PushQ(), RangeSetop(), read_mapping(), Setop(), and VariableAddItem().

int* cl_regex2id ( Attribute attribute,
char *  pattern,
int  flags,
int *  number_of_matches 
)

Gets a list of the ids of those items on a given Attribute that match a particular regular-expression pattern.

The pattern is interpreted internally with the CL regex engine, q.v.

The function returns a pointer to a sequence of ints of size number_of_matches. The list is allocated with malloc(), so do a cl_free() when you don't need it any more.

See also
cl_new_regex
Parameters
attributeThe p-attribute to look on.
patternString containing the pattern against which to match each item on the attribute. Note: this pattern is a regular expression, but it is passed as a string, not a CL_Regex object. The CL_Regex object is created internally.
flagsFlags for the regular expression system via cl_new_regex.
number_of_matchesThis is set to the number of item ids found, i.e. the size of the returned buffer.
Returns
A pointer to the list of item ids.

References ATT_POS, CDA_EBADREGEX, CDA_ENODATA, CDA_OK, check_arg, cl_calloc(), cl_debug, cl_delete_regex(), cl_errno, cl_free, cl_malloc(), cl_new_regex(), cl_regex_error, cl_regex_match(), cl_regex_optimised(), cl_regopt_count_get(), cl_regopt_count_reset(), CompLexicon, CompLexiconIdx, TMblob::data, TComponent::data, ensure_component(), _Attribute::pos, TComponent::size, and word.

Referenced by do_cqi_cl_regex2id(), get_matched_corpus_positions(), lexdecode_show(), and scancorpus_add_key().

int cl_regex_match ( CL_Regex  rx,
char *  str,
int  normalize_utf8 
)

Matches a regular expression against a string.

The pre-compiled regular expression contained in the CL_Regex is compared to the string. This regex automatically uses the case/accent folding flags and character encoding that were specified when the CL_Regex constructor was called.

If the subject string is a UTF-8 string from an external sources, the caller can request enforcement of the subject to canonical NFC form by setting the third argument to true.

See also
cl_new_regex
Parameters
rxThe regular expression to match.
strThe subject (the string to compare the regex to).
normalize_utf8If a UTF-8 string from an external source is passed as the subject, set to this parameter to true, and the function will make sure that the comparison is based on the canonical NFC form. For known-NFC strings, this parameter should be false. If the regex is not UTF-8, this parameter is ignored.
Returns
Boolean: true if the regex matched, otherwise false.

References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::charset, cl_debug, cl_optimize, cl_regopt_successes, cl_string_canonical(), _CL_Regex::extra, _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, _CL_Regex::haystack_buf, _CL_Regex::haystack_casefold, _CL_Regex::icase, _CL_Regex::idiac, _CL_Regex::jumptable, _CL_Regex::needle, REQUIRE_NFC, and utf8.

Referenced by cl_regex2id(), eval_bool(), eval_constraint(), main(), matchfirstpattern(), and scancorpus_word_is_regular().

int cl_regex_optimised ( CL_Regex  rx)

Finds the level of optimisation of a CL_Regex.

This function returns the approximate level of optimisation, computed from the ratio of grain length to number of grains (0 = no grains, ergo not optimised at all).

Parameters
rxThe CL_Regex to check.
Returns
0 if rx is not optimised; otherwise an integer indicating optimisation level.

References _CL_Regex::grain_len, and _CL_Regex::grains.

Referenced by cl_regex2id().

int cl_regopt_count_get ( void  )

Get a reading from the "success counter" for optimised regexes.

The counter is incremented by 1 every time the "grain" system is used successfully to avoid calling PCRE. That is, it is incremented every time a string is scrutinised and found to contain none of the grains.

Usage:

cl_regopt_count_reset();

for (i = 0, hits = 0; i < n; i++) if (cl_regex_match(rx, haystacks[i])) hits++;

fprintf(stderr, "Found %d matches; avoided regex matching %d times out of %d trials", hits, cl_regopt_count_get(), n );

See also
cl_regopt_count_reset
Returns
an integer indicating the number of times a regular expression has been matched using the regopt system of "grains", rather than by calling an external regex library.

References cl_regopt_successes.

Referenced by cl_regex2id().

void cl_regopt_count_reset ( void  )

Reset the "success counter" for optimised regexes.

References cl_regopt_successes.

Referenced by cl_regex2id().

double cl_runif ( void  )

Gets a random number in the range [0,1] with uniform distribution.

Part of the CL-internal random number generator.

Returns
The generated random number.

References cl_random().

Referenced by do_cqi_cqp_query(), and do_reduce().

int cl_sequence_compressed ( Attribute attribute)

Checks whether the item sequence of the given P-attribute is compressed.

See comments in body of function for what counts as "compressed".

Returns
Boolean.

References ATT_POS, check_arg, cl_errno, CompCorpus, CompHuffCodes, CompHuffSeq, CompHuffSync, component_state(), ComponentLoaded, ComponentUnloaded, POS_Attribute::hc, and _Attribute::pos.

Referenced by cl_max_cpos(), and load_component().

void cl_set_debug_level ( int  level)

Sets the debug level configuration variable.

See also
cl_debug

References cl_debug.

Referenced by execute_side_effects(), main(), parse_options(), and set_default_option_values().

int cl_set_intersection ( char *  result,
const char *  s1,
const char *  s2 
)

Computes the intersection of two set attribute values.

Compute intersection of two set attribute values (in standard syntax, i.e. sorted and '|'-delimited); memory for the result string must be allocated by the caller.

Returns
0 on error, 1 otherwise

References CDA_EBUFFER, CDA_EFSETINV, CDA_OK, CL_DYN_STRING_SIZE, cl_errno, cl_strcmp(), s1, and s2.

Referenced by call_predefined_function().

void cl_set_memory_limit ( int  megabytes)

Sets the memory limit respected by some CL functions.

See also
cl_memory_limit

References cl_memory_limit.

Referenced by main().

void cl_set_optimize ( int  state)

Turns optimization on or off.

See also
cl_optimize
Parameters
stateBoolean (true turns it on, false turns it off).

References cl_optimize.

Referenced by execute_side_effects(), main(), and set_default_option_values().

void cl_set_rng_state ( unsigned int  i1,
unsigned int  i2 
)

Restores the state of the CL-internal random number generator.

Parameters
i1The value to set the first RNG integer to (if zero, resets it to 1)
i2The value to set the second RNG integer to (if zero, resets it to 1)

References RNG_I1, and RNG_I2.

Referenced by cl_set_seed(), and SortSubcorpusRandomize().

void cl_set_seed ( unsigned int  seed)

Initialises the CL-internal random number generator.

Parameters
seedA single 32bit number to use as the seed

References cl_set_rng_state().

Referenced by cl_randomize().

int cl_set_size ( char *  s)

Counts the number of elements in a set attribute value.

This function counts the number of elements in a set attribute value (using '|'-delimited standard syntax);

Returns
-1 on error (in particular, if set is malformed)

References CDA_EFSETINV, CDA_OK, and cl_errno.

Referenced by call_predefined_function().

int cl_sort2id ( Attribute attribute,
int  sort_index_position 
)

Gets the ID code of the item at the specified position in the Attribute's sorted wordlist index.

That is, given a sort-order position, the actual ID of the corresponding item is generated.

See also
get_sortidxpos_of_id
Parameters
attributeThe (positional) Attribute whose index is to be searched.
sort_index_positionThe offset in the index where the ID code is to be found.
Returns
Either the integer ID, or an error code (if less than 0)

References ATT_POS, CDA_EIDXORNG, CDA_ENODATA, CDA_OK, check_arg, cl_errno, CompLexiconSrt, TMblob::data, TComponent::data, and ensure_component().

Referenced by lexdecode_show().

char* cl_standard_registry ( )

Gets a string containing the path of the default registry directory.

Note this is a pointer to an internal string, and therefore must not be altered or freed.

Returns
The value of the corpus-module-internal variable regdir, which is initialised from the environment variable REGISTRY_ENVVAR or, failing that, the macro REGISTRY_DEFAULT_PATH.
See also
REGISTRY_ENVVAR
REGISTRY_DEFAULT_PATH

References regdir, REGISTRY_DEFAULT_PATH, and REGISTRY_ENVVAR.

Referenced by cl_new_corpus(), find_corpus(), load_corpusnames(), and main().

int cl_str2id ( Attribute attribute,
char *  id_string 
)

Gets the ID code that corresponds to the specified string on the given P-attribute.

Parameters
attributeThe (positional) Attribute to look the string up on
id_stringThe string of an item on this attribute
Returns
Either the integer ID of the item, or an error code (if less than 0). In the latter case, the error code will also be written to cl_errno.

References ATT_POS, CDA_ENODATA, CDA_ENOSTRING, CDA_EOTHER, CDA_OK, check_arg, cl_errno, cl_strcmp(), CompLexicon, CompLexiconIdx, CompLexiconSrt, TMblob::data, TComponent::data, ensure_component(), and TComponent::size.

Referenced by create_feature_maps(), do_cqi_cl_str2id(), get_corpus_positions(), lexdecode_show(), OptimizeStringConstraint(), show_features(), and VerifyVariable().

int cl_strcmp ( char *  s1,
char *  s2 
)

CL internal string comparison (uses signed char on all platforms).

Referenced by cl_set_intersection(), cl_str2id(), cl_string_list_strcmp(), compare_cells(), and scompare().

char* cl_strcpy ( char *  buf,
const char *  src 
)

Replacement for strcpy that won't copy more than CL_MAX_LINE_LENGTH characters.

This is intended to make it easier to evade buffer overflows. But it doesn't protect against the opposite danger of losing important data from the end of a truncated string.

Note, buffer overflow is still possible if buf is a pointer to the middle of a buffer.

So this function is not a panacea, it's just a bit of a help.

It's also implemented in a way that is safe for down-strcpying, that is, if we are erasing a section from the start/middle of the string - cl_strcpy(string, string+3); for instance). The POSIX standard states that the normal strcpy has undefined behaviour if the objects overlap. That's not the case here.

Parameters
bufA string buffer to copy to.
srcThe string pointer to copy from.
Returns
In classic strcpy-stylie, this function uselessly returns buf.

References buf, and CL_MAX_LINE_LENGTH.

Referenced by cl_string_canonical(), create_feature_maps(), encode_get_input_line(), findcorpus(), ParsePrintOptions(), and range_declare().

char* cl_strdup ( const char *  string)
void cl_string_canonical ( char *  s,
CorpusCharset  charset,
int  flags 
)

Converts a string to canonical form.

The "canonical form" of a string is for use in comparisons where case-insensitivity and/or diacritic insensitivity is desired.

Note that the string s is modified in place. This means it must have enough memory to cope with any expansions made in Unicode case folding. Ideally, allocate double the length of the string (since case-folding doesn't include any one -> more-than-two mappings so far as I know).

Note also that the arguments of this string were changed in v3.2.1. Now, a CorpusCharset is needed. This is because string canonicalising works differently in UTF8, where case folding / accent folding is done by calling Unicode-aware functions. By contrast, the process for Latin1 just uses a straightforward mapping table for both sorts of folding.

In UTF8, an additional flag REQUIRE_NFC can be passed to normalize the string into the canonical pre-composed form (NFC) used internally by CWB. All strings that are going to be inserted into or searched for within an indexed corpus should be processed in this way.

Parameters
sThe string.
charsetThe character set in which the string is encoded. If this is utf8, complex accent and/or case folding will be done, as per the Unicode standard. If it is anything else, internal byte mapping tables will be used.
flagsThe flags that specify which conversions are required. Can be IGNORE_CASE | IGNORE_DIAC | REQUIRE_NFC .

References ascii, cl_free, cl_strcpy(), cl_string_maptable(), IGNORE_CASE, IGNORE_DIAC, REQUIRE_NFC, unknown_charset, and utf8.

Referenced by cl_new_regex(), cl_regex_match(), cl_string_qsort_compare(), create_feature_maps(), encode_get_input_line(), print_tabulation(), regopt_data_copy_to_regex_object(), sencode_parse_line(), SortExternally(), SortSubcorpus(), and VerifyVariable().

char* cl_string_latex2iso ( char *  str,
char *  result,
int  target_len 
)

Converts ASCII strings with latex-style blackslash escapes for accented characters to ISO-8859-1 (Latin-1).

Syntax:

"[AaOoUus..] –> corresponding ISO 8859-1 character

octal} –> ISO 8859-1 character

Note that if cl_allow_latex2iso is FALSE, this function will simply copy the input to the output. So it is always safe to call this function.

See also
cl_allow_latex2iso
Parameters
strThe string to convert.
resultThe location to put the altered string (which should be shorter, or at least no longer than, the input string). If this parameter is NULL, space is automatically allocated for the output. result is allowed to be the same as str.
target_lenThe maximum length of the target string. If result is NULL, then this is deduced automatically.
Returns
Pointer to the altered string (if result was NULL you need to catch this and free it when no longer needed).
See also
cl_string_latex2iso
cl_string_latex2iso

References cl_allow_latex2iso, cl_malloc(), cl_strdup(), popc, and pushc.

Referenced by cl_new_regex(), do_flagged_string(), do_SetVariableValue(), and do_XMLTag().

void cl_string_list_append ( cl_string_list  l,
char *  val 
)

Appends a string pointer to the end of a cl_string_list object.

References cl_string_list_set(), and _cl_string_list::size.

Referenced by cl_corpus_list_attributes(), cl_make_set(), cwbci_check_line(), encode_parse_options(), encode_scan_directory(), and range_declare().

char* cl_string_list_get ( cl_string_list  l,
int  n 
)

Retrieves an element from a cl_string_list object.

Parameters
lThe list to search.
nThe element to retrieve.
Returns
The n'th string on the list, or NULL if there is no n'th string. Note that the returned pointer references the ACTUAL DATA in the list - not a copy, if you want a copy you must make one yourself.

References _cl_string_list::data, and _cl_string_list::size.

Referenced by cl_make_set(), cwbci_check_line(), encode_get_input_line(), encode_parse_options(), main(), range_close(), range_open(), and range_print_registry_line().

void cl_string_list_lumpsize ( cl_string_list  l,
int  s 
)

Sets the lumpsize of a cl_string_list object.

See also
LUMPSIZE
Parameters
lThe cl_string_list.
sThe new lumpsize.

References LUMPSIZE, and _cl_string_list::lumpsize.

void cl_string_list_qsort ( cl_string_list  l)

Sorts a cl_string_list object.

The list of strings is sorted using cl_strcmp().

See also
cl_strcmp

References cl_string_list_strcmp(), _cl_string_list::data, and _cl_string_list::size.

Referenced by cl_make_set(), and encode_scan_directory().

void cl_string_list_set ( cl_string_list  l,
int  n,
char *  val 
)

Sets a string pointer on a cl_string_list object.

The n'th element on the list is set to val, and the list is auto-extended if necessary.

References _cl_string_list::allocated, cl_realloc(), _cl_string_list::data, _cl_string_list::lumpsize, and _cl_string_list::size.

Referenced by cl_string_list_append().

int cl_string_list_size ( cl_string_list  l)

Gets the current size of a cl_string_list object (number of elements on the list).

References _cl_string_list::size.

Referenced by cl_make_set(), cwbci_check_line(), encode_parse_options(), main(), range_close(), range_open(), and range_print_registry_line().

int cl_string_qsort_compare ( const char *  s1,
const char *  s2,
CorpusCharset  charset,
int  flags,
int  reverse 
)

Compares two strings in a qsort-style.

This function is designed to be suitable for use as a callback with qsort(). As such, its return values are negative if s1 is "less than" s2; zero if the two strings are the same; and positive if s2 is "greater than" s2. But of course you can also use it on its own.

You cannot use it directly with qsort as its parameters are wrong. It needs to be wrapped in another function that (at least) provides the charset, flags and reverse arguments (e.g. from global variables or by calling other functions).

The two strings must be in the same character set. Both will be made canonical in accordance with the flags argument if it is set. Also, the comparison can be done on reverse-order strings.

Note that if either flags or reverse is non-zero, then memory allocation will be necessary. If you are calling this function in a loop, that could quickly get costly. To avoid this, a pair of one-time-allocated buffers are used - but this doesn't dispense with all need for allocation. [Another option would be to allow a buffer to be optionally supplied....]

If charset == utf8 and strings are passed in from external sources, the flag REQUIRE_NFC should always be specified to obtain consistent results.

Parameters
s1First string to compare.
s2Second string to compare.
charsetCharacter set of the two strings.
flagsIGNORE_CASE, IGNORE_DIAC, REQUIRE_NFC
reverseBoolean: if true, strings are compared from end to beginning, rather than beginning to end.
Returns
0 if the strings are the same. 1 if s1 is greater. -1 if s2 is greater.

References cl_free, cl_malloc(), CL_MAX_LINE_LENGTH, cl_string_canonical(), s1, s2, and utf8.

Referenced by i2compare().

char* cl_string_reverse ( const char *  s,
CorpusCharset  charset 
)

Creates a "backwards" version of the specified string.

The memory for the reversed string is newly allocated. (This is potentially wasteful, but it occurs in the depths of GLib, so short of reinventing the wheel we have to live with it.)

Parameters
sString to reverse.
charsetThe character set of the string.
Returns
Pointer to the new string.

References cl_strdup(), and utf8.

Referenced by SortExternally(), and SortSubcorpus().

int cl_string_utf8_continuation_byte ( unsigned char  byte)

Checks whether a given byte is a UTF-8 continuation byte.

Byte to check.

Returns
Boolean. True iff the byte is a continuation byte. If it is a one-byte character, or a valid start byte, false.

Referenced by compose_kwic_line().

int cl_string_validate_encoding ( char *  s,
CorpusCharset  charset,
int  repair 
)

Checks the encoding of a string.

This function looks for bad bytes (or byte sequences in the case of UTF8); if any are present, it judges the string invalid.

The string can optionally be "repaired" in-place by replacing bad bytes with '?' characters. If the "repair" is successful, the function returns True.

What counts as "bad" is of course relative to the character set that the string is encoded in - so this must be specified.

Parameters
sNull-terminated string to check.
charsetCorpusCharset of the string's encoding.
repairif True, replace invalid bytes by '?'
Returns
Boolean: true for valid, false for invalid.

References arabic, ascii, cyrillic, greek, hebrew, latin1, latin2, latin3, latin4, latin5, latin6, latin7, latin8, latin9, and utf8.

Referenced by create_feature_maps(), do_flagged_re_variable(), encode_get_input_line(), prepare_Query(), printAlignedStrings(), sencode_parse_line(), and VerifyVariable().

int cl_string_zap_controls ( char *  s,
CorpusCharset  charset,
char  replace,
int  zap_tabs,
int  zap_newlines 
)

Replaces any invalid control characters in a string.

"Invalid" control characters are any below 0x20.

The string is modified in situ. A typical "replace" to use would be '?' to match the action of cl_string_validate_encoding.

Parameters
sThe string to modify.
charsetThe character set of the string.
replaceThe replacement character to use. If this is 0, the character is deleted rather than replaced.
zap_tabsWhether or not tabs should be zapped (boolean).
zap_newlinesWhether or not
and should be zapped (boolean).
Returns
The number of characters replaced/deleted in the string.

Referenced by encode_get_input_line(), and sencode_parse_line().

int cl_struc2cpos ( Attribute attribute,
int  struc_num,
int *  struc_start,
int *  struc_end 
)

Retrieves the start-and-end corpus positions of a specified structure of the given s-attribute type.

Parameters
attributeAn s-attribute.
struc_numThe instance of that s-attribute to retrieve (i.e. the struc_num'th instance of this s-attribute in the corpus).
struc_startLocation to put the starting corpus position.
struc_endLocation to put the ending corpus position.
Returns
boolean: true for all OK, 0 for problem

References ATT_STRUC, CDA_EIDXORNG, CDA_ENODATA, CDA_OK, check_arg, cl_errno, CompStrucData, TMblob::data, TComponent::data, ensure_component(), and TComponent::size.

Referenced by align_print_line(), compose_kwic_line(), decode_print_token_sequence(), do_cqi_cl_cpos2lbound(), do_cqi_cl_cpos2rbound(), do_cqi_cl_struc2cpos(), eval_constraint(), feature_match(), get_position_values(), main(), and matchfirstpattern().

char* cl_struc2str ( Attribute attribute,
int  struc_num 
)

Gets the value that is associated with the specified instance of the given s-attribute.

Parameters
attributeAn S-attribute.
struc_numID of the structure whose value is wanted (ie, function gets value of struc_num'th instance of this s-attribute)
Returns
A string; or NULL in case of error. Note that this string is a pointer to the depths of the Attribute object itself, as this function does not strdup() its result – so don't free this return value!

References ATT_STRUC, CDA_EIDXORNG, CDA_EINTERNAL, CDA_ENODATA, CDA_OK, check_arg, cl_errno, cl_struc_values(), CompStrucAVS, CompStrucAVX, TMblob::data, TComponent::data, ensure_component(), s_v_comp(), and TComponent::size.

Referenced by compute_grouping(), decode_print_surrounding_s_att_values(), decode_print_token_sequence(), do_cqi_cl_struc2str(), eval_constraint(), get_position_values(), main(), matchfirstpattern(), scancorpus_add_key(), and structure_value_at_position().

int cl_struc_values ( Attribute attribute)
char* cl_xml_entity_decode ( char *  s)

Decode XML entities in a string.

This function decodes pre-defined XML entities in string s. It overwrites the input string s and also returns s for convenience.

(The entities are &lt; &gt; &amp; &quot; &apos;).

TODO – numeric entities?

If passed NULL, it will not fall over - it will just pass NULL back!

This function is safe for strings in any encoding. The returned string will be at the same memory location and will always be the same length or shorter after the decoding of entities.

Parameters
sA string to decode.
Returns
The string (rewritten in situ).

Referenced by encode_add_wattr_line(), and range_open().

Variable Documentation

int cl_allow_latex2iso

Boolean switch enabling/disabling latex-style escapes.

By default, it is false; if programs wish to allow these escapes they need to offer some means of changing this variable.

Note that enabling this variable may cause scrambling of the string for LatinX strings where X is not 1; and may cause undefined errors for UTF8 strings. In short, you should only activate it when you are working with a corpus whose charset is Latin1.

See also
CorpusCharset

Referenced by cl_string_latex2iso().

int cl_broken_pipe

This variable will be set to True if a SIGPIPE has been caught and ignored.

It is reset to False whenever a stream is opened or closed, so it is safe to check while writing to a plain file stream. If multiple pipes are active, there is no way to indicate which one caused the SIGPIPE.

This variable will be set to True if a SIGPIPE has been caught and ignored.

Referenced by ascii_print_group(), ascii_print_output(), cl_close_stream(), cl_handle_sigpipe(), cl_open_stream(), do_dump(), html_print_group(), html_print_output(), latex_print_group(), latex_print_output(), print_tabulation(), sgml_print_group(), sgml_print_output(), and SortSubcorpus().

int cl_errno
char cl_regex_error[]

The error message from (PCRE) regex compilation are placed in this buffer if cl_new_regex() fails.

This global variable is part of the CL_Regex object's API.

Referenced by cl_new_regex(), and cl_regex2id().