CWB
Functions
special-chars.h File Reference
#include "globals.h"

Functions

unsigned char * cl_string_maptable (CorpusCharset charset, int flags)
 Gets a specified character mapping table for use in regular expressions. More...
 
int cl_iso_char_is_alphanumeric (unsigned char c, CorpusCharset charset)
 Checks whether a character is alphanumeric in the given ISO-8859 character set. More...
 

Function Documentation

int cl_iso_char_is_alphanumeric ( unsigned char  c,
CorpusCharset  charset 
)

Checks whether a character is alphanumeric in the given ISO-8859 character set.

This function is exported but NOT via cl.h - it is only for the use of CWB utilities. It is not part of the standard API.

Returns false if charset is utf8.

Parameters
cThe character to check.
charsetThe character set to check against.
Returns
Boolean.

References charset, checktable_is_alphanum, and utf8.

Referenced by scancorpus_word_is_regular().

unsigned char* cl_string_maptable ( CorpusCharset  charset,
int  flags 
)

Gets a specified character mapping table for use in regular expressions.

Returns pointer to static mapping table for given flags (IGNORE_CASE and IGNORE_DIAC) and character set.

Removed from the public API for 3.2.0 because there's no way for it to work if the CorpusCharset is UTF8. Prototype moved to special-chars.h

Tables exist for all character sets, but for all except Latin1 and ASCII, they are currently identical to the ASCII tables (i.e. the awareness of case/accent relationships in the upper half of each character set have not yet been inserted).

Parameters
charsetThe character set of this corpus. Currently ignored.
flagsThe flags that specify which table is required. Can be IGNORE_CASE and/or IGNORE_DIAC.
Returns
Pointer to the appropriate mapping table. DO NOT FREE this, or modify it, it is a CL-internal data blob.

References ascii, charset, identity_tab, identity_tab_init, IGNORE_CASE, IGNORE_DIAC, maptable_init_both(), maptable_init_identity(), nocase_nodiac_tab, nocase_nodiac_tab_init, nocase_tab, nodiac_tab, and utf8.

Referenced by cl_string_canonical().