CWB
|
The CL_Regex object, and the CL Regular Expression Optimiser. More...
Macros | |
#define | PCRE_STUDY_JIT_COMPILE 0 |
Functions | |
int | cl_regopt_analyse (char *regex) |
Analyses a regular expression and tries to find the best set of grains. More... | |
CL_Regex | cl_new_regex (char *regex, int flags, CorpusCharset charset) |
Create a new CL_regex object (ie a regular expression buffer). More... | |
int | cl_regex_optimised (CL_Regex rx) |
Finds the level of optimisation of a CL_Regex. More... | |
int | cl_regex_match (CL_Regex rx, char *str, int normalize_utf8) |
Matches a regular expression against a string. More... | |
void | cl_delete_regex (CL_Regex rx) |
Deletes a CL_Regex object, and frees all resources associated with the pre-compiled regex. More... | |
int | is_safe_char (unsigned char c) |
Is the given character a 'safe' character which will only match itself in a regex? More... | |
int | is_ascii_alnum (unsigned char c) |
Is the given character an ASCII alphanumeric? More... | |
int | is_ascii_punct (unsigned char c) |
Is the given character ASCII punctuation? More... | |
int | is_hexadecimal (unsigned char c) |
Is the given character a valid hexadecimal digit? More... | |
char * | read_escape_seq (char *mark) |
Read in an escape sequence for a character or class - part of the CL Regex Optimiser. More... | |
char * | read_matchall (char *mark) |
Reads in an element matching some character - part of the CL Regex Optimiser. More... | |
char * | read_kleene (char *mark, int *one_or_more) |
Reads in a repetition operator - part of the CL Regex Optimiser. More... | |
char * | read_wildcard (char *mark) |
Reads in a wildcard - part of the CL Regex Optimiser. More... | |
char * | read_grain (char *mark, char *grain, int *len) |
Reads in a literal grain from a regex - part of the CL Regex Optimiser. More... | |
char * | read_disjunction (char *mark, int *align_start, int *align_end, int no_paren) |
Finds grains in a simple disjunction group - part of the CL Regex Optimiser. More... | |
void | update_grain_buffer (int at_start, int at_end) |
Updates the public grain buffer – part of the CL Regex Optimiser. More... | |
void | make_jump_table (CL_Regex rx) |
Computes a jump table for Boyer-Moore searches – part of the CL Regex Optimiser. More... | |
void | regopt_data_copy_to_regex_object (CL_Regex rx) |
Copy grain set from internal global variables to CL_Regex object – part of the CL Regex Optimiser. More... | |
void | cl_regopt_count_reset (void) |
Reset the "success counter" for optimised regexes. More... | |
int | cl_regopt_count_get (void) |
Get a reading from the "success counter" for optimised regexes. More... | |
Variables | |
char * | cl_regopt_grain [MAX_GRAINS] |
list of 'grains' (any matching string must contain one of these) More... | |
int | cl_regopt_grain_len |
length of shortest grain (in characters) More... | |
int | cl_regopt_grains |
number of grains More... | |
int | cl_regopt_anchor_start |
Boolean: whether grains are anchored at beginning of string. More... | |
int | cl_regopt_anchor_end |
Boolean: whether grains are anchored at end of string. More... | |
int | cl_regopt_utf8 |
whether the regular expression is in UTF-8 encoding (set before calling cl_regopt_analyse()) More... | |
char * | grain_buffer [MAX_GRAINS] |
Intermediate buffer for grains. More... | |
int | grain_buffer_len [MAX_GRAINS] |
the length of each grain (in characters) More... | |
int | grain_buffer_grains = 0 |
The number of grains currently in the intermediate buffer. More... | |
char | public_grain_data [CL_MAX_LINE_LENGTH] |
A buffer for grain strings. More... | |
char | local_grain_data [CL_MAX_LINE_LENGTH] |
A buffer for grain strings. More... | |
int | cl_regopt_successes = 0 |
A counter of how many times the "grain" system has allwoed us to avoid calling the regex engine. More... | |
char | cl_regex_error [CL_MAX_LINE_LENGTH] |
The error message from (PCRE) regex compilation are placed in this buffer if cl_new_regex() fails. More... | |
The CL_Regex object, and the CL Regular Expression Optimiser.
This is the CL front-end to POSIX regular expressions with CL semantics (most notably: CL regexes always match the entire string and NOT substrings.)
Note that the optimiser is handled automatically by the CL_Regex object.
All variables / functions containing "regopt" are internal to this module and are not exported in the CL API.
Optimisation is done by means of "grains". The grain array in a CL_Regex object is a list of short strings. Any string which will match the regex must contain at least one of these. Thus, the grains provide a quick way of filtering out strings that definitely WON'T match, and avoiding a time-wasting call to the POSIX regex matching function.
While a regex is being optimised, the grains are stored in non-exported global variables in this module. Subsequently they are transferred to members of the CL_regex object with which they are associated. The use of global variables and a fixed-size buffer for grains is partly due to historical reasons, but it does also serve to reduce memory allocation overhead.
#define PCRE_STUDY_JIT_COMPILE 0 |
Referenced by cl_new_regex().
void cl_delete_regex | ( | CL_Regex | rx | ) |
Deletes a CL_Regex object, and frees all resources associated with the pre-compiled regex.
Note that we use cl_free to deallocate the internal PCRE buffers, not pcre_free, for the simple reason that pcre_free is just a function pointer that will normally contain free, and thus we miss out on the checking that cl_free provides.
rx | The CL_Regex to delete. |
References cl_free, _CL_Regex::extra, _CL_Regex::grain, _CL_Regex::grains, _CL_Regex::haystack_buf, _CL_Regex::haystack_casefold, and _CL_Regex::needle.
Referenced by cl_regex2id(), free_booltree(), free_environment(), and main().
CL_Regex cl_new_regex | ( | char * | regex, |
int | flags, | ||
CorpusCharset | charset | ||
) |
Create a new CL_regex object (ie a regular expression buffer).
This function compiles the regular expression according to the specified flags (IGNORE_CASE and/or IGNORE_DIAC and/or REQUIRE_NFC) and for the specified character encoding. The regex is automatically anchored to the start and end of the string (i.e. wrapped in ^(?:...)$).
The regular expression engine used is PCRE. However, the regex is optimized by scanning it for literal strings ("grains") that must be contained in any match; the grains can be used as a fast pre-filter (using Boyer-Moore search for the grains).
The optimizer only understands a subset of PCRE syntax:
The optimizer is always disabled with IGNORE_DIAC if either PCRE JIT is available or the charset is UTF-8. Testing has showed that in these cases the overhead from case-folding each input string outweighs the benefits of the optimizer.
regex | String containing the regular expression |
flags | IGNORE_CASE, or IGNORE_DIAC, or both, or 0. |
charset | The character set of the regex. |
References CDA_EBADREGEX, CDA_OK, _CL_Regex::charset, charset, cl_debug, cl_errno, cl_free, cl_malloc(), CL_MAX_LINE_LENGTH, cl_regex_error, cl_regopt_analyse(), cl_regopt_grain, cl_regopt_grains, cl_regopt_utf8, cl_string_canonical(), cl_string_latex2iso(), _CL_Regex::extra, _CL_Regex::grains, _CL_Regex::haystack_buf, _CL_Regex::haystack_casefold, _CL_Regex::icase, _CL_Regex::idiac, IGNORE_CASE, IGNORE_DIAC, _CL_Regex::needle, PCRE_STUDY_JIT_COMPILE, regopt_data_copy_to_regex_object(), REQUIRE_NFC, and utf8.
Referenced by cl_regex2id(), do_flagged_string(), do_XMLTag(), main(), and scancorpus_add_key().
int cl_regex_match | ( | CL_Regex | rx, |
char * | str, | ||
int | normalize_utf8 | ||
) |
Matches a regular expression against a string.
The pre-compiled regular expression contained in the CL_Regex is compared to the string. This regex automatically uses the case/accent folding flags and character encoding that were specified when the CL_Regex constructor was called.
If the subject string is a UTF-8 string from an external sources, the caller can request enforcement of the subject to canonical NFC form by setting the third argument to true.
rx | The regular expression to match. |
str | The subject (the string to compare the regex to). |
normalize_utf8 | If a UTF-8 string from an external source is passed as the subject, set to this parameter to true, and the function will make sure that the comparison is based on the canonical NFC form. For known-NFC strings, this parameter should be false. If the regex is not UTF-8, this parameter is ignored. |
References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::charset, cl_debug, cl_optimize, cl_regopt_successes, cl_string_canonical(), _CL_Regex::extra, _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, _CL_Regex::haystack_buf, _CL_Regex::haystack_casefold, _CL_Regex::icase, _CL_Regex::idiac, _CL_Regex::jumptable, _CL_Regex::needle, REQUIRE_NFC, and utf8.
Referenced by cl_regex2id(), eval_bool(), eval_constraint(), main(), matchfirstpattern(), and scancorpus_word_is_regular().
int cl_regex_optimised | ( | CL_Regex | rx | ) |
Finds the level of optimisation of a CL_Regex.
This function returns the approximate level of optimisation, computed from the ratio of grain length to number of grains (0 = no grains, ergo not optimised at all).
rx | The CL_Regex to check. |
References _CL_Regex::grain_len, and _CL_Regex::grains.
Referenced by cl_regex2id().
int cl_regopt_analyse | ( | char * | regex | ) |
Analyses a regular expression and tries to find the best set of grains.
Part of the regex optimiser. For a given regular expression, this function will try to extract a set of grains from regular expression {regex_string}. These grains are then used by the CL regex matcher and cl_regex2id() as a pre-filter for faster regular expression search.
The function only recognizes relatively simple regular expressions without recursive nesting, which roughly correspond to wildcard searches similar to SQL's LIKE operator. Searching for a fixed prefix, suffix or infix (or a small set of alternatives) will see the most noticeable performance improvement. Any more complex expression is passed directly to the standard PCRE evaluation. See cl_new_regex() for an overview of the supported subsets of PCRE syntax.
If successful, this function returns True and stores the grains in the optiomiser's global variables above, from which they then must be copied to a CL_Regex object with regopt_data_copy_to_regex_object(). This step will also casefold grains if necessary and prepare the Boyer-Moore jump tables.
Usage: optimised = cl_regopt_analyse(regex_string);
This is a non-exported function.
regex | String containing the regex to optimise. |
The code below parses a regular expression using the supported PCRE subset and attempts to extract a literal string ("grain") that must occur in every match of the regular expression. In a disjunction group, each alternative must contain a grain in order to be used for optimized matching; we refer to this as a complete grain set.
The algorithm scans the regular expression from left to right, looking for the following supported elements in the specified order:
See the respective functions for further information on these elements. Grains are collected by read_grain and read_disjunction in a local buffer. If a complete grain set has been found, it replaces the previous global grain set if it is considered to be more effective based on the length and number of grains.
As a special case, the optimizer first attempts to match the entire regexp as a simple disjunction (read_disjunction) without parentheses.
References cl_debug, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, grain_buffer_len, local_grain_data, read_disjunction(), read_grain(), read_kleene(), read_wildcard(), and update_grain_buffer().
Referenced by cl_new_regex().
int cl_regopt_count_get | ( | void | ) |
Get a reading from the "success counter" for optimised regexes.
The counter is incremented by 1 every time the "grain" system is used successfully to avoid calling PCRE. That is, it is incremented every time a string is scrutinised and found to contain none of the grains.
Usage:
for (i = 0, hits = 0; i < n; i++) if (cl_regex_match(rx, haystacks[i])) hits++;
fprintf(stderr, "Found %d matches; avoided regex matching %d times out of %d trials", hits, cl_regopt_count_get(), n );
References cl_regopt_successes.
Referenced by cl_regex2id().
void cl_regopt_count_reset | ( | void | ) |
Reset the "success counter" for optimised regexes.
References cl_regopt_successes.
Referenced by cl_regex2id().
int is_ascii_alnum | ( | unsigned char | c | ) |
Is the given character an ASCII alphanumeric?
ASCII alphanumeric characters comprise A-Z, a-z and 0-9; they are the only characters that form special escape sequences in PCRE regular expressions.
c | The character (cast to unsigned for the comparison). |
Referenced by read_escape_seq().
int is_ascii_punct | ( | unsigned char | c | ) |
Is the given character ASCII punctuation?
ASCII punctuation symbols are the only characters that may need to be protected by a \ in regular expressions. They cannot form special escape sequences.
c | The character (cast to unsigned for the comparison). |
Referenced by read_grain().
int is_hexadecimal | ( | unsigned char | c | ) |
Is the given character a valid hexadecimal digit?
c | The character (cast to unsigned for the comparison). |
Referenced by read_escape_seq().
int is_safe_char | ( | unsigned char | c | ) |
Is the given character a 'safe' character which will only match itself in a regex?
What counts as safe: A to Z, a to z, 0 to 9, minus, quote marks, percent, ampersand, slash, excl mark, colon, semi colon, character, underscore, tilde, any values above 0x7f (ISO 8859 extension or UTF-8 non-ASCII character).
What counts as not safe therefore includes: brackets, braces, square brackets; questionmark, plus, and star; circumflex and dollar sign; dot; hash; backslash, etc. (But, in UTF8, Unicode PUNC area equivalents of these characters will be safe.)
A safe character can never be the start of a meta element (even if it might appear as part of one), so it's safe to include in a literal grain.
c | The character (cast to unsigned for the comparison). |
Referenced by read_grain(), and read_matchall().
void make_jump_table | ( | CL_Regex | rx | ) |
Computes a jump table for Boyer-Moore searches – part of the CL Regex Optimiser.
Unlike the textbook version, this jumptable includes the last character of each grain (so we don't have to start the string comparison loops every time).
A non-exported function.
References cl_debug, _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, and _CL_Regex::jumptable.
Referenced by regopt_data_copy_to_regex_object().
char* read_disjunction | ( | char * | mark, |
int * | align_start, | ||
int * | align_end, | ||
int | no_paren | ||
) |
Finds grains in a simple disjunction group - part of the CL Regex Optimiser.
This function parses a simple parenthesized disjunction within a regular expression and attempts to extract one grain from each alternative. Grains are written to the local grain buffer. If a complete grain set has been found, the functions returns a pointer to the first character after the disjunction. Otherwise it returns mark and the caller can try to accept the group with read_matchall.
For simplicity, only the first grain in each alternative is considered. This makes it easier to check start/end alignment of the grains. Note that the local grain buffer is always mangled, even if the function is unsuccessful.
The first argument, mark, must point to the '(' at the beginning of the disjunction group (unless no_paren is set).
The booleans align_start and align_end are set to true if the grains from all alternatives are anchored at the start or end of the disjunction group, respectively.
This is a non-exported function.
mark | Pointer to the disjunction group (see also function description). |
align_start | See function description. |
align_end | See function description. |
no_paren | Attempt to read a top-level disjunction without parentheses, which must extend to the end of the string. |
References buf, grain_buffer, grain_buffer_grains, grain_buffer_len, local_grain_data, MAX_GRAINS, read_grain(), and read_wildcard().
Referenced by cl_regopt_analyse().
char* read_escape_seq | ( | char * | mark | ) |
Read in an escape sequence for a character or class - part of the CL Regex Optimiser.
This function reads one of the following escape sequences: ##, {###} ... hexadecimal character code {###} ... octal character code , , , , , ... generic character types #,
{###} ... Unicode properties #, {###} ... negated Unicode properties ... Unicode extended grapheme cluster
References is_ascii_alnum(), and is_hexadecimal().
Referenced by read_matchall().
char* read_grain | ( | char * | mark, |
char * | grain, | ||
int * | len | ||
) |
Reads in a literal grain from a regex - part of the CL Regex Optimiser.
A grain is a string of safe symbols: alphanumeric, safe punctuation and escaped punctuation; numeric character codes (, ) are not supported. The last symbol might be followed by a repetition operator. It is only included in the grain if the repetition count is at least one.
This function finds the longest grain it can starting at the point in the regex indicated by mark. If *grain is not NULL, the grain data are unescaped and copied to the specified buffer. Note that the buffer will be mangled even if no grain is matched; it is guaranteed to contain a NUL-terminated string, though. If *len is not null, the length of the grain (in characters) is stored there.
mark | Pointer to location in the regex string from which to read. |
grain | Optional pointer to a buffer into which the grain data will be copied. Guaranteed to contain a NUL-terminated string even if no grain is found. |
len | Pointer to integer in which length of grain in characters will be stored. |
References cl_regopt_utf8, is_ascii_punct(), is_safe_char(), and read_kleene().
Referenced by cl_regopt_analyse(), and read_disjunction().
char* read_kleene | ( | char * | mark, |
int * | one_or_more | ||
) |
Reads in a repetition operator - part of the CL Regex Optimiser.
This function reads in any repetition operator allowed in PCRE syntax:
mark | Pointer to location in the regex string from which to read. |
one_or_more | Optional pointer to integer, which will be set to True if repetition is not optional. |
Referenced by cl_regopt_analyse(), read_grain(), and read_wildcard().
char* read_matchall | ( | char * | mark | ) |
Reads in an element matching some character - part of the CL Regex Optimiser.
This function reads in an element known to match some character. The following elements are currently recognized:
mark | Pointer to location in the regex string from which to read. |
References cl_regopt_utf8, is_safe_char(), and read_escape_seq().
Referenced by read_wildcard().
char* read_wildcard | ( | char * | mark | ) |
Reads in a wildcard - part of the CL Regex Optimiser.
This function reads in a wildcard segment consisting of an element matching some character (read_matchall) or a capturing or non-capturing group, followed by an optional quantifier (read_kleene). It returns a pointer to the first character after the wildcard segment.
Groups are parsed recursively and must consist of one or more alternatives containing only valid wildcard segments.
mark | Pointer to location in the regex string from which to read. |
References read_kleene(), and read_matchall().
Referenced by cl_regopt_analyse(), and read_disjunction().
void regopt_data_copy_to_regex_object | ( | CL_Regex | rx | ) |
Copy grain set from internal global variables to CL_Regex object – part of the CL Regex Optimiser.
This function carries out four important tasks:
This is a non-exported function.
rx | a pointer to an initialized CL_Regex object |
References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::charset, cl_debug, CL_MAX_LINE_LENGTH, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grains, cl_strdup(), cl_string_canonical(), _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, _CL_Regex::icase, IGNORE_CASE, local_grain_data, and make_jump_table().
Referenced by cl_new_regex().
void update_grain_buffer | ( | int | at_start, |
int | at_end | ||
) |
Updates the public grain buffer – part of the CL Regex Optimiser.
This function copies the local grains to the public buffer, if they are better than the set of grains currently there. The decision is made with a heuristic based on the character length of the shortest grain and the number of different grains.
A non-exported function.
at_start | Boolean: if True, all grains are anchored on the left |
at_end | Boolean: if True, all grains are anchored on the right |
References buf, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, grain_buffer_len, and public_grain_data.
Referenced by cl_regopt_analyse().
char cl_regex_error[CL_MAX_LINE_LENGTH] |
The error message from (PCRE) regex compilation are placed in this buffer if cl_new_regex() fails.
This global variable is part of the CL_Regex object's API.
Referenced by cl_new_regex(), and cl_regex2id().
int cl_regopt_anchor_end |
Boolean: whether grains are anchored at end of string.
Referenced by cl_regopt_analyse(), regopt_data_copy_to_regex_object(), and update_grain_buffer().
int cl_regopt_anchor_start |
Boolean: whether grains are anchored at beginning of string.
Referenced by cl_regopt_analyse(), regopt_data_copy_to_regex_object(), and update_grain_buffer().
char* cl_regopt_grain[MAX_GRAINS] |
list of 'grains' (any matching string must contain one of these)
Referenced by cl_new_regex(), regopt_data_copy_to_regex_object(), and update_grain_buffer().
int cl_regopt_grain_len |
length of shortest grain (in characters)
Referenced by cl_regopt_analyse(), and update_grain_buffer().
int cl_regopt_grains |
number of grains
Referenced by cl_new_regex(), cl_regopt_analyse(), regopt_data_copy_to_regex_object(), and update_grain_buffer().
int cl_regopt_successes = 0 |
A counter of how many times the "grain" system has allwoed us to avoid calling the regex engine.
Referenced by cl_regex_match(), cl_regopt_count_get(), and cl_regopt_count_reset().
int cl_regopt_utf8 |
whether the regular expression is in UTF-8 encoding (set before calling cl_regopt_analyse())
Referenced by cl_new_regex(), read_grain(), and read_matchall().
char* grain_buffer[MAX_GRAINS] |
Intermediate buffer for grains.
When a regex is parsed, grains for each segment are written to this intermediate buffer; if the new set of grains is better than the current one, it is copied to the cl_regopt_ variables.grains in the local buffer may have different lengths
Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().
int grain_buffer_grains = 0 |
The number of grains currently in the intermediate buffer.
Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().
int grain_buffer_len[MAX_GRAINS] |
the length of each grain (in characters)
Referenced by cl_regopt_analyse(), read_disjunction(), and update_grain_buffer().
char local_grain_data[CL_MAX_LINE_LENGTH] |
A buffer for grain strings.
Referenced by cl_regopt_analyse(), read_disjunction(), and regopt_data_copy_to_regex_object().
char public_grain_data[CL_MAX_LINE_LENGTH] |