CWB
|
Data Structures | |
struct | _CL_Regex |
Underlying structure for CL_Regex object. More... | |
Macros | |
#define | MAX_GRAINS 12 |
Maximum number of grains of optimisation. More... | |
Functions | |
void | regopt_data_copy_to_regex_object (CL_Regex rx) |
Copy grain set from internal global variables to CL_Regex object – part of the CL Regex Optimiser. More... | |
int | cl_regopt_analyse (char *regex) |
Analyses a regular expression and tries to find the best set of grains. More... | |
#define MAX_GRAINS 12 |
Maximum number of grains of optimisation.
There's no point in scanning for too many grains, but some regexps used to be bloody inefficient.
Referenced by read_disjunction().
int cl_regopt_analyse | ( | char * | regex | ) |
Analyses a regular expression and tries to find the best set of grains.
Part of the regex optimiser. For a given regular expression, this function will try to extract a set of grains from regular expression {regex_string}. These grains are then used by the CL regex matcher and cl_regex2id() as a pre-filter for faster regular expression search.
The function only recognizes relatively simple regular expressions without recursive nesting, which roughly correspond to wildcard searches similar to SQL's LIKE operator. Searching for a fixed prefix, suffix or infix (or a small set of alternatives) will see the most noticeable performance improvement. Any more complex expression is passed directly to the standard PCRE evaluation. See cl_new_regex() for an overview of the supported subsets of PCRE syntax.
If successful, this function returns True and stores the grains in the optiomiser's global variables above, from which they then must be copied to a CL_Regex object with regopt_data_copy_to_regex_object(). This step will also casefold grains if necessary and prepare the Boyer-Moore jump tables.
Usage: optimised = cl_regopt_analyse(regex_string);
This is a non-exported function.
regex | String containing the regex to optimise. |
The code below parses a regular expression using the supported PCRE subset and attempts to extract a literal string ("grain") that must occur in every match of the regular expression. In a disjunction group, each alternative must contain a grain in order to be used for optimized matching; we refer to this as a complete grain set.
The algorithm scans the regular expression from left to right, looking for the following supported elements in the specified order:
See the respective functions for further information on these elements. Grains are collected by read_grain and read_disjunction in a local buffer. If a complete grain set has been found, it replaces the previous global grain set if it is considered to be more effective based on the length and number of grains.
As a special case, the optimizer first attempts to match the entire regexp as a simple disjunction (read_disjunction) without parentheses.
References cl_debug, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, grain_buffer_len, local_grain_data, read_disjunction(), read_grain(), read_kleene(), read_wildcard(), and update_grain_buffer().
Referenced by cl_new_regex().
void regopt_data_copy_to_regex_object | ( | CL_Regex | rx | ) |
Copy grain set from internal global variables to CL_Regex object – part of the CL Regex Optimiser.
This function carries out four important tasks:
This is a non-exported function.
rx | a pointer to an initialized CL_Regex object |
References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::charset, cl_debug, CL_MAX_LINE_LENGTH, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grains, cl_strdup(), cl_string_canonical(), _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, _CL_Regex::icase, IGNORE_CASE, local_grain_data, and make_jump_table().
Referenced by cl_new_regex().