CWB
Data Structures | Macros | Functions
regopt.h File Reference
#include "globals.h"
#include <pcre.h>

Data Structures

struct  _CL_Regex
 Underlying structure for CL_Regex object. More...
 

Macros

#define MAX_GRAINS   12
 Maximum number of grains of optimisation. More...
 

Functions

void regopt_data_copy_to_regex_object (CL_Regex rx)
 Copy grain set from internal global variables to CL_Regex object – part of the CL Regex Optimiser. More...
 
int cl_regopt_analyse (char *regex)
 Analyses a regular expression and tries to find the best set of grains. More...
 

Macro Definition Documentation

#define MAX_GRAINS   12

Maximum number of grains of optimisation.

There's no point in scanning for too many grains, but some regexps used to be bloody inefficient.

Referenced by read_disjunction().

Function Documentation

int cl_regopt_analyse ( char *  regex)

Analyses a regular expression and tries to find the best set of grains.

Part of the regex optimiser. For a given regular expression, this function will try to extract a set of grains from regular expression {regex_string}. These grains are then used by the CL regex matcher and cl_regex2id() as a pre-filter for faster regular expression search.

The function only recognizes relatively simple regular expressions without recursive nesting, which roughly correspond to wildcard searches similar to SQL's LIKE operator. Searching for a fixed prefix, suffix or infix (or a small set of alternatives) will see the most noticeable performance improvement. Any more complex expression is passed directly to the standard PCRE evaluation. See cl_new_regex() for an overview of the supported subsets of PCRE syntax.

If successful, this function returns True and stores the grains in the optiomiser's global variables above, from which they then must be copied to a CL_Regex object with regopt_data_copy_to_regex_object(). This step will also casefold grains if necessary and prepare the Boyer-Moore jump tables.

Usage: optimised = cl_regopt_analyse(regex_string);

This is a non-exported function.

Parameters
regexString containing the regex to optimise.
Returns
Boolean: true = ok, false = couldn't optimise regex.

The code below parses a regular expression using the supported PCRE subset and attempts to extract a literal string ("grain") that must occur in every match of the regular expression. In a disjunction group, each alternative must contain a grain in order to be used for optimized matching; we refer to this as a complete grain set.

The algorithm scans the regular expression from left to right, looking for the following supported elements in the specified order:

  1. a literal grain (read_grain)
  2. a parenthesized disjunction containing grains (read_disjunction)
  3. a segment matching some, possibly empty substring (read_wildcard), including recursively nested capturing or non-capturing groups

See the respective functions for further information on these elements. Grains are collected by read_grain and read_disjunction in a local buffer. If a complete grain set has been found, it replaces the previous global grain set if it is considered to be more effective based on the length and number of grains.

As a special case, the optimizer first attempts to match the entire regexp as a simple disjunction (read_disjunction) without parentheses.

References cl_debug, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain_len, cl_regopt_grains, grain_buffer, grain_buffer_grains, grain_buffer_len, local_grain_data, read_disjunction(), read_grain(), read_kleene(), read_wildcard(), and update_grain_buffer().

Referenced by cl_new_regex().

void regopt_data_copy_to_regex_object ( CL_Regex  rx)

Copy grain set from internal global variables to CL_Regex object – part of the CL Regex Optimiser.

This function carries out four important tasks:

  • It copies the grain set found by cl_regopt_analyse from the internal global buffer to member variables of the specified CL_Regex object.
  • It casefolds the grains if required (i.e. if rx->icase is True).
  • It cuts grains to equal byte length, preserving start/end anchoring if possible. Note that our implementation of Boyer-Moore search requires equally sized grains.
  • It calls make_jump_table to generate a lookup table for the Boyer-Moore search.

This is a non-exported function.

Parameters
rxa pointer to an initialized CL_Regex object

References _CL_Regex::anchor_end, _CL_Regex::anchor_start, _CL_Regex::charset, cl_debug, CL_MAX_LINE_LENGTH, cl_regopt_anchor_end, cl_regopt_anchor_start, cl_regopt_grain, cl_regopt_grains, cl_strdup(), cl_string_canonical(), _CL_Regex::grain, _CL_Regex::grain_len, _CL_Regex::grains, _CL_Regex::icase, IGNORE_CASE, local_grain_data, and make_jump_table().

Referenced by cl_new_regex().