CWB
|
Data Structures | |
struct | _cl_lexhash |
Underlying structure for the cl_lexhash object. More... | |
Macros | |
#define | DEFAULT_NR_OF_BUCKETS 250000 |
Defines the default number of buckets in a lexhash. More... | |
#define | DEFAULT_FILLRATE_LIMIT 2.0 |
Default parameters for auto-growing the table of buckets (. More... | |
#define | DEFAULT_FILLRATE_TARGET 0.4 |
#define | MAX_BUCKETS 1000000007 /* 1 billion (incremented to next prime number) */ |
Maximum number of buckets lexhash will try to allocate when auto-growing. More... | |
Typedefs | |
typedef void(* | cl_lexhash_cleanup_func )(cl_lexhash_entry) |
TODO: consider alternative hash functions. More... | |
Functions | |
int | is_prime (int n) |
Returns True iff n is a prime. More... | |
int | find_prime (int n) |
Returns smallest prime >= n. More... | |
unsigned int | hash_string (char *string) |
Computes 32bit hash value for string. More... | |
cl_lexhash | cl_new_lexhash (int buckets) |
Creates a new cl_lexhash object. More... | |
void | cl_delete_lexhash_entry (cl_lexhash hash, cl_lexhash_entry entry) |
Deallocates a cl_lexhash_entry object and its key string. More... | |
void | cl_delete_lexhash (cl_lexhash hash) |
Deletes a cl_lexhash object. More... | |
void | cl_lexhash_set_cleanup_function (cl_lexhash hash, cl_lexhash_cleanup_func func) |
Sets the cleanup function for a cl_lexhash. More... | |
void | cl_lexhash_auto_grow (cl_lexhash hash, int flag) |
Turns a cl_lexhash's ability to auto-grow on or off. More... | |
void | cl_lexhash_auto_grow_fillrate (cl_lexhash hash, double limit, double target) |
Configure auto-grow parameters. More... | |
int | cl_lexhash_check_grow (cl_lexhash hash) |
Grows a lexhash table, increasing the number of buckets, if necessary. More... | |
cl_lexhash_entry | cl_lexhash_find_i (cl_lexhash hash, char *token, unsigned int *ret_offset) |
Finds the entry corresponding to a particular string in a cl_lexhash. More... | |
cl_lexhash_entry | cl_lexhash_find (cl_lexhash hash, char *token) |
Finds the entry corresponding to a particular string within a cl_lexhash. More... | |
cl_lexhash_entry | cl_lexhash_add (cl_lexhash hash, char *token) |
Adds a token to a cl_lexhash table. More... | |
int | cl_lexhash_id (cl_lexhash hash, char *token) |
Gets the ID of a particular string within a lexhash. More... | |
int | cl_lexhash_freq (cl_lexhash hash, char *token) |
Gets the frequency of a particular string within a lexhash. More... | |
int | cl_lexhash_del (cl_lexhash hash, char *token) |
Deletes a string from a hash. More... | |
int | cl_lexhash_size (cl_lexhash hash) |
Gets the number of different strings stored in a lexhash. More... | |
#define DEFAULT_FILLRATE_LIMIT 2.0 |
Default parameters for auto-growing the table of buckets (.
Referenced by cl_new_lexhash().
#define DEFAULT_FILLRATE_TARGET 0.4 |
Referenced by cl_new_lexhash().
#define DEFAULT_NR_OF_BUCKETS 250000 |
Defines the default number of buckets in a lexhash.
Referenced by cl_new_lexhash().
#define MAX_BUCKETS 1000000007 /* 1 billion (incremented to next prime number) */ |
Maximum number of buckets lexhash will try to allocate when auto-growing.
Referenced by cl_lexhash_check_grow().
typedef void(* cl_lexhash_cleanup_func)(cl_lexhash_entry) |
TODO: consider alternative hash functions.
The hash function above appears to have been purloined from some version of the Perl source code. This claim cannot be confirmed, though. Perl5 has used various hash functions over time, but older versions (at least up to Perl 5.8.1) implement the simple DJB2 algorithm (see below).
According to
http://burtleburtle.net/bob/hash/
the algorithm is recommended in Don Knuth's "Art of Computer Programming" (Vol. 3, Sec. 6.4), but we haven't been able to find the actual reference there.
The URL above also discusses properties of hash functions at length and suggests a number of better algorithms.
Prime-number-sized hash tables are required by Knuth's algorithm, but make hashing more expensive. Growing a hash from 2^n to 2^(n+1) also gives a highly predicatble redistribution of buckets. Good hash functions should not require division by prime number in order to achieve good distribution.
DJB2: unsigned long hash = 5381; int c;
while (c = *str++) hash = ((hash << 5) + hash) + c; // hash * 33 + c
DJB2a: hash = hash * 33 ^ str[i]
MurmurHash: see http://en.wikipedia.org/wiki/MurmurHash
Experimental comparison: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed A function pointer type defining functions that can be used as the "cleanup" for a deleted cl_lexhash_entry.
void cl_delete_lexhash | ( | cl_lexhash | hash | ) |
Deletes a cl_lexhash object.
This deletes all the entries in all the buckets in the lexhash, plus the cl_lexhash itself.
hash | The cl_lexhash to delete. |
References _cl_lexhash::buckets, cl_delete_lexhash_entry(), cl_free, _cl_lexhash_entry::next, and _cl_lexhash::table.
Referenced by main().
void cl_delete_lexhash_entry | ( | cl_lexhash | hash, |
cl_lexhash_entry | entry | ||
) |
Deallocates a cl_lexhash_entry object and its key string.
Also, the cleanup function is run on the entry.
Usage: cl_delete_lexhash_entry(lexhash, entry);
This is a non-exported function.
hash | The lexhash this entry belongs to (needed to locate the cleanup function, if any). |
entry | The entry to delete. |
References cl_free, and _cl_lexhash::cleanup_func.
Referenced by cl_delete_lexhash(), and cl_lexhash_del().
cl_lexhash_entry cl_lexhash_add | ( | cl_lexhash | hash, |
char * | token | ||
) |
Adds a token to a cl_lexhash table.
If the string is already in the hash, its frequency count is increased by 1.
Otherwise, a new entry is created, with an auto-assigned ID; note that the string is duplicated, so the original string that is passed to this function does not need to be kept in memory.
hash | The hash table to add to. |
token | The string to add. |
References _cl_lexhash::auto_grow, _cl_lexhash::buckets, cl_lexhash_check_grow(), cl_lexhash_find_i(), cl_malloc(), _cl_lexhash_entry::data, _cl_lexhash::entries, _cl_lexhash::fillrate_limit, _cl_lexhash_entry::freq, _cl_lexhash_entry::id, _cl_lexhash_entry::_cl_lexhash_entry_data::integer, _cl_lexhash_entry::key, _cl_lexhash_entry::next, _cl_lexhash::next_id, _cl_lexhash_entry::_cl_lexhash_entry_data::numeric, _cl_lexhash_entry::_cl_lexhash_entry_data::pointer, and _cl_lexhash::table.
Referenced by encode_add_wattr_line(), main(), range_close(), range_declare(), range_open(), and sencode_write_region().
void cl_lexhash_auto_grow | ( | cl_lexhash | hash, |
int | flag | ||
) |
Turns a cl_lexhash's ability to auto-grow on or off.
When this setting is switched on, the lexhash will grow automatically to avoid performance degradation.
Note the default value for this setting is SWITCHED ON.
hash | The hash that will be affected. |
flag | New value for autogrow setting: boolean where true is on and false is off. |
References _cl_lexhash::auto_grow.
void cl_lexhash_auto_grow_fillrate | ( | cl_lexhash | hash, |
double | limit, | ||
double | target | ||
) |
Configure auto-grow parameters.
These settings are only relevant if auto-growing is enabled.
The decision to expand the bucket table of a lexhash is based on its fill rate, i.e. the average number of entries in each bucket. Under normal circumstances, this value corresponds to the average number of comparisons required to insert a new entry into the hash (locating an existing value should require roughly half as many comparisons).
Auto-growing is triggered if the fill rate exceeds a specified limit. The new number of buckets is chosen so that the fill rate after expansion corresponds to the specified target value.
The limit should not be set too low in order to reduce memory overhead and avoid frequent reallocation due to expansion in small increments. Good values seem to be in the range 2.0-5.0; depending on whether speed or memory efficiency is more important. A reasonable value for the target fill rate is 0.4, which corresponds to a 42% overhead over the storage required for entry data structures (48 bytes per entry vs. 8 bytes for each bucket).
hash | The hash that will be affected. |
limit | Fill rate limit, which triggers expansion of the lexhash |
target | Target fill rate after expansion (determines new number of buckets) |
References _cl_lexhash::fillrate_limit, and _cl_lexhash::fillrate_target.
int cl_lexhash_check_grow | ( | cl_lexhash | hash | ) |
Grows a lexhash table, increasing the number of buckets, if necessary.
This functions is called after inserting a new entry into the lexhash. If checks whether the current fill rate exceeds the specified limit. If this is the case, and auto_grow is enabled, then the hash is expanded by increasing the number of buckets, such that the new average fill rate corresponds to the specified target value. This gives the hash better performance and makes it capable of absorbing more keys.
If the bucket table would be expanded to more than MAX_BUCKETS entries, auto-grow is automatically disabled for this lexhash.
Note: this function also implements the hashing algorithm and must be consistent with cl_lexhash_find_i().
Usage: expanded = cl_lexhash_check_grow(cl_lexhash hash);
This is a non-exported function.
hash | The lexhash to autogrow. |
References _cl_lexhash::auto_grow, _cl_lexhash::buckets, cl_debug, cl_free, cl_new_lexhash(), _cl_lexhash::entries, _cl_lexhash::fillrate_limit, _cl_lexhash::fillrate_target, hash_string(), _cl_lexhash_entry::key, MAX_BUCKETS, _cl_lexhash_entry::next, and _cl_lexhash::table.
Referenced by cl_lexhash_add().
int cl_lexhash_del | ( | cl_lexhash | hash, |
char * | token | ||
) |
Deletes a string from a hash.
The entry corresponding to the specified string is removed from the lexhash. If the string is not in the lexhash to begin with, no action is taken.
hash | The hash to alter. |
token | The string to remove. |
References cl_delete_lexhash_entry(), cl_lexhash_find_i(), _cl_lexhash::entries, _cl_lexhash_entry::freq, _cl_lexhash_entry::next, and _cl_lexhash::table.
cl_lexhash_entry cl_lexhash_find | ( | cl_lexhash | hash, |
char * | token | ||
) |
Finds the entry corresponding to a particular string within a cl_lexhash.
This function is basically a wrapper around the internal function cl_lexhash_find_i.
hash | The hash to search. |
token | The key-string to look for. |
References cl_lexhash_find_i().
Referenced by main(), range_close(), range_open(), range_print_registry_line(), and sencode_write_region().
cl_lexhash_entry cl_lexhash_find_i | ( | cl_lexhash | hash, |
char * | token, | ||
unsigned int * | ret_offset | ||
) |
Finds the entry corresponding to a particular string in a cl_lexhash.
This function is the same as cl_lexhash_find(), but *ret_offset is set to the hashtable offset computed for token (i.e. the index of the bucket within the hashtable), unless *ret_offset == NULL.
Note that this function hides the hashing algorithm details from the rest of the lexhash implementation (except cl_lexhash_check_grow, which re-implements the hashing algorithm for performance reasons).
Usage: entry = cl_lexhash_find_i(cl_lexhash hash, char *token, unsigned int *ret_offset);
This is a non-exported function.
hash | The hash to search. |
token | The key-string to look for. |
ret_offset | This integer address will be filled with the token's hashtable offset (can be NULL, in which case, ignored). |
References _cl_lexhash::buckets, hash_string(), _cl_lexhash_entry::key, _cl_lexhash_entry::next, and _cl_lexhash::table.
Referenced by cl_lexhash_add(), cl_lexhash_del(), cl_lexhash_find(), cl_lexhash_freq(), and cl_lexhash_id().
int cl_lexhash_freq | ( | cl_lexhash | hash, |
char * | token | ||
) |
Gets the frequency of a particular string within a lexhash.
hash | The hash to look in. |
token | The string to look for. |
References cl_lexhash_find_i(), and _cl_lexhash_entry::freq.
Referenced by main(), and range_open().
int cl_lexhash_id | ( | cl_lexhash | hash, |
char * | token | ||
) |
Gets the ID of a particular string within a lexhash.
Note this is the ID integer that identifies THAT PARTICULAR STRING, not the hash value of that string - which only identifies the bucket the string is found in!
hash | The hash to look in. |
token | The string to look for. |
References cl_lexhash_find_i(), and _cl_lexhash_entry::id.
Referenced by encode_add_wattr_line(), and range_declare().
void cl_lexhash_set_cleanup_function | ( | cl_lexhash | hash, |
cl_lexhash_cleanup_func | func | ||
) |
Sets the cleanup function for a cl_lexhash.
The cleanup function is called with a cl_lexhash_entry argument; it should delete any objects assocated with the entry's data field.
The cleanup function is initially set to NULL, i.e. run no function.
hash | The cl_lexhash to work with. |
func | Pointer to the function to use for cleanup. |
References _cl_lexhash::cleanup_func, and func.
int cl_lexhash_size | ( | cl_lexhash | hash | ) |
Gets the number of different strings stored in a lexhash.
This returns the total number of entries in all the buckets in the whole hash table.
hash | The hash to size up. |
References _cl_lexhash::entries.
cl_lexhash cl_new_lexhash | ( | int | buckets | ) |
Creates a new cl_lexhash object.
buckets | The number of buckets in the newly-created cl_lexhash; set to 0 to use the default number of buckets. |
References _cl_lexhash::auto_grow, _cl_lexhash::buckets, cl_calloc(), cl_malloc(), _cl_lexhash::cleanup_func, DEFAULT_FILLRATE_LIMIT, DEFAULT_FILLRATE_TARGET, DEFAULT_NR_OF_BUCKETS, _cl_lexhash::entries, _cl_lexhash::fillrate_limit, _cl_lexhash::fillrate_target, find_prime(), _cl_lexhash::next_id, and _cl_lexhash::table.
Referenced by cl_lexhash_check_grow(), main(), range_declare(), sencode_write_region(), and wattr_declare().
int find_prime | ( | int | n | ) |
Returns smallest prime >= n.
Referenced by cl_new_lexhash(), cl_new_ngram_hash(), make_attribute_hash(), and MakeMacroHash().
unsigned int hash_string | ( | char * | string | ) |
Computes 32bit hash value for string.
Referenced by att_hash_lookup(), cl_lexhash_check_grow(), and cl_lexhash_find_i().
int is_prime | ( | int | n | ) |
Returns True iff n is a prime.
Referenced by find_prime().