CWB
Data Structures | Macros | Typedefs | Functions
lexhash.c File Reference
#include "globals.h"
#include "macros.h"
#include "lexhash.h"
#include <math.h>

Data Structures

struct  _cl_lexhash
 Underlying structure for the cl_lexhash object. More...
 

Macros

#define DEFAULT_NR_OF_BUCKETS   250000
 Defines the default number of buckets in a lexhash. More...
 
#define DEFAULT_FILLRATE_LIMIT   2.0
 Default parameters for auto-growing the table of buckets (. More...
 
#define DEFAULT_FILLRATE_TARGET   0.4
 
#define MAX_BUCKETS   1000000007 /* 1 billion (incremented to next prime number) */
 Maximum number of buckets lexhash will try to allocate when auto-growing. More...
 

Typedefs

typedef void(* cl_lexhash_cleanup_func )(cl_lexhash_entry)
 TODO: consider alternative hash functions. More...
 

Functions

int is_prime (int n)
 Returns True iff n is a prime. More...
 
int find_prime (int n)
 Returns smallest prime >= n. More...
 
unsigned int hash_string (char *string)
 Computes 32bit hash value for string. More...
 
cl_lexhash cl_new_lexhash (int buckets)
 Creates a new cl_lexhash object. More...
 
void cl_delete_lexhash_entry (cl_lexhash hash, cl_lexhash_entry entry)
 Deallocates a cl_lexhash_entry object and its key string. More...
 
void cl_delete_lexhash (cl_lexhash hash)
 Deletes a cl_lexhash object. More...
 
void cl_lexhash_set_cleanup_function (cl_lexhash hash, cl_lexhash_cleanup_func func)
 Sets the cleanup function for a cl_lexhash. More...
 
void cl_lexhash_auto_grow (cl_lexhash hash, int flag)
 Turns a cl_lexhash's ability to auto-grow on or off. More...
 
void cl_lexhash_auto_grow_fillrate (cl_lexhash hash, double limit, double target)
 Configure auto-grow parameters. More...
 
int cl_lexhash_check_grow (cl_lexhash hash)
 Grows a lexhash table, increasing the number of buckets, if necessary. More...
 
cl_lexhash_entry cl_lexhash_find_i (cl_lexhash hash, char *token, unsigned int *ret_offset)
 Finds the entry corresponding to a particular string in a cl_lexhash. More...
 
cl_lexhash_entry cl_lexhash_find (cl_lexhash hash, char *token)
 Finds the entry corresponding to a particular string within a cl_lexhash. More...
 
cl_lexhash_entry cl_lexhash_add (cl_lexhash hash, char *token)
 Adds a token to a cl_lexhash table. More...
 
int cl_lexhash_id (cl_lexhash hash, char *token)
 Gets the ID of a particular string within a lexhash. More...
 
int cl_lexhash_freq (cl_lexhash hash, char *token)
 Gets the frequency of a particular string within a lexhash. More...
 
int cl_lexhash_del (cl_lexhash hash, char *token)
 Deletes a string from a hash. More...
 
int cl_lexhash_size (cl_lexhash hash)
 Gets the number of different strings stored in a lexhash. More...
 

Macro Definition Documentation

#define DEFAULT_FILLRATE_LIMIT   2.0

Default parameters for auto-growing the table of buckets (.

See also
cl_lexhash_auto_grow_fillrate for details).

Referenced by cl_new_lexhash().

#define DEFAULT_FILLRATE_TARGET   0.4

Referenced by cl_new_lexhash().

#define DEFAULT_NR_OF_BUCKETS   250000

Defines the default number of buckets in a lexhash.

Referenced by cl_new_lexhash().

#define MAX_BUCKETS   1000000007 /* 1 billion (incremented to next prime number) */

Maximum number of buckets lexhash will try to allocate when auto-growing.

Referenced by cl_lexhash_check_grow().

Typedef Documentation

typedef void(* cl_lexhash_cleanup_func)(cl_lexhash_entry)

TODO: consider alternative hash functions.

The hash function above appears to have been purloined from some version of the Perl source code. This claim cannot be confirmed, though. Perl5 has used various hash functions over time, but older versions (at least up to Perl 5.8.1) implement the simple DJB2 algorithm (see below).

According to

http://burtleburtle.net/bob/hash/

the algorithm is recommended in Don Knuth's "Art of Computer Programming" (Vol. 3, Sec. 6.4), but we haven't been able to find the actual reference there.

The URL above also discusses properties of hash functions at length and suggests a number of better algorithms.

Prime-number-sized hash tables are required by Knuth's algorithm, but make hashing more expensive. Growing a hash from 2^n to 2^(n+1) also gives a highly predicatble redistribution of buckets. Good hash functions should not require division by prime number in order to achieve good distribution.

DJB2: unsigned long hash = 5381; int c;

while (c = *str++) hash = ((hash << 5) + hash) + c; // hash * 33 + c

DJB2a: hash = hash * 33 ^ str[i]

MurmurHash: see http://en.wikipedia.org/wiki/MurmurHash

Experimental comparison: http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed A function pointer type defining functions that can be used as the "cleanup" for a deleted cl_lexhash_entry.

See also
cl_lexhash_set_cleanup_function

Function Documentation

void cl_delete_lexhash ( cl_lexhash  hash)

Deletes a cl_lexhash object.

This deletes all the entries in all the buckets in the lexhash, plus the cl_lexhash itself.

Parameters
hashThe cl_lexhash to delete.

References _cl_lexhash::buckets, cl_delete_lexhash_entry(), cl_free, _cl_lexhash_entry::next, and _cl_lexhash::table.

Referenced by main().

void cl_delete_lexhash_entry ( cl_lexhash  hash,
cl_lexhash_entry  entry 
)

Deallocates a cl_lexhash_entry object and its key string.

Also, the cleanup function is run on the entry.

Usage: cl_delete_lexhash_entry(lexhash, entry);

This is a non-exported function.

See also
cl_lexhash_set_cleanup_function
Parameters
hashThe lexhash this entry belongs to (needed to locate the cleanup function, if any).
entryThe entry to delete.

References cl_free, and _cl_lexhash::cleanup_func.

Referenced by cl_delete_lexhash(), and cl_lexhash_del().

cl_lexhash_entry cl_lexhash_add ( cl_lexhash  hash,
char *  token 
)

Adds a token to a cl_lexhash table.

If the string is already in the hash, its frequency count is increased by 1.

Otherwise, a new entry is created, with an auto-assigned ID; note that the string is duplicated, so the original string that is passed to this function does not need to be kept in memory.

Parameters
hashThe hash table to add to.
tokenThe string to add.
Returns
A pointer to a (new or existing) entry

References _cl_lexhash::auto_grow, _cl_lexhash::buckets, cl_lexhash_check_grow(), cl_lexhash_find_i(), cl_malloc(), _cl_lexhash_entry::data, _cl_lexhash::entries, _cl_lexhash::fillrate_limit, _cl_lexhash_entry::freq, _cl_lexhash_entry::id, _cl_lexhash_entry::_cl_lexhash_entry_data::integer, _cl_lexhash_entry::key, _cl_lexhash_entry::next, _cl_lexhash::next_id, _cl_lexhash_entry::_cl_lexhash_entry_data::numeric, _cl_lexhash_entry::_cl_lexhash_entry_data::pointer, and _cl_lexhash::table.

Referenced by encode_add_wattr_line(), main(), range_close(), range_declare(), range_open(), and sencode_write_region().

void cl_lexhash_auto_grow ( cl_lexhash  hash,
int  flag 
)

Turns a cl_lexhash's ability to auto-grow on or off.

When this setting is switched on, the lexhash will grow automatically to avoid performance degradation.

Note the default value for this setting is SWITCHED ON.

See also
cl_lexhash_check_grow
Parameters
hashThe hash that will be affected.
flagNew value for autogrow setting: boolean where true is on and false is off.

References _cl_lexhash::auto_grow.

void cl_lexhash_auto_grow_fillrate ( cl_lexhash  hash,
double  limit,
double  target 
)

Configure auto-grow parameters.

These settings are only relevant if auto-growing is enabled.

The decision to expand the bucket table of a lexhash is based on its fill rate, i.e. the average number of entries in each bucket. Under normal circumstances, this value corresponds to the average number of comparisons required to insert a new entry into the hash (locating an existing value should require roughly half as many comparisons).

Auto-growing is triggered if the fill rate exceeds a specified limit. The new number of buckets is chosen so that the fill rate after expansion corresponds to the specified target value.

The limit should not be set too low in order to reduce memory overhead and avoid frequent reallocation due to expansion in small increments. Good values seem to be in the range 2.0-5.0; depending on whether speed or memory efficiency is more important. A reasonable value for the target fill rate is 0.4, which corresponds to a 42% overhead over the storage required for entry data structures (48 bytes per entry vs. 8 bytes for each bucket).

See also
cl_lexhash_auto_grow, cl_lexhash_check_grow
Parameters
hashThe hash that will be affected.
limitFill rate limit, which triggers expansion of the lexhash
targetTarget fill rate after expansion (determines new number of buckets)

References _cl_lexhash::fillrate_limit, and _cl_lexhash::fillrate_target.

int cl_lexhash_check_grow ( cl_lexhash  hash)

Grows a lexhash table, increasing the number of buckets, if necessary.

This functions is called after inserting a new entry into the lexhash. If checks whether the current fill rate exceeds the specified limit. If this is the case, and auto_grow is enabled, then the hash is expanded by increasing the number of buckets, such that the new average fill rate corresponds to the specified target value. This gives the hash better performance and makes it capable of absorbing more keys.

If the bucket table would be expanded to more than MAX_BUCKETS entries, auto-grow is automatically disabled for this lexhash.

Note: this function also implements the hashing algorithm and must be consistent with cl_lexhash_find_i().

Usage: expanded = cl_lexhash_check_grow(cl_lexhash hash);

This is a non-exported function.

See also
cl_lexhash_auto_grow, cl_lexhash_auto_grow_fillrate
Parameters
hashThe lexhash to autogrow.
Returns
Always 0.

References _cl_lexhash::auto_grow, _cl_lexhash::buckets, cl_debug, cl_free, cl_new_lexhash(), _cl_lexhash::entries, _cl_lexhash::fillrate_limit, _cl_lexhash::fillrate_target, hash_string(), _cl_lexhash_entry::key, MAX_BUCKETS, _cl_lexhash_entry::next, and _cl_lexhash::table.

Referenced by cl_lexhash_add().

int cl_lexhash_del ( cl_lexhash  hash,
char *  token 
)

Deletes a string from a hash.

The entry corresponding to the specified string is removed from the lexhash. If the string is not in the lexhash to begin with, no action is taken.

Parameters
hashThe hash to alter.
tokenThe string to remove.
Returns
The frequency of the deleted entry (0 if the string was not found in the hash).

References cl_delete_lexhash_entry(), cl_lexhash_find_i(), _cl_lexhash::entries, _cl_lexhash_entry::freq, _cl_lexhash_entry::next, and _cl_lexhash::table.

cl_lexhash_entry cl_lexhash_find ( cl_lexhash  hash,
char *  token 
)

Finds the entry corresponding to a particular string within a cl_lexhash.

This function is basically a wrapper around the internal function cl_lexhash_find_i.

See also
cl_lexhash_find_i
Parameters
hashThe hash to search.
tokenThe key-string to look for.
Returns
The entry that is found (or NULL if the string is not in the hash).

References cl_lexhash_find_i().

Referenced by main(), range_close(), range_open(), range_print_registry_line(), and sencode_write_region().

cl_lexhash_entry cl_lexhash_find_i ( cl_lexhash  hash,
char *  token,
unsigned int *  ret_offset 
)

Finds the entry corresponding to a particular string in a cl_lexhash.

This function is the same as cl_lexhash_find(), but *ret_offset is set to the hashtable offset computed for token (i.e. the index of the bucket within the hashtable), unless *ret_offset == NULL.

Note that this function hides the hashing algorithm details from the rest of the lexhash implementation (except cl_lexhash_check_grow, which re-implements the hashing algorithm for performance reasons).

Usage: entry = cl_lexhash_find_i(cl_lexhash hash, char *token, unsigned int *ret_offset);

This is a non-exported function.

Parameters
hashThe hash to search.
tokenThe key-string to look for.
ret_offsetThis integer address will be filled with the token's hashtable offset (can be NULL, in which case, ignored).
Returns
The entry that is found (or NULL if the string is not in the hash).

References _cl_lexhash::buckets, hash_string(), _cl_lexhash_entry::key, _cl_lexhash_entry::next, and _cl_lexhash::table.

Referenced by cl_lexhash_add(), cl_lexhash_del(), cl_lexhash_find(), cl_lexhash_freq(), and cl_lexhash_id().

int cl_lexhash_freq ( cl_lexhash  hash,
char *  token 
)

Gets the frequency of a particular string within a lexhash.

Parameters
hashThe hash to look in.
tokenThe string to look for.
Returns
The frequency of that string, or 0 if the string is not in the hash (whgich is, of course, actually its frequency).

References cl_lexhash_find_i(), and _cl_lexhash_entry::freq.

Referenced by main(), and range_open().

int cl_lexhash_id ( cl_lexhash  hash,
char *  token 
)

Gets the ID of a particular string within a lexhash.

Note this is the ID integer that identifies THAT PARTICULAR STRING, not the hash value of that string - which only identifies the bucket the string is found in!

Parameters
hashThe hash to look in.
tokenThe string to look for.
Returns
The ID code of that string, or -1 if the string is not in the hash.

References cl_lexhash_find_i(), and _cl_lexhash_entry::id.

Referenced by encode_add_wattr_line(), and range_declare().

void cl_lexhash_set_cleanup_function ( cl_lexhash  hash,
cl_lexhash_cleanup_func  func 
)

Sets the cleanup function for a cl_lexhash.

The cleanup function is called with a cl_lexhash_entry argument; it should delete any objects assocated with the entry's data field.

The cleanup function is initially set to NULL, i.e. run no function.

Parameters
hashThe cl_lexhash to work with.
funcPointer to the function to use for cleanup.

References _cl_lexhash::cleanup_func, and func.

int cl_lexhash_size ( cl_lexhash  hash)

Gets the number of different strings stored in a lexhash.

This returns the total number of entries in all the buckets in the whole hash table.

Parameters
hashThe hash to size up.

References _cl_lexhash::entries.

cl_lexhash cl_new_lexhash ( int  buckets)

Creates a new cl_lexhash object.

Parameters
bucketsThe number of buckets in the newly-created cl_lexhash; set to 0 to use the default number of buckets.
Returns
The new cl_lexhash.

References _cl_lexhash::auto_grow, _cl_lexhash::buckets, cl_calloc(), cl_malloc(), _cl_lexhash::cleanup_func, DEFAULT_FILLRATE_LIMIT, DEFAULT_FILLRATE_TARGET, DEFAULT_NR_OF_BUCKETS, _cl_lexhash::entries, _cl_lexhash::fillrate_limit, _cl_lexhash::fillrate_target, find_prime(), _cl_lexhash::next_id, and _cl_lexhash::table.

Referenced by cl_lexhash_check_grow(), main(), range_declare(), sencode_write_region(), and wattr_declare().

int find_prime ( int  n)

Returns smallest prime >= n.

Referenced by cl_new_lexhash(), cl_new_ngram_hash(), make_attribute_hash(), and MakeMacroHash().

unsigned int hash_string ( char *  string)

Computes 32bit hash value for string.

Referenced by att_hash_lookup(), cl_lexhash_check_grow(), and cl_lexhash_find_i().

int is_prime ( int  n)

Returns True iff n is a prime.

Referenced by find_prime().