CWB
Data Structures | Macros | Typedefs | Functions
feature_maps.h File Reference
#include "../cl/cl.h"

Data Structures

struct  vstack_t
 Data structure for the vstack member of the FMS object. More...
 
struct  feature_maps_t
 Underlying structure for the FMS object. More...
 

Macros

#define MAXBLOCKS   10
 

Typedefs

typedef struct vstack_t vstack_t
 Data structure for the vstack member of the FMS object. More...
 
typedef struct feature_maps_t feature_maps_t
 Underlying structure for the FMS object. More...
 
typedef feature_maps_tFMS
 The FMS object: contains memory space for a feature map between two attributes, used in aligning corpora. More...
 

Functions

FMS create_feature_maps (char **config, int config_lines, Attribute *w_attr1, Attribute *w_attr2, Attribute *s_attr1, Attribute *s_attr2)
 Methods for the FMS class. More...
 
int * get_fvector (FMS fms)
 Feature count vector handling (used internally by feature_match). More...
 
void release_fvector (int *fvector, FMS fms)
 Inserts a new vstack_t at the start of the vstack member of the given FMS. More...
 
void check_fvectors (FMS fms)
 Prints a message about the vector stack of the given FMS. More...
 
int feature_match (FMS fms, int f1, int l1, int f2, int l2)
 Compute similarity measure for a pair of regions, source and target, specified by the corpus positions of the first and last sentences in each region. More...
 
void show_features (FMS fms, int which, char *word)
 Prints the features in an FMS, as applied to a specific lexicon entry, to STDOUT. More...
 
void best_path (FMS fms, int f1, int l1, int f2, int l2, int beam_width, int verbose, int *steps, int **out1, int **out2, int **out_quality)
 Finds the best alignment path for the given spans of s-attribute instances in the source and target corpus. More...
 

Macro Definition Documentation

#define MAXBLOCKS   10

Typedef Documentation

Underlying structure for the FMS object.

typedef feature_maps_t* FMS

The FMS object: contains memory space for a feature map between two attributes, used in aligning corpora.

The "feature map" is a very large and complex data structure of all the different features we can look at, together with weights.

Basically, it is a "compiled" version of the features defined by the cwb-align configuration flags AS APPLIED TO THIS SPECIFIC CORPUS - a massive list of "things to look for" when comparing any two potentially-corresponding regions from a source/target corpus pair.

typedef struct vstack_t vstack_t

Data structure for the vstack member of the FMS object.

See also
FMS

Function Documentation

void best_path ( FMS  fms,
int  f1,
int  l1,
int  f2,
int  l2,
int  beam_width,
int  verbose,
int *  steps,
int **  out1,
int **  out2,
int **  out_quality 
)

Finds the best alignment path for the given spans of s-attribute instances in the source and target corpus.

This function does a beamed dynamic programming search for the best path aligning the sentence regions (f1,l1) in the source corpus and (f2,l2) in the target corpus.

Allowed alignments are 1:0 0:1 1:1 2:1 1:2.

The results are returned in the vectors out1 and out2, which each contain a number of valid entries (alignment points) equal to {steps}.

Alignment points are given as sentence numbers and correspond to the start points of the sentences. At the end-of-region alignment point, sentence numbers will be l1 + 1 and l2 + 1, which must be considered by the caller if l1 (or l2) is the last sentence in the corpus!

The similarity measures of aligned regions are returned in the vector out_quality.

Memory allocated for the return vectors (out1, out2, out_quality) is managed by best_path() and must not be freed by the caller. Calling best_path() overwrites the results of the previous search.

Example usage:

best_path(FMS, f1, l1, f2, l2, beam_width, 0/1, &steps, &out1, &out2, &out_quality);

Parameters
fmsThe FMS to use as comparison criteria.
f1Index of first sentence in source region.
l1Index of last sentence in source region
f2Index of first sentence in target region.
l2Index of last sentence in target region.
beam_widthParameter for the beam search.
verboseBoolean: iff true, prints progress messages on STDOUT.
stepsPut output here (see function description).
out1Put output here (see function description).
out2Put output here (see function description).
out_qualityPut output here (see function description).

References BAR_delete(), BAR_new(), BAR_read(), BAR_write(), beam_width, and feature_match().

Referenced by align_do_alignment().

void check_fvectors ( FMS  fms)

Prints a message about the vector stack of the given FMS.

If it finds a non-zero-count, it prints a message to STDERR. If it doesn't, it prints a message to STDOUT with the count of feature vectors.

Parameters
fmsThe FMS to check.

References vstack_t::fcount, feature_maps_t::n_features, vstack_t::next, and feature_maps_t::vstack.

FMS create_feature_maps ( char **  config,
int  config_lines,
Attribute w_attr1,
Attribute w_attr2,
Attribute s_attr1,
Attribute s_attr2 
)

Methods for the FMS class.

Here is how it works:

FMS = create_feature_maps(config, config_lines, source, target, source_s, target_s);

Input: feature map configuration (ASCII, parsed into separate items) word (or lemma) p-attributes of source and target corpus s-attributes for sentence boundaries in both corpora (source_s, target_s)

Output: set of relevant features mapping from lexicon IDs to feature sets wrapped in FMS struct returned from the function

In order to ensure a maximally compact encoding, feature sets are generated with a two-pass algorithm:

  1. identify relevant features + number of active features for each lexicon ID
  2. generate the actual feature sets Creates feature maps for a source/target corpus pair.

This is the constructor function for the FMS class.

Example usage:

FMS = create_feature_maps(config_data, nr_of_config_lines, source_word, target_word, source_s, target_s);

Parameters
configarray of strings representing the feature map configuration.
config_linesthe number of configuration items stored in config.
w_attr1The p-attribute in the first corpus to link.
w_attr2The p-attribute in the second corpus to link.
s_attr1The s-attribute in the first corpus to link.
s_attr2The s-attribute in the second corpus to link.
Returns
the new FMS object.

References feature_maps_t::att1, feature_maps_t::att2, char_map, char_map_range, charset, cl_attribute_mother_corpus(), cl_corpus_charset(), cl_free, cl_id2freq(), cl_id2str(), cl_id2strlen(), cl_max_id(), CL_MAX_LINE_LENGTH, cl_str2id(), cl_strcpy(), cl_string_canonical(), cl_string_validate_encoding(), config_lines, feature_maps_t::fweight, IGNORE_CASE, IGNORE_DIAC, init_char_map(), feature_maps_t::n_features, feature_maps_t::s1, feature_maps_t::s2, utf8, feature_maps_t::vstack, feature_maps_t::w2f1, feature_maps_t::w2f2, word1, and word2.

Referenced by main().

int feature_match ( FMS  fms,
int  f1,
int  l1,
int  f2,
int  l2 
)

Compute similarity measure for a pair of regions, source and target, specified by the corpus positions of the first and last sentences in each region.

(And by "sentences" we mean "instances of whatever it is this s-attribute represents".)

This is, basically, the "apply me" method for the FMS object.

Usage:

Sim = feature_match(FMS, source_first, source_last, target_first, target_last);

Note that the best_path() function simply passes through the FMS to this function. That function makes the decisions about what is the best sequence of alignments - given the results it has got back from this sentence.

Parameters
fmsThe feature map (which contains the s-attributes in question)
f1Index of first "sentence" (i.e. entry on the s-attribute) of the region to analyse in the source.
l1Index of last "sentence" of the region to analyse in the source.
f2Index of first "sentence" (i.e. entry on the s-attribute) of the region to analyse in the target.
l2Index of last "sentence" of the region to analyse in the target.
Returns
The similarity measurement for the pair of refgions.

References feature_maps_t::att1, feature_maps_t::att2, cl_cpos2id(), cl_struc2cpos(), feature_maps_t::fweight, get_fvector(), release_fvector(), feature_maps_t::s1, feature_maps_t::s2, feature_maps_t::w2f1, and feature_maps_t::w2f2.

Referenced by align_do_alignment(), and best_path().

int* get_fvector ( FMS  fms)

Feature count vector handling (used internally by feature_match).

If the vstack of the FMS (head of linked list) does not yet contain anything, then a new integer array is created and a pointer to it is returned.

If the vstack is already set, then the fcount from the element at the top of the linked list is returned, and its record vstack_t is deleted from the linked list stack.

IN OTHER WORDS, a vector of feature counts is provided EITHER by using the top one off the stack, OR by getting a new one.

Parameters
fmsThe FMS to get a feature vector for.
Returns
Pointer to array of integers (feature counts) big enough to hold th

References cl_free, vstack_t::fcount, feature_maps_t::n_features, vstack_t::next, and feature_maps_t::vstack.

Referenced by feature_match().

void release_fvector ( int *  fvector,
FMS  fms 
)

Inserts a new vstack_t at the start of the vstack member of the given FMS.

{That's what it looks like it does, not sure how the function name fits with that... ???? - AH}

References feature_maps_t::vstack.

Referenced by feature_match().

void show_features ( FMS  fms,
int  which,
char *  word 
)

Prints the features in an FMS, as applied to a specific lexicon entry, to STDOUT.

Usage: show_features(FMS, 1/2, "word");

This will print all features listed in FMS for the token "word"; "word" is looked up in the source corpus if the 2nd argument == 1, and in the target corpus otherwise.

Parameters
fmsThe FMS to print from.
whichWhich corpus to look up? (See description)
wordThe word-type to look up.

References feature_maps_t::att1, feature_maps_t::att2, cl_str2id(), feature_maps_t::fweight, feature_maps_t::w2f1, and feature_maps_t::w2f2.