CWB
|
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>
#include "feature_maps.h"
#include "barlib.h"
Functions | |
void | init_char_map () |
initialises char_map, More... | |
FMS | create_feature_maps (char **config, int config_lines, Attribute *w_attr1, Attribute *w_attr2, Attribute *s_attr1, Attribute *s_attr2) |
Methods for the FMS class. More... | |
int | feature_match (FMS fms, int f1, int l1, int f2, int l2) |
Compute similarity measure for a pair of regions, source and target, specified by the corpus positions of the first and last sentences in each region. More... | |
int * | get_fvector (FMS fms) |
Feature count vector handling (used internally by feature_match). More... | |
void | release_fvector (int *fvector, FMS fms) |
Inserts a new vstack_t at the start of the vstack member of the given FMS. More... | |
void | check_fvectors (FMS fms) |
Prints a message about the vector stack of the given FMS. More... | |
void | show_features (FMS fms, int which, char *word) |
Prints the features in an FMS, as applied to a specific lexicon entry, to STDOUT. More... | |
void | best_path (FMS fms, int f1, int l1, int f2, int l2, int beam_width, int verbose, int *steps, int **out1, int **out2, int **out_quality) |
Finds the best alignment path for the given spans of s-attribute instances in the source and target corpus. More... | |
Variables | |
int | char_map_range = 0 |
the top of the range of char_map's outputs More... | |
unsigned char | char_map [256] |
A character map for computing n-gram features. More... | |
void best_path | ( | FMS | fms, |
int | f1, | ||
int | l1, | ||
int | f2, | ||
int | l2, | ||
int | beam_width, | ||
int | verbose, | ||
int * | steps, | ||
int ** | out1, | ||
int ** | out2, | ||
int ** | out_quality | ||
) |
Finds the best alignment path for the given spans of s-attribute instances in the source and target corpus.
This function does a beamed dynamic programming search for the best path aligning the sentence regions (f1,l1) in the source corpus and (f2,l2) in the target corpus.
Allowed alignments are 1:0 0:1 1:1 2:1 1:2.
The results are returned in the vectors out1 and out2, which each contain a number of valid entries (alignment points) equal to {steps}.
Alignment points are given as sentence numbers and correspond to the start points of the sentences. At the end-of-region alignment point, sentence numbers will be l1 + 1 and l2 + 1, which must be considered by the caller if l1 (or l2) is the last sentence in the corpus!
The similarity measures of aligned regions are returned in the vector out_quality.
Memory allocated for the return vectors (out1, out2, out_quality) is managed by best_path() and must not be freed by the caller. Calling best_path() overwrites the results of the previous search.
Example usage:
best_path(FMS, f1, l1, f2, l2, beam_width, 0/1, &steps, &out1, &out2, &out_quality);
fms | The FMS to use as comparison criteria. |
f1 | Index of first sentence in source region. |
l1 | Index of last sentence in source region |
f2 | Index of first sentence in target region. |
l2 | Index of last sentence in target region. |
beam_width | Parameter for the beam search. |
verbose | Boolean: iff true, prints progress messages on STDOUT. |
steps | Put output here (see function description). |
out1 | Put output here (see function description). |
out2 | Put output here (see function description). |
out_quality | Put output here (see function description). |
References BAR_delete(), BAR_new(), BAR_read(), BAR_write(), beam_width, and feature_match().
Referenced by align_do_alignment().
void check_fvectors | ( | FMS | fms | ) |
Prints a message about the vector stack of the given FMS.
If it finds a non-zero-count, it prints a message to STDERR. If it doesn't, it prints a message to STDOUT with the count of feature vectors.
fms | The FMS to check. |
References vstack_t::fcount, feature_maps_t::n_features, vstack_t::next, and feature_maps_t::vstack.
FMS create_feature_maps | ( | char ** | config, |
int | config_lines, | ||
Attribute * | w_attr1, | ||
Attribute * | w_attr2, | ||
Attribute * | s_attr1, | ||
Attribute * | s_attr2 | ||
) |
Methods for the FMS class.
Here is how it works:
FMS = create_feature_maps(config, config_lines, source, target, source_s, target_s);
Input: feature map configuration (ASCII, parsed into separate items) word (or lemma) p-attributes of source and target corpus s-attributes for sentence boundaries in both corpora (source_s, target_s)
Output: set of relevant features mapping from lexicon IDs to feature sets wrapped in FMS struct returned from the function
In order to ensure a maximally compact encoding, feature sets are generated with a two-pass algorithm:
This is the constructor function for the FMS class.
Example usage:
FMS = create_feature_maps(config_data, nr_of_config_lines, source_word, target_word, source_s, target_s);
config | array of strings representing the feature map configuration. |
config_lines | the number of configuration items stored in config. |
w_attr1 | The p-attribute in the first corpus to link. |
w_attr2 | The p-attribute in the second corpus to link. |
s_attr1 | The s-attribute in the first corpus to link. |
s_attr2 | The s-attribute in the second corpus to link. |
References feature_maps_t::att1, feature_maps_t::att2, char_map, char_map_range, charset, cl_attribute_mother_corpus(), cl_corpus_charset(), cl_free, cl_id2freq(), cl_id2str(), cl_id2strlen(), cl_max_id(), CL_MAX_LINE_LENGTH, cl_str2id(), cl_strcpy(), cl_string_canonical(), cl_string_validate_encoding(), config_lines, feature_maps_t::fweight, IGNORE_CASE, IGNORE_DIAC, init_char_map(), feature_maps_t::n_features, feature_maps_t::s1, feature_maps_t::s2, utf8, feature_maps_t::vstack, feature_maps_t::w2f1, feature_maps_t::w2f2, word1, and word2.
Referenced by main().
int feature_match | ( | FMS | fms, |
int | f1, | ||
int | l1, | ||
int | f2, | ||
int | l2 | ||
) |
Compute similarity measure for a pair of regions, source and target, specified by the corpus positions of the first and last sentences in each region.
(And by "sentences" we mean "instances of whatever it is this s-attribute represents".)
This is, basically, the "apply me" method for the FMS object.
Usage:
Sim = feature_match(FMS, source_first, source_last, target_first, target_last);
Note that the best_path() function simply passes through the FMS to this function. That function makes the decisions about what is the best sequence of alignments - given the results it has got back from this sentence.
fms | The feature map (which contains the s-attributes in question) |
f1 | Index of first "sentence" (i.e. entry on the s-attribute) of the region to analyse in the source. |
l1 | Index of last "sentence" of the region to analyse in the source. |
f2 | Index of first "sentence" (i.e. entry on the s-attribute) of the region to analyse in the target. |
l2 | Index of last "sentence" of the region to analyse in the target. |
References feature_maps_t::att1, feature_maps_t::att2, cl_cpos2id(), cl_struc2cpos(), feature_maps_t::fweight, get_fvector(), release_fvector(), feature_maps_t::s1, feature_maps_t::s2, feature_maps_t::w2f1, and feature_maps_t::w2f2.
Referenced by align_do_alignment(), and best_path().
int* get_fvector | ( | FMS | fms | ) |
Feature count vector handling (used internally by feature_match).
If the vstack of the FMS (head of linked list) does not yet contain anything, then a new integer array is created and a pointer to it is returned.
If the vstack is already set, then the fcount from the element at the top of the linked list is returned, and its record vstack_t is deleted from the linked list stack.
IN OTHER WORDS, a vector of feature counts is provided EITHER by using the top one off the stack, OR by getting a new one.
fms | The FMS to get a feature vector for. |
References cl_free, vstack_t::fcount, feature_maps_t::n_features, vstack_t::next, and feature_maps_t::vstack.
Referenced by feature_match().
void init_char_map | ( | ) |
initialises char_map,
References char_map, and char_map_range.
Referenced by create_feature_maps().
void release_fvector | ( | int * | fvector, |
FMS | fms | ||
) |
Inserts a new vstack_t at the start of the vstack member of the given FMS.
{That's what it looks like it does, not sure how the function name fits with that... ???? - AH}
References feature_maps_t::vstack.
Referenced by feature_match().
void show_features | ( | FMS | fms, |
int | which, | ||
char * | word | ||
) |
Prints the features in an FMS, as applied to a specific lexicon entry, to STDOUT.
Usage: show_features(FMS, 1/2, "word");
This will print all features listed in FMS for the token "word"; "word" is looked up in the source corpus if the 2nd argument == 1, and in the target corpus otherwise.
fms | The FMS to print from. |
which | Which corpus to look up? (See description) |
word | The word-type to look up. |
References feature_maps_t::att1, feature_maps_t::att2, cl_str2id(), feature_maps_t::fweight, feature_maps_t::w2f1, and feature_maps_t::w2f2.
unsigned char char_map[256] |
A character map for computing n-gram features.
After initialisation, this array maps character codes which are to be included in n-grams to a position index without gaps, and all other codes to the index 1.
Basically, when all is said and done, all possible bytes map to a number that represents position in the (unaccented, caseless) Latin alphabet, where where (a|A) => 2, and (any punctuation or non-letter) => 1.
This includes, incidentally, UTF-8 component bytes in the upper half of the 8 bit space. So all such component bytes count as "just punctuation" in the character n-gram comparisons. As a consequence, n-gram features are next to useless with non-latin alphabets.
Referenced by create_feature_maps(), and init_char_map().
int char_map_range = 0 |
the top of the range of char_map's outputs
Referenced by create_feature_maps(), and init_char_map().