Returns target range only if source index aligns to a single consecutive range of target tokens.
Source to Target alignment: reads source-target.A3.final files, in which each source word may be aligned to multiple target words (adapted from code by Sander Canisius)
Return the aligned targeword for a specified index in the source words. Multiple words are concatenated together with a space in between
Return the aligned targetwords for a specified index in the source words
Target to Source alignment: reads target-source.A3.final files, in which each source word is aligned to one target word
Return the aligned targetword for a specified index in the source words
alias of PTProtocol
This class represent one document/text of the Corpus (read-only)
Extracts paragraphs, returns list of plain-text(!) paragraphs
Iterate over all sentences (sentence_id, sentence) in the document, sentence is a list of 4-tuples (word,id,pos,lemma)
This class represent one document/text of the Corpus, loaded into memory at once and retaining the full structure
iterate over paragraphs
iterate over sentences
checks if the document is valid
iterate over words
Executes an xpath expression using the correct namespaces
Resolves the namespace identifier to a full URL
Annotation layers for Span Annotation are derived from this abstract base class
Generator over alternatives, either all or only of a specific annotation type, and possibly restrained also by set.
Will return a single annotation (even if there are multiple). Raises a NoSuchAnnotation exception if none was found
Obtain annotations. Very similar to select() but raises an error if the annotation was not found.
Returns the span element which spans over the specified words or morphemes
Returns an integer indicating whether such as annotation exists, and if so, how many. See annotations() for a description of the parameters.
Returns a RelaxNG definition for this element (as an XML element (lxml.etree) rather than a string)
This is the abstract base class from which all FoLiA elements are derived. This class should not be instantiated directly, but can useful if you want to check if a variable is an instance of any FoLiA element: isinstance(x, AbstractElement). It contains methods and variables also commonly inherited.
High level function that adds (appends) an annotation to an element, it will simply call append() for token annotation elements that fit within the scope. For span annotation, it will create and find or create the proper annotation layer and insert the element there
Tests whether a new element of this class can be added to the parent. Returns a boolean or raises ValueError exceptions (unless set to ignore)!
This will use OCCURRENCES, but may be overidden for more customised behaviour.
This method is mostly for internal use.
Makes sure this element (and all subelements), are properly added to the index
Find the most immediate ancestor of the specified type, multiple classes may be specified
Generator yielding all ancestors of this element, effectively back-tracing its path to the root element. A tuple of multiple classes may be specified.
Append a child element. Returns the added element
If an instance is passed as first argument, it will be appended If a class derived from AbstractElement is passed as first argument, an instance will first be created and then appended.
Generic example, passing a pre-generated instance:
word.append( folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )
Generic example, passing a class to be generated:
word.append( folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )
Generic example, setting text with a class:
word.append( “house”, cls=’original’ )
Returns this word in context, {size} words to the left, the current word, and {size} words to the right
Make a deep copy of this element and all its children. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash
Generator creating a deep copy of the children of this element. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash
Like select, but instead of returning the elements, it merely counts them
Obtain the description associated with the element, will raise NoDescription if there is none
Obtain the feature value of the specific subset. If a feature occurs multiple times, the values will be returned in a list.
Example:
sense = word.annotation(folia.Sense)
synset = sense.feat('synset')
Find replaceable elements. Auxiliary function used by replace(). Can be overriden for more fine-grained control. Mostly for internal use.
returns the index at which an element occurs, recursive by default!
May return a customised text delimiter instead of the default for this class.
Does this element have text (of the specified class)
Is this element part of a correction? If it is, it returns the Correction element (evaluating to True), otherwise it returns None
Insert a child element at specified index. Returns the added element
If an instance is passed as first argument, it will be appended If a class derived from AbstractElement is passed as first argument, an instance will first be created and then appended.
Generic example, passing a pre-generated instance:
word.insert( 3, folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )
Generic example, passing a class to be generated:
word.insert( 3, folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )
Generic example, setting text:
word.insert( 3, "house" )
Returns a depth-first flat list of all items below this element (not limited to AbstractElement)
Returns the left context for an element, as a list. This method crosses sentence/paragraph boundaries by default, which can be restricted by setting scope
Returns the next element, if it is of the specified type and if it does not cross the boundary of the defined scope. Returns None if no next element is found. Non-authoritative elements are never returned.
Alias for retrieving the original uncorrect text
Internal class method used for turning an XML element into an instance of the Class.
This method will be called after an element is added to another. It can do extra checks and if necessary raise exceptions to prevent addition. By default makes sure the right document is associated.
This method is mostly for internal use.
Returns the previous element, if it is of the specified type and if it does not cross the boundary of the defined scope. Returns None if no next element is found. Non-authoritative elements are never returned.
Returns a RelaxNG definition for this element (as an XML element (lxml.etree) rather than a string)
Removes the child element
Appends a child element like append(), but replaces any existing child element of the same type and set. If no such child element exists, this will act the same as append()
to be an alternative.
See append() for more information.
Returns the right context for an element, as a list. This method crosses sentence/paragraph boundaries by default, which can be restricted by setting scope
Select child elements of the specified class.
A further restriction can be made based on set. Whether or not to apply recursively (by default enabled) can also be configured, optionally with a list of elements never to recurse into.
Class: The class to select; any python class subclassed off ‘AbstractElement`
set: The set to match against, only elements pertaining to this set will be returned. If set to None (default), all elements regardless of set will be returned.
recursive: Select recursively? Descending into child elements? Boolean defaulting to True.
of a list, all non-authoritative elements will be skipped (this is the default behaviour). It is common not to
want to recurse into the following elements: folia.Alternative, folia.AlternativeLayer, folia.Suggestion, and folia.Original. These elements contained in these are never authorative. set to the boolean True rather than a list, this will be the default list. You may also include the boolean True as a member of a list, if you want to skip additional tags along non-authoritative ones.
node: Reserved for internal usage, used in recursion.
Example:
text.select(folia.Sense, 'cornetto', True, [folia.Original, folia.Suggestion, folia.Alternative] )
Set a different document, usually no need to call this directly, invoked implicitly by copy()
Associate a document with this element
Correct all parent relations for elements within the scope, usually no need to call this directly, invoked implicitly by copy()
Set the text for this element (and class)
Get the text strictly associated with this element (of the specified class). Does not recurse into children, with the sole exception of Corection/New
Get the text associated with this element (of the specified class), will always be a unicode instance. If no text is directly associated with the element, it will be obtained from the children. If that doesn’t result in any text either, a NoSuchText exception will be raised.
If retaintokenisation is True, the space attribute on words will be ignored, otherwise it will be adhered to and text will be detokenised as much as possible.
Get the text explicitly associated with this element (of the specified class). Returns the TextContent instance rather than the actual text. Raises NoSuchText exception if not found.
Unlike text(), this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the TextContent instance rather than the actual text!
Alias for text with retaintokenisation=True
Internal method, recompute textual value. Only for elements that are a TEXTCONTAINER
Serialises the FoLiA element to XML, by returning an XML Element (in lxml.etree) for this element and all its children. For string output, consider the xmlstring() method instead.
Serialises this FoLiA element to XML, returns a (unicode) string with XML representation for this element and all its children.
Abstract element, all span annotation elements are derived from this class
Will return a single annotation (even if there are multiple). Raises a NoSuchAnnotation exception if none was found
Generator creating a deep copy of the children of this element. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash
Returns an integer indicating whether such as annotation exists, and if so, how many. See annotations() for a description of the parameters.
Sets the span of the span element anew, erases all data inside
Returns a list of word references, these can be Words but also Morphemes or Phonemes.
Abstract element, all structure elements inherit from this class. Never instantiated directly.
See AbstractElement.append()
Does the specified annotation layer exist?
Returns a list of annotation layers found directly under this element, does not include alternative layers
Returns a generator of Paragraph elements found (recursively) under this element.
Returns a generator of Sentence elements found (recursively) under this element
Returns a generator of Word elements found (recursively) under this element.
Abstract element, all subtoken annotation elements are derived from this class
Obtain the text (unicode instance)
Abstract element, all token annotation elements are derived from this class
See AbstractElement.append()
Actor feature, to be used within Event
Apply a correction (TODO: documentation to be written still)
Classes inherited from this class allow for automatic ID generation, using the convention of adding a period, the name of the element , another period, and a sequence number
Elements that allow token annotation (including extended annotation) must inherit from this class
Generator over alternatives, either all or only of a specific annotation type, and possibly restrained also by set.
Will return a single annotation (even if there are multiple). Raises a NoSuchAnnotation exception if none was found
Obtain annotations. Very similar to select() but raises an error if the annotation was not found.
Returns an integer indicating whether such as annotation exists, and if so, how many. See annotations() for a description of the parameters.
Element grouping alternative token annotation(s). Multiple alternative elements may occur, each denoting a different alternative. Elements grouped inside an alternative block are considered dependent.
Element grouping alternative subtoken annotation(s). Multiple altlayers elements may occur, each denoting a different alternative. Elements grouped inside an alternative block are considered dependent.
Begindatetime feature, to be used within Event
Element used for captions for figures or tables, contains sentences
Chunk element, span annotation element to be used in ChunkingLayer
Chunking Layer: Annotation layer for Chunk span annotation elements
Coreference chain. Consists of coreference links.
Syntax Layer: Annotation layer for SyntacticUnit span annotation elements
Coreference link. Used in coreferencechain.
A corpus of various FoLiA documents. Yields a Document on each iteration. Suitable for sequential processing.
A corpus of various FoLiA documents. Yields the filenames on each iteration.
Processes a corpus of various FoLiA documents using a parallel processing. Calls a user-defined function with the three-tuple (filename, args, kwargs) for each file in the corpus. The user-defined function is itself responsible for instantiating a FoLiA document! args and kwargs, as received by the custom function, are set through the run() method, which yields the result of the custom function on each iteration.
See AbstractElement.append()
May return a customised text delimiter instead of the default for this class.
Get the text explicitly associated with this element (of the specified class). Returns the TextContent instance rather than the actual text. Raises NoSuchText exception if not found.
Unlike text(), this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the TextContent instance rather than the actual text!
Dependencies Layer: Annotation layer for Dependency span annotation elements. For dependency entities.
Returns the dependent of the dependency relation. Instance of DependencyDependent
Returns the head of the dependency relation. Instance of DependencyHead
Description is an element that can be used to associate a description with almost any other FoLiA element
Structure element representing some kind of division. Divisions may be nested at will, and may include almost all kinds of other structure elements.
This is the FoLiA Document, all elements have to be associated with a FoLiA document. Besides holding elements, the document hold metadata including declaration, and an index of all IDs.
Add a text to the document:
Example 1:
doc.append(folia.Text)
Create an element associated with this Document. This method may be obsolete and removed later.
No arguments: Get the document’s date from metadata Argument: Set the document’s date in metadata
Returns a depth-first flat list of all items in the document
No arguments: Get the document’s language (ISO-639-3) from metadata Argument: Set the document’s language (ISO-639-3) in metadata
No arguments: Get the document’s license from metadata Argument: Set the document’s license in metadata
Load a FoLiA or D-Coi XML file
Return a generator of all paragraphs found in the document.
If an index is specified, return the n’th paragraph only (starting at 0)
Main XML parser, will invoke class-specific XML parsers. For internal use.
No arguments: Get the document’s publisher from metadata Argument: Set the document’s publisher in metadata
Save the document to FoLiA XML.
Return a generator of all sentence found in the document. Except for sentences in quotes.
If an index is specified, return the n’th sentence only (starting at 0)
Returns the text of the entire document (returns a unicode instance)
No arguments: Get the document’s title from metadata Argument: Set the document’s title in metadata
Return a generator of all active words found in the document. Does not descend into annotation layers, alternatives, originals, suggestions.
If an index is specified, return the n’th word only (starting at 0)
Run Xpath expression and parse the resulting elements. Don’t forget to use the FoLiA namesapace in your expressions, using folia: or the short form f:
Domain annotation: an extended token annotation element
Exception raised when an identifier that is already in use is assigned again to another element
Enddatetime feature, to be used within Event
Entities Layer: Annotation layer for Entity span annotation elements. For named entities.
Entity element, for named entities, span annotation element to be used in EntitiesLayer
Feature elements can be used to associate subsets and subclasses with almost any annotation element
Element for the representation of a graphical figure. Structure element.
Function feature, to be used with morphemes
Gap element. Represents skipped portions of the text. Contains Content and Desc elements
Head element. A structure element. Acts as the header/title of a division. There may be one per division. Contains sentences.
Head feature, to be used within PosAnnotation
Element used for labels. Mostly in within list item. Contains words.
Language annotation: an extended token annotation element
Lemma annotation: a token annotation element
Level feature, to be used with coreferences
Line break element, signals a line break
Element for enumeration/itemisation. Structure element. Contains ListItem elements.
Single element in a List. Structure element. Contained within List element.
Metric elements allow the annotatation of any kind of metric with any kind of annotation element. Allowing for example statistical measures to be added to elements as annotation,
Modality feature, to be used with coreferences
Morpheme element, represents one morpheme in morphological analysis, subtoken annotation element to be used in MorphologyLayer
Find span annotation of the specified type that include this word
Morphology Layer: Annotation layer for Morpheme subtoken annotation elements. For morphological analysis.
Exception raised when the requested type of annotation does not exist for the selected element
Exception raised when the requestion type of text content does not exist for the selected element
Paragraph element. A structure element. Represents a paragraph and holds all its sentences (and possibly other structure Whitespace and Quotes).
This class describes a pattern over words to be searched for. The
Document.findwords() method can subsequently be called with this pattern, and it will return all the words that match. An example will best illustrate this, first a trivial example of searching for one word:
for match in doc.findwords( folia.Pattern('house') ):
for word in match:
print word.id
print "----"
The same can be done for a sequence::
for match in doc.findwords( folia.Pattern('a','big', 'house') ):
for word in match:
print word.id
print "----"
The boolean value ``True`` acts as a wildcard, matching any word::
for match in doc.findwords( folia.Pattern('a',True,'house') ):
for word in match:
print word.id, word.text()
print "----"
Alternatively, and more constraning, you may also specify a tuple of alternatives::
for match in doc.findwords( folia.Pattern('a',('big','small'),'house') ):
for word in match:
print word.id, word.text()
print "----"
Or even a regular expression using the ``folia.RegExp`` class::
for match in doc.findwords( folia.Pattern('a', folia.RegExp('b?g'),'house') ):
for word in match:
print word.id, word.text()
print "----"
Rather than searching on the text content of the words, you can search on the
classes of any kind of token annotation using the keyword argument
``matchannotation=``::
for match in doc.findwords( folia.Pattern('det','adj','noun',matchannotation=folia.PosAnnotation ) ):
for word in match:
print word.id, word.text()
print "----"
The set can be restricted by adding the additional keyword argument
``matchannotationset=``. Case sensitivity, by default disabled, can be enabled by setting ``casesensitive=True``.
Things become even more interesting when different Patterns are combined. A
match will have to satisfy all patterns::
for match in doc.findwords( folia.Pattern('a', True, 'house'), folia.Pattern('det','adj','noun',matchannotation=folia.PosAnnotation ) ):
for word in match:
print word.id, word.text()
print "----"
The ``findwords()`` method can be instructed to also return left and/or right context for any match. This is done using the ``leftcontext=`` and ``rightcontext=`` keyword arguments, their values being an integer number of the number of context words to include in each match. For instance, we can look for the word house and return its immediate neighbours as follows::
for match in doc.findwords( folia.Pattern('house') , leftcontext=1, rightcontext=1):
for word in match:
print word.id
print "----"
A match here would thus always consist of three words instead of just one.
Last, ``Pattern`` also has support for variable-width gaps, the asterisk symbol
has special meaning to this end::
for match in doc.findwords( folia.Pattern('a','*','house') ):
for word in match:
print word.id
print "----"
Unlike the pattern ``('a',True,'house')``, which by definition is a pattern of
three words, the pattern in the example above will match gaps of any length (up
to a certain built-in maximum), so this might include matches such as *a very
nice house*.
Some remarks on these methods of querying are in order. These searches are
pretty exhaustive and are done by simply iterating over all the words in the
document. The entire document is loaded in memory and no special indices are involved.
For single documents this is okay, but when iterating over a corpus of
thousands of documents, this method is too slow, especially for real-time
applications. For huge corpora, clever indexing and database management systems
will be required. This however is beyond the scope of this library.
Resolve a variable sized pattern to all patterns of a certain fixed size
Part-of-Speech annotation: a token annotation element
An XPath query on one or more FoLiA documents
Quote: a structure element. For quotes/citations. May hold words, sentences or paragraphs.
Streaming FoLiA reader. The reader allows you to read a FoLiA Document without holding the whole tree structure in memory. The document will be read and the elements you seek returned as they are found. If you are querying a corpus of large FoLiA documents for a specific structure, then it is strongly recommend to use the Reader rather than the standard Document!
Semantic Role
Syntax Layer: Annotation layer for SemanticRole span annotation elements
Sense annotation: a token annotation element
Sentence element. A structure element. Represents a sentence and holds all its words (and possibly other structure such as LineBreaks, Whitespace and Quotes)
Are there corrections in this sentence?
Generic correction method for words. You most likely want to use the helper functions splitword() , mergewords(), deleteword(), insertword() instead
TODO: Write documentation
Obtain the division this sentence is a part of (None otherwise)
TODO: Write documentation
Obtain the paragraph this sentence is a part of (None otherwise)
TODO: Write documentation
String
Subjectivity annotation/Sentiment analysis: a token annotation element
Synset feature, to be used within Sense
Syntactic Unit, span annotation element to be used in SyntaxLayer
Syntax Layer: Annotation layer for SyntacticUnit span annotation elements
A full text. This is a high-level element (not to be confused with TextContent!). This element may contain divisions, paragraphs, sentences, etc..
Text content element (t), holds text to be associated with whatever element the text content element is a child of.
Text content elements on structure elements like Paragraph and Sentence are by definition untokenised. Only on Word level and deeper they are by definition tokenised.
Find the default reference for text offsets: The parent of the current textcontent’s parent (counting only Structure Elements and Subtoken Annotation Elements)
Note: This returns not a TextContent element, but its parent. Whether the textcontent actually exists is checked later/elsewhere
(Method for internal usage, see AbstractElement)
(Method for internal usage, see AbstractElement)
(Method for internal usage, see AbstractElement.postappend())
Obtain the text (unicode instance)
Validates the Text Content’s references. Raises UnresolvableTextContent when invalid
Time feature, to be used with coreferences
alias of TimeSegment
Dependencies Layer: Annotation layer for Dependency span annotation elements. For dependency entities.
Value feature, to be used within Metric
Whitespace element, signals a vertical whitespace
Word (aka token) element. Holds a word/token and all its related token annotations.
Obtain the deepest division this word is a part of, otherwise return None
Shortcut: returns the FoLiA class of the domain annotation (will return only one if there are multiple!)
Find span annotation of the specified type that includes this word
Returns the text delimiter
Shortcut: returns the FoLiA class of the lemma annotation (will return only one if there are multiple!)
Returns a specific morpheme, the n’th morpheme (given the particular set if specified).
Generator yielding all morphemes (in a particular set if specified). For retrieving one specific morpheme by index, use morpheme() instead
Obtain the paragraph this word is a part of, otherwise return None
Shortcut: returns the FoLiA class of the PoS annotation (will return only one if there are multiple!)
Shortcut: returns the FoLiA class of the sense annotation (will return only one if there are multiple!)
Obtain the sentence this word is a part of, otherwise return None
Word reference. Used to refer to words or morphemes from span annotation elements. The Python class will only be used when word reference can not be resolved, if they can, Word or Morpheme objects will be used
Generator over common ancestors, of the Class specified, of the current element and the other specified elements
Returns (datetime, tz offset in minutes) or (None, None).
Internal function, parses common FoLiA attributes and sets up the instance accordingly