This tutorial will introduce the FoLiA Python library, part of PyNLPl. The FoLiA library provides an Application Programming Interface for the reading, creation and manipulation of FoLiA XML documents. The library works under Python 2.7 as well as Python 3, which is the recommended version. The samples in this documentation follow Python 3 conventions.
Prior to reading this document, it is recommended to first read the FoLiA documentation itself and familiarise yourself with the format and underlying paradigm. The FoLiA documentation can be found on the `FoLiA website < https://proycon.github.io/folia>`_. It is especially important to understand the way FoLiA handles sets/classes, declarations, common attributes such as annotator/annotatortype and the distinction between various kinds of annotation categories such as token annotation and span annotation.
This Python library is also the foundation of the FoLiA Tools collection, which consists of various command line utilities to perform common tasks on FoLiA documents. If you’re merely interested in performing a certain common task, such as a single query or conversion, you might want to check there if it contains is a tool that does what you want already.
Any script that uses FoLiA starts with the import:
from pynlpl.formats import folia
Subsequently, a document can be read from file and follows:
doc = folia.Document(file="/path/to/document.xml")
This returns an instance that holds the entire document in memory. Note that for large FoLiA documents this may consume quite some memory! If you happened to already have the document content in a string, you can load as follows:
doc = folia.Document(string="<FoLiA ...")
Once you have loaded a document, all data is available for you to read and manipulate as you see fit. We will first illustrate some simple use cases:
To save a document back the file it was loaded from, we do:
doc.save()
Or we can specify a specific filename:
doc.save("/tmp/document.xml")
Note
Any content that is in a different XML namespace than the FoLiA namespaces or other supported namespaces (XML, Xlink), will be ignored upon loading and lost when saving.
You may want to simply print all (plain) text contained in the document, which is as easy as:
print(doc)
Alternatively, you can obtain a string representation of all text:
text = str(doc)
For any subelement of the document, you can obtain its text in the same fashion.
Note
In Python 2, both str() as well as unicode() return a unicode instance. You may need to append .encode('utf-8') for proper output.
A document instance has an index which you can use to grab any of its sub elements by ID. Querying using the index proceeds similar to using a python dictionary:
word = doc['example.p.3.s.5.w.1']
print(word)
Note
Python 2 users will have to do print word.text().encode('utf-8') instead, to ensure non-ascii characters are printed properly.
Usually you do not know in advance the ID of the element you want, or you want multiple elements. There are some methods of iterating over certain elements using the FoLiA library.
For example, you can iterate over all words:
for word in doc.words():
print(word)
That however gives you one big iteration of words without boundaries. You may more likely want to seek words within sentences. So we first iterate over all sentences, then over the words therein:
for sentence in doc.sentences():
for word in sentence.words():
print(word)
Or including paragraphs, assuming the document has them:
for paragraph in doc.paragraphs():
for sentence in paragraph.sentences():
for word in sentence.words():
print(word)
You can also use this method to obtain a specific word, by passing an index parameter:
word = sentence.words(3) #retrieves the fourth word
If you want to iterate over all of the child elements of a certain element, regardless of what type they are, you can simply do so as follows:
for subelement in element:
if isinstance(subelement, folia.Sentence):
print("this is a sentence")
else:
print("this is something else")
If applied recursively this allows you to traverse the entire element tree, there are however specialised methods available that do this for you.
There is a generic method available on all elements to select child elements of any desired class. This method is by default applied recursively. Internally, the paragraphs(), words() and sentences() methods seen above are simply shortcuts that make use of the select method:
sentence = doc['example.p.3.s.5.w.1']
words = sentence.select(folia.Word)
for word in words:
print(word)
The select() method has a sibling count(), invoked with the same arguments, which simply counts how many items it finds, without actually returning them:
word = sentence.count(folia.Word)
Advanced Notes:
The select() method and similar high-level methods derived from it, are generators. This implies that the results of the selection are returned one by one in the iteration, as opposed to all stored in memory. This also implies that you can only iterate over it once, we can not do another iteration over the words variable in the above example, unless we reinvoke the select() method to get a new generator. Likewise, we can not do len(words), but have to use the count() method instead.
If you want to have all results in memory in a list, you can simply do the following:
words = list(sentence.select(folia.Word))
The select method is by default recursive, set the third argument to False to make it non-recursive. The second argument can be used for restricting matches to a specific set, a tuple of classes. The recursion will not go into any non-authoritative elements such as alternatives, originals of corrections.
The FoLiA library discerns various Python classes for structure annotation, the corresponding FoLiA XML tag is listed too. Sets and classes can be associated with most of these elements to make them more specific, these are never prescribed by FoLiA. The list of classes is as follows:
The FoLiA documentation explain all of these in detail.
FoLiA and this library enforce explicit rules about what elements are allowed in what others. Exceptions will be raised when this is about to be violated.
The FoLiA paradigm features sets and classes as primary means to represent the actual value (class) of an annotation. A set often corresponds to a tagset, such as a set of part-of-speech tags, and a class is one selected value in such a set.
The paradigm furthermore introduces other comomn attributes to set on annotation elements, such as an identifier, information on the annotator, and more. A full list is provided below:
element.id (string) - The unique identifier of the element
element.set (string) - The set the element pertains to.
Classes correspond with tagsets in this case of many annotation types. Note that since class is already a reserved keyword in python, the library consistently uses cls
element.annotator (string) - The name or ID of the annotator who added/modified this element
element.annotatortype - The type of annotator, can be either folia.AnnotatorType.MANUAL or folia.AnnotatorType.AUTO
element.confidence (float) - A confidence value expressing
element.datetime (datetime.datetime) - The date and time when the element was added/modified.
element.n (string) - An ordinal label, used for instance in enumerated list contexts, numbered sections, etc..
The following attributes are specific to a speech context:
Attributes that are not available for certain elements, or not set, default to None.
FoLiA is of course a format for linguistic annotation. Accessing annotation is therefore one of the primary functions of this library. This can be done using annotations() or annotation(), which is similar to the select() method, except that it will raise an exception when no such annotation is found. The difference between annotation() and annotations() is that the former will grab only one and raise an exception if there are more between which it can’t disambiguate, whereas the second is a generator, but will still raise an exception if none is found:
for word in doc.words():
try:
pos = word.annotation(folia.PosAnnotation, 'CGN')
lemma = word.annotation(folia.LemmaAnnotation)
print("Word: ", word)
print("ID: ", word.id)
print("PoS-tag: " , pos.cls)
print("PoS Annotator: ", pos.annotator)
print("Lemma-tag: " , lemma.cls)
except folia.NoSuchAnnotation:
print("No PoS or Lemma annotation")
Note that the second argument of annotation(), annotations() or select() can be used to restrict your selection to a certain set. In the above example we restrict ourselves to Part-of-Speech tags in the CGN set.
The following token annotation elements are available in FoLiA, they are embedded under a structural element.
The following annotation types are somewhat special, as they are the only elements for which FoLiA assumes a default set and a default class:
FoLiA distinguishes token annotation and span annotation, token annotation is embedded in-line within a structural element, and the annotation therefore pertains to that structural element, whereas span annotation is stored in a stand-off annotation layer outside the element and refers back to it. Span annotation elements typically span over multiple structural elements.
We will discuss three ways of accessing span annotation. As stated, span annotation is contained within an annotation layer of a certain structure element, often a sentence. In the first way of accessing span annotation, we do everything explicitly. We first obtain the layer, then iterate over the span annotation elements within that layer, and finally iterate over the words to which the span applies. Assume we have a sentence and we want to print all the named entities in it:
for layer in sentence.select(folia.EntitiesLayer):
for entity in layer.select(folia.Entity):
print(" Entity class=", entity.cls, " words=")
for word in entity.wrefs():
print(word, end="") #print without newline
print() #print newline
The wrefs() method, available on all span annotation elements, will return a list of all words (as well as morphemes and phonemes) over which a span annotation element spans.
This first way is rather verbose. The second way of accessing span annotation takes another approach, using the findspans() method on Word instances. Here we start from a word and seek span annotations in which that word occurs. Assume we have a word and want to find chunks it occurs in:
for chunk in word.findspans(folia.Chunk):
print(" Chunk class=", chunk.cls, " words=")
for word2 in chunk.wrefs(): #print all words in the chunk (of which the word is a part)
print(word2, end="")
print()
The findspans() method can be called with either the class of a Span Annotation Element, such as folia.Chunk, or which the class of the layer, such as folia.ChunkingLayer.
The third way allows us to look for span elements given an annotation layer and words. In other words, it checks if one or more words form a span. This is an exact match and not a sub-part match as in the previously described method. To do this, we use use the findspan() method on annotation layers:
for span in annotationlayer.findspan(word1,word2):
print(span.cls)
This section lists the available Span annotation elements, the layer that contains them is explicitly mentioned as well.
Some of the span annotation elements are complex and take span role elements as children, these are normal span annotation elements that occur on a within another span annotation (of a particular type) and can not be used standalone.
- Requires the roles folia.CoreferenceLink (coreferencelink) pointing to each coreferenced structure in the chain
- Requires the roles folia.HeadSpan (hd) and folia.DependencyDependent (dep)
The span role folia.HeadSpan (hd) may actually be used by most span annotation elements, indicating it’s head-part.
Creating a new FoliA document, rather than loading an existing one from file, is done by explicitly providing the ID for the new document in the constructor:
doc = folia.Document(id='example')
Whenever you add a new type of annotation, or a different set, to a FoLiA document, you have to first declare it. This is done using the declare() method. It takes as arguments the annotation type, the set, and you can optionally pass keyword arguments to annotator= and annotatortype= to set defaults.
An example for Part-of-Speech annotation:
doc.declare(folia.PosAnnotation, 'brown-tag-set')
An example with a default annotator:
doc.declare(folia.PosAnnotation, 'brown-tag-set', annotator='proycon', annotatortype=folia.AnnotatorType.MANUAL)
Any additional sets for Part-of-Speech would have to be explicitly declared as well. To check if a particular annotation type and set is declared, use the declared(Class, set) method.
Assuming we begin with an empty document, we should first add a Text element. Then we can add paragraphs, sentences, or other structural elements. The add() adds new children to an element:
text = doc.add(folia.Text)
paragraph = text.add(folia.Paragraph)
sentence = paragraph.add(folia.Sentence)
sentence.add(folia.Word, 'This')
sentence.add(folia.Word, 'is')
sentence.add(folia.Word, 'a')
sentence.add(folia.Word, 'test')
sentence.add(folia.Word, '.')
Note
The add() method is actually a wrapper around append(), which takes the exact same arguments. It performs extra checks and works for both span annotation as well as token annotation. Using append() will be faster.
Adding annotations, or any elements for that matter, is done using the add() method on the intended parent element. We assume that the annotations we add have already been properly declared, otherwise an exception will be raised as soon as add() is called. Let’s build on the previous example:
#First we grab the fourth word, 'test', from the sentence
word = sentence.words(3)
#Add Part-of-Speech tag
word.add(folia.PosAnnotation, set='brown-tagset',cls='n')
#Add lemma
lemma.add(folia.LemmaAnnotation, cls='test')
Note that in the above examples, the add() method takes a class as first argument, and subsequently takes keyword arguments that will be passed to the classes’ constructor.
A second way of using add() is by simply passing a fully instantiated child element, thus constructing it prior to adding. The following is equivalent to the above example, as the previous method is merely a shortcut for convenience:
#First we grab the fourth word, 'test', from the sentence
word = sentence.words(3)
#Add Part-of-Speech tag
word.add( folia.PosAnnotation(doc, set='brown-tagset',cls='n') )
#Add lemma
lemma.add( folia.LemmaAnnotation(doc , cls='test') )
The add() method always returns that which was added, allowing it to be chained.
In the above example we first explicitly instantiate a folia.PosAnnotation and a folia.LemmaAnnotation. Instantiation of any FoLiA element (always Python class subclassed off folia.AbstractElement) follows the following pattern:
Class(document, *children, **kwargs)
Note that the document has to be passed explicitly as first argument to the constructor.
The common attributes are set using equally named keyword arguments:
- id=
- cls=
- set=
- annotator=
- annotatortype=
- confidence=
- src=
- speaker=
- begintime=
- endtime=
Not all attributes are allowed for all elements, and certain attributes are required for certain elements. ValueError exceptions will be raised when these constraints are not met.
Instead of setting id. you can also set the keyword argument generate_id_in and pass it another element, an ID will be automatically generated, based on the ID of the element passed. When you use the first method of adding elements, instantiation with generate_id_in will take place automatically behind the scenes when applicable and when id is not explicitly set.
Any extra non-keyword arguments should be FoLiA elements and will be appended as the contents of the element, i.e. the children or subelements. Instead of using non-keyword arguments, you can also use the keyword argument content and pass a list. This is a shortcut made merely for convenience, as Python obliges all non-keyword arguments to come before the keyword-arguments, which if often aesthetically unpleasing for our purposes. Example of this use case will be shown in the next section.
Adding span annotation is easy with the FoLiA library. As you know, span annotation uses a stand-off annotation embedded in annotation layers. These layers are in turn embedded in structural elements such as sentences. However, the add() method abstracts over this. Consider the following example of a named entity:
doc.declare(folia.Entity, "https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml")
sentence = text.add(folia.Sentence)
sentence.add(folia.Word, 'I',id='example.s.1.w.1')
sentence.add(folia.Word, 'saw',id='example.s.1.w.2')
sentence.add(folia.Word, 'the',id='example.s.1.w.3')
word = sentence.add(folia.Word, 'Dalai',id='example.s.1.w.4')
word2 =sentence.add(folia.Word, 'Lama',id='example.s.1.w.5')
sentence.add(folia.Word, '.', id='example.s.1.w.6')
word.add(folia.Entity, word, word2, cls="per")
To make references to the words, we simply pass the word instances and use the document’s index to obtain them. Note also that passing a list using the keyword argument contents is wholly equivalent to passing the non-keyword arguments separately:
word.add(folia.Entity, cls="per", contents=[word,word2])
In the next example we do things more explicitly. We first create a sentence and then add a syntax parse, consisting of nested elements:
doc.declare(folia.SyntaxLayer, 'some-syntax-set')
sentence = text.add(folia.Sentence)
sentence.add(folia.Word, 'The',id='example.s.1.w.1')
sentence.add(folia.Word, 'boy',id='example.s.1.w.2')
sentence.add(folia.Word, 'pets',id='example.s.1.w.3')
sentence.add(folia.Word, 'the',id='example.s.1.w.4')
sentence.add(folia.Word, 'cat',id='example.s.1.w.5')
sentence.add(folia.Word, '.', id='example.s.1.w.6')
#Adding Syntax Layer
layer = sentence.add(folia.SyntaxLayer)
#Adding Syntactic Units
layer.add(
folia.SyntacticUnit(self.doc, cls='s', contents=[
folia.SyntacticUnit(self.doc, cls='np', contents=[
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.1'], cls='det'),
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.2'], cls='n'),
]),
folia.SyntacticUnit(self.doc, cls='vp', contents=[
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.3'], cls='v')
folia.SyntacticUnit(self.doc, cls='np', contents=[
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.4'], cls='det'),
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.5'], cls='n'),
]),
]),
folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.6'], cls='fin')
])
)
Note
The lower-level append() method would have had the same effect in the above syntax tree sample.
Any element can be deleted by calling the remove() method of its parent. Suppose we want to delete word:
word.parent.remove(word)
A deep copy can be made of any element by calling its copy() method:
word2 = word.copy()
The copy will be without parent and document. If you intend to associate a copy with a new document, then copy as follows instead:
word2 = word.copy(newdoc)
If you intend to attach the copy somewhere in the same document, you may want to add a suffix for any identifiers in its scope, since duplicate identifiers are not allowed and would raise an exception. This can be specified as the second argument:
word2 = word.copy(doc, ".copy")
If you have loaded a FoLiA document into memory, you may want to search for a particular annotations. You can of course loop over all structural and annotation elements using select(), annotation() and annotations(). Additionally, Word.findspans() and AbstractAnnotationLayer.findspan() are useful methods of finding span annotations covering particular words, whereas AbstractSpanAnnotation.wrefs() does the reverse and finds the words for a given span annotation element. In addition to these main methods of navigation and selection, there is higher-level function available for searching, this uses the FoLiA Query Language (FQL) or the Corpus Query Language (CQL).
These two languages are part of separate libraries that need to be imported:
from pynlpl.formats import fql, cql
CQL is the easier-language of the two and most suitable for corpus searching. It is, however, less flexible than FQL, which is designed specifically for FoLiA and can not just query, but also manipulate FoLiA documents in great detail.
CQL was developed for the IMS Corpus Workbench, at Stuttgart Univeristy, and is implemented in Sketch Engine, who provide good CQL documentation.
CQL has to be converted to FQL first, which is then executed on the given document. This is a simple example querying for the word “house”:
doc = folia.Document(file="/path/to/some/document.folia.xml")
query = fql.Query(cql.cql2fql('"house"'))
for word in query(doc):
print(word) #these will be folia.Word instances (all matching house)
Multiple words can be queried:
query = fql.Query(cql.cql2fql('"the" "big" "house"'))
for word1,word2,word3 in query(doc):
print(word1, word2,word3)
Queries may contain wildcard expressions to match multiple text patterns. Gaps can be specified using []. The following will match any three word combination starting with the and ending with something that starts with house. It will thus match things like “the big house” or “the small household”:
query = fql.Query(cql.cql2fql('"the" [] "house.*"'))
for word1,word2,word3 in query(doc):
...
We can make the gap optional with a question mark, it can be lenghtened with + or * , like regular expressions:
query = fql.Query(cql.cql2fql('"the" []? "house.*"'))
for match in query(doc):
print("We matched ", len(match), " words")
Querying is not limited to text, but all of FoLiA’s annotations can be used. To force our gap consist of one or more adjectives, we do:
query = fql.Query(cql.cql2fql('"the" [ pos = "a" ]+ "house.*"'))
for match in query(doc):
...
The original CQL attribute here is tag rather than pos, this can be used too. In addition, all FoLiA element types can be used! Just use their FoLiA tagname.
Consult the CQL documentation for more. Do note that CQL is very word/token centered, for searching other types of elements, use FQL instead.
FQL is documented here, a full overview is beyond the scope of this documentation. We will just introduce some basic selection queries so you can develop an initial impression of the language’s abilities.
Selecting a word with a particular text is done as follows:
query = fql.Query('SELECT w WHERE text = "house"')
for word in query(doc):
print(word) #this will be an instance of folia.Word
Regular expression matching can be done using the MATCHES operator:
query = fql.Query('SELECT w WHERE text MATCHES "^house.*$"')
for word in query(doc):
print(word)
The classes of other annotation types can be easily queried as follows:
query = fql.Query('SELECT w WHERE :pos = "v"' AND :lemma = "be"')
for word in query(doc):
print(word)
You can constrain your queries to a particular target selection using the FOR keyword:
query = fql.Query('SELECT w WHERE text MATCHES "^house.*$" FOR s WHERE text CONTAINS "sell"')
for word in query(doc):
print(word)
This construction also allows you to select the actual annotations. To select all people (a named entity) for words that are not John:
query = fql.Query('SELECT entity WHERE class = "person" FOR w WHERE text != "John"')
for entity in query(doc):
print(entity) #this will be an instance of folia.Entity
FOR statement may be chained, and Explicit IDs can be passed using the ID keyword:
query = fql.Query('SELECT entity WHERE class = "person" FOR w WHERE text != "John" FOR div ID "section.21"')
for entity in query(doc):
print(entity)
Sets are specified using the OF keyword, it can be omitted if there is only one for the annotation type, but will be required otherwise:
query = fql.Query('SELECT su OF "http://some/syntax/set" WHERE class = "np"')
for su in query(doc):
print(su) #this will be an instance of folia.SyntacticUnit
We have just covered the SELECT keyword, FQL has other keywords for manipulating documents, such as EDIT, ADD, APPEND and PREPEND.
Note
Consult the FQL documentation at https://github.com/proycon/foliadocserve/blob/master/README.rst for further documentation on the language.
Throughout this tutorial you have seen the folia.Document class as a means of reading FoLiA documents. This class always loads the entire document in memory, which can be a considerable resource demand. The folia.Reader class provides an alternative to loading FoLiA documents. It does not load the entire document in memory but merely returns the elements you are interested in. This results in far less memory usage and also provides a speed-up.
A reader is constructed as follows, the second argument is the class of the element you want:
reader = folia.Reader("my.folia.xml", folia.Word)
for word in reader:
print(word.id)
FoLiA has a number of text markup elements, these appear within the folia.TextContent (t) element, iterating over the element of a folia.TextContent element will first and foremost produce strings, but also uncover these markup elements when present. The following markup types exists:
Features allow a second-order annotation by adding the abilities to assign properties and values to any of the existing annotation elements. They follow the set/class paradigm by adding the notion of a subset and class relative to this subset. The feat() method provides a shortcut that can be used on any annotation element to obtain the class of the feature, given a subset. To illustrate the concept, take a look at part of speech annotation with some features:
pos = word.annotation(folia.PosAnnotation)
if pos.cls = "n":
if pos.feat('number') == 'plural':
print("We have a plural noun!")
elif pos.feat('number') == 'plural':
print("We have a singular noun!")
The feat() method will return an exception when the feature does not exist. Note that the actual subset and class values are defined by the set and not FoLiA itself! They are therefore fictitious in the above example.
The Python class for features is folia.Feature, in the following example we add a feature:
pos.add(folia.Feature, subset="gender", class="f")
Although FoLiA does not define any sets nor subsets. Some annotation types do come with some associated subsets, their use is never mandatory. The advantage is that these associated subsets can be directly used as an XML attribute in the FoLiA document. The FoLiA library provides extra classes, iall subclassed off folia.Feature for these:
A key feature of FoLiA is its ability to make explicit alternative annotations, for token annotations, the folia.Alternative (alt) class is used to this end. Alternative annotations are embedded in this structure. This implies the annotation is not authoritative, but is merely an alternative to the actual annotation (if any). Alternatives may typically occur in larger numbers, representing a distribution each with a confidence value (not mandatory). Each alternative is wrapped in its own folia.Alternative element, as multiple elements inside a single alternative are considered dependent and part of the same alternative. Combining multiple annotation in one alternative makes sense for mixed annotation types, where for instance a pos tag alternative is tied to a particular lemma:
alt = word.add(folia.Alternative)
alt.add(folia.PosAnnotation, set='brown-tagset',cls='n',confidence=0.5)
alt = word.add(folia.Alternative) #note that we reassign the variable!
alt.add(folia.PosAnnotation, set='brown-tagset',cls='a',confidence=0.3)
alt = word.add(folia.Alternative)
alt.add(folia.PosAnnotation, set='brown-tagset',cls='v',confidence=0.2)
Span annotation elements have a different mechanism for alternatives, for those the entire annotation layer is embedded in a folia.AlternativeLayers element. This element should be repeated for every type, unless the layers it describeds are dependent on it eachother:
alt = sentence.add(folia.AlternativeLayers)
layer = alt.add(folia.Entities)
entity = layer.add(folia.Entity, word1,word2,cls="person", confidence=0.3)
Because the alternative annotations are non-authoritative, normal selection methods such as select() and annotations() will never yield them, unless explicitly told to do so. For this reason, there is an alternatives() method on structure elements, for the first category of alternatives.
Corrections are one of the most complex annotation types in FoLiA. Corrections can be applied not just over text, but over any type of structure annotation, token annotation or span annotation. Corrections explicitly preserve the original, and recursively so if corrections are done over other corrections.
Despite their complexity, the library treats correction transparently. Whenever you query for a particular element, and it is part of a correction, you get the corrected version rather than the original. The original is always non-authoritative and normal selection methods will ignore it.
If you want to deal with correction, you have to explicitly get a folia.Correction element. If an element is part of a correction, its incorrection() method will give the correction element, if not, it will return None:
pos = word.annotation(folia.PosAnnotation)
correction = pos.incorrection()
if correction:
if correction.hasoriginal():
originalpos = correction.original(0) #assuming it's the only element as is customary
#originalpos will be an instance of folia.PosAnnotation
print("The original pos was", originalpos.cls)
Corrections themselves carry a class too, indicating the type of correction (defined by the set used and not by FoLiA).
Besides original(), corrections distinguish three other types, new() (the corrected version), current() (the current uncorrected version) and suggestions(i) (a suggestion for correction), the former two and latter two usually form pairs, current() and new() can never be used together. Of suggestions(i) there may be multiple, hence the index argument. These return, respectively, instances of folia.Original, folia.New, folia.Current and folia.Suggestion.
Adding a correction can be done explicitly:
wrongpos = word.annotation(folia.PosAnnotation)
word.add(folia.Correction, folia.New(doc, folia.PosAnnotation(doc, cls="n")) , folia.Original(doc, wrongpos), cls="misclassified")
Let’s settle for a suggestion rather than an actual correction:
wrongpos = word.annotation(folia.PosAnnotation)
word.add(folia.Correction, folia.Suggestion(doc, folia.PosAnnotation(doc, cls="n")), cls="misclassified")
In some instances, when correcting text or structural elements, folia.New() may be empty, which would correspond to an deletion. Similarly, folia.Original() may be empty, corresponding to an insertion.
The use of folia.Current() is reserved for use with structure elements, such as words, in combination with suggestions. The structure elements then have to be embedded in folia.Current(). This situation arises for instance when making suggestions for a merge or split.
Annotation layers for Span Annotation are derived from this abstract base class
Generator over alternatives, either all or only of a specific annotation type, and possibly restrained also by set.
Will return a single annotation (even if there are multiple). Raises a NoSuchAnnotation exception if none was found
Obtain annotations. Very similar to select() but raises an error if the annotation was not found.
Returns the span element which spans over the specified words or morphemes
Returns an integer indicating whether such as annotation exists, and if so, how many. See annotations() for a description of the parameters.
Returns a RelaxNG definition for this element (as an XML element (lxml.etree) rather than a string)
This is the abstract base class from which all FoLiA elements are derived. This class should not be instantiated directly, but can useful if you want to check if a variable is an instance of any FoLiA element: isinstance(x, AbstractElement). It contains methods and variables also commonly inherited.
High level function that adds (appends) an annotation to an element, it will simply call append() for token annotation elements that fit within the scope. For span annotation, it will create and find or create the proper annotation layer and insert the element there
Tests whether a new element of this class can be added to the parent. Returns a boolean or raises ValueError exceptions (unless set to ignore)!
This will use OCCURRENCES, but may be overidden for more customised behaviour.
This method is mostly for internal use.
Makes sure this element (and all subelements), are properly added to the index
Find the most immediate ancestor of the specified type, multiple classes may be specified
Generator yielding all ancestors of this element, effectively back-tracing its path to the root element. A tuple of multiple classes may be specified.
Append a child element. Returns the added element
If an instance is passed as first argument, it will be appended If a class derived from AbstractElement is passed as first argument, an instance will first be created and then appended.
Generic example, passing a pre-generated instance:
word.append( folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )
Generic example, passing a class to be generated:
word.append( folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )
Generic example, setting text with a class:
word.append( “house”, cls=’original’ )
Returns this word in context, {size} words to the left, the current word, and {size} words to the right
Make a deep copy of this element and all its children. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash
Generator creating a deep copy of the children of this element. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash
Like select, but instead of returning the elements, it merely counts them
Obtain the description associated with the element, will raise NoDescription if there is none
Obtain the feature value of the specific subset. If a feature occurs multiple times, the values will be returned in a list.
Example:
sense = word.annotation(folia.Sense)
synset = sense.feat('synset')
Find replaceable elements. Auxiliary function used by replace(). Can be overriden for more fine-grained control. Mostly for internal use.
returns the index at which an element occurs, recursive by default!
May return a customised text delimiter instead of the default for this class.
Does this element have text (of the specified class)
Is this element part of a correction? If it is, it returns the Correction element (evaluating to True), otherwise it returns None
Insert a child element at specified index. Returns the added element
If an instance is passed as first argument, it will be appended If a class derived from AbstractElement is passed as first argument, an instance will first be created and then appended.
Generic example, passing a pre-generated instance:
word.insert( 3, folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )
Generic example, passing a class to be generated:
word.insert( 3, folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )
Generic example, setting text:
word.insert( 3, "house" )
Returns a depth-first flat list of all items below this element (not limited to AbstractElement)
Returns the left context for an element, as a list. This method crosses sentence/paragraph boundaries by default, which can be restricted by setting scope
Returns the next element, if it is of the specified type and if it does not cross the boundary of the defined scope. Returns None if no next element is found. Non-authoritative elements are never returned.
Alias for retrieving the original uncorrect text
Internal class method used for turning an XML element into an instance of the Class.
This method will be called after an element is added to another. It can do extra checks and if necessary raise exceptions to prevent addition. By default makes sure the right document is associated.
This method is mostly for internal use.
Returns the previous element, if it is of the specified type and if it does not cross the boundary of the defined scope. Returns None if no next element is found. Non-authoritative elements are never returned.
Returns a RelaxNG definition for this element (as an XML element (lxml.etree) rather than a string)
Removes the child element
Appends a child element like append(), but replaces any existing child element of the same type and set. If no such child element exists, this will act the same as append()
to be an alternative.
See append() for more information.
Returns the right context for an element, as a list. This method crosses sentence/paragraph boundaries by default, which can be restricted by setting scope
Select child elements of the specified class.
A further restriction can be made based on set. Whether or not to apply recursively (by default enabled) can also be configured, optionally with a list of elements never to recurse into.
Class: The class to select; any python class subclassed off ‘AbstractElement`
set: The set to match against, only elements pertaining to this set will be returned. If set to None (default), all elements regardless of set will be returned.
recursive: Select recursively? Descending into child elements? Boolean defaulting to True.
of a list, all non-authoritative elements will be skipped (this is the default behaviour). It is common not to
want to recurse into the following elements: folia.Alternative, folia.AlternativeLayer, folia.Suggestion, and folia.Original. These elements contained in these are never authorative. set to the boolean True rather than a list, this will be the default list. You may also include the boolean True as a member of a list, if you want to skip additional tags along non-authoritative ones.
node: Reserved for internal usage, used in recursion.
Example:
text.select(folia.Sense, 'cornetto', True, [folia.Original, folia.Suggestion, folia.Alternative] )
Set a different document, usually no need to call this directly, invoked implicitly by copy()
Associate a document with this element
Correct all parent relations for elements within the scope, usually no need to call this directly, invoked implicitly by copy()
Set the text for this element (and class)
Get the text strictly associated with this element (of the specified class). Does not recurse into children, with the sole exception of Corection/New
Get the text associated with this element (of the specified class), will always be a unicode instance. If no text is directly associated with the element, it will be obtained from the children. If that doesn’t result in any text either, a NoSuchText exception will be raised.
If retaintokenisation is True, the space attribute on words will be ignored, otherwise it will be adhered to and text will be detokenised as much as possible.
Get the text explicitly associated with this element (of the specified class). Returns the TextContent instance rather than the actual text. Raises NoSuchText exception if not found.
Unlike text(), this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the TextContent instance rather than the actual text!
Alias for text with retaintokenisation=True
Internal method, recompute textual value. Only for elements that are a TEXTCONTAINER
Serialises the FoLiA element to XML, by returning an XML Element (in lxml.etree) for this element and all its children. For string output, consider the xmlstring() method instead.
Serialises this FoLiA element to XML, returns a (unicode) string with XML representation for this element and all its children.
Abstract element, all span annotation elements are derived from this class
Will return a single annotation (even if there are multiple). Raises a NoSuchAnnotation exception if none was found
Generator creating a deep copy of the children of this element. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash
Returns an integer indicating whether such as annotation exists, and if so, how many. See annotations() for a description of the parameters.
Sets the span of the span element anew, erases all data inside
Returns a list of word references, these can be Words but also Morphemes or Phonemes.
Abstract element, all structure elements inherit from this class. Never instantiated directly.
See AbstractElement.append()
Does the specified annotation layer exist?
Returns a list of annotation layers found directly under this element, does not include alternative layers
Returns a generator of Paragraph elements found (recursively) under this element.
Returns a generator of Sentence elements found (recursively) under this element
Returns a generator of Word elements found (recursively) under this element.
Abstract element, all subtoken annotation elements are derived from this class
Obtain the text (unicode instance)
Abstract element, all token annotation elements are derived from this class
See AbstractElement.append()
Actor feature, to be used within Event
Apply a correction (TODO: documentation to be written still)
Classes inherited from this class allow for automatic ID generation, using the convention of adding a period, the name of the element , another period, and a sequence number
Elements that allow token annotation (including extended annotation) must inherit from this class
Generator over alternatives, either all or only of a specific annotation type, and possibly restrained also by set.
Will return a single annotation (even if there are multiple). Raises a NoSuchAnnotation exception if none was found
Obtain annotations. Very similar to select() but raises an error if the annotation was not found.
Returns an integer indicating whether such as annotation exists, and if so, how many. See annotations() for a description of the parameters.
Element grouping alternative token annotation(s). Multiple alternative elements may occur, each denoting a different alternative. Elements grouped inside an alternative block are considered dependent.
Element grouping alternative subtoken annotation(s). Multiple altlayers elements may occur, each denoting a different alternative. Elements grouped inside an alternative block are considered dependent.
Begindatetime feature, to be used within Event
Element used for captions for figures or tables, contains sentences
Chunk element, span annotation element to be used in ChunkingLayer
Chunking Layer: Annotation layer for Chunk span annotation elements
Coreference chain. Consists of coreference links.
Syntax Layer: Annotation layer for SyntacticUnit span annotation elements
Coreference link. Used in coreferencechain.
A corpus of various FoLiA documents. Yields a Document on each iteration. Suitable for sequential processing.
A corpus of various FoLiA documents. Yields the filenames on each iteration.
Processes a corpus of various FoLiA documents using a parallel processing. Calls a user-defined function with the three-tuple (filename, args, kwargs) for each file in the corpus. The user-defined function is itself responsible for instantiating a FoLiA document! args and kwargs, as received by the custom function, are set through the run() method, which yields the result of the custom function on each iteration.
See AbstractElement.append()
May return a customised text delimiter instead of the default for this class.
Get the text explicitly associated with this element (of the specified class). Returns the TextContent instance rather than the actual text. Raises NoSuchText exception if not found.
Unlike text(), this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the TextContent instance rather than the actual text!
Dependencies Layer: Annotation layer for Dependency span annotation elements. For dependency entities.
Returns the dependent of the dependency relation. Instance of DependencyDependent
Returns the head of the dependency relation. Instance of DependencyHead
Description is an element that can be used to associate a description with almost any other FoLiA element
Structure element representing some kind of division. Divisions may be nested at will, and may include almost all kinds of other structure elements.
This is the FoLiA Document, all elements have to be associated with a FoLiA document. Besides holding elements, the document hold metadata including declaration, and an index of all IDs.
Add a text to the document:
Example 1:
doc.append(folia.Text)
Create an element associated with this Document. This method may be obsolete and removed later.
No arguments: Get the document’s date from metadata Argument: Set the document’s date in metadata
Returns a depth-first flat list of all items in the document
No arguments: Get the document’s language (ISO-639-3) from metadata Argument: Set the document’s language (ISO-639-3) in metadata
No arguments: Get the document’s license from metadata Argument: Set the document’s license in metadata
Load a FoLiA or D-Coi XML file
Return a generator of all paragraphs found in the document.
If an index is specified, return the n’th paragraph only (starting at 0)
Main XML parser, will invoke class-specific XML parsers. For internal use.
No arguments: Get the document’s publisher from metadata Argument: Set the document’s publisher in metadata
Save the document to FoLiA XML.
Return a generator of all sentence found in the document. Except for sentences in quotes.
If an index is specified, return the n’th sentence only (starting at 0)
Returns the text of the entire document (returns a unicode instance)
No arguments: Get the document’s title from metadata Argument: Set the document’s title in metadata
Return a generator of all active words found in the document. Does not descend into annotation layers, alternatives, originals, suggestions.
If an index is specified, return the n’th word only (starting at 0)
Run Xpath expression and parse the resulting elements. Don’t forget to use the FoLiA namesapace in your expressions, using folia: or the short form f:
Domain annotation: an extended token annotation element
Exception raised when an identifier that is already in use is assigned again to another element
Enddatetime feature, to be used within Event
Entities Layer: Annotation layer for Entity span annotation elements. For named entities.
Entity element, for named entities, span annotation element to be used in EntitiesLayer
Feature elements can be used to associate subsets and subclasses with almost any annotation element
Element for the representation of a graphical figure. Structure element.
Function feature, to be used with morphemes
Gap element. Represents skipped portions of the text. Contains Content and Desc elements
Head element. A structure element. Acts as the header/title of a division. There may be one per division. Contains sentences.
Head feature, to be used within PosAnnotation
Element used for labels. Mostly in within list item. Contains words.
Language annotation: an extended token annotation element
Lemma annotation: a token annotation element
Level feature, to be used with coreferences
Line break element, signals a line break
Element for enumeration/itemisation. Structure element. Contains ListItem elements.
Single element in a List. Structure element. Contained within List element.
Metric elements allow the annotatation of any kind of metric with any kind of annotation element. Allowing for example statistical measures to be added to elements as annotation,
Modality feature, to be used with coreferences
Morpheme element, represents one morpheme in morphological analysis, subtoken annotation element to be used in MorphologyLayer
Find span annotation of the specified type that include this word
Morphology Layer: Annotation layer for Morpheme subtoken annotation elements. For morphological analysis.
Exception raised when the requested type of annotation does not exist for the selected element
Exception raised when the requestion type of text content does not exist for the selected element
Paragraph element. A structure element. Represents a paragraph and holds all its sentences (and possibly other structure Whitespace and Quotes).
This class describes a pattern over words to be searched for. The
Document.findwords() method can subsequently be called with this pattern, and it will return all the words that match. An example will best illustrate this, first a trivial example of searching for one word:
for match in doc.findwords( folia.Pattern('house') ):
for word in match:
print word.id
print "----"
The same can be done for a sequence::
for match in doc.findwords( folia.Pattern('a','big', 'house') ):
for word in match:
print word.id
print "----"
The boolean value ``True`` acts as a wildcard, matching any word::
for match in doc.findwords( folia.Pattern('a',True,'house') ):
for word in match:
print word.id, word.text()
print "----"
Alternatively, and more constraning, you may also specify a tuple of alternatives::
for match in doc.findwords( folia.Pattern('a',('big','small'),'house') ):
for word in match:
print word.id, word.text()
print "----"
Or even a regular expression using the ``folia.RegExp`` class::
for match in doc.findwords( folia.Pattern('a', folia.RegExp('b?g'),'house') ):
for word in match:
print word.id, word.text()
print "----"
Rather than searching on the text content of the words, you can search on the
classes of any kind of token annotation using the keyword argument
``matchannotation=``::
for match in doc.findwords( folia.Pattern('det','adj','noun',matchannotation=folia.PosAnnotation ) ):
for word in match:
print word.id, word.text()
print "----"
The set can be restricted by adding the additional keyword argument
``matchannotationset=``. Case sensitivity, by default disabled, can be enabled by setting ``casesensitive=True``.
Things become even more interesting when different Patterns are combined. A
match will have to satisfy all patterns::
for match in doc.findwords( folia.Pattern('a', True, 'house'), folia.Pattern('det','adj','noun',matchannotation=folia.PosAnnotation ) ):
for word in match:
print word.id, word.text()
print "----"
The ``findwords()`` method can be instructed to also return left and/or right context for any match. This is done using the ``leftcontext=`` and ``rightcontext=`` keyword arguments, their values being an integer number of the number of context words to include in each match. For instance, we can look for the word house and return its immediate neighbours as follows::
for match in doc.findwords( folia.Pattern('house') , leftcontext=1, rightcontext=1):
for word in match:
print word.id
print "----"
A match here would thus always consist of three words instead of just one.
Last, ``Pattern`` also has support for variable-width gaps, the asterisk symbol
has special meaning to this end::
for match in doc.findwords( folia.Pattern('a','*','house') ):
for word in match:
print word.id
print "----"
Unlike the pattern ``('a',True,'house')``, which by definition is a pattern of
three words, the pattern in the example above will match gaps of any length (up
to a certain built-in maximum), so this might include matches such as *a very
nice house*.
Some remarks on these methods of querying are in order. These searches are
pretty exhaustive and are done by simply iterating over all the words in the
document. The entire document is loaded in memory and no special indices are involved.
For single documents this is okay, but when iterating over a corpus of
thousands of documents, this method is too slow, especially for real-time
applications. For huge corpora, clever indexing and database management systems
will be required. This however is beyond the scope of this library.
Resolve a variable sized pattern to all patterns of a certain fixed size
Part-of-Speech annotation: a token annotation element
An XPath query on one or more FoLiA documents
Quote: a structure element. For quotes/citations. May hold words, sentences or paragraphs.
Streaming FoLiA reader. The reader allows you to read a FoLiA Document without holding the whole tree structure in memory. The document will be read and the elements you seek returned as they are found. If you are querying a corpus of large FoLiA documents for a specific structure, then it is strongly recommend to use the Reader rather than the standard Document!
Semantic Role
Syntax Layer: Annotation layer for SemanticRole span annotation elements
Sense annotation: a token annotation element
Sentence element. A structure element. Represents a sentence and holds all its words (and possibly other structure such as LineBreaks, Whitespace and Quotes)
Are there corrections in this sentence?
Generic correction method for words. You most likely want to use the helper functions splitword() , mergewords(), deleteword(), insertword() instead
TODO: Write documentation
Obtain the division this sentence is a part of (None otherwise)
TODO: Write documentation
Obtain the paragraph this sentence is a part of (None otherwise)
TODO: Write documentation
String
Subjectivity annotation/Sentiment analysis: a token annotation element
Synset feature, to be used within Sense
Syntactic Unit, span annotation element to be used in SyntaxLayer
Syntax Layer: Annotation layer for SyntacticUnit span annotation elements
A full text. This is a high-level element (not to be confused with TextContent!). This element may contain divisions, paragraphs, sentences, etc..
Text content element (t), holds text to be associated with whatever element the text content element is a child of.
Text content elements on structure elements like Paragraph and Sentence are by definition untokenised. Only on Word level and deeper they are by definition tokenised.
Find the default reference for text offsets: The parent of the current textcontent’s parent (counting only Structure Elements and Subtoken Annotation Elements)
Note: This returns not a TextContent element, but its parent. Whether the textcontent actually exists is checked later/elsewhere
(Method for internal usage, see AbstractElement)
(Method for internal usage, see AbstractElement)
(Method for internal usage, see AbstractElement.postappend())
Obtain the text (unicode instance)
Validates the Text Content’s references. Raises UnresolvableTextContent when invalid
Time feature, to be used with coreferences
alias of TimeSegment
Dependencies Layer: Annotation layer for Dependency span annotation elements. For dependency entities.
Value feature, to be used within Metric
Whitespace element, signals a vertical whitespace
Word (aka token) element. Holds a word/token and all its related token annotations.
Obtain the deepest division this word is a part of, otherwise return None
Shortcut: returns the FoLiA class of the domain annotation (will return only one if there are multiple!)
Find span annotation of the specified type that includes this word
Returns the text delimiter
Shortcut: returns the FoLiA class of the lemma annotation (will return only one if there are multiple!)
Returns a specific morpheme, the n’th morpheme (given the particular set if specified).
Generator yielding all morphemes (in a particular set if specified). For retrieving one specific morpheme by index, use morpheme() instead
Obtain the paragraph this word is a part of, otherwise return None
Shortcut: returns the FoLiA class of the PoS annotation (will return only one if there are multiple!)
Shortcut: returns the FoLiA class of the sense annotation (will return only one if there are multiple!)
Obtain the sentence this word is a part of, otherwise return None
Word reference. Used to refer to words or morphemes from span annotation elements. The Python class will only be used when word reference can not be resolved, if they can, Word or Morpheme objects will be used
Generator over common ancestors, of the Class specified, of the current element and the other specified elements
Returns (datetime, tz offset in minutes) or (None, None).
Internal function, parses common FoLiA attributes and sets up the instance accordingly