FoLiA library

This tutorial will introduce the FoLiA Python library, part of PyNLPl. The FoLiA library provides an Application Programming Interface for the reading, creation and manipulation of FoLiA XML documents. The library works under Python 2.7 as well as Python 3, which is the recommended version. The samples in this documentation follow Python 3 conventions.

Prior to reading this document, it is recommended to first read the FoLiA documentation itself and familiarise yourself with the format and underlying paradigm. The FoLiA documentation can be found on the `FoLiA website < https://proycon.github.io/folia>`_. It is especially important to understand the way FoLiA handles sets/classes, declarations, common attributes such as annotator/annotatortype and the distinction between various kinds of annotation categories such as token annotation and span annotation.

This Python library is also the foundation of the FoLiA Tools collection, which consists of various command line utilities to perform common tasks on FoLiA documents. If you’re merely interested in performing a certain common task, such as a single query or conversion, you might want to check there if it contains is a tool that does what you want already.

Reading FoLiA

Loading a document

Any script that uses FoLiA starts with the import:

from pynlpl.formats import folia

Subsequently, a document can be read from file and follows:

doc = folia.Document(file="/path/to/document.xml")

This returns an instance that holds the entire document in memory. Note that for large FoLiA documents this may consume quite some memory! If you happened to already have the document content in a string, you can load as follows:

doc = folia.Document(string="<FoLiA ...")

Once you have loaded a document, all data is available for you to read and manipulate as you see fit. We will first illustrate some simple use cases:

To save a document back the file it was loaded from, we do:

doc.save()

Or we can specify a specific filename:

doc.save("/tmp/document.xml")

Note

Any content that is in a different XML namespace than the FoLiA namespaces or other supported namespaces (XML, Xlink), will be ignored upon loading and lost when saving.

Printing text

You may want to simply print all (plain) text contained in the document, which is as easy as:

print(doc)

Alternatively, you can obtain a string representation of all text:

text = str(doc)

For any subelement of the document, you can obtain its text in the same fashion.

Note

In Python 2, both str() as well as unicode() return a unicode instance. You may need to append .encode('utf-8') for proper output.

Index

A document instance has an index which you can use to grab any of its sub elements by ID. Querying using the index proceeds similar to using a python dictionary:

word = doc['example.p.3.s.5.w.1']
print(word)

Note

Python 2 users will have to do print word.text().encode('utf-8') instead, to ensure non-ascii characters are printed properly.

Obtaining list of elements

Usually you do not know in advance the ID of the element you want, or you want multiple elements. There are some methods of iterating over certain elements using the FoLiA library.

For example, you can iterate over all words:

for word in doc.words():
    print(word)

That however gives you one big iteration of words without boundaries. You may more likely want to seek words within sentences. So we first iterate over all sentences, then over the words therein:

for sentence in doc.sentences():
    for word in sentence.words():
        print(word)

Or including paragraphs, assuming the document has them:

for paragraph in doc.paragraphs():
    for sentence in paragraph.sentences():
        for word in sentence.words():
            print(word)

You can also use this method to obtain a specific word, by passing an index parameter:

word = sentence.words(3) #retrieves the fourth word

If you want to iterate over all of the child elements of a certain element, regardless of what type they are, you can simply do so as follows:

for subelement in element:
    if isinstance(subelement, folia.Sentence):
        print("this is a sentence")
    else:
        print("this is something else")

If applied recursively this allows you to traverse the entire element tree, there are however specialised methods available that do this for you.

Select method

There is a generic method available on all elements to select child elements of any desired class. This method is by default applied recursively. Internally, the paragraphs(), words() and sentences() methods seen above are simply shortcuts that make use of the select method:

sentence = doc['example.p.3.s.5.w.1']
words = sentence.select(folia.Word)
for word in words:
    print(word)

The select() method has a sibling count(), invoked with the same arguments, which simply counts how many items it finds, without actually returning them:

word = sentence.count(folia.Word)

Advanced Notes:

The select() method and similar high-level methods derived from it, are generators. This implies that the results of the selection are returned one by one in the iteration, as opposed to all stored in memory. This also implies that you can only iterate over it once, we can not do another iteration over the words variable in the above example, unless we reinvoke the select() method to get a new generator. Likewise, we can not do len(words), but have to use the count() method instead.

If you want to have all results in memory in a list, you can simply do the following:

words = list(sentence.select(folia.Word))

The select method is by default recursive, set the third argument to False to make it non-recursive. The second argument can be used for restricting matches to a specific set, a tuple of classes. The recursion will not go into any non-authoritative elements such as alternatives, originals of corrections.

Structure Annotation Types

The FoLiA library discerns various Python classes for structure annotation, the corresponding FoLiA XML tag is listed too. Sets and classes can be associated with most of these elements to make them more specific, these are never prescribed by FoLiA. The list of classes is as follows:

  • folia.Cell - cell - A cell in a row in a table
  • folia.Division - div - Used for for example chapters, sections, subsections
  • folia.Event - event - Often in new-media data where a chat message, tweet or forum post is considered an event.
  • folia.Figure - figure - A graphic/image
  • folia.Gap - gap - A gap containing raw un-annotated textual content
  • folia.Head - head - The head/title of a division (div), used for chapter/section/subsection titles etc..
  • folia.Linebreak - br - An explicit linebreak/newline
  • folia.List - list - A list, bulleted or enumerated
  • folia.ListItem - listitem - An item in a list
  • folia.Note - note - A note, such as a footnote or bibliography reference for instance
  • folia.Paragraph - p
  • folia.Part - part - An abstract part of a larger structure
  • folia.Quote - quote - Cited text
  • folia.Reference - ref - A reference to another structural element, used to refer to footnotes (note) for example.
  • folia.Sentence - s
  • folia.Table - table - A table
  • folia.TableHead - tablehead - The head of a table, containing cells (cell) with column labels
  • folia.Row - row - A row in a table
  • folia.Text - text - The root of the document’s content
  • folia.Whitespace - whitespace - Explicit vertical whitespace
  • folia.Word- w

The FoLiA documentation explain all of these in detail.

FoLiA and this library enforce explicit rules about what elements are allowed in what others. Exceptions will be raised when this is about to be violated.

Common attributes

The FoLiA paradigm features sets and classes as primary means to represent the actual value (class) of an annotation. A set often corresponds to a tagset, such as a set of part-of-speech tags, and a class is one selected value in such a set.

The paradigm furthermore introduces other comomn attributes to set on annotation elements, such as an identifier, information on the annotator, and more. A full list is provided below:

  • element.id (string) - The unique identifier of the element

  • element.set (string) - The set the element pertains to.

  • element.cls (string) - The assigned class, of the set above.

    Classes correspond with tagsets in this case of many annotation types. Note that since class is already a reserved keyword in python, the library consistently uses cls

  • element.annotator (string) - The name or ID of the annotator who added/modified this element

  • element.annotatortype - The type of annotator, can be either folia.AnnotatorType.MANUAL or folia.AnnotatorType.AUTO

  • element.confidence (float) - A confidence value expressing

  • element.datetime (datetime.datetime) - The date and time when the element was added/modified.

  • element.n (string) - An ordinal label, used for instance in enumerated list contexts, numbered sections, etc..

The following attributes are specific to a speech context:

  • element.src (string) - A URL or filename referring the an audio or video file containing the speech. Access this attribute using the element.speaker_src() method, as it is inheritable from ancestors.
  • element.speaker (string) - The name of ID of the speaker. Access this attribute using the element.speech_speaker() method, as it is inheritable from ancestors.
  • element.begintime (4-tuple) - The time in the above source fragment when the phonetic content of this element starts, this is a (hours, minutes,seconds,milliseconds) tuple.
  • element.endtime (4-tuple) - The time in the above source fragment when the phonetic content of this element ends, this is a (hours, minutes,seconds,milliseconds) tuple.

Attributes that are not available for certain elements, or not set, default to None.

Annotations

FoLiA is of course a format for linguistic annotation. Accessing annotation is therefore one of the primary functions of this library. This can be done using annotations() or annotation(), which is similar to the select() method, except that it will raise an exception when no such annotation is found. The difference between annotation() and annotations() is that the former will grab only one and raise an exception if there are more between which it can’t disambiguate, whereas the second is a generator, but will still raise an exception if none is found:

for word in doc.words():
    try:
        pos = word.annotation(folia.PosAnnotation, 'CGN')
        lemma = word.annotation(folia.LemmaAnnotation)
        print("Word: ", word)
        print("ID: ", word.id)
        print("PoS-tag: " , pos.cls)
        print("PoS Annotator: ", pos.annotator)
        print("Lemma-tag: " , lemma.cls)
    except folia.NoSuchAnnotation:
        print("No PoS or Lemma annotation")

Note that the second argument of annotation(), annotations() or select() can be used to restrict your selection to a certain set. In the above example we restrict ourselves to Part-of-Speech tags in the CGN set.

Token Annotation Types

The following token annotation elements are available in FoLiA, they are embedded under a structural element.

  • folia.DomainAnnotation - domain - Domain/genre annotation
  • folia.PosAnnotation - pos - Part of Speech Annotation
  • folia.LangAnnotation - lang - Language identification
  • folia.LemmaAnnotation - lemma
  • folia.SenseAnnotation - sense - Lexical semantic sense annotation
  • folia.SubjectivityAnnotation - subjectivity - Sentiment analysis / subjectivity annotation

The following annotation types are somewhat special, as they are the only elements for which FoLiA assumes a default set and a default class:

  • folia.TextContent - t - Text content, this carries the actual text for the structural element in which is it embedded
  • folia.PhonContent - ph - Phonetic content, this carries a phonetic representation

Span Annotation

FoLiA distinguishes token annotation and span annotation, token annotation is embedded in-line within a structural element, and the annotation therefore pertains to that structural element, whereas span annotation is stored in a stand-off annotation layer outside the element and refers back to it. Span annotation elements typically span over multiple structural elements.

We will discuss three ways of accessing span annotation. As stated, span annotation is contained within an annotation layer of a certain structure element, often a sentence. In the first way of accessing span annotation, we do everything explicitly. We first obtain the layer, then iterate over the span annotation elements within that layer, and finally iterate over the words to which the span applies. Assume we have a sentence and we want to print all the named entities in it:

for layer in sentence.select(folia.EntitiesLayer):
    for entity in layer.select(folia.Entity):
        print(" Entity class=", entity.cls, " words=")
        for word in entity.wrefs():
            print(word, end="")  #print without newline
        print()   #print newline

The wrefs() method, available on all span annotation elements, will return a list of all words (as well as morphemes and phonemes) over which a span annotation element spans.

This first way is rather verbose. The second way of accessing span annotation takes another approach, using the findspans() method on Word instances. Here we start from a word and seek span annotations in which that word occurs. Assume we have a word and want to find chunks it occurs in:

for chunk in word.findspans(folia.Chunk):
    print(" Chunk class=", chunk.cls, " words=")
    for word2 in chunk.wrefs(): #print all words in the chunk (of which the word is a part)
        print(word2, end="")
    print()

The findspans() method can be called with either the class of a Span Annotation Element, such as folia.Chunk, or which the class of the layer, such as folia.ChunkingLayer.

The third way allows us to look for span elements given an annotation layer and words. In other words, it checks if one or more words form a span. This is an exact match and not a sub-part match as in the previously described method. To do this, we use use the findspan() method on annotation layers:

for span in annotationlayer.findspan(word1,word2):
    print(span.cls)

Span Annotation Types

This section lists the available Span annotation elements, the layer that contains them is explicitly mentioned as well.

Some of the span annotation elements are complex and take span role elements as children, these are normal span annotation elements that occur on a within another span annotation (of a particular type) and can not be used standalone.

  • folia.Chunk in folia.ChunkingLayer - chunk in chunks - Shallow parsing. Not nested .
  • folia.CoreferenceChain in folia.CoreferenceLayer - coreferencechain in coreferences - Co-references
  • Requires the roles folia.CoreferenceLink (coreferencelink) pointing to each coreferenced structure in the chain
  • folia.Dependency in folia.DependencyLayer - dependency in dependencies - Dependency Relations
  • Requires the roles folia.HeadSpan (hd) and folia.DependencyDependent (dep)
  • folia.Entity in folia.EntitiesLayer - entity in entities - Named entities
  • folia.SyntacticUnit in folia.SyntaxLayer - su in syntax - Syntax. These elements are generally nested to form syntax trees.
  • folia.SemanticRole in folia.SemanticRolesLayer - semrole in semroles - Semantic Roles

The span role folia.HeadSpan (hd) may actually be used by most span annotation elements, indicating it’s head-part.

Editing FoLiA

Creating a new document

Creating a new FoliA document, rather than loading an existing one from file, is done by explicitly providing the ID for the new document in the constructor:

doc = folia.Document(id='example')

Declarations

Whenever you add a new type of annotation, or a different set, to a FoLiA document, you have to first declare it. This is done using the declare() method. It takes as arguments the annotation type, the set, and you can optionally pass keyword arguments to annotator= and annotatortype= to set defaults.

An example for Part-of-Speech annotation:

doc.declare(folia.PosAnnotation, 'brown-tag-set')

An example with a default annotator:

doc.declare(folia.PosAnnotation, 'brown-tag-set', annotator='proycon', annotatortype=folia.AnnotatorType.MANUAL)

Any additional sets for Part-of-Speech would have to be explicitly declared as well. To check if a particular annotation type and set is declared, use the declared(Class, set) method.

Adding structure

Assuming we begin with an empty document, we should first add a Text element. Then we can add paragraphs, sentences, or other structural elements. The add() adds new children to an element:

text = doc.add(folia.Text)
paragraph = text.add(folia.Paragraph)
sentence = paragraph.add(folia.Sentence)
sentence.add(folia.Word, 'This')
sentence.add(folia.Word, 'is')
sentence.add(folia.Word, 'a')
sentence.add(folia.Word, 'test')
sentence.add(folia.Word, '.')

Note

The add() method is actually a wrapper around append(), which takes the exact same arguments. It performs extra checks and works for both span annotation as well as token annotation. Using append() will be faster.

Adding annotations

Adding annotations, or any elements for that matter, is done using the add() method on the intended parent element. We assume that the annotations we add have already been properly declared, otherwise an exception will be raised as soon as add() is called. Let’s build on the previous example:

#First we grab the fourth word, 'test', from the sentence
word = sentence.words(3)

#Add Part-of-Speech tag
word.add(folia.PosAnnotation, set='brown-tagset',cls='n')

#Add lemma
lemma.add(folia.LemmaAnnotation, cls='test')

Note that in the above examples, the add() method takes a class as first argument, and subsequently takes keyword arguments that will be passed to the classes’ constructor.

A second way of using add() is by simply passing a fully instantiated child element, thus constructing it prior to adding. The following is equivalent to the above example, as the previous method is merely a shortcut for convenience:

#First we grab the fourth word, 'test', from the sentence
word = sentence.words(3)

#Add Part-of-Speech tag
word.add( folia.PosAnnotation(doc, set='brown-tagset',cls='n') )

#Add lemma
lemma.add( folia.LemmaAnnotation(doc , cls='test') )

The add() method always returns that which was added, allowing it to be chained.

In the above example we first explicitly instantiate a folia.PosAnnotation and a folia.LemmaAnnotation. Instantiation of any FoLiA element (always Python class subclassed off folia.AbstractElement) follows the following pattern:

Class(document, *children, **kwargs)

Note that the document has to be passed explicitly as first argument to the constructor.

The common attributes are set using equally named keyword arguments:

  • id=
  • cls=
  • set=
  • annotator=
  • annotatortype=
  • confidence=
  • src=
  • speaker=
  • begintime=
  • endtime=

Not all attributes are allowed for all elements, and certain attributes are required for certain elements. ValueError exceptions will be raised when these constraints are not met.

Instead of setting id. you can also set the keyword argument generate_id_in and pass it another element, an ID will be automatically generated, based on the ID of the element passed. When you use the first method of adding elements, instantiation with generate_id_in will take place automatically behind the scenes when applicable and when id is not explicitly set.

Any extra non-keyword arguments should be FoLiA elements and will be appended as the contents of the element, i.e. the children or subelements. Instead of using non-keyword arguments, you can also use the keyword argument content and pass a list. This is a shortcut made merely for convenience, as Python obliges all non-keyword arguments to come before the keyword-arguments, which if often aesthetically unpleasing for our purposes. Example of this use case will be shown in the next section.

Adding span annotation

Adding span annotation is easy with the FoLiA library. As you know, span annotation uses a stand-off annotation embedded in annotation layers. These layers are in turn embedded in structural elements such as sentences. However, the add() method abstracts over this. Consider the following example of a named entity:

doc.declare(folia.Entity, "https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml")

sentence = text.add(folia.Sentence)
sentence.add(folia.Word, 'I',id='example.s.1.w.1')
sentence.add(folia.Word, 'saw',id='example.s.1.w.2')
sentence.add(folia.Word, 'the',id='example.s.1.w.3')
word = sentence.add(folia.Word, 'Dalai',id='example.s.1.w.4')
word2 =sentence.add(folia.Word, 'Lama',id='example.s.1.w.5')
sentence.add(folia.Word, '.', id='example.s.1.w.6')

word.add(folia.Entity, word, word2, cls="per")

To make references to the words, we simply pass the word instances and use the document’s index to obtain them. Note also that passing a list using the keyword argument contents is wholly equivalent to passing the non-keyword arguments separately:

word.add(folia.Entity, cls="per", contents=[word,word2])

In the next example we do things more explicitly. We first create a sentence and then add a syntax parse, consisting of nested elements:

doc.declare(folia.SyntaxLayer, 'some-syntax-set')

sentence = text.add(folia.Sentence)
sentence.add(folia.Word, 'The',id='example.s.1.w.1')
sentence.add(folia.Word, 'boy',id='example.s.1.w.2')
sentence.add(folia.Word, 'pets',id='example.s.1.w.3')
sentence.add(folia.Word, 'the',id='example.s.1.w.4')
sentence.add(folia.Word, 'cat',id='example.s.1.w.5')
sentence.add(folia.Word, '.', id='example.s.1.w.6')

#Adding Syntax Layer
layer = sentence.add(folia.SyntaxLayer)

#Adding Syntactic Units
layer.add(
    folia.SyntacticUnit(self.doc, cls='s', contents=[
        folia.SyntacticUnit(self.doc, cls='np', contents=[
            folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.1'], cls='det'),
            folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.2'], cls='n'),
        ]),
        folia.SyntacticUnit(self.doc, cls='vp', contents=[
            folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.3'], cls='v')
                folia.SyntacticUnit(self.doc, cls='np', contents=[
                    folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.4'], cls='det'),
                    folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.5'], cls='n'),
                ]),
            ]),
        folia.SyntacticUnit(self.doc, self.doc['example.s.1.w.6'], cls='fin')
    ])
)

Note

The lower-level append() method would have had the same effect in the above syntax tree sample.

Deleting annotation

Any element can be deleted by calling the remove() method of its parent. Suppose we want to delete word:

word.parent.remove(word)

Copying annotations

A deep copy can be made of any element by calling its copy() method:

word2 = word.copy()

The copy will be without parent and document. If you intend to associate a copy with a new document, then copy as follows instead:

word2 = word.copy(newdoc)

If you intend to attach the copy somewhere in the same document, you may want to add a suffix for any identifiers in its scope, since duplicate identifiers are not allowed and would raise an exception. This can be specified as the second argument:

word2 = word.copy(doc, ".copy")

Searching in a FoLiA document

If you have loaded a FoLiA document into memory, you may want to search for a particular annotations. You can of course loop over all structural and annotation elements using select(), annotation() and annotations(). Additionally, Word.findspans() and AbstractAnnotationLayer.findspan() are useful methods of finding span annotations covering particular words, whereas AbstractSpanAnnotation.wrefs() does the reverse and finds the words for a given span annotation element. In addition to these main methods of navigation and selection, there is higher-level function available for searching, this uses the FoLiA Query Language (FQL) or the Corpus Query Language (CQL).

These two languages are part of separate libraries that need to be imported:

from pynlpl.formats import fql, cql

Corpus Query Language (CQL)

CQL is the easier-language of the two and most suitable for corpus searching. It is, however, less flexible than FQL, which is designed specifically for FoLiA and can not just query, but also manipulate FoLiA documents in great detail.

CQL was developed for the IMS Corpus Workbench, at Stuttgart Univeristy, and is implemented in Sketch Engine, who provide good CQL documentation.

CQL has to be converted to FQL first, which is then executed on the given document. This is a simple example querying for the word “house”:

doc = folia.Document(file="/path/to/some/document.folia.xml")
query = fql.Query(cql.cql2fql('"house"'))
for word in query(doc):
    print(word) #these will be folia.Word instances (all matching house)

Multiple words can be queried:

query = fql.Query(cql.cql2fql('"the" "big" "house"'))
for word1,word2,word3 in query(doc):
    print(word1, word2,word3)

Queries may contain wildcard expressions to match multiple text patterns. Gaps can be specified using []. The following will match any three word combination starting with the and ending with something that starts with house. It will thus match things like “the big house” or “the small household”:

query = fql.Query(cql.cql2fql('"the" [] "house.*"'))
for word1,word2,word3 in query(doc):
    ...

We can make the gap optional with a question mark, it can be lenghtened with + or * , like regular expressions:

query = fql.Query(cql.cql2fql('"the" []? "house.*"'))
for match in query(doc):
    print("We matched ", len(match), " words")

Querying is not limited to text, but all of FoLiA’s annotations can be used. To force our gap consist of one or more adjectives, we do:

query = fql.Query(cql.cql2fql('"the" [ pos = "a" ]+ "house.*"'))
for match in query(doc):
    ...

The original CQL attribute here is tag rather than pos, this can be used too. In addition, all FoLiA element types can be used! Just use their FoLiA tagname.

Consult the CQL documentation for more. Do note that CQL is very word/token centered, for searching other types of elements, use FQL instead.

FoLiA Query Language (FQL)

FQL is documented here, a full overview is beyond the scope of this documentation. We will just introduce some basic selection queries so you can develop an initial impression of the language’s abilities.

Selecting a word with a particular text is done as follows:

query = fql.Query('SELECT w WHERE text = "house"')
for word in query(doc):
    print(word)  #this will be an instance of folia.Word

Regular expression matching can be done using the MATCHES operator:

query = fql.Query('SELECT w WHERE text MATCHES "^house.*$"')
for word in query(doc):
    print(word)

The classes of other annotation types can be easily queried as follows:

query = fql.Query('SELECT w WHERE :pos = "v"' AND :lemma = "be"')
for word in query(doc):
    print(word)

You can constrain your queries to a particular target selection using the FOR keyword:

query = fql.Query('SELECT w WHERE text MATCHES "^house.*$" FOR s WHERE text CONTAINS "sell"')
for word in query(doc):
    print(word)

This construction also allows you to select the actual annotations. To select all people (a named entity) for words that are not John:

query = fql.Query('SELECT entity WHERE class = "person" FOR w WHERE text != "John"')
for entity in query(doc):
    print(entity) #this will be an instance of folia.Entity

FOR statement may be chained, and Explicit IDs can be passed using the ID keyword:

query = fql.Query('SELECT entity WHERE class = "person" FOR w WHERE text != "John" FOR div ID "section.21"')
for entity in query(doc):
    print(entity)

Sets are specified using the OF keyword, it can be omitted if there is only one for the annotation type, but will be required otherwise:

query = fql.Query('SELECT su OF "http://some/syntax/set" WHERE class = "np"')
for su in query(doc):
    print(su) #this will be an instance of folia.SyntacticUnit

We have just covered the SELECT keyword, FQL has other keywords for manipulating documents, such as EDIT, ADD, APPEND and PREPEND.

Note

Consult the FQL documentation at https://github.com/proycon/foliadocserve/blob/master/README.rst for further documentation on the language.

Streaming Reader

Throughout this tutorial you have seen the folia.Document class as a means of reading FoLiA documents. This class always loads the entire document in memory, which can be a considerable resource demand. The folia.Reader class provides an alternative to loading FoLiA documents. It does not load the entire document in memory but merely returns the elements you are interested in. This results in far less memory usage and also provides a speed-up.

A reader is constructed as follows, the second argument is the class of the element you want:

reader = folia.Reader("my.folia.xml", folia.Word)
for word in reader:
    print(word.id)

Higher-Order Annotations

Text Markup

FoLiA has a number of text markup elements, these appear within the folia.TextContent (t) element, iterating over the element of a folia.TextContent element will first and foremost produce strings, but also uncover these markup elements when present. The following markup types exists:

  • folia.TextMarkupGap (t-gap) - For marking gaps in the text
  • folia.TextMarkupString (t-str) - For marking arbitrary substring
  • folia.TextMarkupStyle (t-style) - For marking style (such as bold, italics, as dictated by the set used)
  • folia.TextMarkupCorrection (t-correction) - Simple in-line corrections
  • folia.TextMarkupError (t-error) - For marking errors

Features

Features allow a second-order annotation by adding the abilities to assign properties and values to any of the existing annotation elements. They follow the set/class paradigm by adding the notion of a subset and class relative to this subset. The feat() method provides a shortcut that can be used on any annotation element to obtain the class of the feature, given a subset. To illustrate the concept, take a look at part of speech annotation with some features:

pos = word.annotation(folia.PosAnnotation)
if pos.cls = "n":
    if pos.feat('number') == 'plural':
        print("We have a plural noun!")
    elif pos.feat('number') == 'plural':
        print("We have a singular noun!")

The feat() method will return an exception when the feature does not exist. Note that the actual subset and class values are defined by the set and not FoLiA itself! They are therefore fictitious in the above example.

The Python class for features is folia.Feature, in the following example we add a feature:

pos.add(folia.Feature, subset="gender", class="f")

Although FoLiA does not define any sets nor subsets. Some annotation types do come with some associated subsets, their use is never mandatory. The advantage is that these associated subsets can be directly used as an XML attribute in the FoLiA document. The FoLiA library provides extra classes, iall subclassed off folia.Feature for these:

  • folia.SynsetFeature, for use with folia.SenseAnnotation
  • folia.ActorFeature, for use with folia.Event
  • folia.BegindatetimeFeature, for use with folia.Event
  • folia.EnddatetimeFeature, for use with folia.Event

Alternatives

A key feature of FoLiA is its ability to make explicit alternative annotations, for token annotations, the folia.Alternative (alt) class is used to this end. Alternative annotations are embedded in this structure. This implies the annotation is not authoritative, but is merely an alternative to the actual annotation (if any). Alternatives may typically occur in larger numbers, representing a distribution each with a confidence value (not mandatory). Each alternative is wrapped in its own folia.Alternative element, as multiple elements inside a single alternative are considered dependent and part of the same alternative. Combining multiple annotation in one alternative makes sense for mixed annotation types, where for instance a pos tag alternative is tied to a particular lemma:

alt = word.add(folia.Alternative)
alt.add(folia.PosAnnotation, set='brown-tagset',cls='n',confidence=0.5)
alt = word.add(folia.Alternative)   #note that we reassign the variable!
alt.add(folia.PosAnnotation, set='brown-tagset',cls='a',confidence=0.3)
alt = word.add(folia.Alternative)
alt.add(folia.PosAnnotation, set='brown-tagset',cls='v',confidence=0.2)

Span annotation elements have a different mechanism for alternatives, for those the entire annotation layer is embedded in a folia.AlternativeLayers element. This element should be repeated for every type, unless the layers it describeds are dependent on it eachother:

alt = sentence.add(folia.AlternativeLayers)
layer = alt.add(folia.Entities)
entity = layer.add(folia.Entity, word1,word2,cls="person", confidence=0.3)

Because the alternative annotations are non-authoritative, normal selection methods such as select() and annotations() will never yield them, unless explicitly told to do so. For this reason, there is an alternatives() method on structure elements, for the first category of alternatives.

Corrections

Corrections are one of the most complex annotation types in FoLiA. Corrections can be applied not just over text, but over any type of structure annotation, token annotation or span annotation. Corrections explicitly preserve the original, and recursively so if corrections are done over other corrections.

Despite their complexity, the library treats correction transparently. Whenever you query for a particular element, and it is part of a correction, you get the corrected version rather than the original. The original is always non-authoritative and normal selection methods will ignore it.

If you want to deal with correction, you have to explicitly get a folia.Correction element. If an element is part of a correction, its incorrection() method will give the correction element, if not, it will return None:

pos = word.annotation(folia.PosAnnotation)
correction = pos.incorrection()
if correction:
    if correction.hasoriginal():
        originalpos = correction.original(0) #assuming it's the only element as is customary
        #originalpos will be an instance of folia.PosAnnotation
        print("The original pos was", originalpos.cls)

Corrections themselves carry a class too, indicating the type of correction (defined by the set used and not by FoLiA).

Besides original(), corrections distinguish three other types, new() (the corrected version), current() (the current uncorrected version) and suggestions(i) (a suggestion for correction), the former two and latter two usually form pairs, current() and new() can never be used together. Of suggestions(i) there may be multiple, hence the index argument. These return, respectively, instances of folia.Original, folia.New, folia.Current and folia.Suggestion.

Adding a correction can be done explicitly:

wrongpos = word.annotation(folia.PosAnnotation)
word.add(folia.Correction, folia.New(doc, folia.PosAnnotation(doc, cls="n")) , folia.Original(doc, wrongpos), cls="misclassified")

Let’s settle for a suggestion rather than an actual correction:

wrongpos = word.annotation(folia.PosAnnotation)
word.add(folia.Correction, folia.Suggestion(doc, folia.PosAnnotation(doc, cls="n")), cls="misclassified")

In some instances, when correcting text or structural elements, folia.New() may be empty, which would correspond to an deletion. Similarly, folia.Original() may be empty, corresponding to an insertion.

The use of folia.Current() is reserved for use with structure elements, such as words, in combination with suggestions. The structure elements then have to be embedded in folia.Current(). This situation arises for instance when making suggestions for a merge or split.

API Reference

class pynlpl.formats.folia.AbstractAnnotation(doc, *args, **kwargs)
class pynlpl.formats.folia.AbstractAnnotationLayer(doc, *args, **kwargs)

Annotation layers for Span Annotation are derived from this abstract base class

OPTIONAL_ATTRIBS = (0, 6)
PRINTABLE = False
ROOTELEMENT = False
add(child, *args, **kwargs)
alternatives(Class=None, set=None)

Generator over alternatives, either all or only of a specific annotation type, and possibly restrained also by set.

Arguments:
  • Class - The Class you want to retrieve (e.g. PosAnnotation). Or set to None to select all alternatives regardless of what type they are.
  • set - The set you want to retrieve (defaults to None, which selects irregardless of set)
Returns:
Generator over Alternative elements
annotation(type, set=None)

Will return a single annotation (even if there are multiple). Raises a NoSuchAnnotation exception if none was found

annotations(Class, set=None)

Obtain annotations. Very similar to select() but raises an error if the annotation was not found.

Arguments:
  • Class - The Class you want to retrieve (e.g. PosAnnotation)
  • set - The set you want to retrieve (defaults to None, which selects irregardless of set)
Returns:
A list of elements
Raises:
NoSuchAnnotation if the specified annotation does not exist.
append(child, *args, **kwargs)
findspan(*words)

Returns the span element which spans over the specified words or morphemes

hasannotation(Class, set=None)

Returns an integer indicating whether such as annotation exists, and if so, how many. See annotations() for a description of the parameters.

classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None, origclass=None)

Returns a RelaxNG definition for this element (as an XML element (lxml.etree) rather than a string)

xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.AbstractCorrectionChild(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.AbstractTokenAnnotation'>, <class 'pynlpl.formats.folia.AbstractSpanAnnotation'>, <class 'pynlpl.formats.folia.Word'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
OPTIONAL_ATTRIBS = (2, 3, 5, 4)
PRINTABLE = True
ROOTELEMENT = False
TEXTDELIMITER = None
class pynlpl.formats.folia.AbstractDefinition
class pynlpl.formats.folia.AbstractElement(doc, *args, **kwargs)

This is the abstract base class from which all FoLiA elements are derived. This class should not be instantiated directly, but can useful if you want to check if a variable is an instance of any FoLiA element: isinstance(x, AbstractElement). It contains methods and variables also commonly inherited.

ACCEPTED_DATA = ()
ANNOTATIONTYPE = None
AUTH = True
OCCURRENCES = 0
OCCURRENCESPERSET = 1
OPTIONAL_ATTRIBS = ()
PRINTABLE = False
REQUIRED_ATTRIBS = ()
ROOTELEMENT = True
TEXTCONTAINER = False
TEXTDELIMITER = None
XMLTAG = None
add(child, *args, **kwargs)

High level function that adds (appends) an annotation to an element, it will simply call append() for token annotation elements that fit within the scope. For span annotation, it will create and find or create the proper annotation layer and insert the element there

classmethod addable(Class, parent, set=None, raiseexceptions=True)

Tests whether a new element of this class can be added to the parent. Returns a boolean or raises ValueError exceptions (unless set to ignore)!

This will use OCCURRENCES, but may be overidden for more customised behaviour.

This method is mostly for internal use.

addidsuffix(idsuffix, recursive=True)
addtoindex(norecurse=[])

Makes sure this element (and all subelements), are properly added to the index

ancestor(*Classes)

Find the most immediate ancestor of the specified type, multiple classes may be specified

ancestors(Class=None)

Generator yielding all ancestors of this element, effectively back-tracing its path to the root element. A tuple of multiple classes may be specified.

append(child, *args, **kwargs)

Append a child element. Returns the added element

Arguments:
  • child - Instance or class

If an instance is passed as first argument, it will be appended If a class derived from AbstractElement is passed as first argument, an instance will first be created and then appended.

Keyword arguments:
  • alternative= - If set to True, the element will be made into an alternative.

Generic example, passing a pre-generated instance:

word.append( folia.LemmaAnnotation(doc,  cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )

Generic example, passing a class to be generated:

word.append( folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )

Generic example, setting text with a class:

word.append( “house”, cls=’original’ )
context(size, placeholder=None, scope=None)

Returns this word in context, {size} words to the left, the current word, and {size} words to the right

copy(newdoc=None, idsuffix='')

Make a deep copy of this element and all its children. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash

copychildren(newdoc=None, idsuffix='')

Generator creating a deep copy of the children of this element. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash

count(Class, set=None, recursive=True, ignore=True, node=None)

Like select, but instead of returning the elements, it merely counts them

deepvalidation()
description()

Obtain the description associated with the element, will raise NoDescription if there is none

feat(subset)

Obtain the feature value of the specific subset. If a feature occurs multiple times, the values will be returned in a list.

Example:

sense = word.annotation(folia.Sense)
synset = sense.feat('synset')
classmethod findreplaceables(Class, parent, set=None, **kwargs)

Find replaceable elements. Auxiliary function used by replace(). Can be overriden for more fine-grained control. Mostly for internal use.

getindex(child, recursive=True, ignore=True)

returns the index at which an element occurs, recursive by default!

gettextdelimiter(retaintokenisation=False)

May return a customised text delimiter instead of the default for this class.

hastext(cls='current')

Does this element have text (of the specified class)

incorrection()

Is this element part of a correction? If it is, it returns the Correction element (evaluating to True), otherwise it returns None

insert(index, child, *args, **kwargs)

Insert a child element at specified index. Returns the added element

If an instance is passed as first argument, it will be appended If a class derived from AbstractElement is passed as first argument, an instance will first be created and then appended.

Arguments:
  • index
  • child - Instance or class
Keyword arguments:
  • alternative= - If set to True, the element will be made into an alternative.
  • corrected= - Used only when passing strings to be made into TextContent elements.

Generic example, passing a pre-generated instance:

word.insert( 3, folia.LemmaAnnotation(doc,  cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )

Generic example, passing a class to be generated:

word.insert( 3, folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )

Generic example, setting text:

word.insert( 3, "house" )
items(founditems=[])

Returns a depth-first flat list of all items below this element (not limited to AbstractElement)

json(attribs=None, recurse=True)
leftcontext(size, placeholder=None, scope=None)

Returns the left context for an element, as a list. This method crosses sentence/paragraph boundaries by default, which can be restricted by setting scope

next(Class=True, scope=True, reverse=False)

Returns the next element, if it is of the specified type and if it does not cross the boundary of the defined scope. Returns None if no next element is found. Non-authoritative elements are never returned.

Arguments:
  • Class: The class to select; any python class subclassed off ‘AbstractElement`, may also be a tuple of multiple classes. Set to True to constrain to the same class as that of the current instance, set to None to not constrain at all
  • scope: A list of classes which are never crossed looking for a next element. Set to True to constrain to a default list of structure elements (Sentence,Paragraph,Division,Event, ListItem,Caption), set to None to not constrain at all.
originaltext()

Alias for retrieving the original uncorrect text

classmethod parsexml(Class, node, doc)

Internal class method used for turning an XML element into an instance of the Class.

Args:
  • ``node`’ - XML Element
  • doc - Document
Returns:
An instance of the current Class.
postappend()

This method will be called after an element is added to another. It can do extra checks and if necessary raise exceptions to prevent addition. By default makes sure the right document is associated.

This method is mostly for internal use.

previous(Class=True, scope=True)

Returns the previous element, if it is of the specified type and if it does not cross the boundary of the defined scope. Returns None if no next element is found. Non-authoritative elements are never returned.

Arguments:
  • Class: The class to select; any python class subclassed off ‘AbstractElement`. Set to True to constrain to the same class as that of the current instance, set to None to not constrain at all
  • scope: A list of classes which are never crossed looking for a next element. Set to True to constrain to a default list of structure elements (Sentence,Paragraph,Division,Event, ListItem,Caption), set to None to not constrain at all.
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None, origclass=None)

Returns a RelaxNG definition for this element (as an XML element (lxml.etree) rather than a string)

remove(child)

Removes the child element

replace(child, *args, **kwargs)

Appends a child element like append(), but replaces any existing child element of the same type and set. If no such child element exists, this will act the same as append()

Keyword arguments:
  • alternative - If set to True, the replaced element will be made into an alternative. Simply use append() if you want the added element

to be an alternative.

See append() for more information.

resolveword(id)
rightcontext(size, placeholder=None, scope=None)

Returns the right context for an element, as a list. This method crosses sentence/paragraph boundaries by default, which can be restricted by setting scope

select(Class, set=None, recursive=True, ignore=True, node=None)

Select child elements of the specified class.

A further restriction can be made based on set. Whether or not to apply recursively (by default enabled) can also be configured, optionally with a list of elements never to recurse into.

Arguments:
  • Class: The class to select; any python class subclassed off ‘AbstractElement`

  • set: The set to match against, only elements pertaining to this set will be returned. If set to None (default), all elements regardless of set will be returned.

  • recursive: Select recursively? Descending into child elements? Boolean defaulting to True.

  • ignore: A list of Classes to ignore, if set to True instead

    of a list, all non-authoritative elements will be skipped (this is the default behaviour). It is common not to

    want to recurse into the following elements: folia.Alternative, folia.AlternativeLayer, folia.Suggestion, and folia.Original. These elements contained in these are never authorative. set to the boolean True rather than a list, this will be the default list. You may also include the boolean True as a member of a list, if you want to skip additional tags along non-authoritative ones.

  • node: Reserved for internal usage, used in recursion.

Returns:
A generator of elements (instances)

Example:

text.select(folia.Sense, 'cornetto', True, [folia.Original, folia.Suggestion, folia.Alternative] )
setdoc(newdoc)

Set a different document, usually no need to call this directly, invoked implicitly by copy()

setdocument(doc)

Associate a document with this element

setparents()

Correct all parent relations for elements within the scope, usually no need to call this directly, invoked implicitly by copy()

settext(text, cls='current')

Set the text for this element (and class)

stricttext(cls='current')

Get the text strictly associated with this element (of the specified class). Does not recurse into children, with the sole exception of Corection/New

text(cls='current', retaintokenisation=False, previousdelimiter='')

Get the text associated with this element (of the specified class), will always be a unicode instance. If no text is directly associated with the element, it will be obtained from the children. If that doesn’t result in any text either, a NoSuchText exception will be raised.

If retaintokenisation is True, the space attribute on words will be ignored, otherwise it will be adhered to and text will be detokenised as much as possible.

textcontent(cls='current')

Get the text explicitly associated with this element (of the specified class). Returns the TextContent instance rather than the actual text. Raises NoSuchText exception if not found.

Unlike text(), this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the TextContent instance rather than the actual text!

toktext(cls='current')

Alias for text with retaintokenisation=True

updatetext()

Internal method, recompute textual value. Only for elements that are a TEXTCONTAINER

xml(attribs=None, elements=None, skipchildren=False)

Serialises the FoLiA element to XML, by returning an XML Element (in lxml.etree) for this element and all its children. For string output, consider the xmlstring() method instead.

xmlstring(pretty_print=False)

Serialises this FoLiA element to XML, returns a (unicode) string with XML representation for this element and all its children.

class pynlpl.formats.folia.AbstractExtendedTokenAnnotation(doc, *args, **kwargs)
class pynlpl.formats.folia.AbstractSpanAnnotation(doc, *args, **kwargs)

Abstract element, all span annotation elements are derived from this class

OCCURRENCESPERSET = 0
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
PRINTABLE = True
REQUIRED_ATTRIBS = ()
add(child, *args, **kwargs)
addtoindex(norecurse=None)
annotation(type, set=None)

Will return a single annotation (even if there are multiple). Raises a NoSuchAnnotation exception if none was found

append(child, *args, **kwargs)
copychildren(newdoc=None, idsuffix='')

Generator creating a deep copy of the children of this element. If idsuffix is a string, if set to True, a random idsuffix will be generated including a random 32-bit hash

hasannotation(Class, set=None)

Returns an integer indicating whether such as annotation exists, and if so, how many. See annotations() for a description of the parameters.

setspan(*args)

Sets the span of the span element anew, erases all data inside

wrefs(index=None)

Returns a list of word references, these can be Words but also Morphemes or Phonemes.

Arguments:
  • index: If set to an integer, will retrieve and return the n’th element (starting at 0) instead of returning the list of all
xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.AbstractSpanRole(doc, *args, **kwargs)
OPTIONAL_ATTRIBS = (0, 2, 4, 5)
REQUIRED_ATTRIBS = ()
ROOTELEMENT = False
class pynlpl.formats.folia.AbstractStructureElement(doc, *args, **kwargs)

Abstract element, all structure elements inherit from this class. Never instantiated directly.

OCCURRENCESPERSET = 0
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
PRINTABLE = True
REQUIRED_ATTRIBS = (0,)
TEXTDELIMITER = '\n\n'
append(child, *args, **kwargs)

See AbstractElement.append()

hasannotationlayer(annotationtype=None, set=None)

Does the specified annotation layer exist?

layers(annotationtype=None, set=None)

Returns a list of annotation layers found directly under this element, does not include alternative layers

paragraphs(index=None)

Returns a generator of Paragraph elements found (recursively) under this element.

Arguments:
  • index: If set to an integer, will retrieve and return the n’th element (starting at 0) instead of returning the generator of all
resolveword(id)
sentences(index=None)

Returns a generator of Sentence elements found (recursively) under this element

Arguments:
  • index: If set to an integer, will retrieve and return the n’th element (starting at 0) instead of returning a generator of all
words(index=None)

Returns a generator of Word elements found (recursively) under this element.

Arguments:
  • index: If set to an integer, will retrieve and return the n’th element (starting at 0) instead of returning the list of all
class pynlpl.formats.folia.AbstractSubtokenAnnotation(doc, *args, **kwargs)

Abstract element, all subtoken annotation elements are derived from this class

OCCURRENCESPERSET = 0
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
PRINTABLE = True
REQUIRED_ATTRIBS = ()
class pynlpl.formats.folia.AbstractTextMarkup(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.AbstractTextMarkup'>,)
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
PRINTABLE = True
REQUIRED_ATTRIBS = ()
ROOTELEMENT = False
TEXTCONTAINER = True
TEXTDELIMITER = ''
json(attribs=None, recurse=True)
classmethod parsexml(Class, node, doc)
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
resolve()
settext(text)
text()

Obtain the text (unicode instance)

xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.AbstractTokenAnnotation(doc, *args, **kwargs)

Abstract element, all token annotation elements are derived from this class

OCCURRENCESPERSET = 1
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
REQUIRED_ATTRIBS = (1,)
append(child, *args, **kwargs)

See AbstractElement.append()

class pynlpl.formats.folia.ActorFeature(doc, *args, **kwargs)

Actor feature, to be used within Event

SUBSET = 'actor'
XMLTAG = None
class pynlpl.formats.folia.AlignReference(doc, *args, **kwargs)
REQUIRED_ATTRIBS = (0,)
XMLTAG = 'aref'
json(attribs=None, recurse=True)
classmethod parsexml(Class, node, doc)
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
resolve(alignmentcontext)
xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.Alignment(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.AlignReference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 28
OCCURRENCESPERSET = 0
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
PRINTABLE = False
REQUIRED_ATTRIBS = ()
XMLTAG = 'alignment'
json(attribs=None)
resolve()
class pynlpl.formats.folia.AllowCorrections
correct(**kwargs)

Apply a correction (TODO: documentation to be written still)

class pynlpl.formats.folia.AllowGenerateID

Classes inherited from this class allow for automatic ID generation, using the convention of adding a period, the name of the element , another period, and a sequence number

generate_id(cls)
class pynlpl.formats.folia.AllowTokenAnnotation

Elements that allow token annotation (including extended annotation) must inherit from this class

alternatives(Class=None, set=None)

Generator over alternatives, either all or only of a specific annotation type, and possibly restrained also by set.

Arguments:
  • Class - The Class you want to retrieve (e.g. PosAnnotation). Or set to None to select all alternatives regardless of what type they are.
  • set - The set you want to retrieve (defaults to None, which selects irregardless of set)
Returns:
Generator of Alternative elements
annotation(type, set=None)

Will return a single annotation (even if there are multiple). Raises a NoSuchAnnotation exception if none was found

annotations(Class, set=None)

Obtain annotations. Very similar to select() but raises an error if the annotation was not found.

Arguments:
  • Class - The Class you want to retrieve (e.g. PosAnnotation)
  • set - The set you want to retrieve (defaults to None, which selects irregardless of set)
Returns:
A generator of elements
Raises:
NoSuchAnnotation if the specified annotation does not exist.
hasannotation(Class, set=None)

Returns an integer indicating whether such as annotation exists, and if so, how many. See annotations() for a description of the parameters.

class pynlpl.formats.folia.Alternative(doc, *args, **kwargs)

Element grouping alternative token annotation(s). Multiple alternative elements may occur, each denoting a different alternative. Elements grouped inside an alternative block are considered dependent.

ACCEPTED_DATA = [<class 'pynlpl.formats.folia.AbstractTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.MorphologyLayer'>]
ANNOTATIONTYPE = 19
AUTH = False
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
PRINTABLE = False
REQUIRED_ATTRIBS = ()
XMLTAG = 'alt'
class pynlpl.formats.folia.AlternativeLayers(doc, *args, **kwargs)

Element grouping alternative subtoken annotation(s). Multiple altlayers elements may occur, each denoting a different alternative. Elements grouped inside an alternative block are considered dependent.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.AbstractAnnotationLayer'>,)
AUTH = False
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
PRINTABLE = False
REQUIRED_ATTRIBS = ()
XMLTAG = 'altlayers'
class pynlpl.formats.folia.AnnotationType
ALIGNMENT = 28
ALTERNATIVE = 19
CHUNKING = 14
COMPLEXALIGNMENT = 29
COREFERENCE = 30
CORRECTION = 16
DEPENDENCY = 24
DIVISION = 2
DOMAIN = 11
ENTITY = 15
ERRORDETECTION = 18
EVENT = 23
FIGURE = 5
GAP = 26
LANG = 33
LEMMA = 10
LINEBREAK = 7
LIST = 4
METRIC = 32
MORPHOLOGICAL = 22
NOTE = 27
PARAGRAPH = 3
PART = 37
PHON = 20
POS = 9
SEMROLE = 31
SENSE = 12
SENTENCE = 8
STRING = 34
STYLE = 36
SUBJECTIVITY = 21
SUGGESTION = 17
SYNTAX = 13
TABLE = 35
TEXT = 0
TIMESEGMENT = 25
TOKEN = 1
WHITESPACE = 6
class pynlpl.formats.folia.AnnotatorType
AUTO = 1
MANUAL = 2
UNSET = 0
class pynlpl.formats.folia.Attrib
ALL = (0, 1, 2, 4, 3, 5)
ANNOTATOR = 2
CLASS = 1
CONFIDENCE = 3
DATETIME = 5
ID = 0
N = 4
SETONLY = 6
class pynlpl.formats.folia.BegindatetimeFeature(doc, *args, **kwargs)

Begindatetime feature, to be used within Event

SUBSET = 'begindatetime'
XMLTAG = None
class pynlpl.formats.folia.BypassLeakFile
read(n=0)
readline()
class pynlpl.formats.folia.Caption(doc, *args, **kwargs)

Element used for captions for figures or tables, contains sentences

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Gap'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
OCCURRENCES = 1
XMLTAG = 'caption'
class pynlpl.formats.folia.Cell(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Paragraph'>, <class 'pynlpl.formats.folia.Head'>, <class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.Word'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Event'>, <class 'pynlpl.formats.folia.Note'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.Linebreak'>, <class 'pynlpl.formats.folia.Whitespace'>, <class 'pynlpl.formats.folia.Gap'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 35
REQUIRED_ATTRIBS = ((),)
TEXTDELIMITER = ' | '
XMLTAG = 'cell'
class pynlpl.formats.folia.Chunk(doc, *args, **kwargs)

Chunk element, span annotation element to be used in ChunkingLayer

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.WordReference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 14
REQUIRED_ATTRIBS = ()
XMLTAG = 'chunk'
class pynlpl.formats.folia.ChunkingLayer(doc, *args, **kwargs)

Chunking Layer: Annotation layer for Chunk span annotation elements

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Chunk'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Correction'>)
ANNOTATIONTYPE = 14
XMLTAG = 'chunking'
class pynlpl.formats.folia.ClassDefinition(id, label, constraints=[], subclasses=[])
json()
classmethod parsexml(Class, node, constraintindex)
class pynlpl.formats.folia.ConstraintDefinition(id, restrictions={}, exceptions={})
json()
classmethod parsexml(Class, node, constraintindex)
class pynlpl.formats.folia.Content(doc, *args, **kwargs)
OCCURRENCES = 1
XMLTAG = 'content'
json(attribs=None, recurse=True)
classmethod parsexml(Class, node, doc)
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.CoreferenceChain(doc, *args, **kwargs)

Coreference chain. Consists of coreference links.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.CoreferenceLink'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 30
REQUIRED_ATTRIBS = ()
XMLTAG = 'coreferencechain'
class pynlpl.formats.folia.CoreferenceLayer(doc, *args, **kwargs)

Syntax Layer: Annotation layer for SyntacticUnit span annotation elements

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.CoreferenceChain'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Correction'>)
ANNOTATIONTYPE = 30
XMLTAG = 'coreferences'

Coreference link. Used in coreferencechain.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.WordReference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Headspan'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.ModalityFeature'>, <class 'pynlpl.formats.folia.TimeFeature'>, <class 'pynlpl.formats.folia.LevelFeature'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 30
OPTIONAL_ATTRIBS = (2, 4, 5)
REQUIRED_ATTRIBS = ()
ROOTELEMENT = False
XMLTAG = 'coreferencelink'
class pynlpl.formats.folia.Corpus(corpusdir, extension='xml', restrict_to_collection='', conditionf=<function Corpus.<lambda> at 0x7f4622c7a9d8>, ignoreerrors=False, **kwargs)

A corpus of various FoLiA documents. Yields a Document on each iteration. Suitable for sequential processing.

class pynlpl.formats.folia.CorpusFiles(corpusdir, extension='xml', restrict_to_collection='', conditionf=<function Corpus.<lambda> at 0x7f4622c7a9d8>, ignoreerrors=False, **kwargs)

A corpus of various FoLiA documents. Yields the filenames on each iteration.

class pynlpl.formats.folia.CorpusProcessor(corpusdir, function, threads=None, extension='xml', restrict_to_collection='', conditionf=<function CorpusProcessor.<lambda> at 0x7f4622c7abf8>, maxtasksperchild=100, preindex=False, ordered=True, chunksize=1)

Processes a corpus of various FoLiA documents using a parallel processing. Calls a user-defined function with the three-tuple (filename, args, kwargs) for each file in the corpus. The user-defined function is itself responsible for instantiating a FoLiA document! args and kwargs, as received by the custom function, are set through the run() method, which yields the result of the custom function on each iteration.

execute()
run(*args, **kwargs)
class pynlpl.formats.folia.Correction(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.New'>, <class 'pynlpl.formats.folia.Original'>, <class 'pynlpl.formats.folia.Current'>, <class 'pynlpl.formats.folia.Suggestion'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 16
OCCURRENCESPERSET = 0
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
PRINTABLE = True
REQUIRED_ATTRIBS = ()
ROOTELEMENT = True
TEXTDELIMITER = None
XMLTAG = 'correction'
append(child, *args, **kwargs)

See AbstractElement.append()

current(index=None)
gettextdelimiter(retaintokenisation=False)

May return a customised text delimiter instead of the default for this class.

hascurrent()
hasnew()
hasoriginal()
hassuggestions()
new(index=None)
original(index=None)
suggestions(index=None)
text(cls='current', retaintokenisation=False, previousdelimiter='')
textcontent(cls='current')

Get the text explicitly associated with this element (of the specified class). Returns the TextContent instance rather than the actual text. Raises NoSuchText exception if not found.

Unlike text(), this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the TextContent instance rather than the actual text!

class pynlpl.formats.folia.Current(doc, *args, **kwargs)
OCCURRENCES = 1
OPTIONAL_ATTRIBS = ((),)
REQUIRED_ATTRIBS = ((),)
XMLTAG = 'current'
classmethod addable(Class, parent, set=None, raiseexceptions=True)
exception pynlpl.formats.folia.DeepValidationError
class pynlpl.formats.folia.DependenciesLayer(doc, *args, **kwargs)

Dependencies Layer: Annotation layer for Dependency span annotation elements. For dependency entities.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Dependency'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Correction'>)
ANNOTATIONTYPE = 24
XMLTAG = 'dependencies'
class pynlpl.formats.folia.Dependency(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.Headspan'>, <class 'pynlpl.formats.folia.DependencyDependent'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 24
REQUIRED_ATTRIBS = ()
XMLTAG = 'dependency'
dependent()

Returns the dependent of the dependency relation. Instance of DependencyDependent

head()

Returns the head of the dependency relation. Instance of DependencyHead

class pynlpl.formats.folia.DependencyDependent(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.WordReference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 24
XMLTAG = 'dep'
pynlpl.formats.folia.DependencyHead

alias of Headspan

class pynlpl.formats.folia.Description(doc, *args, **kwargs)

Description is an element that can be used to associate a description with almost any other FoLiA element

OCCURRENCES = 1
XMLTAG = 'desc'
json(attribs=None, recurse=True)
classmethod parsexml(Class, node, doc)
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.Division(doc, *args, **kwargs)

Structure element representing some kind of division. Divisions may be nested at will, and may include almost all kinds of other structure elements.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Division'>, <class 'pynlpl.formats.folia.Quote'>, <class 'pynlpl.formats.folia.Gap'>, <class 'pynlpl.formats.folia.Event'>, <class 'pynlpl.formats.folia.Head'>, <class 'pynlpl.formats.folia.Paragraph'>, <class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.List'>, <class 'pynlpl.formats.folia.Figure'>, <class 'pynlpl.formats.folia.Table'>, <class 'pynlpl.formats.folia.Note'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Linebreak'>, <class 'pynlpl.formats.folia.Whitespace'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 2
OPTIONAL_ATTRIBS = (1, 4)
REQUIRED_ATTRIBS = (0,)
TEXTDELIMITER = '\n\n\n'
XMLTAG = 'div'
head()
class pynlpl.formats.folia.Document(*args, **kwargs)

This is the FoLiA Document, all elements have to be associated with a FoLiA document. Besides holding elements, the document hold metadata including declaration, and an index of all IDs.

IDSEPARATOR = '.'
append(text)

Add a text to the document:

Example 1:

doc.append(folia.Text)
Example 2::
doc.append( folia.Text(doc, id=’example.text’) )
count(Class, set=None)
create(Class, *args, **kwargs)

Create an element associated with this Document. This method may be obsolete and removed later.

date(value=None)

No arguments: Get the document’s date from metadata Argument: Set the document’s date in metadata

declare(annotationtype, set, **kwargs)
declared(annotationtype, set)
defaultannotator(annotationtype, set=None)
defaultannotatortype(annotationtype, set=None)
defaultdatetime(annotationtype, set=None)
defaultset(annotationtype)
findwords(*args, **kwargs)
items()

Returns a depth-first flat list of all items in the document

json()
jsondeclarations()
language(value=None)

No arguments: Get the document’s language (ISO-639-3) from metadata Argument: Set the document’s language (ISO-639-3) in metadata

license(value=None)

No arguments: Get the document’s license from metadata Argument: Set the document’s license in metadata

load(filename)

Load a FoLiA or D-Coi XML file

paragraphs(index=None)

Return a generator of all paragraphs found in the document.

If an index is specified, return the n’th paragraph only (starting at 0)

parsemetadata(node)
parsexml(node, ParentClass=None)

Main XML parser, will invoke class-specific XML parsers. For internal use.

parsexmldeclarations(node)
publisher(value=None)

No arguments: Get the document’s publisher from metadata Argument: Set the document’s publisher in metadata

save(filename=None)

Save the document to FoLiA XML.

Arguments:
  • filename=: The filename to save to. If not set (None), saves to the same file as loaded from.
select(Class, set=None, recursive=True, ignore=True)
sentences(index=None)

Return a generator of all sentence found in the document. Except for sentences in quotes.

If an index is specified, return the n’th sentence only (starting at 0)

setcmdi(filename)
setimdi(node)
text(retaintokenisation=False)

Returns the text of the entire document (returns a unicode instance)

title(value=None)

No arguments: Get the document’s title from metadata Argument: Set the document’s title in metadata

words(index=None)

Return a generator of all active words found in the document. Does not descend into annotation layers, alternatives, originals, suggestions.

If an index is specified, return the n’th word only (starting at 0)

xml()
xmldeclarations()
xmlmetadata()
xmlstring()
xpath(query)

Run Xpath expression and parse the resulting elements. Don’t forget to use the FoLiA namesapace in your expressions, using folia: or the short form f:

class pynlpl.formats.folia.DomainAnnotation(doc, *args, **kwargs)

Domain annotation: an extended token annotation element

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 11
XMLTAG = 'domain'
exception pynlpl.formats.folia.DuplicateAnnotationError
exception pynlpl.formats.folia.DuplicateIDError

Exception raised when an identifier that is already in use is assigned again to another element

class pynlpl.formats.folia.EnddatetimeFeature(doc, *args, **kwargs)

Enddatetime feature, to be used within Event

SUBSET = 'enddatetime'
XMLTAG = None
class pynlpl.formats.folia.EntitiesLayer(doc, *args, **kwargs)

Entities Layer: Annotation layer for Entity span annotation elements. For named entities.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Entity'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Correction'>)
ANNOTATIONTYPE = 15
XMLTAG = 'entities'
class pynlpl.formats.folia.Entity(doc, *args, **kwargs)

Entity element, for named entities, span annotation element to be used in EntitiesLayer

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.WordReference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 15
REQUIRED_ATTRIBS = ()
XMLTAG = 'entity'
class pynlpl.formats.folia.ErrorDetection(doc, *args, **kwargs)
ANNOTATIONTYPE = 18
OCCURRENCESPERSET = 0
ROOTELEMENT = True
XMLTAG = 'errordetection'
class pynlpl.formats.folia.Event(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Event'>, <class 'pynlpl.formats.folia.Paragraph'>, <class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.Division'>, <class 'pynlpl.formats.folia.Word'>, <class 'pynlpl.formats.folia.Head'>, <class 'pynlpl.formats.folia.List'>, <class 'pynlpl.formats.folia.Figure'>, <class 'pynlpl.formats.folia.Table'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.ActorFeature'>, <class 'pynlpl.formats.folia.BegindatetimeFeature'>, <class 'pynlpl.formats.folia.EnddatetimeFeature'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 23
OCCURRENCESPERSET = 0
XMLTAG = 'event'
class pynlpl.formats.folia.External(doc, *args, **kwargs)
ACCEPTED_DATA = []
AUTH = True
OPTIONAL_ATTRIBS = ()
PRINTABLE = True
REQUIRED_ATTRIBS = ()
XMLTAG = 'external'
classmethod parsexml(Class, node, doc)
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
select(Class, set=None, recursive=True, ignore=True, node=None)
xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.Feature(doc, *args, **kwargs)

Feature elements can be used to associate subsets and subclasses with almost any annotation element

OCCURRENCESPERSET = 0
SUBSET = None
XMLTAG = 'feat'
json(attribs=None, recurse=True)
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
xml()
class pynlpl.formats.folia.Figure(doc, *args, **kwargs)

Element for the representation of a graphical figure. Structure element.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Caption'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 5
XMLTAG = 'figure'
caption()
json(attribs=None, recurse=True)
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.FunctionFeature(doc, *args, **kwargs)

Function feature, to be used with morphemes

SUBSET = 'function'
XMLTAG = None
class pynlpl.formats.folia.Gap(doc, *args, **kwargs)

Gap element. Represents skipped portions of the text. Contains Content and Desc elements

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Content'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 26
OPTIONAL_ATTRIBS = (0, 1, 2, 3, 4)
XMLTAG = 'gap'
content()
class pynlpl.formats.folia.Head(doc, *args, **kwargs)

Head element. A structure element. Acts as the header/title of a division. There may be one per division. Contains sentences.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.Word'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Event'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Linebreak'>, <class 'pynlpl.formats.folia.Whitespace'>, <class 'pynlpl.formats.folia.Gap'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
OCCURRENCES = 1
TEXTDELIMITER = ' '
XMLTAG = 'head'
class pynlpl.formats.folia.HeadFeature(doc, *args, **kwargs)

Head feature, to be used within PosAnnotation

SUBSET = 'head'
XMLTAG = None
class pynlpl.formats.folia.Headspan(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.WordReference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>)
XMLTAG = 'hd'
class pynlpl.formats.folia.Label(doc, *args, **kwargs)

Element used for labels. Mostly in within list item. Contains words.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Word'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
XMLTAG = 'label'
class pynlpl.formats.folia.LangAnnotation(doc, *args, **kwargs)

Language annotation: an extended token annotation element

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 33
XMLTAG = 'lang'
class pynlpl.formats.folia.LemmaAnnotation(doc, *args, **kwargs)

Lemma annotation: a token annotation element

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 10
XMLTAG = 'lemma'
class pynlpl.formats.folia.LevelFeature(doc, *args, **kwargs)

Level feature, to be used with coreferences

SUBSET = 'level'
XMLTAG = None
class pynlpl.formats.folia.Linebreak(doc, *args, **kwargs)

Line break element, signals a line break

ACCEPTED_DATA = ()
ANNOTATIONTYPE = 7
REQUIRED_ATTRIBS = ()
TEXTDELIMITER = '\n'
XMLTAG = 'br'
class pynlpl.formats.folia.List(doc, *args, **kwargs)

Element for enumeration/itemisation. Structure element. Contains ListItem elements.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.ListItem'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Caption'>, <class 'pynlpl.formats.folia.Event'>, <class 'pynlpl.formats.folia.Note'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 4
TEXTDELIMITER = '\n'
XMLTAG = 'list'
class pynlpl.formats.folia.ListItem(doc, *args, **kwargs)

Single element in a List. Structure element. Contained within List element.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.List'>, <class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Label'>, <class 'pynlpl.formats.folia.Event'>, <class 'pynlpl.formats.folia.Note'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Gap'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 4
XMLTAG = 'item'
exception pynlpl.formats.folia.MalformedXMLError
class pynlpl.formats.folia.MetaDataType
CMDI = 1
IMDI = 2
NATIVE = 0
class pynlpl.formats.folia.Metric(doc, *args, **kwargs)

Metric elements allow the annotatation of any kind of metric with any kind of annotation element. Allowing for example statistical measures to be added to elements as annotation,

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.ValueFeature'>, <class 'pynlpl.formats.folia.Description'>)
ANNOTATIONTYPE = 32
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
REQUIRED_ATTRIB = (1,)
XMLTAG = 'metric'
class pynlpl.formats.folia.ModalityFeature(doc, *args, **kwargs)

Modality feature, to be used with coreferences

SUBSET = 'modality'
XMLTAG = None
class pynlpl.formats.folia.Mode
ITERATIVE = 2
MEMORY = 0
XPATH = 1
exception pynlpl.formats.folia.ModeError
class pynlpl.formats.folia.Morpheme(doc, *args, **kwargs)

Morpheme element, represents one morpheme in morphological analysis, subtoken annotation element to be used in MorphologyLayer

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.FunctionFeature'>, <class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.AbstractTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Description'>)
ANNOTATIONTYPE = 22
OPTIONAL_ATTRIBS = (0, 1, 2, 4, 3, 5)
REQUIRED_ATTRIBS = ((),)
XMLTAG = 'morpheme'
findspans(type, set=None)

Find span annotation of the specified type that include this word

class pynlpl.formats.folia.MorphologyLayer(doc, *args, **kwargs)

Morphology Layer: Annotation layer for Morpheme subtoken annotation elements. For morphological analysis.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Morpheme'>, <class 'pynlpl.formats.folia.Correction'>)
ANNOTATIONTYPE = 22
XMLTAG = 'morphology'
class pynlpl.formats.folia.NativeMetaData(*args, **kwargs)
items()
class pynlpl.formats.folia.New(doc, *args, **kwargs)
OCCURRENCES = 1
OPTIONAL_ATTRIBS = ((),)
REQUIRED_ATTRIBS = ((),)
XMLTAG = 'new'
classmethod addable(Class, parent, set=None, raiseexceptions=True)
exception pynlpl.formats.folia.NoDefaultError
exception pynlpl.formats.folia.NoDescription
exception pynlpl.formats.folia.NoSuchAnnotation

Exception raised when the requested type of annotation does not exist for the selected element

exception pynlpl.formats.folia.NoSuchText

Exception raised when the requestion type of text content does not exist for the selected element

class pynlpl.formats.folia.Note(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Paragraph'>, <class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.Word'>, <class 'pynlpl.formats.folia.Head'>, <class 'pynlpl.formats.folia.List'>, <class 'pynlpl.formats.folia.Figure'>, <class 'pynlpl.formats.folia.Table'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 27
OCCURRENCESPERSET = 0
XMLTAG = 'note'
class pynlpl.formats.folia.Original(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.AbstractTokenAnnotation'>, <class 'pynlpl.formats.folia.AbstractSpanAnnotation'>, <class 'pynlpl.formats.folia.Word'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
AUTH = False
OCCURRENCES = 1
OPTIONAL_ATTRIBS = ((),)
REQUIRED_ATTRIBS = ((),)
XMLTAG = 'original'
classmethod addable(Class, parent, set=None, raiseexceptions=True)
class pynlpl.formats.folia.Paragraph(doc, *args, **kwargs)

Paragraph element. A structure element. Represents a paragraph and holds all its sentences (and possibly other structure Whitespace and Quotes).

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.Quote'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Linebreak'>, <class 'pynlpl.formats.folia.Whitespace'>, <class 'pynlpl.formats.folia.Gap'>, <class 'pynlpl.formats.folia.List'>, <class 'pynlpl.formats.folia.Figure'>, <class 'pynlpl.formats.folia.Event'>, <class 'pynlpl.formats.folia.Head'>, <class 'pynlpl.formats.folia.Note'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 3
TEXTDELIMITER = '\n\n'
XMLTAG = 'p'
class pynlpl.formats.folia.Part(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.AbstractStructureElement'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.Correction'>)
ANNOTATIONTYPE = 37
XMLTAG = 'part'
class pynlpl.formats.folia.Pattern(*args, **kwargs)
This class describes a pattern over words to be searched for. The

Document.findwords() method can subsequently be called with this pattern, and it will return all the words that match. An example will best illustrate this, first a trivial example of searching for one word:

    for match in doc.findwords( folia.Pattern('house') ):
        for word in match:
            print word.id
        print "----"

The same can be done for a sequence::

    for match in doc.findwords( folia.Pattern('a','big', 'house') ):
        for word in match:
            print word.id
        print "----"

The boolean value ``True`` acts as a wildcard, matching any word::

    for match in doc.findwords( folia.Pattern('a',True,'house') ):
        for word in match:
            print word.id, word.text()
        print "----"

Alternatively, and more constraning, you may also specify a tuple of alternatives::


    for match in doc.findwords( folia.Pattern('a',('big','small'),'house') ):
        for word in match:
            print word.id, word.text()
        print "----"

Or even a regular expression using the ``folia.RegExp`` class::


    for match in doc.findwords( folia.Pattern('a', folia.RegExp('b?g'),'house') ):
        for word in match:
            print word.id, word.text()
        print "----"


Rather than searching on the text content of the words, you can search on the
classes of any kind of token annotation using the keyword argument
``matchannotation=``::

    for match in doc.findwords( folia.Pattern('det','adj','noun',matchannotation=folia.PosAnnotation ) ):
        for word in match:
            print word.id, word.text()
        print "----"

The set can be restricted by adding the additional keyword argument
``matchannotationset=``. Case sensitivity, by default disabled, can be enabled by setting ``casesensitive=True``.

Things become even more interesting when different Patterns are combined. A
match will have to satisfy all patterns::

    for match in doc.findwords( folia.Pattern('a', True, 'house'), folia.Pattern('det','adj','noun',matchannotation=folia.PosAnnotation ) ):
        for word in match:
            print word.id, word.text()
        print "----"


The ``findwords()`` method can be instructed to also return left and/or right context for any match. This is done using the ``leftcontext=`` and ``rightcontext=`` keyword arguments, their values being an integer number of the number of context words to include in each match. For instance, we can look for the word house and return its immediate neighbours as follows::

    for match in doc.findwords( folia.Pattern('house') , leftcontext=1, rightcontext=1):
        for word in match:
            print word.id
        print "----"

A match here would thus always consist of three words instead of just one.

Last, ``Pattern`` also has support for variable-width gaps, the asterisk symbol
has special meaning to this end::


    for match in doc.findwords( folia.Pattern('a','*','house') ):
        for word in match:
            print word.id
        print "----"

Unlike the pattern ``('a',True,'house')``, which by definition is a pattern of
three words, the pattern in the example above will match gaps of any length (up
to a certain built-in maximum), so this might include matches such as *a very
nice house*.

Some remarks on these methods of querying are in order. These searches are
pretty exhaustive and are done by simply iterating over all the words in the
document. The entire document is loaded in memory and no special indices are involved.
For single documents this is okay, but when iterating over a corpus of
thousands of documents, this method is too slow, especially for real-time
applications. For huge corpora, clever indexing and database management systems
will be required. This however is beyond the scope of this library.
resolve(size, distribution)

Resolve a variable sized pattern to all patterns of a certain fixed size

variablesize()
variablewildcards()
class pynlpl.formats.folia.PosAnnotation(doc, *args, **kwargs)

Part-of-Speech annotation: a token annotation element

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.HeadFeature'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 9
XMLTAG = 'pos'
class pynlpl.formats.folia.Query(files, expression)

An XPath query on one or more FoLiA documents

class pynlpl.formats.folia.Quote(doc, *args, **kwargs)

Quote: a structure element. For quotes/citations. May hold words, sentences or paragraphs.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Word'>, <class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.Paragraph'>, <class 'pynlpl.formats.folia.Division'>, <class 'pynlpl.formats.folia.Quote'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Gap'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
REQUIRED_ATTRIBS = ()
XMLTAG = 'quote'
append(child, *args, **kwargs)
gettextdelimiter(retaintokenisation=False)
resolveword(id)
class pynlpl.formats.folia.Reader(filename, target, *args, **kwargs)

Streaming FoLiA reader. The reader allows you to read a FoLiA Document without holding the whole tree structure in memory. The document will be read and the elements you seek returned as they are found. If you are querying a corpus of large FoLiA documents for a specific structure, then it is strongly recommend to use the Reader rather than the standard Document!

findwords(*args, **kwargs)
initdoc()
openstream(filename)
class pynlpl.formats.folia.Reference(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
OPTIONAL_ATTRIBS = (0, 2, 3, 5)
PRINTABLE = True
REQUIRED_ATTRIBS = ()
XMLTAG = 'ref'
classmethod parsexml(Class, node, doc)
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
resolve()
xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.RegExp(regexp)
class pynlpl.formats.folia.Row(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Cell'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 35
REQUIRED_ATTRIBS = ((),)
TEXTDELIMITER = '\n'
XMLTAG = 'row'
class pynlpl.formats.folia.SemanticRole(doc, *args, **kwargs)

Semantic Role

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.WordReference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Headspan'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 31
REQUIRED_ATTRIBS = (1,)
XMLTAG = 'semrole'
class pynlpl.formats.folia.SemanticRolesLayer(doc, *args, **kwargs)

Syntax Layer: Annotation layer for SemanticRole span annotation elements

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.SemanticRole'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Correction'>)
ANNOTATIONTYPE = 31
XMLTAG = 'semroles'
class pynlpl.formats.folia.SenseAnnotation(doc, *args, **kwargs)

Sense annotation: a token annotation element

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.SynsetFeature'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 12
XMLTAG = 'sense'
class pynlpl.formats.folia.Sentence(doc, *args, **kwargs)

Sentence element. A structure element. Represents a sentence and holds all its words (and possibly other structure such as LineBreaks, Whitespace and Quotes)

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Word'>, <class 'pynlpl.formats.folia.Quote'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Gap'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Linebreak'>, <class 'pynlpl.formats.folia.Whitespace'>, <class 'pynlpl.formats.folia.Event'>, <class 'pynlpl.formats.folia.Note'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 8
TEXTDELIMITER = ' '
XMLTAG = 's'
corrections()

Are there corrections in this sentence?

correctwords(originalwords, newwords, **kwargs)

Generic correction method for words. You most likely want to use the helper functions splitword() , mergewords(), deleteword(), insertword() instead

deleteword(word, **kwargs)

TODO: Write documentation

division()

Obtain the division this sentence is a part of (None otherwise)

insertword(newword, prevword, **kwargs)
insertwordleft(newword, nextword, **kwargs)
mergewords(newword, *originalwords, **kwargs)

TODO: Write documentation

paragraph()

Obtain the paragraph this sentence is a part of (None otherwise)

resolveword(id)
splitword(originalword, *newwords, **kwargs)

TODO: Write documentation

class pynlpl.formats.folia.SetDefinition(id, type, classes=[], subsets=[], constraintindex={})
json()
classmethod parsexml(Class, node)
testclass(cls)
testsubclass(cls, subset, subclass)
exception pynlpl.formats.folia.SetDefinitionError
class pynlpl.formats.folia.SetType
CLOSED = 0
MIXED = 2
OPEN = 1
class pynlpl.formats.folia.String(doc, *args, **kwargs)

String

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>)
ANNOTATIONTYPE = 34
OCCURRENCES = 0
OCCURRENCESPERSET = 0
OPTIONAL_ATTRIBS = (0, 1, 2, 3, 5)
PRINTABLE = True
REQUIRED_ATTRIBS = ()
XMLTAG = 'str'
class pynlpl.formats.folia.StyleFeature(doc, *args, **kwargs)
SUBSET = 'style'
XMLTAG = None
class pynlpl.formats.folia.SubjectivityAnnotation(doc, *args, **kwargs)

Subjectivity annotation/Sentiment analysis: a token annotation element

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 21
XMLTAG = 'subjectivity'
class pynlpl.formats.folia.SubsetDefinition(id, type, classes=[], constraints=[])
json()
classmethod parsexml(Class, node, constraintindex={})
class pynlpl.formats.folia.Suggestion(doc, *args, **kwargs)
ANNOTATIONTYPE = 17
AUTH = False
OCCURRENCES = 0
OCCURRENCESPERSET = 0
XMLTAG = 'suggestion'
class pynlpl.formats.folia.SynsetFeature(doc, *args, **kwargs)

Synset feature, to be used within Sense

SUBSET = 'synset'
XMLTAG = None
class pynlpl.formats.folia.SyntacticUnit(doc, *args, **kwargs)

Syntactic Unit, span annotation element to be used in SyntaxLayer

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.SyntacticUnit'>, <class 'pynlpl.formats.folia.WordReference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 13
REQUIRED_ATTRIBS = ()
XMLTAG = 'su'
class pynlpl.formats.folia.SyntaxLayer(doc, *args, **kwargs)

Syntax Layer: Annotation layer for SyntacticUnit span annotation elements

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.SyntacticUnit'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Correction'>)
ANNOTATIONTYPE = 13
XMLTAG = 'syntax'
class pynlpl.formats.folia.Table(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.TableHead'>, <class 'pynlpl.formats.folia.Row'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 35
XMLTAG = 'table'
class pynlpl.formats.folia.TableHead(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Row'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.Part'>)
ANNOTATIONTYPE = 35
REQUIRED_ATTRIBS = ((),)
XMLTAG = 'tablehead'
class pynlpl.formats.folia.Text(doc, *args, **kwargs)

A full text. This is a high-level element (not to be confused with TextContent!). This element may contain divisions, paragraphs, sentences, etc..

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.Gap'>, <class 'pynlpl.formats.folia.Event'>, <class 'pynlpl.formats.folia.Division'>, <class 'pynlpl.formats.folia.Paragraph'>, <class 'pynlpl.formats.folia.Quote'>, <class 'pynlpl.formats.folia.Sentence'>, <class 'pynlpl.formats.folia.Word'>, <class 'pynlpl.formats.folia.List'>, <class 'pynlpl.formats.folia.Figure'>, <class 'pynlpl.formats.folia.Table'>, <class 'pynlpl.formats.folia.Note'>, <class 'pynlpl.formats.folia.Reference'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.AbstractExtendedTokenAnnotation'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Correction'>)
OPTIONAL_ATTRIBS = (4,)
REQUIRED_ATTRIBS = (0,)
TEXTDELIMITER = '\n\n\n'
XMLTAG = 'text'
class pynlpl.formats.folia.TextContent(doc, *args, **kwargs)

Text content element (t), holds text to be associated with whatever element the text content element is a child of.

Text content elements on structure elements like Paragraph and Sentence are by definition untokenised. Only on Word level and deeper they are by definition tokenised.

Text content elements can specify offset that refer to text at a higher parent level. Use the following keyword arguments:
  • ref=: The instance to point to, this points to the element holding the text content element, not the text content element itself.
  • offset=: The offset where this text is found, offsets start at 0
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.AbstractTextMarkup'>, <class 'pynlpl.formats.folia.Linebreak'>)
ANNOTATIONTYPE = 0
OCCURRENCES = 0
OCCURRENCESPERSET = 0
OPTIONAL_ATTRIBS = (1, 2, 3, 5)
ROOTELEMENT = True
TEXTCONTAINER = True
XMLTAG = 't'
finddefaultreference()

Find the default reference for text offsets: The parent of the current textcontent’s parent (counting only Structure Elements and Subtoken Annotation Elements)

Note: This returns not a TextContent element, but its parent. Whether the textcontent actually exists is checked later/elsewhere

classmethod findreplaceables(Class, parent, set, **kwargs)

(Method for internal usage, see AbstractElement)

json(attribs=None, recurse=True)
classmethod parsexml(Class, node, doc)

(Method for internal usage, see AbstractElement)

postappend()

(Method for internal usage, see AbstractElement.postappend())

classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
settext(text)
text()

Obtain the text (unicode instance)

validateref()

Validates the Text Content’s references. Raises UnresolvableTextContent when invalid

xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.TextCorrectionLevel
CORRECTED = 0
INLINE = 3
ORIGINAL = 2
UNCORRECTED = 1
class pynlpl.formats.folia.TextMarkupCorrection(doc, *args, **kwargs)
ANNOTATIONTYPE = 16
XMLTAG = 't-correction'
json(attribs=None, recurse=True)
classmethod parsexml(Class, node, doc)
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.TextMarkupError(doc, *args, **kwargs)
ANNOTATIONTYPE = 18
XMLTAG = 't-error'
class pynlpl.formats.folia.TextMarkupGap(doc, *args, **kwargs)
ANNOTATIONTYPE = 26
XMLTAG = 't-gap'
class pynlpl.formats.folia.TextMarkupString(doc, *args, **kwargs)
ANNOTATIONTYPE = 34
XMLTAG = 't-str'
class pynlpl.formats.folia.TextMarkupStyle(doc, *args, **kwargs)
ANNOTATIONTYPE = 36
XMLTAG = 't-style'
class pynlpl.formats.folia.TimeFeature(doc, *args, **kwargs)

Time feature, to be used with coreferences

SUBSET = 'time'
XMLTAG = None
class pynlpl.formats.folia.TimeSegment(doc, *args, **kwargs)
ACCEPTED_DATA = (<class 'pynlpl.formats.folia.WordReference'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Feature'>, <class 'pynlpl.formats.folia.ActorFeature'>, <class 'pynlpl.formats.folia.BegindatetimeFeature'>, <class 'pynlpl.formats.folia.EnddatetimeFeature'>, <class 'pynlpl.formats.folia.Metric'>)
ANNOTATIONTYPE = 25
OCCURRENCESPERSET = 0
XMLTAG = 'timesegment'
pynlpl.formats.folia.TimedEvent

alias of TimeSegment

class pynlpl.formats.folia.TimingLayer(doc, *args, **kwargs)

Dependencies Layer: Annotation layer for Dependency span annotation elements. For dependency entities.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.TimeSegment'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.Correction'>)
ANNOTATIONTYPE = 25
XMLTAG = 'timing'
exception pynlpl.formats.folia.UnresolvableTextContent
class pynlpl.formats.folia.ValueFeature(doc, *args, **kwargs)

Value feature, to be used within Metric

SUBSET = 'value'
XMLTAG = None
class pynlpl.formats.folia.Whitespace(doc, *args, **kwargs)

Whitespace element, signals a vertical whitespace

ACCEPTED_DATA = ()
ANNOTATIONTYPE = 6
REQUIRED_ATTRIBS = ()
TEXTDELIMITER = '\n\n'
XMLTAG = 'whitespace'
class pynlpl.formats.folia.Word(doc, *args, **kwargs)

Word (aka token) element. Holds a word/token and all its related token annotations.

ACCEPTED_DATA = (<class 'pynlpl.formats.folia.AbstractTokenAnnotation'>, <class 'pynlpl.formats.folia.Correction'>, <class 'pynlpl.formats.folia.TextContent'>, <class 'pynlpl.formats.folia.String'>, <class 'pynlpl.formats.folia.Alternative'>, <class 'pynlpl.formats.folia.AlternativeLayers'>, <class 'pynlpl.formats.folia.Description'>, <class 'pynlpl.formats.folia.AbstractAnnotationLayer'>, <class 'pynlpl.formats.folia.Alignment'>, <class 'pynlpl.formats.folia.Metric'>, <class 'pynlpl.formats.folia.Reference'>)
ANNOTATIONTYPE = 1
XMLTAG = 'w'
division()

Obtain the deepest division this word is a part of, otherwise return None

domain(set=None)

Shortcut: returns the FoLiA class of the domain annotation (will return only one if there are multiple!)

findspans(type, set=None)

Find span annotation of the specified type that includes this word

getcorrection(set=None, cls=None)
getcorrections(set=None, cls=None)
gettextdelimiter(retaintokenisation=False)

Returns the text delimiter

json(attribs=None, recurse=True)
lemma(set=None)

Shortcut: returns the FoLiA class of the lemma annotation (will return only one if there are multiple!)

morpheme(index, set=None)

Returns a specific morpheme, the n’th morpheme (given the particular set if specified).

morphemes(set=None)

Generator yielding all morphemes (in a particular set if specified). For retrieving one specific morpheme by index, use morpheme() instead

paragraph()

Obtain the paragraph this word is a part of, otherwise return None

classmethod parsexml(Class, node, doc)
pos(set=None)

Shortcut: returns the FoLiA class of the PoS annotation (will return only one if there are multiple!)

classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
resolveword(id)
sense(set=None)

Shortcut: returns the FoLiA class of the sense annotation (will return only one if there are multiple!)

sentence()

Obtain the sentence this word is a part of, otherwise return None

split(*newwords, **kwargs)
xml(attribs=None, elements=None, skipchildren=False)
class pynlpl.formats.folia.WordReference(doc, *args, **kwargs)

Word reference. Used to refer to words or morphemes from span annotation elements. The Python class will only be used when word reference can not be resolved, if they can, Word or Morpheme objects will be used

REQUIRED_ATTRIBS = (0,)
XMLTAG = 'wref'
classmethod parsexml(Class, node, doc)
classmethod relaxng(includechildren=True, extraattribs=None, extraelements=None)
pynlpl.formats.folia.c

alias of Division

pynlpl.formats.folia.commonancestors(Class, *args)

Generator over common ancestors, of the Class specified, of the current element and the other specified elements

pynlpl.formats.folia.findwords(doc, worditerator, *args, **kwargs)
pynlpl.formats.folia.isncname(name)
pynlpl.formats.folia.loadsetdefinition(filename)
pynlpl.formats.folia.makeelement(E, tagname, **kwargs)
pynlpl.formats.folia.parse_datetime(s)

Returns (datetime, tz offset in minutes) or (None, None).

pynlpl.formats.folia.parsecommonarguments(object, doc, annotationtype, required, allowed, **kwargs)

Internal function, parses common FoLiA attributes and sets up the instance accordingly

pynlpl.formats.folia.relaxng(filename=None)
pynlpl.formats.folia.relaxng_declarations()
pynlpl.formats.folia.validate(filename, schema=None, deep=False)
pynlpl.formats.folia.xmltreefromfile(filename, bypassleak=False)
pynlpl.formats.folia.xmltreefromstring(s, bypassleak=False)