Published February 24, 2015 | Version v1
Poster Open

ANNIS3: Towards Generic Corpus Search and Visualization

  • 1. Humboldt-Universität zu Berlin

Description

This poster showcases the latest developments in the generic representation of digital corpora using ANNIS (Zeldes et al. 2009), a browser-based, open source corpus query architecture. Recent advances in computational linguistics tools and corpus formats have led to a rapidbr growth of multi-layer corpus projects, integrating data from fully manual annotation (e.g. coreference, information structure), fully-automatic tools like taggers and parsers for morphology and syntax, as well as semi-automatic annotations combining the two. Adding to these the special requirements of different corpora, such as highly detailed historical corpora with diplomatic and normalized text, multimodal corpora with aligned A/V streams, or parallel corpora with aligned multilingual texts, quickly leads to a combinatorial explosion: each corpus requires unique search and visualization capabilities, and the overhead of designing the query system core and encoding formats is repeated countless times before research can progress. With ANNIS3, we attempt to move one step closer to a generic solution for the corpus search and representation problem. We model primary linguistic data (transcriptions from any number of simultaneous speakers, aligned multilingual texts, diplomatic and normalized levels...) as nodes in a graph and designate specific layers of information as segmentation layers (cf. Krause et al. 2012). These are treated as "word forms" or "tokens", though there can be any number of such layers and they may overlap freely and be used to define adjacency in queries, token-distance between search elements or query hit context. Above and below these segmentations, we represent any and all annotation types as a multi-DAG, an annotation graph which may contain as many subgraphs as needed, including cycles, as long as each type of annotation is free of cycles within itself. The problem of visualizing heterogeneous data is approached from two directions using an extensible plugin-based system. The system offers dedicated, highly optimized visuali zations for some common data-types, such as constituent and dependency trees, coreference, annotation grids, aligned pdf and multimedia plugins and more. A new module in ANNIS3 constructs annotation-triggered HTML/CSS on the fly, beginning and ending HTML tags depending on the scope of annotation nodes and filling the attributes and styles of such elements with values from the annotation model.

Files

DGfS2014_ANNIS_poster.pdf

Files (940.9 kB)

Name Size Download all
md5:974f23e8575094b78c189ed6a2fc2cea
940.9 kB Preview Download

Additional details

References

  • • Dipper, S. 2005. XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic
  • Annotation. Proceedings of Berliner XML Tage 2005 (BXML 2005). Berlin, 39-50.
  • • Reznicek, M./Lüdeling, A./Hirschmann, H. 2013. Competing target hypotheses in the falko corpus: A
  • flexible multi-layer corpus architecture. In Díaz-Negrillo, A./Ballier, N./Thompson, P. (eds.) Automatic
  • Treatment and Analysis of Learner Corpus Data. Amsterdam: John Benjamins, 101–124.
  • • Stede, M. 2004. The Potsdam Commentary Corpus. In Webber, B./Byron, D. K. (eds.) Proceeding of
  • the ACL-04 Workshop on Discourse Annotation. Barcelona, Spain, 96–102.
  • • Zipser, F./Romary, L. 2010. A Model Oriented Approach to the Mapping of Annotation Formats using
  • Standards. Workshop Language Resource & Language Technology Standards, LREC 2010. Malta, 7-18