ANNIS3: Towards Generic Corpus Search and Visualization
- 1. Humboldt-Universität zu Berlin
Description
This poster showcases the latest developments in the generic representation of digital corpora using ANNIS (Zeldes et al. 2009), a browser-based, open source corpus query architecture. Recent advances in computational linguistics tools and corpus formats have led to a rapidbr growth of multi-layer corpus projects, integrating data from fully manual annotation (e.g. coreference, information structure), fully-automatic tools like taggers and parsers for morphology and syntax, as well as semi-automatic annotations combining the two. Adding to these the special requirements of different corpora, such as highly detailed historical corpora with diplomatic and normalized text, multimodal corpora with aligned A/V streams, or parallel corpora with aligned multilingual texts, quickly leads to a combinatorial explosion: each corpus requires unique search and visualization capabilities, and the overhead of designing the query system core and encoding formats is repeated countless times before research can progress. With ANNIS3, we attempt to move one step closer to a generic solution for the corpus search and representation problem. We model primary linguistic data (transcriptions from any number of simultaneous speakers, aligned multilingual texts, diplomatic and normalized levels...) as nodes in a graph and designate specific layers of information as segmentation layers (cf. Krause et al. 2012). These are treated as "word forms" or "tokens", though there can be any number of such layers and they may overlap freely and be used to define adjacency in queries, token-distance between search elements or query hit context. Above and below these segmentations, we represent any and all annotation types as a multi-DAG, an annotation graph which may contain as many subgraphs as needed, including cycles, as long as each type of annotation is free of cycles within itself. The problem of visualizing heterogeneous data is approached from two directions using an extensible plugin-based system. The system offers dedicated, highly optimized visuali zations for some common data-types, such as constituent and dependency trees, coreference, annotation grids, aligned pdf and multimedia plugins and more. A new module in ANNIS3 constructs annotation-triggered HTML/CSS on the fly, beginning and ending HTML tags depending on the scope of annotation nodes and filling the attributes and styles of such elements with values from the annotation model.
Files
DGfS2014_ANNIS_poster.pdf
Files
(940.9 kB)
Name | Size | Download all |
---|---|---|
md5:974f23e8575094b78c189ed6a2fc2cea
|
940.9 kB | Preview Download |
Additional details
References
- • Dipper, S. 2005. XML-based Stand-off Representation and Exploitation of Multi-Level Linguistic
- Annotation. Proceedings of Berliner XML Tage 2005 (BXML 2005). Berlin, 39-50.
- • Reznicek, M./Lüdeling, A./Hirschmann, H. 2013. Competing target hypotheses in the falko corpus: A
- flexible multi-layer corpus architecture. In Díaz-Negrillo, A./Ballier, N./Thompson, P. (eds.) Automatic
- Treatment and Analysis of Learner Corpus Data. Amsterdam: John Benjamins, 101–124.
- • Stede, M. 2004. The Potsdam Commentary Corpus. In Webber, B./Byron, D. K. (eds.) Proceeding of
- the ACL-04 Workshop on Discourse Annotation. Barcelona, Spain, 96–102.
- • Zipser, F./Romary, L. 2010. A Model Oriented Approach to the Mapping of Annotation Formats using
- Standards. Workshop Language Resource & Language Technology Standards, LREC 2010. Malta, 7-18