This page gathers all sorts of documentation on GrETEL, such as tutorials, related tools, and frequently asked questions. If you have any more questions, you can always consult the GrETEL project website or you can contact us.

Please cite the following paper if you have used GrETEL for your research:

Jan Odijk, Martijn van der Klis and Sheean Spoel (2018). “Extensions to the GrETEL treebank query application” In: Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. Prague, Czech Republic. pp. 46-55.


Frequently Asked Questions

Why is the output limited to 500 sentences?

GrETEL is free for students and academic research, but the corpora that are accessible via GrETEL are not meant for distribution. In other words, we do not have the rights to give out the corpus as a whole. If a user would search for a structure with only a cat="top" node, they could literally download the whole corpus - which is not the intention of this project. If you would like to obtain the raw corpus data (for academic or commercial use), you should contact the INT

For whom is GrETEL intended?

GrETEL is designed as a corpus query tool which means that it is useful for anyone who is interested in searching through the Lassy Small, CGN, or SoNaR treebanks. The tool is especially useful if you want to look for specific linguistic patterns in those corpora.

Where can I find more information about the corpora available in GrETEL?

GrETEL currently provides access to three corpora: Lassy Small, CGN treebank, and SoNaR treebank. More information on these corpora is provided on GrETEL's project page.

  • Lassy Small was the first corpus to be supported in GrETEL. It is a one-million words treebank that consists of written data. All of its annotations have been manually checked and verified.
  • CGN treebank is a treebank of one million words that consists of transcribed Dutch speech. All the provided annotations have been manually checked and verified. CGN stands for "Corpus Gesproken Nederlands" (Spoken Dutch Corpus). The CGN treebank is a syntactically enriched part of the 10-million word CGN corpus.
  • SoNaR treebank is the parsed version of the 500-million word SoNaR-500 corpus. It is a corpus that consists of 25 components of written data. Because of its size, the syntactic annotations have not been manually verified.

How can I contact you?

This website and this tool were originally developed at the Centre for Computational Linguistics (CCL). The current version is developed by the Digital Humanities Lab at Utrecht University. If you have any suggestions, questions, or general feedback you are welcome to give us a ring, or send us an email. You can find contact information on Digital Humanities Lab's website or in the footer of this website.

Why does XPath generated for SoNaR only have one leading slash, when the code for LASSY and CGN has two?

It has to do with how XPath structures work on the one hand, and how we optimised the SoNaR database on the other. An XPath pattern that begins with a double slash makes sure that the pattern is searched for in all descendants of the current node (or implied root), whereas a single slash restricts the search to its direct children. How that difference is relevant for SoNaR has been described in the paper cited below.

Vincent Vandeghinste and Liesbeth Augustinus. (2014). "Making Large Treebanks Searchable. The SoNaR case" . In: Marc Kupietz, Hanno Biber, Harald Lüngen, Piotr Bański, Evelyn Breiteneder, Karlheinz Mörth, Andreas Witt & Jani Takhsha (eds.), Proceedings of the LREC2014 2nd workshop on Challenges in the management of large corpora (CMLC-2). Reykjavik, Iceland. pp. 15-20.

What is new in version 3?

In addition to an overall design update, major changes include a more intuitive query builder in the example-based search mode and a visualizer for syntax trees that is compatible with all modern browsers. Moreover, the results are presented as soon as they are found, so you can browse the matching sentences before the treebank search is completed. Furthermore it is possible to query the 500-million word SoNaR treebank in a similar fashion as the two one-million word treebanks CGN and LASSY Small.

What is new in version 4?

A rewrite of the user interface. Links to results can now be shared and navigation is more flexible, quicker and more reliable. The entry of XPATH has been updated with a highlighted editor, performing live validation and showing suggestions. It's also possible now to show the context of a search (the preceding and following sentence) for an XPATH-based search. We've also added an application for uploading new corpora. Furthermore filtering has been added to be able to filter results by metadata. Finally a completely new analysis module has been added allowing the properties of syntax nodes to be compared using a graphical interface.

How are properties analysed?

Only the properties of the first node matched by an XPATH variable is returned for analysis. For example:

A user searches for //node[node]. Two variables are found in this query: $node1 = //node and $node2 = $node1[node]. A sentence with the following structure would match this query:

node[np] (node[det] node[noun])

The node found for $node1 will then be node[np]. The node found for $node2 will then be node[det]. The properties of node[noun] will not be available for analysis using this query. When searching for a more specific structure, this is unlikely to occur.