GrETEL is free for students and academic research, but the corpora that are accessible via GrETEL
are
not meant for distribution. In other words, we do not have the rights to give
out
the
corpus as a whole.
If a user would search for a structure with only a
cat="top"
node, they could literally download the whole corpus - which is not the
intention
of this project. If you would
like to obtain the raw corpus data (for academic or commercial use), you should contact the
INT
GrETEL is designed as a corpus query tool which means that it is useful for anyone who is
interested
in
searching
through the Lassy Small, CGN, or SoNaR treebanks. The tool is especially useful if you want to
look
for
specific linguistic patterns in those corpora.
GrETEL currently provides access to three corpora: Lassy Small, CGN treebank, and SoNaR treebank.
More
information
on these corpora is provided on
GrETEL's project
page.
-
Lassy Small
was the first corpus to be supported in GrETEL. It is a one-million words treebank that
consists
of
written data. All of
its annotations have been manually checked and verified.
-
CGN treebank is a
treebank
of
one million words that consists of transcribed Dutch speech. All the provided
annotations have been manually checked and verified. CGN stands for "Corpus Gesproken
Nederlands"
(Spoken
Dutch Corpus). The CGN treebank is a syntactically enriched part of the 10-million word CGN
corpus.
-
SoNaR treebank
is the parsed version of the 500-million word SoNaR-500 corpus. It is a corpus that
consists of
25 components
of written data. Because of its size, the syntactic annotations have not been manually
verified.
This website and this tool were originally developed at the Centre for Computational Linguistics
(CCL).
The current
version is developed by the Digital Humanities Lab at Utrecht University. If you have any
suggestions,
questions,
or general feedback you are welcome to give us a ring, or send us an email. You can find contact
information
on
Digital Humanities Lab's website or in the
footer
of this website.
It has to do with how XPath structures work on the one hand, and how we optimised the SoNaR
database
on
the other.
An XPath pattern that begins with a double slash makes sure that the pattern is searched for in
all
descendants
of the current node (or implied root), whereas a single slash restricts the search to its direct
children. How
that difference is relevant for SoNaR has been described in the paper cited below.
Vincent Vandeghinste and Liesbeth Augustinus. (2014).
"Making Large Treebanks Searchable. The SoNaR case"
. In: Marc Kupietz, Hanno Biber, Harald Lüngen, Piotr Bański, Evelyn Breiteneder,
Karlheinz
Mörth, Andreas
Witt & Jani Takhsha (eds.),
Proceedings of the LREC2014 2nd workshop on Challenges in the management of large
corpora
(CMLC-2).
Reykjavik, Iceland. pp. 15-20.
In addition to an overall design update, major changes include a more intuitive query builder in
the
example-based
search mode and a visualizer for syntax trees that is compatible with all modern browsers.
Moreover,
the
results
are presented as soon as they are found, so you can browse the matching sentences before the
treebank
search
is completed. Furthermore it is possible to query the 500-million word SoNaR treebank in a
similar
fashion as
the two one-million word treebanks CGN and LASSY Small.
A rewrite of the user interface. Links to results can now be shared and navigation is more
flexible,
quicker and more reliable.
The entry of XPATH has been updated with a highlighted editor, performing live validation and
showing
suggestions.
It's also possible now to show the context of a search (the preceding and following sentence)
for an
XPATH-based
search. We've also added an application for uploading new corpora. Furthermore filtering has
been
added
to be
able to filter results by metadata. Finally a completely new analysis module has been added
allowing
the
properties
of syntax nodes to be compared using a graphical interface.
Only the properties of the first node matched by an XPATH variable is returned for analysis. For
example:
A user searches for
//node[node]
. Two variables are found in this query:
$node1 = //node
and
$node2 = $node1[node]
. A sentence with the following structure would match this
query:
node[np] (node[det] node[noun])
The node found for
$node1
will then be
node[np]
. The node found for
$node2
will then be
node[det]
. The properties of
node[noun]
will not be available for analysis using this query. When searching for
a
more
specific structure, this is unlikely
to occur.