Documents in the collection: badthings.html, apostrophes.html, clues.html, ditty.html, shizukesaya.html, twilight.html, booze.html, decade_year.html, and of course search.html, and the other test set searchWorthReading.html.
You might also want to look at the build report from the staticSearch build process.
This is the index page for the Static Search test data set. This data set is designed to cover a range of different input document types and configurations in order to test a variety of scenarios the codebase needs to handle, including:
There are three poems, two Victorian and one Japanese (about cicadas). There are also two documents
with tables, one of which deals with liquor consumption in British Columbia, while the other has numbers
such as 1783 and percentages such as 73.58%. One of the poems includes the phrase summer day—our day
,
and if we follow that to the next line we get our day Was clouded
, which should be indexed/found
in this document, but not in the source poem, where lines are boundary contexts. The apostrophes document
is an excerpt from Martin Porter’s site (note the curly apostrophe in his name).
The word artichoke appears here and in one other document, but it is included in the test stopwords list so it should not be indexed or retrieved.