Schema and guidelines for creating a staticSearch engine for your HTML5 site
Martin Holmes
Joey Takeda
2019-2021

This documentation provides instructions on how to use the Project Endings staticSearch Generator to provide a fully-functional search ‘engine’ to your website without any dependency on server-side code such as a database.

7 How does it work?

7.1 Building the index

The tokenizing process first processes your configuration file to create an XSLT file with all your settings embedded in it. Next, it processes your document collection using those settings. Each document is tokenized, and then a separate JSON file is created for each distinct token found; this file contains links to each of the documents which contain that token, as well as keyword-in-context strings for the actual tokens. There will most likely be thousands of these files, but most of them are quite small. These constitute the textual index.

In addition, separate JSON files are created for the list of document titles, and for your stopword list if you have specified one. A single text file is also created containing all the unique terms in the collection, used when doing wildcard searches.

Next, if you have specified search facets in your document headers, the processor will then create a separate JSON file for each of those search facets, consisting of a list of the document identifiers for all documents which match the filters; so if some of your documents are specified as ‘Illustrated’ and some not (true or false), a JSON file will be created for the ‘Illustrated’ facet, with a list of documents which are true for this facet, and a list of documents which are false.

Finally, the template file you have created for the search page on your site will be processed to add the required search controls and JavaScript to make the search work.

7.2 The search page

In order to provide fast, responsive search results, the search page must download only the information it needs for each specific search. Obviously, if it were to download the entire collection of thousands of token files, the process would take forever. So when you search for the word waiting, what happens is that the JavaScript stems that word, producing wait, then it downloads only the single file containing indexing information for that specific word, which is very rapid. (If you are using a different stemmer, of course, then the token will be stemmed to a different output. If you are using the identity stemmer, then the token will be unchanged; with the stripDiacritics pseudo-stemmer, all combining diacritics will be stripped from the search terms, as they are in the corresponding index.)

However, there is some information that is required for all or many searches. To display any results, the list of document titles must be downloaded, for example. A user may for instance use the search facets only, not searching for a particular word or phrase but just wanting a list of all the documents classified as ‘Poems’. This requires that the JSON file with information about that facet be downloaded. So there is some advantage in having the JavaScript start downloading some of the essential files (titles, stopwords and so on) as soon as the page loads, and it also starts downloading the facet files in the background.

At the same time, though, we don't want to clog up the connection downloading these files when the user may do a simple text search which doesn't depend on them, so these files are retrieved using a ‘trickle’ approach, one at a time. Then if a search is initiated, all the files required for that specific search can be downloaded as fast as possible overriding the trickle sequence for files that are needed immediately.

Once the user has been on the search page for any length of time, all ancillary files will have been retrieved (assuming they weren't already cached by the browser), so the only files required for any search are those for the actual text search terms; the response should therefore be even faster for later searches than for early ones.

Notes
1
This example taken from Thomas S. Kuhn, The Structure of Scientific Revolutions (50th anniversary edition), University of Chicago Press, 2012: p. 191.
Martin Holmes and Joey Takeda. Date: 2019-2021