Schema and guidelines for creating a staticSearch engine for your HTML5 site
Martin Holmes
Joey Takeda
2019-2021
This documentation provides instructions on how to use the Project Endings staticSearch
Generator to provide a fully-functional search ‘engine’ to your website without any
dependency on server-side code such as a database.
First, you will have to make sure your site pages are correctly configured so that
the Generator can parse them. Then, you will have to create a configuration file specifying
what options you want to use. Then you run the generator, and the search functionality
should be added to your site.
The generator is expecting to parse well-formed XHTML5 web pages. That means web pages which are well-formed XML, using the XHTML namespace. If your
site is just raggedy tag-soup, then you can't use this tool. You can tidy up your
HTML using HTML Tidy.
6.1 Configuring your site: search filters
Next, you will need to decide whether you want search filters or not. If you want
to allow your users to search (for example) only in poems, or only in articles, or
only in blog posts, or any combination of these document types, you will need to add
<meta> tags to the heads of your documents to specify what these filters are. staticSearch supports four filter types.
6.1.1 Description filters
The description (desc) filter is a word or phrase describing or associated with the
document. Here is a simple example:
This specifies that there is to be a descriptive search filter called ‘Document type’, and one of the types is ‘Poems’; the document containing this <meta> tag is one of the Poems. Another type might be:
If the Generator finds such meta tags when it is indexing, it will create a set of
filter controls on the search page, enabling the user to constrain the search to a
specific set of filter settings.
6.1.1.1 Sort order for description filters
Description filter labels may be plain text such as ‘Short stories’ or ‘Poems’, but they may also be more obscure labels relating to document categories in indexing
systems or archival series identifiers. When the search page is generated, these labels
are turned into a series of labelled checkboxes, sorted in alphabetical order. However,
the strict alphabetical order of items may not be exactly what you want; you may want
to sort ‘305 2’ before ‘305 10’ for example. To deal with cases like this, in addition to the content attribute, you can also supply a custom data-ssfiltersortkey attribute, providing a sort key for each label. Here is are a couple of examples:
In this case, the first item will sort in the filter list before the second item
based on the sort key; without it, they would sort in reverse order based on the content attribute. Note that the data-ssfiltersortkey attribute name is all-lower-case, to comply with the XHTML5 schema.
6.1.2 Date filters
Another slightly different kind of search control is a document date. If your collection
of documents has items from different dates, you can add a <meta> tag like this:
<meta name="Date of publication" class="staticSearch.date" content="1895-01-05"/>
The date may take any of the following forms:
1895 (year only)
1895-01 (year and month)
1895-01-05 (year, month and day)
For some documents, it may not be possible to specify a single date in this form,
so you can specify a range instead, using a slash to separate the start and end dates
of the range (following ISO 8601):
1895/1897
1903-01-02/1905-05-31
6.1.3 Number filters
You can also configure a range filter based on a numeric value (integer or decimal).
For example, you might want to allow people to filter documents in the search results
based on their word-count:
When the indexing process runs over your document collection, by default it will use
the document title that it finds in the <title> element in the document header; that title will then be shown as a link to the document
when it comes up in search results. However, that may not be the ideal title for this
purpose; for example, all of your documents may include the site title as the first
part of their document title, but it would be pointless to include this in the search
result links. Therefore you can override the document title value by providing another
meta tag, like this:
<meta name="docTitle" class="staticSearch.docTitle" content="What I did in my holidays"/>
6.3 Configuring your site: document sort keys
When a user searches for text on your site, the documents retrieved will be presented
in a sequence based on the ‘hit score’ or ‘relevance score’; documents with the highest
scores will be presented first, and the list will be in descending order of relevance.
However, if you have search filters on your site, it is possible that users will not
enter any search text at all; they may simply select some filters and get a list of
matching documents. In this case, there will be no relevance scores, so the documents
will be presented in a random order. However, you may wish to control the order in
which documents without hit scores, or sequences of documents with the same hit score,
are presented. You can do this by adding a single meta tag to the document providing
a ‘sort key’, which can be used to sort the list of hits. This is an example:
When a document is returned as a result of a search hit, you may want to include with
it a thumbnail image. This may be for aesthetic reasons, or because the focus of the
document itself is actually an image (perhaps your site is a set of pages dealing
with works of art, for instance). Whatever the reason, you can supply a link to a
thumbnail image like this:
The content attribute value should either be the path to an image relative to the document itself
or the URL to an external image; so in the example above, there would be a folder
called images which is a sibling of the HTML file containing the tag, and that folder would contain
a file called thisPage.jpg.
6.5 Creating a configuration file
The configuration file is an XML document which tells the Generator where to find
your site, and what search features you would like to include. The configuration file
conforms to a schema which is documented here.
There are three main sections of the configuration file:
Only the <params> element is necessary, but, as we discuss shortly, we highly suggest taking advantage
of the <rules> and <contexts> for the best results.
6.5.1 Specifying parameters
6.5.1.1 Required parameters
The <params> element has two required elements for determining the resource collection that you
wish to index:
The <searchFile> element is a relative URI (resolved, like all URIs specified in the config file,
against the configuration file location) that points directly to the search page that
will be the primary access point for the search. Since the search file must be at
the root of the directory that you wish to index (i.e. the directory that contains
all of the XHTML you want the search to index), the searchFile parameter provides
the necessary information for knowing what document collection to index and where
to put the output JSON. In other words, in specifying the location of your search
page, you are also specifying the location of your document collection. See Creating a search page for more information on how to configure this file.
Note that all output files will be in a directory that is a sibling to the search
page. For instance, in a document collection that looks something like:
myProject
novel.html
poem.html
shortstory.html
search.html
The collection of Javascript and JSON files will be in a directory like so:
myProject
novel.html
poem.html
shortstory.html
search.html
staticSearch
We also require the <recurse> element in the case where the document collection may be nested (as is common with
static sites generated from Jekyll or Wordpress). The <recurse> element is a boolean (true or false) that determines whether or not to recurse into
the subdirectories of the collection and index those files.
6.5.1.2 Optional parameters
The following parameters are optional, but most projects will want to specify some
of them:
<versionFile> enables you to specify the path to a plain-text file containing a simple version
number for the project. This might take the form of a software-release-style version
number such as 1.5, or it might be a Subversion revision number or a Git commit id. It should not contain
any spaces or punctuation. If you provide a version file, the version string will
be used as part of the filenames for all the JSON resources created for the search.
This is useful because it allows the browser to cache such resources when users repeatedly
visit the search page, but if the project is rebuilt with a new version, those cached
files will not be used because the new version will have different filenames. The
path specified is relative to the location of the configuration file (or absolute,
if you wish).
<phrasalSearch> is a boolean parameter which specifies whether you want your search engine to support
phrasal searches (quoted strings). Obviously this is a useful feature, but it is also
costly in terms of the size of JSON token files; in order to support this kind of
search, we store contexts for all hits for each token in each document, so if your
site is very large, and your user base is unlikely to use phrasal searching, it may
not be worth the overhead. The default value is true.
<stemmerFolder> is a string parameter specifying the name of a folder that is inside the /stemmers/ folder in the staticSearch repository structure.
The staticSearch project currently has only one real stemmer, an implementation of
the Porter 2 algorithm for modern English. That appears in /stemmers/en/, so the default value for this parameter is en. We will be adding more stemmers as the project develops. However, if your document
collection is not English, you have a couple of options, one hard and one easy.
Hard option: implement your own stemmers. You will need to write two implementations of the stemmer
algorithm, one in XSLT (which must be named ssStemmer.xsl) and one in JavaScript (ssStemmer.js), and confirm that they both generate the same results. The XSLT stemmer is used
in the generation of the index files at build time, and the JavaScript version is
used to stem the user's input in the search page. You can look at the existing implementations
in the /stemmers/en/ folder to see how the stemmers need to be constructed. Place your stemmers in a folder
called /stemmers/[yourlang]/, and specify yourlang in the configuration file.
Easy option: Use the identity stemmer (which is equivalent to turning off stemming completely), and make sure wildcard
searching is turned on. Then your users can search using wildcards instead of having
their search terms automatically stemmed. To do this, specify the value identity in your configuration file.
Another alternative is the stripDiacritics stemmer. Like the identity stemmer, this is not really a stemmer at all; what it does is to strip out all combining
diacritics from tokens. This is a useful approach if you document collection contains
texts with accents and diacritics, but your users may be unfamiliar with the use of
diacritics and will want to search just with plain unaccented characters. For example,
if a text contains the word élève, but you would like searchers to be able to find the word simply by typing the ascii
string eleve, then this is a good option. Combined with wildcards, it can provide a very flexible
and user-friendly search engine in the absence of a sophisticated stemmer, or for
cases where there are mixed languages so a single stemmer will not do. To use this
option, specify the value stripDiacritics in your configuration file.
<scoringAlgorithm> is an optional element that specifies which scoring algorithm to use when calculating
the score of a term and thus the order in which the results from a search are sorted.
There are currently two options:
raw: This is the default option (and so does not need to be set explicitly). The raw
score is simply the sum of all instances of a term (optionally multipled by a configured
weight via the <rule>/weightconfiguration) in a document. This will usually provide good results for most document collections.
tf-idf: The tf-idf algorithm (term frequency-inverse document frequency) computes the mathematical
relevance of a term within a document relative to the rest of the document collection.
The staticSearch implementation of tf-idf basically follows the textbook definition
of tf-idf: tf-idf = ($instancesOfTerm / $totalTermsInDoc) * log( $allDocumentsCount
/ $docsWithThisTermCount ) This is fairly crude compared to other search engines,
like Lucene, but it may provide useful results for document collections of varying lengths or
in instances where the raw score may be insufficient or misleading. There are a number
of resources on tf-idf scoring, including: Wikipedia and Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
<createContexts> is a boolean parameter that specifies whether you want the indexer to store keyword-in-context
extracts for each of the hits in a document. This increases the size of the index,
but of course it makes for much more user-friendly search results.
<maxKwicsToHarvest> controls the number of keyword-in-context extracts that will be harvested from the
data for each term in a document. If you set this to a low number, the size of the
JSON files will be constrained, but of course the user will only be able to see the
KWICs that have been harvested in their search results.
<maxKwicsToShow> controls the number of keyword-in-context extracts that will be shown in the search
results for each hit document.
<totalKwicLength> is an integer specifying how long a keyword-in-context string should be. Obviously,
the higher this number is, the larger the individual index files will be, but the
more useful the KWICs will be for users looking at the search results.
<kwicTruncateString> is a string containing the character you would like to use at the beginning and/or
the end of a kwic which is not a full sentence. An ellipsis character is the default.
<linkToFragmentId> is a boolean parameter that specifies whether you want the search engine to link
each keyword-in-context extract with the closest element that has an id. If the element has an ancestor with an id, then the indexer will associate that keyword-in-context extract with that id; if there are no suitable ancestor elements that have an id, then the extract is associated with first preceding element with an id.
<scrollToTextFragment> (WARNING: experimental feature). Google has proposed a browser feature called Text Fragments, which would support a special kind of link that targets a specific string of text
inside a page. When clicking on such a link, the browser would scroll to, and then
highlight, the target text. This has been implemented in Chrome-based browsers (Chrome,
Chromium and Edge) at the time of writing, but other browser producers are sceptical
with regard to the specification and worried about possible security implications.
The specification is subject to radical change. <scrollToTextFragment> is a boolean parameter that specifies whether you want to turn on this feature for
browsers that support it. It depends on the availability of keyword-in-context strings,
so <createContexts> must also be turned on to make it work. The feature is automatically suppressed for
browsers which do not support it. We recommend only using this feature on sites which
are in steady development, so that if necessary it can be turned off, our the staticSearch
implementation can be updated to take account of changes. For sites intended to remain
unchanged or archived for any length of time, this feature should be left turned off.
It is off by default.
<verbose> is a boolean which turns on/off detailed output messages during the indexing process.
You might set this to true if something is not working as expected and you need to
do some debugging.
<stopwordsFile> is a string parameter containing the relative path (from the config file) to a text
file containing a list of stopwords that you want to use for your site. A stopword
is a word that will not be indexed, because it is too common (the, a, you and so on). The project has a built-in set of common stopwords for English, which
we recommend you start from; you'll find it in xsl/english_stopwords.txt. If your site is all about a person, a place or some other entity, then you might
add their name to the stopwords list, because it will presumably appear on almost
every page and it makes no sense to search for it. One way to find such terms is to
generate your index, then search for the largest JSON index files that are generated,
to see if they might be too common to be useful as search terms.
<dictionaryFile> is the relative path to a file containing an English dictionary (assuming your site
is in English). This is used to check words during the indexing process, and a report
generated at the end will list all the terms in the site which do not appear in the
dictionary. This is a useful way to find typos in your site. Again, there is a default
dictionary in xsl/english_words.txt which you might copy and adapt.
<indentJSON> is a boolean parameter which controls whether the JSON files generated for the index
are indented or not. Indenting makes the files easier for a human to read, if you
need to understand them or debug them, but obviously it adds to their file size.
<outputFolder> is the name of a folder into which you would like to place the JavaScript and JSON
index files your site search. The default is staticSearch, but if you would prefer something else, you can specify it here. You may also use
this element if you are defining two different searches within the same site, so that
their files are kept in different locations. The value must conform with the XML Name specification.
6.5.2 Specifying rules (optional)
The <rules> elements specifies a list of conditions (using the <rule> element) that tell the parser, using XPath statements in the match attribute, specific weights to assign to particular parts of each document. For instance,
if you wanted all heading elements (<h1>, <h2>, etc) in documents to be given a greater weight and thus receive a higher score in
the results, you can do so using a rule like so:
(It is worth noting, however, the above example is unnecessary: all heading elements
are given a weight of 2 by default, which is the only preconfigured weight in staticSearch.)
The value of the match attribute is transformed in a XSLT template match attribute, and thus must follow
the same rules (i.e. no complex rules like p/ancestor::div). See the W3C XSLT Specification for further details on allowable pattern rules.
Often, there will be elements that you want the tokenizer to ignore completely; for
instance, if you have the same header in every document, then there's no reason to
index its contents on every page. These elements can be ignored simply by using a
<rule> and setting its weight to 0. For instance, if you want to remove the header and the
footer from the search indexing process, you could write something like:
<rule weight="0" match="footer | header"/>
Or if you want to remove XHTML anchor tags (<a>) whose text is identical to the URL specified in its href, you could do something like:
<rule weight="0" match="a[@href=./text()]"/>
Note that the indexer does not tokenize any content in the <head> of the document (but as noted above, metadata can be configured into filters) and
that all elements in the <body> of a document are considered tokenizable. However, common elements that you might
want to exclude include:
<script>
<style>
<code>
6.5.3 Specifying contexts (optional)
When the staticSearch creates the keywords-in-contexts (the "kwic" or "snippets")
for each token, it does so by looking for the nearest block-level element that it
can use as its context. Take, for instance, this unordered list:
<ul> <li>Keyword-in-context search results. This is also configurable, since including contexts increases the size of the index.</li> <li>Search filtering using any metadata you like, allowing users to limit their search
to specific document types.</li> </ul>
Each <li> elements is, by default, a context element, meaning that the snippet generated for each token will not extend beyond
the <li> element boundaries; in this case, if the <li> was not a context attribute, the term ‘search’ would produce a context that looks something like:
"...the size of the index.Search filtering using any metadata you like,..."
Using the <contexts> element, you can control what elements operate as contexts. For instance, say a page
contained a marginal note, encoded as a <span> in your document beside its point of attachment:
<p>About that program I shall have nothing to say here,<span class="sidenote">Some information on this subject can be found in "Second Thoughts"</span> [...] </p>
Using CSS, the footnote might be alongside the text of the document in margin, or
made into a clickable object using Javascript. However, since the tokenizer is unaware
of any server-side processing, it understands the <span> as an inline element and assumes the <p> constitutes the context of the element. A search for ‘information’ might then return:
"...nothing to say here,Some information on this subject can be found...
To tell the tokenizer that the <span> constitutes the context block for any of its token, use the <context> element with an match pattern:
You can also configure it the other way: if a <div>, which is by default a context block, should not be understood as a context block,
then you can tell the parser to not consider it as such using context set to false:
A complex site may have two or more search pages targetting specific types of document
or content, each of which may need its own particular search controls and indexes.
This can easily be achieved by specifying a different <searchFile> and <outputFolder> in the configuration file for each search.
For these searches to be different from each other, they will also probably have different
contexts and rules. For example, imagine that you are creating a special search page
that focuses only on the text describing images or figures in your documents. You
might do it like this:
This specifies that all text nodes which are not part of the document title or descendants
of <div class="figure"> should be ignored (weight=0), so only your target nodes will be indexed.
However, it's also likely that you will want to exclude certain features or documents
from a specialized search page, and this is done using the <excludes> section and its child <exclude> elements.
Here is an example:
<excludes> <!-- We only index files which have illustrations in them. --> <exclude type="index" match="html[not(descendant::meta[@name='Has illustration(s)'][@content='true'])]"/> <!-- We ignore the document type filter,
because we are only indexing one type
of document anyway. --> <exclude type="filter" match="meta[@name='Document type']"/> <!-- We exclude the filter that specifies
these documents because it's pointless. --> <exclude type="filter" match="meta[@name='Has illustration(s)']"/> </excludes>
Here we use <exclude type="index"/> to specify that all documents which do not contain <meta name="Has illustration(s)" content="true"/>> should be ignored. Then we use two <exclude type="filter"/> tags to specify first that the Document type filter should be ignored (i.e. it should not appear on the search page), and second,
that the boolean filter Has illustrations(s) should also be excluded.
Using exclusions, you can create multiple specialized search pages which have customized
form controls within the same document collection. This is at the expense of additional
disk space and build time, of course; each of these searches needs to be built separately.
6.6 Creating a search page
You'll obviously want the search page for your site to conform with the look and feel
of the rest of your site. You can create a complete HTML document (which must of course
also be well-formed XML, so it can be processed), containing all the site components
you need, and then the search build process will insert all the necessary components
into that file. The only requirement is that the page contains one <div> element with the correct id attribute:
<div id="staticSearch"> [...content will be supplied by the build process...] </div>
This <div> will be empty initially. The build process will find insert the search controls,
scripts and results <div> into this container. Then whenever you rebuild the search for your site, the contents
will be replaced. There is no need to make sure it's empty every time.
The search process will also add a CSS <style> element to the <head> of the document:
<style id="ssCss"> [...styles for search controls...] </style>
You can customize this CSS by providing your own CSS that overrides it, using <style>, or <link>, placed after it in the <head> element, or by replacing the inserted CSS after the build process.
Note that once your file has been processed and all this content has been added, you
can process it again at any time; there is no need to start every time with a clean,
empty version of the search page.
You can take a look at the test/search.html page for an example of how to configure the search page (although note that since
this page has already been processed, it has the CSS and the search controls embedded
in it; it also has some additional JavaScript which we use for testing the search
build results, which is not necessary for your site).
6.7 Running the search build process
Once you have configured your HTML and your configuration file, you're ready to create
a search index and a search page for your site. This requires that you run ant in
the root folder of the staticSearch project that you have downloaded or cloned.
Before running the search on your own site, you can test that your system is able
to do the build by doing the (very quick) build of the test materials. If you simply
run the ant command, like this:
mholmes@linuxbox:~/Documents/staticSearch$ ant
you should see a build process proceed using the small test collection of documents,
and at the end, a results page should open up giving you a report on what was done.
If this fails, then you'll need to troubleshoot the problem based on any error messages
you see. (Do you have Java, Ant and ant-contrib installed and working on your system?).
If the test succeeds, you can view the results by uploading the test folder and all its contents to a web server, or by running a local webserver on your
machine in that folder, using the Python HTTP server or PHP's built-in web server.
If the tests all work, then you're ready to build a search for your own site. Now
you need to run the same command, but this time, tell the build process where to find
your custom configuration file:
ant -DconfigFile=/home/mholmes/mysite/config_staticSearch.xml
The same process should run, and if it's successful, you should have a modified search.html page as well as a lot of index files in JSON format in your site HTML folder. Now
you can test your own search in the same ways suggested above.
Notes
1
This example taken from Thomas S. Kuhn, The Structure of Scientific Revolutions (50th anniversary edition), University of Chicago Press, 2012: p. 191.