CLiC for developers

Cheshire3

The data-model

CLiC Concordance

CLiC Concordance based on cheshire3 indexes.

class concordance.Concordance[source]

This concordance takes terms, index names, book selections, and search type as input values and returns json with the search term, ten words to the left and ten to the right, and location information.

This can be used in an ajax api.

build_and_run_query(terms, idxName, Materials, selectWords)[source]

Builds a cheshire query and runs it.

Its output is a tuple of which the first element is a resultset and the second element is number of search terms in the query.

create_concordance(terms, idxName, Materials, selectWords)[source]

main concordance method create a list of lists containing each three contexts left - node -right, and a list within those contexts containing each word. Add two separate lists containing metadata information: [ [left context - word 1, word 2, etc.], [node - word 1, word 2, etc], [right context - word 1, etc], [chapter metadata], [book metadata] ], etc.

CLiC Clusters

Tool to create wordlists based on the entries in an index.

class clusters.Clusters[source]

Class that does all the heavy weighting. It makes the connection with cheshire3, uses the input parameters (indexname and subcorpus/Materials) to return a list of words and their total number of occurrences.

For instance,

the 98021

to  78465

...

or

he said  8937

she said 6732

...

list_clusters(idxName, Materials)[source]

Build a list of clusters and their occurrences.

Limit the list to the first 3000 entries.

CLiC Keywords

Module to compute keywords (words that are used significantly more frequently in one corpus than they are in a reference corpus).

class keywords.Keywords[source]

Class to compute keywords based on an test index (the corpus of analysis), a reference index (the corpus of reference), and a P value.

list_keywords(testIdxName, testMaterials, refIdxName, refMaterials, pValue)[source]

Return a sorted list of keywords. Limited to the first 5000 items.

CLiC Chapter Repository

Display the texts available in the cheshire3 database. Also highlight specific items that were previously retrieved with a concordance.

class chapter_repository.ChapterRepository[source]

Responsible for providing access to chapter resources within Cheshire.

get_book_title(book)[source]

Gets the title of a book from the json file booklist.json

book – string - the book id/accronym e.g. BH

get_chapter(chapter_number, book)[source]

Returns transformed XML for given chapter & book

chapter_number – integer book – string - the book id/accronym e.g. BH

get_chapter_with_highlighted_search_term(chapter_number, book, wid, search_term)[source]

Returns transformed XML for given chapter & book with the search highlighted.

We create the transformer directly so that we can pass extra parameters to it at runtime. In this case the search term.

chapter_number – integer book – string - the book id/accronym e.g. BH wid – integer - word index search_term – string - term to highlight

get_raw_chapter(chapter_number, book)[source]

Returns raw chapter XML for given chapter & book

chapter_number – integer book – string - the book id/accronym e.g. BH

CLiC KWICgrouper

A module to look for patterns in concordances.

class kwicgrouper.Concordance(term, text, word_boundaries=True, length=50, keep_punctuation=True, keep_line_breaks=False)[source]

This is a simple concordance for a text file. The input text should a string that is cleaned, for instance:

text.replace(“

”, ” ”).replace(” ”, ” ”)

This function has two argument: the search term and the text to be searched.

The length should be an integer

classmethod from_multiple_line_file(term, input_li)[source]

Construct a concordance that respect line breaks (rather than one that treats the text as one large string)

TODO

list_concordance()[source]

List the actual concordance.

print_concordance()[source]

Print the lines of a concordance. For debugging purposes.

single_line_conc()[source]

Build a basic concordance based on a single string of text.

class kwicgrouper.KWICgrouper(concordance)[source]

This starts from a concordance and transforms it into a pandas dataframe (here called textframe) that has five words to the left and right of the search term in separate columns. These columns can then be searched for and sorted.

Input: A nested list of lists looking like:

[
[‘sessed of that very useful appendage a ‘, ‘voice’, ‘ for a much longer space of time than t’ ],

...

Each pattern needs its own instantiation of the KWICgrouper object because the self.textframe variable is changed in the filter method.

args_to_dict(L5=None, L4=None, L3=None, L2=None, L1=None, R1=None, R2=None, R3=None, R4=None, R5=None)[source]

Helper function to use L1=”a” type of functions

conc_to_df()[source]

Turns a list of dictionaries with L1-R5 values into a dataframe which can be used as a kwicgrouper.

filter_textframe(kwdict)[source]

Construct a dataframe slice and selector on the fly. This is no longer meta-programming as it does not use the eval function anymore.

This returns None if there is no textframe

split_nodes()[source]

Splits the words into nodes that can be fed into a dataframe.

kwicgrouper.clean_punkt(text)[source]

Delete punktuation from a text.

Problem: turns CAN’T into CA NT

kwicgrouper.clean_text(text)[source]
Clean a text so that it can be used in a concordance. This includes:
  • all text to lowercase
  • deleting line-breaks
  • tokenizing and detokenizing
kwicgrouper.concordance_for_line_by_line_file(input_file, term)[source]

Takes a file that has different line breaks that cannot be ignored (for instance a file with a list of things) and makes it into a concordance

kwicgrouper.old_clean_punkt(text)[source]

This ignores apostrophes and punctuation marks attached to the word * an alternative way would be to replace-delete the punctuation from the text

CLiC Normalizer

Defines normalizers that can be used in the cheshire3 indexing workflow.

CLiC Query Builder

Future module to handle the construction of cheshire3 CQL queries.

CLiC Web app

Index

This is the most important file for the web app. It contains the various routes that end users can use.

For instance

@app.route(‘/about/’, methods=[‘GET’]) def about():

return render_template(“info/about.html”)

Where /about/ is the link.

API

This file is an extension of index.py. It generates the raw json API that the keywords, cluster, and concordances use(d).

It needs to be refactored.

Models

models.py defines the SQL tables that CLiC uses. These classes provide a python interface to the SQL database so that you can write python code that automatically queries the database.

This is heavily dependent on Flask-SQLAlchmey and SQLAlchemy