Aims of EndoMineR

The goal of EndoMineR is to extract as much information as possible from free or semi-structured endoscopy reports and their associated pathology specimens.

Gastroenterology now has many standards against which practice is measured although many reporting systems do not include the reporting capability to give anything more than basic analysis. Much of the data is locked in semi-structured text. However the nature of semi-structured text means that data can be extracted in a standardised way- it just requires more manipulation. This package provides that manipulation so that complex endoscopic-pathological analyses, in line with recognised standards for these analyses, can be done.

How is the package divided?


The package is basically divied into three parts. How all the functions are connected in shown in the adjoining figure. The import of the raw data is left up to the user with the overall aim being that all the data is present in one dataframe. The user can either load data so that each row of the data is an endoscopic episode (or a pathology report) in its raw form and then allow the package to extract the relevant parts of the data, or the data can be pre-extracted (ie separate columns for the Endoscopist, medication given etc.) so that the Extraction step is skipped. The package can take either but the importing is left to the user.

  1. The extraction- This is really when the data is provided as full text reports. You may already have the data in a spreadsheet in which case this part isn’t necessary. The extraction is provided as one function Extractor, explained below.

  2. Cleaning- These are a group of functions that allow the user to extract and clean data commonly found in endoscopic and pathology reports. The cleaning functions usually remove common typos or extraneous information and do some reformatting. Some of the functions will also extract derived data into separate columns. The cleaning functions are provided on a per column basis (so if you have a column containing the endoscopist name, for example ,then EndoEndoscopist will clean this. However convenience functions are also provided to run the several cleaning functions at the same time as long as the relevant columns are present. EndoscAll for example will run several of the cleaning functions as long as the columns are properly named so that the subfunctions are run on the correct columns.This is also true for HistolAll

  3. Analyses- The analyses provide graphing functions as well as analyses according to the cornerstone questions in gastroenterology- namely surveillance, patient tracking, quality of endoscopy and pathology reporting and diagnostic yield questions as explained in the EndoMineR principles pages. The analyses are separated into generic analyses that are relevant to any endo-pathological dataset, as well as specific analyses for adenoma detection rates and Barrett’s surveillance and therapy. Further disease specific datasets will be included in future iterations.

The extractor function

Endoscopic and pathological data will come in one of two forms- either as a collection of the whole text report or as spreadsheets with some degree of separation into different columns of the various aspects of that report eg. who the Endoscopist was, the patient’s unique identifier etc. For the latter, the package user will not need to Extract information as it is already extracted and so can go straight to cleaning the data. For the former the Extractor function has been provided:



The Extractor is a very useful function. Different hospitals will use different software with different headings for endoscopic reports. The extractor allows the user to define the separations in a report so that all reports can be automatically placed into a meaningful dataframe for further cleaning. Here we use the in-built datasets as part of the package.


A list of keywords is then constructed. This list is made up of the words that will be used to split the document. It is very common for individual departments in both gastroenterology and pathology to use semi-structured reporting so that the same headers are used between patient reports. The list is therefore populated with these headers as defined by the user. The Extractor then does the splitting for each pair of words, dumps the text between the delimiter in a new column and names that column with the first keyword in the pair with some cleaning up and then returns the new dataframe. Here we use an example dataset (which has not had separate columns selected already) as the input:

PathReportWhole
Hospital: Random NHS Foundation Trust Hospital Number: H2890235 Patient Name: al-Bilal, Widdad DOB: 1922-05-04 General Practitioner: Dr. Mondragon, Amber Date received: 2002-11-10 Clinical Details: Previous had serrated lesions ?,If looks more like UC, please provide Nancy severity index 3 specimen. Nature of specimen: Nature of specimen as stated on pot = ‘Ascending colon x2’|,Nature of specimen as stated on request form = ‘rectum’|,Nature of specimen as stated on pot = ‘4X LOWER, 4X UPPER OESOPHAGUS’|,Nature of specimen as stated on pot = ‘rectal polyp’| Macroscopic description: 1 specimens collected the largest measuring 3 x 5 x 2 mm and the smallest 3 x 5 x 5 mm Histology: The appearances are of a hyperplastic polyp.,8 pieces of tissue, the largest measuring 4 x 36 x 2 mm and the smallest 3 x 3.,Completeness of excision is uncertain as the base is not clearly visualised.,There is no ulceration.,Kikuchi level: sm2. Diagnosis: Colon, biopsy - Normal.,- Focal granulomatous inflammation, non-necrotising.,Duodenum, biopsy - within normal histological limits.,Sigmoid colon, polypectomy: - Tubular adenoma with moderate dysplasia.,- Hyperplastic polyp .,Caecum polyp biopsies:- tubular adenoma, low grade dysplasia.,- Mild chronic inflammation within the oesophageal mucosa.,Sigmoid colon biopsies:- normal mucosa.,Sigmoid polyp excision:- tubular adenoma.


We can then define the list of delimiters that will split this text into separate columns, title the columns according to the delimiters and return a dataframe. each column simply contains the text between the delimiters that the user has defined. These columns are then ready for the more refined cleaning provided by subesquent functions.


mywords<-c("Hospital Number","Patient Name:","DOB:","General Practitioner:",
"Date received:","Clinical Details:","Macroscopic description:",
"Histology:","Diagnosis:")
PathDataFrameFinalColon2<-Extractor(PathDataFrameFinalColon2,"PathReportWhole",mywords)
HospitalNumber PatientName DOB GeneralPractitioner Datereceived ClinicalDetails Macroscopicdescription Histology Diagnosis
H2890235 al-Bilal, Widdad 1922-05-04 Dr Mondragon, Amber 2002-11-10 Previous had serrated lesions ?,If looks more like UC, please provide Nancy severity index 3 specimen Nature of specimen: Nature of specimen as stated on pot = ‘Ascending colon x2’|,Nature of specimen as stated on request form = ‘rectum’|,Nature of specimen as stated on pot = ‘4X LOWER, 4X UPPER OESOPHAGUS’|,Nature of specimen as stated on pot = ‘rectal polyp’| 1 specimens collected the largest measuring 3 x 5 x 2 mm and the smallest 3 x 5 x 5 mm The appearances are of a hyperplastic polyp ,8 pieces of tissue, the largest measuring 4 x 36 x 2 mm and the smallest 3 x 3 ,Completeness of excision is uncertain as the base is not clearly visualised ,There is no ulceration ,Kikuchi level: sm2 Colon, biopsy - Normal ,- Focal granulomatous inflammation, non-necrotising ,Duodenum, biopsy - within normal histological limits ,Sigmoid colon, polypectomy: - Tubular adenoma with moderate dysplasia ,- Hyperplastic polyp ,Caecum polyp biopsies:- tubular adenoma, low grade dysplasia ,- Mild chronic inflammation within the oesophageal mucosa ,Sigmoid colon biopsies:- normal mucosa ,Sigmoid polyp excision:- tubular adenoma

Endoscopic cleaning

Once the extraction has been done into separate columns, various cleaning functions can be used for indivisual columns. This is illustrated in the figure below. Any one column does not have to be present in the final dataframe once extraction has happened. The functions are defined according to the most likely outputted columns from extraction from a typical dataset. If endoscopy reports are being extracted then the functions concentrate on these.



Endoscopist cleaning

For example if the Endoscopist name has been pulled out, the EndoscEndoscopist function can be used which returns the submitted data frame with the Endoscopist column cleaned up.


The endoscopist column might initially look like this (as the last column in this dataframe)


HospitalNumber PatientName GeneralPractitioner Dateofprocedure Endoscopist
J6044658 Jargon, Victoria Dr Martin, Marche 2009-11-11 Dr Sullivan, Shelby
Y6417773 Powell, Destiny Dr al-Safi, Lutfiyya 2008-06-15 Dr Kekich, Annabelle
B6072011 Martinez-Santos, Ana Dr Rogers, Monica 2007-10-27 Dr Sullivan, Shelby
G1449886 Lopez, Maria Dr Heilman, Lisa 2002-03-17 Dr Avitia-Ramirez, Alondra
V1607560 al-Rahimi, Rif’a Dr Krumland, Lisa 2011-12-05 Dr Greimann, Phoua
I8031481 Forrest, Dazheea Dr Millman, Arianna 2014-09-19 Dr Avitia-Ramirez, Alondra
W2120051 Naperola, Breanna Dr Vigil, Lidia 2002-05-28 Dr Martinez, Maegen
O7163832 Zuni, Shannon Dr Merced, Essence 2009-09-19 Dr Anderson, Alana
P6620949 Gomez Barron, Erin Dr Ursery, Dezire 2003-10-02 Dr Anderson, Alana
L4378217 Hamm, Shebra Dr Bauman, Caitlin 2016-11-22 Dr Ives, Rashiah
Myendo2<-EndoscEndoscopist(Myendo,'Endoscopist')

This function performs the cleaning of common things found in the text that may cause confusion such as getting rid of the titles ahead of the Endoscopist’s name, getting rid of whitespace etc. This is important to prevent double outputs for the same Endoscopist because of, for example, the lack and presence of a ‘.’ after Dr amongst other variations. The result is as follows:


HospitalNumber PatientName GeneralPractitioner Dateofprocedure Endoscopist
J6044658 Jargon, Victoria Dr Martin, Marche 2009-11-11 Sullivan, Shelby
Y6417773 Powell, Destiny Dr al-Safi, Lutfiyya 2008-06-15 Kekich, Annabelle
B6072011 Martinez-Santos, Ana Dr Rogers, Monica 2007-10-27 Sullivan, Shelby
G1449886 Lopez, Maria Dr Heilman, Lisa 2002-03-17 Avitia-Ramirez, Alondra
V1607560 al-Rahimi, Rif’a Dr Krumland, Lisa 2011-12-05 Greimann, Phoua
I8031481 Forrest, Dazheea Dr Millman, Arianna 2014-09-19 Avitia-Ramirez, Alondra
W2120051 Naperola, Breanna Dr Vigil, Lidia 2002-05-28 Martinez, Maegen
O7163832 Zuni, Shannon Dr Merced, Essence 2009-09-19 Anderson, Alana
P6620949 Gomez Barron, Erin Dr Ursery, Dezire 2003-10-02 Anderson, Alana
L4378217 Hamm, Shebra Dr Bauman, Caitlin 2016-11-22 Ives, Rashiah


Medication cleaning

The EndoscMeds currently extracts Fentanyl, Pethidine, Midazolam and Propofol doses into a separate column and reformats them as numeric columns so further calculations can be done.

Several other similar clean up functions are available for Endoscopy as follows. We will extract from the Raw endoscopy data first:

mywords<-c("Hospital:","Hospital Number:","Patient Name:","General Practitioner:","Date of procedure:","Endoscopist:","Second Endoscopist",
           "Medications:","Instrument:","Extent of Exam:","Indications:","Procedure Performed:",
"Findings:","Diagnosis:")
TheOGDReportFinal2<-Extractor(TheOGDReportFinal,"OGDReportWhole",mywords)
TheOGDReportFinal2df<-data.frame(TheOGDReportFinal2["HospitalNumber"],TheOGDReportFinal2["Instrument"],TheOGDReportFinal2["Indications"],TheOGDReportFinal2["Medications"],TheOGDReportFinal2["ProcedurePerformed"])
pander(head(TheOGDReportFinal2df,10))
HospitalNumber Instrument Indications Medications ProcedurePerformed
J6044658 FG5 Follow-up ULCER HEALING Fentanyl 12 5mcg Midazolam 6mg Gastroscopy (OGD)
Y6417773 FG6 Weight Loss Fentanyl 125mcg Midazolam 7mg Gastroscopy (OGD)
B6072011 FG2 Follow-up ULCER HEALING Fentanyl 125mcg Midazolam 6mg Gastroscopy (OGD)
G1449886 FG1 Other- Fentanyl 12 5mcg Midazolam 2mg Gastroscopy (OGD)
V1607560 FG2 Previous OGD ? 8 months ago Fentanyl 75mcg Midazolam 6mg Gastroscopy (OGD)
I8031481 FG6 Surveillance-Barrett’s Fentanyl 150mcg Midazolam 3mg Gastroscopy (OGD)
W2120051 FG5 Dyspepsia Fentanyl 125mcg Midazolam 5mg Gastroscopy (OGD)
O7163832 FG2 Oesophagus- Dysplasia Fentanyl 75mcg Midazolam 3mg Gastroscopy (OGD)
P6620949 FG4 Oesophagus- Dysplasia Fentanyl 25mcg Midazolam 1mg Gastroscopy (OGD)
L4378217 FG7 Therapeutic- Dilatation Fentanyl 150mcg Midazolam 1mg Gastroscopy (OGD)
v<-EndoscMeds(TheOGDReportFinal2df,'Medications')
HospitalNumber Instrument Indications Medications ProcedurePerformed Fent Midaz Peth Prop
J6044658 FG5 Follow-up ULCER HEALING Fentanyl 12 5mcg Midazolam 6mg Gastroscopy (OGD) 5 6 5 5
Y6417773 FG6 Weight Loss Fentanyl 125mcg Midazolam 7mg Gastroscopy (OGD) 125 7 125 125
B6072011 FG2 Follow-up ULCER HEALING Fentanyl 125mcg Midazolam 6mg Gastroscopy (OGD) 125 6 125 125
G1449886 FG1 Other- Fentanyl 12 5mcg Midazolam 2mg Gastroscopy (OGD) 5 2 5 5
V1607560 FG2 Previous OGD ? 8 months ago Fentanyl 75mcg Midazolam 6mg Gastroscopy (OGD) 75 6 75 75
I8031481 FG6 Surveillance-Barrett’s Fentanyl 150mcg Midazolam 3mg Gastroscopy (OGD) 150 3 150 150
W2120051 FG5 Dyspepsia Fentanyl 125mcg Midazolam 5mg Gastroscopy (OGD) 125 5 125 125
O7163832 FG2 Oesophagus- Dysplasia Fentanyl 75mcg Midazolam 3mg Gastroscopy (OGD) 75 3 75 75
P6620949 FG4 Oesophagus- Dysplasia Fentanyl 25mcg Midazolam 1mg Gastroscopy (OGD) 25 1 25 25
L4378217 FG7 Therapeutic- Dilatation Fentanyl 150mcg Midazolam 1mg Gastroscopy (OGD) 150 1 150 150

Instrument,Indications and Procedure type cleaning

EndoscInstrument,EndoscIndications and EndoscProcPerformed all perform similar cleaning functions with the endoscope number, the indication for the investigation and the actual procedure performed respectively. Future iterations will try to make these cleaning functions more generic and applicable to a wider number of use cases.

Histological cleaning

The cleaning functions for histology are a little more difficult as Histology reports often have a greater degree of free text reporting. In general, each histology report can be divided into the Macroscopic description of a specimen which itself is comprised of how many specimens there are for each sample sent (a sample can be a pot which includes several specimens) and how big each specimen is. The report will often give a detailed description of what is actually seen and then provide an overall diagnosis.



Basic Histology Cleaning

The histology cleaning functions are based around this. For example, the HistolHistol cleans the Histology text if present.


The original input example can be seen here:

## [1] "  Two biopsies consist of small bowel mucosa and are within normal histological limits\n\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [2] "  modified giemsa stain\n,These are biopsies of gastric mucosa ,There is no evidence of coeliac disease\n,The nuclei are hyperchromatic,\n,There is no granulomatous inflammation\n,The appearances are in keeping with a reactive/chemical gastritis,features including basal layer hyperplasia and reactive nucelar changes with underlying\n,These are two biopsies of squamous epithelium within normal limits,fibromuscularisation of the lamina propria and mild chronic inflammation\n,These biopsies of columnar mucosa show focal acute inflammation, moderate chronic inflammation\n\n"

And once the function is run the result is here:

t<-HistolHistol(Mypath,'Histology')
## [1] ""                                                                                                                                                                                                                                                                                                                                                                    
## [2] "  modified giemsa stain ,These are biopsies of gastric mucosa .\nThe nuclei are hyperchromatic,\n.\n,The appearances are in keeping with a reactive/chemical gastritis,features including basal layer hyperplasia and reactive nucelar changes with underlying\n.\n,These biopsies of columnar mucosa show focal acute inflammation, moderate chronic inflammation\n"


Extraction of Diagnosis

Some pathology reports also provide an overall impression or a list of diagnoses interpreted from the description of the pathological text. This can also be extracted. The diagnoses may also included the absence of features and a function is provided to both clean up the Diagnosis column as well as exclude negative diagnoses. If a diagnosis column is present the function can be run as follows:

## [1] "  Distal transverse colon polyp excision:- tubular adenoma, low grade dysplasia\n,Ileo-caecal valve, biopsies:\n,Stomach antrum biopsies:- normal mucosa\n,- Up to 34 eosinophils per high power field,Stomach, biopsy - Mild chronic inflammation\n"                                                                                                                                                                                
## [2] "  Rectum, polyp biopsy: - Tubular adenoma with mild dysplasia,- Raised intra-epithelial lymphocytes ,Duodenum, biopsies - within normal histological limits\n,B GI biopsy - DISTAL OESOPHAGUS X2, MID OESO X3, PROX OESO X2\n,Oesophagus, biopsies : - Minimal chronic inflammation,Sigmoid colon, polypectomy: - Tubular adenoma with moderate dysplasia,Oesophagus polyps biopsies:- 2 x papillomas\n,Duodenum biopsies:- normal\n"


with the following result:


t<-HistolDx(Mypath,'Diagnosis')
## [1] "  Distal transverse colon polyp excision:\ntubular adenoma, low grade dysplasia\n,\nUp to 34 eosinophils per high power field,Stomach, biopsy \nMild chronic inflammation\n"                                                                                           
## [2] "B GI biopsy \nDISTAL OESOPHAGUS X2, MID OESO X3, PROX OESO X2\n,Oesophagus, biopsies : \nMinimal chronic inflammation,Sigmoid colon, polypectomy: \nTubular adenoma with moderate dysplasia,Oesophagus polyps biopsies:\n2 x papillomas\n,Duodenum biopsies:\nnormal\n"

Extraction of elements from the Macroscopic Description

Because the information from the Macroscopic Description is based around numbers, a further function has been provided called HistolNumOfBx to extract the number of biopsies taken.

In order to extract the numbers, the limit of what has to be extracted has to be set as part of the regex so that the function takes whatever word limits the selection.It collects everything from the regex [0-9]{1,2}.{0,3} to whatever the string boundary is. For example, if the report usually says:


Mypath.HospitalNumber Mypath.PatientName Mypath.Macroscopicdescription
J6044658 Jargon, Victoria 3 specimens collected the largest measuring 3 x 2 x 1 mm and the smallest 2 x 1 x 5 mm
Y6417773 Powell, Destiny 4 specimens collected the largest measuring 4 x 4 x 4 mm and the smallest 5 x 3 x 1 mm
B6072011 Martinez-Santos, Ana 9 specimens collected the largest measuring 2 x 5 x 2 mm and the smallest 1 x 1 x 4 mm
G1449886 Lopez, Maria 4 specimens collected the largest measuring 5 x 4 x 1 mm and the smallest 1 x 3 x 3 mm
V1607560 al-Rahimi, Rif’a 5 specimens collected the largest measuring 2 x 2 x 1 mm and the smallest 3 x 4 x 3 mm


Based on this, the word that limits the number you are interested in is ‘specimen’ so the function and it’s output is:


v<-HistolNumbOfBx(Mypath,'Macroscopicdescription','specimen')
v.HospitalNumber v.PatientName v.NumbOfBx
J6044658 Jargon, Victoria 3
Y6417773 Powell, Destiny 4
B6072011 Martinez-Santos, Ana 9
G1449886 Lopez, Maria 4
V1607560 al-Rahimi, Rif’a 5

Extraction of Specific Disease Entities

The user may want to extract specific diseases from a histology dataset. This can be done using the function HistolExtrapolDx which simply takes the Diagnosis column and looks up the presence or absence of certain diseases. The function has been hard coded to look for dysplasia, cancer or GIST but will also take user defined words. These will have to be in the form of a regular expression or can be left as an empty string as in the examples

Mypath3<-data.frame(Mypath["HospitalNumber"],Mypath["Diagnosis"])
HospitalNumber Diagnosis
J6044658 Distal transverse colon polyp excision:- tubular adenoma, low grade dysplasia ,Ileo-caecal valve, biopsies: ,Stomach antrum biopsies:- normal mucosa ,- Up to 34 eosinophils per high power field,Stomach, biopsy - Mild chronic inflammation
Y6417773 Rectum, polyp biopsy: - Tubular adenoma with mild dysplasia,- Raised intra-epithelial lymphocytes ,Duodenum, biopsies - within normal histological limits ,B GI biopsy - DISTAL OESOPHAGUS X2, MID OESO X3, PROX OESO X2 ,Oesophagus, biopsies : - Minimal chronic inflammation,Sigmoid colon, polypectomy: - Tubular adenoma with moderate dysplasia,Oesophagus polyps biopsies:- 2 x papillomas ,Duodenum biopsies:- normal
B6072011 - Background Barrett ‘s oesophagus,Sigmoid colon, biopsy - Adenocarcinoma ,- Gastric metaplasia,Oesophagus 36cm ’papilloma’ biopsy:- normal squamous mucosa ,- Chronic active inflammation,Oesophagus, biopsy - Barrett ’s oesophagus with moderate chronic inflammation ,- Minimal chronic inflammation
G1449886 Stomach, biopsy - Mild chronic inflammation and reactive changes ,- Normal,- note: biopsies put into the wrong pots ,Oesophagus, biopsy - Poorly differentiated tumour ,Rectum, polyp biopsy: - Tubular adenoma with mild dysplasia,- Mild chronic inflammation and oedema,-Inflammatory fibroid polyp,- within normal histological limits,- Negative for HLO
V1607560 Nodule GOJ, biopsies:- acute and chronic inflammation with Helicobacter ,Stomach, biopsy -Mild acute and chronic inflammation ,Oesophagus polyps biopsies:- 2 x papillomas ,- <1 mm from lateral margin,Duodenum biopsies:- patchy increase in IELs ,Duodenum, biopsy - Normal
I8031481 Duodenum and stomach, polyp biopsies - Consistent with hamartomatous polyps ,Gastric oesophageal junction, biopsies : - Chronic inflammation,Stomach, biopsy - Mild chronic inflammation ,- Gastric HER2 negative,- Minimal chronic inflammation,Oesophagus, biopsy - Acute inflammation in presumed proximal biopsies only ,Gastro-osophageal junction, biopsy - Squamocolumnar mucosa ,Descending colon biopsies:- normal mucosa ,- Within normalhistological limits,The biopsies of gastric oesophageal junction type squamo-columnar mucosa show mild chronic
W2120051 - Negative for helicobacter,- Intestinal metaplasia ,Oesophagus biopsies:- normal ,Ileum and colon biopsies:- normal mucosa ,Oesophagus, EMR 43P - Barrett ’s oesophagus without intestinal metaplasia ,Sigmoid colon, biopsy - Adenocarcinoma ,- Chronic active inflammation,Duodenum biopsies:- normal mucosa ,- Low grade dysplasia
O7163832 - possible eosinophilic oesophagitis,- Mild chronic gastritis,A -E) Stomach, polyps, biopsies: ,- Mild chronic inflammation and oedema,- Focal mild chronic inflammation,- Gastric HER2 negative,Ileum and colon biopsies:- normal mucosa ,Adjacent mucosa, biopsy - Normal small bowel mucosa
P6620949 - Chronic active gastritis,Stomach, biopsy - Chronic, moderately active Helicobacter associated gastritis ,Oesophaguas biopsies:- normal mucosa ,- Acute inflammatory exudate,Right and left colon, biopsies: - Within normal histological limits
L4378217 Stomach, biopsy - Chronic, moderately active Helicobacter associated gastritis ,- tubular adenoma, low grade dysplasia x 1 ,- Tubular adenoma,- Tubular adenoma,Sigmoid colon, polyp biopsy - Hyperplastic polyp ,Stomach, biopsy - Reactive gastritis and intestinal metaplasia ,- Chronic inflammation,- Mild chronic inflammation
Mypath3<-HistolExtrapolDx(Mypath3,"Diagnosis","")
HospitalNumber Diagnosis Extracted
J6044658 Distal transverse colon polyp excision:- tubular adenoma, low grade dysplasia ,Ileo-caecal valve, biopsies: ,Stomach antrum biopsies:- normal mucosa ,- Up to 34 eosinophils per high power field,Stomach, biopsy - Mild chronic inflammation dyspla
Y6417773 Rectum, polyp biopsy: - Tubular adenoma with mild dysplasia,- Raised intra-epithelial lymphocytes ,Duodenum, biopsies - within normal histological limits ,B GI biopsy - DISTAL OESOPHAGUS X2, MID OESO X3, PROX OESO X2 ,Oesophagus, biopsies : - Minimal chronic inflammation,Sigmoid colon, polypectomy: - Tubular adenoma with moderate dysplasia,Oesophagus polyps biopsies:- 2 x papillomas ,Duodenum biopsies:- normal dyspla, dyspla
B6072011 - Background Barrett ‘s oesophagus,Sigmoid colon, biopsy - Adenocarcinoma ,- Gastric metaplasia,Oesophagus 36cm ’papilloma’ biopsy:- normal squamous mucosa ,- Chronic active inflammation,Oesophagus, biopsy - Barrett ’s oesophagus with moderate chronic inflammation ,- Minimal chronic inflammation carcin
G1449886 Stomach, biopsy - Mild chronic inflammation and reactive changes ,- Normal,- note: biopsies put into the wrong pots ,Oesophagus, biopsy - Poorly differentiated tumour ,Rectum, polyp biopsy: - Tubular adenoma with mild dysplasia,- Mild chronic inflammation and oedema,-Inflammatory fibroid polyp,- within normal histological limits,- Negative for HLO tumour, dyspla
V1607560 Nodule GOJ, biopsies:- acute and chronic inflammation with Helicobacter ,Stomach, biopsy -Mild acute and chronic inflammation ,Oesophagus polyps biopsies:- 2 x papillomas ,- <1 mm from lateral margin,Duodenum biopsies:- patchy increase in IELs ,Duodenum, biopsy - Normal
I8031481 Duodenum and stomach, polyp biopsies - Consistent with hamartomatous polyps ,Gastric oesophageal junction, biopsies : - Chronic inflammation,Stomach, biopsy - Mild chronic inflammation ,- Gastric HER2 negative,- Minimal chronic inflammation,Oesophagus, biopsy - Acute inflammation in presumed proximal biopsies only ,Gastro-osophageal junction, biopsy - Squamocolumnar mucosa ,Descending colon biopsies:- normal mucosa ,- Within normalhistological limits,The biopsies of gastric oesophageal junction type squamo-columnar mucosa show mild chronic
W2120051 - Negative for helicobacter,- Intestinal metaplasia ,Oesophagus biopsies:- normal ,Ileum and colon biopsies:- normal mucosa ,Oesophagus, EMR 43P - Barrett ’s oesophagus without intestinal metaplasia ,Sigmoid colon, biopsy - Adenocarcinoma ,- Chronic active inflammation,Duodenum biopsies:- normal mucosa ,- Low grade dysplasia carcin, dyspla
O7163832 - possible eosinophilic oesophagitis,- Mild chronic gastritis,A -E) Stomach, polyps, biopsies: ,- Mild chronic inflammation and oedema,- Focal mild chronic inflammation,- Gastric HER2 negative,Ileum and colon biopsies:- normal mucosa ,Adjacent mucosa, biopsy - Normal small bowel mucosa
P6620949 - Chronic active gastritis,Stomach, biopsy - Chronic, moderately active Helicobacter associated gastritis ,Oesophaguas biopsies:- normal mucosa ,- Acute inflammatory exudate,Right and left colon, biopsies: - Within normal histological limits
L4378217 Stomach, biopsy - Chronic, moderately active Helicobacter associated gastritis ,- tubular adenoma, low grade dysplasia x 1 ,- Tubular adenoma,- Tubular adenoma,Sigmoid colon, polyp biopsy - Hyperplastic polyp ,Stomach, biopsy - Reactive gastritis and intestinal metaplasia ,- Chronic inflammation,- Mild chronic inflammation dyspla

Other less useful functions include but which may be useful in certain hospitals and certain situations in

  1. HistolAccessionNumber which extracts Accession Number data from the report where one is present.The Accession number relates to the actual specimen number as ascribed by the pathology service.



Removal of negatives

In addition, if there is a need to remove all sentences that give negative diagnoses (eg “There is no evidence of…”) so that false positive diagnoses are not made during the analysis stage, a further function can be applied called NegativeRemove. It can be applied as a stand alone function but is also implemented within the HistolDx function which extracts and cleans the diagnosis from the Histology text to provide a Simplified Diagnosis column.


The original input example can be seen here

## [1] "  - Negative for helicobacter,- Intestinal metaplasia ,Oesophagus biopsies:- normal\n,Ileum and colon biopsies:- normal mucosa\n,Oesophagus, EMR 43P - Barrett 's oesophagus without intestinal metaplasia\n,Sigmoid colon, biopsy - Adenocarcinoma\n,- Chronic active inflammation,Duodenum biopsies:- normal mucosa\n,- Low grade dysplasia"

If we apply the function NegativeRemove we see this changes to:

MypathNegRem<-NegativeRemove(Mypath,"Diagnosis")
## [1] ",Ileum and colon biopsies:- normal mucosa\n,Oesophagus, EMR 43P - Barrett 's oesophagus without intestinal metaplasia\n,Sigmoid colon, biopsy - Adenocarcinoma\n,- Chronic active inflammation,Duodenum biopsies:- normal mucosa\n,- Low grade dysplasia"