Recognize, Annotate, and Visualize Parallel Content Structures in XML Documents

— We present a four-phase parallel approach for capturing, annotating, and visualizing parallel structures in XML documents. We designed a highlighting strategy that first decomposes XML documents in various data streams, including plain text, formulae, and images. Second, those streams are processed with external algorithms and tools optimized for specific tasks, such as analyzing similarities or differences or differences in the respective formats. Third, we compute comparison metadata such as annotations and highlighting marks. Fourth, the position information is concatenated based on the original XML's computed positions document. Eventually, the resulting comparison can then be visualized or processed further while keeping the reference to the source documents intact. While our algorithm has been developed for visualizing similarities as part of plagiarism detection tasks, we expect that many applications will benefit from a well-designed and integrative method that separates between addressing the match locations and inserting highlight marks. For example, our algorithm can also add comments in XML-unaware plaintext editors. We also treat the edge cases, overlaps as well as multi-match with our approach.


I. INTRODUCTION
XML is an important and widely used format (e.g., HTML, ODT, and docx are all based on XML) for representing, storing, and exchanging data in the form of documents for numerous use cases. In science, TEI (digital humanities) and JATS (MINT disciplines) are important XML-based document formats. The Text Encoding Initiative (TEI) has also become a de facto standard within the humanities, [1] where it is used, for example, to encode printed works (edition science) or to mark up linguistic information (linguistics) in texts. Examining documents for similarities or differences, storing these corresponding results, and visualizing them for users is expected in document processing. Important use cases would be, for example, tracking changes in collaboratively edited documents (Microsoft Word / Open Office) or the detection of plagiarism in scientific publications. Particularly in the area of academic plagiarism [2], where it goes as far as the highly covert reuse of content, e.g., by paraphrasing or translating of the text, and finally, the reuse of data or reuse of data or ideas without proper attribution [3], investigation of similarities in documents are of elementary importance. Therefore, detecting hidden academic plagiarism in research publications is an urgent problem that concerns many stakeholders, including academic publishers, research institutions, funding agencies, and, of course, other researchers.
Another area of data analyses relates to digital editions, the core area of digital humanities, investigations of similarities, and general comparisons in complex textual and semistructured datasets are helpful for further research questions and reuse [4]. Moreover, (semi-)automatic processing steps (e.g., text recognition in manuscripts and inscriptions, heuristic and inferential statistical detection of structural relationships in and empirical analyses of language and text corpora, etc.) and their systematic evaluation (e.g., image analysis, metadata enrichment, directed information, graphical models, word embeddings, interaction and social networks, etc.) are of elementary importance in digital editions. The approach of Rosselli et al. within the opensource tool EVT -Edition Visualization Technology uses the Digital Vercelli Book as an example to show how digital editions can be searched, explored, and studied [5].
The XML format offers the advantage that hierarchical structures and user-defined tags allow flexible data representation [6]. However, this advantage is challenging when comparing for similarities or differences [7,8,9]  different content in XML documents must be analyzed using different algorithms. A second challenge arises when something within an XML document needs to be annotated, modified, or deleted and the relation to the originals must be preserved. This is the case, for example, when comparing two XML documents for similarities and then highlighting the similarities.

II. COMPARISON TO THE STATE OF THE ART
Many tools support the highlighting of differences between documents (so-called diff tools). Mostly, these tools focus on comparing native source code and managing code changes. These tools can also be integrated into common source code management systems, for example. As a rule, such tools also make the presupposition that the documents must be essentially similar and therefore not fundamentally different, since the output or the comparison result then also becomes very difficult to interpret. There are three standard layouts for the different tools (uniform, double and triple). Also, the positions of the different text fragments are highlighted in the scrollbars of the individual documents.  Also, specialized solutions exist, which compare files in the same format, such as images or plain text files. Still, no tools can highlight similarities of different types (such as image and text), which is an important prerequisite for visualization of multimodal diffs.
However, there are two main differences between different tools to merge different texts in particular source code and highlight a parallel structure in XML documents. First, XML documents contain structural, textual, and information predefined presentation standards such as SVG images or MathML formulae. In case to highlight those non-textual elements or fractions of it, special treatment is necessary. Second, in source code outputs, overlapping highlighting is less problematic. This is because there are only three types of annotations, insert delete, and move. For XML documents, corresponding sections might be determined by several analytics algorithms, and thus the overlapping of highlights is more complex. In particular, the highlighting can span through different formatting instructions.
Third, the analytical tasks are more diverse. While for source code, the task is usually to resolve conflicting change proposals or understand a change in the source code, investigating similarities in documents is more diverse. The overview first paradigm is more important in this context.

III. OUR APPROACH
We present a four-stage approach for identifying, annotating, and highlighting parallel content structures in XML documents. Our approach address scenarios in which two documents shall be analyzed for similar content, specifically similar text, images, and formulae. This scenario is particularly relevant for scientific document processing use cases, such as plagiarism detection or editorial theory.
In the first stage, an XML or HTML document is decomposed into plain text, formulae, and images. These elements are processed with appropriate external algorithms for text similarity or difference analysis. Then, these elements can be concatenated with the positions of the original XML document to form minimal, non-overlapping elements and insert, for example, the type-specific highlight marker. To avoid highlighting constraints, matching groups are introduced, the corresponding tags are moved, and the part to be highlighted is split accordingly. Thus, many applications will benefit from a well-designed method that separates addressing matching locations from inserting highlighting. In our approach, we also consider edge cases, overlapping matches, as well as multi-match. Our approach and tool is a further development of HyPlag [10,11], which provides a template or command-line tool for extracting different data streams from the XML document, annotating them, and then reassembling the original XML tags with the plain text, images, and formulae using our algorithm.  The XML/HTML document is decomposed into plain text, mathematical expressions, and images in the preprocessing phase. In the next step, all XML/HTML formatting instructions, links, and tags are removed. However, we note and store the resulting plaintext string positions where each XML/HTML tag was removed during this extraction. After doing this for both XML/HTML documents, we have two plaintext files and a list of the tags removed from these text files and their location in the original XML document. In this second phase, the plaintext can then be identified using a standard text similarity or difference algorithm (e.g., Encoplot [12], Boyer-Moore [13]), and the formulae and images can be identified using appropriate algorithms using specialized partially XML unaware tools (e.g., also Frequency Histograms of Mathematical Identifiers (Histo), Longest Common Subsequence of Identifiers (LCIS), [14] Perceptual hashing (pHash), Positional text matching, [10]Bibliographic Coupling (BC), Longest Common Citation Sequence (LCCS), Citation Chunking (CC)[ [15], [16]]) are processed. Alternatively, characters or words can be added, modified, or deleted manually in the extracted plain text. After processing the respective elements with external algorithms or manual modifications, we evaluate the returned positions' changes to the input strings' integer position. Using our position list generated in the first phase, we can relate these matches to the positions of the respective XML/HTML tags in the original documents. However, we need to ensure that no XML tags are in the matched span to implement highlighting correctly. Otherwise, the highlighting could interfere with other formatting instructions, such as <b>bold</b>. Therefore, we introduce matching groups and split the range to be highlighted into as many fractions as leaf nodes are affected in the XML/HTML document and assign the same group tag to the splits. We repeat this method for all algorithms and all content types to be compared. In addition to the split tag problem, we now face the challenge of overlapping groups in leaf elements. To avoid this, we split the groups into non-overlapping highlights and adjust the mathematical group information accordingly. Note that this means that identical highlights are merged into one. Also, contextual information, such as highlighting or additional comments, is stored in the group and not in the highlighting itself.

Phase 4 Recombine match groups and insert highlight tags.
Based on the group information, we now add the highlighting tags to the sheet elements. The highlighting depends on the content type of the sheet. We keep an expandable list that assigns a highlighting method to each sheet type. For example, for the HTML tags head, title, base, link, meta, style, body, article, header, footer, div, figure, data, ruby, span, we use the em-tag for highlighting. Our results show that it is possible to extract the plain text, random images, and mathematical expressions separately from an XML document. Then, these separate types can be compared using various external tools, and the comparison results can be merged back with reference to the original document. In addition to pure comparison, this makes it possible to modify the text in human-readable form, delete and add characters, and then insert the XML tags and formatting instructions into this modified pure text. In addition, our suggested approach allows for a much better and more informed analysis of XML documents. Analysis can now be done at the level of individual elements, such as plain text, images, or mathematical expressions.
One challenge in extracting and compiling the plain text and XML tags is dealing with whitespace. By default, no whitespace is inserted between text and the respective XML tags in an XML document. So if we remove the XML tags as part of the suggested approach, there will be no whitespace between each word in the plain text output file.
The following example <postCode>66123</postCode><settlement>Saarbrücke n</settlement><country key="DE">Germany</country> illustrates that as soon as the XML tags are removed, the plain text is written together as in a continuous text.

66123SaarbrückenGermany
Of course, space can also be inserted or output during extraction, but then the calculated positions would not be correct when the changed text and the XML tags are later assembled. It might be helpful to note the position where the space is inserted during extraction so that this position can be determined again during composition. The position of the XML tags is finally corrected for the inserted spaces. Even if the algorithm is implemented that appropriate blanks are inserted, the problem can occur if the XML tag is set within a word that thereby the word is pulled apart based on the blank.
Another challenge is when words are inserted in plain text at the edge of an XML tag that the algorithm so far cannot detect in which XML tag the new or changed word should be inserted. Further considerations are required at these points.

V. CONCLUSION
We have presented a proposal for capturing, annotating, and visualizing parallel structures in XML documents. An XML document can first be decomposed into plain text, formulas, and images and processed and analyzed with external algorithms for text similarity or difference analysis. After processing the elements with external algorithms and inserting annotations, such as plain text or highlight marks, these elements can be concatenated based on the positions of the original XML document. To this end, we have also proposed a solution approach for highlighting constraints, edge cases, overlaps, and multiple matches. Our standalone software package, we have created can be integrated into various applications to split XML documents into different data streams for analysis, for example, and then reassemble the modified data streams into one XML document using the calculated positions in case of changes or highlighting.
Our Code is available as open-source at: https://github.com/ag-gipp/parallelXmlHighlighting ACKNOWLEDGMENT This work was partially supported by the German Research Foundation (DFG) grant no. GI 1259/3-1.