Recognize, Annotate, and Visualize Parallel Content Structures in XML Documents

Marco Beck; Moritz Schubotz; Vincent Stange; Norman Meuschke; Bela Gipp

doi:10.5281/zenodo.5224572

Published August 19, 2021 | Version v1

Conference paper Open

Recognize, Annotate, and Visualize Parallel Content Structures in XML Documents

1. University of Wuppertal
2. FIZ Karlsruhe
3. OriginStamp AG

We present a four-phase parallel approach for capturing, annotating, and visualizing parallel structures in XML documents. We designed a highlighting strategy that first decomposes XML documents in various data streams, including plain text, formulae, and images. Second, those streams are processed with external algorithms and tools optimized for specific tasks, such as analyzing similarities or differences or differences in the respective formats. Third, we compute comparison metadata such as annotations and highlighting marks. Fourth, the position information is concatenated based on the original XML's computed positions document. Eventually, the resulting comparison can then be visualized or processed further while keeping the reference to the source documents intact. While our algorithm has been developed for visualizing similarities as part of plagiarism detection tasks, we expect that many applications will benefit from a well-designed and integrative method that separates between addressing the match locations and inserting highlight marks. For example, our algorithm can also add comments in XML-unaware plaintext editors. We also treat the edge cases, overlaps as well as multi-match with our approach.

Files

beck2021.pdf

Files (469.9 kB)

Name	Size	Download all
beck2021.pdf md5:cfa731368cade93f1bcf5ad61f865650	469.9 kB	Preview Download

	All versions	This version
Views	130	128
Downloads	170	167
Data volume	82.7 MB	81.3 MB

Recognize, Annotate, and Visualize Parallel Content Structures in XML Documents

Creators

Description

Files

beck2021.pdf

Files (469.9 kB)