MarkupMnA

Rao, Sukrit; Islam, Pranab; Bollineni, Rohith; Khosla, Shaan; Fei, Tingyi; Wu, Qian; Cho, Kyunghyun; Kobzar, Vladimir

doi:10.5281/zenodo.8034853

Published June 7, 2023 | Version 1.0

Dataset Open

MarkupMnA

1. New York University
2. Columbia University

The MarkupMnA dataset is a corpus of 151 merger and acquisition agreements with annotated sections titles, section numbers, page numbers, and more, based on HTML filings by US public companies retrieved from the SEC EDGAR database. We consider the task of section title annotation as a sequence labeling task, and to that end, use the BEIOS tagging scheme when generating our annotations. There are over 70,000 labels in the entire dataset excluding outside labels and over 465,000 labels including outside labels.

We add annotations to the contracts in an already widely used dataset, MAUD, which is an expert-annotated reading comprehension dataset. The broad objective of our work is to make progress toward developing computationally efficient hierarchical representations of long documents, specifically for legal contracts. We hope that our annotations can be used in conjunction with MAUD to advance legal NLP research.

Files

Contract Name to HTML Link Mapping.csv

Files (16.6 MB)

Name	Size	Download all
Contract Name to HTML Link Mapping.csv md5:9bb05c14358829da4fd01870b0021432	24.0 kB	Preview Download
MarkupMnA.zip md5:141b5ee35b8772be84355d936022de13	16.6 MB	Preview Download

	All versions	This version
Views	4,467	161
Downloads	463	61
Data volume	1.3 GB	267.1 MB

MarkupMnA

Creators

Description

Files

Contract Name to HTML Link Mapping.csv

Files (16.6 MB)