Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun

doi:10.5281/zenodo.5612316

Published October 29, 2021 | Version 0.0

Dataset Open

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

1. The Ohio State University
2. Google Research

This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

# path to the uncompressed files, should be a directory with a set of tar files
url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar'
dataset = (
    wds.Dataset(url)
    .shuffle(1000) # cache 1000 samples and shuffle
    .decode()
    .to_tuple("json")
    .batched(20) # group every 20 examples into a batch
)

# Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch
# You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence
 's1_all_links': {
    'Sils,_Girona': [[0, 4]],
    'municipality': [[10, 22]],
    'Comarques_of_Catalonia': [[30, 37]],
    'Selva': [[41, 46]],
    'Catalonia': [[51, 60]]
  }, # list of entities and their mentions in the sentence (start, end location)
 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs
    {
       'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair
       's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query
       's2s': [ # list of other sentences that contain the common entity pair, or evidence
          {
             'md5': '2777e32bddd6ec414f0bc7a0b7fea331',
             'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or  Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.',
             's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence
             'pair_locs': [ # mentions of the entity pair in the evidence
                [[19, 27]], # mentions of entity 1
                [[0, 5], [288, 293]] # mentions of entity 2
              ],
             'all_links': {
                'Selva': [[0, 5], [288, 293]],
                'Comarques_of_Catalonia': [[19, 27]],
                'Catalonia': [[40, 49]]
              }
           }
        ,...] # there are multiple evidence sentences
     },
  ,...] # there are multiple entity pairs in the query
}

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.',
 's1_all_links': {...}, # same as text-only
 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only
 'table_pairs': [
    'tid': 'Major_League_Baseball-1',
    'text':[
       ['World Series Records', 'World Series Records', ...],
       ['Team', 'Number of Series won', ...],
       ['St. Louis Cardinals (NL)', '11', ...],
    ...] # table content, list of rows
    'index':[
       [[0, 0], [0, 1], ...],
       [[1, 0], [1, 1], ...],
    ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table.
    'value_ranks':[
       [0, 0, ...],
       [0, 0, ...],
       [0, 10, ...],
    ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS
    'value_inv_ranks': [], # inverse rank
    'all_links':{
       'St._Louis_Cardinals': {
          '2': [
           [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]]
          ] # list of mentions in the second row, the key is row_id
       },
       'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]},
    }
    'name': '', # table name, if exists
    'pairs': {
       'pair': ['American_League', 'National_League'],
       's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query
       'table_pair_locs': {
          '17': [ # mention of entity pair in row 17
             [
               [[17, 0], [3, 18]],
               [[17, 1], [3, 18]],
               [[17, 2], [3, 18]],
               [[17, 3], [3, 18]]
             ], # mention of the first entity
             [
               [[17, 0], [21, 36]],
               [[17, 1], [21, 36]],
             ] # mention of the second entity
          ]
       }
     }
 ]
}

Notes

We would like to thank the anonymous reviewers for their helpful comments. Authors at The Ohio State University were sponsored in part by Google Faculty Award, the Army Research Office under cooperative agreements W911NF-17-1-0412, NSF Grant IIS1815674, NSF CAREER #1942980, Fujitsu gift grant, and Ohio Supercomputer Center (Center, 1987). The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notice herein. Research was also supported with Cloud TPUs from Google's TPU Research Cloud (TRC).

Files

Files (33.2 GB)

Name	Size
sentence_pairs_for_pretrain_no_tokenization.tar.gz md5:b3021c7e8d538f70a337c58bd683adc5	12.3 GB	Download
table_pairs_for_pretrain_no_tokenization.tar.gz md5:642b203242ae335841c5b6b991f3ab03	20.9 GB	Download

Additional details

Deng, Xiang, et al. "ReasonBERT: Pre-trained to Reason with Distant Supervision." EMNLP (2021).

	All versions	This version
Views	980	980
Downloads	160	160
Data volume	2.9 TB	2.9 TB

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

Authors/Creators

Description

Notes

Files

Files (33.2 GB)

Additional details

References