Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision
Creators
- 1. The Ohio State University
- 2. Google Research
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.
There are two files:
sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only
table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid
The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.
For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT
Below is a sample code snippet to load the data
import webdataset as wds
# path to the uncompressed files, should be a directory with a set of tar files
url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar'
dataset = (
wds.Dataset(url)
.shuffle(1000) # cache 1000 samples and shuffle
.decode()
.to_tuple("json")
.batched(20) # group every 20 examples into a batch
)
# Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch
# You can also iterate through all examples and dump them with your preferred data format
Below we show how the data is organized with two examples.
Text-only
{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence
's1_all_links': {
'Sils,_Girona': [[0, 4]],
'municipality': [[10, 22]],
'Comarques_of_Catalonia': [[30, 37]],
'Selva': [[41, 46]],
'Catalonia': [[51, 60]]
}, # list of entities and their mentions in the sentence (start, end location)
'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs
{
'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair
's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query
's2s': [ # list of other sentences that contain the common entity pair, or evidence
{
'md5': '2777e32bddd6ec414f0bc7a0b7fea331',
'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.',
's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence
'pair_locs': [ # mentions of the entity pair in the evidence
[[19, 27]], # mentions of entity 1
[[0, 5], [288, 293]] # mentions of entity 2
],
'all_links': {
'Selva': [[0, 5], [288, 293]],
'Comarques_of_Catalonia': [[19, 27]],
'Catalonia': [[40, 49]]
}
}
,...] # there are multiple evidence sentences
},
,...] # there are multiple entity pairs in the query
}
Hybrid
{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.',
's1_all_links': {...}, # same as text-only
'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only
'table_pairs': [
'tid': 'Major_League_Baseball-1',
'text':[
['World Series Records', 'World Series Records', ...],
['Team', 'Number of Series won', ...],
['St. Louis Cardinals (NL)', '11', ...],
...] # table content, list of rows
'index':[
[[0, 0], [0, 1], ...],
[[1, 0], [1, 1], ...],
...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table.
'value_ranks':[
[0, 0, ...],
[0, 0, ...],
[0, 10, ...],
...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS
'value_inv_ranks': [], # inverse rank
'all_links':{
'St._Louis_Cardinals': {
'2': [
[[2, 0], [0, 19]], # [[row_id, col_id], [start, end]]
] # list of mentions in the second row, the key is row_id
},
'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]},
}
'name': '', # table name, if exists
'pairs': {
'pair': ['American_League', 'National_League'],
's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query
'table_pair_locs': {
'17': [ # mention of entity pair in row 17
[
[[17, 0], [3, 18]],
[[17, 1], [3, 18]],
[[17, 2], [3, 18]],
[[17, 3], [3, 18]]
], # mention of the first entity
[
[[17, 0], [21, 36]],
[[17, 1], [21, 36]],
] # mention of the second entity
]
}
}
]
}
Notes
Files
Files
(33.2 GB)
Name | Size | Download all |
---|---|---|
md5:b3021c7e8d538f70a337c58bd683adc5
|
12.3 GB | Download |
md5:642b203242ae335841c5b6b991f3ab03
|
20.9 GB | Download |
Additional details
References
- Deng, Xiang, et al. "ReasonBERT: Pre-trained to Reason with Distant Supervision." EMNLP (2021).