---
annotations_creators:
- crowdsourced
language_creators:
- found
languages:
- en
licenses:
- unknown
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- extended|trivia_qa
task_categories:
- question-answering
task_ids:
- open-domain-qa
paperswithcode_id: freebaseqa
---
 
# Dataset Card for FreebaseQA
 
## Table of Contents
- [Dataset Description](#dataset-description)
  - [Dataset Summary](#dataset-summary)
  - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
  - [Languages](#languages)
- [Dataset Structure](#dataset-structure)
  - [Data Instances](#data-instances)
  - [Data Fields](#data-fields)
  - [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
  - [Curation Rationale](#curation-rationale)
  - [Source Data](#source-data)
  - [Annotations](#annotations)
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
  - [Social Impact of Dataset](#social-impact-of-dataset)
  - [Discussion of Biases](#discussion-of-biases)
  - [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
  - [Dataset Curators](#dataset-curators)
  - [Licensing Information](#licensing-information)
  - [Citation Information](#citation-information)
  - [Contributions](#contributions)
## Dataset Description
 
- **Homepage:**
- **Repository: [FreebaseQA repository](https://github.com/kelvin-jiang/FreebaseQA)**
- **Paper: [FreebaseQA ACL paper](https://www.aclweb.org/anthology/N19-1028.pdf)**
- **Leaderboard:**
- **Point of Contact: [Kelvin Jiang](https://github.com/kelvin-jiang)**
 
### Dataset Summary
 
FreebaseQA is a dataset for open-domain factoid question answering (QA) tasks over structured knowledge bases, like Freebase.
 
### Supported Tasks and Leaderboards
 
[More Information Needed]
 
### Languages
 
English
 
## Dataset Structure
 
### Data Instances

Here is an example from the dataset:
 
```
{'Parses': {'Answers': [{'AnswersMid': ['m.01npcx'], 'AnswersName': [['goldeneye']]}, {'AnswersMid': ['m.01npcx'], 'AnswersName': [['goldeneye']]}], 'InferentialChain': ['film.film_character.portrayed_in_films..film.performance.film', 'film.actor.film..film.performance.film'], 'Parse-Id': ['FreebaseQA-train-0.P0', 'FreebaseQA-train-0.P1'], 'PotentialTopicEntityMention': ['007', 'pierce brosnan'], 'TopicEntityMid': ['m.0clpml', 'm.018p4y'], 'TopicEntityName': ['james bond', 'pierce brosnan']}, 'ProcessedQuestion': "what was pierce brosnan's first outing as 007", 'Question-ID': 'FreebaseQA-train-0', 'RawQuestion': "What was Pierce Brosnan's first outing as 007?"}
```
 
### Data Fields
- `Question-ID`: a `string` feature representing ID of each question.
- `RawQuestion`: a `string` feature representing the original question collected from data sources.
- `ProcessedQuestion`: a `string` feature representing the question processed with some operations such as removal of trailing question mark and decapitalization.
- `Parses`: a dictionary feature representing the semantic parse(s) for the question containing:
  - `Parse-Id`: a `string` feature representing the ID of each semantic parse.
  - `PotentialTopicEntityMention`: a `string` feature representing the potential topic entity mention in the question.
  - `TopicEntityName`: a `string` feature representing name or alias of the topic entity in the question from Freebase.
  - `TopicEntityMid`: a `string` feature representing the Freebase MID of the topic entity in the question.
  - `InferentialChain`: a `string` feature representing path from the topic entity node to the answer node in Freebase, labeled as a predicate.
  - `Answers`: a dictionary feature representing the answer found from this parse containing:
    - `AnswersMid`: a `string` feature representing the Freebase MID of the answer.
    - `AnswersName`: a `list` of `string` features representing the answer string from the original question-answer pair.
### Data Splits
This data set contains 28,348 unique questions that are divided into three subsets: train (20,358), dev (3,994) and eval (3,996), formatted as JSON files: FreebaseQA-[train|dev|eval].json
## Dataset Creation
 
### Curation Rationale
 
[More Information Needed]
 
### Source Data
 
#### Initial Data Collection and Normalization
 
The data set is generated by matching trivia-type question-answer pairs with subject-predicateobject triples in Freebase. For each collected question-answer pair, we first tag all entities in each question and search for relevant predicates that bridge a tagged entity with the answer in Freebase. Finally, human annotation is used to remove false positives in these matched triples. 
 
#### Who are the source language producers?
 
[More Information Needed]
 
### Annotations
 
#### Annotation process
 
[More Information Needed]
 
#### Who are the annotators?
 
[More Information Needed]
 
### Personal and Sensitive Information
 
[More Information Needed]
 
## Considerations for Using the Data
 
### Social Impact of Dataset
 
[More Information Needed]
 
### Discussion of Biases
 
[More Information Needed]
 
### Other Known Limitations
 
[More Information Needed]
 
## Additional Information
 
### Dataset Curators
 
Kelvin Jiang - Currently at University of Waterloo. Work was done at
York University.
 
### Licensing Information
 
[More Information Needed]
 
### Citation Information
 
```
@inproceedings{jiang-etal-2019-freebaseqa,
    title = "{F}reebase{QA}: A New Factoid {QA} Data Set Matching Trivia-Style Question-Answer Pairs with {F}reebase",
    author = "Jiang, Kelvin  and
      Wu, Dekun  and
      Jiang, Hui",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1028",
    doi = "10.18653/v1/N19-1028",
    pages = "318--323",
    abstract = "In this paper, we present a new data set, named FreebaseQA, for open-domain factoid question answering (QA) tasks over structured knowledge bases, like Freebase. The data set is generated by matching trivia-type question-answer pairs with subject-predicate-object triples in Freebase. For each collected question-answer pair, we first tag all entities in each question and search for relevant predicates that bridge a tagged entity with the answer in Freebase. Finally, human annotation is used to remove any false positive in these matched triples. Using this method, we are able to efficiently generate over 54K matches from about 28K unique questions with minimal cost. Our analysis shows that this data set is suitable for model training in factoid QA tasks beyond simpler questions since FreebaseQA provides more linguistically sophisticated questions than other existing data sets.",
}
```

### Contributions

Thanks to [@gchhablani](https://github.com/gchhablani) and [@anaerobeth](https://github.com/anaerobeth) for adding this dataset.
