SOTAB V2 for SemTab 2023
Description
SOTAB V2 for SemTab 2023 includes datasets used to evaluate Column Type Annotation (CTA) and Columns Property Annotation (CPA) systems in the 2023 edition of the SemTab challenge. The datasets for both rounds of the challenge were down-sampled from the full train, test and validation splits of the SOTAB V2 (WDC Schema.org Table Annotation Benchmark version 2) benchmark, so that the datasets of the first round have a smaller vocabulary of 40 and 50 labels for CTA and CPA respectively corresponding to easier/more general domains, and the datasets of the second round include the full vocabulary size of 80 and 105 labels and are therefore considered to be harder to annotate. The columns and the relationships between columns are annotated using the Schema.org and DBpedia vocabulary.
SOTAB V2 for SemTab 2023 contains the splits used in Round 1 and Round 2 of the challenge. Each round includes a training, validation and test split together with the ground truth for the test splits and the vocabulary list. The ground truth of the test sets of both rounds are manually verified.
Files contained in SOTAB V2 for SemTab 2023:
- Round1-SOTAB-CPA-DatasetsAndGroundTruth = This file contains the csv files of the first round of the challenge for the task of CPA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.
- Round1-SOTAB-CTA-DatasetsAndGroundTruth = This file contains the csv files of the first round of the challenge for the task of CTA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.
- Round2-SOTAB-CPA-SCH-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CPA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.
- Round2-SOTAB-CTA-SCH-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CTA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with Schema.org.
- Round2-SOTAB-CPA-DBP-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CPA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with DBpedia.
- Round2-SOTAB-CTA-DBP-DatasetsAndGroundTruth = This file contains the csv files of the second round of the challenge for the task of CTA for training, validation and test set, as well as label set and ground truth in the "gt" folder. The columns in these files are annotated with DBpedia.
All the corresponding tables can be found in the "Tables" zip folders.
Note on License: This data includes data from the following sources. Refer to each source for license details:
- CommonCrawl https://commoncrawl.org/
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Files
SOTAB V2 for SemTab 2023.zip
Files
(4.0 GB)
Name | Size | Download all |
---|---|---|
md5:60fc78da45812a9741d0204ace200660
|
4.0 GB | Preview Download |