Petar Ristoski
Oliver Lehmberg
Heiko Paulheim
Robert Meusel
Christian Bizer
Alexander Diete
Nicolas Heist
Sascha Krstanovic
Thorsten Andre Knöller

This page describes the data format that is used to represent table data. Further, it contains the download instructions for the WDC Web Table Corpora 2012. General information about the WDC Web Tables Coprus 2012 can be found on the overview page.

1. Data Formats and Download

The main corpus of Web tables is divided into 854,083 gzip files, which are then packed in 885 tar archives. Each tar archive in the complete corpus contains 1,000 gzip files, where each gzip file contains Web tables extracted from a couple thousand Web pages. For each Web page that contains at least one content Web table, we provide the corresponding HTML file, the set of extracted Web tables in CSV format, and a JSON file that contains meta data for the extraction of the Web tables. Each JSON file contains the URL of the Web page, a reference to the corresponding HTML file in the gzip file, and information for each of the extracted Web tables. All files that are referring to the same Web page, share the same file name prefix, e.g. a JSON file with the name 71657325_XXXXXXX.json would referre to the HTML file 71657325_YYYYYY, and a list of CSV files: 71657325_0_ZZZZZZZ.csv, 71657325_1_ZZZZZZZ.csv etc... For each of the extracted Web tables, the JSON file contains the position of the table inside the HTML file, and basic statistics for the data in the Web tables. The complete JSON Schema can be found here.

To download the corpora of Web table use the following links:

Data SetSize#Files
Relational Corpus 20121 020 GB885 (.tar)
Enlgish-Language Relational Web Tables697 GB180 (.tar)
You can download data samples on the following links:

2. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

3. Credits

The extraction of the Web Table Corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.

