Download Instructions for the WDC RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (November 2017)

This document contains instructions on how to download the November 2017 version of the Web Data Commons RDFa, Microdata, Embedded JSONLD, and Microformats data sets.

Download the Extracted Data

The extracted RDFa, Microdata, Microformat and Embedded JSONLD data is provided for download as N-Quads. Files are compressed using GZIP and each file is around 100 MB large. Overall 8,433 files with a total size of 858 GB are provided.

List of download URLs for RDF from the November 2017 corpus (Example Content)

The extracted RDF data can be downloaded using wget with the command wget -i The files, containing quads for specific formats can be found in the table below, together with more detailed statistics about the number of files and sizes.

Format Number Of Files Approx. Total File Size File List
html-rdfa 473 47 GB html-rdfa.list
html-microdata 6,100 627 GB html-microdata.list
html-embedded-jsonld 610 61 GB html-embedded-jsonld.list
html-mf-geo 5 476 MB html-mf-geo.list
html-mf-hcalendar 12 1 GB html-mf-hcalendar.list
html-mf-hcard 1,124 111 GB html-mf-hcard.list
html-mf-adr 28 2.7 GB html-mf-adr.list
html-mf-hrecipe 5 434 MB html-mf-hrecipe.list
html-mf-hlisting 5 455 MB html-mf-hlisting.list
html-mf-hresume 1 1.6 MB html-mf-hresume.list
html-mf-hreview 25 2.4 GB html-mf-hreview.list
html-mf-species 1 14 MB html-mf-species.list
html-mf-xfn 44 4.3 GB html-mf-xfn.list

Get the Code

The source code can be checked out from our Github repository. For more information about the framework and a detailed description how to run a own extraction visit the framework page.
The code for the analysis of the quads can be checked out from the StructuredDataProfiler Github repository.

Get Support

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.