This page contains shows the results of the pre-released 2012 corpus which was published by Common Crawl in February. The pages contained in the pre-release are a subset of the pages contained in the August 2012 Common Crawl Corpus. We also extracted the structured data from this pre-release. The resulting of the complete 2012 corpus can be found here.

The February 2012 Common Crawl Corpus is part of the August 2012 Common Crawl Corpus and is no longer available as separate download.

Extraction Statistics

Crawl DateFeb 2012
Total Data20.9 Terabyte(compressed)
Total URLs1,700,611,442
Parsed HTML URLs1,486,186,868
Domains with Triples65,408,946
URLs with Triples188.821.015
Typed Entities1,222,563,749

Format Breakdown

Extraction Costs

The costs for parsing the 20.9 Terabytes of compressed input data of the Feburary 2012 Common Crawl corpus, extracting the RDF data and storing the extracted data on S3 totaled 523 EUR (excluding VAT) in Amazon EC2 fees. We used 100 spot instances of type c1.xlarge for the extraction which altogether required 3,007 machine hours.