Anna Primpeli
Christian Bizer
Helene Bechtold

This page provides the WDC-25 Gold Standard for Product Categorization for public download. The gold standard consists of more than 24,000 manually labelled product offers from different e-shops. The offers are assigned to a flat catagorization schema consisting of 25 product categories. We also present the results of a baseline cagegorization experiment in which we train and evaluate an ensemble of one-vs-rest classifiers. We apply the learned classifier to the the WDC Training Dataset for Large-scale Product Matching in order to obtain a consistent categorization of all 26 million product offers contained in the dataset, which we also provide for download.


1. Motivation

Categorizing product offers from different e-shops into a single categorization schema is a task faced by many aggregators in e-commerce. The WDC Categorization Gold Standard allows the comparison of learning-based categorization methods on this task. The gold standard is based on the WDC Training Set for Large-scale Product Matching which contains offers from 79 thousand websites. The offers are marked up on the websites using the product vocabulary. We created the WDC Categorization Gold Standard by manually categorizing a subset of the offers in the training set into a flat, non-overlapping schema of 25 top-level product categories. We labelled at least 500 offers for each category.

For researchers that are interested in hierarchical product categorization, we also offer the WDC-222 Gold Standard for Hierarchical Product Categorization which consists of 2,984 product offers from different e-sops that are assigned to 222 leaf node categories of the Icecat product categorization hierarchy. 

In the following, we describe the methodology that was used to create the WDC-25 gold standard, as well as statistics about the dataset and baseline experiments. Finally, we present statistics about the results of applying the learned classifer to the whole WDC product data corpus.

2. Gold Standard Creation

To create a WDC-25 gold standard that contains a sufficient amount of offers per category to train and evaluate a classification model that performs effectively on all categories, we manually labelled offers from the English Training Corpus.

First, we defined a set of categories with the goal of creating a taxonomy that is comparable to other relevant e-commerce category taxonomies. In order to create such a category set, the Amazon, Google and UNSPSC taxonomies were compared and the overlapping first-level categories were identified. Additionally, some second-level categories were used in the electronics and clothing domains to create a categorization that is not too broad, resulting in a set of 25 representative categories.

To find an equal number of different offers for each category, the results from the transfer learning categorization, which was used as a first categorization experiment on the initial gold standard, were manually verified (for details about the transfer learning approach please refer to the WDC Product Matching website). More specifically, for each category, the offers assigned to that category by the transfer learning approach, as well as the offers in the same cluster, were reviewed and, if they were correctly classified, all offers in the cluster were annotated with the class label. The properties name, title, description, brand and manufacturer were used in order to identify the correct category. Offers that did not fit into any of the defined categories were labelled with 'Others' and offers, for which the category was unclear due to too less information or non-English attributes, were labelled with 'not found'. For some categories, only a few offers were labelled by the transfer learning approach. In order to obtain more offers belonging to these categories, a keyword search was applied using words specific for the respective domain. Only clusters that contain less than 80 offers were selected to reduce noise. Furthermore, clusters containing more than one offer were preferred and it was ensured that each offer contained at least a title or a name. Finally, the categories in the categorization gold standard that was created for first experiments were adapted to the defined set of categories and was added to the labelled data. It contains 985 offers, categorized into 24 categories and a very imbalanced distribution of categories. Details about the distribution of the offers on the categories in the initial gold standard can be found on the WDC Product Matching website.

3.Gold Standard Statistics

The final categorization gold standard contains at least 50 clusters for each category and 2115 clusters in total. The clusters consist of 24,689 product offers. Table 1 shows the number of offers and clusters per category, as well as the average cluster size. Further, the coverage of the properties per category and in total (Figure 1) is given. The title property refers to the concatenated name and title for each offer. The distribution of the cluster sizes (number of offers per cluster) is depicted in Figure 2.

Table 1: Gold Standard Statistics per category

Category # offers # clusters Average size of clusters title description brand manufacturer
Automotive 1446 78 18.5 100% 66% 63% 5%
Baby 918 89 10.3 100% 69% 47% 12%
Books 656 89 7.4 100% 35% 62% 7%
CDs_and_Vinyl 604 90 6.7 95% 36% 19% 5%
Camera_and_Photo 968 91 10.6 100% 86% 35% 19%
Cellphones_and_Accessories 1377 90 15.3 100% 67% 70% 2%
Clothing 3242 232 14.0 100% 14% 1% 0%
Computers_and_Accessories 4753 162 29.3 100% 97% 94% 1%
Grocery_and_Gourmet_Food 561 76 7.6 100% 80% 20% 7%
Health_and_Beauty 506 73 6.9 99% 52% 23% 10%
Home_and_Garden 554 78 7.1 100% 69% 25% 4%
Jewelry 767 56 13.7 100% 79% 3% 1%
Luggage_and_Travel_Gear 812 72 11.3 99% 68% 31% 6%
Movies_and_TV 643 75 8.6 98% 91% 11% 9%
Musical_Instruments 570 83 6.9 99% 94% 35% 11%
Office_Products 659 57 11.6 100% 51% 6% 7%
Other_Electronics 687 87 7.9 100% 81% 33% 8%
Others 10 7 1.4 100% 70% 40% 10%
Pet_Supplies 610 77 7.9 100% 97% 3% 6%
Shoes 555 68 8.2 100% 48% 23% 2%
Sports_and_Outdoors 818 71 11.5 100% 85% 86% 1%
Tools_and_Home_Improvement 783 85 9.2 100% 68% 57% 17%
Toys_and_Games 586 89 6.6 100% 34% 21% 9%
Video_Games 584 82 7.1 100% 86% 48% 7%
not found 1020 58 17.6 97% 26% 3% 3%
Total 24689 2115 11.7 99.58% 65.48% 42.78% 4.87%

4. Baseline Experiments

In order to set a baseline for the comparison of different categorization algorithms, we split the Gold Standard into train and test dataset and train a one-vs-rest ensemble of logistic regression classifiers. This ensemble reaches a F1 score of 85% on the test set. In the following we provide details about the experiment.

Experimental Setup

First, a set of features was created from each offer. The properties name, title, description, brand and manufacturer were used. The title and name attribute was concatenated to a final title attribute. For each offer that did not have one of the properties itself, the respective parent property was used. So if an offer did not have an own manufacturer, the parent entity's manufacturer was assigned to it. Specification table keys and values, if available, were added as a further attribute. In order to extract useful information from the specification table values, only the values belonging to the keys Model, Type, Category, Sub-Category, Manufacturer were used. Additionally, the content of the html pages of each offer was extracted by removing all html tags and code. The specification tables and html pages for the offers in the corpus can be downloaded from the WDC Product Matching website.

The terms of each attribute were lowercased and all punctuation characters and single letters or numbers were removed. Stop words were removed from the descriptions and html content using the stop words list from the Python Natural Language Toolkit NLTK.

The training set was built by grouping the offers in the manually labelled gold standard by their ID clusters, concatenating the values of each attribute in a cluster. The resulting dataset was split into a training and test set, by assigning all clusters containing only one offer to the training set and splitting the remaining clusters by stratified sampling into 80% training and 20% test data. The training set was highly imbalanced regarding the class distribution. The clusters of each category were up-and downsampled to the median amount of 72 clusters per category.

In addition to the training set derived from the WDC Categorization Gold Standarad, we also use a subset of the UCSD Amazon Product Dataset for training. This subset was created by randomly sampling 1000 offers per category that contain a title and description. We also provide this subset for download at the end of the page.

The classification experiments were done using scikit-learn. We created feature vectors by computing tf-idf vectors for each attribute separately in the training set and the corpus. The parameters for the vector creation were optimized using the training set and grid search with 5-fold cross-validation. For the title attribute bigrams were used to create the tf-idf vectors, for the remaining attributes unigrams were used. The number of features was restricted to 10,000 for the manufacturer and specification tables attributes.

For classification a Logistic Regression Classifier was optimized with grid search and 5-fold cross-validation on the training set. The resulting logistic regression model uses stochastic average gradient descent with a one-vs-rest approach. Thus, the multi-class classification problem was reduced to 25 binary classification problems. The model was applied to the offers in the corpus grouped in their ID clusters. Thus, all offers in a cluster were assigned to one category.

Results of the Experiments

The classifier achieves a micro-averaged F1 score of 85% on the test set when trained using the training set derived from the WDC Categorization Gold Standard as well as the Amazon training data. The classifier achieves a micro-averaged F1 score of 82% when trained without the Amazon data. Table 2 shows the category-specific performance of the classifier trained on the WDC and Amazon data.

Table 2: Results per category

Category Precision Recall F1 # clusters
Automotive 1.0 1.0 1.0 10
Baby 1.0 1.0 1.0 10
Books 0.90 0.90 0.90 10
CDs_and_Vinyl 1.0 1.0 1.0 10
Camera_and_Photo 0.79 0.92 0.85 12
Cellphones_and_Accessories 0.90 0.90 0.90 10
Clothing 0.96 0.91 0.93 47
Computers_and_Accessories 0.94 0.91 0.92 32
Grocery_and_Gourmet_Food 0.82 0.90 0.86 10
Health_and_Beauty 0.73 0.80 0.76 10
Home_and_Garden 0.73 0.73 0.73 11
Jewelry 0.83 1.0 0.91 10
Luggage_and_Travel_Gear 0.80 0.80 0.80 10
Movies_and_TV 0.60 0.30 0.40 10
Musical_Instruments 1.0 0.90 0.95 10
Office_Products 0.77 1.0 0.87 10
Other_Electronics 0.47 0.64 0.54 11
Others 0.0 0.0 0.0 1
Pet_Supplies 1.0 0.90 0.95 10
Shoes 0.82 0.82 0.82 11
Sports_and_Outdoors 0.67 0.80 0.73 10
Tools_and_Home_Improvement 0.78 0.70 0.74 10
Toys_and_Games 0.69 0.90 0.78 10
Video_Games 1.0 0.90 0.95 100
not found 0.83 0.45 0.59 11
Micro Avg 0.85 0.85 0.85 306
Macro Avg 0.80 0.80 0.79

In addition to the test set that we derived from the WDC Categorization Gold Standard, we also evaluate the classifer that we learned using the training set and the Amazon data using the initial WDC Gold Standard (consisting of 985 offers) as test set. The classifier achieves 84% micro-averaged F1 on the initial gold standard.

5. Corpus Categorization

In order to obtain a consistent categorization for all 26 million product offers contained in the WDC product data corpus, we apply the classifier that we learned using the WDC and Amazon data to all offers in the corpus. The categorization resulted in the following distribution of categories on the individual offers (Figure 3) and on the clusters (Figure 4) in the corpus:

In order to verify the performance of the classifier on the whole corpus, we manually checked a sample of 10 offers per category, excluding non-English offers as well as offers without any information. The results per category are shown in Figure 5. The percentages and colors indicate the number of category offers in the corpus, dark blue representing a high number of offers and light blue a low number in the corpus.

5. Download

Below, we provide the WDC-25 Gold Standard for Product Categorization for public download. The gold standard is a subset of the english training corpus and contains the same properties for each offer as described here. We further provide the subsets of the gold standard that were used for training and testing aggregated in clusters, as well as the Amazon sample that was used as additional training data. The training and test set contain the, as well as the parent properties name, title, description, brand and manufacturer. Additionally specification table keys and values and HTML content are contained. Each attribute is concatenated within a cluster. The Amazon training offers contain a title, brand and description along with a category label and an identifier (asin). Finally, we offer the Categorized Training Dataset for Large-scale Product Matching for download. It contains cluster ids and category labels in a csv format.

File Sample Size Download
Categorization Gold Standard categories_gold_standard_offers_sample.json 26MB categories_gold_standard_offers.json.gzip
Categorization Training Set categories_clusters_training_sample.json 303.4MB categories_clusters_training.json.gzip
Categorization Test Set categories_clusters_testing_sample.json 68.3MB categories_clusters_testing.json.gzip
Amazon Training Data amazon_training_sample.json 25.4MB amazon_training.json.gzip
Categorized Matching Corpus (English) categorized_clusters_english_sample.csv 247MB categories_offers_en_clusters.csv.gzip

6. References

  1. Primpeli, A., Peeters, R., & Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of the 2019 World Wide Web Conference. pp. 381-386 ACM (2019).

7. Feedback

Please send questions and feedback to the Web Data Commons Google Group.

More information about Web Data Commons is found here.