José Luis Garza Garza
Ralph Peeters
Christian Bizer

This page provides the WDC-24 Gold Standard for Product Categorization as well as a categorization of all product offers in the WDC PDC2020-C Products Corpus for public download. The gold standard consists of over 17,000 manually labeled product offers from e-commerce websites. The offers are assigned to a non-hierarchical categorization schema consisting of 24 product categories. The gold standard is a result of the master thesis of José Luis Garza Garza. Below, José explains how he built the gold standard and used it for comparing the performance of different machine learning techniques on the product categorization task. The best performing model (RoBERTa Base) is used to categorize all offers in the PDC2020-C Products Corpus.


1. Introduction

Product categorization is an important task in e-commerce, influencing several aspects, including search and recommendation systems, customer experience, and revenue generation. However, categorizing an enormous volume of products into a correct structure presents considerable challenges due to the inherent complexity and lack of labeled data. Product categorization comes with a unique series of challenges. Dealing with high variability and unstructured text data in product descriptions and with different vendors potentially describing the same product in varied ways, the ML model must identify and categorize products accurately regardless of these discrepancies. Additionally, multiclass classification problems with a high number of different product categories can significantly increase the complexity of the classification task.

Therefore, we focus on addressing these challenges. It involves developing a scalable and efficient Deep Learning models for product categorization that can handle the large volume and diversity of products, manages the variability in product descriptions, and outperform traditional Machine Learning methods. We will explore the application of transformer-based deep learning models, such as BERT and RoBERTa, for this task and develop a strategy for creating the WDC-24 Gold Standard. The goal is to accurately and consistently categorize the WDC Products Corpus using transformer-based models.

2. Gold Standard Creation

To create a WDC-24 Gold Standard, we use a semi-automatic annotation strategy . This approach involves utilizing a baseline model in predicting category labels for products in the corpus, as mentioned in the previous section. The model’s outputs act as preliminary category assignments for these products. However, these initial labels should not be accepted without further review. The need for a verification process is paramount, as it ensures the validity of these categorization predictions. To that end, we introduce a manual verification process following the model’s prediction. In this process, we take a subset of clusters and verify whether the model’s predicted category aligns with the product descriptions.

First,we established a baseline model by following the experimental procedure, training, and testing data introduced in WDC-25 Gold Standard for Product Categorization. The experiment follows a non-hierarchical product categorization and a Logistic Regression with a One-vs-Rest approach, a widely used and easily comprehensible ML model. This technique is notably favorable for multiclass classification issues since it adapts to each category by fitting a separate model, treating it as a binary classification issue. Furthermore, this method is computationally efficient, which was crucial because of the large datasets we utilized in our task.

After using the baseline model to learn to predict the product categories on the WDC Products Corpus, the application of majority voting is used as a post-processing technique for the classification results based on the underlying structure of the data. This approach reinforces the effectiveness of the model’s predictions, increasing confidence in the final category assignment. The WDC Products Corpus contains multiple products grouped into clusters based on similarity. These clusters are derived from product co-occurrences across various e-commerce sites, which means the products within the same cluster are typical of the same category. However, due to the diversity of e-commerce offerings and their descriptions, certain products within the same cluster may have to be misclassified into different categories. By applying majority voting, we set a democratic rule within each cluster. The category predicted most frequently within a cluster will be deemed as the ’majority vote’ and assigned as the final category for all products within that cluster. This method allows leveraging the data’s inherent structure (clusters) to rectify any inconsistencies in the model’s predictions. This step improves the quality of our model’s output in two ways. Firstly, it reduces noise in the predicted categories, as the majority vote within each cluster overrules misclassifications. Secondly, by aligning the products’ category assignment with the underlying structure of the data, we ensure that our model’s outputs are theoretically sound and practical.

Our next step is to define a strategy for the manual verification. Our selection for this verification stage is not arbitrary. The focus is on clusters that have between 5 and 30 products per category, as these clusters are narrow enough and therefore are more likely to contain products that are indeed related. These clusters are a subset of those populated by the baseline model’s predictions. A result of this manual verification process is the unbalancing of the dataset. In the analysis and categorization of our data, several assumptions were made about the classification of certain products. These assumptions were necessary to create a unified structure of categories and to help in the interpretation and understanding of the data. Since some categories may have more correct predictions, the number of clusters in each category could vary significantly. To mitigate this imbalance issue and ensure the models are well-trained across all categories, we perform additional checks for more clusters from the minority classes.

3.Gold Standard Statistics

The WDC-24 Gold Standard has at least 70 clusters per category, contributing to a more balanced dataset that contains 2,620 unique clusters, 17,580 products, and an average cluster size of 6.67 products. Table 1 presents a complete overview of the product categories distribution within the WDC-24 Gold Standard after the manual verification. This analysis is useful because it helps us understand the labeled corpus’s size and class distribution. It allows us to define the best sampling approach to create a good training, validation, and testing set for our task. The WDC-24 Gold Standard contains 24 categories, as the ’Not Found’ category from the original taxonomy was not included because of the low number of predicted labels.

Category Label # Offers # Clusters Avg Cluster Size Max Cluster Size
Home and Garden 1,389 181 7.67 25
Toys and Games 1,318 194 6.79 30
Computers and Accessories 1,305 161 8.11 25
Sports and Outdoors 1,144 172 6.65 22
Other Electronics 915 118 7.75 28
Shoes 735 122 6.02 13
Clothing 733 116 6.32 15
CDs and Vinyl 716 114 6.28 13
Cell Phones and Accessories 694 98 7.08 29
Automotive 667 92 7.25 21
Pet Supplies 664 99 6.71 20
Books 654 105 6.23 16
Jewelry 642 103 6.23 16
Office Products 613 98 6.26 11
Camera and Photo 604 82 7.36 20
Health and Beauty 588 99 5.94 11
Video Games 583 91 6.41 16
Movies and TV 556 90 6.18 16
Tools and Home Improvement 555 92 6.03 20
Musical Instruments 546 90 6.07 11
Grocery and Gourmet Food 546 90 6.07 14
Baby 498 72 6.92 22
Luggage and Travel Gear 486 82 5.93 19
Others 429 72 5.96 16
Totals 17,580 2,633 6.67 -
Table 1: WDC-24 Gold Standard Statistics
Dividing data into training, validation, and testing sets is crucial in building any Machine Learning model. This allows us to evaluate the model’s performance and ability to generalize to unseen data, mitigating the risk of overfitting. We followed an 80%-10%-10% approach for splitting our labeled Corpus, meaning that 80% of the data will be used for training, 10% for validation, and 10% for testing our models. Splitting the Corpus with this strategy guarantees that our data is prepared for our product categorization task. This step and the labeling strategies discussed in the previous section laid a solid foundation for successfully developing and validating our product categorization models. One thing we also considered is that we split the data by cluster, not by single offer, meaning that one cluster of products can appear in just one set. We also provide this subset for download at the end of the page.

4. Benchmark Experiments and Results

This section represents the approach we followed to train and evaluate our selected models’ performance using the WDC-24 Gold Standard. We have a diverse range of models to analyze and gain a thorough understanding of their performance across their different architectures. This analysis will help us determine which model best suits our large-scale experiment using the latest WDC Product Corpus (2020).

The performance results for each model, after being trained with their respective configurations, are presented in Table 2.

Model Micro F1-Score Macro F1-Score
Logistic Regression with OvR 0.85 0.85
BERT Base Uncased 0.88 0.88
BERT Large Uncased 0.88 0.88
BERT Base Multilingual 0.86 0.85
RoBERTa Base 0.88 0.89
RoBERTa Large 0.88 0.88
XML-RoBERTa 0.88 0.87
Table 2: Results Comparison

Selection of the Best Performing Model

Factors such as model complexity, generalization capability, efficiency, and scalability play significant roles in making an informed decision. After analyzing the previous results, the RoBERTa Base model is chosen as the preferred option. Occam's razor principle favors simpler models that can provide accurate predictions. RoBERTa Base has demonstrated excellent generalization across diverse categories, making it well-suited for the experiment. Additionally, it offers computational efficiency and scalability, ensuring optimal performance as the data volume increases. By considering these factors, the RoBERTa Base model emerges as the ideal choice for the large-scale experiment.

5. Categorization of all Offers in the PDC2020-C Corpus

We use the best performing model to categorize all 22 million offers in the WDC PDC2020-C product corpus. The results presented in Table 3 and Figure 1 illustrate the outcomes of our large-scale experiment using the RoBERTa Base model alongside the implementation of majority voting as a method of final product classification on the WDC PDC2020-C Product Corpus.

Category # Offers # Clusters Avg Cluster Size % Clusters > 1 products
Clothing 3,716,240 3,617,411 1.03 2.21
Home and Garden 2,953,284 2,789,280 1.06 4.10
Sports and Outdoors 1,884,730 1,779,077 1.06 4.01
Office Products 1,336,911 1,266,785 1.06 3.64
Jewelry 1,310,126 1,246,801 1.05 3.55
Tools and Home Improvement 1,190,917 1,147,093 1.04 2.79
Health and Beauty 1,048,407 904,705 1.16 9.27
Automotive 937,906 886,365 1.06 4.40
Toys and Games 925,174 845,964 1.09 4.61
Books 866,078 802,832 1.08 5.85
Shoes 817,529 777,374 1.05 3.36
Grocery and Gourmet Food 799,108 724,006 1.10 7.71
Other Electronics 605,783 557,814 1.09 5.25
Computers and Accessories 609,701 533,220 1.14 7.79
Pet Supplies 477,353 438,148 1.09 5.40
Luggage and Travel Gear 434,211 417,249 1.04 3.11
Musical Instruments 423,967 393,570 1.08 5.23
Baby 354,002 332,050 1.07 4.84
Cell Phones and Accessories 332,424 310,285 1.07 4.75
CDs and Vinyl 315,851 289,000 1.09 6.53
Camera and Photo 260,459 233,486 1.12 7.15
Others 183,658 156,559 1.17 8.42
Video Games 88,780 84,024 1.06 3.55
Movies and TV 56,611 52,143 1.09 5.52
Totals 21,929,210 20,585,241 1.06 -
Table 3: Large Scale Experiment Statistics

In order to verify the performance of the RoBERTa Base model on the whole corpus, we manually checked a sample of 10 offers per category (240 records), considering only products that contain information from the following three features: Title, Brand, and Description. Overall, the experiment has an accuracy of 72.04%. However, the performance per class varies significantly across different categories, highlighting the strengths and potential improvement areas in the model’s predictive capability. Figure 2 shows the results per category in descending order.

6. Download

Below, we provide the WDC-24 Gold Standard for Product Categorization for public download. We further offer the training, validation, and testing set used for all the experiments. As a result of running our best-performing model, RoBERTa base, on the WDC Product Corpus, we have created two separate output files. One file displays the results obtained using majority voting, including cluster ID and category labels. The other file presents the results without majority voting and provides the offer ID and their respective category label.

File Size Download
WDC-24 Gold Standard 11.7MB WDC_24_GoldStandard.csv
WDC-24 Gold Standard Training Set 9.3MB WDC_24_GoldStandard_TrainingSet.csv
WDC-24 Gold Standard Validation Set 1.1MB WDC_24_GoldStandard_ValidationSet.csv
WDC-24 Gold Standard Testing Set 1.2MB WDC_24_GoldStandard_TestingSet.csv
Large Scale Experiment on WDC Product Corpus with Majority Voting Using RoBERTa Base Model 133.6MB WDC_Corpus_LargeScaleExperiment_MajorityVoting.json.gz
Large Scale Experiment on WDC Product Corpus without Majority Voting Using RoBERTa Base Model 82.8MB WDC_Corpus_LargeScaleExperiment_WithoutMajorityVoting.json.gz

7. References

[1] Anna Primpeli, Ralph Peeters, and Christian Bizer. The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, pages 381–386, New York, NY, USA, May 2019. Association for Computing Machinery

[2] Ralph Peeters, Reng Chiz Der, and Christian Bizer. WDC Products: A Multi Dimensional Entity Matching Benchmark.January 2023. arXiv:2301.09521[cs]

8. Feedback

Please send questions and feedback to the Web Data Commons Google Group.

More information about Web Data Commons is found here.