CADC: Encoding User-Item Interactions for Compressing Recommendation Model Training Data

Hossein Entezari Zarch entezari@usc.edu University of Southern CaliforniaLos AngelesCaliforniaUSA Abdulla Alshabanah aalshaba@usc.edu University of Southern CaliforniaLos AngelesCaliforniaUSA Chaoyi Jiang chaoyij@usc.edu University of Southern CaliforniaLos AngelesCaliforniaUSA  and  Murali Annavaram annavara@usc.edu University of Southern CaliforniaLos AngelesCaliforniaUSA
Abstract.

Deep learning recommendation models (DLRMs) are at the heart of the current e-commerce industry. However, the amount of training data used to train these large models is growing exponentially, leading to substantial training hurdles. The training dataset contains two primary types of information: content-based information (features of users and items) and collaborative information (interactions between users and items). One approach to reduce the training dataset is to remove user-item interactions. But that significantly diminishes collaborative information, which is crucial for maintaining accuracy due to its inclusion of interaction histories. This loss profoundly impacts DLRM performance. This paper makes an important observation that if one can capture the user-item interaction history to enrich the user and item embeddings, then the interaction history can be compressed without losing model accuracy. Thus, this work, Collaborative Aware Data Compression (CADC), takes a two-step approach to training dataset compression. In the first step, we use matrix factorization of the user-item interaction matrix to create a novel embedding representation for both the users and items. Once the user and item embeddings are enriched by the interaction history information the approach then applies uniform random sampling of the training dataset to drastically reduce the training dataset size while minimizing model accuracy drop. The source code of CADC is available at https://anonymous.4open.science/r/DSS-RM-8C1D/README.md.

Deep Learning Recommendation Models, Largescale Recommender Systems, Data Compression
ccs: Information systems Recommender systemsccs: Information systems Learning to rank

1. Introduction

Deep learning recommendation models (DLRM) play a pivotal role in enhancing user experience, by suggesting new and pertinent content, across numerous online platforms. Companies like Meta, Google, Microsoft, Netflix, and Alibaba employ these sophisticated models for a range of services, including personalizing and ranking Instagram stories (Medvedev et al., 2019), video suggestions on YouTube (Covington et al., 2016), mobile app recommendations on Google Play (Cheng et al., 2016), personalized news and entertainment options (Elkahky et al., 2015; Steck et al., 2021), and tailored product recommendations (Zhou et al., 2019). Additionally, tasks such as Newsfeed Ranking and Search are also built upon DNNs (Gupta et al., 2020a, b; Naumov et al., 2019; Song et al., 2020), further exemplifying the critical role these models play in content discovery and user engagement.

The essential role of DLRMs in generating revenue for many internet companies has led to their marked increase in complexity and size. From 2017 to 2021, the Meta’s DLRM model size escalated 16-fold, requiring terabytes of model weights (Mudigere et al., 2021). Additionally, the need for memory bandwidth to manage these models increased almost 30-fold(Sethi et al., 2022). This growth translates to recommendation models consuming over 50% of training and 80% of AI inference cycles(Acun et al., 2021; Gupta et al., 2020b; Naumov et al., 2020; Lui et al., 2021; Zhao et al., 2020). As the model size grows correspondingly the training dataset size has also exploded in size. The training dataset consists of user item interactions. Thus the system infrastructure, such as GPU and CPU count, total system memory, to support these models has grown by up to 2.9 times in just a few years (Wu et al., 2022; Mudigere et al., 2022).

One aproach to reduce the training costs is to reduce the training dataset size. There are also orthogonal approaches to reduce the model size but this work focuses on reducing the training dataset size. DLRMs derive value from two primary types of information within large datasets: content-based information (features of users and items) and collaborative information (interactions between users and items). Removing interactions from large datasets significantly diminishes collaborative information, which is crucial for maintaining accuracy due to its inclusion of interaction histories. This loss profoundly impacts DLRM performance.

To address this challenge, we propose Collaborative Aware Data Compression (CADC), a strategy that harnesses Matrix Factorization (MF), a computationally efficient model, to capture and compress the entire collaborative spectrum of a dataset into a set of user and item embeddings. This method ensures that the essential collaborative information is preserved, even when the dataset undergoes substantial reduction. By using these pre-trained embeddings in DLRMs, the models are less sensitive to collaborative interaction data loss when the dataset is filtered. We test our approach on Movielens 1M and 10M, and Epinions, showing that it can keep models accurate even after many of the user-item interactions have been removed.

2. Collaborative Aware Data Compression

Let 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT denote the entire training dataset, consisting of users 𝒰𝒰\mathcal{U}caligraphic_U and items 𝒱𝒱\mathcal{V}caligraphic_V. Our objective is to train a base TTNN model, represented by TTNNsubscriptTTNN\mathcal{M}_{\text{TTNN}}caligraphic_M start_POSTSUBSCRIPT TTNN end_POSTSUBSCRIPT, on 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT. Due to the large size of 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT, direct training is impractical. To address this, we create 𝒟selsubscript𝒟sel\mathcal{D}_{\text{sel}}caligraphic_D start_POSTSUBSCRIPT sel end_POSTSUBSCRIPT, a subset in which interactions are randomly selected from 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT to ensure that it includes a representative sample of the original 𝒰𝒰\mathcal{U}caligraphic_U and 𝒱𝒱\mathcal{V}caligraphic_V Ṫhis selection process is designed to preserve the statistical properties and data distribution of 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT, thereby reducing the computational demands of training TTNNsubscriptTTNN\mathcal{M}_{\text{TTNN}}caligraphic_M start_POSTSUBSCRIPT TTNN end_POSTSUBSCRIPT without compromising the integrity of the dataset’s inherent structure.

We build on the observation that collaborative information present in user-item interactions must be captured for ensuring model accuracy when reducing dataset size. To address this, we introduce the CADC technique. This method involves training a compact collaborative filtering model, specifically Matrix Factorization (MF), on the collaborative information residing in 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT to generate pre-trained embeddings for 𝒰𝒰\mathcal{U}caligraphic_U and 𝒱𝒱\mathcal{V}caligraphic_V  based on their complete interaction profiles. These embeddings are then integrated into the TTNNsubscriptTTNN\mathcal{M}_{\text{TTNN}}caligraphic_M start_POSTSUBSCRIPT TTNN end_POSTSUBSCRIPT. Incorporating these pre-trained weights, which encapsulate the entire collaborative information from 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT, allows TTNNsubscriptTTNN\mathcal{M}_{\text{TTNN}}caligraphic_M start_POSTSUBSCRIPT TTNN end_POSTSUBSCRIPT to access comprehensive interaction data while only being trained on a significantly smaller subset, 𝒟selsubscript𝒟sel\mathcal{D}_{\text{sel}}caligraphic_D start_POSTSUBSCRIPT sel end_POSTSUBSCRIPT. This strategic integration dramatically mitigates the adverse effects of training data filtering and preserves model accuracy by maintaining vital collaborative information inside the DLRM. The ensuing sections detail the CADC methodology.

2.1. Pre-training Embedding Vectors

To encapsulate the entire collaborative information within 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT into a set of embedding vectors, we employ MF, a well-regarded and computationally efficient method in collaborative filtering. Given the massive size of 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT, the training methodology must not only be computationally efficient but also capable of capturing the dynamics of user-item interactions with high fidelity. This method efficiently captures the dynamics of user-item interactions by reducing the high-dimensional interaction space into a lower-dimensional, continuous feature space.

In MF, we construct two separate embedding tables: one for users and another for items. Each user and item is represented as a vector, 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐯jsubscript𝐯𝑗\mathbf{v}_{j}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, in this latent feature space. To enhance the model’s ability to capture individual preferences and item qualities, we incorporate a bias term for each user and item into their respective vectors. The last element of each vector, ui,biassubscript𝑢𝑖biasu_{i,\text{bias}}italic_u start_POSTSUBSCRIPT italic_i , bias end_POSTSUBSCRIPT and vj,biassubscript𝑣𝑗biasv_{j,\text{bias}}italic_v start_POSTSUBSCRIPT italic_j , bias end_POSTSUBSCRIPT, serves as this bias term. The interaction between user i𝑖iitalic_i and item j𝑗jitalic_j is formulated as follows:

y^MF(i,j)=σ((𝐮i𝐯j)+ui,bias+vj,bias+b)subscript^𝑦MF𝑖𝑗𝜎superscriptsubscript𝐮𝑖superscriptsubscript𝐯𝑗subscript𝑢𝑖biassubscript𝑣𝑗bias𝑏\hat{y}_{\text{MF}}(i,j)=\sigma((\mathbf{u}_{i}^{\prime}\cdot\mathbf{v}_{j}^{% \prime})+u_{i,\text{bias}}+v_{j,\text{bias}}+b)over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT MF end_POSTSUBSCRIPT ( italic_i , italic_j ) = italic_σ ( ( bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_u start_POSTSUBSCRIPT italic_i , bias end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT italic_j , bias end_POSTSUBSCRIPT + italic_b )

where 𝐮isuperscriptsubscript𝐮𝑖\mathbf{u}_{i}^{\prime}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐯jsuperscriptsubscript𝐯𝑗\mathbf{v}_{j}^{\prime}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represent the vectors 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐯jsubscript𝐯𝑗\mathbf{v}_{j}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT excluding their respective bias terms ui,biassubscript𝑢𝑖biasu_{i,\text{bias}}italic_u start_POSTSUBSCRIPT italic_i , bias end_POSTSUBSCRIPT and vj,biassubscript𝑣𝑗biasv_{j,\text{bias}}italic_v start_POSTSUBSCRIPT italic_j , bias end_POSTSUBSCRIPT. b𝑏bitalic_b represent the global bias term. σ𝜎\sigmaitalic_σ is the sigmoid function.

To optimize these embeddings, we employ binary cross-entropy loss. Due to the implicit feedback nature of the datasets and scarcity of positive interactions, we implement negative sampling to balance the labels’ distribution. Specifically, we generate 𝒟negsubscript𝒟neg\mathcal{D}_{\text{neg}}caligraphic_D start_POSTSUBSCRIPT neg end_POSTSUBSCRIPT, a subset of negative interactions, to counteract the sparsity of the data where most labels are implicitly zero. The loss formulation, which incorporates both sets of interactions, is as follows:

MF=(i,j)𝒟fulllog(y^MF(i,j))(i,j)𝒟neglog(1y^MF(i,j))subscriptMFsubscript𝑖𝑗subscript𝒟fullsubscript^𝑦MF𝑖𝑗subscript𝑖𝑗subscript𝒟neg1subscript^𝑦MF𝑖𝑗\mathcal{L}_{\text{MF}}=-\sum_{(i,j)\in\mathcal{D}_{\text{full}}}\log(\hat{y}_% {\text{MF}}(i,j))-\sum_{(i,j)\in\mathcal{D}_{\text{neg}}}\log(1-\hat{y}_{\text% {MF}}(i,j))caligraphic_L start_POSTSUBSCRIPT MF end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT MF end_POSTSUBSCRIPT ( italic_i , italic_j ) ) - ∑ start_POSTSUBSCRIPT ( italic_i , italic_j ) ∈ caligraphic_D start_POSTSUBSCRIPT neg end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( 1 - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT MF end_POSTSUBSCRIPT ( italic_i , italic_j ) )

2.2. Integrating Enhanced Embedding Vectors into DLRM

In our approach, we employ the TTNN as a variant of the DLRM, which processes user and item features through distinct pathways. Each pathway comprises a Multi-Layer Perceptron (MLP): the user tower, 𝒯usersubscript𝒯user\mathcal{T}_{\text{user}}caligraphic_T start_POSTSUBSCRIPT user end_POSTSUBSCRIPT, and the item tower, 𝒯itemsubscript𝒯item\mathcal{T}_{\text{item}}caligraphic_T start_POSTSUBSCRIPT item end_POSTSUBSCRIPT. Specifically, 𝒯usersubscript𝒯user\mathcal{T}_{\text{user}}caligraphic_T start_POSTSUBSCRIPT user end_POSTSUBSCRIPT processes a concatenated vector of the corresponding user identifier and its features, user,i=[iduser,i;featuresuser,i]subscriptuser𝑖subscriptiduser𝑖subscriptfeaturesuser𝑖\mathcal{F}_{\text{user},i}=[\text{id}_{\text{user},i};\text{features}_{\text{% user},i}]caligraphic_F start_POSTSUBSCRIPT user , italic_i end_POSTSUBSCRIPT = [ id start_POSTSUBSCRIPT user , italic_i end_POSTSUBSCRIPT ; features start_POSTSUBSCRIPT user , italic_i end_POSTSUBSCRIPT ], while 𝒯itemsubscript𝒯item\mathcal{T}_{\text{item}}caligraphic_T start_POSTSUBSCRIPT item end_POSTSUBSCRIPT handles a similar vector for items, item,j=[iditem,j;featuresitem,j]subscriptitem𝑗subscriptiditem𝑗subscriptfeaturesitem𝑗\mathcal{F}_{\text{item},j}=[\text{id}_{\text{item},j};\text{features}_{\text{% item},j}]caligraphic_F start_POSTSUBSCRIPT item , italic_j end_POSTSUBSCRIPT = [ id start_POSTSUBSCRIPT item , italic_j end_POSTSUBSCRIPT ; features start_POSTSUBSCRIPT item , italic_j end_POSTSUBSCRIPT ].

The interaction between a user i𝑖iitalic_i and an item j𝑗jitalic_j is modeled by the dot product of the embedding vectors output by each tower, followed by a sigmoid transformation to compute the prediction score. This is formally expressed as:

y^TTNN(i,j)=σ(𝒯user(user,i)T𝒯item(item,j))subscript^𝑦TTNN𝑖𝑗𝜎subscript𝒯usersuperscriptsubscriptuser𝑖𝑇subscript𝒯itemsubscriptitem𝑗\hat{y}_{\text{TTNN}}(i,j)=\sigma(\mathcal{T}_{\text{user}}(\mathcal{F}_{\text% {user},i})^{T}\cdot\mathcal{T}_{\text{item}}(\mathcal{F}_{\text{item},j}))over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT TTNN end_POSTSUBSCRIPT ( italic_i , italic_j ) = italic_σ ( caligraphic_T start_POSTSUBSCRIPT user end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT user , italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ caligraphic_T start_POSTSUBSCRIPT item end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUBSCRIPT item , italic_j end_POSTSUBSCRIPT ) )

where σ𝜎\sigmaitalic_σ denotes the sigmoid function.

The enriched embeddings obtained from our pre-training step using MF, 𝐮isubscript𝐮𝑖\mathbf{u}_{i}bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐯jsubscript𝐯𝑗\mathbf{v}_{j}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, are used to initialize and freeze the identifiers idusersubscriptiduser\text{id}_{\text{user}}id start_POSTSUBSCRIPT user end_POSTSUBSCRIPT and iditemsubscriptiditem\text{id}_{\text{item}}id start_POSTSUBSCRIPT item end_POSTSUBSCRIPT within user,isubscriptuser,i\mathcal{F}_{\text{user,i}}caligraphic_F start_POSTSUBSCRIPT user,i end_POSTSUBSCRIPT and item,jsubscriptitem,j\mathcal{F}_{\text{item,j}}caligraphic_F start_POSTSUBSCRIPT item,j end_POSTSUBSCRIPT with these vectors. Consequently, the TTNN’s identifier embedding tables become non-trainable, and 𝒯usersubscript𝒯user\mathcal{T}_{\text{user}}caligraphic_T start_POSTSUBSCRIPT user end_POSTSUBSCRIPT and 𝒯itemsubscript𝒯item\mathcal{T}_{\text{item}}caligraphic_T start_POSTSUBSCRIPT item end_POSTSUBSCRIPT now process the inputs user,i=[𝐮i;featuresuser,i]subscriptuser𝑖subscript𝐮𝑖subscriptfeaturesuser𝑖\mathcal{F}_{\text{user},i}=[\mathbf{u}_{i};\text{features}_{\text{user},i}]caligraphic_F start_POSTSUBSCRIPT user , italic_i end_POSTSUBSCRIPT = [ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; features start_POSTSUBSCRIPT user , italic_i end_POSTSUBSCRIPT ] and item,j=[𝐯j;featuresitem,j]subscriptitem𝑗subscript𝐯𝑗subscriptfeaturesitem𝑗\mathcal{F}_{\text{item},j}=[\mathbf{v}_{j};\text{features}_{\text{item},j}]caligraphic_F start_POSTSUBSCRIPT item , italic_j end_POSTSUBSCRIPT = [ bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; features start_POSTSUBSCRIPT item , italic_j end_POSTSUBSCRIPT ] respectively. TTNNsubscriptTTNN\mathcal{M}_{\text{TTNN}}caligraphic_M start_POSTSUBSCRIPT TTNN end_POSTSUBSCRIPT is then trained on 𝒟selsubscript𝒟sel\mathcal{D}_{\text{sel}}caligraphic_D start_POSTSUBSCRIPT sel end_POSTSUBSCRIPT, leveraging the comprehensive interaction dynamics encapsulated within 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT. This sophisticated integration not only leverages the depth of neural networks but also harnesses the breadth of collaborative filtering, ensuring a robust and accurate prediction mechanism.

Findings detailed in Section 5 reveal that using non-trainable embeddings for user and item identifiers in TTNNsubscriptTTNN\mathcal{M}_{\text{TTNN}}caligraphic_M start_POSTSUBSCRIPT TTNN end_POSTSUBSCRIPT is the most effective approach, achieving the highest model accuracy among evaluated methods. This method, by freezing the embeddings, simplifies the training process as it does not require updating the large user and item ID embedding tables—typically the most substantial component in a DLRM. When the TTNN is trained on the reduced dataset 𝒟selsubscript𝒟sel\mathcal{D}_{\text{sel}}caligraphic_D start_POSTSUBSCRIPT sel end_POSTSUBSCRIPT, it effectively utilizes the comprehensive interaction dynamics originally captured within 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT. This approach not only exploits the depth of neural networks but also incorporates the extensive capabilities of collaborative filtering, offering a robust and precise mechanism adept at managing the challenges posed by large-scale industry datasets.

Compression: In the last step of our approach we randomly sample a subset of the user-item interactions to create a compressed training dataset. Since our pre-training captures the interaction history efficiently we do not need to explore any complex compression schemes, as our results show next.

3. Experimental Setup

Our experiments were designed to assess the efficacy of the CADC across three distinct datasets, each with unique characteristics: MovieLens 1M111The MovieLens 1M dataset, available at https://grouplens.org/datasets/movielens/1m/, MovieLens 10M222The MovieLens 10M dataset, available at https://grouplens.org/datasets/movielens/10m/, and Epinions333The Epinions dataset, available at https://alchemy.cs.washington.edu/data/epinions/. For evaluation, the last two interactions of each user were reserved for validation and testing. Each dataset was subjected to training over 100 epochs on 10%percent1010\%10 % of its interaction data using a TTNN architecture, with embedding sizes set to 96. Before training the TTNN, an MF model was trained on all interactions within each dataset for 100 epochs, with an embedding size of 95. This size aligns with the TTNN’s effective size when incorporating additional bias terms for users and items. The optimization of MF employed an alternating scheme using the Adam optimizer: item embeddings were fixed while updating user embeddings, and vice versa.

Performance was evaluated using Hit Rate at 10 (HR@10) and Normalized Discounted Cumulative Gain at 10 (NDCG@10), metrics that assess accuracy and ranking quality. Additionally, the training time for each dataset was recorded in seconds to evaluate the time efficiency of the method. In all scenarios utilizing CADC, a small model was first trained on the entire dataset, after which the embeddings were transferred to the TTNN and frozen, as detailed in Section 2.

3.1. Baselines

To evaluate the effectiveness of the CADC, it was benchmarked against various methods:

  • Random: This baseline trains the TTNN on the filtered dataset without any sophisticated data compression or embedding optimization techniques, serving as a naive control.

  • Long-Tail Item Recommendation Techniques (Over-Sampling, Under-Sampling, LogQ) (Zhang et al., 2021): These methods are incorporated as baselines to address challenges posed by data filtering, which often exacerbates the long-tail problem in recommendation systems.

  • CADC-MLP (He et al., 2017): This variant employs an MLP as the interaction function between user and item embeddings, replacing the traditional dot-product approach used in MF. It offers a more complex interaction model, making it computationally more intensive than traditional MF.

4. Results

Method MovieLens 1M MovieLens 10M Epinions
HR@10 NDCG@10 Time HR@10 NDCG@10 Time HR@10 NDCG@10 Time
Random 3.76 (45.7%) 1.71 (49.6%) 142 5.25 (42.9%) 2.62 (41.6%) 848 1.53 (60.3%) 0.70 (63.7%) 15
Over-Sampling 0.70 (89.9%) 0.32 (90.6%) 142 0.09 (99.0%) 0.04 (99.1%) 848 0.63 (83.6%) 0.26 (86.5%) 15
Under-Sampling 1.84 (73.4%) 0.84 (75.2%) 142 1.74 (81.1%) 0.80 (82.2%) 848 0.98 (74.5%) 0.42 (78.2%) 15
LogQ 4.21 (39.2%) 1.99 (41.3%) 142 5.63 (38.7%) 2.73 (39.2%) 848 1.69 (56.1%) 0.90 (53.4%) 15
CADC-MLP 6.23 (10.0%) 2.95 (13.0%) 302+136 8.48 (7.7%) 4.17 (7.1%) 3007+753 3.55 (7.8%) 1.82 (5.7%) 44+14
CADC 6.57 (5.1%) 3.28 (3.2%) 18+136 8.54 (7.1%) 4.18 (6.9%) 184+755 3.79 (1.6%) 1.89 (2.1%) 12+14
GS 6.92 3.39 1464 9.19 4.49 8299 3.85 1.93 142
Table 1. Performance comparison of different data handling strategies using the MovieLens 1M, MovieLens 10M, and Epinions datasets. Metrics evaluated include HR@10 and NDCG@10 for recommendation accuracy and computational time in seconds. The percentages in parentheses indicate the performance reduction from the Gold Standard (GS).

Table 1 presents the performance analysis of the recommendation systems utilizing the CADC method compared to other approaches across three datasets: MovieLens 1M, MovieLens 10M, and Epinions. Performance metrics include HR@10, NDCG@10, and computational time (seconds). The parentheses values indicate the percentage performance degradation relative to the Gold Standard (GS).

CADC achieved notable success in maintaining high recommendation quality while training on only 10%percent1010\%10 % of the data, significantly reducing the training time compared to models trained on the entire dataset. Specifically, for MovieLens 1M, CADC showed superior performance, achieving an HR@10 of 6.57 and an NDCG@10 of 3.28, with only a 5.1%percent5.15.1\%5.1 % and 3.2%percent3.23.2\%3.2 % degradation in performance compared to the GS, respectively. These results were obtained with significantly reduced computational time (154.6 seconds compared to 1464.4 seconds for the GS).

In the larger MovieLens 10M dataset, CADC continued to outperform other methods with an HR@10 of 8.54 and an NDCG@10 of 4.18, marking a 7.1%percent7.17.1\%7.1 % and 6.9%percent6.96.9\%6.9 % performance decline relative to the GS. The computational time was notably lower (939.6 seconds) compared to the GS, which required 8298.8 seconds. For the Epinions dataset, CADC demonstrated the highest performance improvements, with an HR@10 of 3.79 and an NDCG@10 of 1.89, indicating minimal performance degradation of 1.6%percent1.61.6\%1.6 % and 2.1%percent2.12.1\%2.1 % respectively compared to the GS, achieved in significantly lesser time (26.4 seconds).

The Random baseline method exhibited substantial performance declines across all datasets, with the most significant reductions observed in the Epinions dataset, where HR@10 was only 1.53 and NDCG@10 was 0.70. The CADC-MLP variant, although performing better than the Random baseline, was still consistently outperformed by the CADC method in terms of both ranking quality and time efficiency. The experimented sampling methods, including Over-Sampling and Under-Sampling, failed to yield promising results. The LogQ method achieved results that were marginally better than those of the naive Random approach but remained substantially inferior to the GS.

5. Sensitivity Analysis

This section examines the impact of various factors on the performance of the CADC method, specifically focusing on the data filtering ratio and different embedding integration techniques. All experiments were conducted using the MovieLens-1M dataset.

  • Data Filtering Ratio: Defined as the ratio of the number of interactions in 𝒟fullsubscript𝒟full\mathcal{D}_{\text{full}}caligraphic_D start_POSTSUBSCRIPT full end_POSTSUBSCRIPT to those in 𝒟selsubscript𝒟sel\mathcal{D}_{\text{sel}}caligraphic_D start_POSTSUBSCRIPT sel end_POSTSUBSCRIPT. For instance, a data filtering ratio of 50 means that the DLRM is trained on 2% of the complete dataset. Figures illustrate that as the data filtering ratio increases, indicating more substantial data reduction, the decline in ranking performance becomes progressively less pronounced. This demonstrates CADC’s capability to maintain model accuracy effectively, even with significantly reduced datasets.

  • Embedding Integration Techniques: Analysis of different integration techniques within the CADC framework reveals varied impacts on performance:

    • Hybrid: Employs a combination of pre-trained and trainable elements where two-thirds of the id embedding elements are derived from MF and the remaining third are regular trainable parameters.

    • Init: Initializes the id embeddings entirely with pre-trained vectors, which remain updatable during training, offering flexibility in adaptation.

    • Init-Frz: Currently used in CADC, this method involves initializing the id embeddings with pre-trained vectors and subsequently freezing them, enhancing stability.

    • Linear: Similar to Hybrid, but incorporates a trainable linear layer that processes the pre-trained embeddings before their integration.

    • MLP: Builds on the Linear approach by applying an MLP to the pre-trained embeddings, introducing a higher level of model complexity.

Refer to caption
Figure 1. Impact of Data Filtering Ratios on HR@10 for CADC. This plot demonstrates how the performance of CADC is influenced by varying levels of data reduction.
Method HR@10 NDCG@10
Hybrid 6.03 (12.9%) 3.08 (9.1%)
Init 6.23 (10.0%) 3.12 (8.0%)
Init-Frz 6.57 (5.1%) 3.28 (3.2%)
Linear 6.56 (5.2%) 3.19 (5.9%)
MLP 6.59 (4.8%) 3.22 (5.0%)
GS 6.92 3.39
Table 2. Comparative performance of CADC embedding integration techniques on MovieLens 1M, measured by HR@10 and NDCG@10. Percentages reflect deviation from the Gold Standard (GS).

The empirical results, displayed in Table 2, indicate that Init-Frz not only simplifies computational demands with the fewest trainable parameters but also achieves superior outcomes compared to the alternatives. Specifically, it records the highest performance in NDCG@10 and closely competes with the more computationally demanding MLP and Linear methods in HR@10. Notably, allowing embeddings to be updated in the Init method reduces accuracy, likely due to the noisy nature of TTNN gradients at the start of training and the limited data scope, which can compromise the integrity of pre-trained embeddings.

6. Related Works

Sampling Interaction Data. Data Sampling is crucial in recommendation systems for extracting hard negative samples and assessing algorithms. It’s been employed through methods such as random sampling, leveraging the underlying graph structure (Mittal et al., 2021; Ying et al., 2018), and specific techniques such as similarity search (Jain et al., 2019) and stratified sampling (Chen et al., 2017). Sampling is also crucial in assessing recommendation algorithms (Cañamares and Castells, 2020; Krichene and Rendle, 2020). Additionally, it’s also useful for createing smaller subsets of large datasets for purposes such as quick testing and algorithmic comparisons (Sachdeva et al., 2022).

Coreset Selection. Coreset selection identifies data subsets that represent the full dataset’s quality. Methods include score-based approaches, which select data based on criteria like forgetting frequency (Toneva et al., 2018), loss value (Kawaguchi and Lu, 2020; Jiang et al., 2019), and prediction uncertainty (Coleman et al., 2019), and gradient-based approaches, which estimate the dataset’s gradient(Mirzasoleiman et al., 2020; Pooladzandi et al., 2022; Yang et al., 2023). These model-specific methods require computation at each data point, making them impractical for large datasets due to high computational demands. Our approach simplifies this by training a very basic computational model once, irrespective of the DLRM and content features to be used. This one-time training embeds collaborative information efficiently, circumventing the computational challenges of traditional coreset selection methods and facilitating scalable training for large datasets.

Data Distillation. Data distillation synthesizes compact data summaries, primarily in continuous domains like images. These techniques distill the essential knowledge of an entire dataset into a significantly smaller, synthetic summary (Zhao and Bilen, 2021; Nguyen et al., 2021). However, these techniques have predominantly focused on continuous data like images, a recent approach extended these methods to synthesize fake graphs, assuming pre-existing node representations, which limits their applicability to recommendation data (Jin et al., 2021). Sachdeva et al. (Sachdeva et al., 2022) adapt data distillation for collaborative filtering by generating high-fidelity, compressed data summaries specifically for use with infinitely-wide autoencoders (\infty-AE). This method is designed exclusively for \infty-AE applications in collaborative filtering and does not integrate content-based features. In contrast, our proposed method is developed for DLRMs like TTNN, incorporating both user-item interactions and content-based features into the training process.

7. Conclusion

This study introduces CADC, a pioneering approach designed to efficiently train DLRMs on large-scale datasets without substantially impacting model accuracy. Our findings demonstrate that by employing pre-trained embeddings that encapsulate comprehensive interaction data, CADC can significantly reduce the volume of data needed for training while preserving the collaborative information essential for maintaining high prediction quality. Tested across datasets like MovieLens 1M, MovieLens 10M, and Epinions, CADC not only outperforms traditional training methods in terms of efficiency and scalability but also maintains a high level of accuracy, closely approximating the performance of models trained on full datasets. Specifically, CADC has proven to mitigate the impact of data reduction on model performance, reducing training times dramatically without corresponding losses in effectiveness. Our research contributes to the broader field of recommendation systems by providing a scalable solution that addresses the twin challenges of maintaining high data throughput and model accuracy in the face of exponentially growing data sizes. It opens new avenues for future research, particularly in exploring more complex models and integration techniques that could further enhance the efficiency and effectiveness of recommendation systems.

References

  • (1)
  • Acun et al. (2021) Bilge Acun, Matthew Murphy, Xiaodong Wang, Jade Nie, Carole-Jean Wu, and Kim Hazelwood. 2021. Understanding training efficiency of deep learning recommendation models at scale. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 802–814.
  • Cañamares and Castells (2020) Rocío Cañamares and Pablo Castells. 2020. On target item sampling in offline recommender system evaluation. In Proceedings of the 14th ACM Conference on Recommender Systems. 259–268.
  • Chen et al. (2017) Ting Chen, Yizhou Sun, Yue Shi, and Liangjie Hong. 2017. On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 767–776.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.
  • Coleman et al. (2019) Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. 2019. Selection via proxy: Efficient data selection for deep learning. arXiv preprint arXiv:1906.11829 (2019).
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191–198.
  • Elkahky et al. (2015) Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A multi-view deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th international conference on world wide web. 278–288.
  • Gupta et al. (2020a) Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S Lee, David Brooks, and Carole-Jean Wu. 2020a. Deeprecsys: A system for optimizing end-to-end at-scale neural recommendation inference. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 982–995.
  • Gupta et al. (2020b) Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, et al. 2020b. The architectural implications of facebook’s dnn-based personalized recommendation. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 488–501.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web. 173–182.
  • Jain et al. (2019) Himanshu Jain, Venkatesh Balasubramanian, Bhanu Chunduri, and Manik Varma. 2019. Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 528–536.
  • Jiang et al. (2019) Angela H Jiang, Daniel L-K Wong, Giulio Zhou, David G Andersen, Jeffrey Dean, Gregory R Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C Lipton, et al. 2019. Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762 (2019).
  • Jin et al. (2021) Wei Jin, Lingxiao Zhao, Shichang Zhang, Yozen Liu, Jiliang Tang, and Neil Shah. 2021. Graph condensation for graph neural networks. arXiv preprint arXiv:2110.07580 (2021).
  • Kawaguchi and Lu (2020) Kenji Kawaguchi and Haihao Lu. 2020. Ordered sgd: A new stochastic optimization framework for empirical risk minimization. In International Conference on Artificial Intelligence and Statistics. PMLR, 669–679.
  • Krichene and Rendle (2020) Walid Krichene and Steffen Rendle. 2020. On sampled metrics for item recommendation. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1748–1757.
  • Lui et al. (2021) Michael Lui, Yavuz Yetim, Özgür Özkan, Zhuoran Zhao, Shin-Yeh Tsai, Carole-Jean Wu, and Mark Hempstead. 2021. Understanding capacity-driven scale-out neural recommendation inference. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 162–171.
  • Medvedev et al. (2019) Ivan Medvedev, Haotian Wu, and Taylor Gordon. 2019. Powered by AI: Instagram’s Explore recommender system. Retrieved June 17 (2019), 2022.
  • Mirzasoleiman et al. (2020) Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. 2020. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning. PMLR, 6950–6960.
  • Mittal et al. (2021) Anshul Mittal, Noveen Sachdeva, Sheshansh Agrawal, Sumeet Agarwal, Purushottam Kar, and Manik Varma. 2021. ECLARE: Extreme classification with label graph correlations. In Proceedings of the Web Conference 2021. 3721–3732.
  • Mudigere et al. (2022) Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, et al. 2022. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 993–1011.
  • Mudigere et al. (2021) Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, et al. 2021. High-performance, distributed training of large-scale deep learning recommendation models. arXiv preprint arXiv:2104.05158 (2021).
  • Naumov et al. (2020) Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa Ozdal, et al. 2020. Deep learning training in facebook data centers: Design of scale-up and scale-out systems. arXiv preprint arXiv:2003.09518 (2020).
  • Naumov et al. (2019) Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G Azzolini, et al. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091 (2019).
  • Nguyen et al. (2021) Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. 2021. Dataset distillation with infinitely wide convolutional networks. Advances in Neural Information Processing Systems 34 (2021), 5186–5198.
  • Pooladzandi et al. (2022) Omead Pooladzandi, David Davini, and Baharan Mirzasoleiman. 2022. Adaptive second order coresets for data-efficient machine learning. In International Conference on Machine Learning. PMLR, 17848–17869.
  • Sachdeva et al. (2022) Noveen Sachdeva, Carole-Jean Wu, and Julian McAuley. 2022. On sampling collaborative filtering datasets. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 842–850.
  • Sethi et al. (2022) Geet Sethi, Bilge Acun, Niket Agarwal, Christos Kozyrakis, Caroline Trippel, and Carole-Jean Wu. 2022. RecShard: statistical feature-based memory optimization for industry-scale neural recommendation. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 344–358.
  • Song et al. (2020) Qingquan Song, Dehua Cheng, Hanning Zhou, Jiyan Yang, Yuandong Tian, and Xia Hu. 2020. Towards automated neural interaction discovery for click-through rate prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 945–955.
  • Steck et al. (2021) Harald Steck, Linas Baltrunas, Ehtsham Elahi, Dawen Liang, Yves Raimond, and Justin Basilico. 2021. Deep learning for recommender systems: A Netflix case study. AI Magazine 42, 3 (2021), 7–18.
  • Toneva et al. (2018) Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. 2018. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159 (2018).
  • Wu et al. (2022) Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, et al. 2022. Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4 (2022), 795–813.
  • Yang et al. (2023) Yu Yang, Hao Kang, and Baharan Mirzasoleiman. 2023. Towards sustainable learning: coresets for data-efficient deep learning. In Proceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA) (ICML’23). JMLR.org, Article 1640, 17 pages.
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 974–983.
  • Zhang et al. (2021) Yin Zhang, Derek Zhiyuan Cheng, Tiansheng Yao, Xinyang Yi, Lichan Hong, and Ed H Chi. 2021. A model of two tales: Dual transfer learning framework for improved long-tail item recommendation. In Proceedings of the web conference 2021. 2220–2231.
  • Zhao and Bilen (2021) Bo Zhao and Hakan Bilen. 2021. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning. PMLR, 12674–12685.
  • Zhao et al. (2020) Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. 2020. Distributed hierarchical gpu parameter server for massive scale deep learning ads systems. Proceedings of Machine Learning and Systems 2 (2020), 412–428.
  • Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 5941–5948.