This work takes a critical stance on previous studies concerning fairness evaluation in Large-Language Model (LLM)-based recommender systems, which have primarily assessed consumer fairness by comparing recommendation lists generated with and without sensitive user attributes. Such approaches implicitly treat discrepancies in recommended items as biases, overlooking whether these changes might stem from genuine personalization aligned with true preferences of users. Moreover, these earlier studies typically address single sensitive attributes in isolation, neglecting the complex interplay of intersectional identities. In response to these shortcomings, we introduce CFaiRLLM, an enhanced evaluation framework that not only incorporates true preference alignment but also rigorously examines intersectional fairness by considering overlapping sensitive attributes. Additionally, CFaiRLLM introduces diverse user profile sampling strategies—random, top-rated, and recency-focused—to better understand the impact of profile generation fed to LLMs in light of inherent token limitations in these systems. Given that fairness depends on accurately understanding users’ tastes and preferences, these strategies provide a more realistic assessment of fairness within RecLLMs. To validate the efficacy of CFaiRLLM, we conducted extensive experiments using MovieLens and LastFM datasets, applying various sampling strategies and sensitive attribute configurations. The evaluation metrics include both item similarity measures and true preference alignment considering both hit and ranking (Jaccard Similarity and PRAG), thereby conducting a multi-faceted analysis of recommendation fairness. The results demonstrated that true preference alignment offers a more personalized and fair assessment compared to similarity-based measures, revealing significant disparities when sensitive and intersectional attributes are incorporated. Notably, our study finds that intersectional attributes amplify fairness gaps more prominently, especially in less structured domains such as music recommendations in LastFM. These findings suggest that future fairness evaluations in RecLLMs should incorporate true preference alignment to ensure equitable and genuinely personalized recommendations.
CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System / Deldjoo, Y., Noia, T.D.. - In: ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY. - ISSN 2157-6904. - 16:6(2025), pp. 142.1-142.31. [10.1145/3725853]
CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System
Deldjoo, Yashar
;Noia, Tommaso Di
2025
Abstract
This work takes a critical stance on previous studies concerning fairness evaluation in Large-Language Model (LLM)-based recommender systems, which have primarily assessed consumer fairness by comparing recommendation lists generated with and without sensitive user attributes. Such approaches implicitly treat discrepancies in recommended items as biases, overlooking whether these changes might stem from genuine personalization aligned with true preferences of users. Moreover, these earlier studies typically address single sensitive attributes in isolation, neglecting the complex interplay of intersectional identities. In response to these shortcomings, we introduce CFaiRLLM, an enhanced evaluation framework that not only incorporates true preference alignment but also rigorously examines intersectional fairness by considering overlapping sensitive attributes. Additionally, CFaiRLLM introduces diverse user profile sampling strategies—random, top-rated, and recency-focused—to better understand the impact of profile generation fed to LLMs in light of inherent token limitations in these systems. Given that fairness depends on accurately understanding users’ tastes and preferences, these strategies provide a more realistic assessment of fairness within RecLLMs. To validate the efficacy of CFaiRLLM, we conducted extensive experiments using MovieLens and LastFM datasets, applying various sampling strategies and sensitive attribute configurations. The evaluation metrics include both item similarity measures and true preference alignment considering both hit and ranking (Jaccard Similarity and PRAG), thereby conducting a multi-faceted analysis of recommendation fairness. The results demonstrated that true preference alignment offers a more personalized and fair assessment compared to similarity-based measures, revealing significant disparities when sensitive and intersectional attributes are incorporated. Notably, our study finds that intersectional attributes amplify fairness gaps more prominently, especially in less structured domains such as music recommendations in LastFM. These findings suggest that future fairness evaluations in RecLLMs should incorporate true preference alignment to ensure equitable and genuinely personalized recommendations.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

