At a Glance
LLM-based feed rankers consistently favor more polarizing, negative, and sometimes toxic posts across providers and prompt styles; simple prompt tweaks change surface content but do not remove demographic skew.
ON THIS PAGE
What They Found
Polarization is the strongest signal driving recommendations: models favor more polarized content across platforms, providers, and prompt styles. Toxic content and negative sentiment are amplified under engagement-oriented prompts but reduced under informative framing, so prompt goals strongly shape safety trade-offs. Provider behavior differs: one provider stayed most stable across prompts, another adapted most by prompt for toxicity, and a third showed the strongest negative sentiment preference. On Twitter/X, left-leaning authors are over-represented in recommendations despite being a minority of the available posts.
Not sure where to start?Get personalized recommendations
By the Numbers
1Polarization explains the most variance in selection (R² = 0.055) and is significant in 98.1% of experimental conditions.
2Controversial and informative prompts produce average variance explained of ~0.015–0.016 versus ~0.004 for 'popular' prompts (about a 4× difference in bias strength).
3On Twitter/X the candidate pool was 17.9% left-leaning and 43.4% right-leaning, yet models systematically over-represent left-leaning authors in recommendations.
What This Means
Engineers building feed or agent-driven curation systems should audit for polarization and test prompt objectives, since engagement goals can push negative and toxic content. Product and safety leads at social platforms need these findings to design ranking constraints, monitoring, and human oversight before deploying LLM-based curation. Researchers studying recommendation fairness can use the provider- and prompt-level contrasts to separate structural model biases from design choices. audit for polarization
Key Figures

Fig 1: Figure 1 : R 2 R^{2} (variance explained) for each of the 13 features across six prompt strategies, averaged over three providers and three platforms (demographic features: Twitter/X only). Rows ordered by average effect size; the average column summarizes overall bias strength. Significance markers: * = p < < 0.05 in > > 50% of conditions, ** = > > 60%, *** = > > 75%.

Fig 2: Figure 2 : Content and safety directional bias by model and prompt style. Three heatmaps show polarization, sentiment polarity, and toxicity directional bias averaged across Bluesky, Reddit, and Twitter/X. Positive values (red) indicate preference for higher values (respectively, more polarized, more positive, or more toxic content); negative values (blue) indicate preference for lower values (respectively, less polarized, less positive, or less toxic content).

Fig 3: Figure 3 : Directional bias in sensitive demographic attributes for Twitter/X (demographic inference restricted to this platform due to bio availability; see Section 3 ). Rows correspond to LLM providers (Claude, OpenAI, Gemini) plus an average across models; columns to demographic categories, with candidate pool proportions in parentheses. Cell annotations report mean ± \pm s.d.; for individual model rows the standard deviation is computed across prompt styles and runs, while for the Average row it reflects variability across models.

Fig 4: Figure 4 : Average R 2 R^{2} (variance explained) for each feature aggregated across all 54 experimental conditions (3 datasets × \times 3 models × \times 6 prompts), ordered by effect size. Demographic attributes (author gender, political leaning, minority status) are computed for Twitter/X only; all other features average across all three platforms. Significance markers: * = p < < 0.05 in > > 50% of conditions, ** = > > 60%, *** = > > 75%.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreConsiderations
Demographic signals were inferred from profile text and are noisy (nearly 48.4% unknown for minority status), so conclusions about gender and minority exposure are exploratory. Recommendations were non-personalized and based only on post text, excluding real-world signals like prior engagement or user history that could amplify or alter biases. Models tested reflect versions from late 2025–early 2026; provider updates or different model families could change the magnitudes observed. provider updates
Methodology & More
The study ran a large audit of LLM-based content curation across 54 conditions: three model providers, three social platforms (Twitter/X, Bluesky, Reddit), and six prompt styles (for example, engaging, informative, popular, controversial). For each condition the authors sampled 5,000 posts per platform, performed 100 recommendation trials (100 posts sampled per trial), and collected the top-10 recommendations, yielding 1,000 recommended posts per condition and roughly 540,000 recommendations overall. Recommendations were non-personalized and designed to isolate model-level biases from user personalization. prompt styles Results show a clear pattern: polarization is the dominant factor in what models pick, outperforming topic, toxicity, and sentiment as predictors of selection. Prompt objective strongly modulates toxicity and sentiment—engagement-style prompts prefer more toxic and negative content while informative prompts suppress toxicity—but prompt framing has limited power to correct demographic skew. Provider differences matter: one provider had the most stable behavior across prompts, another adapted more depending on prompt for toxicity handling, and a third amplified negative sentiment most consistently. On Twitter/X, inferred political leaning produced the clearest demographic bias: left-leaning authors were over-represented in recommendations despite being a smaller share of the candidate pool. Practical implications include the need for ranking constraints, curated training data, adversarial debiasing, and human-in-the-loop oversight when deploying LLM-based curation at scale. ranking constraints
Avoid common pitfallsLearn what failures to watch for
Credibility Assessment:
Authors have low h-indices and no stated affiliations; arXiv preprint with no citations. Lacks signals of top institutions or established researchers.