Table of Contents

DeepPersona

A Depth-First Synthetic-Persona Engine for Highly Personalized Language Models

1University of California San Diego, 2KU Leuven, 3Shanghai Jiao Tong University,
4University of Michigan, 5Denison University, 6Meta
* Equal Contribution † Corresponding Author
Domain Distribution

Domain coverage of DeepPersona's human attribute taxonomy.

Tool Icon A Toolkit, Not Just a Dataset

DeepPersona is a generative engine powered by the largest and continuously extensible human attribute taxonomy to date. Researchers can:

  • Control anchor traits to synthesize targeted cohorts
  • Bias depth toward specific subtrees
  • Enhance existing shallow personas with rich details
  • Scale to billions of richly detailed profiles

Book Icon Brief Introduction

Simulating human profiles by instilling personas into large language models (LLMs) is rapidly transforming research in personalization, social simulation, and human-AI alignment. However, most existing synthetic personas remain shallow and simplistic, capturing minimal attributes and failing to reflect the rich complexity and diversity of real human identities.

We introduce DeepPersona, a scalable generative engine for synthesizing narrative-complete synthetic personas through a two-stage, taxonomy-guided method. First, we algorithmically construct the largest-ever human-attribute taxonomy, comprising over hundreds of hierarchically-organized attributes, by systematically mining thousands of real user-ChatGPT conversations. Second, we progressively sample attributes from this taxonomy, conditionally generating coherent and realistic personas, averaging hundreds of structured attributes and roughly 1 MB of narrative text, two orders of magnitude deeper than prior works.

Intrinsic evaluations confirm significant improvements in attribute diversity (32% higher coverage) and profile uniqueness (44% greater) compared to state-of-the-art baselines. Extrinsically, our personas enhance GPT-4.1-mini's personalized Q&A accuracy by 11.6% average on ten metrics, and substantially narrow (by 32%) the gap between simulated LLM "citizens" and authentic human responses in social surveys.

DeepPersona thus provides a rigorous, scalable, and privacy-free platform for high-fidelity human simulation and personalized AI research.

DeepPersona Framework

DeepPersona Framework. Our two-stage approach: (1) constructing a comprehensive human-attribute taxonomy from real conversations, and (2) progressively sampling attributes to generate deep, coherent personas.

Stage 1: Human-Attribute Taxonomy Construction. We systematically mine 3,000 real-world user-ChatGPT dialogues from the Puffin dataset, identifying 1,224 high-quality personalized Q-A pairs. Using GPT-4o, we extract and hierarchically organize fine-grained attributes (e.g., Lifestyle → Food Preference → Vegan), then merge semantically similar branches. The resulting taxonomy contains 4,676 unique nodes organized under 12 broad categories (Demographics, Health, Core Values, etc.).

Stage 2: Progressive Attribute Sampling. With the comprehensive taxonomy \(T\) in place, persona generation reduces to iteratively sampling from the structured distribution \(\mathbb{P}_{\theta,T}(P \mid S,k) = \prod_{i=1}^{k} \Pr(a_i \mid S, P_{\lt i}, T)\, \Pr_{\theta}(v_i \mid a_i, S, P_{\lt i}, T)\), where \(P = \{(a_i, v_i)\}_{i=1}^{k}\) and \(P_{\lt i} = \{(a_j, v_j)\}_{j \lt i}\). The attribute selector chooses the next node \(a_i\) from \(T\) and the LLM \(\theta\) generates its value \(v_i\). To achieve realistic depth and diversity, we first anchor a stable core of non-negotiable attributes (age, location, career, personal values, life attitude, hobbies), then use bias-free value assignment from predefined tables for demographics to avoid majority-culture defaults. The selector performs stochastic breadth-first traversal of \(T\), biasing toward long-tail branches to maximize diversity, while the LLM progressively fills each selected attribute conditioned on the evolving profile \(P_{\lt i}\) to ensure global coherence. A narrative summary \(\operatorname{Narr}(P)\) is produced as a byproduct of this sampling process.

Try DeepPersona

Experience DeepPersona firsthand! Generate personalized profiles and explore diverse personas through our interactive demo.

🚀 Launch Interactive Demo

Profile Examples

Explore diverse persona profiles generated across different countries and demographics

Select a Country

Select Profile

China Profile 1

Loading profile data...

Experimental Results

Main Results

We benchmark DeepPersona on three complementary axes to verify that profiles are deep, distinct, and useful: (i) intrinsic quality, (ii) LLM personalization, and (iii) social simulation. Together, these evaluations assess whether DeepPersona truly advances synthetic users from verbose text to research-ready human proxies.

Contact

If you have any questions regarding DeepPersona, feel free to reach out to us via email at zhouyufan365@gmail.com or directly submit a GitHub issue.

BibTeX

@article{wang2024deeppersona,
    title={DeepPersona: A Depth-First Synthetic-Persona Engine for Highly Personalized Language Models},
    author={Wang, Zhen and Zhou, Yufan and Luo, Zhongyan and Yao, Adam Wood Man and Pan, Liushang},
    journal={arXiv preprint},
    year={2024}
}