Introduction
Overview of the Dataset on Hugging Face
The dataset is publicly available on Hugging Face at here.
Each entry in the dataset represents a unique persona and consists of four fields: pid, persona_text, persona_summary, and persona_json.
pid: a unique participant identifier.persona_text: the complete text of the survey along with the corresponding responses.persona_summary: a textual summary of the individual based on the survey content.persona_json: a structured JSON object containing the organized survey data and responses, which can be used for downstream processing or analysis.
The following code snippet demonstrates how to load and parse the persona_json field from the dataset:
from datasets import load_dataset
import json
ds = load_dataset('LLM-Digital-Twin/Twin-2K-500', 'full_persona')
first_person = json.loads(ds['data'][0]['persona_json'])
Following this sample code, the next section will describe the structure of the persona_json file in detail.
Structure of the persona_json file
This section describes the hierarchical structure of the persona_json file that stores each individual response in the study. The structure is organized into two primary levels: blocks, and within each block, a set of questions and corresponding answers.
The survey is conducted across four distinct waves, and each block is associated with one of these waves. Blocks serve as thematic groupings of questions, which may vary in number and content. Blocks and questions can be presented in a fixed or randomized order. In some cases, blocks or individual questions may be randomly selected for inclusion based on experimental conditions or display logic (For the json file, this is especially true for the blocks in the fourth wave).
This section provides a comprehensive listing of all possible question blocks and their contents. However, the actual set of questions encountered by each participant (or digital twin) may differ due to such randomization and conditional display mechanisms.
Element 0:
Element 1:
Element 2:
Element 3:
Element 4:
Element 5:
Element 6:
Element 7:
Element 8:
Element 9:
Element 10:
Element 11:
Element 12:
Element 13:
Element 14:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Element 15:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Element 16:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Element 17:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Element 18:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Element 19:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Element 20:
Randomization Group of 3 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Variation 3 of the Group:
Element 21:
Randomization Group of 3 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Variation 3 of the Group:
Element 22:
Randomization Group of 3 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Variation 3 of the Group:
Element 23:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Element 24:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Element 25:
Randomization Group of 3 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Variation 3 of the Group:
Element 26:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Element 27:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group:
Element 28:
Randomization Group of 2 blocks
Variation 1 of the Group:
Variation 2 of the Group: