Data Overview & Exploration

This notebook provides an initial exploration of the synthetic TCR repertoire dataset. The data simulates a cell therapy manufacturing workflow where peripheral blood mononuclear cells (PBMCs) from the apheresis collection are processed into a final drug product. We examine 6 patients with paired Apheresis/Product samples.

# Load dependencies
library(tidyverse)
library(immunarch)

# Read synthetic data
tcr_data <- read_csv("data/raw/synthetic_tcr_repertoires.csv")
patient_meta <- read_csv("data/raw/patient_metadata.csv")

glimpse(tcr_data)

Rows: 81,593 Columns: 9 $ patient_id <chr> "PT-001", "PT-001", "PT-001", "PT-001", … $ clinical_response <chr> "CR", "CR", "CR", "CR", … $ sample_type <chr> "Apheresis", "Apheresis", "Apheresis", … $ clonotype_id <chr> "PT-001_CLN_00001", "PT-001_CLN_00002", … $ cdr3_aa <chr> "CDWVSYQFTKRRF", "CMGDVHRMPPGLMF", … $ v_gene <chr> "TRBV7-9", "TRBV29-1", "TRBV29-1", … $ j_gene <chr> "TRBJ1-1", "TRBJ2-1", "TRBJ2-1", … $ clone_count <dbl> 8242, 3590, 2207, 1568, … $ clone_fraction <dbl> 0.08689, 0.03785, 0.02327, 0.01653, …

02 Patient Cohort

The cohort includes 6 AML patients with varying clinical responses to cell therapy: 2 complete responders (CR), 2 partial responders (PR), and 2 with progressive disease (PD).

Patient	Response	Age	Sex	Disease	Prior Lines
PT-001	CR	60	F	AML	2
PT-002	CR	46	M	AML	2
PT-003	PR	43	M	AML	3
PT-004	PR	59	M	AML	4
PT-005	PD	47	F	AML	2
PT-006	PD	58	F	AML	2

03 Per-Sample Summary

For each patient, we have paired Apheresis (starting material) and Product (manufactured drug product) TCR repertoire samples. Key diversity metrics are computed below.

# Compute diversity metrics per sample
diversity_summary <- tcr_data %>%
  group_by(patient_id, sample_type, clinical_response) %>%
  summarise(
    n_clonotypes    = n_distinct(clonotype_id),
    total_reads     = sum(clone_count),
    shannon_entropy = vegan::diversity(clone_count, index = "shannon"),
    clonality       = 1 - (shannon_entropy / log(n_clonotypes)),
    .groups = "drop"
  )

Patient	Sample	Response	Clonotypes	Reads	Shannon H	Clonality	D50
PT-001	Apheresis	CR	13,238	94,857	7.9996	0.1571	1816
PT-001	Product	CR	2,029	72,517	4.4388	0.4171	3
PT-002	Apheresis	CR	14,132	106,246	8.1066	0.1517	2046
PT-002	Product	CR	940	79,693	3.0637	0.5525	1
PT-003	Apheresis	PR	11,658	120,934	7.7771	0.1694	1379
PT-003	Product	PR	1,123	99,121	3.4408	0.5101	2
PT-004	Apheresis	PR	12,786	126,170	7.9188	0.1626	1639
PT-004	Product	PR	2,474	51,822	4.8667	0.3771	5
PT-005	Apheresis	PD	9,771	144,336	7.4659	0.1874	858
PT-005	Product	PD	1,913	60,767	4.6019	0.3910	5
PT-006	Apheresis	PD	8,676	113,231	7.2834	0.1968	614
PT-006	Product	PD	2,853	55,484	5.2527	0.3398	11

Key Observation

Apheresis samples show high diversity (Shannon H > 7.0, low clonality) with thousands of unique clonotypes, while Product samples are dramatically more oligoclonal (Shannon H 3.0–5.3, clonality 0.34–0.55). This is consistent with the expected biology: manufacturing selects and expands a subset of T cell clones from the starting material.

04 V Gene Usage

We examine TRBV gene segment usage across samples to identify any preferential gene usage patterns between Apheresis and Product.

# V gene usage summary
vgene_summary <- tcr_data %>%
  group_by(sample_type, v_gene) %>%
  summarise(total_count = sum(clone_count), .groups = "drop") %>%
  group_by(sample_type) %>%
  mutate(fraction = total_count / sum(total_count)) %>%
  arrange(sample_type, desc(fraction))

Figure 1. Top 10 TRBV gene segments by read fraction, shown for Apheresis (cyan) and Product (purple) samples aggregated across all patients.

05 Data Quality Notes

Synthetic Data Disclaimer

This dataset is entirely synthetic, generated to demonstrate analytical methods and pipeline architecture. CDR3 sequences are randomly generated and do not correspond to real TCR rearrangements. Clone size distributions follow power-law models calibrated to approximate real repertoire characteristics.