{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "92092038", "metadata": {}, "source": [ "# Multiple samples" ] }, { "attachments": {}, "cell_type": "markdown", "id": "95cde18a", "metadata": {}, "source": [ "In this notebook, we will work with multiple samples, and discuss batch effect issues." ] }, { "cell_type": "markdown", "id": "1c772fb6", "metadata": {}, "source": [ "# Load packages" ] }, { "cell_type": "code", "execution_count": 1, "id": "69e629d8", "metadata": {}, "outputs": [], "source": [ "import scanpy as sc\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import scipy\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "id": "2970f5d6", "metadata": {}, "source": [ "# Loading multiple samples into Scanpy" ] }, { "cell_type": "markdown", "id": "cc3321af", "metadata": {}, "source": [ "All of the computation we have done so far has been on a single sample. However, most real world data are likely not going to be a single sample and often require us to merge multiple data. We will use Scanpy to concatenate multiple data into one unified anndata object and discuss the issue of technical effect and its correction." ] }, { "cell_type": "markdown", "id": "4962811a", "metadata": {}, "source": [ "For the purposes of illustration we will use the PBMC data (dataset 5) used in this publication: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9. The authors provide the data in an easily usable format (https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking)." ] }, { "cell_type": "code", "execution_count": 2, "id": "7ad2fbbf", "metadata": {}, "outputs": [], "source": [ "input_path = '/Users/sara.jimenez/Documents/scWorkshop/data/'" ] }, { "cell_type": "markdown", "id": "8a79917b", "metadata": {}, "source": [ "## batch-1" ] }, { "cell_type": "code", "execution_count": 3, "id": "82135380", "metadata": {}, "outputs": [], "source": [ "data_batch1 = pd.read_csv(input_path + 'b1_exprs.txt', sep = '\\t', index_col = 0)\n", "celltype_batch1 = pd.read_csv(input_path + 'b1_celltype.txt', sep = '\\t', index_col = 0)" ] }, { "cell_type": "code", "execution_count": 4, "id": "dece40c0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | data_3p-AAACCTGAGCATCATC-0 | \n", "data_3p-AAACCTGAGCTAGTGG-0 | \n", "data_3p-AAACCTGCACATTAGC-0 | \n", "data_3p-AAACCTGCACTGTTAG-0 | \n", "data_3p-AAACCTGCATAGTAAG-0 | \n", "data_3p-AAACCTGCATGAACCT-0 | \n", "data_3p-AAACCTGGTAAGAGGA-0 | \n", "data_3p-AAACCTGGTAGAAGGA-0 | \n", "data_3p-AAACCTGGTCCAGTGC-0 | \n", "data_3p-AAACCTGGTGTCTGAT-0 | \n", "... | \n", "data_3p-TTTGTCACAGGGATTG-0 | \n", "data_3p-TTTGTCAGTAGCAAAT-0 | \n", "data_3p-TTTGTCAGTCAGATAA-0 | \n", "data_3p-TTTGTCAGTCGCGTGT-0 | \n", "data_3p-TTTGTCAGTTACCGAT-0 | \n", "data_3p-TTTGTCATCATGTCCC-0 | \n", "data_3p-TTTGTCATCCGATATG-0 | \n", "data_3p-TTTGTCATCGTCTGAA-0 | \n", "data_3p-TTTGTCATCTCGAGTA-0 | \n", "data_3p-TTTGTCATCTGCTTGC-0 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RP11-34P13.3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
FAM138A | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
OR4F5 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
RP11-34P13.7 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
RP11-34P13.8 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 rows × 8098 columns
\n", "\n", " | Sample | \n", "n_counts | \n", "n_genes | \n", "batch | \n", "louvain | \n", "anno | \n", "Method | \n", "CellType | \n", "
---|---|---|---|---|---|---|---|---|
data_3p-AAACCTGAGCATCATC-0 | \n", "data_3p | \n", "2394 | \n", "871 | \n", "0 | \n", "9 | \n", "B cell | \n", "10X_3prime | \n", "B cell | \n", "
data_3p-AAACCTGAGCTAGTGG-0 | \n", "data_3p | \n", "4520 | \n", "1316 | \n", "0 | \n", "5 | \n", "CD4 T cell | \n", "10X_3prime | \n", "CD4 T cell | \n", "
data_3p-AAACCTGCACATTAGC-0 | \n", "data_3p | \n", "2788 | \n", "898 | \n", "0 | \n", "1 | \n", "CD4 T cell | \n", "10X_3prime | \n", "CD4 T cell | \n", "
data_3p-AAACCTGCACTGTTAG-0 | \n", "data_3p | \n", "4667 | \n", "1526 | \n", "0 | \n", "0 | \n", "Monocyte_CD14 | \n", "10X_3prime | \n", "Monocyte_CD14 | \n", "
data_3p-AAACCTGCATAGTAAG-0 | \n", "data_3p | \n", "4440 | \n", "1495 | \n", "0 | \n", "0 | \n", "Monocyte_CD14 | \n", "10X_3prime | \n", "Monocyte_CD14 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
data_3p-TTTGTCATCATGTCCC-0 | \n", "data_3p | \n", "3141 | \n", "1176 | \n", "0 | \n", "4 | \n", "CD8 T cell | \n", "10X_3prime | \n", "CD8 T cell | \n", "
data_3p-TTTGTCATCCGATATG-0 | \n", "data_3p | \n", "5401 | \n", "1379 | \n", "0 | \n", "4 | \n", "CD8 T cell | \n", "10X_3prime | \n", "CD8 T cell | \n", "
data_3p-TTTGTCATCGTCTGAA-0 | \n", "data_3p | \n", "6081 | \n", "1802 | \n", "0 | \n", "0 | \n", "Monocyte_CD14 | \n", "10X_3prime | \n", "Monocyte_CD14 | \n", "
data_3p-TTTGTCATCTCGAGTA-0 | \n", "data_3p | \n", "3970 | \n", "1317 | \n", "0 | \n", "7 | \n", "CD8 T cell | \n", "10X_3prime | \n", "CD8 T cell | \n", "
data_3p-TTTGTCATCTGCTTGC-0 | \n", "data_3p | \n", "4027 | \n", "1259 | \n", "0 | \n", "4 | \n", "CD8 T cell | \n", "10X_3prime | \n", "CD8 T cell | \n", "
8098 rows × 8 columns
\n", "\n", " | data_5p-AAACCTGAGCGATAGC-1 | \n", "data_5p-AAACCTGAGCTAAACA-1 | \n", "data_5p-AAACCTGAGGGAGTAA-1 | \n", "data_5p-AAACCTGAGTCTTGCA-1 | \n", "data_5p-AAACCTGAGTTCGATC-1 | \n", "data_5p-AAACCTGCACACTGCG-1 | \n", "data_5p-AAACCTGCACGGTGTC-1 | \n", "data_5p-AAACCTGCAGATGGGT-1 | \n", "data_5p-AAACCTGCAGGTGGAT-1 | \n", "data_5p-AAACCTGGTAAGCACG-1 | \n", "... | \n", "data_5p-TTTGTCACAGCTGGCT-1 | \n", "data_5p-TTTGTCACAGGTGGAT-1 | \n", "data_5p-TTTGTCAGTCCGAAGA-1 | \n", "data_5p-TTTGTCAGTTGATTGC-1 | \n", "data_5p-TTTGTCATCACAAACC-1 | \n", "data_5p-TTTGTCATCCACGTTC-1 | \n", "data_5p-TTTGTCATCGCGTAGC-1 | \n", "data_5p-TTTGTCATCTTAACCT-1 | \n", "data_5p-TTTGTCATCTTACCGC-1 | \n", "data_5p-TTTGTCATCTTGTTTG-1 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RP11-34P13.3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
FAM138A | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
OR4F5 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
RP11-34P13.7 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
RP11-34P13.8 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 rows × 7378 columns
\n", "\n", " | Sample | \n", "n_counts | \n", "n_genes | \n", "batch | \n", "louvain | \n", "anno | \n", "Method | \n", "CellType | \n", "
---|---|---|---|---|---|---|---|---|
data_5p-AAACCTGAGCGATAGC-1 | \n", "data_5p | \n", "2712 | \n", "1318 | \n", "1 | \n", "18 | \n", "NK cell | \n", "10X_5prime | \n", "NK cell | \n", "
data_5p-AAACCTGAGCTAAACA-1 | \n", "data_5p | \n", "6561 | \n", "2164 | \n", "1 | \n", "3 | \n", "Monocyte_CD14 | \n", "10X_5prime | \n", "Monocyte_CD14 | \n", "
data_5p-AAACCTGAGGGAGTAA-1 | \n", "data_5p | \n", "6322 | \n", "2112 | \n", "1 | \n", "8 | \n", "Monocyte_CD14 | \n", "10X_5prime | \n", "Monocyte_CD14 | \n", "
data_5p-AAACCTGAGTCTTGCA-1 | \n", "data_5p | \n", "4528 | \n", "1526 | \n", "1 | \n", "16 | \n", "CD8 T cell | \n", "10X_5prime | \n", "CD8 T cell | \n", "
data_5p-AAACCTGAGTTCGATC-1 | \n", "data_5p | \n", "3426 | \n", "1332 | \n", "1 | \n", "3 | \n", "Monocyte_CD14 | \n", "10X_5prime | \n", "Monocyte_CD14 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
data_5p-TTTGTCATCCACGTTC-1 | \n", "data_5p | \n", "6547 | \n", "2044 | \n", "1 | \n", "3 | \n", "Monocyte_CD14 | \n", "10X_5prime | \n", "Monocyte_CD14 | \n", "
data_5p-TTTGTCATCGCGTAGC-1 | \n", "data_5p | \n", "3615 | \n", "1397 | \n", "1 | \n", "10 | \n", "B cell | \n", "10X_5prime | \n", "B cell | \n", "
data_5p-TTTGTCATCTTAACCT-1 | \n", "data_5p | \n", "3828 | \n", "1480 | \n", "1 | \n", "16 | \n", "CD8 T cell | \n", "10X_5prime | \n", "CD8 T cell | \n", "
data_5p-TTTGTCATCTTACCGC-1 | \n", "data_5p | \n", "6444 | \n", "2388 | \n", "1 | \n", "28 | \n", "Plasmacytoid dendritic cell | \n", "10X_5prime | \n", "Plasmacytoid dendritic cell | \n", "
data_5p-TTTGTCATCTTGTTTG-1 | \n", "data_5p | \n", "4457 | \n", "1662 | \n", "1 | \n", "11 | \n", "CD8 T cell | \n", "10X_5prime | \n", "CD8 T cell | \n", "
7378 rows × 8 columns
\n", "