{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "92092038", "metadata": {}, "source": [ "# Pre-processing" ] }, { "attachments": {}, "cell_type": "markdown", "id": "95cde18a", "metadata": {}, "source": [ "In this notebook, we will look at basic pre-processing of scRNA-seq data. We'll cover the topics of: \n", "- QC \n", "- Data normalization " ] }, { "attachments": {}, "cell_type": "markdown", "id": "1c772fb6", "metadata": {}, "source": [ "# Library imports" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0cc7ef24", "metadata": {}, "source": [ "Install all packages for the tutorial" ] }, { "cell_type": "code", "execution_count": 37, "id": "69e629d8", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import scipy\n", "import os\n", "\n", "#single cell library\n", "import scanpy as sc\n", "\n", "#doublet detection\n", "import scrublet as scr\n" ] }, { "cell_type": "markdown", "id": "2970f5d6", "metadata": {}, "source": [ "# Load data" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cc3321af", "metadata": {}, "source": [ "10k Human peripheral blood mononuclear cells (PBMCs) of a healthy female donor aged 25-30 were obtained by 10x Genomics.\n", "\n", "The data is an mtx directory with an `mtx` file (*i.e.* count matrix), two `tsv` files with barcodes (*i.e.* cell indices) and features (*i.e.* gene symbols). `Scanpy` unpacks the files (if the files are in `gz` archive format) and creates an `anndata` object with the `read_10x_mtx` function.\n", "\n", "The data being used here was collected from: https://www.10xgenomics.com/datasets/10k-human-pbmcs-3-v3-1-chromium-x-with-intronic-reads-3-1-high" ] }, { "cell_type": "code", "execution_count": 2, "id": "091ece7f", "metadata": {}, "outputs": [], "source": [ "input_path = '/Users/sara.jimenez/Documents/scWorkshop/data/'" ] }, { "cell_type": "code", "execution_count": 3, "id": "5b5e9b52", "metadata": {}, "outputs": [], "source": [ "adata = sc.read_10x_mtx(input_path + 'filtered_feature_bc_matrix/')" ] }, { "cell_type": "code", "execution_count": 4, "id": "1045dca3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 11984 × 36601\n", " var: 'gene_ids', 'feature_types'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata" ] }, { "cell_type": "code", "execution_count": 5, "id": "b652c992", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " |
---|
AAACCCAAGGCCCAAA-1 | \n", "
AAACCCAAGTAATACG-1 | \n", "
AAACCCAAGTCACACT-1 | \n", "
AAACCCACAAAGCGTG-1 | \n", "
AAACCCACAATCGAAA-1 | \n", "
... | \n", "
TTTGTTGGTTGGATCT-1 | \n", "
TTTGTTGGTTTCTTAC-1 | \n", "
TTTGTTGTCCATTTCA-1 | \n", "
TTTGTTGTCTACACAG-1 | \n", "
TTTGTTGTCTCATTAC-1 | \n", "
11984 rows × 0 columns
\n", "\n", " | gene_ids | \n", "feature_types | \n", "
---|---|---|
MIR1302-2HG | \n", "ENSG00000243485 | \n", "Gene Expression | \n", "
FAM138A | \n", "ENSG00000237613 | \n", "Gene Expression | \n", "
OR4F5 | \n", "ENSG00000186092 | \n", "Gene Expression | \n", "
AL627309.1 | \n", "ENSG00000238009 | \n", "Gene Expression | \n", "
AL627309.3 | \n", "ENSG00000239945 | \n", "Gene Expression | \n", "
... | \n", "... | \n", "... | \n", "
AC141272.1 | \n", "ENSG00000277836 | \n", "Gene Expression | \n", "
AC023491.2 | \n", "ENSG00000278633 | \n", "Gene Expression | \n", "
AC007325.1 | \n", "ENSG00000276017 | \n", "Gene Expression | \n", "
AC007325.4 | \n", "ENSG00000278817 | \n", "Gene Expression | \n", "
AC007325.2 | \n", "ENSG00000277196 | \n", "Gene Expression | \n", "
36601 rows × 2 columns
\n", "