Problem Set #3

The last problem set is free form and gives you a chance to design a script using the data files on RNA-seq quantification and GENCODE from last week, or use some other data that you find interesting. There are four requirements for your analysis:

  1. Use Python to process the data. The point is to use Python, not necessarily to conduct a significant scientific analysis.

  2. Create at least one graphic with matplotlib, seaborn or plotly.

  3. Use pandas for part of your data processing, more than just reading in a TSV.

  4. Do all of this in a Jupyter notebook, although if you need to use a command line tool for part of the data processing that is okay.

Using docstring at the beginning of your script to provide a short description of the logic of your script as we’ll not necessarily know the data you are using and the point of the script. Include inline comments as appropriate. Include a description of the information you graphed as a comment near the graph generation code.

Submit three files: your .ipynb file with inline output, HTML version of the notebook, and upload at least one image your script created. Upload an image file created from the script, not a screen capture. We DO NOT want the data files, these are often very large. The notebook file will show us enough so we can see the intermediate processing.

-Mike

In [1]:
'''
I just want to know the different expression genes between sigmoid colon and stomach.
'''
import os as os
os.getcwd()
Out[1]:
'D:\\my github\\shen\\content\\en\\post\\2019-12-15-python-fro-genomics-final-set'
In [2]:
os.listdir()
Out[2]:
['.ipynb_checkpoints',
 'BIOS274-2019-PS2-description.pdf',
 'ENCFF016CBS.tsv',
 'example_URL_code.ipynb',
 'final_problem_xiaotao_shen.ipynb',
 'output_file',
 'Problem_Set2_Part1_Xiaotao_Shen.ipynb',
 'ps2_gencode_url.tsv',
 'ps2_gene_quant_URLs.tsv',
 'Xiaotao Shen code for problem set2.ipynb']
In [3]:
import numpy as np
import pandas as pd
In [4]:
ps2_gene_quant_url = pd.read_csv("ps2_gene_quant_URLs.tsv", sep='\t', header = None)
ps2_gene_quant_url
Out[4]:
0 1 2
0 ENCFF016CBS omental fat pad https://www.encodeproject.org/files/ENCFF016CB...
1 ENCFF365ZMW sigmoid colon https://www.encodeproject.org/files/ENCFF365ZM...
2 ENCFF408GFZ subcutaneous adipose tissue https://www.encodeproject.org/files/ENCFF408GF...
3 ENCFF505TUS prostate gland https://www.encodeproject.org/files/ENCFF505TU...
4 ENCFF633OSJ suprapubic skin https://www.encodeproject.org/files/ENCFF633OS...
5 ENCFF862LZL heart left ventricle https://www.encodeproject.org/files/ENCFF862LZ...
6 ENCFF863ERP testis https://www.encodeproject.org/files/ENCFF863ER...
7 ENCFF916ODF vagina https://www.encodeproject.org/files/ENCFF916OD...
8 ENCFF918KPC stomach https://www.encodeproject.org/files/ENCFF918KP...

Read sigmoid colon and stomach gene expression data

In [5]:
sigmoid_colon_data = pd.read_csv(list(ps2_gene_quant_url.iloc[1])[2], sep = '\t')
In [6]:
stomach_data = pd.read_csv(list(ps2_gene_quant_url.iloc[8])[2], sep = '\t')
In [7]:
sigmoid_colon_data.head(5)
Out[7]:
gene_id transcript_id(s) length effective_length expected_count TPM FPKM posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound
0 10904 10904 93.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 12954 12954 94.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 12956 12956 72.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 12958 12958 82.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 12960 12960 73.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [8]:
stomach_data.head(5)
Out[8]:
gene_id transcript_id(s) length effective_length expected_count TPM FPKM posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound
0 10904 10904 93.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 12954 12954 94.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 12956 12956 72.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 12958 12958 82.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 12960 12960 73.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Clean data

In [9]:
##only remain gene id and TPM and remove the genes with TPM = 0
sigmoid_colon_data = sigmoid_colon_data[['gene_id', 'TPM']]
sigmoid_colon_data = sigmoid_colon_data[sigmoid_colon_data.TPM > 0]
In [10]:
print(sigmoid_colon_data.shape)
sigmoid_colon_data.head(5)
(20750, 2)
Out[10]:
gene_id TPM
649 ENSG00000000003.14 0.46
650 ENSG00000000005.5 0.01
651 ENSG00000000419.12 4.51
652 ENSG00000000457.13 0.78
653 ENSG00000000460.16 0.36
In [11]:
##only remain gene id and TPM and remove the genes with TPM = 0
stomach_data = stomach_data[['gene_id', 'TPM']]
stomach_data = stomach_data[stomach_data.TPM > 0]
In [12]:
print(stomach_data.shape)
stomach_data.head(5)
(16461, 2)
Out[12]:
gene_id TPM
649 ENSG00000000003.14 0.49
651 ENSG00000000419.12 0.24
652 ENSG00000000457.13 0.28
653 ENSG00000000460.16 2.18
654 ENSG00000000938.12 0.15

Align stomach_data and sigmoid_colon_data

In [13]:
stomach_data.columns = ['gene_id', 'TPM_1']
print(stomach_data.columns)
sigmoid_colon_data.columns = ['gene_id', 'TPM_2']
print(sigmoid_colon_data.columns)
Index(['gene_id', 'TPM_1'], dtype='object')
Index(['gene_id', 'TPM_2'], dtype='object')
In [14]:
gene_data = stomach_data.join(sigmoid_colon_data.set_index('gene_id'), on='gene_id')
In [15]:
print(gene_data.shape)
gene_data.head(5)
(16461, 3)
Out[15]:
gene_id TPM_1 TPM_2
649 ENSG00000000003.14 0.49 0.46
651 ENSG00000000419.12 0.24 4.51
652 ENSG00000000457.13 0.28 0.78
653 ENSG00000000460.16 2.18 0.36
654 ENSG00000000938.12 0.15 0.40

Find different expression genes

In [16]:
import math as math
fold_change = gene_data.TPM_2/gene_data.TPM_1
log2_fold_change = [math.log2(x) for x in fold_change]
In [17]:
gene_data['fold_change'] = fold_change
gene_data['log2_fold_change'] = log2_fold_change
In [18]:
gene_data.head(5)
Out[18]:
gene_id TPM_1 TPM_2 fold_change log2_fold_change
649 ENSG00000000003.14 0.49 0.46 0.938776 -0.091148
651 ENSG00000000419.12 0.24 4.51 18.791667 4.232021
652 ENSG00000000457.13 0.28 0.78 2.785714 1.478047
653 ENSG00000000460.16 2.18 0.36 0.165138 -2.598259
654 ENSG00000000938.12 0.15 0.40 2.666667 1.415037
In [21]:
from seaborn as sns
  File "<ipython-input-21-886ad395f1de>", line 1
    from seaborn as sns
                  ^
SyntaxError: invalid syntax
In [20]:
sns.scatterplot(x="gene_id", y="log2_fold_change", data=gene_data)
d:\software\python\lib\site-packages\plotnine\scales\scale.py:93: MatplotlibDeprecationWarning: 
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
  if cbook.iterable(self.breaks) and cbook.iterable(self.labels):
d:\software\python\lib\site-packages\plotnine\layer.py:449: UserWarning: geom_point : Removed 1071 rows containing missing values.
  self.data = self.geom.handle_na(self.data)
d:\software\python\lib\site-packages\plotnine\utils.py:553: MatplotlibDeprecationWarning: 
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
  return cbook.iterable(var) and not is_string(var)
Out[20]:
<ggplot: (-9223371844339231456)>
In [ ]: