Problem Set #3
The last problem set is free form and gives you a chance to design a script using the data files on RNA-seq quantification and GENCODE from last week, or use some other data that you find interesting. There are four requirements for your analysis:
Use Python to process the data. The point is to use Python, not necessarily to conduct a significant scientific analysis.
Create at least one graphic with matplotlib, seaborn or plotly.
Use pandas for part of your data processing, more than just reading in a TSV.
Do all of this in a Jupyter notebook, although if you need to use a command line tool for part of the data processing that is okay.
Using docstring at the beginning of your script to provide a short description of the logic of your script as we’ll not necessarily know the data you are using and the point of the script. Include inline comments as appropriate. Include a description of the information you graphed as a comment near the graph generation code.
Submit three files: your .ipynb file with inline output, HTML version of the notebook, and upload at least one image your script created. Upload an image file created from the script, not a screen capture. We DO NOT want the data files, these are often very large. The notebook file will show us enough so we can see the intermediate processing.
-Mike
'''
I just want to know the different expression genes between sigmoid colon and stomach.
'''
import os as os
os.getcwd()
os.listdir()
import numpy as np
import pandas as pd
ps2_gene_quant_url = pd.read_csv("ps2_gene_quant_URLs.tsv", sep='\t', header = None)
ps2_gene_quant_url
sigmoid_colon_data = pd.read_csv(list(ps2_gene_quant_url.iloc[1])[2], sep = '\t')
stomach_data = pd.read_csv(list(ps2_gene_quant_url.iloc[8])[2], sep = '\t')
sigmoid_colon_data.head(5)
stomach_data.head(5)
##only remain gene id and TPM and remove the genes with TPM = 0
sigmoid_colon_data = sigmoid_colon_data[['gene_id', 'TPM']]
sigmoid_colon_data = sigmoid_colon_data[sigmoid_colon_data.TPM > 0]
print(sigmoid_colon_data.shape)
sigmoid_colon_data.head(5)
##only remain gene id and TPM and remove the genes with TPM = 0
stomach_data = stomach_data[['gene_id', 'TPM']]
stomach_data = stomach_data[stomach_data.TPM > 0]
print(stomach_data.shape)
stomach_data.head(5)
stomach_data.columns = ['gene_id', 'TPM_1']
print(stomach_data.columns)
sigmoid_colon_data.columns = ['gene_id', 'TPM_2']
print(sigmoid_colon_data.columns)
gene_data = stomach_data.join(sigmoid_colon_data.set_index('gene_id'), on='gene_id')
print(gene_data.shape)
gene_data.head(5)
import math as math
fold_change = gene_data.TPM_2/gene_data.TPM_1
log2_fold_change = [math.log2(x) for x in fold_change]
gene_data['fold_change'] = fold_change
gene_data['log2_fold_change'] = log2_fold_change
gene_data.head(5)
from seaborn as sns
sns.scatterplot(x="gene_id", y="log2_fold_change", data=gene_data)