Final review from the Big Data Cancer Class

This summer class started by learning what cancer actually is, not just in general but how it forms at the molecular level and how it spreads and behaves differently depending on where it starts and how far it’s gone. The first week we talked about how cancer is basically when cells keep growing and dividing when they shouldn’t leading to tumors but then some of those cells can break off and travel through the body and cause problems somewhere else, meaning it has metastasized (though we learned more about this in the final lesson). We looked at breast cancer in particular and how it’s staged from 0 to 4, and how stage 4 means it has metastisized/spread, and also how it can be detected with things like mammograms, MRIs, and biopsies.

In the second lesson we learned about the Human Genome Project and how sequencing used to cost a significant amount of money and take a while but due to new innovations and improvements, sequencing genomes is much more efficient and affordable leading to projects like TCGA. We used Jupyter notebooks to calculate sequencing costs over time and also to do basic coding such as reading files and plotting graphs, and while this part was helpful for background, the math and repetition made it one of my least favorite activities. After that we learned about the central dogma (DNA to RNA to protein) and how only a small part of our genome actually codes for proteins, and we looked at things like integrin gene expression across different organs using violin plots and tSNE plots which helped us see tissue-specific patterns.

One of the most interesting lessons for me was when we got into AI and machine learning and how it’s being used in medicine like detecting cancer in pathology slides or predicting drug responses. We used Python and the integrin gene data to build simple models to classify organs and then evaluated them using confusion matrices and ROC curve. We also learned that accuracy isn’t always the best metric if the data is unbalanced which helped me better understand how models can be misleading sometimes. We also spent time learning about mutations in cancer like SNPs, insertions, deletions, and copy number changes, and how these mutations can be either driver or passenger depending on whether they actually affect cancer growth. Using TCGA BRCA data, we looked at which genes are commonly mutated and used tools like the Xena Browser to find somatic variants which helped show why understanding mutations is so important for targeted therapy. Then we talked about cancer cell lines and how HeLa cells were the first immortalized line, and we looked at breast cancer subtypes like Luminal A, HER2+, and triple negative, and how tools like PAM50 and Prosigna are used to classify them in clinical settings.

In the final lesson we learned about liquid biopsy and how cancer spreads through steps like local invasion, entering the bloodstream, traveling around, leaving the blood, and starting new tumors in other places. We talked about circulating tumor cells (CTCs) and how the CellSearch system can find and count them using antibodies and magnets and confirm they’re not just white blood cells with fluorescence markers. What was really interesting was how the number of CTCs relates to patient survival, like patients with fewer than 5 CTCs had better outcomes than those with 5 or more, and this was actually shown using Kaplan-Meier plots, which we also made earlier in class. For example, I used TCGA data to make survival plots for breast cancer and acute myeloid leukemia, and while breast cancer had a slow decline and longer survival times, the AML plot dropped really fast showing lower survival rates, and that really showed how much variation there is between different cancer types.

Making the Kaplan-Meier plots using TCGA data from UCSC Xena was my favorite activity this summer. I focused on two cancer types which were breast cancer (TCGA BRCA) and acute myeloid leukemia (TCGA LAML), and using the overall survival data we plotted the probability of survival over time for each cancer type. It was really cool to see the actual curves and to be able to interpret what they mean, and see how these innovations were ble to really make an impact on people’s lives and really combined biology with coding and real-world data.

If I had to pick a least favorite activity, I’d probably say it was the one where we calculated sequencing costs over time. It wasn’t that the activity was bad, it just felt more like plugging numbers into formulas than analyzing data, and I just liked the other activities more because they felt more connected to the the content.

Overall I think this class was really interesting and I learned a lot about how cancer works and how big data plays a large role in the research and development of new tools, as well as how AI and machine learning can help make it make sense.

Liquid Biopsy for Cancer Metasis Detections Lesson

Blog Archive

Archive of all previous blog posts