Logistic Regression and Machine Learning!
Machine learning is a branch of artificial intelligence that lets computers learn from data and make predictions or decisions without being specifically programmed. It includes three main types which are supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, models are trained on labeled data and are commonly used for tasks like classification (for example predicting whether a tissue sample is from the brain or lung) and regression (for example predicting continuous outcomes like age or disease severity). Unsupervised learning involves finding patterns in unlabeled data, such as clustering similar samples, while reinforcement learning focuses on learning through trial and error using feedback from the environment. In our project, we used a Jupyter Notebook to apply logistic regression, a supervised learning method used for binary classification. Specifically, we built a model to distinguish between lung and brain tissue based on gene expression levels, such as integrins. To evaluate the model, we used a Receiver Operating Characteristic (ROC) curve, which shows the trade-off between the true positive rate and false positive rate. The area under the ROC curve (AUC) gives us a measure of the model’s accuracy where a value near 1.0 indicates strong performance, and 0.5 suggests no better than random guessing. This exercise demonstrated how machine learning can be applied to real biological data to uncover meaningful patterns and make accurate predictions.
This ROC curve shows the performance of a logistic regression model using ITGB4 expression to classify brain vs. lung samples. The Area Under the Curve is 0.47, which is worse than random guessing. This shows that ITGB4 expression alone does not provide useful discriminatory power for distinguishing between brain and lung tissues in this dataset. The curve is close to the diagonal reference line, suggesting that the model is unable to effectively separate the two classes. This result highlights the need to explore alternative features or combine multiple genes to improve classification accuracy.
This ROC curve evaluates the model’s ability to distinguish between brain and lung tissues using both ITGA3 and ITGB4 expression. The curve hugs the top-left corner of the plot, which indicates near-perfect classification. The model is highly accurate with almost no false positives and nearly all true positives, indicating that the combination of these two integrins provides strong discriminatory power between the two tissue types.
[logistic regression jupyter notebook file!](https://rwang08.github.io/logisticregression.html