GENOMİK, EPİGENOMİK VE PROTEOMİK VERİLERİ KULLANARAK TÜMÖR-İÇİ HETEROJENİTE NİCELLEŞTİRMESİNE MAKİNE ÖĞRENMESİ YAKLAŞIMI

Tez Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: İstanbul Üniversitesi-Cerrahpaşa, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Türkiye

Tez Danışmanı: Özgür Can Turna

Tezin Onay Tarihi: 2021

Tezin Dili: İngilizce

Özet:

Cancer is the name of all the diseases related to the uncontrolled cell proliferation in a tissue or organ. It stems from the molecular alterations within the cells leading the intracellular mechanisms to deviate from its normal functioning. Intracellular functions are carried out by proteins. The molecular changes may cause over- or under-synthesis of some proteins. Therefore, the proteins that are produced in abnormal amount may disrupt many cellular functions and cause the cells proliferate aberrantly leading to a tumor constitution. Cancerous cells may undergo consecutive molecular alterations. Thus, several types of cancerous cell groups emerge within the same tumor. Intra-Tumor Heterogeneity (ITH) refers to the distinct groups of cells that a single tumor comprises. ITH is found to be associated with numerous prognostic factors including survival, tumor advancement, metastasis, immunity, therapeutic resistance, and drug response. Therefore, it is essential to quantify ITH to draw inferences about disease prognosis. Previously, ITH had been determined by visual examination of tumor samples. However, thanks to Next Generation Sequencing, which is a recent sequencing technology yielding various types of data regarding genomic, epigenomic and proteomic information of patients, many researchers are allowed to study on the determination of ITH through data science. There are the studies evaluating ITH according to merely gene expression and DNA mutation data. Besides, these studies are limited to only some types of cancer. This study proposes a novel approach to utilizing genomic, epigenomic and proteomic data sets for the purpose of establishing relationships with ITH-associated features. Owing to that survival is strongly associated with ITH, survival analysis is conducted by using the data sets that are transformed in such a way that they represent the overall aberrancy level of the tumor samples. This study aims to comprehend various molecular datasets including gene expression, DNA methylation, protein synthesis, CNV and SNV data. As it is based on multi-omics data and is a pan-cancer study, this study is expected to make significant contributions to the literature by spanning hitherto unfocused data types and cancer types.

Furthermore, machine learning models are developed in order to predict the pre-calculated subclone numbers by using the transformed values of the datasets. Subclone numbers are determined based on tumor image data or mutation data. The approaches evaluating subclone numbers based on mutation data display significantly different results. For this reason, it is suggested to include more comprehensive data sets to produce preferable estimations. Besides, distinct data types such as DNA methylation and protein synthesis data have not been used to infer the subclone numbers so far. Therefore, multi-omics approaches are considered as potentially significant methods in estimating subclone numbers, rather than single molecular datasets. As it predicts the subclone numbers based on gene expression, DNA methylation, protein synthesis, CNV and SNV data, this study is expected to be a significant research for the literature.

The results demonstrate that, the features calculated by the proposed method are strongly associated with the overall survival in several cancer types and pan-cancer scale. Besides, ensemble methods successfully predict the subclone numbers with > 0.8 R-squared score. It is suggested for further studies to focus on the validation of the transformation technique by applying them on different cancer data sets.