یادگیری شبکه های بیزی در حضور مقادیر گمشده

منیره نوائی لواسانی

عنوان پایان‌نامه

یادگیری شبکه های بیزی در حضور مقادیر گمشده

دانشجو: منیره نوائی لواسانی

استاد راهنما: وحید رضایی تبار

استاد مشاور: رضا پورطاهری

استاد داور: فرزاد اسکندری

رشته تحصیلی: علم داده ها

مقطع تحصیلی: کارشناسی ارشد

ساعت دفاع

چکیده

در دنیای امروز، شبکه‌های بیزی ابزار قدرتمندی در مدل‌بندی پدیده‌ها و سیستم‌های ایستا و پویا هستند و در زمینه‌های مختلفی از جمله تشخیص بیماری‌ها، تصمیم‌گیری و دسته‌بندی کاربرد دارند. یک شبکه بیزی یک مدل گرافی-احتمالی است که ارتباط‌های علّی و معلولی بین متغیرها را نشان می‌دهد و از یک گراف بدون دور جهت‌دار و یک مجموعه از احتمال‌های شرطی تشکیل شده است. دو موضوع مهم در مدل‌بندی یک مجموعه داده با استفاده از یک شبکه بیزی، یادگیری ساختاری و یادگیری پارامتری شبکه است. یادگیری ساختاری به معنی پیدا کردن ساختاری با بیشترین تطبیق با داده‌ها می‌باشد و با استفاده از روش‌هایی مانند روش‌های محدودیت‌گرا، جستجو-امتیازدهی و روش‌های ترکیبی انجام می‌گیرد. یادگیری پارامتری به معنی پیدا کردن عناصر جدول توزیع احتمال شرطی برای متغیرهای گسسته و محاسبه ضرایب رگرسیونی برای متغیرهای پیوسته است. در واقع هدف از یادگیری پارامتری، براورد احتمال هر رأس به شرط والدینش می‌باشد. این مراحل با داشتن داده‌های کامل امکان‌پذیر است. اما اگر با داده‌های حاوی مقادیر گمشده مواجه باشیم مسئله ساده نخواهد بود. یکی از روش‌های برخورد با این مسئله، حذف داده‌های گمشده است که روش مناسبی نیست چرا که باعث از دست رفتن اطلاعات و اریبی براوردگرها می‌شود. روش دیگر جانهی داده‌ها است که در مقایسه با حذف عملکرد بهتری دارد. روش بیشینه‌سازی امیدریاضی ساختاری یکی از روش‌هایی است که برای یادگیری شبکه‌های بیزی در حضور مقادیر گمشده استفاده می‌شود. این الگوریتم یک روش یادگیری آماری پیشرفته است که برای مدل‌سازی در شرایطی که بخشی از داده‌ها مشاهده نشده هستند، به کار می‌رود. برخلاف الگوریتم $EM$ که تنها پارامترهای مدل را در یک ساختار از پیش تعیین شده بهینه‌سازی می‌کند، الگوریتم $SEM$ امکان یادگیری همزمان ساختار و پارامترهای مدل را فراهم می‌سازد. این ویژگی، $SEM$ را به ابزاری بسیار موثر برای یادگیری مدل‌های گرافی احتمالاتی مانند شبکه‌های بیزی، در شرایطی که اطلاعات ساختاری مدل نیز ناشناخته است، تبدیل کرده است. در این الگوریتم، در هر تکرار ابتدا مرحله امیدریاضی انجام می‌شود، که در آن توزیع احتمال مقادیر گمشده با توجه به داده‌های مشاهده شده و ساختار فعلی مدل محاسبه می‌شود. سپس در مرحله بیشینه‌سازی، با استفاده از امیدریاضی به‌دست آمده، ابتدا ساختار مدل به‌روزرسانی شده و سپس پارامترهای مدل جدید بیشینه می‌شوند. این روند تکراری تا همگرایی ادامه می‌یابد. این الگوریتم در حوزه‌هایی چون داده‌کاوی، یادگیری ماشین، بیوانفورماتیک و هوش مصنوعی کاربرد دارد و به ویژه در مواقعی که داده‌ها ناقص، ساختار سیستم پیچیده و یا فضای جستجوی مدل وسیع باشد، عملکرد قابل توجهی از خود نشان می‌دهد.

Abstract

Bayesian networks are a powerful tool to model static and dynamic systems and are applied in various fields such as: diagnosis, decision making, classification. A Bayesian network is a probabilistic graphical model which demonstrates the causal relationship between variables. It consists of a directed acyclic graph and a set of conditional probabilities. Two important factors when modeling a dataset using a Bayesian network are structure learning and parameter learning. Structure learning is to find a structure which fits the data properly and can be learned by techniques such as constraint-based, score-based and hybrid. Parameter learning is to find the elements of conditional probability table for categorical variables and to calculate regression coefficients for continuous variables. In fact, the purpose of parameter learning is to estimate the probability of each node given its parents. Learning a Bayesian network is possible if the dataset is complete. However, if there are missing values in a dataset, the problem is not easy to solve. One method to handle missing data is deletion. Due to losing information and biasness, this method is not efficient. Imputation is another technique which can have more accurate results in comparision with deletion. Structural expectation maximization is one of the powerful methods to learn a Bayesian network in presence of missing data. This algorithm is an advanced statistical learning technique designed for modeling in presence of incomplete, missing or hidden data. Unlike EM algorithm, which focuses solely on optimizing model parameters within a fixed structure, SEM allows for the simultaneous learning of both the structure and parameters of probabilistic models. This makes SEM particularely effective for learning probabilistic graphical models, such as Bayesian networks, when the model structure is unknown. In each iteration of SEM algorithm, the E-step estimates the expected sufficient statistics of the complete data given the observed data and the current model structure. In the M-step, the algorithm first updates the model structure and then re-estimstes the model parameters to maximize the expected complete-data log-likelihood. This iterative process continues untill convergence. SEM has wide applications in fields such as datamining, machine learning, bioinformatics and artificial intelligence. It is especially useful in scenarios where data is partially observed, the underlying system is complx or the model search space is large. By offering a principled approach to jointly learning structure and parameters, SEM provides a powerful framework for statistical modeling under uncertainty