{ "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" }, "orig_nbformat": 2, "kernelspec": { "name": "magisterka", "display_name": "magisterka", "language": "python" } }, "nbformat": 4, "nbformat_minor": 2, "cells": [ { "source": [ "# Expert Rules" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "This notebook presents example usage of user-guided rule induction which follows the scheme introduced by the [GuideR](https://www.sciencedirect.com/science/article/abs/pii/S0950705119300802?dgcid=coauthor) algorithm (Sikora et al, 2019). \n", "Each problem (classification, regression, survival) in addition to the basic class has an expert class, i.e. RuleClassifier and ExpertRuleClassifier. Expert classes allow you to define set of initial rules, preferred conditions and forbidden conditions. \n", "This tutorial will show you how to define rules and conditions\n" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Import and init RuleKit" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from rulekit import RuleKit\n", "from rulekit.classification import RuleClassifier\n", "from rulekit.params import Measures\n", "\n", "\n", "RuleKit.init()" ] }, { "source": [ "## Classification" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "### Prepare dataset" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from scipy.io import arff\n", "import pandas as pd\n", "\n", "datasets_path = \"\" \n", "file_name = \"seismic-bumps.arff\"\n", "\n", "data_df = pd.DataFrame(arff.loadarff(datasets_path + file_name)[0])\n", "data_df['class'] = data_df['class'].astype(int)\n", "\n", "x = data_df.drop(['class'], axis=1)\n", "y = data_df['class']" ] }, { "source": [ "### Define rules and conditions" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "expert_rules = [('rule-0', 'IF [[gimpuls = (-inf, 750)]] THEN class = {0}'), ('rule-1', 'IF [[gimpuls = <750, inf)]] THEN class = {1}')]\n", "\n", "expert_preferred_conditions = [('preferred-condition-0', '1: IF [[seismic = {a}]] THEN class = {0}'), ('preferred-attribute-0', '1: IF [[gimpuls = Any]] THEN class = {1}')]\n", "\n", "expert_forbidden_conditions = [('forb-attribute-0', '1: IF [[seismoacoustic = Any]] THEN class = {0}'), ('forb-attribute-1', 'inf: IF [[ghazard = Any]] THEN class = {1}')]" ] }, { "source": [ "### Rule induction" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "from rulekit.classification import ExpertRuleClassifier\n", "\n", "clf = ExpertRuleClassifier(\n", " min_rule_covered = 8,\n", " max_growing= 0,\n", " extend_using_preferred=True,\n", " extend_using_automatic=True,\n", " induce_using_preferred=True,\n", " induce_using_automatic=True\n", ")\n", "clf.fit(values = x, labels = y, expert_rules = expert_rules, expert_preferred_conditions = expert_preferred_conditions, expert_forbidden_conditions= expert_forbidden_conditions)\n", "ruleset = clf.model" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "IF [[gimpuls = (-inf, 750)]] AND [seismic = {a}] AND nbumps4 = (-inf, 0.50) AND nbumps = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 1252.50) AND nbumps = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 1342.50) AND goimpuls = (-inf, 312) AND nbumps = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 1427.50) AND nbumps = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 1653.50) AND genergy = (-inf, 1006585) AND goimpuls = (-inf, 312) AND nbumps = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 1752) AND nbumps = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 2733) AND goimpuls = (-inf, 312) AND nbumps = (-inf, 1.50) THEN class = {0}\nIF gimpuls = <2965, inf) AND genergy = <634250, inf) AND nbumps = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 1331) AND nbumps = (-inf, 2.50) THEN class = {0}\nIF gimpuls = (-inf, 1655.50) AND genergy = (-inf, 386010) AND nbumps = (-inf, 2.50) THEN class = {0}\nIF gimpuls = (-inf, 1686) AND goimpuls = (-inf, 312) AND nbumps5 = (-inf, 0.50) AND nbumps = (-inf, 2.50) AND nbumps2 = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 2892) AND genergy = (-inf, 386010) AND goimpuls = (-inf, 312) AND nbumps = (-inf, 2.50) THEN class = {0}\nIF gimpuls = (-inf, 2068.50) AND goimpuls = (-inf, 312) AND genergy = (-inf, 1004565) AND nbumps = (-inf, 2.50) AND nbumps2 = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 2184.50) AND nbumps = (-inf, 2.50) THEN class = {0}\nIF gimpuls = (-inf, 1253.50) AND nbumps3 = (-inf, 1.50) AND nbumps2 = (-inf, 2.50) THEN class = {0}\nIF gimpuls = (-inf, 901) AND goimpuls = (-inf, 96.50) AND senergy = (-inf, 3850) AND nbumps = (-inf, 3.50) THEN class = {0}\nIF gimpuls = (-inf, 1253.50) AND nbumps3 = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 1253.50) AND goimpuls = (-inf, 312) AND senergy = (-inf, 9600) AND nbumps3 = (-inf, 2.50) AND nbumps2 = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 1253.50) AND nbumps3 = (-inf, 2.50) AND nbumps2 = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 1253.50) AND goimpuls = (-inf, 312) AND senergy = (-inf, 8100) AND nbumps2 = (-inf, 2.50) THEN class = {0}\nIF ghazard = {a} AND goenergy = <-40.50, 68.50) AND maxenergy = (-inf, 5500) AND gimpuls = (-inf, 901) AND goimpuls = <-39.50, inf) AND senergy = <1150, inf) AND nbumps2 = <1.50, inf) THEN class = {0}\nIF goenergy = <-48.50, inf) AND gimpuls = (-inf, 695.50) AND maxenergy = <2500, inf) AND goimpuls = <-54.50, inf) AND genergy = <10915, inf) AND nbumps3 = (-inf, 3.50) AND senergy = <3950, inf) AND nbumps2 = (-inf, 1.50) AND nbumps = (-inf, 6.50) THEN class = {0}\nIF gimpuls = (-inf, 1253.50) AND nbumps = (-inf, 4.50) AND nbumps2 = (-inf, 2.50) THEN class = {0}\nIF gimpuls = (-inf, 1253.50) AND nbumps3 = (-inf, 2.50) AND nbumps = (-inf, 5.50) THEN class = {0}\nIF maxenergy = (-inf, 75000) AND gimpuls = (-inf, 901) AND genergy = (-inf, 378500) AND nbumps3 = (-inf, 3.50) AND nbumps4 = (-inf, 2.50) THEN class = {0}\nIF gimpuls = (-inf, 1139.50) AND goimpuls = (-inf, 312) AND senergy = (-inf, 85450) THEN class = {0}\nIF gimpuls = <1150.50, inf) AND goimpuls = <-35.50, inf) AND nbumps3 = (-inf, 2.50) AND nbumps2 = (-inf, 0.50) AND nbumps = <1.50, inf) THEN class = {0}\nIF goenergy = <-18.50, inf) AND gimpuls = <927, inf) AND genergy = (-inf, 508210) AND senergy = (-inf, 5750) AND nbumps2 = <1.50, inf) THEN class = {0}\nIF senergy = (-inf, 5750) THEN class = {0}\nIF gimpuls = (-inf, 2489.50) AND genergy = (-inf, 318735) AND nbumps3 = (-inf, 2.50) AND nbumps2 = (-inf, 2.50) THEN class = {0}\nIF goenergy = <-36.50, inf) AND goimpuls = (-inf, 6.50) AND genergy = <392530, inf) AND senergy = <6750, inf) AND nbumps2 = (-inf, 1.50) THEN class = {0}\nIF gimpuls = (-inf, 3881.50) AND nbumps = (-inf, 4.50) AND nbumps2 = (-inf, 2.50) THEN class = {0}\nIF [gimpuls = <1253.50, inf)] AND goenergy = <-40.50, 87) AND maxenergy = (-inf, 7500) AND genergy = <96260, 673155) AND seismic = {b} AND seismoacoustic = {a} AND senergy = (-inf, 10000) AND nbumps = (-inf, 3.50) THEN class = {1}\nIF goenergy = (-inf, 96) AND maxenergy = <1500, inf) AND gimpuls = <605, 1959) AND goimpuls = <-55, 95) AND genergy = <61250, 662435) AND senergy = (-inf, 36050) AND nbumps3 = <0.50, inf) AND nbumps2 = <0.50, inf) AND nbumps = (-inf, 6.50) THEN class = {1}\nIF goenergy = (-inf, 186) AND maxenergy = <1500, inf) AND gimpuls = <538.50, inf) AND genergy = <58310, 934630) AND goimpuls = <-55, inf) AND senergy = (-inf, 40650) AND nbumps2 = <0.50, inf) THEN class = {1}\nIF gimpuls = <521.50, inf) AND genergy = <58310, inf) AND goimpuls = <-71, inf) AND senergy = <650, inf) AND nbumps = <1.50, inf) AND nbumps2 = <0.50, inf) THEN class = {1}\nIF goenergy = (-inf, 97) AND gimpuls = <378, 2132) AND maxenergy = <2500, inf) AND genergy = <34880, 587745) AND goimpuls = (-inf, 95) AND senergy = <3150, 36050) AND nbumps3 = <0.50, inf) AND nbumps2 = <0.50, inf) AND nbumps = (-inf, 6.50) THEN class = {1}\nIF goenergy = (-inf, 135.50) AND gimpuls = <306, inf) AND genergy = <19245, inf) AND senergy = <550, inf) AND nbumps = <1.50, inf) THEN class = {1}\nIF goenergy = (-inf, -1.50) AND gimpuls = <153.50, 289) AND genergy = <17405, 37085) AND goimpuls = <-60.50, inf) AND senergy = (-inf, 40500) AND nbumps3 = (-inf, 3.50) AND nbumps = <1.50, inf) AND nbumps2 = <0.50, inf) THEN class = {1}\nIF goenergy = (-inf, 131.50) AND gimpuls = <1253.50, inf) AND genergy = <54930, 1062020) AND goimpuls = <-60.50, 109) AND shift = {W} AND senergy = (-inf, 36050) AND nbumps2 = (-inf, 2.50) THEN class = {1}\nIF gimpuls = <98.50, inf) AND senergy = <650, inf) AND nbumps2 = <0.50, inf) THEN class = {1}\nIF goenergy = <-78.50, inf) AND gimpuls = <66, inf) AND goimpuls = <-74.50, inf) AND genergy = <3065, inf) AND senergy = <550, inf) THEN class = {1}\nIF goenergy = (-inf, 176.50) AND gimpuls = <131, inf) AND genergy = <48545, inf) THEN class = {1}\nIF goenergy = <-4, inf) AND gimpuls = <396, 1445.50) AND genergy = <32795, 49585) AND goimpuls = <-19, inf) AND shift = {W} AND senergy = (-inf, 350) THEN class = {1}\nIF goenergy = <-37.50, inf) AND gimpuls = <537.50, 796) AND genergy = <16805, 32020) AND goimpuls = <-36.50, inf) AND senergy = (-inf, 250) THEN class = {1}\nIF goenergy = <-37.50, 181) AND gimpuls = <240, 470.50) AND genergy = <19670, 40735) AND goimpuls = <-42.50, inf) AND shift = {W} THEN class = {1}\nIF gimpuls = <54.50, inf) AND goimpuls = <-74.50, inf) AND genergy = <1510, inf) AND senergy = (-inf, 115450) THEN class = {1}\n" ] } ], "source": [ "for rule in ruleset.rules:\n", " print(rule)" ] }, { "source": [ "## Regression" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "### Prepare dataset" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from scipy.io import arff\n", "import pandas as pd\n", "\n", "datasets_path = \"\" \n", "file_name = \"methane-train.arff\"\n", "\n", "data_df = pd.DataFrame(arff.loadarff(datasets_path + file_name)[0])\n", "\n", "x = data_df.drop(['MM116_pred'], axis=1)\n", "y = data_df['MM116_pred']" ] }, { "source": [ "### Define rules and conditions" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "expert_rules = None\n", "\n", "expert_preferred_conditions = [('preferred-condition-0', '3: IF PD = <0.5, inf) THEN MM116_pred = {NaN}'), ('preferred-condition-1', '5: IF PD = <0.5, inf) AND MM116 = (-inf, 1.0) THEN MM116_pred = {NaN}')]\n", "\n", "expert_forbidden_conditions = [('forb-attribute-0', 'inf: IF DMM116 = Any THEN MM116_pred = {NaN}')]" ] }, { "source": [ "### Rule induction" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from rulekit.regression import ExpertRuleRegressor\n", "\n", "reg = ExpertRuleRegressor(\n", " min_rule_covered = 5,\n", " max_growing= 0,\n", " extend_using_preferred=True,\n", " extend_using_automatic=False,\n", " induce_using_preferred=True,\n", " induce_using_automatic=True\n", ")\n", "reg.fit(values = x, labels = y, expert_rules = expert_rules, expert_preferred_conditions = expert_preferred_conditions, expert_forbidden_conditions= expert_forbidden_conditions)\n", "ruleset = reg.model" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "IF MM116 = (-inf, 0.45) AND MM31 = <0.18, 0.24) THEN MM116_pred = {0.40} [0.38,0.42]\nIF MM31 = (-inf, 0.25) THEN MM116_pred = {0.40} [0.33,0.47]\nIF AS038 = (-inf, 2.45) AND MM31 = (-inf, 0.26) THEN MM116_pred = {0.40} [0.31,0.49]\nIF MM31 = <0.18, 0.28) THEN MM116_pred = {0.50} [0.38,0.62]\nIF MM116 = <0.25, 0.45) AND MM31 = <0.18, inf) AND BA13 = (-inf, 1077.50) THEN MM116_pred = {0.40} [0.38,0.42]\nIF MM116 = (-inf, 0.25) THEN MM116_pred = {0.20} [0.15,0.25]\nIF MM116 = <0.45, 0.55) AND PG072 = <1.65, inf) THEN MM116_pred = {0.50} [0.47,0.53]\nIF MM116 = <0.55, 0.65) AND MM31 = <0.24, inf) THEN MM116_pred = {0.60} [0.56,0.64]\nIF MM116 = <0.45, 0.75) AND AS038 = (-inf, 2.45) AND MM31 = (-inf, 0.29) AND BA13 = <1072.50, 1077.50) THEN MM116_pred = {0.50} [0.47,0.53]\nIF PD = (-inf, 0.50) AND MM116 = <0.45, 0.85) AND AS038 = (-inf, 2.45) AND MM31 = (-inf, 0.29) AND BA13 = <1072.50, 1077.50) THEN MM116_pred = {0.50} [0.44,0.56]\nIF MM116 = (-inf, 0.75) THEN MM116_pred = {0.50} [0.37,0.63]\nIF MM116 = <0.45, 0.85) THEN MM116_pred = {0.70} [0.55,0.85]\nIF MM116 = <0.45, inf) AND MM31 = <0.30, inf) THEN MM116_pred = {0.90} [0.68,1.12]\nIF MM116 = <0.70, inf) THEN MM116_pred = {0.90} [0.70,1.10]\n" ] } ], "source": [ "for rule in ruleset.rules:\n", " print(rule)" ] }, { "source": [ "## Survival" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "### Prepare dataset" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "from scipy.io import arff\n", "import pandas as pd\n", "\n", "datasets_path = \"\" \n", "file_name = 'bmt.arff'\n", "\n", "data_df = pd.DataFrame(arff.loadarff(open(datasets_path + file_name, 'r', encoding=\"cp1252\"))[0])\n", "\n", "# code to fix the problem with encoding of the file\n", "tmp_df = data_df.select_dtypes([object]) \n", "tmp_df = tmp_df.stack().str.decode(\"cp1252\").unstack()\n", "for col in tmp_df:\n", " data_df[col] = tmp_df[col]\n", " \n", "data_df = data_df.replace({'?': None})\n", "\n", "x = data_df.drop(['survival_status'], axis=1)\n", "y = data_df['survival_status']" ] }, { "source": [ "### Define rules and conditions" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "expert_rules = [('rule-0', 'IF [[CD34kgx10d6 = (-inf, 10.0)]] AND [[extcGvHD = {0}]] THEN survival_status = {NaN}')]\n", "\n", "expert_preferred_conditions = [('attr-preferred-0', 'inf: IF [CD34kgx10d6 = Any] THEN survival_status = {NaN}')]\n", "\n", "expert_forbidden_conditions = [('attr-forbidden-0', 'IF [ANCrecovery = Any] THEN survival_status = {NaN}')]" ] }, { "source": [ "### Rule induction" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "from rulekit.survival import ExpertSurvivalRules\n", "\n", "srv = ExpertSurvivalRules(\n", " survival_time_attr = 'survival_time',\n", " min_rule_covered = 5,\n", " max_growing= 0,\n", " extend_using_preferred=False,\n", " extend_using_automatic=False,\n", " induce_using_preferred=True,\n", " induce_using_automatic=True\n", ")\n", "srv.fit(values = x, labels = y, expert_rules = expert_rules, expert_preferred_conditions = expert_preferred_conditions, expert_forbidden_conditions= expert_forbidden_conditions)\n", "ruleset = srv.model" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "IF [[CD34kgx10d6 = (-inf, 10)]] AND [[extcGvHD = {0}]] THEN survival_status = {NaN}\nIF [CD34kgx10d6 = (-inf, 11.86)] AND PLTrecovery = <500142.50, inf) THEN survival_status = {NaN}\nIF [CD34kgx10d6 = (-inf, 11.86)] AND RecipientRh = {1} AND Recipientage = <17.85, inf) THEN survival_status = {NaN}\nIF [CD34kgx10d6 = (-inf, 11.86)] AND Relapse = {0} AND PLTrecovery = <26, inf) AND Recipientage = <14.30, inf) THEN survival_status = {NaN}\nIF [CD34kgx10d6 = (-inf, 11.86)] AND Donorage = (-inf, 40.64) AND Gendermatch = {0} AND PLTrecovery = <26, inf) AND Recipientage = <12, 18.85) THEN survival_status = {NaN}\nIF [CD34kgx10d6 = (-inf, 11.86)] AND RecipientRh = {1} AND CD3dCD34 = <6.64, inf) THEN survival_status = {NaN}\nIF [CD34kgx10d6 = <11.86, inf)] AND Relapse = {0} THEN survival_status = {NaN}\nIF [CD34kgx10d6 = (-inf, 11.86)] AND Relapse = {0} AND extcGvHD = {1} THEN survival_status = {NaN}\nIF [CD34kgx10d6 = (-inf, 11.86)] AND Donorage35 = {1} AND Rbodymass = <37.65, inf) THEN survival_status = {NaN}\nIF [CD34kgx10d6 = (-inf, 11.86)] AND PLTrecovery = <19.50, inf) AND Donorage35 = {0} AND CD3dkgx10d8 = <0.92, inf) AND CD3dCD34 = <0.97, inf) AND Rbodymass = (-inf, 43.50) AND Recipientage = (-inf, 12.95) THEN survival_status = {NaN}\nIF [CD34kgx10d6 = <11.86, inf)] THEN survival_status = {NaN}\n" ] } ], "source": [ "for rule in ruleset.rules:\n", " print(rule)" ] } ] }