실습 목표
와인 데이터셋을 통해 DecisionTreeClassifier, RandomForestClassifier, 그리고 AdaBoostClassifier, GradientBoostClassifier를 학습해보고자 한다.
와인 데이터 셋 가져오기(sklearn.datasets)
from sklearn.datasets import load_wine
wine = load_wine()
sklearn.datasets에는 머신러닝을 위한 다양한 샘플 데이터들이 있다. 와인 등급 데이터는 분류(calssification)를 위한 데이터셋이다. load_wine() 함수를 이용하여 쉽게 로드할 수 있다. 로드된 데이터 셋은 pandas.DataFrame이 아닌sklearn.utils.Bunch 클래스 형태로 저장된다.(데이터 전처리 과정에서 wine 객체를 바로 인덱싱하는 것이 어렵다)
print(wine.keys())
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
wine 객체에 .keys() 메소드를 통해, 호출로 불러올 수 있는 key 값을 확인할 수 있습니다.
'data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'
DESCR 속성을 통해 데이터셋의 세부 정보를 확인할 수 있다.
print(wine["DESCR"])
▼ DESCR를 통해 호출한 wine_dataset의 세부 정보
.. _wine_dataset:
Wine recognition dataset
------------------------
**Data Set Characteristics:**
:Number of Instances: 178
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
- class:
- class_0
- class_1
- class_2
:Summary Statistics:
============================= ==== ===== ======= =====
Min Max Mean SD
============================= ==== ===== ======= =====
Alcohol: 11.0 14.8 13.0 0.8
Malic Acid: 0.74 5.80 2.34 1.12
Ash: 1.36 3.23 2.36 0.27
Alcalinity of Ash: 10.6 30.0 19.5 3.3
Magnesium: 70.0 162.0 99.7 14.3
Total Phenols: 0.98 3.88 2.29 0.63
Flavanoids: 0.34 5.08 2.03 1.00
Nonflavanoid Phenols: 0.13 0.66 0.36 0.12
Proanthocyanins: 0.41 3.58 1.59 0.57
Colour Intensity: 1.3 13.0 5.1 2.3
Hue: 0.48 1.71 0.96 0.23
OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71
Proline: 278 1680 746 315
============================= ==== ===== ======= =====
:Missing Attribute Values: None
:Class Distribution: class_0 (59), class_1 (71), class_2 (48)
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.
Original Owners:
Forina, M. et al, PARVUS -
An Extendible Package for Data Exploration, Classification and Correlation.
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.
Citation:
Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.
.. topic:: References
(1) S. Aeberhard, D. Coomans and O. de Vel,
Comparison of Classifiers in High Dimensional Settings,
Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Technometrics).
The data was used with many others for comparing various
classifiers. The classes are separable, though only RDA
has achieved 100% correct classification.
(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
(All results using the leave-one-out technique)
(2) S. Aeberhard, D. Coomans and O. de Vel,
"THE CLASSIFICATION PERFORMANCE OF RDA"
Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Journal of Chemometrics).
모델 학습에 사용할 데이터셋을 준비한다. .data, .target를 통해 X, y를 설정한다.
X = wine.data
y = wine.target
sklearn.model_selection의 train_test_split을 이용하여, test set과 train set를 구분시킨다.
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y , test_size =0.2)
의사결정나무 분류기 만들기(DecisionTree)
from sklearn import tree
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
clf = tree.DecisionTreeClassifier()
clf.fit(Xtrain, ytrain)
pred = clf.predict(Xtest)
print("DecisionTree의 Accuracy는 {}".format(accuracy_score(ytest,pred,)))
print("DecisionTree의 Recall은 {}".format(recall_score(ytest,pred,average='macro')))
print("DecisionTree의 Precision은 {}".format(precision_score(ytest,pred,average='macro')))
print("DecisionTree의 F1 score은 {}".format(f1_score(ytest,pred,average='macro')))
DecisionTree의 Accuracy는 0.9166666666666666
DecisionTree의 Recall은 0.9
DecisionTree의 Precision은 0.9330065359477123
DecisionTree의 F1 score은 0.9058503836317136
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].
→ binarary class가 아닌 multi class 문제에서는 average parameter를 "macro"로 설정해야 한다.
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]
This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:
의사결정트리 시각화
sklrean.tree.plot_tree를 통해 의사결정트리를 시각화할 수 있다.
import matplotlib.pyplot as plt
plt.figure(figsize=(18,15)) # set plot size (denoted in inches)
tree.plot_tree(clf, filled = True,feature_names = wine.feature_names, fontsize = 15)
plt.show()
https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html#sklearn.tree.plot_tree
#트리 기반의 알고리즘은 feature_importances_를 구할 수 있다.
for pair in zip(wine.feature_names, clf.feature_importances_):
print(pair)
('alcohol', 0.0)
('malic_acid', 0.0)
('ash', 0.041016273015055164)
('alcalinity_of_ash', 0.0)
('magnesium', 0.0)
('total_phenols', 0.0)
('flavanoids', 0.39455285391008443)
('nonflavanoid_phenols', 0.0)
('proanthocyanins', 0.0)
('color_intensity', 0.4338830533989614)
('hue', 0.0)
('od280/od315_of_diluted_wines', 0.0)
('proline', 0.13054781967589899)
특성 중요도
특성 중요도(feature importance)는 각 특성이 얼마나 중요한 역할을 하는지를 측정하는 지표입니다. 이는 해당 특성에 의한 분할이 불순도를 얼마나 감소시키는지에 따라 결정됩니다. 특성 중요도의 합은 항상 1이며, 각 특성의 중요도는 0과 1 사이의 값을 가집니다.
- 먼저 특성(feature)를 선택시 중요한 역할 지표로써 특성 순서대로 나타냅니다.
- 분리 기준이 되는 모든 특성노드에서 (전체 샘플의 비율 * 해당 노드에서의 정보이득)을 구합니다.
- 구한 값들 을 분리특성별 비율로 특성 중요도를 구할 수 있습니다.
flavanoids,color_intensity를 이 큰 중요도를 가짐
# Importing necessary libraries
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import DecisionTreeClassifier
# Choosing the first 2 columns for the plot
X_train_cols = Xtrain[:, [6,9]]
# Creating and fitting the tree classifier
classifier = DecisionTreeClassifier(max_depth=4, ).fit(X_train_cols, ytrain)
# Plotting the tree boundaries
disp = DecisionBoundaryDisplay.from_estimator(classifier,
X_train_cols,
response_method="predict",
xlabel=wine.feature_names[6], ylabel=wine.feature_names[9],
alpha=0.5,
cmap="ocean")
# Plotting the data points
disp.ax_.scatter(Xtrain[:, 6], Xtrain[:, 9],
c=ytrain, edgecolor="k",
cmap="ocean")
plt.title(f"Decision surface for tree trained on {wine.feature_names[6]} and {wine.feature_names[9]}")
plt.show()
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier()
rnd_clf.fit(Xtrain, ytrain)
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
for pair in zip(rnd_clf.feature_importances_, wine.feature_names):
print(pair)
(0.1660149900439666, 'alcohol')
(0.03137663630256123, 'malic_acid')
(0.015461780538221183, 'ash')
(0.03334800276246116, 'alcalinity_of_ash')
(0.03529333577787161, 'magnesium')
(0.05700757202604788, 'total_phenols')
(0.13496023575286983, 'flavanoids')
(0.014063067808984569, 'nonflavanoid_phenols')
(0.01861776100716496, 'proanthocyanins')
(0.15997156347248778, 'color_intensity')
(0.07816403224965565, 'hue')
(0.11961923401206669, 'od280/od315_of_diluted_wines')
(0.13610178824564087, 'proline')
랜덤포레스트 분류기
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators =5,oob_score =True)
rnd_clf.fit(Xtrain[:,[6,9]], ytrain)
/usr/local/lib/python3.10/dist-packages/sklearn/ensemble/_forest.py:583: UserWarning: Some inputs do not have OOB scores. This probably means too few trees were used to compute any reliable OOB estimates.
warn(
RandomForestClassifier(n_estimators=5, oob_score=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(n_estimators=5, oob_score=True)
rnd_clf.oob_score_
0.8661971830985915
rnd_clf.estimators_
[DecisionTreeClassifier(max_features='sqrt', random_state=455926434),
DecisionTreeClassifier(max_features='sqrt', random_state=1909490868),
DecisionTreeClassifier(max_features='sqrt', random_state=1572180743),
DecisionTreeClassifier(max_features='sqrt', random_state=2036201724),
DecisionTreeClassifier(max_features='sqrt', random_state=1962566239)]
type(rnd_clf.estimators_)
list
rnd_clf.estimators_[0]
DecisionTreeClassifier(max_features='sqrt', random_state=455926434)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_features='sqrt', random_state=455926434)
sklearn.inspection.DecisionBoundaryDisplay
feature_1.ravel()
array([0.34 , 0.43673469, 0.53346939, ..., 4.88653061, 4.98326531,
5.08 ])
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
X = Xtrain[:,[6,9]]
classifier = rnd_clf
disp = DecisionBoundaryDisplay.from_estimator(
classifier, X, response_method="predict",
xlabel=wine.feature_names[6], ylabel=wine.feature_names[9],
alpha=0.5,cmap = "ocean"
)
scatter = disp.ax_.scatter(X[:, 0], X[:, 1], c=ytrain, edgecolor="k", cmap = "ocean")
plt.legend(*scatter.legend_elements())
plt.title("RandomForestClassifier Decision Boundary")
plt.show()
pred = classifier.predict(Xtest[:,[6,9]])
print("f1 score은 {}".format(f1_score(pred, ytest, average ="macro")))
f1 score은 0.8774509803921569
i= -0
for est in rnd_clf.estimators_:
i+=1
X = Xtrain[:,[6,9]]
disp = DecisionBoundaryDisplay.from_estimator(
est, X, response_method="predict",
xlabel=wine.feature_names[6], ylabel=wine.feature_names[9],
alpha=0.5,cmap = "ocean")
scatter = disp.ax_.scatter(X[:, 0], X[:, 1], c=ytrain, edgecolor="k", cmap = "ocean")
plt.legend(*scatter.legend_elements())
plt.title("Decision Tree {}".format(i))
pred = est.predict(Xtest[:,[6,9]])
print("f1 score은 {}".format(f1_score(pred, ytest, average ="macro")))
plt.show()
f1 score은 0.8842592592592592
f1 score은 0.7283643892339544
f1 score은 0.9058503836317136
f1 score은 0.7701418108150225
f1 score은 0.8514557338086749
matplotlib.axes._axes.Axes
Ada부스트(Adaboost; Adaptive boost)
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(
n_estimators = 150, learning_rate = 1.0
)
ada_clf.fit(Xtrain[:,[6,9]], ytrain)
AdaBoostClassifier(n_estimators=150)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(n_estimators=150)
ada_clf.estimators_
X = Xtrain[:,[6,9]]
classifier = ada_clf
disp = DecisionBoundaryDisplay.from_estimator(
classifier, X, response_method="predict",
xlabel=wine.feature_names[6], ylabel=wine.feature_names[9],
alpha=0.5,cmap = "ocean"
)
scatter = disp.ax_.scatter(X[:, 0], X[:, 1], c=ytrain, edgecolor="k", cmap = "ocean")
plt.legend(*scatter.legend_elements())
plt.title("AdaBoostClassifier Decision Boundary")
plt.show()
pred = classifier.predict(Xtest[:,[6,9]])
print("f1 score은 {}".format(f1_score(pred, ytest, average ="macro")))
f1 score은 0.7020814479638009
X = wine.data[:,[6,9]]
classifier = ada_clf
disp = DecisionBoundaryDisplay.from_estimator(
classifier, X, response_method="predict",
xlabel=wine.feature_names[6], ylabel=wine.feature_names[9],
alpha=0.5,cmap = "ocean"
)
scatter = disp.ax_.scatter(X[:, 0], X[:, 1], c=wine.target, edgecolor="k", cmap = "ocean")
plt.legend(*scatter.legend_elements())
plt.title("AdaBoostClassifier Decision Boundary")
plt.show()
pred = classifier.predict(Xtest[:,[6,9]])
print("f1 score은 {}".format(f1_score(pred, ytest, average ="macro")))
f1 score은 0.7020814479638009
그라디언트 부스팅(Gradient Boosting)
from sklearn.ensemble import GradientBoostingClassifier
gbm_clf = GradientBoostingClassifier(learning_rate=0.1, n_estimators = 100, n_iter_no_change=10)
gbm_clf.fit(Xtrain[:,[6,9]], ytrain)
GradientBoostingClassifier(n_iter_no_change=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(n_iter_no_change=10)
len(gbm_clf.estimators_)
100
classifier = gbm_clf
disp = DecisionBoundaryDisplay.from_estimator(
classifier, X, response_method="predict",
xlabel=wine.feature_names[6], ylabel=wine.feature_names[9],
alpha=0.5,cmap = "ocean"
)
scatter = disp.ax_.scatter(X[:, 0], X[:, 1], c=wine.target, edgecolor="k", cmap = "ocean")
plt.legend(*scatter.legend_elements())
plt.title("GradientBoostingClassifier Decision Boundary")
plt.show()
pred = classifier.predict(Xtest[:,[6,9]])
print("f1 score은 {}".format(f1_score(pred, ytest, average ="macro")))
f1 score은 0.8596352519502424
'Datascience' 카테고리의 다른 글
Ensemble 2: AdaBoosting과 GradientBoosting (0) | 2024.01.06 |
---|---|
Ensemble 1: 앙상블 학습과 랜덤 포레스트 (0) | 2024.01.03 |
Trading Off Precision and Recall(정밀도와 재현률 트레이드 오프) (1) | 2024.01.01 |
머신러닝 알고리즘의 성능 평가 지표(Evaluation Metric) (0) | 2023.12.29 |
Data Preprocessing : Label Encoding * One hot Encoding (0) | 2023.12.29 |