Standard/z-score scaling first shift features to their centers(mean) and then divide by their standard deviation. This method is suitable for most continous features of approximately Gaussian distribution.
zscore(xij′)=σixij−μij
min-max scaling
Min-max scaling method scales data into range [0, 1]. This method is suitable for data concentrated within a range and preserves zero values for sparse data. Min-max scaling is also sensitive to outliers in the data. Try removing outliers or clip data into a range before scaling.
Max-abs scaling method is similar to min-max scaling, but scales data into range [-1, 1]. It does not shift/center the data and thus preserves signs (positive/negative) of features. Like min-max, max-abs is sensitive to outliers.
Robust scaling method use robust statistics (median, interquartile range) instead of mean and standard deviation. Median and IQR are less sensitive to outliers. For features with large numbers of outliers or largely deviates from normal distribution, robust scaling is recommended.
#产生模拟数据,1000个数据点,均值为10,标准差为2x = random_state.normal(10, 2, size=1000)fig, ax = plt.subplots(1,2,figsize=(16, 6))sns.distplot(np.ravel(x), ax=ax[0])sns.distplot(np.ravel(StandardScaler().fit_transform(x.reshape((-1, 1)))), ax=ax[1])ax[0].set_title('original data distribution',fontsize=20)ax[1].set_title('scaled data distribution by standard scaling',fontsize=20)
#产生模拟数据,1000个数据点,均值为10,标准差为2fig, ax = plt.subplots(2,2,figsize=(16, 16))x = random_state.normal(10, 2, size=1000)sns.distplot(np.ravel(StandardScaler().fit_transform(x.reshape((-1, 1)))), ax=ax[0,0])x = random_state.normal(10, 2, size=1000)sns.distplot(np.ravel(MinMaxScaler().fit_transform(x.reshape((-1, 1)))), ax=ax[0,1])x = random_state.normal(10, 2, size=1000)sns.distplot(np.ravel(MaxAbsScaler().fit_transform(x.reshape((-1, 1)))), ax=ax[1,0])x = random_state.normal(10, 2, size=1000)sns.distplot(np.ravel(RobustScaler().fit_transform(x.reshape((-1, 1)))), ax=ax[1,1])ax[0,0].set_title('scaled data distribution by standard scaling',fontsize=20)ax[0,1].set_title('scaled data distribution by min-max scaling',fontsize=20)ax[1,0].set_title('scaled data distribution by abs-max scaling',fontsize=20)ax[1,1].set_title('scaled data distribution by robust scaling',fontsize=20)
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=random_state)print('number of training samples: {}, test samples: {}'.format(X_train.shape[0], X_test.shape[0]))
The Breast Cancer datasets is available machine learning repository maintained by the University of California, Irvine. The dataset contains 569 samples of malignant and benign tumor cells.
The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively.
The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
# read the dataall_df = pd.read_csv('data/data.csv', index_col=False)all_df.head()
all_df.columns
# Id column is redundant and not useful, we want to drop itall_df.drop('id', axis=1, inplace=True)
1.4 Quick Glance on the Data
# The info() method is useful to get a quick description of the data, in particular the total number of rows, # and each attribute’s type and number of non-null valuesall_df.info()
all_df.describe()
注意到除了诊断信息,其他信息都已经数值化了
# check the categorical attribute's distributionall_df['diagnosis'].value_counts()
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(15,10))fig.subplots_adjust(hspace =.2, wspace=.3)axes = axes.ravel()for i, col inenumerate(all_df.columns[21:]): _= sns.boxplot(y=col, x='diagnosis', data=all_df, ax=axes[i])
Correlation Matrix
# Compute the correlation matrixcorrMatt = all_df.corr()# Generate a mask for the upper trianglemask = np.zeros_like(corrMatt)mask[np.triu_indices_from(mask)]=True# Set up the matplotlib figurefig, ax = plt.subplots(figsize=(20, 12))plt.title('Breast Cancer Feature Correlation')# Generate a custom diverging colormapcmap = sns.diverging_palette(260, 10, as_cmap=True)# Draw the heatmap with the mask and correct aspect ratiosns.heatmap(corrMatt, vmax=1.2, square=False, cmap=cmap, mask=mask, ax=ax, annot=True, fmt='.2g', linewidths=1);#sns.heatmap(corrMatt, mask=mask, vmax=1.2, square=True, annot=True, fmt='.2g', ax=ax);
Observation:
可以发现 mean values 的相关性都比较大 between 1 - 0.75.
The mean area of the tissue nucleus has a strong positive correlation with mean values of radius and parameter.
Some paramters are moderately positive correlated (r between 0.5-0.75) are concavity and area, concavity and perimeter etc.
Likewise, we see some strong negative correlation between fractal_dimension with radius, texture, perimeter mean values.
Mean values of cell radius, perimeter, area, compactness, concavity and concave points can be used in classification of the cancer. Larger values of these parameters tends to show a correlation with malignant tumors.
Mean values of texture, smoothness, symmetry or fractual dimension does not show a particular preference of one diagnosis over the other.
In any of the histograms there are no noticeable large outliers that warrants further cleanup.
Here, we transform the class labels from their original string representation (M and B) into integers
即二值化label
如果是多余两类的label,使用one hot coding的方式编码即可
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
diagnosis_encoded = encoder.fit_transform(all_df['diagnosis'])
diagnosis_encoded
print(encoder.classes_)
恶性:1,良性:0
3.3 标准化特征
X = all_df.drop('diagnosis', axis=1)
y = all_df['diagnosis']
from sklearn.preprocessing import StandardScaler
# Normalize the data (center around 0 and scale to remove the variance).
scaler = StandardScaler()
Xs = scaler.fit_transform(X)
Predictive model using Support Vector Machine (SVM)
Support vector machines (SVMs) learning algorithm will be used to build the predictive model. SVMs are one of the most popular classification algorithms, and have an elegant way of transforming nonlinear data so that one can use a linear algorithm to fit a linear model to the data (Cortes and Vapnik 1995)
Important Parameters
The important parameters in kernel SVMs are the
Regularization parameter C;
The choice of the kernel - (linear, radial basis function(RBF) or polynomial);
Kernel-specific parameters.
gamma and C both control the complexity of the model, with large values in either resulting in a more complex model. Therefore, good settings for the two parameters are usually strongly correlated, and C and gamma should be adjusted together.
划分数据集
from sklearn.preprocessing import LabelEncoder
# transform the class labels from their original string representation (M and B) into integers
le = LabelEncoder()
all_df['diagnosis'] = le.fit_transform(all_df['diagnosis'])
X = all_df.drop('diagnosis', axis=1) # drop labels for training set
y = all_df['diagnosis']
# # stratified sampling. Divide records in training and testing sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_state, stratify=y)
# Normalize the data (center around 0 and scale to remove the variance).
scaler = StandardScaler()
Xs_train = scaler.fit_transform(X_train)
# Create an SVM classifier and train it on 70% of the data set.
from sklearn.svm import SVC
clf = SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', probability=True)
clf.fit(Xs_train, y_train)
Xs_test = scaler.transform(X_test)
classifier_score = clf.score(Xs_test, y_test)
在测试集上的准确度
print('The classifier accuracy score is {:03.2f}'.format(classifier_score))
交叉验证
# Get average of 3-fold cross-validation score using an SVC estimator.
from sklearn.model_selection import cross_val_score
n_folds = 10
clf_cv = SVC(C=1.0, kernel='rbf', degree=3, gamma='auto')
cv_error = np.average(cross_val_score(clf_cv, Xs_train, y_train, cv=n_folds))
print('The {}-fold cross-validation accuracy score for this classifier is {:.2f}'.format(n_folds, cv_error))
Classification with Feature Selection & cross-validation
# The confusion matrix helps visualize the performance of the algorithm.
from sklearn.metrics import confusion_matrix, classification_report
y_pred = clf.fit(Xs_train, y_train).predict(Xs_test)
cm = confusion_matrix(y_test, y_pred)
# lengthy way to plot confusion matrix, a shorter way using seaborn is also shown somewhere downa
fig, ax = plt.subplots(figsize=(3, 3))
ax.matshow(cm, cmap=plt.cm.Blues, alpha=0.3)
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(x=j, y=i,
s=cm[i, j],
va='center', ha='center')
classes=["Benign","Malignant"]
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
plt.xlabel('Predicted Values', )
plt.ylabel('Actual Values');
The default for SVM (the SVC class) is to use the Radial Basis Function (RBF) kernel with a C value set to 1.0. We will perform a grid search using 5-fold cross validation with a standardized copy of the training dataset. We will try a number of simpler kernel types and C values with less bias and more bias (less than and more than 1.0 respectively).
Python scikit-learn provides two simple methods for algorithm parameter tuning:
Based on the best classifier that we got from our optimization process we would now try to visualize the decision boundary of the SVM. In order to visualize the SVM decision boundary we need to reduce the multi-dimensional data to two dimension. We will resort to applying the linear PCA transformation that will transofrm our data to a lower dimensional subspace (from 30D to 2D in this case).
# Apply PCA by fitting the scaled data with only two dimensions
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
# Transform the original data using the PCA fit above
Xs_train_pca = pca.fit_transform(Xs_train)
# Take the first two PCA features. We could avoid this by using a two-dim dataset
X = Xs_train_pca
y = y_train
# http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
def make_meshgrid(x, y, h=.02):
"""Create a mesh of points to plot in
Parameters
----------
x: data to base x-axis meshgrid on
y: data to base y-axis meshgrid on
h: stepsize for meshgrid, optional
Returns
-------
xx, yy : ndarray
"""
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy
# http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
def plot_contours(ax, clf, xx, yy, **params):
"""Plot the decision boundaries for a classifier.
Parameters
----------
ax: matplotlib axes object
clf: a classifier
xx: meshgrid ndarray
yy: meshgrid ndarray
params: dictionary of params to pass to contourf, optional
"""
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
# create a mesh of values from the 1st two PCA components
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1)
Now it is time to create some models of the data and estimate their accuracy on unseen data. Here is what we are going to cover in this step: 1. Separate out a validation dataset. 2. Setup the test harness to use 10-fold cross validation. 3. Build 5 different models
4. Select the best model
Validation Dataset
# read the data
all_df = pd.read_csv('data/data.csv', index_col=False)
all_df.head()
all_df.columns
# Id column is redundant and not useful, we want to drop it
all_df.drop('id', axis =1, inplace=True)
from sklearn.preprocessing import LabelEncoder
# transform the class labels from their original string representation (M and B) into integers
le = LabelEncoder()
all_df['diagnosis'] = le.fit_transform(all_df['diagnosis'])
X = all_df.drop('diagnosis', axis=1) # drop labels for training set
y = all_df['diagnosis']
# Divide records in training and testing sets: stratified sampling
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7, stratify=y)
# Normalize the data (center around 0 and scale to remove the variance).
scaler = StandardScaler()
Xs_train = scaler.fit_transform(X_train)
6.2 Evaluate Algorithms: Baseline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
# Spot-Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# Test options and evaluation metric
num_folds = 10
num_instances = len(X_train)
scoring = 'accuracy'
results = []
names = []
for name, model in tqdm(models):
kf = KFold(n_splits=num_folds, random_state=random_state)
cv_results = cross_val_score(model, X_train, y_train, cv=kf, scoring=scoring, n_jobs=-1)
results.append(cv_results)
names.append(name)
print('10-Fold cross-validation accuracy score for the training data for all the classifiers')
for name, cv_results in zip(names, results):
print("%-10s: %.6f (%.6f)" % (name, cv_results.mean(), cv_results.std()))
print('10-Fold cross-validation accuracy score for the training data for all the classifiers')
for name, cv_results in zip(names, results):
print("%-10s: %.6f (%.6f)" % (name, cv_results.mean(), cv_results.std()))
In this section we investigate tuning the parameters for three algorithms that show promise from the spot-checking in the previous section: LR, LDA and SVM.
Tuning hyper-parameters - SVC estimator
# Make Support Vector Classifier Pipeline
pipe_svc = Pipeline([('scl', StandardScaler()),
('pca', PCA(n_components=2)),
('clf', SVC(probability=True, verbose=False))])
# Fit Pipeline to training data and score
scores = cross_val_score(estimator=pipe_svc, X=X_train, y=y_train, cv=10, n_jobs=-1, verbose=0)
print('SVC Model Training Accuracy: %.3f +/- %.3f' %(np.mean(scores), np.std(scores)))
print('SVC Model Tuned Parameters Best Score: ', gs_svc.best_score_)
print('SVC Model Best Parameters: ', gs_svc.best_params_)
Tuning the hyper-parameters - k-NN hyperparameters
For our standard k-NN implementation, there are two primary hyperparameters that we’ll want to tune:
The number of neighbors k.
The distance metric/similarity function.
Both of these values can dramatically affect the accuracy of our k-NN classifier. Grid object is ready to do 10-fold cross validation on a KNN model using classification accuracy as the evaluation metric. In addition, there is a parameter grid to repeat the 10-fold cross validation process 30 times. Each time, the n_neighbors parameter should be given a different value from the list.
We can't give GridSearchCV just a list
We've to specify n_neighbors should take on 1 through 30
We can set n_jobs = -1 to run computations in parallel (if supported by your computer and OS)
from sklearn.neighbors import KNeighborsClassifier as KNN
pipe_knn = Pipeline([('scl', StandardScaler()),
('pca', PCA(n_components=2)),
('clf', KNeighborsClassifier())])
#Fit Pipeline to training data and score
scores = cross_val_score(estimator=pipe_knn,
X=X_train,
y=y_train,
cv=10,
n_jobs=-1)
print('Knn Model Training Accuracy: %.3f +/- %.3f' %(np.mean(scores), np.std(scores)))
print('Knn Model Tuned Parameters Best Score: ', gs_knn.best_score_)
print('Knn Model Best Parameters: ', gs_knn.best_params_)
6.5 SVC最终模型
# Use best parameters
final_clf_svc = gs_svc.best_estimator_
# Get Final Scores
scores = cross_val_score(estimator=final_clf_svc,
X=X_train,
y=y_train,
cv=10,
n_jobs=-1)
print('Final Model Training Accuracy: %.3f +/- %.3f' %(np.mean(scores), np.std(scores)))
print('Final Accuracy on Test set: %.5f' % final_clf_svc.score(X_test, y_test))