Principal component Analysis | Python
Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in the data set. It accomplishes this reduction by identifying directions, called principal components, along which the variation in the data is maximum.
Below are the list of steps we will be following throughout the tutorial.
- Normalize data
- Know how to select number of components
- Perform Principal component analysis (PCA)
- Compute the correlations between the original data and each principal component
- Explain the components observed
- Scatter plot all the data on PC0 vs PC1 or PC1 vs PC2
- Scatter plot all the original dimensions in the space of PC0 and PC1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Read and View Dataset
df = pd.read_csv('placesCleanedUp.txt', sep="\s+", header=None,index_col=False)
df.columns = ["CommunityState","Climate","HousingCost","HlthCare", "Crime","Transp", "Educ","Arts","Recreat","Econ","CaseNum","lat","lon","pop","statenum"]
df[['Community','State']] = df.CommunityState.apply(
lambda x: pd.Series(str(x).split(",")))
df=df.drop(['CommunityState'],axis=1)
df
Step1: Normalize the data (StandardScaler uses Mean=0 and Stdev=1)
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScalercolumns=['Climate', 'HousingCost','HlthCare','Crime','Transp','Educ','Arts','Recreat','Econ'] #,'CaseNum','lat','lon','pop','statenum'
dfScaler = df[columns]#Initialize, fit(train) and transform data to normalize
scaler=StandardScaler()
scaler.fit(dfScaler)
scaled_data=scaler.transform(dfScaler)#View in Frame
df_scaled = pd.DataFrame(scaled_data, columns=columns)
df_scaled = df_scaled.round(2)
df_scaled
How to select # of components?
Method 1: pass %
from sklearn.decomposition import PCApca = PCA(n_components = 0.95)
pca.fit(scaled_data)
reduced = pca.transform(scaled_data)
print('Original Dimensions: ',scaled_data.shape) #14
print('Reduced Dimensions: ',reduced.shape) #11>>
Original Dimensions: (329, 9)
Reduced Dimensions: (329, 7)
95%
of variance is observed by 7 dimensions
Method 2: Select the number of components for PCA looking at plot
pca = PCA().fit(scaled_data)
plt.rcParams["figure.figsize"] = (12,6)fig, ax = plt.subplots()
xi = np.arange(1, 10, step=1)
y = np.cumsum(pca.explained_variance_ratio_)plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')plt.xlabel('Fig 1 Number of Components')
plt.xticks(np.arange(0, 10, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)ax.grid(axis='x')
plt.show()
95%
of variance is observed by 7 components
Step 2 Perform PCA
(use SVD or the eigenvalue decomposition of the covariance matrix)
pca=PCA(n_components=7)
pca.fit(scaled_data)>>
PCA(copy=True, iterated_power='auto', n_components=7, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
Step 3 Scree plot and Variance Explained
A Scree plot displays how much variation each principal component captures from the data and allows us to select the main/principal components to keep in the entire dataset. If the first two or three PCs have capture most of the information, then we can ignore the rest without losing anything important. The y axis is eigenvalues, which essentially stands for the amount of variation. An ideal curve should be steep, then bends at an “elbow” — this is the cutting-off point — and after that flattens out. In Figure 2 below, just PC 1, 2, and 3 seems enough to describe the data.
A higher explained variance, captures more variability in dataset, which could potentially lead to better performance when training your model.model.
pca.explained_variance_array([3.41868293, 1.21767731, 1.14495927, 0.9237255 , 0.75558148,
0.63248434, 0.49455091])pca.explained_variance_ratio_array([0.37869909, 0.13488624, 0.12683102, 0.1023242 , 0.08369832,
0.07006243, 0.05478308])import seaborn as sn
dfScree = pd.DataFrame({'var':pca.explained_variance_ratio_,'PC':['PC1','PC2','PC3','PC4','PC5','PC6','PC7']})
sn.barplot(x='PC',y="var",data=dfScree, color="c").set_title('Fig 2. Component Variance');
x_pca=pca.transform(scaled_data)scaled_data.shape
(329, 9)x_pca.shape
(329, 7)
Compute the correlations between the original data and each principal component
import seaborn as sndf_pc = pd.DataFrame(data = x_pca, columns = ['pc1','pc2','pc3','pc4','pc5','pc6','pc7'])
df_pc=df_pc.drop(['pc4'],axis=1)
df_pc=df_pc.drop(['pc5'],axis=1)
df_pc=df_pc.drop(['pc6'],axis=1)
df_pc=df_pc.drop(['pc7'],axis=1)
df_col = pd.concat([df_pc,df_scaled], axis=1)
df_col
covMatrix = pd.DataFrame.cov(df_col)
sn.set(rc={'figure.figsize':(14,6)})
sn.heatmap(covMatrix, annot=True, fmt='g')
plt.figure(figsize=(28,18))
plt.show()
Explain the Components observed
PCA 1 — The first principal component is strongly correlated with five of the original variables. It increases with increasing Arts
, Health
, Transportation
, Housing
and Recreation
scores. communities with high values tend to have a lot of arts available, in terms of theaters, orchestras, etc.
PCA 2 — The second principal component increases with decreasing Education
and Health
. This component can be viewed as a measure of how uneducated and unhealthy the location is in terms of education including available schools, universities and health care including doctors, hospitals, etc.
PCA 3 — The third principal component decreases with only one of the values, decreasing Economy
. It can be viewed as measure of how poor the state is in terms of business environment, jobs market and growth.
Step 4 Scatter plot all communities along two of the PCs (PC1 vs PC2 or PC1 vs PC3)
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
ax1.scatter(x_pca[:,0],x_pca[:,1])
ax1.set_xlabel("PC1")
ax1.set_ylabel("PC2")
ax1.set_frame_on(False)
ax1.grid(True)
ax1.set_title('PC1 vs PC2')
ax2.scatter(x_pca[:,0],x_pca[:,2])
ax2.set_xlabel("PC1")
ax2.set_ylabel("PC3")
ax2.set_title('PC1 vs PC3')
ax2.grid(True)
ax2.set_frame_on(False)
PCA biplot = PCA score plot + loading plot
The left and bottom axes are of the PCA plot. It shows PCA scores of the samples (dots).
The top and right axes belong to the loading plot. It shows how strongly each characteristic (vector) influence the principal components.
import plotly.express as pxfeatures = ['Climate', 'HousingCost','HlthCare','Crime','Transp','Educ','Arts','Recreat','Econ'] #,'CaseNum','lat','lon','pop','statenum'
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)fig = px.scatter(x_pca, x=0, y=1)
for i, feature in enumerate(features):
fig.add_shape(
type='line',
x0=0, y0=0,
x1=loadings[i, 0]*4.5,
y1=loadings[i, 1]*4.5
)
fig.add_annotation(
x=loadings[i, 0]*5.5,
y=loadings[i, 1]*5.5,
ax=0, ay=0,
xanchor="center",
yanchor="bottom",
text=feature,
)
fig.show()
Plot STATES on principle components
def plotPCA(c1,c2):
dict = {'NY':'g', 'FL':'y','TX':'b','CA':'r','PA':'navy','WA':'turquoise','IL':'darkorange'}
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
ax.scatter(x_pca[:,c1-1],x_pca[:,c2-1])
for i,(Community,u) in enumerate(zip(np.array(df),x_pca)):
if df['State'][i] in dict:
ax.annotate(df['State'][i],(u[c1-1],u[c2-1]),color=dict[df['State'][i]])
else:
ax.annotate(df['State'][i],(u[c1-1],u[c2-1])) ax.set_xlabel(f"PC{c1}")
ax.set_ylabel(f"PC{c2}")
ax.grid(True)
ax.set_frame_on(False)
title = 'States By ' + f"PC{c1}" + ' and ' + f"PC{c2}"
ax.set_title(title)
fig.savefig(title + '.png', dpi=75)
returnplotPCA(2, 1)
The angles between the vectors tell us how characteristics correlate with one another.
- When two vectors are close, forming a small angle, the two variables they represent are positively correlated. Example:
TX
andFL
. SimilarlyCA
andWA
- If they meet each other at
90°
, they are not likely to be correlated. Example:TX
andPA
, similarlyTX
andCA
- When they diverge and form a large angle (close to
180°
), they are negative correlated. Example:TX
andOR
, similarlyPA
andFL
Plot Communities on principle components
c1, c2 = 0, 1
fig, ax = plt.subplots(1, 1, figsize=(18, 8))
ax.scatter(x_pca[:,c1],x_pca[:,c2])
for i,(Community,u) in enumerate(zip(np.array(df),x_pca)):
ax.annotate(df['Community'][i],(u[c1],u[c2]))
ax.grid(True)
ax.set_xlabel(f"PC{c1}")
ax.set_ylabel(f"PC{c2}")
ax.set_frame_on(True)
fig.savefig('Communities By PC1 PC2.png', dpi=75)
Step 5: Scatter plot all original dimensions in the space of PC0 and PC1.
# with regression
sn.pairplot(df_col, kind="reg")
plt.show()
3D chart
from mpl_toolkits import mplot3d
3D scatter plots allows to compare 3 characterstics of dataset instead of two.
%matplotlib notebook
import matplotlib.pyplot as pltax = plt.axes(projection='3d')
xline=x_pca[:,0]
yline=x_pca[:,1]
zline=x_pca[:,2]ax.scatter3D(xline, yline, zline,c=zline,cmap='BrBG_r')
ax.set_xlabel('PC 1')
ax.set_ylabel('PC 2')
ax.set_zlabel('PC 3')
plt.show()
The scatter points transparency gives a sense of depth in the figure. Since PCA is not a clustering method, by reducing dimensionality, it can help visualize patterns, such as groups of similar dimentions such as PCA1 for Arts
, Health
, Transportation
, Housing
and Recreation.
These patterns might not be visible on a 2D PCA plot, but show up more clearly in 3D.