Principal component Analysis | Python

Cinni Patel
6 min readFeb 14, 2021

Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in the data set. It accomplishes this reduction by identifying directions, called principal components, along which the variation in the data is maximum.

Below are the list of steps we will be following throughout the tutorial.

  • Normalize data
  • Know how to select number of components
  • Perform Principal component analysis (PCA)
  • Compute the correlations between the original data and each principal component
  • Explain the components observed
  • Scatter plot all the data on PC0 vs PC1 or PC1 vs PC2
  • Scatter plot all the original dimensions in the space of PC0 and PC1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Read and View Dataset

df = pd.read_csv('placesCleanedUp.txt', sep="\s+", header=None,index_col=False)
df.columns = ["CommunityState","Climate","HousingCost","HlthCare", "Crime","Transp", "Educ","Arts","Recreat","Econ","CaseNum","lat","lon","pop","statenum"]
df[['Community','State']] = df.CommunityState.apply(
lambda x: pd.Series(str(x).split(",")))
df=df.drop(['CommunityState'],axis=1)
df

Step1: Normalize the data (StandardScaler uses Mean=0 and Stdev=1)

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
columns=['Climate', 'HousingCost','HlthCare','Crime','Transp','Educ','Arts','Recreat','Econ'] #,'CaseNum','lat','lon','pop','statenum'
dfScaler = df[columns]
#Initialize, fit(train) and transform data to normalize
scaler=StandardScaler()
scaler.fit(dfScaler)
scaled_data=scaler.transform(dfScaler)
#View in Frame
df_scaled = pd.DataFrame(scaled_data, columns=columns)
df_scaled = df_scaled.round(2)
df_scaled

How to select # of components?

Method 1: pass %

from sklearn.decomposition import PCApca = PCA(n_components = 0.95)
pca.fit(scaled_data)
reduced = pca.transform(scaled_data)
print('Original Dimensions: ',scaled_data.shape) #14
print('Reduced Dimensions: ',reduced.shape) #11
>>
Original Dimensions: (329, 9)
Reduced Dimensions: (329, 7)

95% of variance is observed by 7 dimensions

Method 2: Select the number of components for PCA looking at plot

pca = PCA().fit(scaled_data)
plt.rcParams["figure.figsize"] = (12,6)
fig, ax = plt.subplots()
xi = np.arange(1, 10, step=1)
y = np.cumsum(pca.explained_variance_ratio_)
plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')
plt.xlabel('Fig 1 Number of Components')
plt.xticks(np.arange(0, 10, step=1)) #change from 0-based array index to 1-based human-readable label
plt.ylabel('Cumulative variance (%)')
plt.title('The number of components needed to explain variance')
plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)
ax.grid(axis='x')
plt.show()

95% of variance is observed by 7 components

Step 2 Perform PCA

(use SVD or the eigenvalue decomposition of the covariance matrix)

pca=PCA(n_components=7)
pca.fit(scaled_data)
>>
PCA(copy=True, iterated_power='auto', n_components=7, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)

Step 3 Scree plot and Variance Explained

A Scree plot displays how much variation each principal component captures from the data and allows us to select the main/principal components to keep in the entire dataset. If the first two or three PCs have capture most of the information, then we can ignore the rest without losing anything important. The y axis is eigenvalues, which essentially stands for the amount of variation. An ideal curve should be steep, then bends at an “elbow” — this is the cutting-off point — and after that flattens out. In Figure 2 below, just PC 1, 2, and 3 seems enough to describe the data.

A higher explained variance, captures more variability in dataset, which could potentially lead to better performance when training your model.model.

pca.explained_variance_array([3.41868293, 1.21767731, 1.14495927, 0.9237255 , 0.75558148,
0.63248434, 0.49455091])
pca.explained_variance_ratio_array([0.37869909, 0.13488624, 0.12683102, 0.1023242 , 0.08369832,
0.07006243, 0.05478308])
import seaborn as sn
dfScree = pd.DataFrame({'var':pca.explained_variance_ratio_,'PC':['PC1','PC2','PC3','PC4','PC5','PC6','PC7']})
sn.barplot(x='PC',y="var",data=dfScree, color="c").set_title('Fig 2. Component Variance');
x_pca=pca.transform(scaled_data)scaled_data.shape
(329, 9)
x_pca.shape
(329, 7)

Compute the correlations between the original data and each principal component

import seaborn as sndf_pc = pd.DataFrame(data = x_pca, columns = ['pc1','pc2','pc3','pc4','pc5','pc6','pc7'])
df_pc=df_pc.drop(['pc4'],axis=1)
df_pc=df_pc.drop(['pc5'],axis=1)
df_pc=df_pc.drop(['pc6'],axis=1)
df_pc=df_pc.drop(['pc7'],axis=1)

df_col = pd.concat([df_pc,df_scaled], axis=1)
df_col
covMatrix = pd.DataFrame.cov(df_col)
sn.set(rc={'figure.figsize':(14,6)})
sn.heatmap(covMatrix, annot=True, fmt='g')
plt.figure(figsize=(28,18))
plt.show()

Explain the Components observed

PCA 1 — The first principal component is strongly correlated with five of the original variables. It increases with increasing Arts, Health, Transportation, Housing and Recreation scores. communities with high values tend to have a lot of arts available, in terms of theaters, orchestras, etc.

PCA 2 — The second principal component increases with decreasing Education and Health. This component can be viewed as a measure of how uneducated and unhealthy the location is in terms of education including available schools, universities and health care including doctors, hospitals, etc.

PCA 3 — The third principal component decreases with only one of the values, decreasing Economy. It can be viewed as measure of how poor the state is in terms of business environment, jobs market and growth.

Step 4 Scatter plot all communities along two of the PCs (PC1 vs PC2 or PC1 vs PC3)

fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

ax1.scatter(x_pca[:,0],x_pca[:,1])
ax1.set_xlabel("PC1")
ax1.set_ylabel("PC2")
ax1.set_frame_on(False)
ax1.grid(True)
ax1.set_title('PC1 vs PC2')

ax2.scatter(x_pca[:,0],x_pca[:,2])
ax2.set_xlabel("PC1")
ax2.set_ylabel("PC3")
ax2.set_title('PC1 vs PC3')
ax2.grid(True)
ax2.set_frame_on(False)

PCA biplot = PCA score plot + loading plot

The left and bottom axes are of the PCA plot. It shows PCA scores of the samples (dots).

The top and right axes belong to the loading plot. It shows how strongly each characteristic (vector) influence the principal components.

import plotly.express as pxfeatures = ['Climate', 'HousingCost','HlthCare','Crime','Transp','Educ','Arts','Recreat','Econ'] #,'CaseNum','lat','lon','pop','statenum'
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
fig = px.scatter(x_pca, x=0, y=1)
for i, feature in enumerate(features):
fig.add_shape(
type='line',
x0=0, y0=0,
x1=loadings[i, 0]*4.5,
y1=loadings[i, 1]*4.5
)
fig.add_annotation(
x=loadings[i, 0]*5.5,
y=loadings[i, 1]*5.5,
ax=0, ay=0,
xanchor="center",
yanchor="bottom",
text=feature,
)
fig.show()

Plot STATES on principle components

def plotPCA(c1,c2):

dict = {'NY':'g', 'FL':'y','TX':'b','CA':'r','PA':'navy','WA':'turquoise','IL':'darkorange'}
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
ax.scatter(x_pca[:,c1-1],x_pca[:,c2-1])
for i,(Community,u) in enumerate(zip(np.array(df),x_pca)):
if df['State'][i] in dict:
ax.annotate(df['State'][i],(u[c1-1],u[c2-1]),color=dict[df['State'][i]])
else:
ax.annotate(df['State'][i],(u[c1-1],u[c2-1]))
ax.set_xlabel(f"PC{c1}")
ax.set_ylabel(f"PC{c2}")
ax.grid(True)
ax.set_frame_on(False)
title = 'States By ' + f"PC{c1}" + ' and ' + f"PC{c2}"
ax.set_title(title)
fig.savefig(title + '.png', dpi=75)

return
plotPCA(2, 1)

The angles between the vectors tell us how characteristics correlate with one another.

  • When two vectors are close, forming a small angle, the two variables they represent are positively correlated. Example: TX and FL. Similarly CA and WA
  • If they meet each other at 90°, they are not likely to be correlated. Example: TX and PA, similarly TX and CA
  • When they diverge and form a large angle (close to 180°), they are negative correlated. Example: TX and OR, similarly PA and FL

Plot Communities on principle components

c1, c2 = 0, 1
fig, ax = plt.subplots(1, 1, figsize=(18, 8))
ax.scatter(x_pca[:,c1],x_pca[:,c2])
for i,(Community,u) in enumerate(zip(np.array(df),x_pca)):
ax.annotate(df['Community'][i],(u[c1],u[c2]))
ax.grid(True)
ax.set_xlabel(f"PC{c1}")
ax.set_ylabel(f"PC{c2}")
ax.set_frame_on(True)
fig.savefig('Communities By PC1 PC2.png', dpi=75)

Step 5: Scatter plot all original dimensions in the space of PC0 and PC1.

# with regression
sn.pairplot(df_col, kind="reg")
plt.show()

3D chart

from mpl_toolkits import mplot3d

3D scatter plots allows to compare 3 characterstics of dataset instead of two.

%matplotlib notebook
import matplotlib.pyplot as plt
ax = plt.axes(projection='3d')
xline=x_pca[:,0]
yline=x_pca[:,1]
zline=x_pca[:,2]
ax.scatter3D(xline, yline, zline,c=zline,cmap='BrBG_r')
ax.set_xlabel('PC 1')
ax.set_ylabel('PC 2')
ax.set_zlabel('PC 3')
plt.show()

The scatter points transparency gives a sense of depth in the figure. Since PCA is not a clustering method, by reducing dimensionality, it can help visualize patterns, such as groups of similar dimentions such as PCA1 for Arts, Health, Transportation, Housing and Recreation. These patterns might not be visible on a 2D PCA plot, but show up more clearly in 3D.

--

--

Cinni Patel

Data Scientist — Generalist | Big data Enthusiast | Student at St Thomas Uni