import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Principal Component Analysis and Loadings¶
Principal Component Analysis (PCA) is one of the most popular data analysis and dimensionality techniques to this day. As someone coming from ML background, I have seen it used for dimensionality reduction but for some reason it seems I never noticed the “analysis” in the name that much. Then, when I was reading a paper, I saw they used PCA to justify their groupings and analyze the data, and they did that by something called loadings. After a little bit of research, it turns out it is just a fancy name but I wanted to write this post as my first “official” blog post on something.
I will derive PCA from the maximum variance point of view, and then explain what loadings are and how they explain the relationship between features. Then, I will apply a PCA analysis on air quality measurements in Mersin, Turkey and we will see if we can capture similar stations by those loadings.
Let’s start.
First, let’s derive PCA ourselves, from a maximum variance point of view.
We are interested in the dataset $X \in \mathbb{R}^{n\times d}$, where each row represents a datapoint and each column represents a feature.
n = 30
d = 5
X = np.random.rand(n, d)
Before going further, let us normalize our data to have a mean of 0.
$$ \begin{aligned} X_c = X – \bar{X} \end{aligned} $$where $\bar{X}$ represents the row-wise means.
X_c = (X - X.mean(axis=0))
This is our data represented in the standard basis, i.e $\begin{pmatrix}1&0&0&0&0\end{pmatrix}$, $\begin{pmatrix}0&1&0&0&0\end{pmatrix}$, $\dots$
We are interested in other basis vectors, such that when projected using those vectors, variation is maximized. Let’s check the current variance first:
$$ \begin{aligned} S &= \frac{1}{N} X_c^TX_c \end{aligned} $$S = (X_c.T @ X_c) / n
S
array([[ 0.07165141, 0.01730077, -0.00888918, -0.00484992, -0.03748958], [ 0.01730077, 0.07553007, -0.02827698, 0.00382616, -0.00305851], [-0.00888918, -0.02827698, 0.0690027 , 0.0068945 , -0.0025763 ], [-0.00484992, 0.00382616, 0.0068945 , 0.0601852 , 0.00653085], [-0.03748958, -0.00305851, -0.0025763 , 0.00653085, 0.08807457]])
This is our current covariance matrix. When we project our data to a new vector, our new covariance matrix will be:
$$ \begin{aligned} \frac{1}{N}(X_c u_1)^T(X_c u_1) &= \frac{1}{N} u_1^TX_c X_c u_1 = u_1^TSu_1 \end{aligned} $$We need to maximize this variance. It is obvious that when $u_1 \to \infty$, this will be maximized. But we are interested in maximizing by finding a direction and then projecting onto it. Therefore, the magnitude of the vector is irrelevant. To make our life easier, we can fix its size to 1: $u_1^Tu_1 = 1$ and this would turn our problem into a constrained optimization problem.
We can introduce a Lagrange multiplier $\lambda_1$ and optimize the objective:
$$ \begin{aligned} & u_1^TSu_1 + \lambda_1(1 – u_1^Tu_1) & \\ & 2 S u_1 – \lambda_1 u_1 = 0 & \text{(Take the derivative wrt. $u_1$ and set it to 0)}\\ & Su_1 = \lambda_1 u_1 \end{aligned} $$From the final equation, we can see that $u_1$ needs to be an eigenvector of the covariance matrix $S$. Therefore, maximizing the variation would correspond to selecting the largest eigenvalue. In that case, $u_1$ is the first principal component (Or depending on your field, you might see it called a principal axis but in ML community, it is usually the principal component). Then, following this in an iterative way we can get all of our principal components.
But, how do we calculate the PCA? The answer is Singular Value Decomposition (SVD), as it is a highly related concept. All matrices have SVD, thus we know that our dataset matrix also has an SVD: $X_c = U\Sigma V^T$. Using this relationship, we can find the eigenvectors and eigenvalues of $S$:
$$ \begin{align} X_c^T X_c &= V \Sigma U^T U \Sigma V^T = V \Sigma^2 V^T & \text{($U$ is a unitary matrix, s.t. $U^TU = I$)}\\ \implies & S = \frac{1}{N} V \Sigma^2 V^T \end{align} $$And as we do not care about the magnitude of the eigenvectors (i.e. If $v_1$ is an eigenvector, $5v_1$ is also a valid eigenvector), we can just drop the $\frac{1}{N}$ and use $V \Sigma^2 V^T$. From here, we see two things:
- Right singular vectors of the dataset matrix are scaled eigenvectors of the covariance matrix.
- Singular values of the dataset matrix are the squareroots of the eigenvalues of the covariance matrix.
Then, we can simply calculate the PCA by using SVD:
u, s, vt = np.linalg.svd(X_c)
explained_variance = s**2
principal_comps = vt
explained_variance
array([3.78865876, 2.92541508, 1.88357554, 1.21211713, 1.12355213])
Recovering the exact eigenvalues:
explained_variance / n
array([0.12628863, 0.09751384, 0.06278585, 0.0404039 , 0.03745174])
Each row is an eigenvector:
principal_comps
array([[-0.6077345 , -0.38737752, 0.27080286, 0.11247385, 0.62818221], [ 0.12534338, -0.6098027 , 0.59748364, -0.05631033, -0.50226701], [-0.08517668, -0.26555398, -0.3000831 , -0.91106 , 0.04632308], [-0.29740359, -0.46054634, -0.66702274, 0.36398383, -0.34934883], [ 0.72058577, -0.44210572, -0.18630596, 0.14718774, 0.47846059]])
Now, we have the corresponding eigenvalues and eigenvectors, i.e. variances and principal components. We can project our data to each of these vectors:
(X_c @ principal_comps.T)
array([[ 1.44345229e-01, 1.23210030e-01, -2.32913581e-01, -1.34922878e-01, 3.82387259e-01], [ 2.01270098e-01, -6.30671274e-02, 9.20910764e-02, 3.89603746e-02, 2.26175645e-02], [-6.71514565e-01, 4.24139781e-02, 2.56046078e-01, -6.26196766e-02, -2.71142633e-01], [-6.48606252e-01, 1.26157225e-01, -3.28288631e-01, 1.87717016e-01, 1.67646008e-01], [-5.37772752e-01, 1.22889678e-01, 5.23503062e-01, 1.16742975e-01, 2.17190535e-01], [ 5.01003168e-01, -6.93549991e-02, 3.72218915e-01, 1.29689293e-01, -6.42936333e-03], [-2.52104378e-01, -4.48483730e-01, -2.54459317e-03, -1.82790865e-01, 6.86092463e-02], [ 4.16717452e-03, -3.86821163e-01, -2.92641326e-01, -2.46712038e-02, -2.15309593e-01], [ 2.60372260e-01, 4.08116233e-02, -2.47717350e-01, -4.04772824e-01, 8.64188579e-02], [ 1.85345727e-01, -4.93955548e-03, 3.80749879e-02, 2.99092051e-01, 4.62490002e-02], [-1.77288433e-01, 5.49053231e-01, -4.04181029e-02, 4.22681085e-01, -1.30534198e-01], [ 3.78642627e-01, -6.70545110e-02, 8.33749415e-02, -2.20884883e-01, -1.70968508e-01], [-2.39720574e-01, -2.64882202e-01, -3.75357360e-01, -3.22916257e-02, -4.14765097e-02], [ 3.52868333e-01, 5.47959179e-01, -1.68551497e-01, 3.02250494e-02, -3.59056191e-01], [ 4.01120289e-01, -9.74536950e-02, 5.41869301e-01, 1.72739676e-01, 3.34669668e-02], [ 1.90333594e-02, 1.07620876e-01, 3.98149597e-02, 5.49144390e-02, -9.63115349e-02], [ 8.16690087e-02, 5.64584642e-01, -3.36891735e-02, -1.38776985e-01, -2.84805388e-01], [-7.16766311e-01, 1.53091293e-01, 2.22334839e-01, -3.18940354e-01, -2.26900805e-01], [ 5.33840361e-01, 3.37675389e-02, -7.05777843e-04, 2.30184731e-01, 6.53656897e-02], [-1.31071865e-01, -1.92748159e-01, 1.33001359e-01, -1.51408167e-02, 3.37036892e-01], [-2.58252162e-02, -6.08901641e-01, 4.10510084e-01, -1.95143211e-01, -7.78338399e-02], [ 7.19396153e-03, 4.41689903e-01, 1.02252335e-01, -3.19045108e-01, 1.77972962e-01], [-4.95505334e-02, 4.34163043e-01, 3.50184154e-02, -4.49963366e-02, 3.24417845e-01], [-9.19071784e-02, -2.82897037e-01, -4.37203169e-01, 6.74872321e-02, 1.83016962e-01], [-2.44979893e-01, 9.34807905e-02, 3.55730775e-02, -6.07611214e-02, 1.23933467e-01], [ 3.49704093e-01, -4.75232275e-01, -6.24772156e-02, -1.12222932e-01, -2.81856192e-01], [ 2.75189454e-01, -4.37583494e-01, 1.77583296e-03, 2.60555779e-01, 8.67047938e-02], [ 6.27335818e-01, 3.25872898e-01, -1.82674408e-01, -1.84914736e-01, 9.68839653e-02], [-2.34216310e-01, -4.33034543e-02, -1.43890915e-01, 3.56145647e-01, -1.92906008e-01], [-3.01776703e-01, -2.64042883e-01, -3.38386166e-01, 8.57602081e-02, -6.43872485e-02]])
And for dimensionality reduction, we can decide to just keep top 2:
X_c @ principal_comps[:2, :].T
array([[ 0.14434523, 0.12321003], [ 0.2012701 , -0.06306713], [-0.67151456, 0.04241398], [-0.64860625, 0.12615722], [-0.53777275, 0.12288968], [ 0.50100317, -0.069355 ], [-0.25210438, -0.44848373], [ 0.00416717, -0.38682116], [ 0.26037226, 0.04081162], [ 0.18534573, -0.00493956], [-0.17728843, 0.54905323], [ 0.37864263, -0.06705451], [-0.23972057, -0.2648822 ], [ 0.35286833, 0.54795918], [ 0.40112029, -0.0974537 ], [ 0.01903336, 0.10762088], [ 0.08166901, 0.56458464], [-0.71676631, 0.15309129], [ 0.53384036, 0.03376754], [-0.13107186, -0.19274816], [-0.02582522, -0.60890164], [ 0.00719396, 0.4416899 ], [-0.04955053, 0.43416304], [-0.09190718, -0.28289704], [-0.24497989, 0.09348079], [ 0.34970409, -0.47523228], [ 0.27518945, -0.43758349], [ 0.62733582, 0.3258729 ], [-0.23421631, -0.04330345], [-0.3017767 , -0.26404288]])
Loadings¶
In some texts, it is possible to see something called “loadings”, which are used to understand the relationship between original features and the PCA transformation. It sounds fancy, but it is just the components of the eigenvectors. They call this is the contribution of the original feature to the eigenvector.
Here is the idea behind it: Let’s actually think about projecting. Let us consider the first principal component:
principal_comps[0, :]
array([-0.6077345 , -0.38737752, 0.27080286, 0.11247385, 0.62818221])
And consider our first data point:
X_c[0]
array([ 0.26322816, -0.17611613, 0.20135411, 0.22866828, 0.24809388])
To project $x_1$ to $u_1$, we take their dot product: $$ \begin{aligned} \langle u_1, x_1 \rangle = u_{11}x_{11} + u_{12}x_{12} + u_{13}x_{13} + u_{14}x_{14} + u_{15}x_{15} \end{aligned} $$
Then, we see that the contribution of the first feature is $u_{11}$, because it is multiplied by that. Contribution of the second feature is $u_{12}$, because it is multiplied by that and so on. So, the signs and magnitudes of these coefficients signify the relationships between features. If they have the same sign, they try to project the data point to the same side of the principal component; and if they have the opposite side they kind of work against each other. And relative magnitudes of these components give us an idea of their similarity.
Case Study: Air Pollution in Mersin, Turkey¶
Let’s see if we can actually use PCA to detect which air quality stations are similar. We have the daily means of 5 different stations in the same city, for about a year.
def typecast_float(value):
try:
return float(value.replace(',', '.'))
except:
return value
df = pd.read_excel("mersin_stations.xlsx", converters={'Mersin - Akdeniz': typecast_float, 'Mersin - Huzurkent': typecast_float,
'Mersin - İstiklal Cad.': typecast_float, 'Mersin - Tarsus': typecast_float, 'Mersin - Tasucu': typecast_float,
'Mersin - Toroslar': typecast_float, 'Mersin - Yenişehir': typecast_float})
df = df.drop(index=0)
cols = list(df.columns)
cols.remove("Tarih")
df = df.set_index("Tarih")
for col in cols:
df.loc[df[col] == "-", col] = np.nan
df[col] = df[col].astype(np.float32)
/home/ugurkap/miniconda3/envs/main/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:221: UserWarning: Workbook contains no default style, apply openpyxl's default warn("Workbook contains no default style, apply openpyxl's default")
df.describe()
Mersin – Akdeniz | Mersin – Huzurkent | Mersin – İstiklal Cad. | Mersin – Tarsus | Mersin – Tasucu | Mersin – Toroslar | Mersin – Yenişehir | |
---|---|---|---|---|---|---|---|
count | 321.000000 | 310.000000 | 334.000000 | 332.000000 | 302.000000 | 316.000000 | 335.000000 |
mean | 81.926414 | 50.694416 | 74.172096 | 60.954185 | 30.675200 | 40.514462 | 54.070477 |
std | 36.366062 | 14.367387 | 38.317322 | 22.139832 | 11.958918 | 17.441872 | 19.098902 |
min | 26.870001 | 18.760000 | 28.080000 | 15.350000 | 6.190000 | 7.310000 | 11.250000 |
25% | 54.419998 | 41.462502 | 49.062500 | 45.632500 | 22.375000 | 29.082500 | 41.264999 |
50% | 71.620003 | 49.610001 | 61.994999 | 59.064999 | 29.529999 | 37.400002 | 51.169998 |
75% | 101.750000 | 59.330002 | 82.845001 | 72.497498 | 38.439999 | 49.227501 | 66.555000 |
max | 217.050003 | 100.919998 | 287.739990 | 165.800003 | 79.589996 | 106.820000 | 115.500000 |
From this description, we have a rough idea of the stations. But let’s use PCA to actually analyze the data:
df.mean()
Mersin - Akdeniz 81.926414 Mersin - Huzurkent 50.694416 Mersin - İstiklal Cad. 74.172096 Mersin - Tarsus 60.954185 Mersin - Tasucu 30.675200 Mersin - Toroslar 40.514462 Mersin - Yenişehir 54.070477 dtype: float32
# 1. Normalize the data, then fill in the missing values by the mean (0 now)
df_norm = df - df.mean()
df_norm = df_norm.fillna(0)
# 2. Use SVD to get PCA
u, s, vt = np.linalg.svd(df_norm)
explained_variance = s**2
principal_comps = vt
# 3. Print the results:
print(f"Explained Variance: \n{list(explained_variance)}")
print(f"Explained Variance Ratio: \n{list(explained_variance / explained_variance.sum())}")
print(f"Principal Components: \n{principal_comps}")
Explained Variance: [970372.7, 161876.78, 99340.76, 73822.26, 44803.496, 25845.836, 22792.111] Explained Variance Ratio: [0.6936912, 0.115720995, 0.07101581, 0.052773383, 0.032028716, 0.018476436, 0.016293418] Principal Components: [[ 0.6225126 0.16223446 0.66419894 0.24866222 -0.00622524 0.11959113 0.26234362] [-0.02710083 0.2685964 -0.45266798 0.6661215 0.345759 0.05868038 0.3943383 ] [ 0.57136565 -0.05261079 -0.38114944 -0.1719964 -0.09751546 -0.68772197 0.11595257] [-0.3109046 -0.15181339 0.36948395 0.50480044 0.04580177 -0.6335116 -0.29242852] [ 0.37802657 -0.37931138 -0.2600934 0.41760823 -0.24467331 0.32418633 -0.55336404] [-0.09289647 0.69212395 -0.06681351 0.11723612 -0.6863987 -0.04926331 -0.14337459] [ 0.19258615 0.5028002 -0.0050598 -0.14799356 0.5812028 -0.02079634 -0.59156203]]
Okay, here we see that these grouped stations are somewhat similar, as 69\%$ of the variance can be explained by one principal component.
Returning back to our loadings, and actually understanding the relationships between features; our first principal component is
$$ \begin{aligned} \begin{pmatrix} 0.62 & 0.16 & 0.66 & 0.25 & -0.01 & 0.12 & 0.26 \end{pmatrix} \end{aligned} $$From here, we observe the following groupings:
- Stations 1, 3 ($\sim 0.6 – 0.7$)
- Stations 4, 7 ($\sim 0.2$)
- Stations 2, 6 ($\sim 0.1$)
- Station 5 ($\sim 0$)
We can also see that 5 is very different than the other stations, as it has a negative sign. Let’s refresh our memories by checking the description:
df.describe()
Mersin – Akdeniz | Mersin – Huzurkent | Mersin – İstiklal Cad. | Mersin – Tarsus | Mersin – Tasucu | Mersin – Toroslar | Mersin – Yenişehir | |
---|---|---|---|---|---|---|---|
count | 321.000000 | 310.000000 | 334.000000 | 332.000000 | 302.000000 | 316.000000 | 335.000000 |
mean | 81.926414 | 50.694416 | 74.172096 | 60.954185 | 30.675200 | 40.514462 | 54.070477 |
std | 36.366062 | 14.367387 | 38.317322 | 22.139832 | 11.958918 | 17.441872 | 19.098902 |
min | 26.870001 | 18.760000 | 28.080000 | 15.350000 | 6.190000 | 7.310000 | 11.250000 |
25% | 54.419998 | 41.462502 | 49.062500 | 45.632500 | 22.375000 | 29.082500 | 41.264999 |
50% | 71.620003 | 49.610001 | 61.994999 | 59.064999 | 29.529999 | 37.400002 | 51.169998 |
75% | 101.750000 | 59.330002 | 82.845001 | 72.497498 | 38.439999 | 49.227501 | 66.555000 |
max | 217.050003 | 100.919998 | 287.739990 | 165.800003 | 79.589996 | 106.820000 | 115.500000 |
We see that station 5 is the cleanest of the bunch, and is usually behaving different than them. Stations 1 and 3 are very similar, as we expected from them and so on.
We could actually decide those similarities by eye, but now we have a formal point to stand on. It is also nice that we don’t have to do it by eye as we could be talking about hundreds of stations too.
References¶
[1] Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.
[2] “Hava Kalitesi – İstasyon Veri İndirme | T.C. Çevre, Şehircilik ve İklim Değişikliği Bakanlığı.” 16 Apr. 2023