PCA, Loadings, Data Analysis

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Principal Component Analysis and Loadings¶

Principal Component Analysis (PCA) is one of the most popular data analysis and dimensionality techniques to this day. As someone coming from ML background, I have seen it used for dimensionality reduction but for some reason it seems I never noticed the “analysis” in the name that much. Then, when I was reading a paper, I saw they used PCA to justify their groupings and analyze the data, and they did that by something called loadings. After a little bit of research, it turns out it is just a fancy name but I wanted to write this post as my first “official” blog post on something.

I will derive PCA from the maximum variance point of view, and then explain what loadings are and how they explain the relationship between features. Then, I will apply a PCA analysis on air quality measurements in Mersin, Turkey and we will see if we can capture similar stations by those loadings.

Let’s start.

First, let’s derive PCA ourselves, from a maximum variance point of view.

We are interested in the dataset $X \in \mathbb{R}^{n\times d}$, where each row represents a datapoint and each column represents a feature.

In [2]:

n = 30
d = 5
X = np.random.rand(n, d)

Before going further, let us normalize our data to have a mean of 0.

$$ \begin{aligned} X_c = X – \bar{X} \end{aligned} $$

where $\bar{X}$ represents the row-wise means.

In [3]:

X_c = (X - X.mean(axis=0))

This is our data represented in the standard basis, i.e $\begin{pmatrix}1&0&0&0&0\end{pmatrix}$, $\begin{pmatrix}0&1&0&0&0\end{pmatrix}$, $\dots$

We are interested in other basis vectors, such that when projected using those vectors, variation is maximized. Let’s check the current variance first:

$$ \begin{aligned} S &= \frac{1}{N} X_c^TX_c \end{aligned} $$

In [4]:

S = (X_c.T @ X_c) / n
S

Out[4]:

array([[ 0.07165141,  0.01730077, -0.00888918, -0.00484992, -0.03748958],
       [ 0.01730077,  0.07553007, -0.02827698,  0.00382616, -0.00305851],
       [-0.00888918, -0.02827698,  0.0690027 ,  0.0068945 , -0.0025763 ],
       [-0.00484992,  0.00382616,  0.0068945 ,  0.0601852 ,  0.00653085],
       [-0.03748958, -0.00305851, -0.0025763 ,  0.00653085,  0.08807457]])

This is our current covariance matrix. When we project our data to a new vector, our new covariance matrix will be:

$$ \begin{aligned} \frac{1}{N}(X_c u_1)^T(X_c u_1) &= \frac{1}{N} u_1^TX_c X_c u_1 = u_1^TSu_1 \end{aligned} $$

We need to maximize this variance. It is obvious that when $u_1 \to \infty$, this will be maximized. But we are interested in maximizing by finding a direction and then projecting onto it. Therefore, the magnitude of the vector is irrelevant. To make our life easier, we can fix its size to 1: $u_1^Tu_1 = 1$ and this would turn our problem into a constrained optimization problem.

We can introduce a Lagrange multiplier $\lambda_1$ and optimize the objective:

$$ \begin{aligned} & u_1^TSu_1 + \lambda_1(1 – u_1^Tu_1) & \\ & 2 S u_1 – \lambda_1 u_1 = 0 & \text{(Take the derivative wrt. $u_1$ and set it to 0)}\\ & Su_1 = \lambda_1 u_1 \end{aligned} $$

From the final equation, we can see that $u_1$ needs to be an eigenvector of the covariance matrix $S$. Therefore, maximizing the variation would correspond to selecting the largest eigenvalue. In that case, $u_1$ is the first principal component (Or depending on your field, you might see it called a principal axis but in ML community, it is usually the principal component). Then, following this in an iterative way we can get all of our principal components.

But, how do we calculate the PCA? The answer is Singular Value Decomposition (SVD), as it is a highly related concept. All matrices have SVD, thus we know that our dataset matrix also has an SVD: $X_c = U\Sigma V^T$. Using this relationship, we can find the eigenvectors and eigenvalues of $S$:

$$ \begin{align} X_c^T X_c &= V \Sigma U^T U \Sigma V^T = V \Sigma^2 V^T & \text{($U$ is a unitary matrix, s.t. $U^TU = I$)}\\ \implies & S = \frac{1}{N} V \Sigma^2 V^T \end{align} $$

And as we do not care about the magnitude of the eigenvectors (i.e. If $v_1$ is an eigenvector, $5v_1$ is also a valid eigenvector), we can just drop the $\frac{1}{N}$ and use $V \Sigma^2 V^T$. From here, we see two things:

Right singular vectors of the dataset matrix are scaled eigenvectors of the covariance matrix.
Singular values of the dataset matrix are the squareroots of the eigenvalues of the covariance matrix.

Then, we can simply calculate the PCA by using SVD:

In [5]:

u, s, vt = np.linalg.svd(X_c)

explained_variance = s**2
principal_comps = vt

In [6]:

explained_variance

Out[6]:

array([3.78865876, 2.92541508, 1.88357554, 1.21211713, 1.12355213])

Recovering the exact eigenvalues:

In [7]:

explained_variance / n

Out[7]:

array([0.12628863, 0.09751384, 0.06278585, 0.0404039 , 0.03745174])

Each row is an eigenvector:

In [8]:

principal_comps

Out[8]:

array([[-0.6077345 , -0.38737752,  0.27080286,  0.11247385,  0.62818221],
       [ 0.12534338, -0.6098027 ,  0.59748364, -0.05631033, -0.50226701],
       [-0.08517668, -0.26555398, -0.3000831 , -0.91106   ,  0.04632308],
       [-0.29740359, -0.46054634, -0.66702274,  0.36398383, -0.34934883],
       [ 0.72058577, -0.44210572, -0.18630596,  0.14718774,  0.47846059]])

Now, we have the corresponding eigenvalues and eigenvectors, i.e. variances and principal components. We can project our data to each of these vectors:

In [9]:

(X_c @ principal_comps.T)

Out[9]:

array([[ 1.44345229e-01,  1.23210030e-01, -2.32913581e-01,
        -1.34922878e-01,  3.82387259e-01],
       [ 2.01270098e-01, -6.30671274e-02,  9.20910764e-02,
         3.89603746e-02,  2.26175645e-02],
       [-6.71514565e-01,  4.24139781e-02,  2.56046078e-01,
        -6.26196766e-02, -2.71142633e-01],
       [-6.48606252e-01,  1.26157225e-01, -3.28288631e-01,
         1.87717016e-01,  1.67646008e-01],
       [-5.37772752e-01,  1.22889678e-01,  5.23503062e-01,
         1.16742975e-01,  2.17190535e-01],
       [ 5.01003168e-01, -6.93549991e-02,  3.72218915e-01,
         1.29689293e-01, -6.42936333e-03],
       [-2.52104378e-01, -4.48483730e-01, -2.54459317e-03,
        -1.82790865e-01,  6.86092463e-02],
       [ 4.16717452e-03, -3.86821163e-01, -2.92641326e-01,
        -2.46712038e-02, -2.15309593e-01],
       [ 2.60372260e-01,  4.08116233e-02, -2.47717350e-01,
        -4.04772824e-01,  8.64188579e-02],
       [ 1.85345727e-01, -4.93955548e-03,  3.80749879e-02,
         2.99092051e-01,  4.62490002e-02],
       [-1.77288433e-01,  5.49053231e-01, -4.04181029e-02,
         4.22681085e-01, -1.30534198e-01],
       [ 3.78642627e-01, -6.70545110e-02,  8.33749415e-02,
        -2.20884883e-01, -1.70968508e-01],
       [-2.39720574e-01, -2.64882202e-01, -3.75357360e-01,
        -3.22916257e-02, -4.14765097e-02],
       [ 3.52868333e-01,  5.47959179e-01, -1.68551497e-01,
         3.02250494e-02, -3.59056191e-01],
       [ 4.01120289e-01, -9.74536950e-02,  5.41869301e-01,
         1.72739676e-01,  3.34669668e-02],
       [ 1.90333594e-02,  1.07620876e-01,  3.98149597e-02,
         5.49144390e-02, -9.63115349e-02],
       [ 8.16690087e-02,  5.64584642e-01, -3.36891735e-02,
        -1.38776985e-01, -2.84805388e-01],
       [-7.16766311e-01,  1.53091293e-01,  2.22334839e-01,
        -3.18940354e-01, -2.26900805e-01],
       [ 5.33840361e-01,  3.37675389e-02, -7.05777843e-04,
         2.30184731e-01,  6.53656897e-02],
       [-1.31071865e-01, -1.92748159e-01,  1.33001359e-01,
        -1.51408167e-02,  3.37036892e-01],
       [-2.58252162e-02, -6.08901641e-01,  4.10510084e-01,
        -1.95143211e-01, -7.78338399e-02],
       [ 7.19396153e-03,  4.41689903e-01,  1.02252335e-01,
        -3.19045108e-01,  1.77972962e-01],
       [-4.95505334e-02,  4.34163043e-01,  3.50184154e-02,
        -4.49963366e-02,  3.24417845e-01],
       [-9.19071784e-02, -2.82897037e-01, -4.37203169e-01,
         6.74872321e-02,  1.83016962e-01],
       [-2.44979893e-01,  9.34807905e-02,  3.55730775e-02,
        -6.07611214e-02,  1.23933467e-01],
       [ 3.49704093e-01, -4.75232275e-01, -6.24772156e-02,
        -1.12222932e-01, -2.81856192e-01],
       [ 2.75189454e-01, -4.37583494e-01,  1.77583296e-03,
         2.60555779e-01,  8.67047938e-02],
       [ 6.27335818e-01,  3.25872898e-01, -1.82674408e-01,
        -1.84914736e-01,  9.68839653e-02],
       [-2.34216310e-01, -4.33034543e-02, -1.43890915e-01,
         3.56145647e-01, -1.92906008e-01],
       [-3.01776703e-01, -2.64042883e-01, -3.38386166e-01,
         8.57602081e-02, -6.43872485e-02]])

And for dimensionality reduction, we can decide to just keep top 2:

In [10]:

X_c @ principal_comps[:2, :].T

Out[10]:

array([[ 0.14434523,  0.12321003],
       [ 0.2012701 , -0.06306713],
       [-0.67151456,  0.04241398],
       [-0.64860625,  0.12615722],
       [-0.53777275,  0.12288968],
       [ 0.50100317, -0.069355  ],
       [-0.25210438, -0.44848373],
       [ 0.00416717, -0.38682116],
       [ 0.26037226,  0.04081162],
       [ 0.18534573, -0.00493956],
       [-0.17728843,  0.54905323],
       [ 0.37864263, -0.06705451],
       [-0.23972057, -0.2648822 ],
       [ 0.35286833,  0.54795918],
       [ 0.40112029, -0.0974537 ],
       [ 0.01903336,  0.10762088],
       [ 0.08166901,  0.56458464],
       [-0.71676631,  0.15309129],
       [ 0.53384036,  0.03376754],
       [-0.13107186, -0.19274816],
       [-0.02582522, -0.60890164],
       [ 0.00719396,  0.4416899 ],
       [-0.04955053,  0.43416304],
       [-0.09190718, -0.28289704],
       [-0.24497989,  0.09348079],
       [ 0.34970409, -0.47523228],
       [ 0.27518945, -0.43758349],
       [ 0.62733582,  0.3258729 ],
       [-0.23421631, -0.04330345],
       [-0.3017767 , -0.26404288]])

Loadings¶

In some texts, it is possible to see something called “loadings”, which are used to understand the relationship between original features and the PCA transformation. It sounds fancy, but it is just the components of the eigenvectors. They call this is the contribution of the original feature to the eigenvector.

Here is the idea behind it: Let’s actually think about projecting. Let us consider the first principal component:

In [11]:

principal_comps[0, :]

Out[11]:

array([-0.6077345 , -0.38737752,  0.27080286,  0.11247385,  0.62818221])

And consider our first data point:

In [12]:

X_c[0]

Out[12]:

array([ 0.26322816, -0.17611613,  0.20135411,  0.22866828,  0.24809388])

To project $x_1$ to $u_1$, we take their dot product: $$ \begin{aligned} \langle u_1, x_1 \rangle = u_{11}x_{11} + u_{12}x_{12} + u_{13}x_{13} + u_{14}x_{14} + u_{15}x_{15} \end{aligned} $$

Then, we see that the contribution of the first feature is $u_{11}$, because it is multiplied by that. Contribution of the second feature is $u_{12}$, because it is multiplied by that and so on. So, the signs and magnitudes of these coefficients signify the relationships between features. If they have the same sign, they try to project the data point to the same side of the principal component; and if they have the opposite side they kind of work against each other. And relative magnitudes of these components give us an idea of their similarity.

Case Study: Air Pollution in Mersin, Turkey¶

Let’s see if we can actually use PCA to detect which air quality stations are similar. We have the daily means of 5 different stations in the same city, for about a year.

In [13]:

def typecast_float(value):
    try:
        return float(value.replace(',', '.'))
    except:
        return value

df = pd.read_excel("mersin_stations.xlsx", converters={'Mersin - Akdeniz': typecast_float, 'Mersin - Huzurkent': typecast_float,
       'Mersin - İstiklal Cad.': typecast_float, 'Mersin - Tarsus': typecast_float, 'Mersin - Tasucu': typecast_float,
       'Mersin - Toroslar': typecast_float, 'Mersin - Yenişehir': typecast_float})

df = df.drop(index=0)
cols = list(df.columns)

cols.remove("Tarih")
df = df.set_index("Tarih")

for col in cols:
    df.loc[df[col] == "-", col] = np.nan
    df[col] = df[col].astype(np.float32)

/home/ugurkap/miniconda3/envs/main/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:221: UserWarning: Workbook contains no default style, apply openpyxl's default
  warn("Workbook contains no default style, apply openpyxl's default")

In [14]:

df.describe()

Out[14]:

	Mersin – Akdeniz	Mersin – Huzurkent	Mersin – İstiklal Cad.	Mersin – Tarsus	Mersin – Tasucu	Mersin – Toroslar	Mersin – Yenişehir
count	321.000000	310.000000	334.000000	332.000000	302.000000	316.000000	335.000000
mean	81.926414	50.694416	74.172096	60.954185	30.675200	40.514462	54.070477
std	36.366062	14.367387	38.317322	22.139832	11.958918	17.441872	19.098902
min	26.870001	18.760000	28.080000	15.350000	6.190000	7.310000	11.250000
25%	54.419998	41.462502	49.062500	45.632500	22.375000	29.082500	41.264999
50%	71.620003	49.610001	61.994999	59.064999	29.529999	37.400002	51.169998
75%	101.750000	59.330002	82.845001	72.497498	38.439999	49.227501	66.555000
max	217.050003	100.919998	287.739990	165.800003	79.589996	106.820000	115.500000

From this description, we have a rough idea of the stations. But let’s use PCA to actually analyze the data:

In [15]:

df.mean()

Out[15]:

Mersin - Akdeniz          81.926414
Mersin - Huzurkent        50.694416
Mersin - İstiklal Cad.    74.172096
Mersin - Tarsus           60.954185
Mersin - Tasucu           30.675200
Mersin - Toroslar         40.514462
Mersin - Yenişehir        54.070477
dtype: float32

In [16]:

# 1. Normalize the data, then fill in the missing values by the mean (0 now)
df_norm = df - df.mean()
df_norm = df_norm.fillna(0)

# 2. Use SVD to get PCA

u, s, vt = np.linalg.svd(df_norm)
explained_variance = s**2
principal_comps = vt

# 3. Print the results:

print(f"Explained Variance: \n{list(explained_variance)}")
print(f"Explained Variance Ratio: \n{list(explained_variance / explained_variance.sum())}")
print(f"Principal Components: \n{principal_comps}")

Explained Variance: 
[970372.7, 161876.78, 99340.76, 73822.26, 44803.496, 25845.836, 22792.111]
Explained Variance Ratio: 
[0.6936912, 0.115720995, 0.07101581, 0.052773383, 0.032028716, 0.018476436, 0.016293418]
Principal Components: 
[[ 0.6225126   0.16223446  0.66419894  0.24866222 -0.00622524  0.11959113
   0.26234362]
 [-0.02710083  0.2685964  -0.45266798  0.6661215   0.345759    0.05868038
   0.3943383 ]
 [ 0.57136565 -0.05261079 -0.38114944 -0.1719964  -0.09751546 -0.68772197
   0.11595257]
 [-0.3109046  -0.15181339  0.36948395  0.50480044  0.04580177 -0.6335116
  -0.29242852]
 [ 0.37802657 -0.37931138 -0.2600934   0.41760823 -0.24467331  0.32418633
  -0.55336404]
 [-0.09289647  0.69212395 -0.06681351  0.11723612 -0.6863987  -0.04926331
  -0.14337459]
 [ 0.19258615  0.5028002  -0.0050598  -0.14799356  0.5812028  -0.02079634
  -0.59156203]]

Okay, here we see that these grouped stations are somewhat similar, as 69\%$ of the variance can be explained by one principal component.

Returning back to our loadings, and actually understanding the relationships between features; our first principal component is

$$ \begin{aligned} \begin{pmatrix} 0.62 & 0.16 & 0.66 & 0.25 & -0.01 & 0.12 & 0.26 \end{pmatrix} \end{aligned} $$

From here, we observe the following groupings:

Stations 1, 3 ($\sim 0.6 – 0.7$)
Stations 4, 7 ($\sim 0.2$)
Stations 2, 6 ($\sim 0.1$)
Station 5 ($\sim 0$)

We can also see that 5 is very different than the other stations, as it has a negative sign. Let’s refresh our memories by checking the description:

In [17]:

df.describe()

Out[17]:

	Mersin – Akdeniz	Mersin – Huzurkent	Mersin – İstiklal Cad.	Mersin – Tarsus	Mersin – Tasucu	Mersin – Toroslar	Mersin – Yenişehir
count	321.000000	310.000000	334.000000	332.000000	302.000000	316.000000	335.000000
mean	81.926414	50.694416	74.172096	60.954185	30.675200	40.514462	54.070477
std	36.366062	14.367387	38.317322	22.139832	11.958918	17.441872	19.098902
min	26.870001	18.760000	28.080000	15.350000	6.190000	7.310000	11.250000
25%	54.419998	41.462502	49.062500	45.632500	22.375000	29.082500	41.264999
50%	71.620003	49.610001	61.994999	59.064999	29.529999	37.400002	51.169998
75%	101.750000	59.330002	82.845001	72.497498	38.439999	49.227501	66.555000
max	217.050003	100.919998	287.739990	165.800003	79.589996	106.820000	115.500000

We see that station 5 is the cleanest of the bunch, and is usually behaving different than them. Stations 1 and 3 are very similar, as we expected from them and so on.

We could actually decide those similarities by eye, but now we have a formal point to stand on. It is also nice that we don’t have to do it by eye as we could be talking about hundreds of stations too.

References¶

[1] Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.

[2] “Hava Kalitesi – İstasyon Veri İndirme | T.C. Çevre, Şehircilik ve İklim Değişikliği Bakanlığı.” 16 Apr. 2023