{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"_uuid": "989b76b2cb6cb90eef00ceb72b900b48b068d6d1",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 02-Statistical Link Between Variables"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"_kg_hide-input": true,
"_kg_hide-output": false,
"_uuid": "f45ba6a511f519d35d488b71d3d8ab3189c0b175",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Dependencies\n",
"\n",
"# Standard Dependencies\n",
"import os\n",
"import numpy as np\n",
"import pandas as pd\n",
"from math import sqrt\n",
"\n",
"# Visualization\n",
"from pylab import *\n",
"import matplotlib.mlab as mlab\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Statistics\n",
"from statistics import median\n",
"from scipy import signal\n",
"from math import factorial\n",
"import scipy.stats as stats\n",
"from scipy.stats import sem, binom, lognorm, poisson, bernoulli, spearmanr\n",
"from scipy.fftpack import fft, fftshift\n",
"\n",
"# Scikit-learn for Machine Learning models\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Seed for reproducability\n",
"seed = 12345\n",
"np.random.seed(seed)\n",
"\n",
"\n",
"# Read in csv of Toy Dataset\n",
"# We will use this dataset throughout the tutorial\n",
"df = pd.read_csv('../data/toy_dataset.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "7e44e6d3b18b6da66ae052463a2e8a1afc9bf008",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Table of contents"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "564fcdf42264765f9e9cb0021393ff225da36356",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"- [Covariance](#7)\n",
"- [Correlation](#8)\n",
"- [Linear Regression](#9)\n",
"- [Bias, MSE and SE](#5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "d6e2d0bd983e68e86af13af1ad4e5f3e973c0193",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Covariance "
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "1d182cf7e089fccdc808b0c0938bdf1938a7cfb7",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Covariance is a measure of how much two random variables vary together. variance is similar to covariance in that variance shows you how much one variable varies. Covariance tells you how two variables vary together.\n",
"\n",
"If two variables are independent, their covariance is 0. However, a covariance of 0 does not imply that the variables are independent."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"_kg_hide-input": true,
"_uuid": "7a795570fae61547decb79d331479d3098701aaa",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Covariance between Age and Income: \n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Age
\n",
"
Income
\n",
"
\n",
" \n",
" \n",
"
\n",
"
Age
\n",
"
133.922426
\n",
"
-3.811863e+02
\n",
"
\n",
"
\n",
"
Income
\n",
"
-381.186341
\n",
"
6.244752e+08
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Income\n",
"Age 133.922426 -3.811863e+02\n",
"Income -381.186341 6.244752e+08"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Covariance between Age and Income\n",
"print('Covariance between Age and Income: ')\n",
"\n",
"df[['Age', 'Income']].cov()"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "45baa36045e88dbab22e850e4df6b4420fff747f",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Correlation "
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "118f9e24a69065a48efb5c78ac03c6c475a1dad9",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Correlation is a standardized version of covariance. Here it becomes more clear that Age and Income do not have a strong correlation in our dataset."
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "634cdc7963624e398ab1841ccbbf00be08c12948",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The formula for Pearson's correlation coefficient consists of the covariance between the two random variables divided by the standard deviation of the first random variable times the standard deviation of the second random variable.\n",
"\n",
"Formula for Pearson's correlation coefficient:\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"_kg_hide-input": true,
"_uuid": "b14ff3e57239585b05b34a01f731dc52eec8ea94",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pearson: \n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Age
\n",
"
Income
\n",
"
\n",
" \n",
" \n",
"
\n",
"
Age
\n",
"
1.000000
\n",
"
-0.001318
\n",
"
\n",
"
\n",
"
Income
\n",
"
-0.001318
\n",
"
1.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Income\n",
"Age 1.000000 -0.001318\n",
"Income -0.001318 1.000000"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Correlation between two normal distributions\n",
"# Using Pearson's correlation\n",
"print('Pearson: ')\n",
"df[['Age', 'Income']].corr(method='pearson')"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "81ebd4c6e6be66529a181dcf5e3946ab051d04b6",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Another method for calculating a correlation coefficient is 'Spearman's Rho'. The formula looks different but it will give similar results as Pearson's method. In this example we see almost no difference, but this is partly because it is obvious that the Age and Income columns in our dataset have no correlation.\n",
"\n",
"Formula for Spearmans Rho:\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"_kg_hide-input": true,
"_uuid": "9f94042e40978a8bd3df01a828d02dd25ca0db70",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Spearman: \n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Age
\n",
"
Income
\n",
"
\n",
" \n",
" \n",
"
\n",
"
Age
\n",
"
1.000000
\n",
"
-0.001452
\n",
"
\n",
"
\n",
"
Income
\n",
"
-0.001452
\n",
"
1.000000
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Age Income\n",
"Age 1.000000 -0.001452\n",
"Income -0.001452 1.000000"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Using Spearman's rho correlation\n",
"print('Spearman: ')\n",
"df[['Age', 'Income']].corr(method='spearman')"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"_uuid": "c4b58d1312d3e4a28244dc4d5640426359859161",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ubuntu/Documents/Projects/msci_data/.venv/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.\n",
" warnings.warn(\n"
]
},
{
"data": {
"text/plain": [
""
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Generate data\n",
"x = np.random.uniform(low=20, high=260, size=100)\n",
"y = 50000 + 2000*x - 4.5 * x**2 + np.random.normal(size=100, loc=0, scale=10000)\n",
"\n",
"# Plot data with Linear Regression\n",
"plt.figure(figsize=(16,5))\n",
"plt.title('Well fitted but not well fitting: Linear regression plot on quadratic data', fontsize='xx-large')\n",
"sns.regplot(x, y)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "79a0d6a9333121c08351ceaa76bf661b661f124a",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Linear regression "
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "742e13d2528cff6398f9016b80d01483d2bf7c2c",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Linear Regression can be performed through Ordinary Least Squares (OLS) or Maximum Likelihood Estimation (MLE).\n",
"\n",
"Most Python libraries use OLS to fit linear models.\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "5b360ab8eefb8268db7430cfc45364fd700d1aa0",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Bias, MSE and SE "
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "de3f98a8a8c46e22678e26c0c5847c9e4503479c",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**Bias** is a measure of how far the sample mean deviates from the population mean. The sample mean is also called **Expected value**.\n",
"\n",
"Formula for Bias:\n",
"\n",
"\n",
"\n",
"The formula for expected value (EV) makes it apparent that the bias can also be formulated as the expected value minus the population mean:\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"_uuid": "7de089fa73a9b42aeab7a1d78e4b2856a6d9527f",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Generate Normal Distribution\n",
"normal_dist = np.random.randn(10000)\n",
"normal_df = pd.DataFrame({'value' : normal_dist})\n",
"# Take sample\n",
"normal_df_sample = normal_df.sample(100)\n",
"\n",
"# Calculate Expected Value (EV), population mean and bias\n",
"ev = normal_df_sample.mean()[0]\n",
"pop_mean = normal_df.mean()[0]\n",
"bias = ev - pop_mean"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"_kg_hide-input": true,
"_uuid": "268ea18ed84402103384034862ed14bbe43c6e19",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sample mean (Expected Value): -0.11906267796745086\n",
"Population mean: -0.01073582444747704\n",
"Bias: -0.10832685351997381\n"
]
}
],
"source": [
"print('Sample mean (Expected Value): ', ev)\n",
"print('Population mean: ', pop_mean)\n",
"print('Bias: ', bias)"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "dcb382e5b8587d277a88269f9a1e4760ffbc21d8",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**MSE (Mean Squared Error)** is a formula to measure how much estimators deviate from the true distribution. This can be very useful with for example, evaluating regression models.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"**RMSE (Root Mean Squared Error)** is just the root of the MSE.\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"_uuid": "839b15a284e146ae9f582c28ab6e30af07eecd86",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MSE: 36.0\n",
"RMSE: 6.0\n"
]
}
],
"source": [
"from math import sqrt\n",
"\n",
"Y = 100 # Actual Value\n",
"YH = 94 # Predicted Value\n",
"\n",
"# MSE Formula \n",
"def MSE(Y, YH):\n",
" return np.square(YH - Y).mean()\n",
"\n",
"# RMSE formula\n",
"def RMSE(Y, YH):\n",
" return sqrt(np.square(YH - Y).mean())\n",
"\n",
"\n",
"print('MSE: ', MSE(Y, YH))\n",
"print('RMSE: ', RMSE(Y, YH))"
]
},
{
"cell_type": "markdown",
"metadata": {
"_uuid": "f51ce7f24700f38a575a0236e9f40964d16a01f5",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The **Standard Error (SE)** measures how spread the distribution is from the sample mean.\n",
"\n",
"\n",
"\n",
"The formula can also be defined as the standard deviation divided by the square root of the number of samples.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# Generate Normal Distribution\n",
"normal_dist = np.random.randn(10000)\n",
"normal_df = pd.DataFrame({'value' : normal_dist})\n",
"normal_dist = pd.Series(normal_dist)\n",
"# Create a Pandas Series for easy sample function\n",
"normal_dist = pd.Series(normal_dist)\n",
"\n",
"normal_dist2 = np.random.randn(10000)\n",
"normal_df2 = pd.DataFrame({'value' : normal_dist2})\n",
"# Create a Pandas Series for easy sample function\n",
"normal_dist2 = pd.Series(normal_dist)\n",
"\n",
"normal_df_total = pd.DataFrame({'value1' : normal_dist, \n",
" 'value2' : normal_dist2})"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"_kg_hide-input": true,
"_uuid": "9b07a5750071eaba369eb243fa4f8227886cf9e4",
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Standard Error of uniform sample: 0.029383241532640426\n",
"Standard Error of normal sample: 0.09801666115089963\n"
]
}
],
"source": [
"# Standard Error (SE)\n",
"# Uniform distribution (between 0 and 1)\n",
"uniform_dist = np.random.random(1000)\n",
"uniform_df = pd.DataFrame({'value' : uniform_dist})\n",
"uniform_dist = pd.Series(uniform_dist)\n",
"\n",
"uni_sample = uniform_dist.sample(100)\n",
"norm_sample = normal_dist.sample(100)\n",
"\n",
"print('Standard Error of uniform sample: ', sem(uni_sample))\n",
"print('Standard Error of normal sample: ', sem(norm_sample))\n",
"\n",
"# The random samples from the normal distribution should have a higher standard error"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}