{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pycharm": {
"is_executing": true,
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from shared.DescriptiveStats import create_stats\n",
"\n",
"path = '../data/insurance.csv'\n",
"df = pd.read_csv(path)\n",
"\n",
"descriptiveData = {}\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": []
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" sex | \n",
" bmi | \n",
" children | \n",
" smoker | \n",
" region | \n",
" charges | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 1338.000000 | \n",
" 1338 | \n",
" 1338.000000 | \n",
" 1338.000000 | \n",
" 1338 | \n",
" 1338 | \n",
" 1338.000000 | \n",
"
\n",
" \n",
" unique | \n",
" NaN | \n",
" 2 | \n",
" NaN | \n",
" NaN | \n",
" 2 | \n",
" 4 | \n",
" NaN | \n",
"
\n",
" \n",
" top | \n",
" NaN | \n",
" male | \n",
" NaN | \n",
" NaN | \n",
" no | \n",
" southeast | \n",
" NaN | \n",
"
\n",
" \n",
" freq | \n",
" NaN | \n",
" 676 | \n",
" NaN | \n",
" NaN | \n",
" 1064 | \n",
" 364 | \n",
" NaN | \n",
"
\n",
" \n",
" mean | \n",
" 39.207025 | \n",
" NaN | \n",
" 30.663397 | \n",
" 1.094918 | \n",
" NaN | \n",
" NaN | \n",
" 13270.422265 | \n",
"
\n",
" \n",
" std | \n",
" 14.049960 | \n",
" NaN | \n",
" 6.098187 | \n",
" 1.205493 | \n",
" NaN | \n",
" NaN | \n",
" 12110.011237 | \n",
"
\n",
" \n",
" min | \n",
" 18.000000 | \n",
" NaN | \n",
" 15.960000 | \n",
" 0.000000 | \n",
" NaN | \n",
" NaN | \n",
" 1121.873900 | \n",
"
\n",
" \n",
" 25% | \n",
" 27.000000 | \n",
" NaN | \n",
" 26.296250 | \n",
" 0.000000 | \n",
" NaN | \n",
" NaN | \n",
" 4740.287150 | \n",
"
\n",
" \n",
" 50% | \n",
" 39.000000 | \n",
" NaN | \n",
" 30.400000 | \n",
" 1.000000 | \n",
" NaN | \n",
" NaN | \n",
" 9382.033000 | \n",
"
\n",
" \n",
" 75% | \n",
" 51.000000 | \n",
" NaN | \n",
" 34.693750 | \n",
" 2.000000 | \n",
" NaN | \n",
" NaN | \n",
" 16639.912515 | \n",
"
\n",
" \n",
" max | \n",
" 64.000000 | \n",
" NaN | \n",
" 53.130000 | \n",
" 5.000000 | \n",
" NaN | \n",
" NaN | \n",
" 63770.428010 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age sex bmi children smoker region \\\n",
"count 1338.000000 1338 1338.000000 1338.000000 1338 1338 \n",
"unique NaN 2 NaN NaN 2 4 \n",
"top NaN male NaN NaN no southeast \n",
"freq NaN 676 NaN NaN 1064 364 \n",
"mean 39.207025 NaN 30.663397 1.094918 NaN NaN \n",
"std 14.049960 NaN 6.098187 1.205493 NaN NaN \n",
"min 18.000000 NaN 15.960000 0.000000 NaN NaN \n",
"25% 27.000000 NaN 26.296250 0.000000 NaN NaN \n",
"50% 39.000000 NaN 30.400000 1.000000 NaN NaN \n",
"75% 51.000000 NaN 34.693750 2.000000 NaN NaN \n",
"max 64.000000 NaN 53.130000 5.000000 NaN NaN \n",
"\n",
" charges \n",
"count 1338.000000 \n",
"unique NaN \n",
"top NaN \n",
"freq NaN \n",
"mean 13270.422265 \n",
"std 12110.011237 \n",
"min 1121.873900 \n",
"25% 4740.287150 \n",
"50% 9382.033000 \n",
"75% 16639.912515 \n",
"max 63770.428010 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe(include='all')"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Descriptive Analysis\n",
"This type of analysis describes and summarises the data\n",
"\n",
"## Measures of central tendency\n",
"These types of calculations look at the averages of the data set\n",
"- Mean - The average. Add all the values and divide by the number of values.\n",
"- Median - The middle number.\n",
"- Mode - The value which appears the most in the set.\n",
"\n",
"## Measures of Spread\n",
"How similar or different the data in the data set is.\n",
"- Range - The difference between the highest and lowest value.\n",
"- Interquartile Range - The spread of the middle half of the distribution, when the set has been ordered from lowest to highest.\n",
"- Standard Deviation - Measures the amount of variation of dispersion in a set of values.\n",
"- Variance - measure of how different/ variable values are from the average and from each other.\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" mean | \n",
" median | \n",
" mode | \n",
" range | \n",
" IQR | \n",
" std | \n",
" var | \n",
"
\n",
" \n",
" \n",
" \n",
" age | \n",
" 39.207025 | \n",
" 39.000 | \n",
" 0 18\n",
"Name: age, dtype: int64 | \n",
" 46.00000 | \n",
" 24.000000 | \n",
" 14.044709 | \n",
" 1.972539e+02 | \n",
"
\n",
" \n",
" bmi | \n",
" 30.663397 | \n",
" 30.400 | \n",
" 0 32.3\n",
"Name: bmi, dtype: float64 | \n",
" 37.17000 | \n",
" 8.397500 | \n",
" 6.095908 | \n",
" 3.716009e+01 | \n",
"
\n",
" \n",
" children | \n",
" 1.094918 | \n",
" 1.000 | \n",
" 0 0\n",
"Name: children, dtype: int64 | \n",
" 5.00000 | \n",
" 2.000000 | \n",
" 1.205042 | \n",
" 1.452127e+00 | \n",
"
\n",
" \n",
" charges | \n",
" 13270.422265 | \n",
" 9382.033 | \n",
" 0 1639.5631\n",
"Name: charges, dtype: float64 | \n",
" 62648.55411 | \n",
" 11899.625365 | \n",
" 12105.484976 | \n",
" 1.465428e+08 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean median \\\n",
"age 39.207025 39.000 \n",
"bmi 30.663397 30.400 \n",
"children 1.094918 1.000 \n",
"charges 13270.422265 9382.033 \n",
"\n",
" mode range \\\n",
"age 0 18\n",
"Name: age, dtype: int64 46.00000 \n",
"bmi 0 32.3\n",
"Name: bmi, dtype: float64 37.17000 \n",
"children 0 0\n",
"Name: children, dtype: int64 5.00000 \n",
"charges 0 1639.5631\n",
"Name: charges, dtype: float64 62648.55411 \n",
"\n",
" IQR std var \n",
"age 24.000000 14.044709 1.972539e+02 \n",
"bmi 8.397500 6.095908 3.716009e+01 \n",
"children 2.000000 1.205042 1.452127e+00 \n",
"charges 11899.625365 12105.484976 1.465428e+08 "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"number_columns = df.select_dtypes(include='number')\n",
"for column in number_columns:\n",
" descriptiveData.update(create_stats(column, df[column]))\n",
"pd.DataFrame.from_dict(descriptiveData, orient = 'index')"
]
},
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.7"
}
},
"nbformat": 4,
"nbformat_minor": 1
}