Descriptive Analysis
Contents
import pandas as pd
from shared.DescriptiveStats import create_stats
path = '../data/insurance.csv'
df = pd.read_csv(path)
descriptiveData = {}
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
df.describe(include='all')
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
count | 1338.000000 | 1338 | 1338.000000 | 1338.000000 | 1338 | 1338 | 1338.000000 |
unique | NaN | 2 | NaN | NaN | 2 | 4 | NaN |
top | NaN | male | NaN | NaN | no | southeast | NaN |
freq | NaN | 676 | NaN | NaN | 1064 | 364 | NaN |
mean | 39.207025 | NaN | 30.663397 | 1.094918 | NaN | NaN | 13270.422265 |
std | 14.049960 | NaN | 6.098187 | 1.205493 | NaN | NaN | 12110.011237 |
min | 18.000000 | NaN | 15.960000 | 0.000000 | NaN | NaN | 1121.873900 |
25% | 27.000000 | NaN | 26.296250 | 0.000000 | NaN | NaN | 4740.287150 |
50% | 39.000000 | NaN | 30.400000 | 1.000000 | NaN | NaN | 9382.033000 |
75% | 51.000000 | NaN | 34.693750 | 2.000000 | NaN | NaN | 16639.912515 |
max | 64.000000 | NaN | 53.130000 | 5.000000 | NaN | NaN | 63770.428010 |
Descriptive Analysis#
This type of analysis describes and summarises the data
Measures of central tendency#
These types of calculations look at the averages of the data set
Mean - The average. Add all the values and divide by the number of values.
Median - The middle number.
Mode - The value which appears the most in the set.
Measures of Spread#
How similar or different the data in the data set is.
Range - The difference between the highest and lowest value.
Interquartile Range - The spread of the middle half of the distribution, when the set has been ordered from lowest to highest.
Standard Deviation - Measures the amount of variation of dispersion in a set of values.
Variance - measure of how different/ variable values are from the average and from each other.
number_columns = df.select_dtypes(include='number')
for column in number_columns:
descriptiveData.update(create_stats(column, df[column]))
pd.DataFrame.from_dict(descriptiveData, orient = 'index')
mean | median | mode | range | IQR | std | var | |
---|---|---|---|---|---|---|---|
age | 39.207025 | 39.000 | 0 18 Name: age, dtype: int64 | 46.00000 | 24.000000 | 14.044709 | 1.972539e+02 |
bmi | 30.663397 | 30.400 | 0 32.3 Name: bmi, dtype: float64 | 37.17000 | 8.397500 | 6.095908 | 3.716009e+01 |
children | 1.094918 | 1.000 | 0 0 Name: children, dtype: int64 | 5.00000 | 2.000000 | 1.205042 | 1.452127e+00 |
charges | 13270.422265 | 9382.033 | 0 1639.5631 Name: charges, dtype: float64 | 62648.55411 | 11899.625365 | 12105.484976 | 1.465428e+08 |