import pandas as pd
from shared.DescriptiveStats import create_stats

path = '../data/insurance.csv'
df = pd.read_csv(path)

descriptiveData = {}
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
df.describe(include='all')
age sex bmi children smoker region charges
count 1338.000000 1338 1338.000000 1338.000000 1338 1338 1338.000000
unique NaN 2 NaN NaN 2 4 NaN
top NaN male NaN NaN no southeast NaN
freq NaN 676 NaN NaN 1064 364 NaN
mean 39.207025 NaN 30.663397 1.094918 NaN NaN 13270.422265
std 14.049960 NaN 6.098187 1.205493 NaN NaN 12110.011237
min 18.000000 NaN 15.960000 0.000000 NaN NaN 1121.873900
25% 27.000000 NaN 26.296250 0.000000 NaN NaN 4740.287150
50% 39.000000 NaN 30.400000 1.000000 NaN NaN 9382.033000
75% 51.000000 NaN 34.693750 2.000000 NaN NaN 16639.912515
max 64.000000 NaN 53.130000 5.000000 NaN NaN 63770.428010

Descriptive Analysis#

This type of analysis describes and summarises the data

Measures of central tendency#

These types of calculations look at the averages of the data set

  • Mean - The average. Add all the values and divide by the number of values.

  • Median - The middle number.

  • Mode - The value which appears the most in the set.

Measures of Spread#

How similar or different the data in the data set is.

  • Range - The difference between the highest and lowest value.

  • Interquartile Range - The spread of the middle half of the distribution, when the set has been ordered from lowest to highest.

  • Standard Deviation - Measures the amount of variation of dispersion in a set of values.

  • Variance - measure of how different/ variable values are from the average and from each other.

number_columns = df.select_dtypes(include='number')
for column in number_columns:
    descriptiveData.update(create_stats(column, df[column]))
pd.DataFrame.from_dict(descriptiveData, orient = 'index')
mean median mode range IQR std var
age 39.207025 39.000 0 18 Name: age, dtype: int64 46.00000 24.000000 14.044709 1.972539e+02
bmi 30.663397 30.400 0 32.3 Name: bmi, dtype: float64 37.17000 8.397500 6.095908 3.716009e+01
children 1.094918 1.000 0 0 Name: children, dtype: int64 5.00000 2.000000 1.205042 1.452127e+00
charges 13270.422265 9382.033 0 1639.5631 Name: charges, dtype: float64 62648.55411 11899.625365 12105.484976 1.465428e+08