import pandas as pd
from shared.DescriptiveStats import create_stats

path = '../data/insurance.csv'
df = pd.read_csv(path)

descriptiveData = {}
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

df.describe(include='all')

	age	sex	bmi	children	smoker	region	charges
count	1338.000000	1338	1338.000000	1338.000000	1338	1338	1338.000000
unique	NaN	2	NaN	NaN	2	4	NaN
top	NaN	male	NaN	NaN	no	southeast	NaN
freq	NaN	676	NaN	NaN	1064	364	NaN
mean	39.207025	NaN	30.663397	1.094918	NaN	NaN	13270.422265
std	14.049960	NaN	6.098187	1.205493	NaN	NaN	12110.011237
min	18.000000	NaN	15.960000	0.000000	NaN	NaN	1121.873900
25%	27.000000	NaN	26.296250	0.000000	NaN	NaN	4740.287150
50%	39.000000	NaN	30.400000	1.000000	NaN	NaN	9382.033000
75%	51.000000	NaN	34.693750	2.000000	NaN	NaN	16639.912515
max	64.000000	NaN	53.130000	5.000000	NaN	NaN	63770.428010

Descriptive Analysis#

This type of analysis describes and summarises the data

Measures of central tendency#

These types of calculations look at the averages of the data set

Mean - The average. Add all the values and divide by the number of values.
Median - The middle number.
Mode - The value which appears the most in the set.

Measures of Spread#

How similar or different the data in the data set is.

Range - The difference between the highest and lowest value.
Interquartile Range - The spread of the middle half of the distribution, when the set has been ordered from lowest to highest.
Standard Deviation - Measures the amount of variation of dispersion in a set of values.
Variance - measure of how different/ variable values are from the average and from each other.

number_columns = df.select_dtypes(include='number')
for column in number_columns:
    descriptiveData.update(create_stats(column, df[column]))
pd.DataFrame.from_dict(descriptiveData, orient = 'index')

	mean	median	mode	range	IQR	std	var
age	39.207025	39.000	0 18 Name: age, dtype: int64	46.00000	24.000000	14.044709	1.972539e+02
bmi	30.663397	30.400	0 32.3 Name: bmi, dtype: float64	37.17000	8.397500	6.095908	3.716009e+01
children	1.094918	1.000	0 0 Name: children, dtype: int64	5.00000	2.000000	1.205042	1.452127e+00
charges	13270.422265	9382.033	0 1639.5631 Name: charges, dtype: float64	62648.55411	11899.625365	12105.484976	1.465428e+08

Data Science

Descriptive Analysis

Contents

Descriptive Analysis#

Measures of central tendency#

Measures of Spread#