Model Design and Logistic Regression in Python


Model Design and Logistic Regression in Python

I recently modeled customer churn in Julia with logistic regression model. It was interesting to be sure, but I want to extend my analysis skillset by modeling biostatistics data. In this post, I design a logistic regression model of health predictors.

Imports

# load some default Python modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
from google.cloud import bigquery
from pprint import pprint
from datetime import date, datetime
import contextily as cx
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Data

Data Description

Chinese Longitudinal Healthy Longevity Survey (CLHLS), Biomarkers Datasets, 2009, 2012, 2014 (ICPSR 37226) Principal Investigator(s): Yi Zeng, Duke University, and Peking University; James W. Vaupel, Max Planck Institutes, and Duke University

filename = "/Users/jnapolitano/Projects/biostatistics/data/37226-0003-Data.tsv"
df = pd.read_csv(filename, sep='\t')
# read data in pandas dataframe


# list first few rows (datapoints)
df.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

df.describe()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

The float collumns were not interpretted correctly by pandas. I’ll fix that

df.columns
Index(['ID', 'TRUEAGE', 'A1', 'ALB', 'GLU', 'BUN', 'CREA', 'CHO', 'TG', 'GSP',
       'CRPHS', 'UA', 'HDLC', 'SOD', 'MDA', 'VD3', 'VITB12', 'UALB', 'UCR',
       'UALBBYUCR', 'WBC', 'LYMPH', 'LYMPH_A', 'RBC', 'HGB', 'HCT', 'MCV',
       'MCH', 'MCHC', 'PLT', 'MPV', 'PDW', 'PCT'],
      dtype='object')
# check datatypesdf
df.dtypes
ID            int64
TRUEAGE      object
A1           object
ALB          object
GLU          object
BUN          object
CREA         object
CHO          object
TG           object
GSP          object
CRPHS        object
UA           object
HDLC         object
SOD          object
MDA          object
VD3          object
VITB12       object
UALB         object
UCR          object
UALBBYUCR    object
WBC          object
LYMPH        object
LYMPH_A      object
RBC          object
HGB          object
HCT          object
MCV          object
MCH          object
MCHC         object
PLT          object
MPV          object
PDW          object
PCT          object
dtype: object

Everything was read an object. I’ll cast everything to numeric… Thank you numpy

# replace empty space with na
df = df.replace(" ", np.nan)

just to be safe, I’ll replace all blank spaces with np.nan.

# convert numeric objects to numeric data types.  I checked in the code book there will not be any false positives
df = df.apply(pd.to_numeric, errors='raise')
# Recheck dictypes
df.dtypes
ID             int64
TRUEAGE      float64
A1           float64
ALB          float64
GLU          float64
BUN          float64
CREA         float64
CHO          float64
TG           float64
GSP          float64
CRPHS        float64
UA           float64
HDLC         float64
SOD          float64
MDA          float64
VD3          float64
VITB12       float64
UALB         float64
UCR          float64
UALBBYUCR    float64
WBC          float64
LYMPH        float64
LYMPH_A      float64
RBC          float64
HGB          float64
HCT          float64
MCV          float64
MCH          float64
MCHC         float64
PLT          float64
MPV          float64
PDW          float64
PCT          float64
dtype: object
# check statistics of the features
df.describe()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

It is kind of odd that there are greater counts for some rows. I’ll remove all na.

Checking for negative values and anything else I missed from the initial sql clean:

df = df.dropna()
df.describe()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

We remove about 4/5 of our data. The counts are now equivalent. Everything is in the correct data type.

Visualizing Age Distribution

I am curious what the age spread looks like. An even spread could be used to determine health outcomes.

# plot histogram of fare
df.TRUEAGE.hist(figsize=(14,3))
plt.xlabel('Age')
plt.title('Histogram');

png

Unfortunately, the spread is not evenly distributed.

Visualizing Age to Triglyceride Levels

A predictive model relating health factors to longevity is probably possible. Certain factors must be met, but I’ll assume they are for the sake of this mockup.

#idx = (df.trip_distance < 3) & (gdf.fare_amount < 100)
plt.scatter(df.TRUEAGE, df.TG)
plt.xlabel('True Age')
plt.ylabel('Triglyceride, mmol/L')

# theta here is estimated by hand
plt.show()

png

Filter Examples

The data above doesn’t really need to be filtered. To demonstrate how it could be, I include some randomized columns that are then filtered according to specific conditions.

To fit the specificities of the conditions in the training video I’ll add some randomized columns.

n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
emergency_department =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["EMERGENCY"] = emergency_department
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 100 # exclusive
cancer_care =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["CANCER_TYPE"] = cancer_care
n = df.shape[0]
lower_bound = 0 #inclusive 
#0 = no
#1 = ICPI
# 2 MONO 
upper_bound = 3 # exclusive
icpi_history =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["ICPI_HIST"] = icpi_history
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2# exclusive
# Spanish = 0
# English = 1
# Arbitrarily chosen. 
language =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["LANG"] = language
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
follow_up =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["FOLLOW_UP"] = follow_up
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
# 0 = no
# 1 = Yes
cons =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["CONSENT"] = cons
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
# 0 = no
# 1 = Yes
prego =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["PREGNANT"] = prego
df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Writing the Filter

Writing a quick filter to ensure eligibiity. This could, and probably should be written functionally, but so it goes.

# Greater than 18
# CANCER TYPE IS NOT equal to a non-malanoma skin cancer ie 5 arbitrarily chosen 
# Patient Seeking care in emergency department is true
# History is not equal to 0. Ie not recieving either.  IDK how it would be provided.  It could also possibly be written as equal to 1 or 2
# Lang is either english or spanish
# Patient Agrees to Follow Up
# Patient Consents
# Patient is Not Pregnant

idx_spanish = (df.TRUEAGE > 18) & (df.CANCER_TYPE != 5) & (df.EMERGENCY == 1) & \
        (df.ICPI_HIST == 0) & (df.LANG == 0) & (df.FOLLOW_UP == 1) & (df.CONSENT == 1) & (df.PREGNANT == 0)

# Ideally the english and spanish speakers would have been filtered prior to this, but for the sake of exploration this will work. 
idx_english = (df.TRUEAGE > 18) & (df.CANCER_TYPE != 5) & (df.EMERGENCY == 1) & \
        (df.ICPI_HIST == 0) & (df.LANG == 1) & (df.FOLLOW_UP == 1) & (df.CONSENT == 1) & (df.PREGNANT == 0)

I created spanish and english dataframes for the sake of data manipulation. It is not realy necessary, but it would permit modifying and recoding the data if it were formatted differently.


filtered_df = pd.concat([df[idx_english], df[idx_spanish]], ignore_index=True)

the filtered df is a concattenation of the english and spanish filtered data.

filtered_df.describe()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

filtered_df.shape[0]
#only 33 left following the filter. 
33

Following the filter only 35 data are left in the set. A workflow similiar to this could be used to identify possible survey recruits from aggregated chart data.

Logistic Regression Sample

I am surpirsed by the low level of samples left following the filter. To avoid a small n, I will use the initial dataset.

filename = "/Users/jnapolitano/Projects/biostatistics/data/37226-0003-Data.tsv"
df = pd.read_csv(filename, sep='\t')

# replace empty space with na
df = df.replace(" ", np.nan)

# convert numeric objects to numeric data types.  I checked in the code book there will not be any false positives
df = df.apply(pd.to_numeric, errors='raise')
df = df.dropna()
df
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

The Model

There is strong suspicion that biomarkers can determine whether a patient should be admitted for emergency care. In this simplified model, I will randomly distribute proper disposition across the dataset.

n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
# 0 = no
# 1 = Yes
tmp =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["PROP_DISPOSITION"] = tmp

Create Test and Train Set

This could be randomly sampled as well…

Random Sample

# copy in memory to avoid errors.  This could be done from files or in other ways if memory is limited.  
master_table = df.copy()

Test Sample Set with 10,000 Randomly Selected from the Master with Replacement

test_sample = master_table.sample(n=20000,replace=True)
targets = test_sample.pop("PROP_DISPOSITION")

Seperate Train and Test Sets

x_train, x_test, y_train, y_test = train_test_split(test_sample, targets, test_size=0.2, random_state=0)

Data Standardization

Calculate the mean and standard deviation for each column. Subtract the corresponding mean from each element. Divide the obtained difference by the corresponding standard deviation.

Thankfully this is built into SKLearn.

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)

Create the Model

model = LogisticRegression(solver='liblinear', C=0.05, multi_class='ovr',
                           random_state=0)
model.fit(x_train, y_train)
LogisticRegression(C=0.05, multi_class='ovr', random_state=0,
                   solver='liblinear')

Evaluate Model

x_test = scaler.transform(x_test)
y_pred = model.predict(x_test)

Model Scoring

With completely randomized values the score should be about 50%. If it is significantly greater than there is probably a problem with the model.

model.score(x_train, y_train)
0.5676875
model.score(x_test, y_test)
0.55825

Results are expected

Confusion matrix

cm = confusion_matrix(y_test, y_pred)
font_size = 10

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.set_xlabel('Predicted outputs', color='purple')
ax.set_ylabel('Actual outputs', color='purple')
ax.xaxis.set(ticks=range(len(cm)))
ax.yaxis.set(ticks=range(len(cm)))
#ax.set_ylim(0, 1)
for i in range(len(cm)):
    for j in range(len(cm)):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='purple')
plt.show()

png

Because the data is randomized it makes the model is accurate about 50% of the time.

Printing the Classification Report

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.55      0.47      0.51      1927
           1       0.56      0.64      0.60      2073

    accuracy                           0.56      4000
   macro avg       0.56      0.56      0.55      4000
weighted avg       0.56      0.56      0.55      4000