Model Design and Logistic Regression in Python
Designing a logistic regression model from randomized bioinformatics data.
julianumerical-computingLogistic Regressionstatistics
7 Minutes, 19 Seconds
2022-06-17 13:20 +0000
Model Design and Logistic Regression in Python
I recently modeled customer churn in Julia with logistic regression model. It was interesting to be sure, but I want to extend my analysis skillset by modeling biostatistics data. In this post, I design a logistic regression model of health predictors.
Imports
# load some default Python modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
from google.cloud import bigquery
from pprint import pprint
from datetime import date, datetime
import contextily as cx
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
Data
Data Description
Chinese Longitudinal Healthy Longevity Survey (CLHLS), Biomarkers Datasets, 2009, 2012, 2014 (ICPSR 37226) Principal Investigator(s): Yi Zeng, Duke University, and Peking University; James W. Vaupel, Max Planck Institutes, and Duke University
filename = "/Users/jnapolitano/Projects/biostatistics/data/37226-0003-Data.tsv"
df = pd.read_csv(filename, sep='\t')
# read data in pandas dataframe
# list first few rows (datapoints)
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
The float collumns were not interpretted correctly by pandas. I’ll fix that
df.columns
Index(['ID', 'TRUEAGE', 'A1', 'ALB', 'GLU', 'BUN', 'CREA', 'CHO', 'TG', 'GSP',
'CRPHS', 'UA', 'HDLC', 'SOD', 'MDA', 'VD3', 'VITB12', 'UALB', 'UCR',
'UALBBYUCR', 'WBC', 'LYMPH', 'LYMPH_A', 'RBC', 'HGB', 'HCT', 'MCV',
'MCH', 'MCHC', 'PLT', 'MPV', 'PDW', 'PCT'],
dtype='object')
# check datatypesdf
df.dtypes
ID int64
TRUEAGE object
A1 object
ALB object
GLU object
BUN object
CREA object
CHO object
TG object
GSP object
CRPHS object
UA object
HDLC object
SOD object
MDA object
VD3 object
VITB12 object
UALB object
UCR object
UALBBYUCR object
WBC object
LYMPH object
LYMPH_A object
RBC object
HGB object
HCT object
MCV object
MCH object
MCHC object
PLT object
MPV object
PDW object
PCT object
dtype: object
Everything was read an object. I’ll cast everything to numeric… Thank you numpy
# replace empty space with na
df = df.replace(" ", np.nan)
just to be safe, I’ll replace all blank spaces with np.nan.
# convert numeric objects to numeric data types. I checked in the code book there will not be any false positives
df = df.apply(pd.to_numeric, errors='raise')
# Recheck dictypes
df.dtypes
ID int64
TRUEAGE float64
A1 float64
ALB float64
GLU float64
BUN float64
CREA float64
CHO float64
TG float64
GSP float64
CRPHS float64
UA float64
HDLC float64
SOD float64
MDA float64
VD3 float64
VITB12 float64
UALB float64
UCR float64
UALBBYUCR float64
WBC float64
LYMPH float64
LYMPH_A float64
RBC float64
HGB float64
HCT float64
MCV float64
MCH float64
MCHC float64
PLT float64
MPV float64
PDW float64
PCT float64
dtype: object
# check statistics of the features
df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
It is kind of odd that there are greater counts for some rows. I’ll remove all na.
Checking for negative values and anything else I missed from the initial sql clean:
df = df.dropna()
df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
We remove about 4/5 of our data. The counts are now equivalent. Everything is in the correct data type.
Visualizing Age Distribution
I am curious what the age spread looks like. An even spread could be used to determine health outcomes.
# plot histogram of fare
df.TRUEAGE.hist(figsize=(14,3))
plt.xlabel('Age')
plt.title('Histogram');
Unfortunately, the spread is not evenly distributed.
Visualizing Age to Triglyceride Levels
A predictive model relating health factors to longevity is probably possible. Certain factors must be met, but I’ll assume they are for the sake of this mockup.
#idx = (df.trip_distance < 3) & (gdf.fare_amount < 100)
plt.scatter(df.TRUEAGE, df.TG)
plt.xlabel('True Age')
plt.ylabel('Triglyceride, mmol/L')
# theta here is estimated by hand
plt.show()
Filter Examples
The data above doesn’t really need to be filtered. To demonstrate how it could be, I include some randomized columns that are then filtered according to specific conditions.
To fit the specificities of the conditions in the training video I’ll add some randomized columns.
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
emergency_department = np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["EMERGENCY"] = emergency_department
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 100 # exclusive
cancer_care = np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["CANCER_TYPE"] = cancer_care
n = df.shape[0]
lower_bound = 0 #inclusive
#0 = no
#1 = ICPI
# 2 MONO
upper_bound = 3 # exclusive
icpi_history = np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["ICPI_HIST"] = icpi_history
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2# exclusive
# Spanish = 0
# English = 1
# Arbitrarily chosen.
language = np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["LANG"] = language
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
follow_up = np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["FOLLOW_UP"] = follow_up
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
# 0 = no
# 1 = Yes
cons = np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["CONSENT"] = cons
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
# 0 = no
# 1 = Yes
prego = np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["PREGNANT"] = prego
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Writing the Filter
Writing a quick filter to ensure eligibiity. This could, and probably should be written functionally, but so it goes.
# Greater than 18
# CANCER TYPE IS NOT equal to a non-malanoma skin cancer ie 5 arbitrarily chosen
# Patient Seeking care in emergency department is true
# History is not equal to 0. Ie not recieving either. IDK how it would be provided. It could also possibly be written as equal to 1 or 2
# Lang is either english or spanish
# Patient Agrees to Follow Up
# Patient Consents
# Patient is Not Pregnant
idx_spanish = (df.TRUEAGE > 18) & (df.CANCER_TYPE != 5) & (df.EMERGENCY == 1) & \
(df.ICPI_HIST == 0) & (df.LANG == 0) & (df.FOLLOW_UP == 1) & (df.CONSENT == 1) & (df.PREGNANT == 0)
# Ideally the english and spanish speakers would have been filtered prior to this, but for the sake of exploration this will work.
idx_english = (df.TRUEAGE > 18) & (df.CANCER_TYPE != 5) & (df.EMERGENCY == 1) & \
(df.ICPI_HIST == 0) & (df.LANG == 1) & (df.FOLLOW_UP == 1) & (df.CONSENT == 1) & (df.PREGNANT == 0)
I created spanish and english dataframes for the sake of data manipulation. It is not realy necessary, but it would permit modifying and recoding the data if it were formatted differently.
filtered_df = pd.concat([df[idx_english], df[idx_spanish]], ignore_index=True)
the filtered df is a concattenation of the english and spanish filtered data.
filtered_df.describe()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
filtered_df.shape[0]
#only 33 left following the filter.
33
Following the filter only 35 data are left in the set. A workflow similiar to this could be used to identify possible survey recruits from aggregated chart data.
Logistic Regression Sample
I am surpirsed by the low level of samples left following the filter. To avoid a small n, I will use the initial dataset.
filename = "/Users/jnapolitano/Projects/biostatistics/data/37226-0003-Data.tsv"
df = pd.read_csv(filename, sep='\t')
# replace empty space with na
df = df.replace(" ", np.nan)
# convert numeric objects to numeric data types. I checked in the code book there will not be any false positives
df = df.apply(pd.to_numeric, errors='raise')
df = df.dropna()
df
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
The Model
There is strong suspicion that biomarkers can determine whether a patient should be admitted for emergency care. In this simplified model, I will randomly distribute proper disposition across the dataset.
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
# 0 = no
# 1 = Yes
tmp = np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["PROP_DISPOSITION"] = tmp
Create Test and Train Set
This could be randomly sampled as well…
Random Sample
# copy in memory to avoid errors. This could be done from files or in other ways if memory is limited.
master_table = df.copy()
Test Sample Set with 10,000 Randomly Selected from the Master with Replacement
test_sample = master_table.sample(n=20000,replace=True)
targets = test_sample.pop("PROP_DISPOSITION")
Seperate Train and Test Sets
x_train, x_test, y_train, y_test = train_test_split(test_sample, targets, test_size=0.2, random_state=0)
Data Standardization
Calculate the mean and standard deviation for each column. Subtract the corresponding mean from each element. Divide the obtained difference by the corresponding standard deviation.
Thankfully this is built into SKLearn.
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
Create the Model
model = LogisticRegression(solver='liblinear', C=0.05, multi_class='ovr',
random_state=0)
model.fit(x_train, y_train)
LogisticRegression(C=0.05, multi_class='ovr', random_state=0,
solver='liblinear')
Evaluate Model
x_test = scaler.transform(x_test)
y_pred = model.predict(x_test)
Model Scoring
With completely randomized values the score should be about 50%. If it is significantly greater than there is probably a problem with the model.
model.score(x_train, y_train)
0.5676875
model.score(x_test, y_test)
0.55825
Results are expected
Confusion matrix
cm = confusion_matrix(y_test, y_pred)
font_size = 10
fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.set_xlabel('Predicted outputs', color='purple')
ax.set_ylabel('Actual outputs', color='purple')
ax.xaxis.set(ticks=range(len(cm)))
ax.yaxis.set(ticks=range(len(cm)))
#ax.set_ylim(0, 1)
for i in range(len(cm)):
for j in range(len(cm)):
ax.text(j, i, cm[i, j], ha='center', va='center', color='purple')
plt.show()
Because the data is randomized it makes the model is accurate about 50% of the time.
Printing the Classification Report
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.55 0.47 0.51 1927
1 0.56 0.64 0.60 2073
accuracy 0.56 4000
macro avg 0.56 0.56 0.55 4000
weighted avg 0.56 0.56 0.55 4000