Woodruff Health Sciences Center
Rollins School of Public Health, Nell Hodgson Woodruff School of Nursing
Emory University
Sepsis, defined as a systemic bacterial infection, is estimated to be responsible for as much as one quarter of neonatal deaths globally. Approximately 30–40% of infections resulting in neonatal sepsis are transmitted at the time of birth. Early-onset infections are often acquired around the time of delivery through vertical mother-to-infant transmission. Late-onset infections present after delivery (typically after 3 to 7 days of age) and are attributed to organisms acquired from the neonate’s interactions with the hospital, home, or community environment. Low birth weight is an established risk factor for sepsis with preterm, low birth weight infants having a 3–10 times higher incidence compared to full-term normal birth weight infants.
However, little is known about the incidence, risk factors, or etiology of neonatal sepsis in sub-Saharan Africa (SSA). The WHO reports “newborns are at higher risk of acquiring health care-associated infection in developing countries, with infection rates three to twenty times higher than in high-income countries.” A study in Nigeria reported 6.5 cases of neonatal sepsis per 1,000 live births in a referral hospital, and 21 cases of neonatal sepsis per 1,000 live births were reported from a referral hospital in Zimbabwe. There are no population-level estimates of neonatal sepsis in Ethiopia.
The study at hand will be the first known work to rigorously examine the burden of facility-acquired infections among facility-born neonates at healthcare facilities with limited WASH infrastructure and practices in a low-income country.
Source:
WHSC Synergy Award Proposal, Christine L. Moe, PhD & John N. Cranmer, DNP, MPH, MSN, ANP-BC
The primary goals of my work in this research were to describe the (1) incidence and (2) etiologic agents of neonatal sepsis in normal and low birth-weight study subjects. Two specific challenges lent themselves to solutions using data science tools and methods:
Data wrangling. The research team gathered data from the summer of 2018 through summer 2019 in support of the research aims defined above. The data were collected on paper forms and later manually entered as digital records. The dataset needed cleaning, validation, and pre-processing in advance of analysis.
Modeling. The research proposed analysis of several aspects of the data, including incidence and co-variates (such as birth weight) of neonatal sepsis. I planned to use inferential models as implemented in common data science libraries like sci-kit learn
or stats models
for this analysis, even though the purpose was more inferential than predictive.
The research team collected data over a one-year period at two healthcare facilities in the Amhara, Ethopia region (Felegehiwot Hospital and Debretabor Hospital). The data were then manually transcribed into a research data capture tool called REDCap and exported to Excel.
Once the data were exported from REDCap, our process included the following steps:
De-identification. Because of time constraints, I was not able to complete the necessary training with Emory’s IRB to handle personally identifiable information, so the research team had to de-identify the dataset before I could handle it.
EDA and Cleaning. Because the data were collected on paper forms and transcribed into REDCap, many data fields contained inconsistent data types and formats that needed cleaning up. Many categorical variables needed to be translated into indicator variables (dummy variables).
Reconciliation. The most challenging aspect of the data cleaning was the lack of unique, consistent identifiers for each study subject (mother and chilren). I developed several tools for helping the team in Ethiopia to identify and reconcile these IDs so we could link all the data sources together. Unfortunately, the time needed to complete this task exceeded the time available for the Capstone, so the fully combined dataset was not available for Capstone analysis.
Because the fully consolidated dataset was not available during the window Capstone window, I fell back to analyzing segments of the dataset. Specifically, I completed the following analyses, which I’ll describe briefly below.
All these models were implemented using sci-kit learn
.
I hoped unsupervised learning might reveal some latent patterns in the dataset, so I ran both K-Means clustering and DBSCAN models on both the Followup and Clinical Sample data. The best silhouette scores for each model and dataset, along with model hyperparameters follow.
Dataset | K-Means | DBSCAN |
---|---|---|
Follow-up Interviews | 0.87 k=2 |
0.79 epsilon=0.68 4 clusters |
Etiologic Agents & AMR | 0.36 k=9 |
0.32 epsilon=0.1 5 clusters |
Because so much of the data was either categorical (and had been split into binary indicators) or dichotomous, both clustering techniques failed to reveal anything meaningful. Even in the follow-up interview clustering, where silhouette scores were relatively high, the clusters themselves revealed nothing noteworthy in excess of where the interview happened and whether or not the subject deceased.
More interesting was classification on an infant’s birthweight status. Each baby in the study was classified as either “normal” birthweight (if their recorded birthweight was greater than or equal to 2,000g) or “low” birthweight (otherwise). Of our 615 observations, 19.4% of subjects were classified as “low birth weight”, so baseline accuracy the model needed to outperform was about 81%.
The Followup interview dataset contained 615 observations, and 156 of those contained at least one null value. I experimented with a Python package called autoimpute
to perform Predictive Mean Matching imputation for those records. The imputation proved to be fairly fragile due to non-convergence (throwing run-time errors every time it was run but twice), so I instead omitted columns that contained null values and ran two classification models: K Nearest Neighbors and Cross-validated Logistic Regression.
Grid-searching over 18 combinations of hyperparameters resulted in a “best” KNN model with 91% accuracy, noticeably above the baseline accuracy of 81%.
Actual + | Actual - | |
---|---|---|
Pred + | 18 | 1 |
Pred - | 13 | 122 |
Metric | Value |
---|---|
Baseline Accuracy | 80.65% |
Model Accuracy | 90.91% |
Specificity | 99.19% |
Sensitivity | 58.06% |
Subsequent modeling with logistic regression outperformed this, however, so I did not iterate this model.
Logistic Regression performed even better than KNN, with an initial model clocking in at 96.8% accuracy. Examining this model more closely revealed that a small subset of features were driving most of the explanatory power of the model. I ran 3 additional iterations that excluded progressively more features. The final model, which used just three features (number of previous low birthweight births, number of previous pre-term births, and number of previous live births), accurately predicted birthweight status over 95% of the time. The AUC-ROC for this model exceeded that of the initial regression model slightly (see image nearby).
This could be a valuable tool for predicting birthweight status for non-first-time mothers. It’s worth noting that because we did not yet have a fully consolidated dataset, this finding – while promising – is only preliminary.
The confusion matrix on the test dataset (n=154) and summary model metrics follow.
Actual + | Actual - | |
---|---|---|
Pred + | 120 | 3 |
Pred - | 4 | 27 |
Metric | Value |
---|---|
Baseline Accuracy | 80.65% |
Model Accuracy | 95.45% |
Specificity | 97.56% |
Sensitivity | 87.10% |