Contents

**Modeling Bronchopneumonia Status in Infants Using Discriminant and Logistic Regression Analysis**

**ABSTRACT**

This work applies Discriminant Analysis and Logistic Regression models to predict the prevalence of Bronchopneumonia status (BPn) in infants. The data used in this study were collected from two tertiary health institutions in North Central Zone; University Teaching Hospital (UTH), Abuja and Federal Medical Centre (FMC), Keffi, Nassarawa State. Five predictors which are well-recognized for characterizing Bronchopneumonia in infants (baby’s weight at birth, baby’s weight 4week after, sex, mother’s age and mother’s occupation) were considered. One hundred and eighty (180) and two hundred and fifty-three (253) infants with Low Birth Weight (LBW) were randomly sampled using simple random sampling technique from UTH, Abuja and FMC, Keffi respectively to build up the models. Both Linear Discriminant and Logistic Regression Models were fitted to the data for the two groups, and the best model was identified. Ten different samples of size 10 each were randomly taken from the dataset using SPSS package. The new datasets were used to validate the two models. It was observed that Discriminant Model is better used in the zone than Logistic Regression Model. We also find out that baby’s weight at birth is best at discriminating between the two groups, since it has the least value of Wilk’s Lambda compare to other predictor variables.

**Acronyms**

**LRA:**Logistic Regression Analysis**BPn:**Broncho – pneumonia**BPD:**Broncho – pulmonary Dysplasia**LBW:**Low Birth Weight**VLBW:**Very Low Birth Weight**E L B W:**Extremely Low Birth Weight**LDA:**Linear Discriminant Analysis**OR:**Odds Ratio**NBI:**Narrow Band Imagine**ASD:**Angiogenic Squamous Dysplasia**SVM:**Support Vector Machine**TLC:**Total Lymphocyte Count**PCA:**Principal Component Analysis**VAT:**Visceral Adipose Tissue**NLR:**Neutrophil – to – Lymphocyte Ratio**UTH:**University Teaching Hospital**FMC:**Federal Medical Centre

**CHAPTER ONE**

**1.0 INTRODUCTION**

**1.1 Background of the study**

Discriminant analysis is a procedure that can be used to build Discriminant functions which are linear functions of p-variables that can be used to describe or elucidate the differences among two or more groups. The goals of discriminant analysis include identifying the relative contribution of the p variables to separation of the groups and finding the optimal plane on which the points can be projected to best illustrate the configuration of the groups. Another use of discriminant analysis is the prediction or allocation of observations to groups, in which linear functions of the variables are employed to assign an individual sampling unit to one of the groups. The measured values in the observation vector for an individual or object are evaluated by the classification function to find the particular group to which the individual most likely belongs.

Interest in human development before birth is widely spread because of the interest in knowing more about our beginning and the desire to improve the quality of life. The intricate process by which a baby develops from a single cell is miraculous and few events are more exciting than a mother’s viewing of her embryo during an Ultrasound examination. Human development is a continuous process that begins when an oocyte (ovum) from a female is fertilized by a sperm (spermatozoa) from the male. By accepting the shelter of uterus, the fetus also takes the risk of disease or malnutrition and of biochemical immunological and hormonal adjustment.

Until the beginning of the Nineteenth Century, far more attention was paid to the collection and presentation of data than to their interpretation. Large volume of data were usually collected and frequently misinterpreted if indeed interpretation was attempted. However, since that time, the importance of scientific approach in the interpretation of data has been realized and great steps have been achieved in the development of appropriate methods.

In modern days, statistics has played a significant role in Biological, Pharmaceutical and Medical Sciences (Cornfield,1952). The application of multivariate statistical techniques to biological and medical data has dominated the areas of evidence-based medicine. Multivariate methods are relevant in virtually every branch of applied medicine, pharmacy and public health. They come into play either when we have a medical theory to test or when we have a relationship in mind that has some importance for medical decision or policy analysis in public health. Multivariate methods are also used in other disciplines.

Multivariate methods are prominently used on data to test a theory or to estimate a relationship in different disciplines. In some cases, especially those that involve the testing of medical theories, a formal multivariate model is constructed. The model consists of multivariate technique that describes various relationships. In most cases, the model is used to make predictions in either the testing of a medical theory or the study of a policy impact in pharmacy and public health.

Kirkwood and Stern (2008) defined discriminant analysis and classification as the multivariate techniques concerned with separating sets of objects or observations and with allocating new objects or observations to previously defined groups. As a separation procedure, it is often employed on a one-time basis in order to investigate observed differences when causal relationships are not well understood. The immediate goals of discriminant analysis and classification are to describe the differential features of objects so as to find Discriminant function whose numerical values are such that the collections are separated as much as possible and to sort new objects or observations into two or more classes or groups.

In clinical situations, the status of a patient is assessed by the presence or absence of a disease. There are many factors to consider which may or may not correlate with the incidence of the disease. There have been numerous retrospective medical research studies published each year that review past medical records and charts of former patients to help determine some of the risk factors (or causing agents) of diseases that are of interest. Finding the risk factors and the potential risk factors can help to prevent the development of the disease. For all of the diseases, most of the risk factors considered are categorical variables i.e. variables taking on two or more possible values. (Hosmer and Lemeshow, 1989), two prominent statisticians, stated that „the logistic regression model has become the standard method of analysis in this situation. ‟

Logistic regression analysis is also called “Binary Logistic Regression Analysis”, “Multinomial Logistic Regression Analysis” and Ordinal Logistic Regression Analysis”, depending on the scale type where the dependent variable is measured and the number of categories of the dependent variable. Logistic regression is divided into two; Univariate Logistic Regression and Multivariate Logistic Regression (Stephenson, 2006).

Like any other model building technique, the goal of the logistic regression analysis is “to find the best fitting and most parsimonious, yet biologically reasonable model to describe the relationship between an outcome (dependent or response) variable and a set of independent (predictor or explanatory) variables” (Hosmer and Lemeshow 1989). This statement motivates the purpose of this study to identify risk factors for low birth weight (LBW) in newborn infants using the statistical tools of logistic regression analysis.

The use of logistic regression dates back to 1845. It first appeared during the mathematical studies for the population growth at that time. The term logistic regression analysis comes from logit transformation, which is applied to the dependent variable. This case, at the same time, causes certain differences both in estimation and interpretation (Anderson, 2008).

In many application areas, such as epidemiological and biomedical studies, where outcomes may be occurrence or nonoccurrence, mortality (dead or alive), and so forth, logistic regression is the standard approach for the analysis of binary and categorical outcome data.

Logistic Regression Analysis (LRA) extends the techniques of multiple regression analysis to research situations in which the outcome variable is categorical. In practice, situation involving categorical outcomes are quite common. In the setting of evaluating an educational program, for example, predictions may be made for the dichotomous of success/failure or improved/not-improved. Similarly, in a medical setting, an outcome might be presence/absence of disease. The focus of this study is on the situations in which the outcome variable is dichotomous, although extension of the techniques of LRA to outcomes with three or more categories (e.g improved, same, or worse) is possible.

In this Twenty First Century, statistics play an important role in many simulations, modeling and decision-making processes. This implies the need for statistical research in every facet of medicine; especially the evidence-based medicine. Anderson (2008) mentioned that the critical factor that separates statistical research from other ways of knowing the medical world is that statistical research is purely scientific in nature. In this sense, Science refers to both a system for producing medical knowledge and the knowledge produced. Also Science is a combination of an orientation towards a set of procedures, techniques, knowledge and instruments for gaining knowledge.

**1.2** **Bronchopneumonia**

Pneumonia is an illness, usually caused by infection, in which the lungs become inflamed and congested, reducing oxygen exchange and leading to cough and breathlessness. It affects individuals of all ages but occurs most frequently in children and the elderly.. Historically, in developed countries, deaths from pneumonia have been reduced by improvements in living conditions, air quality, and nutrition. In developing world today, many deaths from pneumonia are also preventable by immunization or access to simple, effective treatments (Anthony, 2010).

Pneumonia can be caused by bacteria, viruses and fungi. *Streptococcus* pneumonia and *Haemophilus *influenza type b (Hib) are the most common causes of bacterial pneumonia while respiratory syncytial virus is the most common viral cause of pneumonia. A yeast-like fungus- *Pneumocystis jiroveci* is often responsible for pneumonia deaths in HIV-infected infants.

Bronchopneumonia or bronchial-pneumonia or bronchogenic pneumonia is a type of pneumonia characterized by multiple foci of isolated, acute consolidation, affecting one or more pulmonary lobules. It is one of two types of bacterial pneumonia as classified by gross anatomic distribution of consolidation (solidification). The other being lobar pneumonia. Bronchopneumonia is less likely than lobar pneumonia to be associated with streptococcus.

The bronchopneumonia pattern has been associated with hospital acquired pneumonia, and with specific organisms‟ *staphylococcus aureus, klebsiella coli and pseudomonas.* In bacterial pneumonia, invasion of the lung parenchyma by bacteria produces an inflammatory immune response. This response leads to a filling of the alveolar sacs with exudates. The loss of air space and its replacement with fluid is called consolidation.

Broncho-Pulmonary Dysplasia (BPD) is a chronic type of lung disease prevalent among infants. This disease if present in a pregnant mother leads to low birth weight of infants at birth. It is a serious lung condition that affects infants. It mostly affects premature who need oxygen therapy (oxygen given through nasal prongs, a mask or a breathing tube). Most infants who develop BPD are born more than ten weeks before their due dates and weigh less than 2 pounds (about 1kg) at birth, and have breathing problems (Jobe, 2001).

**1.3** **Low Birth Weight**

Low Birth Weight (LBW) is described as a birth weight of a live born infant of less than 2.5kg regardless of gestational age. Subcategories include; Very Low Birth Weight (VLBW) which is less than 1.5kg and Extremely Low Birth Weight (ELBW) which is less than 1.0kg. Normal Weight at term of delivery is 2.5kg – 4.2kg. Most normal babies weigh above 2.5kg by 37 weeks of gestation. Intrauterine growth restriction refers to delayed growth within the uterus, which then leads to low birth weight. Some babies are just small and happen to weigh less than 2.5kg at birth, just like some adults are smaller than others. Though this is considered low birth weight, in these cases, it is not abnormal nor a cause for concern.

Using the discriminant and logistic regression is of interest to this study. We will use a sample of not less than 400 of infants drawn from an underlying population of children with low birth weight (kg). These children were confined to a neonatal intensive care unit, they required incubation during the first 12 hours of life, and they survived for at least 28 days and their weights measured four weeks later. Healthy infants are denoted by

- while, Infected infants by (1). Factors that contribute to the risk of Broncho-Pulmonary Dysplasia (BPD) include high blood pressure in mothers, hypercholesterolemia in mothers and family history of tobacco smoking, among others (Eneh, 2011).

In strict terms, the application of statistical techniques to biological and medical data is called Biostatistics. Generally speaking, biostatistical methods are relevant in virtually every branch of applied medicine, pharmacy, nutrition and public health. They come into play either when we have a medical theory to test or when we have a relationship in mind that has some importance for medical decision or policy analysis in public health. Biostatistical methods in medicine are more or less empirical analysis using data to test a theory or to estimate a relationship in medicine, pharmacy, public health and other areas.

In some cases, especially those that involve the testing of medical theories, a formal statistical model is constructed. The model consists of statistical equations that describe various relationships. A biostatistical analysis begins by specifying a statistical model. Once a statistical model has been specified, various hypotheses of interest can be stated and empirically tested in terms of the unknown biological or medical parameters. An empirical analysis requires data which are used to estimate model parameters and to formally test hypotheses of interest. In most cases, the model is used to make predictions in either the testing of a medical theory or the study of a policy‟s impact in pharmacy and public health (Rencher, 2002).

Some statistical models in medical research may contain dichotomous factor; in form of a person is male or female; a person does or does not have a disease in question, to mention but a few. In all of these examples, the relevant information can be captured by defining a classification discriminant model.

**1.4 Statement of the Problem**

Birth weight less than 2.5kg is categorized as Low Birth Weight (LBW). It remains a significant public health problem in both developed and developing countries. These infants with LBW encounter greater neonatal morbidity and mortality and significantly higher rates of physical and mental handicaps later in life (Pope, 2010). Taking the infants population globally, the proportion of babies with a LBW is an indicator of a multifaceted public-health problem that includes the sex of an infant as well as the birth weight and weight four weeks after birth. Also, the mother‟s age and mother‟s occupation are important variables that could predict the LBW of the infant considered in the study. Therefore, the main problem which comes up in this particular study is how to construct linear discriminant and logistic regression models that are capable of predicting the Bronchopneumonia(BPn) status of the infant using mother‟s age, mother‟s occupation, baby‟s sex, baby‟s weight at birth and baby‟s weight four weeks after birth as predictor variables. The core research issue is therefore to explore the predictive powers of both the Linear Discriminant Model and Logistic Regression Model as regards statistical modeling.

However, since the models comprise discrimination and classification, it is in the interest of the researcher to classify some infants as affected and unaffected patients of Broncho Pneumonia using Linear Discriminant and Logistic Regression Models. Hence, a suitable prediction model will be developed to satisfy the best methods of validation as well as diagnostics of statistical decisions. Moreover, the Linear Discriminant Model and Logistic Regression Model could be used to predict the BPn status of new cases of infants.

**1.5** **Aim and Objectives of the Study**

The aim of this study is to investigate the bronchopneumonia status in infants using linear discriminant and logistic regression models, and this will be achieved through the following objectives; by

- Constructing a linear discriminant and logistic regression models that are capable for predicting the Bronchopneumonia status in infants;
- Predicting the Bronchopneumonia status of some infants (random selected cases) using the developed models;
- Comparing the predictive powers of the two models for Bronchopneumonia;
- Determining the predictor that has the most discriminating ability among the predictors.

**1.6** **Significance of the Study**

The Linear Discriminant and Logistic Regression Models built in this study will give effective guide in evidence-based medicine. That is, to achieve useful projections of the BPn status of infants so as to isolate factors responsible for such. On the other hand, the study will assist medical researchers to ascertain the prevalence of BPn using the developed models.

**1.7** **Scope of the Study**

The purpose of this study is to develop the models based on five predictors; Weight at birth, Weight four weeks after birth, Sex, Mother‟s age and Mother‟s occupation. The five independent variables are incorporated in both the linear discriminant and logistic regression models as the most relevant factors considered and captured by the study.

The study will also focus on North Central Zone, out of six geo-political zones of the Federation. The sample taken from two health tertiary institutions would be used for the analysis on prevalence of BPn among infants which lead to Low Birth Weight (LBW) in infants.

**1.8 Definition of Terms**

**Bronchopneumonia (BPn): **Is a type of pneumonia characterized by multiple foci of isolated, acute consolidation, affecting one or more pulmonary lobules.

**Bronchopulmonary Dysplasia (BPD): **Is a chronic type of lung disease prevalent among infants, this disease if present in a pregnant mother leads to low birth weight of infants at birth.

**Low Birth Weight (LBW): **Is described as a birth weight of a live born infant of less than 2,500g (5 pounds 8 ounces) regardless of gestational age.

**Discriminant Function: **Is a multivariate technique concerned with separating distinct sets of objects (or observations) and it gives the rule for allocating (observations) to previously defined groups.

**Logistic regression **or Logit deals with the cases where the response variable consists of two or more categorical values.