Application of Item Response Theory in the Development and Validation of Multiple Choice Test in Economics


The study applied item response theory in the development and validation of multiple -choice test in Economics. Instrumentation research design was used for the study. A sample of 1005 Economics senior secondary school II students was randomly selected from 46 government co-education schools. To guide this study, six research questions were posed and two hypotheses were formulated. The Economics Multiple choice test items numbering 50 developed by the researcher were used for data collection. To ensure the validity of the instrument, the instrument was subjected to face and content validation by three experts, two from the department of science education and one from Economics department. The reliability index of 0.89 was obtained. The data generated from the study were analyzed using maximum likelihood estimation technique of BILOG-MG computer programming. The analysis of the data revealed that 50 test items of Economics survived therefore, the final instrument developed for assessing students’ ability in Economics contained 50 items with the appropriate indices. The result of the study showed that 49 items of the multiple choice question in Economics were reliable based on three parameter model (3pl) model. The findings also showed that thirty one (31) items of the Economics multiple-choice test in Economics were difficult. The findings further revealed that items functions differential in Economics among male and female students. Based on the findings, recommendations were made which include that the examination bodies and teachers should encourage and adopt IRT in developing test items used in measuring students ability in Economics.



Background to the Study

Economics is one of the senior secondary school subjects that require assessment to ascertain students’ basic knowledge and skills and understanding of the concepts and the nature of economic problems in any society. Economics has been defined variously by many authorities. These different definitions arise because Economics studies human behavior and man behaves differently. Mankiw (2001) defined Economics as the study of how society manages its scarce resources. Egunjobi and Egwakhide (2010) opined that Economics is the study of human endeavors in respect of production, distribution, exchange and consumption. Economics, according to Orji (2002), is the science of scarcity and choice. This implies that when resources are limited in quantity relative to their uses, they are scarce, and the fact about scarcity forces the individual to make a choice among the alternatives. In Nigeria, Economics came into the secondary school curriculum in1966 (Obemeata, 1991). The objectives of studying Economics according to Asadu (2001) are:

· To enable students to acquire knowledge for the practical solution of the economic problem of Nigerian societies, developing countries and the world at large.

· To prepare and encourage students to be cautious and affective in the management of scarce resources.

· To equip students with the basic principle of economics necessary for useful living.

· To increase students respect for the dignity of labour and their appreciation to economic, cultural and social values of the society.

The objectives discussed tend to suggest that the study of Economics is a form of learning in which knowledge, skills and habits of a group of people are transferred from one generation to the next through teaching, training or research. Learning is simply described as a change in behavior as a result of experience (Maduewesi, 1999). According to Black and William (2009) learning is tied to effective assessment by monitoring students, progress and feeding that information back to students. Because learning is unpredictable, assessment is necessary to make adaptive adjustments to instruction, but assessment processes themselves impact the learner’s willingness, desire, and capacity to learn (Harlen & Deakin-Crick, 2002). Assessment is the systematic collection, review and use of information about educational programs to improve student learning. In the view of Huba and Freed (2000), assessment is the process of gathering and discussing information from multiple and diverse sources in order to develop a deep understanding of what students know, understand, and can do with their knowledge as a result of their educational experiences. This idea could be seen in the Federal Republic of Nigeria (FRN) policy on education concerning continuous assessment which is supposed to be implemented at all level of the educational system for both adult and young learners (FRN, 2004). This type of assessment could be affected through the use of achievement test. Malcolm (2003) viewed achievement test as an exam designed to assess how much knowledge a person has in a certain area or set of areas. The following are some objectives of achievement tests:

· To measure whether students possess the pre-requisite skills needed to succeed in any unit or whether the students have achieved the objectives of the planned instruction.

· To monitor students’ learning and to provide on-going feedback to both students and teachers during the teaching-learning process.

· To identify the students’ learning difficulties- whether persistent or recurring.

· To assign grades.

These objectives can be achieved by the use of different assessment instruments such as; essay tests and objective tests which are utilized by the teacher depending on the aims of the measurement. The focus of this study is on objective tests. Objective test is one of the assessment instrument used in testing or assessing students’ academic achievement in any given instruction. In objective tests, such as multiple choice questions, students are asked and respondent required to select the best possible answer (or answers) out of the choices from a list (Okoro, 2006). Multiple choice items consist of a stem and a set of options. The stem is the beginning part of the item that presents the problem to be solved, a question asked of the respondent, or an incomplete statement to be completed, as well as any other relevant information. The options are the possible answers that the examiner can choose from, with the correct answer called the key and the incorrect answers called distracters. Test scores obtained from the multiple choice questions are used to assess the competence of the students. Some of the advantages of the multiple choice questions as reported in the literature are; multiple choice test items can be used to measure both the lower and higher levels of the cognitive domain (Onunkwo, 2002). Multiple choice tests, unlike essay test, allow the teacher to ask a large number of questions that adequately cover the course content (Okoro, 2006). Bush (2001) noted that multiple choice questions can increase the test takers probability of guessing the right answer to a question by eliminating unlikely choices. The multiple choice tests generally are much more objective, because they are mostly self-administered and scorers can apply a scoring key which allows them to agree perfectly (Meredith, Joyce & Walter 2007). However all assessment instruments must satisfy the criteria of reliability, validity, objectivity as well as usability (Anene & Ndubisi, 2003). Reliability is conceived in relation to the extent of consistency or dependability of a measuring instrument (Abonyi, 2011). This implies that if any test were to be applied in Economics an infinite number of times, it would be expected to generate responses that vary a little from trial to trial, as a result of measurement error. Therefore, for any measuring instrument, the smaller the error, the greater the reliability while the greater the error, the smaller the reliability. Individual scores on a test can be viewed as the combined result of the true score and measurement error. The type of measurement error that is utilized in interpreting individual scores is called standard error of measurement. Standard error of measurement, according to Onunkwo (2002), provides the standard deviation of a series of measurements taken on the same individual. Validity refers to the extent to which an instrument measures what it is designed to measure (Nworgu, 2006). A test with high validity will measure accurately the particular qualities it is supposed to measure. The objectivity of a test refers to whether its scores are undistorted by biases of individuals who administer and score it, while usability of a test is the extent to which a test provides to the teacher or test administrator, clear instructions that can be put into practice without a great deal of difficulty or confusion. In order words, a test in Economics is usable if it does not force students to waste their time dealing with the idea of recording the answer. Nevertheless, instrument development in Economics requires more than determination of reliability, validity, objectivity and usability of the items. Some other indices such as item difficulty, item discrimination, distractors are required for determination of the quality of the instrument.

Unfortunately, teacher of Economics which teachers are inclined to, do not determine these qualities of a test. The reason may be that the questions should not require these qualities or teachers lack the knowledge of setting quality tests. This may result in the students’ failure in WAEC (West African Examinations Council). However, the procedures for determining these indices or parameter of items of the instrument depend on the measurement theory used. The two distinct measurement theories are the Classical Test Theory (CTT) and Item Response Theory (IRT). Classical test theory is based on the true score theory which views the observed score (X) as a combination of the true scores (T) and an error component (E) (Adedoyin, 2010). The observed score of a test-taker is usually seen as an estimate of the true scores of the test-taker plus or minus some unobservable measurement error (Crocker & Algina, 2008). An advantage of classical test theory is that it is relatively simple and easy to interpret. CTT does not have a complex theoretical model to relate an examinee’s ability to succeed on a particular item. Instead, CTT collectively considers a pool of examinees and empirically examinees ability to success on a particular item. However, CTT can be criticized since the item difficulty could vary depending on the sample of test-takers of test. Therefore, it is difficult to compare test-takers results between different tests. Secondly, Npkone (2001) asserted that the proportion of examinees in a sample that get an item correct changes from a sample whose mean ability is high to one whose mean ability is low.

However, despite the limitation of CTT it is being used to describe the estimates of achievement test in secondary schools. For instance, the students’ achievements in Economics are often subjected to statistical measure as mean, standard deviation, e.t.c. These statistics change for a test when another sample from the same population of students is used. The estimates or indices are obtained depending on how many samples were chosen from the students’ population. In order words, there is so much dependence on student total (aggregate) score in a test while the achievement on individual items is not determined. Therefore, to ensure effective teaching and learning of Economics in schools, an achievement test that focuses on attainment on individual items will have better utility than one on students’ aggregate scores. An educational measurement scale that has ratio scale, sample independent attributes and students’ ability reported on both item and total instrument levels can be developed with the measurement theory called Item Response Theory (IRT) otherwise known as modern theory. Item Response Theory (IRT) is, for some researchers, the answer to the limitations of classical test theory (Troy- Gerard, 2004). Item response theory is a modeling technique that tries to describe the relationship between an examinee’s test performance and the latent trait underlying the performance (Henard, 2000). Reeve (2002) describes item response theory as a body of theory describing the application of mathematical models to data from questionnaires and tests as a basis for measuring things such as abilities and attitudes. Item Response Theory (IRT) looks at the examinee’s performance by using item distributions based on the examinee’s probability of success on a latent variable. In IRT, item statistics also referred parameters are estimated and interpreted. Under IRT, parameters of the persons are invariant across items, and parameters of the items are invariant in different populations of persons. It brings greater flexibility and provides more sophisticated information which allows for the improvement of the reliability of an assessment.

According to Nenty (2004), invariance is the bedrock of objectivity in physical measurement, and the lack of it raises a lot of questions about the scientific nature of psychological measurement. Item response theory is a collection of different models showing the relationship between a participant’s responses on an item and underlying latent trait (Ercikan & Koh, 2005). These models were originally developed for items that are scored dichotomously (correct or incorrect) but the concept and method of IRT extend to a wide variety of polytomous models for all types of psychological variables that are measured by rating scales of various kind (Vander & Hambleton, 1997). IRT model assumes that the performance of an examinee can be completely predicted or explained from one or more abilities. IRT models the probability of a correct answer using three logistic functions. The one-parameter logistic (1PL) model attempts to address the probability of a correct answer by allowing each question to have an independent difficulty variable. For instance, one-parameter model allows each question on an achievement test to have an independent difficulty variable. The two-parameter logistic (2PL) model attempts to model each item’s level of discrimination between high and low ability students while in the (3PL) model adds a third item parameter which is called pseudo-guessing parameter that reflects the probability that an examinee with a very low trait level will correctly answer an item solely by guessing. This implies that students can correctly answer an item in an achievement test by guessing.

Obinne (2012) observed that guessing is giving an answer or making a judgment about something without being sure of all the facts. Guessing parameter model gives the probability of an individual with ability, responding correctly to an item with a difficulty index, discrimination index and a guessing index. The model assumes that the three parameters (difficulty, discrimination and guessing) are necessary for an estimate and valid relationship between the probability of a correct response to an item and the trait level (ability) of an individual. Within the latent trait test model, the internal validity of a test is assessed in terms of the statistical fit of each item to the model. Fit to the model also implies that item discriminations are uniform and substantial, that there are no errors in item scoring. It also indicates that guessing has had a negligible effect on test scores. IRT models are extremely helpful in assessment instrument like Economics achievement test when trying to understand students’ abilities by examining their test performance. To ensure that Economics achievement test is fair for all examinees, the instrument should be fair. A test instrument is said to be fair when two groups of equal ability with respect to the construct measured by the test should earn the same score on each item of the test. The comparison between results of subgroups gives indication of items that are functioning differently for different groups of students. If the test is not fair or yield different scores from subgroups for instance gender, it is said to suffer from Differential Item Functioning (DIF). Differential item functioning is a collection of statistical methods that gives indication of items that are functioning differently for different groups of students (Madu, 2012). This implies that differential item functioning would occur in Economics achievement test if the Item Response Function (IRF) for an item are different for two groups. In the view of Meredith, Joyce and Walter (2007) differential item functioning means that individuals of equal ability but from different subgroup (e.g., males and females) do not have the same probability of earning the same score. Gender is a broad analytic concept which highlights women’s roles and responsibilities in relation to those of men. Gender relates to the difference in sex (that is, either male or female) and how this quality affects their dispositions and perception toward life and academic activities (Okoh, 2007). Hence, instrument developed for measuring achievement test in Economics may suffer from differential item functioning if they do not have the basic qualities that test instrument should have and moreover even when they tried to have some qualities they are based on the CTT frame work where a large p-value difference and item by group interaction may label an item as biased when in fact no bias exist. However, the type of measurement theory that ensures item level performance instead of aggregate level performance in analyzing Economics achievement test is therefore the concern of this study.

Statement of the Problem

The Federal Republic of Nigeria Policy on Education (FRN) (2004) has emphasized so much on continuous assessment which is necessary at all level of education. By this policy, teachers assess the knowledge, skills and abilities of the students in Economics at senior secondary school. Every assessment is expected to treat the test-taker equally but the instrument development through classical test theory which the teachers set hardly accomplishes this purpose. This is because, it is group dependent and the item statistics such as item difficulty and item discrimination are also group dependent.

Based on these limitations of the instrument developed under classical test theory, the researcher designed this study using a modern measurement theory to ensure objectivity in measurement of the students’ scores in analyzing Economics multiple choice test items. Therefore, the question addressed is: would item response theory influence the instrument development and validation of multiple choice test in Economics?

Purpose of the Study

The main purpose of this study was to apply item response theory in the development and validation of the multiple choice test in Economics. Specifically, the study determined the;

1. Standard errors of measurement of the test items of the multiple choice test in Economics.

2. Fit of the items of the Economics multiple choice test using three-parameter logistic (3PL) model.

3. Difficulty parameter of the test items of the multiple choice test in Economics.

4. Discrimination parameter of the test items of the multiple choice test in Economics.

5. Guessing parameter of the test items of the multiple choice test in Economics.

6. Differential item functioning of the test items of the multiple choice test in Economics with respect to gender.

Significance of the Study

The results of this study have both theoretical and practical significance. Theoretically, item response theory which focused on paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables was used to show the relationship between student’s test performance and the latent trait underlying the performance. The theory also provides a better view on the information each question provides about a student.

The practical significance of this study is expected to be beneficial to the teachers, curriculum planners, students and guidance and counselors.

This study should help the teachers to understand the steps involved in the test development. This enables teachers to set quality questions in the school which may have similar qualities with external examination questions. This may also give insight to the teachers that the performance of the students during external examinations depends on the quality questions or assessment they set in the school. Teachers should find this study useful as it helps to ensure maximum report of the achievement of the examinees by providing ideas to meaningful interpretation of examinees result through person-by-item encounter (latent trait model) during examination. The study would report the examinees’ achievement by classifying the examinees into ability levels on each of the items based on Item response theory (IRT) using item response function (IRT). The Economics teachers can use instrument to predict the probability of the examinees correctly answering any given item if the examinees’ ability levels are known.

To curriculum planners, this study provides another reform of curricular goals and objectives. The usefulness of this study ties in providing empirical data to enable them plan a functional curriculum taking into consideration the development and validation of achievement test such as Economics as a subject. This should encourage and guide teachers to develop and set quality questions in the school.

To the student, it would enlighten them on the interpretation of their performance in Economics when assessed using the developed instrument. The study should enable them to understand the relationship between their performance on each question they answered and underling latent trait.

On the aspect of the guidance and counselors, the findings of this study would help them to understand the performance of the students on each question as exposed by the teachers. This should enable them to determine the strength and weakness of each student. This help to advice the student from time to time on the factors that affect their performance or academic life in the career to choose.

Scope of the Study

Application of item response theory in the development and validation of multiple choice test in Economic was limited to SS2 Economics students at senior secondary school in Nsukka Education Zone of Enugu state. The SS2 students were chosen because the topics used in the instrument of this study are contained in SS2 scheme of work. The content scope includes: Demand and supply, financial institutions, public finance, labour force, alternative economic system, theory of cost and inflation. The above topics were selected from the SS2 Economics syllabus. The choice of these topics was because students always find them difficult to understand during classroom teaching and learning.

Research Questions

The following research questions were posed to guide this study.

1. What are the standard errors of measurement of the test items of the multiple choice test in Economics?

2. How do the items of the Economics multiple choice test fit the three-parameter logistic (3PL) model?

3. What are the difficulty parameters of the test items of the multiple choice test in Economics?

4. What are the discrimination parameters of the test items of multiple choice test in Economics?

5. What are the guessing parameters of the test items of the multiple choice test in Economics examinations?

6. What are the Differential item functioning of the test items of the multiple choice test in Economics with respect to gender?

Research Hypotheses

The following null hypotheses (H0) were formulated and were tested at .05 level of significance.

1. H01: There is no significant fit between the items of Economics multiple choice test based on three-parameter model.

2. H02: The test items of multiple choice test in Economics do not function differentially between male and female SS11 Economics students.


Abonyi, O. S. (2011). Instrumentation in behavioral research: A practical approach. Enugu: TIMEX Publishing Company.

Adedoyin, O. O. (2010). Investigating the invariance of person parameter estimates based on classical test and item response theories. International journal of educational science. Retrieved November 30, 2012, from http://www.uniBotswana./journal/ education/science

Adeyegbe, S. (2004). History of West African Examinations Council. Retrieved October 12, 2012, from http://www.waecnigeria. org/home.htm.

Akindele, B. P. (2003). The development of an item bank for selection tests into Nigerian universities: an exploratory study. Unpublished doctoral dissertation, University of Ibadan, Nigeria.

Ali, A. (2006). Conducting research in education and the social sciences. Enugu: Tashiwa Networks Ltd.

Anaekwe, M.C. (2007). Basic research methods and statistics in education and social sciences (2nd ed.). Onitsha: Sofie Publicity and Printry Limited.

Anastasi, A., & Urbina, S. (2002). Psychological testing. New York: Prentice Hall.

Anene, G. U., & Ndubisi, O.G (2003). Test development process. In B. G. Nworgu (Ed.), Educational measurement and evaluation: Theory and practice (pp.110-122). Nsukka: University Trust Publishers.

Anikweze, C. M. (2010). Measurement and evaluation: For teacher education. (2nd
ed.). Enugu: SNAAP Press Ltd.