How to Determine a Reliable Sample Size for Your Research Work
Sampling surrounds our daily lives even without realizing it. Because we generalize, hypothesis, and measure different things in our lives; from interpreting whether the day’s weather is hotter/cold than normal, to imagining if a certain candidate is taking the lead in the population’s opinion.
The mathematical statistician Taro Yamane devised a formula for estimating or determining the sample size of the population under study so that the inferences and conclusions drawn from the survey can be applied to the complete population from which the sample was drawn.
What are a statistical population and its importance in the sampling context?
Statistical population is the set of items, people, or elements in general that contains all the information available to make a certain type of inference. Understanding the concept of population is very important in defining the sample size calculation. Citing the example of elections, what is the best way to identify how the intention of votes for any candidate is? A generic answer would be to say: just talk to the entire voting population. But, thinking in just a few seconds, we’ll conclude that this shouldn’t be a very easy task. So, how can we draw conclusions and assumptions about any topic scientifically, with a coherent methodology and that brings results close to what the population really represents? That’s where sampling comes into play.
Be careful not to confuse sampling with data clipping – In this article, we talk more about this issue.
What is sampling?
Sampling is a process that follows techniques for choosing members of a population so that it is possible to make inferences about the entire population. In other words, sampling allows us to conclude about the whole by analysing only parts. Seeking to rationalize resources, we can generate a sample that can represent our set of interests. For that, we must think about some important issues. For, the concept of sampling assumes that we want to study characteristics of individuals and populations. Since it is a sample that seeks to represent an entire population, we will inherently have deviations from reality, measurement errors, and other imperfections; this is largely due to chance.
When trying to estimate the habit of exercising among Nigerians, for example, our sample may suffer deviations if we have selected more elderly than young people, more children than adults, from one region to another, and so on. So, we can get into some important concepts:
The margin of error
It is the difference between the mean found in the sample and the mean of the population. Within the sampling calculation, the error margin enters as one of the parameters to be entered. Therefore, we can see an inversely proportional relationship between the margin of error and the sample size: the smaller the maximum desired margin of error, the larger the sample will have to be. For example, a master’s thesis that wants to find the proportion of Nigerians with their own home has an error rate of 5%. Suppose the survey returns the result that 45% of the population owns their own home, the sampling error indicates that the actual result can vary by 5 percentage points, that is, the number of Nigerians with their own home is between 40 % and 50% of the population.
To have the results closest to the true population, the selection of our sample must be completely random. However, how to guarantee perfect randomness? Maybe it’s a topic for another time. However, what we can imagine is that the less tied our sample is to a particular group or category, the better our sample will represent the population.
Population, in statistical terms, is nothing more than the totality of the factors that we want to analyse. Whether the total number of people living in the region of interest to us or the total number of organisms that live in each ecosystem.
Degree of confidence
The term confidence, within sampling techniques, means how much we are willing to give up “certainty” to have a more efficient sample. We can think of confidence as a range of probabilities, where the greater the degree of confidence established, the greater the range of possible outcomes within a sample. Thus, we delimit this interval in standard deviations, that is, how much our sample can deviate from the true population mean, with a certain degree of confidence. For example, a doctoral thesis with a confidence level of 95% means that if you repeat the research 100 times, in 95 the result will be the same.
How does the sample calculation work?
Quantitative research requires reliable numerical indicators so that, through data analysis, can return with important results for developing scientific knowledge. This means that the research must work with strict statistical criteria. Otherwise, the search results become objectionable. This also means choosing the optimal sample size.
To arrive at the ideal number of the sample, a statistical model is used that informs us of the number of people or events that must be collected to reach the reliability of the results. The statistical model is also called sample calculation and is part of a specific formula.
Sample Calculation Formula
To find the optimal sampling number, it is necessary to use the central limit theorem. It provides mathematical support for the concept that the mean of a random sample from a large population tends to be close to the mean of the entire population. The theorem is represented by the following formula:
n=N Z² p (1-p) (N-1) e² + Z² p (1-p)
n = is the sample size we want to calculate (sampling)
N = is the size of the universe (i.e., the population)
Z = is the deviation from the mean value that is accepted to reach the desired level of confidence
e = is the maximum allowable margin of error
p = is the proportion expected to be found
The result obtained with the sampling will be the most likely to be found also in the total universe of the research. As we move away from this value (up or down), the results will be less and less likely values. So, as the probability decreases as you move away from the mean, it’s possible to create a range around the most likely value. That’s the trust level. The distance that is needed from this most likely value will determine the margin of error.
In sum, to ensure high data quality, the sample should be as representative as possible. Ideally, participants can be selected at random from the entire population. Sometimes, however, this assumption is not realistic. Then it must be decided individually whether there is a risk to the informative value of the data. If necessary, a conscious selection of the sample can be made, for example using quota or cut-off procedures. It is also important to determine the required sample size. If the sample is too small to detect an effect, this means wasted resources just as much as an excessively large sample.