Quandaries and Queries


If I have a set of data points (14 to be exact)of unknown pedigree from a large population, what tests can I apply to see if they constitute a random sample from the large population?

In a related question: if I knew a priori that the larger population fell along a general distribution (say lognormal) and if a lognormal line plot of this sub-data (14 points, unknown pedigree) fit rather snugly, could I safely assume that my 14 data points were randomly selected? Or would I be making a potentially damaging assumption?




Hi Stu,

Randomness of the sample is a question to ask at the stage of COLLECTING the data not after the data is collected. Can you tell us how you designed the experiment for collecting the data ?

Andrei and Penny

Stu wrote back


The best way I can put this is to say that the process for the assembly, in a real specific sense, in not known. Let me describe it in terms of widgets:

We have a population of 4 million widgets. They are sent, atypically, overseas in large lots. At some point (and we don't know if the integrity of the lots were maintained) these widgets are commingled. The mingling/mixing is not done with the hope of producing a random sample. We do not know how thorough the mingling process was. We do know that following the mixing process, the widgets were then repackaged in boxes of 20, sent back, and then haphazardly distributed around the United States. Now someone within our company wants to try and make strong inferences to population about the characteristics of our overall widget population. So they buy a box of widgets from each of the four respective manufacturing lots at a store in Colorado, blindly choose four widgets each out of each box (but not by way of any random number table), and start measuring certain things-- like elemental composition. What I'm trying to figure out is if one can assume that 16 widgets represent a random sample from which we can make those inferences to population. Obviously, there isn't quite enough information to tell. But since there were some elements of a random process (a mixing process of unknown depth, blind picking) I'm wondering if we can back track. For instance, if we did know something a priori about the general population of widgets (say that there should be a general lognormal distribution of certain chemical elements), and these 16 widgets fit a lognormal probability plot, could we safely assume that we had a true random sample? Are there other tests we can perform after the fact? Or should one simply acknowledge the lack of background information and refrain from making strong inferences to population for fear of biased results?


The procedure described seems to be sufficient to state that the sample is really random. We can safely produce some statistical inference, while the sample size 16 is small for a population of 4 million. The results derived maybe not be very precise and I would have serious concern about this problem, but not about randomness or bias. A small sample size can result in a strong inference but the result might be insignificant and inaccurate.

Yes, if we did know something a priori about the general population of widgets (say that there should be a general lognormal distribution), then these 16 widgets should fit a lognormal probability plot. But I do not know about any results how we can back track. It may be an indication that the sample was really random, but I never heard that this PROVES randomness.

Once again, from our point of view this is a question about how we design the experiment. After sample has been collected nothing more can be done.

Andrei and Penny