Statistical analysis – PRO EXPERT WRITINGS

Statistical analysis
Introduction
In science, we often repeat an experiment many times. In doing so, we are able to compute
an average value, which would be the best value to report. We make the assumption, however, that
both accuracy and precision are maximized. In performing experiments, we utilize properly
calibrated instruments and we use them to their fullest capability when reporting values. An
experiment performed with an N-number of trials would generally have a significant population
clustered about the mean value of all trials when there are random outcomes. When examining the
results of a large number of N-trials, there are typically outliers that appear in many cases for
experiments such as these. Rolling dice for instance can yield such a case. If a pair of dice were
rolled, say, 100 times, in many instances the number seven would be obtained when adding the
two dice. However, there would also be outliers such as two and twelve present in the data set.
Why would we expect seven frequently? In simple terms, there are several ways seven could be
the sum of the two rolled dice. In terms of this lab, we say the probability for achieving a seven is
larger than the probability of achieving the outliers two and twelve.
Lab objectives
To study the statistical nature in science by investigating many rolls of four pieces of
dice. These “rolls” will be simulated using Excel.
Introduction to basic probability for rolling dice
Roll a single die. A die has six faces. It is equally likely that when the die is rolled, it will
be either 1, 2, 3, 4, 5 or 6. In a word, the probability that it would be any of those numbers is the
same (1/6). Thus, since all outcomes are equally likely, then the probability equation for this
scenario is simply
( ) =

.
Note that the above equation is assuming that each event scenario is equally likely as the other.
It gets much more interesting when there is a pair of dice. For example, the number of
outcomes increases to 36 (each dice has a total of six faces and each dice is unique to the other).
Let us call the first die ( ) and the second ( ). Each die is unique. We will call the “roll”
scenario as ( , ). [If it is easier to imagine, call one die the “red” die and the other the “white”
die.] For instance, if you roll the dice and the sum total + = 3. There are two scenarios
which can give us this sum; (2,1) or (1,2), since both die are unique. We call our sample space
{(2,1), (1,2)}. Therefore, the probability of obtaining a 3 from the roll based on our sample space
is simply
( = 3) =
1
36 +
1
36 =
2
36 =
1
18.
The entire sample space is seen below. Note again that there are 36 different (unique) outcomes.
Importantly, if all of the probabilities were added from this entire sample space, the total would
add up to one.
Basic probability for rolling dice
The table below shows all of the possible scenarios listed in the sample space (36 total).
Sum of two
dice
# of possible
ways to get
sum
2 1
3 2
4 3
5 4
6 5
7 6
8 5
9 4
10 3
11 2
12 1
Notice that there are 6 ways to obtain the sum of 7 from a pair of dice. Thus we expect the
probability to achieve a sum of 7 to be the greatest. What would be the probability of rolling a
sum of 2?
Let us assume there is a large sample of data collected by (many) rolls. The mean value is
simply the center of the distribution of values within the sample. Let there be N trials of a
measurement of a quantity x. The mean value of the measured quantity x over N trials can be
written as
.
1
1


N
i
i
x
N
x (1)
This is an equation that simply states: For the “best” value of a measurement of some quantity x,
it is simply the arithmetic average
x
of an N number of trials. Suppose you were rolling a single
piece of die. There are 6 possible outcomes. When rolling two or more dice many times, the
mean value from all N-rolls, and all possible outcomes ( ) that have a frequency of occurrence
can be written as
( ) .
1
1


N
i
i i
f x x
N
x (2)
Two other important quantities are variance and standard deviation. Variance is defined as the
arithmetic average of the square of the difference between each trial and the mean. That is, it is a
measure as to how far a random number is from the mean. The standard deviation ( ) or the
square root of the variance (
2
) is expressed as:
 





N
i
i
N
x i
x x
N
d
N 1
2
1
2
( )
1
1
( )
1
1

. (3)
For the above equation, the quantity xi is the value for the i
th trial. The deviation or the residual is
defined as the difference between the i
th trial and
x , di = xi – x .
Let us assume that there is some fluctuation within an N number of trials. If N is
sufficiently large, then the fluctuations (the spread) would follow a Gaussian (Normal)
distribution. The figure below shows what this
would look like with a large N-number of trials.
The width of this curve is simply the standard
deviation. Notice the percentages of a large data
set that fall within the inflection points. For
example, 68.2 % of the data fall within one
standard deviation from the mean value.
The equation describing the normal distribution is written as
( ) =
1
√2

[−
( − ̅)
2
2 2
].
(4)
For a large collection of data points, the values typically cluster around the mean value, if the
precision of the test is well within acceptable means. In a word, since the standard deviation is a
measure of the width of the above distribution curve, a large standard deviation correlates to
measurements of low precision.
Rolling dice is a measurement, which is random in nature. If there are two dice, one
expects the average to float around the sum of seven. The values will of course fluctuate above
and below seven. Due to these fluctuations, random uncertainty is present. This is an important
point: Many measurements should be undertaken to result in a mean value close to what one
might expect. Note that random uncertainties are assumed to follow the above distribution if
there are a (large) number of rolls and several dice.
Experimental set-up (Excel)
1. Open up Excel and pair with one other student.
2. Create a tab on the spreadsheet and call it “dice-rolling”.
3. Refer to the sample spreadsheet at the end of the experiment for labeling purposes. Please
label your spreadsheet accordingly.
4. In cell A3, type “=RANDBETWEEN(1,6)” and then press enter. This random number
generator is simulating a rolled die. Hit F9 to make another roll. Copy this formula in
cells B3 through D3. This represents a simulation of four rolled dice.
5. Highlight the cells A3 through D3. Right click EDIT/COPY. With the cells highlighted,
click on A3 and with the mouse button pressed, scroll down to A2002. Now, right-click
EDIT/PASTE.
6. In cell E3, type”=SUM(A3:D3)”. Click on E3, and then right-click to EDIT/COPY.
Scroll down to E2002 and with the E-column highlighted, right-click to EDIT/PASTE
the formula.
7. This next step is crucial for visual appearance. Go down and click on any cell (say,
D30). With the cell chosen, and with a left-click mouse, scroll down (in the same
column) to row 1998. With these cells highlighted, go to the top tool bar and chose
format/Hide & Unhide/Hide rows. Note: This DOES NOT affect your calculations
whatsoever. It merely allows a single-page snapshot of the data and analysis. If you were
going to print, it would not print many pages of data, but just this singular snap-shot.
8. In column I, starting in I3, enter “4”. This is the minimum sum with four dice. Add one
down until you reach 24 in I23.
9. Left-click on cell J3 and scroll down to J23. With the cells highlighted, type
“=FREQUENCY(E3:E2002,I3:I23), BUT DO NOT PRESS ENTER! This is an array.
After you type, simply press both the Control and the Shift key simultaneously and then
press enter. Now, to verify, in cell J24 type “=SUM(J3:J23). Press enter. Did you get
what you were expected? Stop and think.
10. Now let us set-up a visual of our “collected data”. Highlight I3 through J23 (columns I
and J with the appropriate cells should be highlighted). Go to Insert/Recommended
Charts/All Charts. You should see the vertical bar chart that resembles a “Histogram”,
with the “x-series” being 4 through 24 (sum of dice). The y-series is frequency. Please
label these series on your chart. Title the chart as you see fit.
11. It is now time to calculate other relevant statistical data.
12. In cell G3, type “=SUM(E3:E2002)/2000”. This is equation 1.
13. To verify that we should also get this mean from equation 2, in cell K3 type “=I3*J3” and
copy down to K23. In K24, type”=SUM(K3:K24)”. Finally, we need the sum of all f(x)
(2000). Thus, in K25 type “=K24/J24”. Compare this mean with the mean you previously
calculated. [Again, do not worry if every operation you do generates another roll and
numbers slightly change. Statistically, this is meaningless to what we are trying to do
here.].
14. To calculate the variance (see above), we need the deviations squared. Namely, ( − ̅)
2
.
15. In cell F3, type in the formula “=(E3-$G$3)^2” and press enter. Copy this formula down
to F2002. Notice the dollar signs around “G” in the formula. Why are those needed?
16. To calculate the standard deviation, we take the square-root of the variance. In H3, type
the formula “=SQRT(SUM(F3:F2002)/1999)”. Hit enter. Why is it 1999?
17. You are now ready to calculate the normal distribution.
18. In cell L3, type “=2000*(1/((SQRT(2*PI())*$H$3)))*EXP(-((I3-
$G$3)^2/(2*$H$3^2)))”. Please be careful, as this is a very long formula. You may also
use the Excel function: “=2000*NORM.DIST(x, mean, stdev, FALSE)” as a back-up to
compare.
19. Copy the formula down to L23.
20. Next, sum cells L3 through L23. To do that, in L24, type “=SUM(L3:L23)”. Is the
answer what you expected? Mention that in your summary.
21. Lastly, in M3 we will calculate the standard deviation of the mean (SDOM). To do that,
in M3 type “=H3/SQRT(J24)”. Hit enter. This calculation is simply the standard
deviation divided by the square-root of the number of measurements. Notice that if the
number of measurements gets even larger, the SDOM gets smaller. We will mention
SDOM later in the course.
22. Save the spreadsheet
23. Right click on the graph. Go to Select Data and then click Add. Go to Series Values.
Clear the information and then go and highlight cells L3 through L23. Click OK and then
OK again.
24. Finally, right click on one of the bars in the graph. Go to Change Series Chart type. For
series 2, which is the theoretical Gaussian, select line graph.
25. Your graph should look like the model.
26. Right click on the graph and go up to Add Chart Element on the top toolbar. Label the
horizontal and vertical series.
Summary
1. Create a fourth tab called “summary”.
2. In the tab, address the following:
a. Can the 2000 rolls of four dice be properly described by a normal distribution? What
do your calculations tell you about this?
b. Did you get roughly 68 % of the data within one standard deviation?
c. In your own words, how would you define “Probability” based on your experimental
work in this lab?
d. Identify an experiment in YOUR area that this analysis might be useful.
3. Save the file.
4. Upload it in the link provided before the due date.

Newsletter Updates

Leave a ReplyCancel Reply

Related Posts

Communications Question

Intercultural Communication

Week 3 Discussion – Open, Honest, and Transparent Communication