Saturday, February 22, 2014

            I’ve recently posted online the research papers written by my EAP students (English for Academic Purposes) on misleading statistics.

 The Tuesday-Thursday Class is here, and the Monday-Wednesday-Friday Class is here. 
(Update: I've since added 3 more classes to the blog--here, here, and here).
            The motivation for doing this was an idea I got from I.S.P. Nation’s book Teaching ESL/EFL Reading and Writing.  Nation mentions that a way to increase students’ motivation on writing is to provide opportunities to publish their writing.
            I was also thinking about the examples that Lourdes Ortega mentions about students finding their second language identity through writing on-line.  (I’m stretching this somewhat, because Ortega cites examples of students writing about things they are actually interested in, like creating websites about Anime.  A paper on misleading statistics that the students are forced to write is probably not the same, but nevertheless I hope it will give them some sense of using their English abilities to engage with the wider world.)
            My students, mostly teenagers and digital natives, were not overly amazed at the novelty of something being published online.  But I think they were moderately pleased, and that’s the most you could realistically hope for.

            [There was also an ulterior motive for this which I didn’t tell my students about.  We’re having a big problem with plagiarism at my school.  It’s possible, using sites like and to check the student papers against the Internet for plagiarism, but, since the essay topic remains the same from term to term, a bigger problem is that essays are often getting passed down from one class to another. And since the school does not keep a database of past essays, this is harder to check for.  So, I figured, why not start posting the essays online?]

A couple other notes:
            In addition to publishing these papers online, I also tried out a couple more ideas from Nation’s book last term.
            One idea was to have the students write their papers in groups instead of individually.  The idea is hopefully that in collaborative writing the students’ share their knowledge and learn from each other.  (The danger is that one student just writes the whole thing, but I tried to guard against this by giving the students time to collaborate in class.)
            The other idea, also from Nation, is to put the student papers into a larger book.  So instead of having each group write generally about “misleading statistics” (as the curriculum states, and as I’ve done in the past), I had each group write about an aspect of misleading statistics, and then at the end of term we arranged the various papers together into one booklet.  I took this booklet down to the printers shop and had bound copies made up, and on the last day of class we had a book signing party.
            (Nation suggests this, but it was also an idea one of my Calvin professor’s used.  This History 101 paper I wrote was actually part of a much larger book on religion in the ancient world.)
            The introduction and conclusion were composed by the class as a whole—one student was nominated to write down the class’s ideas, and then the other students suggested sentences.  At the end I corrected it.  (Another idea from Nation.)

Another note:
            I’ve been teaching this EAP course on misleading statistics for a couple years now, and a few years back, to my surprise, I saw a student actually cite Calvin College, my alma mater, as a source on her research paper.
            It turns out a Calvin professors had set up a site on misleading statistics that had been attracting the attention of my students all the way over here in Cambodia.
            (This is one of my favorite “randomly-running-into-Calvin-College-in-distant-areas-of-the world” stories.  My other favorite is finding Howard Van Till’s book The Fourth Day (A) at the check-out counter at Melbourne University library—which meant that not only did Melbourne University library have this book, but that someone was actually reading it!)

            Anyway, after I discovered this website, I started assigning it to my students to read to help them prepare for the paper.  (The Website is located HERE.)
            After a few terms, I ended up re-writing the first couple chapters in order to make it more accessible to ESL Cambodian students.  (Or I tried anyway—I’m not sure I’m the best writer, but my goal was to make it simpler and easier to understand.)  I only re-wrote the first couple chapters, and then after that the students had to read the authentic text for the rest of the book.

            Below are my re-writes that I’ve been using in my classes.
Chapter 1: Our Treacherous Tendency
(Source: Shady Statistics
1.      Discreet Deceit
There is a huge tendency of human beings to insert bias, and the strangest thing about this tendency is that it occurs sometimes without us even realizing it. One major contributor of this bias is our pride. We all desire to look good for other people, to look like we've got it together, and even to twist the truth in order to preserve our reputation and successful appearance. The results of some surveys and statistics simply cannot be trusted due to the nature of the content in which they seek to gather information about.  Sometimes people lie consciously (knowingly) and sometimes people lie unconsciously (unknowingly) but the fact is people will often adjust the truth to make themselves look better, even if they are talking to a complete stranger.
For example, imagine you were doing a study on whether people washed their hands after using the toilet.  Should you try and do this study by survey?  What would be the problem with this survey?  Are you likely to get honest results?
"Studies show that people wash their hands 4.67 times a day." In a scenario like this, we should ask ourselves, "How in the world did they get that figure?" Will a person, no matter how randomly selected they are, ever admit to occasionally not washing their hands to a complete stranger? These kinds of statistics are only useful in determining what people say about washing their hands. We can hardly draw any other conclusions.
2.      Sample Problems
It is also important to note that sometimes the samples of people who participate in surveys and statistics can result in misleading statistics. For example, imagine if I wanted to find out what percentage of Cambodian people enjoyed shopping at Soriya shopping center.  I went to Soriya shopping center and did my survey there.  What would be the problem with this survey?
If I wanted to gather information about whether customers enjoyed shopping at a particular mall, I would not gather my sample from the people that are already inside that mall. Chances are, if they are shopping there, they like it.
Surveyors must be extremely cautious when it comes to how a survey is set up and how the results are gathered.  For example, what would be the problems of conducting a survey on the Internet?  Or using text messaging from cell phones?
Conducting a survey about teenagers' opinions whose information is gathered via text messaging eliminates those teens who a) don't have a cell phone and b) don't have text messaging. Furthermore, the survey only obtains results from those who choose to participate! This is sometimes only a small fraction of those who were asked, and results like these can do dangerous things to any statistics that are calculated.  (This is also the same reason why Internet surveys and polls are never reliable.  They collect information only from the people that choose to participate.)
In order to truly get an accurate statistic, I would need either a random sample or a stratified sample.
A random sample was once described as a sample selected by pure chance from the population.  (When statisticians use the word “population” they mean the whole of whatever they are studying, and the “sample” is just a small part of it.)
However stratified sampling is the best kind of sampling. It allows the sample to consist of the same proportions of things as they exist in reality. For example, imagine we were doing a survey on whether Cambodians like Angry Birds.  I would try and get a sample that reflects (or mirrors) the general population.  In Cambodia now, 31.9% of the people are under 14 years old, 64.3% of people are between 15 and 64 years old, and 3.8% are 65 or older.  To get a truly accurate statistic, I would want to make sure these same percentages are reflected in my sample size.  I would also want to take into account the percentage of males and females.  Instead of simply selecting males and females at random, I would need to find out what percentage of Cambodia’s population is female, and then work that into my sample. 
What other factors would I need to consider to get an accurate stratified sample?
You can see perhaps that it actually takes a lot of hard work and preparation to get good stratified samples.  Not surprisingly then, many surveys don’t bother with this.  Therefore, many surveys are unreliable.
3.      Biased Questions
Often with polls, the questions lead people to answer one way or another.  Occasionally, the questions are intentionally designed to get people to answer a certain way.
For example, imagine a survey question that said:
What do you think of Labor Unions?
            a). a terrible idea
            b). Inefficient
            c). okay, but not great
            d). evil

            Imagine as a result of this survey, we found that 70% of Americans thought labor unions were only okay, and not great.  What would be the problem with this statistic?

            How could we design a survey that would be more accurate?
In fact, the best most reliable types of survey questions are open ended survey questions.   This strategy allows any kind of answer and thus does not leave out any opinions that someone may have. Each person is free to answer how he or she likes.
For example:
What words or phrases come to mind when you think of labor unions?
Afterwards, the information can be collected in a survey that would give the most frequent answers to the questions.   The strategies of an open-ended question and not giving options for an answer gives an unbiased approach to getting the opinions about labor unions. By putting the information in a table that clearly organizes national, Republican, and Democratic results, it's clear to see the opinions of each group. There is no confusion in how to read the answers, and there is no answer not listed as an option. Clearly, this is the best approach to finding out the nation's opinions on labor unions.

Let’s look at some examples of real life misleading statistics.  This is a true story.  A lady was listening to the radio and heard about a poll taken that said that "11% of Americans don’t believe that the Obama Administration cut taxes last year."

This surprised her.  It had been big news last year that Obama had cut taxes.  It had made him very popular in America.  So she decided to research the statistic.  Where did this statistic come from?  How did they know this?
She did some research on it, and it led her to find out that the statistic was found under a link to "Poll: Who are the tea partiers?"   The Tea Party is a political group in America that is opposed to Barak Obama.  So the poll was only from a selected group (the Tea Party), which represents a bad example of random sampling.  All the people in the survey opposed Barak Obama, so it did not accurately represent the American population as a whole.
            Look at the categories we studied above.  Which category does this statistic fit under?

Chapter 2: Deceptively Mean
          Have you ever heard the word “average” before?  What does this word mean?  Does it have the same meaning in all situations?
          In normal everyday conversation, average often means normal, or usual.  However in statistics, average has a much more mathematical meaning.
          Here’s a question to think about: how could a dishonest person use a mathematical average to create a misleading statistic—that is, a statistic which is technically true, but creates a misleading perception.
Part of the problem is that in statistics average has 3 different meanings: mean, median, and mode.
Using each type of average in an appropriate manner can be easily done however it is important that you are knowledgeable in the differences between the three first. The mean average takes the sum total of all the collected data and then divides that total by the amount of participants within the study. This type of average can be used when figuring out things such as the average grade on a quiz for students within a math class. The next type of average is referred to as the median average. The median average is determined by taking an overall set of values or data and finding out which falls directly in the middle. The final type of average is called the mode average. This average accounts for the most often occurring item within a data set. This type of average helps to show how frequent a particular portion of the data is common across the group of subjects being studied. In becoming knowledgeable about each type of average that can be used it is very important that those displaying statistical data are ethically using the information to inform their audience and not to just influence the audience’s perception with deception.
Confused?  Let’s go over the same thing again a little bit slower.
A mean average is created by adding all the numbers together, and then dividing by the size of the sample.  So for example, imagine Sue has $1, Tom has $1, Sam has $20 and Jason has $30.  On average, how much money do they have?  Well, we add up all the numbers (1 +1+20+30) and then divide by the number of our sample (4 people).  So the average is (1 +1+20+30) /4=$13.
The median is the middle number.  Put all the numbers in a row with the least number on the left, and the greatest number on the right, and the number in the middle is the median.
Let’s use our example from above.  What was the middle number?
$1, $20, $30

The middle number here is $20, so that is our median.
And the mode is the number that occurs most frequently.  Again, using the same example:
$1 (2), $20 (1), $30 (1)
The mode here is $1, because that number is the most frequent.
The important thing to remember is when you see the word “average” written down in a statistic, you often have no idea if the researcher is referring to the mean average, the median average, or the mode average.  How could the dishonest people abuse this to mislead someone?
We often think of the word average as being the same as normal, but this is not always the case.   For example, if 9 people in a company earn $10,000 a year, and one person earns a billion dollars ($1,000,000,000) a year, what is the mean average?  What is the mode average?

 In this case, either are correct, but one can be slightly misleading.  For example, if a company wanted to recruit new employees by advertising a high wage, how could they be dishonest about their average employee wage?

No comments: