How to Lie to Yourself and Others Using Statistics
Misusing statistics is one of the most powerful ways to lie . We usually teach you to avoid misinterpreting statistics , but knowing how numbers are manipulated can help you determine when this happens. To that end, we’re going to show you how to make the data say whatever the hell you want to back up any wrong idea you might have.
Collect sample data that will add bias to your conclusions
The first step to building statistics is determining what you want to analyze. Statisticians call this ” population .” You then define a subset of this data to collect, which in your analysis should be representative of the general population. The larger and more accurate the sample, the more accurate your conclusions can be .
Of course, there are some serious ways to accidentally or intentionally spoil this type of statistical sampling. If the data samples you collect are bad, you will end up drawing false conclusions no matter what. There are many ways to mess up your data, but here are a few of them:
- Self-selection bias : This type of bias occurs when people or data you study voluntarily fall into a group that is not representative of your entire population. For example, when we ask our readers questions such as “ What’s your favorite texting app? “We only get answers from people who prefer to read Lifehacker. The results of such an informal survey are likely not to be representative of the general population, because all of our readers are smarter, funnier, and more attractive than the average person.
- Convenience sampling : This bias occurs when a study analyzes all available data instead of trying to find representative data. For example, a cable news network might poll its viewers about a political candidate. Without polling people who watch other networks (or don’t watch TV at all), it is impossible to say that the survey results reflect reality.
- No response bias : This occurs when some people in a selected group do not respond to a statistical survey, causing the responses to change. For example, if a survey about sexual activity asks the question, “Have you ever cheated on your spouse?” some people may be reluctant to admit to being unfaithful, which makes it look like deception is less common than it actually is.
- Open Source Polls : These types of polls allow anyone to submit responses and, in many cases, do not even verify that people only submit a response once. While they are common, they are fundamentally biased in that they do not attempt to control the input in any meaningful way. For example, online surveys that simply ask you to choose your preferred option fall under this bias. While they can be interesting and helpful, they don’t know how to objectively prove their point.
These are just a few of the many ways to bias sampling . If you want to make a deceiving impression, choose poison. For example, open source polls on websites can be used to “prove” that whatever candidate you like won the debate, or that Undertale is the best game of all time . The beauty of sampling bias is that someone, somewhere, is running an unscientific survey that says whatever you want. So just google until you find an unscientific survey you like, or heck, create your own.
Choose an analysis that supports your ideas
Since numbers are used in statistics, it is easy to assume that they are irrefutable evidence of the ideas they claim to support. In fact, the mathematics behind statistics is complex, and incorrect analysis can lead to different or even completely contradictory conclusions. If you want to change the statistics to suit your needs, tweak the math.
To illustrate the flaws in data analysis, statistician Francis Anscombe created the Anscombe quartet (chart above). It consists of four graphs that show completely different trends when viewed on a graph. The X1 chart shows a basic uptrend scatter chart. X2 shows a curved trend that went up but now goes down. X3 shows a smaller upward trend, but with one outlier on the Y-axis. X4 shows data that is perfectly flat on the X-axis, except for one outlier, which is very high on both axes.
This is where it gets crazy. For all four diagrams, the following statements are true:
- The mean x is 9 for each dataset.
- The average y is 7.50 for each dataset.
- The variance for x is 11 and the variance for y is 4.12.
- The correlation between x and y is 0.816 for each dataset.
If you only saw this data in text form, you might think that all four situations are identical. For example, suppose you have a chart like X1 that shows the wages of men in your company by year, and another chart like X2 shows the wages of women for the same time in the same company. If you only show the text, you will see that they have the same average salary! However, if you show the graphs, people will see that women’s salaries tend to go down for some reason.
Anscombe suggested that, in order not to mislead people, you should always visualize your data before drawing conclusions and know how outliers affect your analysis. Outliers in a well-formed diagram are hard to miss, but they can have a huge, but invisible impact on text. Of course, if your goal is to mislead people, you can just skip this step.
Make charts that highlight your biased conclusion
Most people don’t have time for their own statistical analysis, so they rely on you to show them charts summarizing your findings. If you design your diagrams correctly, they should offer ideas that match reality. If you want to screw up, you can highlight the data you like the most.
One of the most famous and ridiculously inaccurate charts in the recent past was drawn by a member of Congress at a planned parenting meeting. During this meeting, Rep. Jason Chaffetz (Rhode Utah) tried to prove that PP’s abortion services have grown since 2006, while its cancer treatment services have declined over the same period. This is the diagram he used to demonstrate this:
On the face of it, the number of abortions has skyrocketed and the number of cancer treatment services has fallen sharply. For this conclusion, we can thank several flaws in this diagram:
- There is no label on the Y-axis. While the bottom X-axis is labeled with years, the Y-axis has no label at all. Is this the number of procedures? How much money was spent on procedures? Who knows! You don’t have to.
- The y-axis scale is incorrect. Apart from the wrong label, the y-axis is not scaled correctly. The end point of the red line data is 327000, which is inexplicably higher on the graph than the 935573 end point of the pink line data. Technically, each line goes in the right direction, but the scaling is wrong.
- It lacks context. These data points (as they are) only suggest what is happening, not why. For example, in 2009 the Task Force of preventive services in the US updated its recommendation to undergo a mammography every two years, instead of doing it every year. This could explain the decline in the number of cancer screenings.
Most diagrams are not so glaring mistakes, but they are a great example of how you can be misled by simply missing a few key elements of the diagram. The news site Quartz showed what this chart would look like if presented properly (note, data for 2008 is not presented and therefore missing from the chart):
On this scale, the number of abortions is relatively low, while the number of cancer screenings has decreased. However, since the individual data points are shown, we can see that the decline started around 2009, as we predicted. This is how you accurately present information in its correct context! So if you want to mislead people, all you need is a little charting fake. Ditch the labels, manipulate the axis a little, and you too can trick people into thinking you have a better point of view than you do.
Hide your sources at all costs
The easier it is to see your sources, the easier it will be for other people to confirm or disprove your findings. If your findings are verifiable, then be sure to let people see your data and how you arrived at it. However, if your goal is to mislead people, never let anyone know how you arrived at your conclusions.
For a proper search, every person who ever mentions a piece of data will include a link to the source. News sites should link to the research or research they cite (and not to research articles). Researchers may not show their entire dataset, but the research source must answer a few basic questions:
- How was the data collected? Have you called people on the phone? Stop them near the mall? Was this a Twitter poll? The method you use to collect your data may indicate (or disprove) sampling bias.
- When was the data collected? When did you collect the data and how long did it take to collect it? Reports can quickly become out of date and trends can change over time. Including the time frame from which the data is obtained can say a lot about the conclusions you draw.
- Who Collected the Data? The person or group that collects the data can provide insight into how reliable the data is. A tobacco company study claiming cigarettes are safe may not be true unless confirmed by someone else.
- Whom did you ask? Particularly in the area of surveys and surveys, it is important to know who was being interviewed. If a politician only polls those who are already friends with him, he will not receive data representing the population as a whole.
Sourcing is used not only to avoid bias, but also to give others the opportunity to verify your claims. It opens your data, your methods, and your findings to criticism. This allows others to try to make holes in your ideas. If your findings don’t hold water, they fall apart. The most accurate statistics are those that others can see and confirm with their own research. However, if your goal is to mislead yourself or someone else, don’t share the sources. In fact, your best defense is simply to say, “Look!” and go away. Nobody can refute this.