Contributed by:

This pdf includes the following topics:-

Frequency Distributions

EPAGAS

Frequency Table

Relative Frequency

Choosing Categories

Bar Graphs

Representing Quantitative data using a Histogram and many more.

Frequency Distributions

EPAGAS

Frequency Table

Relative Frequency

Choosing Categories

Bar Graphs

Representing Quantitative data using a Histogram and many more.

1.
Frequency Distributions

In this section, we look at ways to organize data in order to

make it user friendly. We begin by presenting two data

sets, from which, because of how the data is presented, it is

difficult to obtain meaningful information. We will present

ways to organize and present the data , from which

meaningful summary information can be derived at a

glance.

Data Set 1 A random sample of 20 students were asked

to estimate the average number of hours they spent per

week studying outside of class. Also their eye color and the

number of pets they owned was recorded. The results are

given on the next page.

In this section, we look at ways to organize data in order to

make it user friendly. We begin by presenting two data

sets, from which, because of how the data is presented, it is

difficult to obtain meaningful information. We will present

ways to organize and present the data , from which

meaningful summary information can be derived at a

glance.

Data Set 1 A random sample of 20 students were asked

to estimate the average number of hours they spent per

week studying outside of class. Also their eye color and the

number of pets they owned was recorded. The results are

given on the next page.

2.
Frequency Distributions

Student # Hours Studying Eye Color # Pets

Student 1 10 blue 1

Student 2 7 brown 0

Student 3 15 brown 3

Student 4 20 green 1

Student 5 40 blue 2

Student 6 25 green 1

Student 7 22 hazel 0

Student 8 13 brown 5

Student 9 12 gray 4

Student 10 21 hazel 3

Student 11 16 blue 1

Student 12 22 green 1

Student 13 25 brown 1

Student 14 30 green 2

Student 15 29 brown 0

Student 16 25 green 4

Student 17 27 gray 0

Student 18 15 hazel 1

Student 19 14 blue 2

Student 20 17 brown 2

Student # Hours Studying Eye Color # Pets

Student 1 10 blue 1

Student 2 7 brown 0

Student 3 15 brown 3

Student 4 20 green 1

Student 5 40 blue 2

Student 6 25 green 1

Student 7 22 hazel 0

Student 8 13 brown 5

Student 9 12 gray 4

Student 10 21 hazel 3

Student 11 16 blue 1

Student 12 22 green 1

Student 13 25 brown 1

Student 14 30 green 2

Student 15 29 brown 0

Student 16 25 green 4

Student 17 27 gray 0

Student 18 15 hazel 1

Student 19 14 blue 2

Student 20 17 brown 2

3.
Frequency Distributions

Data Set 2: EPAGAS The Environmental Protection

Agency (EPA) perform extensive tests on all new car

models to determine their mileage ratings. The 25

measurements given below represent the results of the test

on a sample of size 25 of a new car model.

EPA mileage ratings on 25 cars

36.3 41.0 36.9 37.1 44.9

40.5 36.5 37.6 33.9 40.2

38.5 39.0 35.5 34.8 38.6

41.0 31.8 37.3 33.1 37.0

37.1 40.3 36.7 37.0 33.9

Data Set 2: EPAGAS The Environmental Protection

Agency (EPA) perform extensive tests on all new car

models to determine their mileage ratings. The 25

measurements given below represent the results of the test

on a sample of size 25 of a new car model.

EPA mileage ratings on 25 cars

36.3 41.0 36.9 37.1 44.9

40.5 36.5 37.6 33.9 40.2

38.5 39.0 35.5 34.8 38.6

41.0 31.8 37.3 33.1 37.0

37.1 40.3 36.7 37.0 33.9

4.
Frequency Table or Frequency Distribution

To construct a frequency table, we divide the observations

into classes or categories. The number of observations in

each category is called the frequency of that category. A

Frequency Table or Frequency Distribution is a table

showing the categories next to their frequencies. When

dealing with Quantitative data (data that is numerical in

nature), the categories into which we group the data may

be defined as a range or an interval of numbers, such as

0 − 10 or they may be single outcomes (depending on the

nature of the data). When dealing with Qualitative data

(non-numerical data), the categories may be single

outcomes or groups of outcomes. When grouping the data

in categories, make sure that they are disjoint (to ensure

that observations do not fall into more than category) and

that every observation falls into one of the categories.

To construct a frequency table, we divide the observations

into classes or categories. The number of observations in

each category is called the frequency of that category. A

Frequency Table or Frequency Distribution is a table

showing the categories next to their frequencies. When

dealing with Quantitative data (data that is numerical in

nature), the categories into which we group the data may

be defined as a range or an interval of numbers, such as

0 − 10 or they may be single outcomes (depending on the

nature of the data). When dealing with Qualitative data

(non-numerical data), the categories may be single

outcomes or groups of outcomes. When grouping the data

in categories, make sure that they are disjoint (to ensure

that observations do not fall into more than category) and

that every observation falls into one of the categories.

5.
Frequency Table or Frequency Distribution

Example: Data Set 1 Here are frequency distributions

for the data on eye color and number of pets owned. (Note

that we lose some information from our original data set by

separating the data)

Eye Color # of Students # Pets # of Students

(Category) ( Frequency)

(Category) ( Frequency)

0 4

Blue 4

1 7

Brown 6

2 4

Gray 2

3 2

Hazel 5 4 2

Green 3 5 1

Total 20 Total 20

Note that sum of frequencies = total number of

observations, in this case number of students in our sample.

Example: Data Set 1 Here are frequency distributions

for the data on eye color and number of pets owned. (Note

that we lose some information from our original data set by

separating the data)

Eye Color # of Students # Pets # of Students

(Category) ( Frequency)

(Category) ( Frequency)

0 4

Blue 4

1 7

Brown 6

2 4

Gray 2

3 2

Hazel 5 4 2

Green 3 5 1

Total 20 Total 20

Note that sum of frequencies = total number of

observations, in this case number of students in our sample.

6.
Relative Frequency

The relative frequency of a category is the frequency of

that category (the number of observations that fall into the

category) divided by the total number of observations:

Relative Frequency of Category i =

frequency of category i

total number of observations

We may wish to also/only record the relative frequency

of the classes (or outcomes) in our table.

The relative frequency of a category is the frequency of

that category (the number of observations that fall into the

category) divided by the total number of observations:

Relative Frequency of Category i =

frequency of category i

total number of observations

We may wish to also/only record the relative frequency

of the classes (or outcomes) in our table.

7.
Relative Frequency

Eye Color Proportion of Students # Pets Proportion of Students

(Category) ( Rel. Frequency)

(Category) ( Rel. Frequency)

0 0.20

Blue 0.20

1 0.35

Brown 0.30

2 0.20

Gray 0.10 3 0.10

Hazel 0.25 4 0.10

Green 0.15 5 0.05

Total 1.0

Total 1.0

Eye Color Proportion of Students # Pets Proportion of Students

(Category) ( Rel. Frequency)

(Category) ( Rel. Frequency)

0 0.20

Blue 0.20

1 0.35

Brown 0.30

2 0.20

Gray 0.10 3 0.10

Hazel 0.25 4 0.10

Green 0.15 5 0.05

Total 1.0

Total 1.0

8.
Choosing Categories

I When choosing categories, the categories should

cover the entire range of observations, but

should not overlap. If the categories chosen are

intervals one should specify what happens to data at

the end points of the intervals.

I For example if the categories are the intervals 0-10,

10-20, 20-30, 30-40, 40-50. One should specify which

interval 10 goes into, which interval 20 goes into, etc..

It’s usual to use different brackets in interval notation

to indicate whether the endpoint is included or not.

The notation [0, 10) denotes the interval from 0 to 10

where 0 is included in the interval but 10 is not.

I When choosing categories, the categories should

cover the entire range of observations, but

should not overlap. If the categories chosen are

intervals one should specify what happens to data at

the end points of the intervals.

I For example if the categories are the intervals 0-10,

10-20, 20-30, 30-40, 40-50. One should specify which

interval 10 goes into, which interval 20 goes into, etc..

It’s usual to use different brackets in interval notation

to indicate whether the endpoint is included or not.

The notation [0, 10) denotes the interval from 0 to 10

where 0 is included in the interval but 10 is not.

9.
Choosing Categories

I Common sense should be used in forming categories.

Somewhere between 5 and 15 categories gives a

meaningful picture that is easily processed.

However if there are only 3 candidates for a

presidential election and you conduct a poll to

determine who those polled will vote for, then it is

natural to choose 3 categories.

I Common sense should be used in forming categories.

Somewhere between 5 and 15 categories gives a

meaningful picture that is easily processed.

However if there are only 3 candidates for a

presidential election and you conduct a poll to

determine who those polled will vote for, then it is

natural to choose 3 categories.

10.
Choosing Categories

I Common sense should be used in forming categories.

Somewhere between 5 and 15 categories gives a

meaningful picture that is easily processed.

However if there are only 3 candidates for a

presidential election and you conduct a poll to

determine who those polled will vote for, then it is

natural to choose 3 categories.

I To choose intervals as categories with quantitative

data, one might subtract the smallest observation from

the largest and divide by the desired number of

intervals. This gives a rough idea of interval length.

Then adjust it to a simpler (larger) number which is

relatively close to it, making intervals of the desired

length where the first starts at a natural point lower

than the minimum observation and the last ends at a

natural point greater than the maximum observation.

I Common sense should be used in forming categories.

Somewhere between 5 and 15 categories gives a

meaningful picture that is easily processed.

However if there are only 3 candidates for a

presidential election and you conduct a poll to

determine who those polled will vote for, then it is

natural to choose 3 categories.

I To choose intervals as categories with quantitative

data, one might subtract the smallest observation from

the largest and divide by the desired number of

intervals. This gives a rough idea of interval length.

Then adjust it to a simpler (larger) number which is

relatively close to it, making intervals of the desired

length where the first starts at a natural point lower

than the minimum observation and the last ends at a

natural point greater than the maximum observation.

11.
Choosing Categories

I For example, if you data ranged from 1 to 29, and you

wanted to create 6 categories as intervals of equal

length. The length of each should be approximately

29−1

6 ≈ 4.667. It is natural to use 6 intervals of length

5 in this case, with the first starting at 0 and the last

ending at 30. If we decide to include the right end

point and exclude the left end point for each interval,

our intervals are :

(0, 5], (5, 10], (10, 15], (15, 20], (20, 25], (25, 30].

I For example, if you data ranged from 1 to 29, and you

wanted to create 6 categories as intervals of equal

length. The length of each should be approximately

29−1

6 ≈ 4.667. It is natural to use 6 intervals of length

5 in this case, with the first starting at 0 and the last

ending at 30. If we decide to include the right end

point and exclude the left end point for each interval,

our intervals are :

(0, 5], (5, 10], (10, 15], (15, 20], (20, 25], (25, 30].

12.
Choosing Categories

Example: Data set 2 Make a frequency distribution

(table) for the data on mileage ratings using 5 intervals of

equal length. Include the left end point of each interval and

omit the right end point.

EPA mileage ratings on 25 cars

Mileage # of cars 36.3 41.0 36.9 37.1 44.9

(Category) ( Frequency) 40.5 36.5 37.6 33.9 40.2

38.5 39.0 35.5 34.8 38.6

[ , ) 41.0 31.8 37.3 33.1 37.0

37.1 40.3 36.7 37.0 33.9

[ , )

[ , )

[ , )

[ , )

Total

Example: Data set 2 Make a frequency distribution

(table) for the data on mileage ratings using 5 intervals of

equal length. Include the left end point of each interval and

omit the right end point.

EPA mileage ratings on 25 cars

Mileage # of cars 36.3 41.0 36.9 37.1 44.9

(Category) ( Frequency) 40.5 36.5 37.6 33.9 40.2

38.5 39.0 35.5 34.8 38.6

[ , ) 41.0 31.8 37.3 33.1 37.0

37.1 40.3 36.7 37.0 33.9

[ , )

[ , )

[ , )

[ , )

Total

13.
Choosing Categories

We are told to divide the data into 5 intervals of equal length.

The smallest value in the data is 31.8 and the largest is 44.9

44.9 − 31.8

and = 2.62. If we start at 30.0 and use intervals of

5

length 3, 5 intervals later will end at 45.0 so we cover the data.

Mileage # of cars

(Category) ( Frequency)

[30, 33 ) 1

[33, 36) 5

The value 39.0 goes in the in-

[36, 39 ) 12 terval [39, 42) NOT the inter-

val [36, 39).

[39, 42 ) 6

[42, 45 ) 1

Total 25

We are told to divide the data into 5 intervals of equal length.

The smallest value in the data is 31.8 and the largest is 44.9

44.9 − 31.8

and = 2.62. If we start at 30.0 and use intervals of

5

length 3, 5 intervals later will end at 45.0 so we cover the data.

Mileage # of cars

(Category) ( Frequency)

[30, 33 ) 1

[33, 36) 5

The value 39.0 goes in the in-

[36, 39 ) 12 terval [39, 42) NOT the inter-

val [36, 39).

[39, 42 ) 6

[42, 45 ) 1

Total 25

14.
Choosing Categories

Example: Data set 1 Make a frequency distribution

(table) for the data on the estimated average number of

hours spent studying in data set 1, using 7 intervals of

equal length. Include the left end point of each interval and

omit the right end point.

We are told to divide the data into 7 intervals of equal

length. The smallest value in the data is 7 and the largest

40 − 7

is 40. Since ≈ 4.7, it makes sense to use intervals of

7

length 5. Starting at 5, we will end at 40. Since we have a

value of 40 and we have agreed to omit right-hand end

points, this does not quite work. If we start with 6 we will

be OK.

Example: Data set 1 Make a frequency distribution

(table) for the data on the estimated average number of

hours spent studying in data set 1, using 7 intervals of

equal length. Include the left end point of each interval and

omit the right end point.

We are told to divide the data into 7 intervals of equal

length. The smallest value in the data is 7 and the largest

40 − 7

is 40. Since ≈ 4.7, it makes sense to use intervals of

7

length 5. Starting at 5, we will end at 40. Since we have a

value of 40 and we have agreed to omit right-hand end

points, this does not quite work. If we start with 6 we will

be OK.

15.
Choosing Categories

Hours Studying # of students

(Category) ( Frequency)

[6, 11 ) 2

[11, 16 ) 5

[16, 21 ) 3

[21, 26 ) 6

[26, 31 ) 3

[31, 36 ) 0

[36, 41 ) 1

Total 20

Hours Studying # of students

(Category) ( Frequency)

[6, 11 ) 2

[11, 16 ) 5

[16, 21 ) 3

[21, 26 ) 6

[26, 31 ) 3

[31, 36 ) 0

[36, 41 ) 1

Total 20

16.
If we started with 5 and used 8 intervals:

Hours Studying # of students

(Category) ( Frequency)

[5, 10 ) 1

[10, 15 ) 4

[15, 20 ) 4

[20, 25 ) 4

[25, 30 ) 5

[30, 35 ) 1

[35, 40 ) 0

[40, 45 ) 1

Total 20

Hours Studying # of students

(Category) ( Frequency)

[5, 10 ) 1

[10, 15 ) 4

[15, 20 ) 4

[20, 25 ) 4

[25, 30 ) 5

[30, 35 ) 1

[35, 40 ) 0

[40, 45 ) 1

Total 20

17.
Representing Qualitative data graphically

Pie Chart One way to present our qualitative data

graphically is using a Pie Chart. The pie is represented by

a circle (Spanning 3600 ). The size of the pie slice

representing each category is proportional to the relative

frequency of the category. The angle that the slice makes

at the center is also proportional to the relative frequency

of the category; in fact the angle for a given category is

given by:

category angle at the center =

relative frequency category × 3600 .

The pie chart should always adhere to the area principle.

That is the proportion of the area of the pie devoted to any

category is the same as the proportion of the data that lies

in that category. This principle is commonly violated to

alter perception and subtly promote a particular point of

view (see end of slides).

Pie Chart One way to present our qualitative data

graphically is using a Pie Chart. The pie is represented by

a circle (Spanning 3600 ). The size of the pie slice

representing each category is proportional to the relative

frequency of the category. The angle that the slice makes

at the center is also proportional to the relative frequency

of the category; in fact the angle for a given category is

given by:

category angle at the center =

relative frequency category × 3600 .

The pie chart should always adhere to the area principle.

That is the proportion of the area of the pie devoted to any

category is the same as the proportion of the data that lies

in that category. This principle is commonly violated to

alter perception and subtly promote a particular point of

view (see end of slides).

18.
Representing Qualitative data graphically

Example 1 Here is the data on eye color from data set 1

in a pie chart.

Example 1 Here is the data on eye color from data set 1

in a pie chart.

19.
My favourite pie chart

20.
Bar Graphs

We can also represent our data graphically on a Bar

Chart or Bar Graph. Here the categories of the

qualitative variable are represented by bars, where the

height of each bar is either the category frequency, category

relative frequency, or category percentage.

The bases of all bars should be equal in width. Having

equal bases ensures that the bar graph adheres to the area

principle, which in this case means that the proportion of

the total area of the bars devoted to a category( = area of

the bar above a category divided by the sum of the areas of

all bars) should be the same as the proportion of the data

in the category. This principle is often violated to promote

a particular point of view (see end of slides).

We can also represent our data graphically on a Bar

Chart or Bar Graph. Here the categories of the

qualitative variable are represented by bars, where the

height of each bar is either the category frequency, category

relative frequency, or category percentage.

The bases of all bars should be equal in width. Having

equal bases ensures that the bar graph adheres to the area

principle, which in this case means that the proportion of

the total area of the bars devoted to a category( = area of

the bar above a category divided by the sum of the areas of

all bars) should be the same as the proportion of the data

in the category. This principle is often violated to promote

a particular point of view (see end of slides).

21.
Bar Graphs

22.
Representing Quantitative data using a Histogram

Histograms A histogram is a bar chart in which each

bar represents a category and its height represents either

the frequency, relative frequency (proportion) or percentage

in that category.

If a variable can only take on a finite number of values (or

the values can be listed in an infinite sequence) the variable

is said to be discrete.

For example the number of pets in Data set 1 was a

discrete variable and each value formed a category of its

own. In this case, each bar in the histogram is centered

over the number corresponding to the category and all bars

have equal width of 1 unit. (see below).

Histograms A histogram is a bar chart in which each

bar represents a category and its height represents either

the frequency, relative frequency (proportion) or percentage

in that category.

If a variable can only take on a finite number of values (or

the values can be listed in an infinite sequence) the variable

is said to be discrete.

For example the number of pets in Data set 1 was a

discrete variable and each value formed a category of its

own. In this case, each bar in the histogram is centered

over the number corresponding to the category and all bars

have equal width of 1 unit. (see below).

23.
Representing Quantitative data using a Histogram

24.
Representing Quantitative data using a Histogram

If a variable can take all values in some interval, it is called

a continuous variable. If our data consists of observations

of a continuous variable, such as that in data set 2, the

categories used for our histogram should be intervals of

equal length (to adhere to the area principle) formed in a

manner similar to that described above for frequency

tables. The bases of the bars in our histogram are

comprised of these categories of equal length and their

heights represent either the frequency, relative frequency or

percentage in each category. Because it is difficult to tell

from the histogram alone which endpoints are included in

the categories, we adopt the convention that the categories

(intervals) include the left endpoint but not the right

endpoint.

If a variable can take all values in some interval, it is called

a continuous variable. If our data consists of observations

of a continuous variable, such as that in data set 2, the

categories used for our histogram should be intervals of

equal length (to adhere to the area principle) formed in a

manner similar to that described above for frequency

tables. The bases of the bars in our histogram are

comprised of these categories of equal length and their

heights represent either the frequency, relative frequency or

percentage in each category. Because it is difficult to tell

from the histogram alone which endpoints are included in

the categories, we adopt the convention that the categories

(intervals) include the left endpoint but not the right

endpoint.

25.
Representing Quantitative data using a Histogram

Example Construct a histogram for the data in data set 2

on EPA mileage ratings, using the categories used above in

the frequency table. Use the frequency of observations in

each category to define the height of the bars.

Mileage # of cars

(Category) ( Frequency)

[ , )

[ , )

[ , )

[ , )

[ , )

Total

Example Construct a histogram for the data in data set 2

on EPA mileage ratings, using the categories used above in

the frequency table. Use the frequency of observations in

each category to define the height of the bars.

Mileage # of cars

(Category) ( Frequency)

[ , )

[ , )

[ , )

[ , )

[ , )

Total

26.
Representing Quantitative data using a Histogram

On the left is the frequency data from above.

Hours Studying # of students

(Category) ( Frequency)

[6, 11 ) 2

[11, 16 ) 5

[16, 21 ) 3

[21, 26 ) 6

[26, 31 ) 3

[31, 36 ) 0

[36, 41 ) 1

Total 20

On the left is the frequency data from above.

Hours Studying # of students

(Category) ( Frequency)

[6, 11 ) 2

[11, 16 ) 5

[16, 21 ) 3

[21, 26 ) 6

[26, 31 ) 3

[31, 36 ) 0

[36, 41 ) 1

Total 20

27.
Changing the width of the categories

For large data sets one can get a finer description of the

data, by decreasing the width of the class intervals on the

histogram. The following Histograms are for the same set

of data, recording the duration (in minutes) of eruptions of

the Old Faithful Geyser in Yellowstone National Park.

01/07/2008 06:34 PM

For large data sets one can get a finer description of the

data, by decreasing the width of the class intervals on the

histogram. The following Histograms are for the same set

of data, recording the duration (in minutes) of eruptions of

the Old Faithful Geyser in Yellowstone National Park.

01/07/2008 06:34 PM

28.
Stem and Leaf Display

Another graphical display presenting a compact picture of

the data is given by a stem and leaf plot.

To construct a Stem and Leaf plot

I Separate each measurement into a stem and a leaf –

generally the leaf consists of exactly one digit (the last

one) and the stem consists of 1 or more digits.

e.g.: 734 stem = 73, leaf=4

2.345 stem = 2.34, leaf=5.

Sometimes the decimal is left out of the stem but a note is

added on how to read each value. For the 2.345

example we would state that 234|5 should be read as 2.345.

Another graphical display presenting a compact picture of

the data is given by a stem and leaf plot.

To construct a Stem and Leaf plot

I Separate each measurement into a stem and a leaf –

generally the leaf consists of exactly one digit (the last

one) and the stem consists of 1 or more digits.

e.g.: 734 stem = 73, leaf=4

2.345 stem = 2.34, leaf=5.

Sometimes the decimal is left out of the stem but a note is

added on how to read each value. For the 2.345

example we would state that 234|5 should be read as 2.345.

29.
Stem and Leaf Display

Sometimes, when the observed values have many

digits, it may be helpful either to round the numbers

(round 2.345 to 2.35, with stem=2.3, leaf=5) or truncate

(or dropping) digits (truncate 2.345 to 2.34).

I Write out the stems in order increasing vertically (from

top to bottom) and draw a line to the right of the

stems.

I Attach each leaf to the appropriate stem.

I Arrange the leaves in increasing order (from left to

right).

Sometimes, when the observed values have many

digits, it may be helpful either to round the numbers

(round 2.345 to 2.35, with stem=2.3, leaf=5) or truncate

(or dropping) digits (truncate 2.345 to 2.34).

I Write out the stems in order increasing vertically (from

top to bottom) and draw a line to the right of the

stems.

I Attach each leaf to the appropriate stem.

I Arrange the leaves in increasing order (from left to

right).

30.
Stem and Leaf Display

Example Make a Stem and Leaf Plot for the data on the

average number of hours spent studying per week given in

Data Set 1.

10, 7, 15, 20, 40, 25, 22, 13, 12, 21

16, 22, 25, 30, 29, 25, 27, 15, 14, 17

All are data points are 2 digit integers and the tens digit

goes from 0 to 4.

0 7

1 0 2 3 4 5 5 6 7

2 0 2 3 4 5 5 6 7 9

3 0

4 0

Example Make a Stem and Leaf Plot for the data on the

average number of hours spent studying per week given in

Data Set 1.

10, 7, 15, 20, 40, 25, 22, 13, 12, 21

16, 22, 25, 30, 29, 25, 27, 15, 14, 17

All are data points are 2 digit integers and the tens digit

goes from 0 to 4.

0 7

1 0 2 3 4 5 5 6 7

2 0 2 3 4 5 5 6 7 9

3 0

4 0

31.
Extras : How to Lie with statistics

Example This (faux) pie chart, shows the needs of a cat,

and comes from a box containing a cat toy. Note that the

“categories” are not distinct and they use an exploding

slice to distort the are for Hunting, which is the need of

your cat that this particular toy is supposed to fulfill.

Example This (faux) pie chart, shows the needs of a cat,

and comes from a box containing a cat toy. Note that the

“categories” are not distinct and they use an exploding

slice to distort the are for Hunting, which is the need of

your cat that this particular toy is supposed to fulfill.

32.
Extras : How to Lie with statistics

33.
Extras : How to Lie with statistics

A subtle way to lie with statistics is to violate the area

rule. The pie chart below is distorted to make the areas of

regions devoted to some categories proportionally larger

than they should be by stretching

73492685_3d516242aa_m.jpg the pie

(JPEG Image, intopixels)

240x198 an oval

shape and adding a third dimension.

A subtle way to lie with statistics is to violate the area

rule. The pie chart below is distorted to make the areas of

regions devoted to some categories proportionally larger

than they should be by stretching

73492685_3d516242aa_m.jpg the pie

(JPEG Image, intopixels)

240x198 an oval

shape and adding a third dimension.

34.
Extras : How to Lie with statistics

Example Both of the following graphs represent the same

information. The graph on the left violates the area

principle by making the base of the bars (banknotes) of

unequal width.

Purchasing Power of the Diminishing Dollar

$1.00

1.0

94c

83c

0.8

64c

0.6

44c

0.4

0.2

0.0

1958 1963 1968 1973 1978

Eisenhower Kennedy Johnson Nixon Carter

Is the bottom dollar note roughly half the size of the top one?

Example Both of the following graphs represent the same

information. The graph on the left violates the area

principle by making the base of the bars (banknotes) of

unequal width.

Purchasing Power of the Diminishing Dollar

$1.00

1.0

94c

83c

0.8

64c

0.6

44c

0.4

0.2

0.0

1958 1963 1968 1973 1978

Eisenhower Kennedy Johnson Nixon Carter

Is the bottom dollar note roughly half the size of the top one?

35.
Google Image Result for http://lilt.ilstu.edu/gmklass/pos138/datadisplay/sections/charts/3%20graphic%20data_files/image016.jpg 02/10/2007 07:43 PM

Extras : How to Lie with statistics Google Image Result for http://lilt.ilstu.edu/gmklass/pos138/datadis... http://images.google.com/

a number (actually, in the case of a scatterplot, two numbers). It is the job of the chart’s text to

tell the reader just what each of those numbers represents.

See full-size image.

Designing good charts, however, presents more challenges than tabular display as it draws on

lilt.ilstu.edu/.../image016.jpg

Example

the talents of both All ofandthe

the scientist the artist.following

You have to know andgraphs

understand your violate

data, but the area 504 x 389 pixels - 27k

you also need a good sense of how the reader will visualize the chart’s graphical elements. Image may be scaled down and su

principle by replacing the bars displayed

Two problems arise in charting that are less common when data areBelow

by irregular

in tables. Poor

objects in

Example 4.4 How to Lie with Statistics is the image in its original context on the page: lilt.ilstu.edu/.../section

addition to making

choices, or deliberately deceptive, choicesthe bases

in graphic of unequal

design can provide a distorted picture oflength.

The bar graph that follows presents the total sales figures for three realtors.

When the bars are replaced with pictures, often related to the topic of the

numbers

graph, the graph is called and relationships they represent.

a pictogram. A more common problem is that charts are often

Total designed in ways that hide what the data might tell us, or that distract the reader from quickly

Sales $2.05 million

discerning the meaning of the evidence presented in the chart. Each of these problems is

$1.41 million

illustrated in the two classic texts on data presentation: Darrell Huff’s How to Lie with Statistics

$0.9 million

(1994) and Edward Tufte’s The Visual Display of Quantitative Information (1983).

Huff’s

No. #1

Realtor 1 little paperback,

No. 2#2 first published

Realtor RealtorNo.

#33 in 1954 and reissued many times thereafter, condemned

Realtor

graphical representations of data that “lied”. Here, the two numbers, one 3 times the magnitude

(a) How does the height of the home for Realtor 1 compare to that for

of3?the other, are represented by two cows, one

Realtor 27 times larger than the other, resulting in a Lie

(b) How does the area of the home for Realtor 1 compare to that for

Factor

Realtor 3? of 9.

Solution

(a) The height for Realtor 1 is just slightly over twice that of Realtor 3. The

heights are at the correct total sales levels.

(b) To avoid distortion of the pictures, the area of the home for Realtor 1 is

more than four times the area of the home for Realtor 3.

What We’ve Learned: When you see a pictogram, be careful to interpret the

results appropriately, and do not allow the area of the pictures to mislead you.

!

Figure 1: Graphical distortion of data

SOURCE: DarrellChapter

Huff.4 --- 13 1993. How to Lie

with Statistics WW Norton & Co, 72.

Here the figure depicts the increase in the number of milk cows in the United States, from 8

million in 1860 to twenty five million in 1936. The larger cow is thus represented as three

times the height the 1860 cow. But she is also three times as wide, thus taking up nine times the

Extras : How to Lie with statistics Google Image Result for http://lilt.ilstu.edu/gmklass/pos138/datadis... http://images.google.com/

a number (actually, in the case of a scatterplot, two numbers). It is the job of the chart’s text to

tell the reader just what each of those numbers represents.

See full-size image.

Designing good charts, however, presents more challenges than tabular display as it draws on

lilt.ilstu.edu/.../image016.jpg

Example

the talents of both All ofandthe

the scientist the artist.following

You have to know andgraphs

understand your violate

data, but the area 504 x 389 pixels - 27k

you also need a good sense of how the reader will visualize the chart’s graphical elements. Image may be scaled down and su

principle by replacing the bars displayed

Two problems arise in charting that are less common when data areBelow

by irregular

in tables. Poor

objects in

Example 4.4 How to Lie with Statistics is the image in its original context on the page: lilt.ilstu.edu/.../section

addition to making

choices, or deliberately deceptive, choicesthe bases

in graphic of unequal

design can provide a distorted picture oflength.

The bar graph that follows presents the total sales figures for three realtors.

When the bars are replaced with pictures, often related to the topic of the

numbers

graph, the graph is called and relationships they represent.

a pictogram. A more common problem is that charts are often

Total designed in ways that hide what the data might tell us, or that distract the reader from quickly

Sales $2.05 million

discerning the meaning of the evidence presented in the chart. Each of these problems is

$1.41 million

illustrated in the two classic texts on data presentation: Darrell Huff’s How to Lie with Statistics

$0.9 million

(1994) and Edward Tufte’s The Visual Display of Quantitative Information (1983).

Huff’s

No. #1

Realtor 1 little paperback,

No. 2#2 first published

Realtor RealtorNo.

#33 in 1954 and reissued many times thereafter, condemned

Realtor

graphical representations of data that “lied”. Here, the two numbers, one 3 times the magnitude

(a) How does the height of the home for Realtor 1 compare to that for

of3?the other, are represented by two cows, one

Realtor 27 times larger than the other, resulting in a Lie

(b) How does the area of the home for Realtor 1 compare to that for

Factor

Realtor 3? of 9.

Solution

(a) The height for Realtor 1 is just slightly over twice that of Realtor 3. The

heights are at the correct total sales levels.

(b) To avoid distortion of the pictures, the area of the home for Realtor 1 is

more than four times the area of the home for Realtor 3.

What We’ve Learned: When you see a pictogram, be careful to interpret the

results appropriately, and do not allow the area of the pictures to mislead you.

!

Figure 1: Graphical distortion of data

SOURCE: DarrellChapter

Huff.4 --- 13 1993. How to Lie

with Statistics WW Norton & Co, 72.

Here the figure depicts the increase in the number of milk cows in the United States, from 8

million in 1860 to twenty five million in 1936. The larger cow is thus represented as three

times the height the 1860 cow. But she is also three times as wide, thus taking up nine times the