Hướng dẫn vẽ đồ thị nhiều biến

Biên soạn

ThS. Nguyễn Tấn Đức | www.tuhocr.com

Cập nhật

2024 June 06

1 Tình huống thường gặp

Bạn có dataset gồm nhiều cột dữ liệu (biến liên tục, biến định lượng, số đếm) như vậy để thể hiện nhiều nhất số lượng các biến này lên đồ thị 2D thì ta sẽ thực hiện như thế nào. Các dạng đồ thị có nhiều biến (multivariate plot) giúp lồng ghép nhiều thông tin khác nhau vào trong cùng một đồ thị, tuy nhiên bạn cũng có thể tách ra thành các đồ thị con để giúp người đọc dễ tiếp nhận được thông điệp từ đồ thị. Trong ví dụ này mình sử dụng dataset state.x77 về thông tin 50 bang nước Mỹ ở thập niên 1970.

Video hướng dẫn chi tiết và các tài liệu liên quan được upload ở tài khoản học viên.

2 Chuẩn bị dataset

state.x77 -> df

df_1 <- as.data.frame(df)

df_1$State <- row.names(df_1)

row.names(df_1) <- NULL

df_1 <- df_1[, c(9, 1:4, 6, 8)]

names(df_1)[5] <- "Life_Exp"

names(df_1)[6] <- "HS_Grad"

df_1 # mặc định thì dataset này xếp thứ tự bang nước Mỹ theo alphabet

            State Population Income Illiteracy Life_Exp HS_Grad   Area
1         Alabama       3615   3624        2.1    69.05    41.3  50708
2          Alaska        365   6315        1.5    69.31    66.7 566432
3         Arizona       2212   4530        1.8    70.55    58.1 113417
4        Arkansas       2110   3378        1.9    70.66    39.9  51945
5      California      21198   5114        1.1    71.71    62.6 156361
6        Colorado       2541   4884        0.7    72.06    63.9 103766
7     Connecticut       3100   5348        1.1    72.48    56.0   4862
8        Delaware        579   4809        0.9    70.06    54.6   1982
9         Florida       8277   4815        1.3    70.66    52.6  54090
10        Georgia       4931   4091        2.0    68.54    40.6  58073
11         Hawaii        868   4963        1.9    73.60    61.9   6425
12          Idaho        813   4119        0.6    71.87    59.5  82677
13       Illinois      11197   5107        0.9    70.14    52.6  55748
14        Indiana       5313   4458        0.7    70.88    52.9  36097
15           Iowa       2861   4628        0.5    72.56    59.0  55941
16         Kansas       2280   4669        0.6    72.58    59.9  81787
17       Kentucky       3387   3712        1.6    70.10    38.5  39650
18      Louisiana       3806   3545        2.8    68.76    42.2  44930
19          Maine       1058   3694        0.7    70.39    54.7  30920
20       Maryland       4122   5299        0.9    70.22    52.3   9891
21  Massachusetts       5814   4755        1.1    71.83    58.5   7826
22       Michigan       9111   4751        0.9    70.63    52.8  56817
23      Minnesota       3921   4675        0.6    72.96    57.6  79289
24    Mississippi       2341   3098        2.4    68.09    41.0  47296
25       Missouri       4767   4254        0.8    70.69    48.8  68995
26        Montana        746   4347        0.6    70.56    59.2 145587
27       Nebraska       1544   4508        0.6    72.60    59.3  76483
28         Nevada        590   5149        0.5    69.03    65.2 109889
29  New Hampshire        812   4281        0.7    71.23    57.6   9027
30     New Jersey       7333   5237        1.1    70.93    52.5   7521
31     New Mexico       1144   3601        2.2    70.32    55.2 121412
32       New York      18076   4903        1.4    70.55    52.7  47831
33 North Carolina       5441   3875        1.8    69.21    38.5  48798
34   North Dakota        637   5087        0.8    72.78    50.3  69273
35           Ohio      10735   4561        0.8    70.82    53.2  40975
36       Oklahoma       2715   3983        1.1    71.42    51.6  68782
37         Oregon       2284   4660        0.6    72.13    60.0  96184
38   Pennsylvania      11860   4449        1.0    70.43    50.2  44966
39   Rhode Island        931   4558        1.3    71.90    46.4   1049
40 South Carolina       2816   3635        2.3    67.96    37.8  30225
41   South Dakota        681   4167        0.5    72.08    53.3  75955
42      Tennessee       4173   3821        1.7    70.11    41.8  41328
43          Texas      12237   4188        2.2    70.90    47.4 262134
44           Utah       1203   4022        0.6    72.90    67.3  82096
45        Vermont        472   3907        0.6    71.64    57.1   9267
46       Virginia       4981   4701        1.4    70.08    47.8  39780
47     Washington       3559   4864        0.6    71.72    63.5  66570
48  West Virginia       1799   3617        1.4    69.48    41.6  24070
49      Wisconsin       4589   4468        0.7    72.48    54.5  54464
50        Wyoming        376   4566        0.6    70.29    62.9  97203

Thông tin các cột như sau:

Population population estimate as of July 1, 1975 (đơn vị ngàn dân)
Income per capita income (1974) (đơn vị USD)
Illiteracy tỷ lệ mù chữ (1970, percent of population)
Life_Exp tuổi thọ trung bình (life expectancy in years 1969–71)
HS Grad tỷ lệ tốt nghiệp trung học phổ thông (percent high-school graduates in 1970)
Area diện tích (land area in square miles)

3 Vẽ đồ thị scatter plot 2 biến

Ta có thể sử dụng cột Income và Life_Exp để biểu diễn đặc trưng giữa các bang trong nước Mỹ với hai biến này.

plot(formula = Life_Exp ~ Income,
     data = df_1,
     pch = 19,
     col = "darkgreen")

text(x = df_1$Income,
     y = df_1$Life_Exp,
     cex = 0.7,
     labels = df_1$State,
     pos = 3)

4 Thêm thông tin về dân số

Áp dụng lệnh cut() để tạo group cho biến dân số, đây là kỹ thuật chuyển biến định lượng sang biến phân loại. Sử dụng tham số col màu sắc để biểu diễn biến dân số.

df_1$Population_group <- cut(x = df_1$Population,
                       breaks = c(0, 500, 1000, 5000, 10000, 30000),
                       labels = c("≤ 500",
                                  "500 < population ≤ 1000",
                                  "1000 < population ≤ 5000",
                                  "5000 < population ≤ 10000",
                                  "> 10000"))

df_1 |> dplyr::arrange(desc(Population_group), desc(Income)) -> df_2

df_2

            State Population Income Illiteracy Life_Exp HS_Grad   Area          Population_group
1      California      21198   5114        1.1    71.71    62.6 156361                   > 10000
2        Illinois      11197   5107        0.9    70.14    52.6  55748                   > 10000
3        New York      18076   4903        1.4    70.55    52.7  47831                   > 10000
4            Ohio      10735   4561        0.8    70.82    53.2  40975                   > 10000
5    Pennsylvania      11860   4449        1.0    70.43    50.2  44966                   > 10000
6           Texas      12237   4188        2.2    70.90    47.4 262134                   > 10000
7      New Jersey       7333   5237        1.1    70.93    52.5   7521 5000 < population ≤ 10000
8         Florida       8277   4815        1.3    70.66    52.6  54090 5000 < population ≤ 10000
9   Massachusetts       5814   4755        1.1    71.83    58.5   7826 5000 < population ≤ 10000
10       Michigan       9111   4751        0.9    70.63    52.8  56817 5000 < population ≤ 10000
11        Indiana       5313   4458        0.7    70.88    52.9  36097 5000 < population ≤ 10000
12 North Carolina       5441   3875        1.8    69.21    38.5  48798 5000 < population ≤ 10000
13    Connecticut       3100   5348        1.1    72.48    56.0   4862  1000 < population ≤ 5000
14       Maryland       4122   5299        0.9    70.22    52.3   9891  1000 < population ≤ 5000
15       Colorado       2541   4884        0.7    72.06    63.9 103766  1000 < population ≤ 5000
16     Washington       3559   4864        0.6    71.72    63.5  66570  1000 < population ≤ 5000
17       Virginia       4981   4701        1.4    70.08    47.8  39780  1000 < population ≤ 5000
18      Minnesota       3921   4675        0.6    72.96    57.6  79289  1000 < population ≤ 5000
19         Kansas       2280   4669        0.6    72.58    59.9  81787  1000 < population ≤ 5000
20         Oregon       2284   4660        0.6    72.13    60.0  96184  1000 < population ≤ 5000
21           Iowa       2861   4628        0.5    72.56    59.0  55941  1000 < population ≤ 5000
22        Arizona       2212   4530        1.8    70.55    58.1 113417  1000 < population ≤ 5000
23       Nebraska       1544   4508        0.6    72.60    59.3  76483  1000 < population ≤ 5000
24      Wisconsin       4589   4468        0.7    72.48    54.5  54464  1000 < population ≤ 5000
25       Missouri       4767   4254        0.8    70.69    48.8  68995  1000 < population ≤ 5000
26        Georgia       4931   4091        2.0    68.54    40.6  58073  1000 < population ≤ 5000
27           Utah       1203   4022        0.6    72.90    67.3  82096  1000 < population ≤ 5000
28       Oklahoma       2715   3983        1.1    71.42    51.6  68782  1000 < population ≤ 5000
29      Tennessee       4173   3821        1.7    70.11    41.8  41328  1000 < population ≤ 5000
30       Kentucky       3387   3712        1.6    70.10    38.5  39650  1000 < population ≤ 5000
31          Maine       1058   3694        0.7    70.39    54.7  30920  1000 < population ≤ 5000
32 South Carolina       2816   3635        2.3    67.96    37.8  30225  1000 < population ≤ 5000
33        Alabama       3615   3624        2.1    69.05    41.3  50708  1000 < population ≤ 5000
34  West Virginia       1799   3617        1.4    69.48    41.6  24070  1000 < population ≤ 5000
35     New Mexico       1144   3601        2.2    70.32    55.2 121412  1000 < population ≤ 5000
36      Louisiana       3806   3545        2.8    68.76    42.2  44930  1000 < population ≤ 5000
37       Arkansas       2110   3378        1.9    70.66    39.9  51945  1000 < population ≤ 5000
38    Mississippi       2341   3098        2.4    68.09    41.0  47296  1000 < population ≤ 5000
39         Nevada        590   5149        0.5    69.03    65.2 109889   500 < population ≤ 1000
40   North Dakota        637   5087        0.8    72.78    50.3  69273   500 < population ≤ 1000
41         Hawaii        868   4963        1.9    73.60    61.9   6425   500 < population ≤ 1000
42       Delaware        579   4809        0.9    70.06    54.6   1982   500 < population ≤ 1000
43   Rhode Island        931   4558        1.3    71.90    46.4   1049   500 < population ≤ 1000
44        Montana        746   4347        0.6    70.56    59.2 145587   500 < population ≤ 1000
45  New Hampshire        812   4281        0.7    71.23    57.6   9027   500 < population ≤ 1000
46   South Dakota        681   4167        0.5    72.08    53.3  75955   500 < population ≤ 1000
47          Idaho        813   4119        0.6    71.87    59.5  82677   500 < population ≤ 1000
48         Alaska        365   6315        1.5    69.31    66.7 566432                     ≤ 500
49        Wyoming        376   4566        0.6    70.29    62.9  97203                     ≤ 500
50        Vermont        472   3907        0.6    71.64    57.1   9267                     ≤ 500

Áp dụng cách subset vector theo factor để tạo ra vector chứa màu sắc tương ứng từng mức trong biến Population_group.

levels(df_2$Population_group)

[1] "≤ 500"                     "500 < population ≤ 1000"   "1000 < population ≤ 5000"  "5000 < population ≤ 10000" "> 10000"

color_area_group <- c("#ff99e6", # level thấp
                      "#C17EFB",
                      "#7900cc",
                      "#cc0000",
                      "#ff0000") # level cao

color_area_group[df_2$Population_group]

 [1] "#ff0000" "#ff0000" "#ff0000" "#ff0000" "#ff0000" "#ff0000" "#cc0000" "#cc0000" "#cc0000" "#cc0000" "#cc0000" "#cc0000" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#7900cc" "#C17EFB" "#C17EFB" "#C17EFB" "#C17EFB" "#C17EFB" "#C17EFB" "#C17EFB" "#C17EFB" "#C17EFB" "#ff99e6" "#ff99e6" "#ff99e6"

plot(formula = Life_Exp ~ Income,
     data = df_2,
     pch = 19,
     col = color_area_group[df_2$Population_group])

text(x = df_2$Income,
     y = df_2$Life_Exp,
     cex = 0.7,
     labels = df_2$State,
     pos = 3)

5 Thêm thông tin về diện tích

Cách 1: Ta sẽ biểu diễn diện tích của các bang theo độ lớn của point character, sử dụng tham số cex, để làm được như vậy cần quy đổi về tỷ lệ giữa các bang và chuyển thành các mức cex phù hợp (đây là cách vẽ bubble chart).

Cách 2: Ta cắt dữ liệu ra tương tự như biến dân số, sau đó gán vào tỷ lệ cex phù hợp. Ở đây ta chọn cách 2 để thực hiện.

### cách 1
# prop.table(df_2$Area) -> df_2$cex_area
# 
# library(car)
# df_2$cex_area_ok <- car::recode(df_2$cex_area, 
#                                "0.0001:0.0005 = 1; 
#                                 0.0005:0.01  = 1.5; 
#                                 0.01:0.05 = 2;
#                                 0.05:0.1 = 2.5;
#                                 else = 3")

### cách 2
df_2$Area_group <- cut(x = df_2$Area,
                       breaks = c(0, 5000, 10000, 30000, 
                                  100000, 300000, 600000),
                       labels = c("≤ 5000",
                                  "5000 < area ≤ 10000",
                                  "10000 < area ≤ 30000",
                                  "30000 < area ≤ 100000",
                                  "100000 < area ≤ 300000",
                                  "> 300000"))

df_2$cex_area_ok <- car::recode(df_2$Area_group, 
                               " '≤ 5000' = 1; 
                                '5000 < area ≤ 10000'  = 1.25; 
                                '10000 < area ≤ 30000' = 1.5;
                                '30000 < area ≤ 100000' = 2;
                                '100000 < area ≤ 300000' = 2.75;
                                else = 3")

df_2$cex_area_ok <- as.character(df_2$cex_area_ok)
df_2$cex_area_ok <- as.numeric(df_2$cex_area_ok)

df_2

            State Population Income Illiteracy Life_Exp HS_Grad   Area          Population_group             Area_group cex_area_ok
1      California      21198   5114        1.1    71.71    62.6 156361                   > 10000 100000 < area ≤ 300000        2.75
2        Illinois      11197   5107        0.9    70.14    52.6  55748                   > 10000  30000 < area ≤ 100000        2.00
3        New York      18076   4903        1.4    70.55    52.7  47831                   > 10000  30000 < area ≤ 100000        2.00
4            Ohio      10735   4561        0.8    70.82    53.2  40975                   > 10000  30000 < area ≤ 100000        2.00
5    Pennsylvania      11860   4449        1.0    70.43    50.2  44966                   > 10000  30000 < area ≤ 100000        2.00
6           Texas      12237   4188        2.2    70.90    47.4 262134                   > 10000 100000 < area ≤ 300000        2.75
7      New Jersey       7333   5237        1.1    70.93    52.5   7521 5000 < population ≤ 10000    5000 < area ≤ 10000        1.25
8         Florida       8277   4815        1.3    70.66    52.6  54090 5000 < population ≤ 10000  30000 < area ≤ 100000        2.00
9   Massachusetts       5814   4755        1.1    71.83    58.5   7826 5000 < population ≤ 10000    5000 < area ≤ 10000        1.25
10       Michigan       9111   4751        0.9    70.63    52.8  56817 5000 < population ≤ 10000  30000 < area ≤ 100000        2.00
11        Indiana       5313   4458        0.7    70.88    52.9  36097 5000 < population ≤ 10000  30000 < area ≤ 100000        2.00
12 North Carolina       5441   3875        1.8    69.21    38.5  48798 5000 < population ≤ 10000  30000 < area ≤ 100000        2.00
13    Connecticut       3100   5348        1.1    72.48    56.0   4862  1000 < population ≤ 5000                 ≤ 5000        1.00
14       Maryland       4122   5299        0.9    70.22    52.3   9891  1000 < population ≤ 5000    5000 < area ≤ 10000        1.25
15       Colorado       2541   4884        0.7    72.06    63.9 103766  1000 < population ≤ 5000 100000 < area ≤ 300000        2.75
16     Washington       3559   4864        0.6    71.72    63.5  66570  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
17       Virginia       4981   4701        1.4    70.08    47.8  39780  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
18      Minnesota       3921   4675        0.6    72.96    57.6  79289  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
19         Kansas       2280   4669        0.6    72.58    59.9  81787  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
20         Oregon       2284   4660        0.6    72.13    60.0  96184  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
21           Iowa       2861   4628        0.5    72.56    59.0  55941  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
22        Arizona       2212   4530        1.8    70.55    58.1 113417  1000 < population ≤ 5000 100000 < area ≤ 300000        2.75
23       Nebraska       1544   4508        0.6    72.60    59.3  76483  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
24      Wisconsin       4589   4468        0.7    72.48    54.5  54464  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
25       Missouri       4767   4254        0.8    70.69    48.8  68995  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
26        Georgia       4931   4091        2.0    68.54    40.6  58073  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
27           Utah       1203   4022        0.6    72.90    67.3  82096  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
28       Oklahoma       2715   3983        1.1    71.42    51.6  68782  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
29      Tennessee       4173   3821        1.7    70.11    41.8  41328  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
30       Kentucky       3387   3712        1.6    70.10    38.5  39650  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
31          Maine       1058   3694        0.7    70.39    54.7  30920  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
32 South Carolina       2816   3635        2.3    67.96    37.8  30225  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
33        Alabama       3615   3624        2.1    69.05    41.3  50708  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
34  West Virginia       1799   3617        1.4    69.48    41.6  24070  1000 < population ≤ 5000   10000 < area ≤ 30000        1.50
35     New Mexico       1144   3601        2.2    70.32    55.2 121412  1000 < population ≤ 5000 100000 < area ≤ 300000        2.75
36      Louisiana       3806   3545        2.8    68.76    42.2  44930  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
37       Arkansas       2110   3378        1.9    70.66    39.9  51945  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
38    Mississippi       2341   3098        2.4    68.09    41.0  47296  1000 < population ≤ 5000  30000 < area ≤ 100000        2.00
39         Nevada        590   5149        0.5    69.03    65.2 109889   500 < population ≤ 1000 100000 < area ≤ 300000        2.75
40   North Dakota        637   5087        0.8    72.78    50.3  69273   500 < population ≤ 1000  30000 < area ≤ 100000        2.00
41         Hawaii        868   4963        1.9    73.60    61.9   6425   500 < population ≤ 1000    5000 < area ≤ 10000        1.25
42       Delaware        579   4809        0.9    70.06    54.6   1982   500 < population ≤ 1000                 ≤ 5000        1.00
43   Rhode Island        931   4558        1.3    71.90    46.4   1049   500 < population ≤ 1000                 ≤ 5000        1.00
44        Montana        746   4347        0.6    70.56    59.2 145587   500 < population ≤ 1000 100000 < area ≤ 300000        2.75
45  New Hampshire        812   4281        0.7    71.23    57.6   9027   500 < population ≤ 1000    5000 < area ≤ 10000        1.25
46   South Dakota        681   4167        0.5    72.08    53.3  75955   500 < population ≤ 1000  30000 < area ≤ 100000        2.00
47          Idaho        813   4119        0.6    71.87    59.5  82677   500 < population ≤ 1000  30000 < area ≤ 100000        2.00
48         Alaska        365   6315        1.5    69.31    66.7 566432                     ≤ 500               > 300000        3.00
49        Wyoming        376   4566        0.6    70.29    62.9  97203                     ≤ 500  30000 < area ≤ 100000        2.00
50        Vermont        472   3907        0.6    71.64    57.1   9267                     ≤ 500    5000 < area ≤ 10000        1.25

plot(formula = Life_Exp ~ Income,
     data = df_2,
     pch = 19,
     cex = df_2$cex_area_ok,
     col = color_area_group[df_2$Population_group])

text(x = df_2$Income,
     y = df_2$Life_Exp,
     cex = 0.7,
     labels = df_2$State,
     pos = 3)

6 Thêm điều kiện về tuổi thọ

Ta sử dụng hai tính chất của point character (từ 21 đến 25) là có thể tô màu viền col và màu nền bg để đưa thêm điều kiện về tuổi thọ trung bình vào đồ thị, để dễ quan sát ta sẽ chỉnh lại độ trong suốt về màu sắc giữa các điểm dữ liệu.

plot(formula = Life_Exp ~ Income,
     data = df_2,
     pch = 21,
     cex = df_2$cex_area_ok,
     bg = adjustcolor(color_area_group[df_2$Population_group], alpha.f = 0.8),
     lwd = 1,
     col = ifelse(df_2$Life_Exp <= 70,
                  yes = adjustcolor("cyan", alpha.f = 1),
                  no = adjustcolor("transparent", alpha.f = 1)))

text(x = df_2$Income,
     y = df_2$Life_Exp,
     cex = 0.7,
     labels = df_2$State,
     pos = 3)

abline(h = 70, lty = 2, lwd = 2, col = "darkgreen")

7 Gộp hai biến cũ để tạo thành biến mới

Trong dataset này ta thấy có biến Illiteracy về mức độ mù chữ trong tổng số dân (tính theo phần trăm), do đó ta có thể chuyển thành biến Literacy (là 1 - Illiteracy) để đại diện cho tỷ lệ biết chữ trong tổng số dân.

Tiếp đó ta có biến HS_Grad đại diện cho tỷ lệ tốt nghiệp trung học phổ thông (tú tài) trên tổng số dân, nếu biểu diễn biến này thì cũng được, tuy nhiên để minh họa cách tận dụng dữ liệu thì mình sẽ tạo ra biến mới, gọi là Edu_index đại diện cho tỷ lệ tốt nghiệp trung học phổ thông tính trên tổng số dân biết chữ, để đánh giá mức độ học vấn giữa các bang. Như vậy sẽ gộp được hai biến gốc là Illiteracy và HS_Grad thành biến mới Edu_index giúp tăng thêm thông tin cho đồ thị.

View dataset final

df_2$Literacy <- 100 - df_2$Illiteracy

df_2$Edu_index <- df_2$HS_Grad / df_2$Literacy

## tạo group cho biến `Edu_index`

df_2$Edu_index_group <- car::recode(df_2$Edu_index, 
                                    "0:0.4 = 'low'; 
                                     0.4:0.6  = 'medium'; 
                                     else = 'high'")

df_2$Edu_index_group <- factor(df_2$Edu_index_group,
                               levels = c("low", "medium", "high"),
                               ordered = TRUE)

library(kableExtra)
df_2 %>% kbl(format = "html") %>%
  kable_styling(bootstrap_options = c("striped", 
                                      "hover", 
                                      "condensed", 
                                      "bordered", 
                                      "responsive")) %>%
  row_spec(0, bold = TRUE, align = "c", color = "white", background = "#1d6c00") %>% 
  kable_classic(full_width = TRUE, html_font = "arial") -> output

save_kable(output, file = "output.html")

Vẽ đồ thị với tham số pch đại diện cho chỉ số Edu_index ở từng bang.

plot(formula = Life_Exp ~ Income,
     data = df_2,
     pch = c(22, 24, 21)[df_2$Edu_index_group],
     cex = df_2$cex_area_ok,
     bg = adjustcolor(color_area_group[df_2$Population_group], alpha.f = 0.8),
     lwd = 1,
     col = ifelse(df_2$Life_Exp <= 70,
                  yes = adjustcolor("cyan", alpha.f = 1),
                  no = adjustcolor("transparent", alpha.f = 1)))

text(x = df_2$Income,
     y = df_2$Life_Exp,
     cex = 0.7,
     labels = df_2$State,
     pos = 3)

abline(h = 70, lty = 2, lwd = 2, col = "darkgreen")

8 Chỉnh lại text không bị overlap

plot(formula = Life_Exp ~ Income,
     data = df_2,
     pch = c(22, 24, 21)[df_2$Edu_index_group],
     cex = df_2$cex_area_ok,
     bg = adjustcolor(color_area_group[df_2$Population_group], alpha.f = 0.8),
     lwd = 1,
     col = ifelse(df_2$Life_Exp <= 70,
                  yes = adjustcolor("cyan", alpha.f = 1),
                  no = adjustcolor("transparent", alpha.f = 1)))

# text(x = df_2$Income,
#      y = df_2$Life_Exp,
#      cex = 0.7,
#      labels = df_2$State,
#      pos = 3)

abline(h = 70, lty = 2, lwd = 2, col = "darkgreen")

library(basicPlotteR)
basicPlotteR::addTextLabels(xCoords = df_2$Income,
                            yCoords = df_2$Life_Exp,
                            labels = df_2$State,
                            keepLabelsInside = TRUE,
                            # border = "black",
                            # col.background = "lightyellow",
                            avoidPoints = TRUE,
                            col.label = "black",
                            col.line = NA,
                            cex.label = 0.8,
                            cex.pt = 0.9)

9 Thêm đường đồng mức thể hiện mật độ điểm dữ liệu

Khi ta có đồ thị scatter plot với mật độ điểm tập trung dày đặc, để thể hiện phân bố 2D cho hai biến x và y tương ứng ta sẽ vẽ đường đồng mức (thực tế là biến z đại diện cho mật độ điểm trên một đơn vị diện tích ở đồ thị scatter plot ban đầu). Thông tin chi tiết các bạn xem thêm ở đây nhé.¹

dataEllipse superimposes the normal-probability contours over a scatterplot of the data

par(mar = c(5, 5, 5, 2))

plot(formula = Life_Exp ~ Income,
     data = df_2,
     type = "n",
     pch = c(22, 24, 21)[df_2$Edu_index_group],
     cex = df_2$cex_area_ok,
     bg = adjustcolor(color_area_group[df_2$Population_group], alpha.f = 0.8),
     lwd = 1,
     col = ifelse(df_2$Life_Exp <= 70,
                  yes = adjustcolor("cyan", alpha.f = 1),
                  no = adjustcolor("transparent", alpha.f = 1)),
     xlim = c(2000, 7000),
     ylim = c(67, 74),
     xaxs = "i",
     yaxs = "i",
     xlab = "Thu nhập (USD)",
     ylab = "Tuổi thọ trung bình (năm)",
     las = 1)

car::dataEllipse(x = df_2$Income,
                 y = df_2$Life_Exp,
                 plot.points = FALSE,
                 col = "lightgreen",
                 center.pch = FALSE,
                 fill = TRUE,
                 levels = c(0.5, 0.9),
                 fill.alpha = 0.3,
                 grid = TRUE,
                 lty = 2)

abline(h = 70, lty = 3, lwd = 2, col = adjustcolor("gray", alpha.f = 0.8))

points(formula = Life_Exp ~ Income,
     data = df_2,
     type = "p",
     pch = c(22, 24, 21)[df_2$Edu_index_group],
     cex = df_2$cex_area_ok,
     bg = adjustcolor(color_area_group[df_2$Population_group], alpha.f = 0.8),
     lwd = 1,
     col = ifelse(df_2$Life_Exp <= 70,
                  yes = adjustcolor("cyan", alpha.f = 1),
                  no = adjustcolor("transparent", alpha.f = 1)),
     xlim = c(2000, 7000),
     ylim = c(67, 74),
     xaxs = "i",
     yaxs = "i",
     las = 1)

library(basicPlotteR)
basicPlotteR::addTextLabels(xCoords = df_2$Income,
                            yCoords = df_2$Life_Exp,
                            labels = df_2$State,
                            keepLabelsInside = TRUE,
                            # border = "black",
                            # col.background = "lightyellow",
                            avoidPoints = TRUE,
                            col.label = "black",
                            col.line = NA,
                            cex.label = 0.8,
                            cex.pt = 0.9)

10 Thêm chú thích và hoàn thiện đồ thị

par(mar = c(5, 5, 5, 2))
par(font.lab = 2)
par(font.axis = 2)

plot(formula = Life_Exp ~ Income,
     data = df_2,
     type = "n",
     pch = c(22, 24, 21)[df_2$Edu_index_group],
     cex = df_2$cex_area_ok,
     bg = adjustcolor(color_area_group[df_2$Population_group], alpha.f = 0.8),
     lwd = 1,
     col = ifelse(df_2$Life_Exp <= 70,
                  yes = adjustcolor("cyan", alpha.f = 1),
                  no = adjustcolor("transparent", alpha.f = 1)),
     xlim = c(2000, 7000),
     ylim = c(67, 74),
     xaxs = "i",
     yaxs = "i",
     xlab = "Thu nhập bình quân đầu người (USD)",
     ylab = "Tuổi thọ trung bình (năm)",
     las = 1)

car::dataEllipse(x = df_2$Income,
                 y = df_2$Life_Exp,
                 plot.points = FALSE,
                 col = "lightgreen",
                 center.pch = FALSE,
                 fill = TRUE,
                 levels = c(0.5, 0.9),
                 fill.alpha = 0.3,
                 grid = TRUE,
                 lty = 2)

abline(h = 70, lty = 3, lwd = 2, col = adjustcolor("gray", alpha.f = 0.8))

points(formula = Life_Exp ~ Income,
     data = df_2,
     type = "p",
     pch = c(22, 24, 21)[df_2$Edu_index_group],
     cex = df_2$cex_area_ok,
     bg = adjustcolor(color_area_group[df_2$Population_group], alpha.f = 0.8),
     lwd = 1,
     col = ifelse(df_2$Life_Exp <= 70,
                  yes = adjustcolor("cyan", alpha.f = 1),
                  no = adjustcolor("transparent", alpha.f = 1)),
     xlim = c(2000, 7000),
     ylim = c(67, 74),
     xaxs = "i",
     yaxs = "i",
     las = 1)

library(basicPlotteR)
basicPlotteR::addTextLabels(xCoords = df_2$Income,
                            yCoords = df_2$Life_Exp,
                            labels = df_2$State,
                            keepLabelsInside = TRUE,
                            # border = "black",
                            # col.background = "lightyellow",
                            avoidPoints = TRUE,
                            col.label = "black",
                            col.line = NA,
                            cex.label = 0.8,
                            cex.pt = 0.9)

### legend dân số

legend(x = "topright",
       y = NULL,
       title = "Dân số (nghìn người)",
       title.font = 2,
       legend = levels(df_2$Population_group),
       col = color_area_group,
       pt.cex = 1.5,
       y.intersp = 1.25,
       x.intersp = 1.25,
       inset = 0.01,
       bty = "n",
       pch = 19)

### legend diện tích

leg <- legend(x = "bottomright",
       y = NULL,
       title = "Diện tích (square mile)",
       legend = c("≤ 5000",
                  "5000 < area ≤ 10000",
                  "10000 < area ≤ 30000",
                  "30000 < area ≤ 100000",
                  "100000 < area ≤ 300000",
                  "> 300000"),
       col = "black",
       pch = 1,
       y.intersp = c(1, 1.25, 1.25, 1.25, 1.5, 1.5),
       pt.cex = c(1, 1.25, 1.5, 2, 2.75, 3),
       bty = "n",
       plot = FALSE)

legend(x = leg$rect$left - 400,
       y = leg$rect$top,
       title = "",
       legend = c("",
                  "",
                  "",
                  "",
                  "",
                  ""),
       col = "black",
       pch = 0,
       y.intersp = c(1, 1.25, 1.25, 1.25, 1.5, 1.5),
       pt.cex = c(1, 1.25, 1.5, 2, 2.75, 3),
       bty = "n")

legend(x = leg$rect$left - 200,
       y = leg$rect$top,
       title = "",
       legend = c("",
                  "",
                  "",
                  "",
                  "",
                  ""),
       col = "black",
       pch = 2,
       y.intersp = c(1, 1.25, 1.25, 1.25, 1.5, 1.5),
       pt.cex = c(1, 1.25, 1.5, 2, 2.75, 3),
       bty = "n")

legend(x = "bottomright",
       y = NULL,
       title.font = 2,
       title = "Diện tích (square mile)",
       legend = c("≤ 5000",
                  "5000 < area ≤ 10000",
                  "10000 < area ≤ 30000",
                  "30000 < area ≤ 100000",
                  "100000 < area ≤ 300000",
                  "> 300000"),
       col = "black",
       pch = 1,
       x.intersp = 1.25,
       y.intersp = c(1, 1.25, 1.25, 1.25, 1.5, 1.5),
       pt.cex = c(1, 1.25, 1.5, 2, 2.75, 3),
       plot = TRUE,
       bty = "n")

### legend edu index

legend(x = "topleft",
       y = NULL,
       title = "Tỷ lệ tốt nghiệp tú tài (%)\ntính trên tổng số người biết chữ",
       title.font = 2,
       legend = c("≤ 40 ~ Low",
                  "40–60 ~ Medium",
                  "> 60 ~ High"),
       col = "black",
       pt.bg = "gray",
       pt.cex = 1.5,
       y.intersp = 1.25,
       x.intersp = 1.25,
       inset = 0.01,
       bty = "n",
       pch = c(22, 24, 21))

### legend other

legend(x = "bottomleft",
       y = NULL,
       title = "Ghi chú",
       title.font = 2,
       legend = c("Point có viền (Life_Exp ≤ 70)",
                  "Point không viền (Life_Exp > 70)"),
       col = c("cyan", "purple"),
       lwd = 2,
       lty = 0,
       pt.bg = "purple",
       merge = FALSE,
       pt.cex = 1.5,
       horiz = FALSE,
       bty = "n",
       pch = c(21, 19))

title(main = "Thông tin về các tiểu bang Hoa Kỳ (thập niên 1970) | Hướng dẫn vẽ đồ thị nhiều biến",
      cex.main = 1.5,
      col.main = "darkblue")
mtext(text = "Source: Dataset state.x77\nThis plot is only for training R",
      side = 1,
      col = "blue",
      font = 3,
      line = 3.5,
      adj = 0,
      xpd = NA)

box()

library(png)
library(grid)
logor <- readPNG("logor.png")

scale_logo <- 0.08

grid.raster(logor, 
            x = 0.9, 
            y = 0.70, 
            width = scale_logo)

11 Tài liệu tham khảo

https://www.visiondummy.com/2014/04/draw-error-ellipse-representing-covariance-matrix/

Footnotes

https://statisticsbyjim.com/graphs/contour-plots/↩︎