--- title: "Các lệnh R thường dùng khi thực hiện thống kê mô tả" author-title: Biên soạn author: Duc Nguyen | Founder of tuhocr.com site-url: https://www.tuhocr.com/ published-title: Cập nhật date: last-modified date-format: "YYYY MMMM DD" cover-image: cover.png favicon: favicon.png bibliography: reborn_1.bib format: html: toc: true toc-title: "Mục lục" toc-expand: 6 toc-location: left number-sections: true number-depth: 6 theme: [style.scss] page-layout: full # code-annotations: below # không thuận tiện cho 2 code chunk embed-resources: true anchor-sections: true smooth-scroll: true link-external-newwindow: true code-tools: source: https://applyr.netlify.app/codebase/quarto-descriptive-statistics/quarto-descriptive-statistics.txt # grid: # sidebar-width: 0px # body-width: 2000px # margin-width: 0px highlight-style: solarized # execute: # keep-md: true engine: knitr knitr: opts_chunk: R.options: width: 1000 editor_options: chunk_output_type: console --- ```{r, echo=FALSE, results='hide'} knitr::opts_chunk$set(error = TRUE, # suppress errors message = FALSE, # suppress messages warning = FALSE, # suppress warnings # results = 'hide', # suppress code output echo = TRUE, # suppress code # fig.show = 'hide', # suppress plots cache = TRUE # enable caching ) # library(ggfortify) # autoplot(lm(cars$dist ~ cars$speed)) file <- list.files(pattern="*.qmd") newfile <- gsub("\\.qmd", ".txt", file) file.copy(from = file, to = newfile, overwrite = TRUE) zip(zipfile = "quarto-descriptive-statistics", files = c("quarto-descriptive-statistics.qmd", "style.scss", "reborn_1.bib", "students.csv", "run_quarto.R", "quarto-descriptive-statistics.Rproj", "_extensions/") ) ``` {{< downloadthis quarto-descriptive-statistics.zip dname="quarto-descriptive-statistics" label="RStudio project" icon=file-zip type=primary >}} text web ‍ ‍ ‍ # Tình huống thường gặp Bạn có file dữ liệu `df` gồm rất nhiều hàng và cột (trong bài này ta sử dụng dummy dataset để làm ví dụ minh họa). Việc đầu tiên cần làm là kiểm tra đặc điểm dữ liệu để đánh giá tổng quát toàn bộ dataset như thế nào nhằm có một hình dung cụ thể về dữ liệu trước khi thực hiện các bước phân tích sâu hơn. Để tạo ra 1 file HTML table show full dataset, bạn sử dụng package `kableExtra`[^1]. [**View**](https://applyr.netlify.app/codebase/quarto-descriptive-statistics/output.html) ```{r, eval = FALSE} df <- read.csv("students.csv") library(kableExtra) df %>% kbl(format = "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "bordered", "responsive")) %>% row_spec(0, bold = TRUE, align = "c", color = "white", background = "#1d6c00") %>% kable_classic(full_width = TRUE, html_font = "arial") -> output save_kable(output, file = "output.html") ``` ```{r} df <- read.csv("students.csv") head(df, n = 30) ### xem 30 dòng đầu của df ``` # Các lệnh kiểm tra dataset thường dùng ```{r} dim(df) ### Kiểm tra dataset gồm bao nhiêu hàng và cột ``` ```{r} str(df) ### Xem tổng quát cấu trúc dataset ``` ```{r} summary(df) ### Tóm tắt đặc điểm từng cột trong dataset ``` Vì giá trị `NA` missing values chỉ thể hiện ở numeric vector mà không thể hiện ở character vector. Vì vậy kết quả trong lệnh `summary()` chỉ để tham khảo sơ bộ khi đánh giá về tình trạng `NA` của bộ dữ liệu. ```{r} ### Lệnh này xác nhận là toàn bộ dataset có `NA` hay không. ### Nếu FALSE là chắc chắn không có `NA`, còn TRUE thì chắc chắn là có `NA`. any(is.na(df)) ``` ## Kiểm tra kỹ `class` từng cột ```{r} sapply(df, class) ### Kiểm tra class từng cột. ``` Ta chuyển toàn bộ các cột ở class `integer` sang class `numeric` (nếu cần thiết). Ở đây mình làm ví dụ minh họa để có code mẫu áp dụng cho các trường hợp tương tự. ```{r} ### Tách ra vector chứa kết quả kiểm tra class từng cột class_tung_cot <- sapply(df, class, simplify = TRUE, USE.NAMES = TRUE) ### Tìm vị trí index của các cột có class là `integer` integer_index_cot <- grep(pattern = "integer", ignore.case = FALSE, fixed = TRUE, x = class_tung_cot) ### Tách ra tên các cột có class `integer` ten_cot_integer <- names(class_tung_cot)[integer_index_cot] ### Chuyển toàn bộ các cột `integer` này về dạng `numeric` df[, ten_cot_integer] <- lapply(df[, ten_cot_integer], as.numeric) ### Kiểm tra lại thì thấy toàn bộ các cột `integer` đã chuyển qua dạng `numeric` sapply(df, class, simplify = TRUE, USE.NAMES = TRUE) ``` ## Xác nhận mỗi dòng (observation, row) là một quan sát riêng biệt Để tăng mức độ tự tin khi thao tác với dữ liệu, bạn cần xác nhận là mỗi dòng (ở đây là mỗi sinh viên) thì dữ liệu là duy nhất, không có lặp lại. Việc này rất quan trọng để tránh bị trùng lắp (trong trường hợp dataset ở dạng semi-long) để ta tìm cách trải dữ liệu ra làm sao thu được dạng true long hoặc true wide, tức là mỗi một dòng là một quan sát riêng biệt. ```{r} ### Thông thường trong các dataset luôn có cột ID để làm cơ sở theo dõi thông tin, ### nếu không có cột ID thì bạn cần tạo ra cột ID để thuận tiện xử lý sau này (nếu cần) ### Kiểm tra thông tin ở cột ID sinh viên xem có trùng lắp hay không, ### Lệnh unique() trả về kết quả của toàn bộ các giá trị xuất hiện duy nhất trong cột ID ### Lệnh length() giúp đếm tổng chiều dài của vector length(unique(df$stud.id)) ``` Như vậy ta xác nhận xem tổng số hàng của dataset `df` có bằng với tổng số giá trị ID riêng biệt ở cột `stud.id` hay không. Nếu là `TRUE` thì chắc chắn là mỗi một dòng là một sinh viên riêng biệt. ```{r} identical(dim(df)[1], length(unique(df$stud.id))) ``` Cách 2 là ta dùng lệnh `duplicated()` ```{r} ### Lệnh này trả về vector logical check cho mỗi giá trị trong vector stud.id ### Nếu có trùng lặp thì sẽ có TRUE, còn nếu không có trùng lặp sẽ là FALSE check_duplicated <- duplicated(df$stud.id) ### Lệnh all này sẽ trả về kết quả TRUE nếu toàn bộ giá trị trong vector !check_duplicated ### là TRUE, nghĩa là không có trùng lặp all(!check_duplicated) ``` Cách 3 là ta dùng lệnh `anyDuplicated()` ```{r} ### Để thuận tiện thì ta dùng lệnh này, nếu kết quả trả về 0 nghĩa là không có trùng lặp anyDuplicated(df$stud.id) ``` ## Dùng lệnh `table()` để kiểm tra giá trị `NA` ở từng cột ### Tham số `gender` ở vị trí hàng, tham số `major` ở vị trí cột Lệnh `table()` là một lệnh rất mạnh trong R. Ta nên dùng tham số `useNA = "ifany"` nhằm thể hiện luôn số lượng `NA` nếu có trong vector đang kiểm tra. ```{r} options(width = 120) check_na <- lapply(X = df[, !(names(df) %in% c("no", "stud.id", "name"))], ### Bỏ 3 cột đầu vì không cần thiết phải check `NA` FUN = table, ### Tham số `gender` ở vị trí hàng, tham số `major` ở vị trí cột useNA = "ifany") check_na ``` Kết quả trả về là bảng tần số ở từng cột, nếu có `NA` như ở trường hợp cột `salaray` thì sẽ hiện ra ở dòng cuối. Trong trường hợp kết quả này quá nhiều cột, ta chỉ quan tâm tách ra những cột nào có giá trị `NA` và muốn biết có bao nhiêu `NA` trong cột đó (dù là `character` hay `numeric`) thì ta sẽ làm thêm một bước kiểm tra sau. ```{r} ### Lệnh này sẽ kiểm từng thành phần trong list `check_na` sau đó trả về kết quả ### Những thành phần nào (cột nào trong df) có bao nhiêu giá trị `NA` check_1 <- lapply(check_na, function(x) { x[which(is.na(names(x)))] } ) ### Build function kiểm tra ruột của từng thành phần trong list có giá trị hay không is.integer0 <- function(x) { is.integer(x) && length(x) == 0L } ### Kiểm tra từng thành phần trong check_1 có giá trị hay không check_2 <- lapply(check_1, is.integer0) ### Trích xuất những cột có giá trị missing value check_1[!unlist(check_2)] ``` ## Thể hiện giá trị `NA` trên toàn bộ dataset qua heatmap ### Cách 1: Sử dụng package `mice` ```{r, fig.width=12, fig.height=6} library(mice) mice::md.pattern(df, plot = TRUE, rotate.names = TRUE) ``` Kết quả này đọc là có 1753 sinh viên có đủ các thông tin ở tất cả các cột, 3139 sinh viên có đủ thông tin (chỉ trừ cột `salary` là có missing value), 3347 sinh viên có đủ thông tin (chỉ trừ các cột `score1`, `score2`, và `salary`). Có tổng cộng 13180 giá trị `NA` trong toàn bộ dataset. Cụ thể cách phân tích kết quả missing value theo package `mice` được trình bày [**ở đây**](https://thongkesinhhoc.com/ky-thuat-kiem-tra-missing-value) ### Cách 2: Sử dụng package `VIM` Sử dụng đồ thị này để thể hiện tỷ lệ % giá trị `NA` trên toàn dataset. ```{r, fig.width=12, fig.height=6} library(VIM) aggr_plot <- VIM::aggr(df, col = c("navyblue", "red"), numbers = TRUE, sortVars = TRUE, labels = names(data), cex.axis = 1, gap = 3, ylab = c("Histogram of missing data", "Pattern")) ``` Ta có thể so sánh từng cặp chỉ tiêu với nhau thông qua đồ thị này, với các biện luận tham khảo ở đây[^2]. ```{r, fig.width=12, fig.height=6} VIM::marginplot(df[, c("salary", "score1")]) VIM::marginplot(df[, c("salary", "score2")]) VIM::marginplot(df[, c("score1", "score2")]) ``` # Các lệnh group và summary dữ liệu Ở thời điểm này, ta đã có cái nhìn tổng quát về bộ dữ liệu. Lúc này việc phân tích thống kê mô tả sẽ đi theo câu hỏi nghiên cứu/chủ đề mà bạn quan tâm để làm cơ sở chọn ra những biến/cột/variable phân tích mô tả cụ thể hơn. Mình chọn quan tâm về số lượng nam và nữ ở cột giới tính `gender` theo học các ngành `major` trong bộ dữ liệu này. ## Áp dụng các lệnh tạo bảng summary ### Sử dụng lệnh `table()` cho hai tham số ```{r} ### Tham số `gender` ở vị trí hàng, tham số `major` ở vị trí cột table(df$gender, df$major, useNA = "ifany") ``` Nếu dùng lệnh `table()` cho ba tham số sẽ tạo ra dạng array 3 chiều, chẻ ra thành từng matrix 2 chiều. ```{r} ### Tham số `religion` ở vị trí thứ 3 sẽ được tách ra tương ứng các matrix giữa `gender` và `major` table(df$gender, df$major, df$religion, useNA = "ifany") ``` ### Add margins vào table Tạo bảng summary cho hai tham số `gender` và `major` ```{r} gender_major <- table(df$gender, df$major, useNA = "ifany") names(dimnames(gender_major)) <- c("Gender", "Major") addmargins(A = gender_major, margin = seq_along(dim(gender_major)), FUN = sum, quiet = TRUE) ``` Tạo bảng summary cho ba tham số `gender` và `major` và `religion` ```{r} gender_major_religion <- table(df$gender, df$major, df$religion, useNA = "ifany") names(dimnames(gender_major_religion)) <- c("Gender", "Major", "Religion") addmargins(A = gender_major_religion, margin = seq_along(dim(gender_major_religion)), FUN = sum, quiet = TRUE) ``` ### Sử dụng lệnh `prop.table()` ```{r} ### Lệnh này trả về tỷ lệ giữa nam và nữ trong cùng vector `gender` prop.table(table(df$gender), margin = NULL) ### Nếu có 2 tham số trở lên thì các bạn chú ý tham số margin nhé prop.table(table(df$gender, df$major, useNA = "ifany"), margin = NULL) addmargins(A = prop.table(table(df$gender, df$major, useNA = "ifany"), margin = NULL), margin = seq_along(dim(prop.table(table(df$gender, df$major, useNA = "ifany"), margin = NULL))), FUN = sum, quiet = TRUE) addmargins(A = prop.table(table(df$gender, df$major, useNA = "ifany"), margin = 1), margin = seq_along(dim(prop.table(table(df$gender, df$major, useNA = "ifany"), margin = 1))), FUN = sum, quiet = TRUE) addmargins(A = prop.table(table(df$gender, df$major, useNA = "ifany"), margin = 2), margin = seq_along(dim(prop.table(table(df$gender, df$major, useNA = "ifany"), margin = 2))), FUN = sum, quiet = TRUE) ``` ## Áp dụng các lệnh summary trong package `dplyr` ```{r} library(dplyr) df |> dplyr::group_by(gender, religion) |> dplyr::summarise(trung_binh_age = mean(age), trung_vi_age = median(age), do_lech_chuan_age = sd(age), so_luong = n(), max_age = max(age), min_age = min(age), tong_so_tuoi = sum(age)) -> summary_all summary_all ``` ```{r} library(dplyr) df |> dplyr::group_by(major) |> dplyr::summarise(so_luong_mon = n_distinct(major), so_nguoi_theo_hoc = n()) |> print(n = Inf) ``` ```{r} library(dplyr) df |> dplyr::group_by(major, gender) |> dplyr::summarise(so_nguoi_theo_hoc = n()) -> df_major_gender df_major_gender |> print(n = Inf) # df |> dplyr::group_by(major, minor) |> # dplyr::summarise(so_luong_mon = n_distinct(major), # so_nguoi_theo_hoc = n()) |> # print(n = Inf) # # df |> dplyr::group_by(major, minor) |> # dplyr::summarise(so_luong_mon = n_distinct(minor), # so_nguoi_theo_hoc = n())|> # print(n = Inf) ``` # Vẽ đồ thị mô tả dữ liệu ## Vẽ đồ thị cột ### Đồ thị cột side-by-side **Nếu dataset ở dạng như sau thì khi ta vẽ đồ thị cột sẽ dùng theo dạng formula `y ~ x1 + x2`** ```{r} df_major_gender ``` ```{r, fig.width=14, fig.height=6} barplot(so_nguoi_theo_hoc ~ gender + major, data = df_major_gender, beside = TRUE, col = c("blue", "red"), angle = c(45, 135), density = 20, xlab = "", ylab = "Sinh viên", yaxs = "i", ylim = c(0, 1000), xaxs = "i", xlim = c(0.5, 18.5), # xaxt = "n", main = "Số lượng sinh viên theo học ở các chuyên ngành khác nhau", width = 1, las = 1) legend(x = "topright", y = NULL, title = "Chú thích", legend = c("Nữ", "Nam"), col = c("blue", "red"), fill = c("blue", "red"), angle = c(45, 135), density = 20) box() ``` **Nếu dataset ở dạng `table` hay `matrix` thì ta vẽ trực tiếp bằng lệnh `barplot()` với chính table đó.** ```{r} student_gender_major <- table(df$gender, df$major) student_gender_major ``` ```{r, fig.width=14, fig.height=6} barplot(height = student_gender_major, beside = TRUE, col = c("blue", "red"), angle = c(45, 135), density = 20, xlab = "", ylab = "Sinh viên", yaxs = "i", ylim = c(0, 1000), xaxs = "i", xlim = c(0.5, 18.5), # xaxt = "n", main = "Số lượng sinh viên theo học ở các chuyên ngành khác nhau", width = 1, las = 1) legend(x = "topright", y = NULL, title = "Chú thích", legend = c("Nữ", "Nam"), col = c("blue", "red"), fill = c("blue", "red"), angle = c(45, 135), density = 20) box() ``` ### Đồ thị cột stacked barchart ```{r, fig.width=14, fig.height=6} barplot(height = student_gender_major, beside = FALSE, col = c("blue", "red"), angle = c(45, 135), density = 20, xlab = "", ylab = "Sinh viên", yaxs = "i", ylim = c(0, 2000), xaxs = "i", xlim = c(0, 7.4), # xaxt = "n", main = "Số lượng sinh viên theo học ở các chuyên ngành khác nhau", width = 1, space = 0.2, las = 1) par("usr") legend(x = "topright", y = NULL, title = "Chú thích", legend = c("Nữ", "Nam"), col = c("blue", "red"), fill = c("blue", "red"), angle = c(45, 135), density = 20) box() ``` ### Đồ thị cột percent stacked barchart ```{r, fig.width=14, fig.height=6} percent_1 <- prop.table(table(df$gender, df$major), margin = 2) percent_1 barplot(height = percent_1, beside = FALSE, col = c("blue", "red"), angle = c(45, 135), density = 20, xlab = "", ylab = "Sinh viên (%)", yaxs = "i", ylim = c(0, 1), xaxs = "i", xlim = c(0, 7.4), # xaxt = "n", main = "Tỷ lệ sinh viên theo học ở các chuyên ngành khác nhau", width = 1, space = 0.2, las = 1) # par("usr") legend(x = 6.461, y = 1.16, title = "Chú thích", legend = c("Nữ", "Nam"), horiz = TRUE, col = c("blue", "red"), fill = c("blue", "red"), angle = c(45, 135), density = 20, xpd = TRUE) box(which = "plot", col = "black") # box(which = "figure", col = "red") ``` ## Vẽ đồ thị đường Ta quan tâm về `salary` và `age` xem có mối tương quan như thế nào. ```{r, fig.width=14, fig.height=6} ### tách ra dataset clean df_salary <- df[, c("age", "gender", "salary")] df_salary <- na.omit(df_salary) df_salary |> dplyr::arrange(age, salary) -> df_salary head(df_salary, n = 30) ``` ```{r, fig.width=14, fig.height=6, results='hide'} options(scipen = 1e9) oldpar <- par(no.readonly = TRUE) par(mar = c(6, 8, 4, 4)) par(mgp = c(4, 1, 0)) plot(x = df_salary$age, y = df_salary$salary, type = "o", col = "darkblue", xlim = c(0, 80), ylim = c(0, 80000), las = 1, xaxs = "i", yaxs = "i", xlab = "Age", ylab = "Salary", main = paste0("Tương quan giữa tuổi và tiền lương (n = ", dim(df_salary)[1], " sinh viên)"), lwd = 1, lty = 1, bty = "o") par(oldpar) ``` ## Vẽ đồ thị hộp Ta cắt vector `age` theo các độ tuổi khác nhau để thuận tiện vẽ đồ thị hộp. ```{r} df_salary$group_age <- cut(x = df_salary$age, breaks = c(0, 20, 30, 40, 50, 60, 70), labels = c("≤ 20", "20 < age ≤ 30", "30 < age ≤ 40", "40 < age ≤ 50", "50 < age ≤ 60", "> 60")) df_salary |> dplyr::arrange(age, salary) -> df_salary as.data.frame(table(df_salary$group_age)) # sample kiểu base R # sample_x <- sample(1:nrow(df_salary), size = 30) # df_salary[sample_x, ] -> df_sample # sample kiểu dplyr (ngẫu nhiên) # dplyr::sample_n(tbl = df_salary, size = 30, replace = FALSE) |> dplyr::arrange(age, salary) set.seed(1) # sample kiểu dplyr (ngẫu nhiên theo từng nhóm factor) df_salary |> dplyr::group_by(group_age) |> dplyr::sample_frac(size = 0.05, replace = FALSE) |> dplyr::arrange(age, salary)-> sample_df as.data.frame(sample_df) ``` ```{r, fig.width=14, fig.height=6} oldpar <- par(no.readonly = TRUE) par(mar = c(6, 8, 4, 4)) par(mgp = c(4, 1, 0)) boxplot(formula = salary ~ group_age, data = df_salary, col = rainbow(n = 6), las = 1, xlab = "Nhóm tuổi", ylab = "Mức lương", main = "Đồ thị thể hiện mức lương theo các nhóm tuổi") par(oldpar) ``` ```{r, fig.width=14, fig.height=6} oldpar <- par(no.readonly = TRUE) par(mar = c(6, 12, 4, 4)) par(mgp = c(4, 1, 0)) boxplot(formula = salary ~ group_age + gender, data = df_salary, col = c(rep("cyan", 6), rep("coral", 6)), las = 1, xlab = "Mức lương", ylab = "", horizontal = TRUE, sep = "-", lex.order = FALSE, # names = c(letters[1:12]), main = "Đồ thị thể hiện mức lương theo các nhóm tuổi và giới tính") par(oldpar) ``` ```{r, fig.width=14, fig.height=6} oldpar <- par(no.readonly = TRUE) par(mar = c(6, 12, 4, 4)) par(mgp = c(4, 1, 0)) boxplot(formula = salary ~ group_age + gender, data = df_salary, col = rep(c("cyan", "coral"), 6), las = 1, xlab = "Mức lương", ylab = "", horizontal = TRUE, sep = "-", lex.order = TRUE, # names = c(letters[1:12]), main = "Đồ thị thể hiện mức lương theo các nhóm tuổi và giới tính") par(oldpar) ``` ```{r} df_salary |> dplyr::group_by(gender, group_age) |> dplyr::summarise(trung_binh_salary = mean(salary), trung_vi_salary = median(salary), do_lech_chuan_salary = sd(salary), max_salary = max(salary), min_salary = min(salary), so_luong = n()) -> summary_salary as.data.frame(summary_salary) ``` ## Vẽ nhiều đồ thị con trong một hình Vẽ đồ thị giữa `height` và `weight` ```{r, fig.width=14, fig.height=6} df_height <- df[, c("age", "gender", "major", "height", "weight", "salary")] head(df_height, n = 30) oldpar <- par(no.readonly = TRUE) par(mar = c(6, 8, 4, 4)) par(mgp = c(4, 1, 0)) plot(x = df_height$weight, y = df_height$height, type = "p", col = "darkgreen", pch = 1, cex = 0.5, xlim = c(0, 120), ylim = c(0, 250), las = 1, xaxs = "i", yaxs = "i", xlab = "Cân nặng (kg)", ylab = "Chiều cao (cm)", main = paste0("Tương quan giữa chiều cao và cân nặng (n = ", dim(df_height)[1], " sinh viên)"), lwd = 1, lty = 1, bty = "o") par(oldpar) ``` Tô màu theo `gender` ```{r, fig.width=14, fig.height=6} df_height <- df[, c("age", "gender", "religion", "height", "weight", "salary")] col_1 <- factor(df_height$gender) oldpar <- par(no.readonly = TRUE) par(mar = c(6, 8, 4, 4)) # par(mgp = c(4, 1, 0)) plot(x = df_height$weight, y = df_height$height, type = "p", col = c("blue", "red")[col_1], pch = 1, cex = 0.5, xlim = c(0, 120), ylim = c(0, 250), las = 1, xaxs = "i", yaxs = "i", xlab = "Cân nặng (kg)", ylab = "Chiều cao (cm)", main = paste0("Tương quan giữa chiều cao và cân nặng (n = ", dim(df_height)[1], " sinh viên)"), lwd = 1, lty = 1, bty = "o") legend(x = "bottomleft", y = NULL, title = "Chú thích", legend = c("Nữ", "Nam"), # horiz = TRUE, col = c("blue", "red"), # fill = c("blue", "red"), pch = 1, xpd = TRUE) par(oldpar) ``` ### Sử dụng package `lattice` vẽ đồ thị scatterplot ```{r, fig.width=14, fig.height=10} library(lattice) lattice::xyplot(x = height ~ weight | gender + religion, groups = gender, col = c("blue", "red"), origin = 0, xlim = c(30, 130), ylim = c(50, 250), xlab = "Cân nặng (kg)", ylab = "Chiều cao (cm)", main = paste0("Tương quan giữa chiều cao và cân nặng theo giới tính và tôn giáo (n = ", dim(df_height)[1], " sinh viên)"), data = df_height) ``` ### Vẽ đồ thị nhiều biến ```{r, fig.width=14, fig.height=10} df_clean <- df[, c("age", "gender", "religion", "height", "weight", "salary")] df_clean <- na.omit(df_clean) df_clean$group_age <- cut(x = df_clean$age, breaks = c(0, 20, 30, 40, 50, 60, 70), labels = c("≤ 20", "20 < age ≤ 30", "30 < age ≤ 40", "40 < age ≤ 50", "50 < age ≤ 60", "> 60")) df_clean |> dplyr::arrange(age) -> df_clean as.data.frame(table(df_clean$group_age)) library(lattice) head(df_clean) options(scipen = 10) df_clean$bmi <- df_clean$weight / (df_clean$height * 0.01)^2 df_clean$religion <- factor(df_clean$religion) labels <- levels(df_clean$religion) lattice::xyplot(x = bmi ~ salary | group_age + gender, groups = religion, col = adjustcolor(col = c("red", "black", "darkgreen", "blue", "purple"), alpha.f = 0.5), pch = c(15, 17, 18, 19, 8), origin = 0, ylim = c(15, 31), xlim = c(0, 80000), xlab = "Salary (USD)", ylab = "BMI index", main = paste0("Phân bố mức lương theo chỉ số BMI, giới tính và nhóm tuổi (n = ", dim(df_clean)[1], " sinh viên)"), key = list(space = "top", columns = 5, title = "Chú thích", points=list(pch = c(15, 17, 18, 19, 8), col = adjustcolor(col = c("red", "black", "darkgreen", "blue", "purple"), alpha.f = 1)), text = list(labels)), scale = list(alternating = 3, rot = 0), panel = function(...) { panel.abline(h = 30, col = "red", lty = 2) panel.abline(h = 18.5, col = "red", lty = 2) panel.abline(v = 20000, col = "blue", lty = 2) panel.abline(v = 60000, col = "blue", lty = 2) panel.xyplot(...) }, data = df_clean) ``` ### Vẽ đồ thị histogram ```{r, fig.width=10, fig.height=6} height_data <- hist(df$height, col = "lightyellow", main = "Histogram of students' height", xlab = "Height (cm)", ylab = "Density", xlim = c(120, 220), ylim = c(0, 0.04), probability = TRUE) lines(density(df$height), col = "red", lty = 1, lwd = 2) day_so <- seq(from = min(df$height, na.rm = TRUE), to = max(df$height, na.rm = TRUE), length = 1000) curve(expr = dnorm(day_so, mean = mean(df$height, na.rm = TRUE), sd = sd(df$height, na.rm = TRUE)), type = "l", add = TRUE, lwd = 2, xname = "day_so", col = "blue", lty = "dotted") legend(x = "topright", y = NULL, legend = c("kernel density", "normal curve"), col = c("red", "blue"), lwd = 2, lty = c(1, 2)) height_data ``` # Tài liệu tham khảo 1. 2. 3. 4. 5. 6. 7. 8. 9. [^1]: [^2]: