Phân tích mô tả xử lý r

Question

Hướng dẫn chung

Cài đặt các thư viện

install.packages(“readr”)

Nội dung chính Show

Hướng dẫn chung
Cài đặt các thư viện
Khai báo các thư viện
Tạo Dataset
Các phép toán thống kê
Kiểm tra giá trị trung bình của hai mẫu có khác nhau không?
Dữ liệu mảng trong R
Lập trình cơ bản với R
Vòng lặp for
Vòng lặp While
Hàm cơ bản trong R
Gom nhóm dư liệu
Thao tác với dữ liệu và các phép toán xử lý trên dữ liệu
Biểu đồ cơ bản: hist, plot,boxplot Hist:
hist(var, xlab, ylab, main, xlim, ylim, col, border, prob)
PHÂN TÍCH THỐNG KÊ MÔ TẢ
Measures of Central Tendency
The mean is a descriptive statistic that looks at the average value of a data set.
Kiểm định với R
Kiểm tra giá trị trung bình của hai mẫu
Đã biết phương sai
Chưa biết phương sai
Xây dựng mô hình Hổi qui với R
Linear Regression (Y~X)
fit = lm(salnow~salbeg)
Quan hệ tương quan giữa lương khởi điểm và lương hiện tại:
vễ biểu đồ mô hình hình (Plot the chart)

install.packages(“haven”)

install.packages(“readxl”)

install.packages(“psych”)

install.packages(“Hmisc”)

install.packages(“gmodels”)

Khai báo các thư viện

library(“readr”)

ibrary(“haven”)

ibrary(“readxl”)

librarys(“psych”)

library(“Hmisc”)

library(“gmodels”) # Các phép toán ## Các phép toán cớ bản

log(8,2)

## [1] 3

2^3

## [1] 8

sin(pi/2)

## [1] 1

cos(pi)

## [1] -1

Tạo Dataset

x = c(1,5,80,90,91) 
id= c(1, 2, 3, 4, 5, 6) 
Y = c(10, 16, 34,40, 50,26)
Data = data.frame(id,Y) 
Data

Các phép toán thống kê

Kiểm tra giá trị trung bình của hai mẫu có khác nhau không?

x = rnorm(20)
y = rnorm(20)
t.test(x,y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 0.39174, df = 37.994, p-value = 0.6974
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4929458  0.7295030
## sample estimates:
##   mean of x   mean of y 
## -0.05843635 -0.17671498

print(mean(x1))

## [1] 3

median(x1)

## [1] 3

sd(x1)

## [1] 1.581139

var(x1)

## [1] 2.5

mad(x1)

## [1] 1.4826

range(x1)

## [1] 1 5

which.max(x1)

## [1] 5

which.min(x2)

## [1] 1

length(x1)

## [1] 5

cov(x,y)

## [1] -0.0699699

Dữ liệu mảng trong R

x1 <-c(1,9,3,4,8) # Tao 1 vec to, day so 
x1

## [1] 1 9 3 4 8

y1 <- x1*2 # gap doi gia tri x1

y1

## [1]  2 18  6  8 16

Data = data.frame(x1,x2) 
Data

sum = x1+x2 
sum

## [1]  4 14 10 13 19

Data$sum =Data$x1+Data$x2 
Data

Lập trình cơ bản với R

Vòng lặp for

for(i in 1:5) 
{ print(i) }

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

x <- c("a", "b", "c", "d")

x

## [1] "a" "b" "c" "d"

for(i in 1:4) { ## Print out each element of 'x'
  print(x[i]) 
  }

## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"

## Generate a sequence based on length of 'x' 
for(i in seq_along(x)) 
{ 
  print(x[i]) 
  }

## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"

for(letter in x) 
{ 
  print(letter) 
}

## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"

for(i in 1:4) print(x[i])

## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"

Vòng lặp While

count <- 0 
while(count < 5) 
{ 
  print(count) 
  count <- count + 2 
}

## [1] 0
## [1] 2
## [1] 4

for(i in 1:10) 
{ if(i <= 5) 
{ ## Skip the first 5 iterations 
  next 
} 
  print(i) }

## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Hàm cơ bản trong R

# Function in R
f = function() { 
  cat("Hello, world! Vietnam Bnaking Academy\n") 
  for(i in 1:10) 
    print(i) } 
f()

## Hello, world! Vietnam Bnaking Academy
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

f <- function(num) { 
  for(i in seq_len(num)) { 
    cat("Hello, world!\n") 
    } 
  } 
f(6)

## Hello, world!
## Hello, world!
## Hello, world!
## Hello, world!
## Hello, world!
## Hello, world!

facto <- function(n) { 
  if ((n ==0) || (n == 1)) return (1) 
  else return (n*facto(n-1)) } 
facto(8)

## [1] 40320

fi <- function(n) { 
  if ((n ==0) || (n == 1)) return (1) 
  else return (fi(n-1) + fi(n-2)) 
  }
fi(8)

## [1] 34

Gom nhóm dư liệu

id = c(1,2,3,4,5)
gender = c("male","female","female","male","male") 
dat = data.frame (id,gender) 
dat

# ################################ 
dat$sex[gender == "male"] = 1 
dat$sex[gender =="female"]= 0 
dat

dat$group[id >=1 & id<=3] = "A" 
dat$group[id >=4 & id<=5] = "B" 
dat

######################################## 
x = c(1,5, 80,90,91,10) 
id = c(1, 2, 3, 4, 5, 6) 
Y = c(10, 16, 34,40, 50,26) 
Data = data.frame(id,x,Y) 
attach(Data) #Lay du lieu

## The following objects are masked _by_ .GlobalEnv:
## 
##     id, x, Y

Data

setwd("D://R-Code") #Dat thu muc lam viec 
getwd() # Kiem tra thu muc lam viec

## [1] "D:/R-Code"

Thao tác với dữ liệu và các phép toán xử lý trên dữ liệu

Read from local file 1. Dùng lệnh “save as” trong Excel lưu số liệu dưới dạng file “.csv” 2. dùng lệnh dt = read.csv (đường dẫn file, header = TRUE) Để lưu dữ liệu vào đối tượng 3. Báo cho R biết: lấy Data ra phân tích : attach(dt)

dt = read.csv("D://Datasets/Data-Analysis/salary.csv",header = T) 
attach(dt)

## The following object is masked _by_ .GlobalEnv:
## 
##     id

## The following object is masked from Data:
## 
##     id

#dt

head(dt)

head(dt)

library(psych)
describe(dt)

Biểu đồ cơ bản: hist, plot,boxplot Hist:

hist(var, xlab, ylab, main, xlim, ylim, col, border, prob)

hist(salbeg)

hist(salnow, col= "green", border = "white")

hist(salnow, col= "blue", border = "white", xlab = "lương hiện tại ($)", ylab = "tỉ lệ", prob = T)

## Boxplot (var ~ group, xlab, ylab, main, xlim, ylim, col, border, horizontal)

plot(salbeg)

boxplot (salnow, xlab ="lương hiện tại", main = "lương hiện tại", col = "red")

boxplot (salnow, notch = T, xlab ="lương hiện tại", main = "lương hiên tại", col = "blue")

 boxplot (salnow, notch = T, xlab ="lương hiện tại", main = "lương hiên tại", col = "yellow", horizontal = T)

boxplot (salnow~sex, notch = T, xlab ="lương hiện tại", main = "lương hiên tại", col = "blue", horizontal=T)

boxplot (salnow~sex, notch = T, xlab ="lương hiện tại", main = "lương hiên tại", col = c("blue","red"), horizontal =T )

##Hàm barplot Hàm barplot

f = table(salnow) 
barplot(f)

f = table(sex) 
barplot(f)

means1 = with(dt,tapply(salnow, jobcat, mean)) 
#means = tapply(salnow,jobcat, sum) 
barplot(means1, horiz = T, xlab =" Lương trung bình", ylab = "nhóm nghề")

means = with(dt,tapply(jobcat,jobcat, sum)) 
#means = tapply(salnow,jobcat, sum) 
barplot(means, horiz = T, xlab ="số nhân viên", ylab = "nhóm nghề")

# Biểu đồ cơ bản: ## hist, plot,boxplot ##pie sa = c(sum(salbeg),sum(salnow)) pie(sa)

sa = c(sum(salbeg),sum(salnow)) 
pie(sa)

library(gmodels)
CrossTable(sex, digits = 3)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  474 
## 
##  
##           |         0 |         1 | 
##           |-----------|-----------|
##           |       258 |       216 | 
##           |     0.544 |     0.456 | 
##           |-----------|-----------|
## 
## 
## 
##

CrossTable(sex, jobcat,digits = 3)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  474 
## 
##  
##              | jobcat 
##          sex |         1 |         2 |         3 |         4 |         5 |         6 |         7 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            0 |       110 |        47 |        27 |        34 |        30 |         4 |         6 |       258 | 
##              |     1.488 |     9.866 |    10.301 |     6.117 |     9.089 |     0.601 |     2.289 |           | 
##              |     0.426 |     0.182 |     0.105 |     0.132 |     0.116 |     0.016 |     0.023 |     0.544 | 
##              |     0.485 |     0.346 |     1.000 |     0.829 |     0.938 |     0.800 |     1.000 |           | 
##              |     0.232 |     0.099 |     0.057 |     0.072 |     0.063 |     0.008 |     0.013 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
##            1 |       117 |        89 |         0 |         7 |         2 |         1 |         0 |       216 | 
##              |     1.777 |    11.785 |    12.304 |     7.306 |    10.857 |     0.717 |     2.734 |           | 
##              |     0.542 |     0.412 |     0.000 |     0.032 |     0.009 |     0.005 |     0.000 |     0.456 | 
##              |     0.515 |     0.654 |     0.000 |     0.171 |     0.062 |     0.200 |     0.000 |           | 
##              |     0.247 |     0.188 |     0.000 |     0.015 |     0.004 |     0.002 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       227 |       136 |        27 |        41 |        32 |         5 |         6 |       474 | 
##              |     0.479 |     0.287 |     0.057 |     0.086 |     0.068 |     0.011 |     0.013 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
##

PHÂN TÍCH THỐNG KÊ MÔ TẢ

#Khái niệm tổng thể (population) và mẫu (sample) Trong chương này, chúng ta sẽ sử dụng R cho mục đích phân tích thống kê mô tả. Nói đến thống kê mô tả là nói đến việc mô tả dữ liệu bằng các phép tính và chỉ số thống kê thông thường mà chúng ta đã làm quen qua từ thuở trung học như số trung bình (mean), số trung vị (median), phương sai (variance) độ lệch chuẩn (standard deviation)… cho các biến số liên tục, và tỉ số (proportion) cho các biến số không liên tục. Nhưng trước khi hướng dẫn phân tích thống kê mô tả, bạn đọc nên phân biệt hai khái niệm tổng thể (population) và mẫu (sample).

heigh = c(162,160,157,155,167,160,161,153,149,157,159,164,150,162,168,165,156,157,154,157)
sample5 = sample(heigh,5)
sample5

## [1] 160 162 149 157 164

#Gía trị trung bình: \[mean = \frac{\sum_{i=1}^n*x_i}{n}\]

mean(sample5)

## [1] 158.4

Measures of Central Tendency

The mean is a descriptive statistic that looks at the average value of a data set.

Phương sai mẫu Tổng thể chung: Phương sai mẫu:

\[\sigma = \sqrt{\frac{\sum_{i=1}^N(x_i - \mu)^2}{N-1}}\]

Tổng thể mẫu:

\[ s = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \overline{x})^2}{n-1}}\] Độ lệch chuẩn (Standard-Deviation) Tổng thể chung: \[Standard_\_Deviation = sd = \sqrt\sigma = \sqrt{\frac{\sum_{i=1}^N(x_i - \mu)^2}{N-1}}\] Tổng thể mẫu: \[Standard_\_Deviation = sd = \sqrt\sigma = \sqrt{\frac{\sum_{i=1}^n(x_i - \mu)^2}{n-1}}\]

Kiểm định với R

Kiểm tra giá trị trung bình của hai mẫu

x = rnorm(10)
y = rnorm(10)
t.test(x,y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 0.52818, df = 17.966, p-value = 0.6038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6317068  1.0559228
## sample estimates:
##   mean of x   mean of y 
## -0.04232802 -0.25443603

##  [1] -0.06435635 -0.50867787  0.57368143 -1.13249269 -1.41870116  1.09659289
##  [7] -0.53368394  1.22863973  0.26961367  0.06610406

##  [1]  0.2764979 -0.4733287 -1.5307298 -1.4920381 -0.8515796 -0.2552806
##  [7]  1.4117951  0.6238134 -0.3950166  0.1415066

TT = t.test(x,y)
TT

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 0.52818, df = 17.966, p-value = 0.6038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6317068  1.0559228
## sample estimates:
##   mean of x   mean of y 
## -0.04232802 -0.25443603

summary(TT)

##             Length Class  Mode     
## statistic   1      -none- numeric  
## parameter   1      -none- numeric  
## p.value     1      -none- numeric  
## conf.int    2      -none- numeric  
## estimate    2      -none- numeric  
## null.value  1      -none- numeric  
## stderr      1      -none- numeric  
## alternative 1      -none- character
## method      1      -none- character
## data.name   1      -none- character

ttest = t.test(x,y)
ttest

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 0.52818, df = 17.966, p-value = 0.6038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6317068  1.0559228
## sample estimates:
##   mean of x   mean of y 
## -0.04232802 -0.25443603

names(ttest)

##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

Đã biết phương sai

Chưa biết phương sai

Xây dựng mô hình Hổi qui với R

library(foreign)
#Doc du lieu tu bo nho ngoai
dt = read.csv("D://Datasets/Data-Analysis/salary.csv")
#attach(data) #chuyen du lieu vao bo nho
#dt

Linear Regression (Y~X)

fit = lm(salnow~salbeg)

fit = lm(salnow~salbeg, data = dt)
fit

## 
## Call:
## lm(formula = salnow ~ salbeg, data = dt)
## 
## Coefficients:
## (Intercept)       salbeg  
##     771.282        1.909

summary(fit)

## 
## Call:
## lm(formula = salnow ~ salbeg, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14169.7  -1612.2   -461.7   1033.7  19717.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 771.28230  355.47194    2.17   0.0305 *  
## salbeg        1.90945    0.04741   40.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3246 on 472 degrees of freedom
## Multiple R-squared:  0.7746, Adjusted R-squared:  0.7741 
## F-statistic:  1622 on 1 and 472 DF,  p-value: < 2.2e-16

a <- data.frame(salbeg = 300)
b = data.frame(salbeg=400)
predict(fit,a)

##        1 
## 1344.117

predict(fit,b)

##        1 
## 1535.062

Quan hệ tương quan giữa lương khởi điểm và lương hiện tại:

cov(salnow,salbeg)# ????

## [1] 18925532

cor(salnow,salbeg)  # calculate correlation between salnow and salnbeg

## [1] 0.8801175

range(salbeg)

## [1]  3600 31992

mad(salbeg)

## [1] 1494.461

vễ biểu đồ mô hình hình (Plot the chart)

# Plot the chart.
#plot(y,x,col = "blue",main = "Height & Weight Regression", abline(lm(x~y)),cex = 1.5,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")
plot(salbeg,salnow,col = "red",main = "Height & Weight Regression", cex = 1.1,pch = 16,xlab = "SalBegin",ylab = "SalNow")
abline(fit,lwd =4, col = "green")
arrows(5000,50000,16000, predict(fit, data.frame(salbeg=16000)))
text(10000,50000,"Line of the best fit", pos=3) # Text align to the right of the point of (10000,50000)

#png(file = "linearregression.png")

scatter.smooth(x= salbeg, y = salnow, main = "salnow ~ salbeg", col = "red")  # scatterplot

# Other useful functions

coefficients(fit) # model coefficients

## (Intercept)      salbeg 
##   771.28230     1.90945

confint(fit, level=0.95) # CIs for model parameters

##                2.5 %     97.5 %
## (Intercept) 72.77899 1469.78562
## salbeg       1.81629    2.00261

#fitted(fit) # predicted values

# residuals(fit) # residuals

anova(fit) # anova table

#vcov(fit) # covariance matrix for model parameters 
#influence(fit) # regression diagnostics

# diagnostic plots 
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page 
plot(fit)

references:

type: article-journal id: WatsonCrick1953 author:
- family: Watson given: J. D.
- family: Crick given: F. H. C. issued: date-parts:
  - - 1953
    - 4
    - 25 title: ‘Tutorial: Basic Statistics in Python — Descriptive Statistics’ title-short: Molecular structure of nucleic acids container-title: Nature volume: issue: page: DOI: URL: https://www.dataquest.io/blog/basic-statistics-with-python-descriptive-statistics/ language: en-GB

Cryto Phân tích Viết code R Bài tập RStudio