r | Dr. Hui Lin

Learning ggmap

Mon, 01 Jan 0001 00:00:00 +0000

This is a document for myself to review ggmap package, and how to get quick start.

ggmap is powerful map plot package in R. Compared with maps package, plots exported from ggmaps are elegant.

Two steps to plot a map: plot a map raster, decorate the base map with your own data.

Step 1: download your base map

You need to know the location you will plot. Define location in two ways showing below:

library(ggmap)
Location1 <- "University of Wisconsin, Milwaukee" #Use your address as your defination
Location2 <- "c(lon = -95.3632715, lat = 29.7632836)" # Use longitude and latitude

Use get_map function to download the raster map in your location. There are 3 map “sources” to obtain a map raster, and each of these sources has multiple “map types”

stamen: “watercolor”, “toner”, “terrain”

googlemap:

osm:(sometimes their servers are unavailable)

myMap <- get_map(location=myLocation, source="stamen", maptype=“watercolor", crop=FALSE)
ggmap(myMap)
##zoom = integer from 3-21
##3 = continent, 10=city, 21=building
##(openstreetmap limit of 18)

Step 2 decorate your map with data

In addition, a developing R package need to be concerned: rMap.

References:

ggmap github

ggmap document

ggmap quick start

[ggmap Introduction](https://dl.dropboxusercontent.com/u/24648660/ggmap useR 2012.pdf)

Learning ggplot2

Mon, 01 Jan 0001 00:00:00 +0000

本课程介绍三种R语言的绘图工具包：plot,qplot,ggplot。三种绘图包的能够和语法均不相同。 plot命令是R语言自带的绘图命令，绘图效果简单，适宜数据分析时绘图。 qplot命令是R语言初级绘图语言包，能够提供符合出版物标准的简单绘图。得到的图形大方美观。

Lesson: Regression Models Introduction

制图：plot(jitter(child,4) ~ parent,galton)

建立回归函数：使用函数lm(linear model)，例如：

regrline <- lm(child~parent, galton)

建立回归函数之后，使用abline(add straight lines to a plot)函数将回归函数在图表中画出，例如

abline(regrline, lwd = 3, col = "red")
#lwd = line width, col = line color

画出直线之后可以使用函数summary查看回归函数的各类参数包括残差，系数，相关系数等等。

qplot

qplot is a basic function in ggplot2 package. It provides some basic plots (e.g. points, smooth, boxplot) for users to learn their database generally.

qplot(hwy, displ, data = mpg)

qplot可调的参数有许多，如下展示


qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim= xlab=, ylab=, main=, sub=)

x, y, data, color, shape, size这些参数很容易理解。可以使用不同的变量实现变化。

alpha用于调节透明度。0就是全透明，1就是实心。

geom用于调节图表类型，有以下几个选项"point", “smooth”, “boxplot”, “line”, “histogram”, “density”, “bar”, “jitter”.

point表示散点图

smooth做出拟合的曲线图

boxplot做出股价图，此处不使用自定义的最高值，最低值和平均值，而是使用fill定义group，自动计算

qplot(year,averTLO, data = gt_sum, xlab = "Year", ylab = "Land&Ocean Avg Temp", geom = c("boxplot","jitter"),fill=decade)

![boxplot](/image/Land&Ocean Avg Temp vs year_boxplot_jitter.jpeg)

line做出折线图

histogram只针对单变量的柱状分布图，纵轴为count，横轴为该单变量。

density同样只针对单变量，画出该单变量的密度分布图。纵轴为density，横轴为该单变量。

bar即为常规柱状分布图，定义两个变量,利用fill变量可以实现多种变化

qplot(factor(cyl), data=mtcars, geom="bar", fill=factor(gear))

jitter则是在x轴上产生随机变量从而避免图形重叠带来的困扰。例如

p <- ggplot(mpg, aes(displ, hwy))
p + geom_point()
不用jitter散点图的效果

p + geom_point(position = "jitter")
使用上jitter的效果

method和formula这两个选项是针对smooth这个选项而出现的。当smooth选项被调用，默认的拟合方法为loess。还有其他拟合方式允许被调用，如’lm’:线性拟合，‘gam’:generalized additive models,“rlm”: robust regression

For example, to add simple linear regression lines, you'd specify geom="smooth", method="lm", formula=y~x. Changing the formula to y~poly(x,2) would produce a quadratic fit. Note that the formula uses the letters x and y, not the names of the variables.

For method="gam", be sure to load the mgcv package. For method="rml", load the MASS package.

cited from Quick-R

facets这个选项可以利用变量生成不同的分图，该选项的表达方式为：facets=rowvar~colvar，若rowvar或colvar不需要设置变量则用“.”代替。例如facets = .~colvar

利用coord_flip()可以将图表翻转

ggplot(diamonds, aes(color, fill=cut)) + geom_bar() + coord_flip()

需要叠加不同类型的图表是使用c指令，e.g. c("point", "smooth")

qplot(year,averT, data = gt_sum, xlab = "Year", ylab = "Land Avg Temp", geom = c("point","smooth"))

![combine](/image/land avg temp vs year_point_smooth.jpeg)

xlim, ylim, xlab, ylab均容易理解 main,sub用于调节主副标题

ggplot2

ggplot采用图层式绘图方法，可根据自己的意图添加想要的图层，适合绘制复杂的大图。下面是一个展示ggplot绘图语法的例子，其中mpg是ggplot自带的一个关于汽车品牌，性质的数据库。hwy和displ分别是mpg数据库中的字段，表示每加仑汽油行驶的里程数和汽车的排量。

>g<-ggplot(mpg, aes(hwy, displ))
>g+geom_points()+geom_smooth()

How to export a table.

write.table(dataframe, "pathway")

Learning R in Kaggle

Mon, 01 Jan 0001 00:00:00 +0000

I’m learning data analysis and explore in R following the tutorial posted in Kaggle

Here are some sentences I found it useful.

train <- read.csv('../input/train.csv', stringsAsFactors = F)

This is how to read csv files. Also, use read.delim(), read.delim2() to read txt file.read documentation

stringAsFactors is a useful factor. Here we do not want to use the headers as factors.

str(dataframe) use this to check data. Also if you use R Studio, use view(dataframe) to check data.

To check dataframe, we can also use tbl_df function.

full_df <- tbl_df(full)
full_df

Learning R using Swirl

Mon, 01 Jan 0001 00:00:00 +0000

用Swirl学习R语言，Learn R, in R

how to import data (read function) how to manipulate data wit dplyr

lesson: Getting and Cleaning Data

我们可以使用read.csv函数来导入数据。具体查看?read.csv，例子

#set a path to csv file
path2csv <-"E:/R-3.3.2/library/swirl/Courses/Getting_and_Cleaning_Data/Manipulating_Data_with_dplyr/2014-07-08.csv"

#use read.csv to read the csv file
mydf<-read.csv(path2csv,stringsAsFactors = FALSE)
#stringsAsFactors: logical: should character vectors be converted to factors?

使用dim()查看数据的行列情况

dim(mydf)

使用Dplyr包处理数据，首先library(dplyr)，然后使用tbl_df()(tibble)函数读取frame中的数据，这一步很重要，只有这样才能继续使用下面的函数和功能。使用函数rm("what_you_want_to_delete")(remove)删除frame。

library(dplyr)
cran<-tbl_df(mydf)
rm("my_df")

五个最基础最常用的函数工具：select(), filter(), arrange(), mutate(), summarize()

select函数可选取frame中的任意列，不用使用$符号。可以使用:，选取连续列，使用-删除不需要的列，例如

select(cran,country:r_arch)
select(cran, -(time:size))

filter函数可以选取任意行

> filter(cran, package == "swirl")
> filter(cran,country == "IN", r_version <= "3.0.2") #, for AND
> filter(cran, country == "US" | country == "IN") #|for OR
> filter(cran, !is.na(r_version)) #

is.na()如果数据是空的，返回TRUE，反之FALSE

arrange函数可以根据要求重新排列行的顺序。

arrange(cran, ip_id)
arrange(cran, desc(ip_id))

mutate函数可以用来增加一列派生变量。

> mutate(cran3, correct_size = size+1000)
> mutate(cran3, temperature = mean(AvgTemp, na.rm = "TRUE")) #求平均值，把空值删除
> mutate(cran3, decade = trunc(year/10))#trunc函数取整， 一系列的函数还有floor(), round(),signif().

Lesson2 Grouping and Chaining with dplyr

summarize是个很重要的函数。例如

by_package <- group_by(cran, package)
pack_sum <- summarize(by_package,
 count = n(),
 unique = n_distinct(ip_id),
 countries = n_distinct(country),
 avg_bytes = mean(size))

n()表示的是括号内的字段有多少不为空的数据，n_distinct表示括号内的字段有多少不重复的数据。这个函数是length(unique(x))的简化和快速版，更容易操作。

如果需要使用递进关系的函数，那么可以使用%>%连接符，可以连接不同函数。例如，

cran %>%
 select(ip_id, country, package, size) %>%
 mutate(size_mb = size / 2^20) %>%
 filter(size_mb <= 0.5) %>%
 arrange(desc(size_mb))

在上面的例子中，select函数需要用到cran数据框架，mutate函数需要用到select函数处理之后的数据框架，……以此类推。而在末尾的%>%连接符正好起到这样的作用。

R basics

Mon, 01 Jan 0001 00:00:00 +0000

R语言学习笔记

这里是关于一些R语言的语法备忘 Learning Website

Unit 1 Assignment and basic calculation

myapples = 3

myapples <- 3

+, -, *, /, ^

%% means the remainder.

Unit 2 Vectors

combine function: c()

numeric_vector = c(1,2,3) #Or c(1:3)

sum()calculates the sum of all elements of a vector. mean() calculates the average of all elements of a vector. Selection by comparison: logical comparison operator: <, >, <=, >=, ==, !=.

poker_vector[selection_vector]

Unit 3 Matrices

Useful functions: matrix(), colnames(),rownames, rbind, cbind. rowSums(), colSums(), e.g.

>matrix(1:9,byrow =TRUE, nrow = 3)
 [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

>new_hope <- c(460.998, 314.4)
>empire_strikes <- c(290.475, 247.900)
>return_jedi <- c(309.306, 165.8)
>star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
>region <- c("US", "non-US")
>titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

#Usage of colnames and rownames

>colnames(star_wars_matrix)<-region
>rownames(star_wars_matrix)<-titles

 US non-US
A New Hope 460.998 314.4
The Empire Strikes Back 290.475 247.9
Return of the Jedi 309.306 165.8

# Usage of cbind and rbind

big_matrix <- cbind(matrix1, matrix2, vector1 ...)
big_matrix = rbind(matrix1, ...)

注意此处与MATLAB语法的区别，在MATLAB中选取矩阵的行列用同样使用[],但是选取整列这个功能，MATLAB中使用：表示，例如school[1,:]，而在R中，不使用任何符号，例如school[1,]。

Unit 4 Factor

The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.

factor_speed_vector <-factor(speed_vector, ordered=TRUE, levels=c("slow","fast","insane"))

Unit 5 Data Frame

Useful fuctions show below: head(variables) shows the first observations of a data frame tail(variables) shows the last obseravations of a variables str() get a quick overview of data data.frame(vectors1, vectors2, ...) combine vectors into one data $sign: e.g. planets_df$diameter when data have names

subset(my_df, subset = some_condition), e.g. subset(planet_df, diameter<1) order()interesting function e.g.

>a = c(100,10,1000)
order(a)
[1] 2 1 3

>a[order(a),]
[1] 10 100 1000
#the comma is the solid brakets is crucial.

Unit 6 List

List can have kinds of components: vector, matrices and data frames.

My_list = list(my_vector, my_matrix, my_df)

Change the name of list

names(my_list)=c("vec", "mat", "df")

my_list = list(my_vec=vec, my_matrix=mat,...)

To conveniently add elements to lists you can use the c() function, that you also used to build vectors:

ext_list <- c(my_list , my_val)

This will simply extend the original list, my_list, with the component my_val. This component gets appended to the end of the list. If you want to give the new list item a name, you just add the name as you did before:

ext_list <- c(my_list, my_name = my_val)