Editing Text Variables
First we get some data.
if(!file.exists("./data")){dir.create("./data")}
fileURL <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileURL, destfile="./data/cameras.csv", method="curl")
cameraData <- read.csv("./data/cameras.csv")
The names of the variables could be a little better.
names(cameraData)
## [1] "address" "direction" "street" "crossStreet"
## [5] "intersection" "Location.1"
We can create upper or lowercase names
tolower(names(cameraData))
## [1] "address" "direction" "street" "crossstreet"
## [5] "intersection" "location.1"
toupper(names(cameraData))
## [1] "ADDRESS" "DIRECTION" "STREET" "CROSSSTREET"
## [5] "INTERSECTION" "LOCATION.1"
Splitting a text string
We can use strsplit to split the names where the period is.
To split a string at a period you must use the escape character \ twice because the period is a reserve character
This Doesn’t work:
splitNames <- strsplit(names(cameraData), “.”)
This works
Adding “\” twice before the period does work.
splitNames <- strsplit(names(cameraData), "\\.")
splitNames
## [[1]]
## [1] "address"
##
## [[2]]
## [1] "direction"
##
## [[3]]
## [1] "street"
##
## [[4]]
## [1] "crossStreet"
##
## [[5]]
## [1] "intersection"
##
## [[6]]
## [1] "Location" "1"
When a name is split it is turned into two separate list items. If the character that tells stirsplit where to split isn’t found then the text string isn’t changed.
# No periods in the original name so the name isn't split
splitNames[[5]]
## [1] "intersection"
#Now there are two names
splitNames[[6]]
## [1] "Location" "1"
Using Lists
A little refresher on using lists.
We can create a list with different classes of variables as well as matrices, vectors or an element of any other class.
mylist <- list(letters = c("A", "B", "C"), numbers = 1:3, matrix(1:25, ncol = 5))
mylist
## $letters
## [1] "A" "B" "C"
##
## $numbers
## [1] 1 2 3
##
## [[3]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
## [4,] 4 9 14 19 24
## [5,] 5 10 15 20 25
We can select elements within the list by indexing them using []
mylist[1]
## $letters
## [1] "A" "B" "C"
Or by selecting the list element name using $
mylist$numbers
## [1] 1 2 3
And we can select the matrix
mylist[3][[1]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
## [4,] 4 9 14 19 24
## [5,] 5 10 15 20 25
and elements within the matrix
mylist[3][[1]][,1]
## [1] 1 2 3 4 5
mylist[3][[1]][2,]
## [1] 2 7 12 17 22
Extracting the first part of the column name
splitNames[[6]][1]
## [1] "Location"
THis is a little function that removes the first element of a name that has been split.
firstElement <- function(x){x[1]}
sapply(splitNames, firstElement)
## [1] "address" "direction" "street" "crossStreet"
## [5] "intersection" "Location"
Using gsub and sub
Download peer review study
fileUrl1 <- "https://dl.dropboxusercontent.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 <- "https://dl.dropboxusercontent.com/u/7710864/data/solutions-apr29.csv"
download.file(fileUrl1,destfile="./data/reviews.csv",method="curl")
download.file(fileUrl2,destfile="./data/solutions.csv",method="curl")
reviews <- read.csv("./data/reviews.csv"); solutions <- read.csv("./data/solutions.csv")
head(reviews,2)
## id solution_id reviewer_id start stop time_left accept
## 1 1 3 27 1304095698 1304095758 1754 1
## 2 2 4 22 1304095188 1304095206 2306 1
You might want to remove the underscores sub looks for certain characters and replaces it with something else
sub("_", "", names(reviews))
## [1] "id" "solutionid" "reviewerid" "start" "stop"
## [6] "timeleft" "accept"
sub doesn’t work with a character string with multiple elements that should be replaced.
testName <- "This_is_a_test"
sub("_", "", testName)
## [1] "Thisis_a_test"
gsub will work though.
gsub("_", "", testName)
## [1] "Thisisatest"
Searching for specific values in variable names using grep and grepl
grep returns the variable number where the search string is found
grep("Alameda", cameraData$intersection)
## [1] 4 5 36
grepl returns TRUE or FALSE indicating if the search string exists in variable or not.
grepl("Alameda", cameraData$intersection)
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE
Using table tells us how many times the search string shows up.
table(grepl("Alameda", cameraData$intersection))
##
## FALSE TRUE
## 77 3
Now I can print out the three variables where Alameda shows up.
cameraData$intersection[grep("Alameda", cameraData$intersection)]
## [1] The Alameda & 33rd St E 33rd & The Alameda
## [3] Harford \n & The Alameda
## 74 Levels: & Caton Ave & Benson Ave ... York Rd \n & Gitting Ave
Although this is faster and does the same thing
grep("Alameda", cameraData$intersection, value=TRUE)
## [1] "The Alameda & 33rd St" "E 33rd & The Alameda"
## [3] "Harford \n & The Alameda"
Using ! returns a subset of every row except rows with “Alameda” in the intersection data.
subset_cameraData<- cameraData[!grepl("Alameda", cameraData$intersection),]
If a value that you search for doesn’t appear it will return interger(0)
grep("JeffStreet", cameraData$intersection)
## integer(0)
The length of that value will be zero if the search string can’t be found.
length(grep("JeffStreet", cameraData$intersection))
## [1] 0
Getting the length of character vectors
nchar returns the length of a text string which the length function doesn’t do.
nchar("These are words")
## [1] 15
length("These are words")
## [1] 1
Modifying the length of a text string
substring returns characters from the first number to the second number
substr("These words", 1, 5)
## [1] "These"
substr("These words", 2, 5)
## [1] "hese"
Pasting words together.
paste("These", "are", "words")
## [1] "These are words"