Editing Text Variables

First we get some data.

if(!file.exists("./data")){dir.create("./data")}
fileURL <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileURL, destfile="./data/cameras.csv", method="curl")
cameraData <- read.csv("./data/cameras.csv")

The names of the variables could be a little better.

names(cameraData)
## [1] "address"      "direction"    "street"       "crossStreet" 
## [5] "intersection" "Location.1"

We can create upper or lowercase names

tolower(names(cameraData))
## [1] "address"      "direction"    "street"       "crossstreet" 
## [5] "intersection" "location.1"
toupper(names(cameraData))
## [1] "ADDRESS"      "DIRECTION"    "STREET"       "CROSSSTREET" 
## [5] "INTERSECTION" "LOCATION.1"


Splitting a text string

We can use strsplit to split the names where the period is.

To split a string at a period you must use the escape character \ twice because the period is a reserve character

This Doesn’t work:

splitNames <- strsplit(names(cameraData), “.”)

This works

Adding “\” twice before the period does work.

splitNames <- strsplit(names(cameraData), "\\.")
splitNames
## [[1]]
## [1] "address"
## 
## [[2]]
## [1] "direction"
## 
## [[3]]
## [1] "street"
## 
## [[4]]
## [1] "crossStreet"
## 
## [[5]]
## [1] "intersection"
## 
## [[6]]
## [1] "Location" "1"

When a name is split it is turned into two separate list items. If the character that tells stirsplit where to split isn’t found then the text string isn’t changed.

# No periods in the original name so the name isn't split
splitNames[[5]]
## [1] "intersection"
#Now there are two names
splitNames[[6]]
## [1] "Location" "1"


Using Lists

A little refresher on using lists.


We can create a list with different classes of variables as well as matrices, vectors or an element of any other class.

mylist <- list(letters = c("A", "B", "C"), numbers = 1:3, matrix(1:25, ncol = 5))
mylist
## $letters
## [1] "A" "B" "C"
## 
## $numbers
## [1] 1 2 3
## 
## [[3]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    2    7   12   17   22
## [3,]    3    8   13   18   23
## [4,]    4    9   14   19   24
## [5,]    5   10   15   20   25

We can select elements within the list by indexing them using []

mylist[1]
## $letters
## [1] "A" "B" "C"

Or by selecting the list element name using $

mylist$numbers
## [1] 1 2 3

And we can select the matrix

mylist[3][[1]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    6   11   16   21
## [2,]    2    7   12   17   22
## [3,]    3    8   13   18   23
## [4,]    4    9   14   19   24
## [5,]    5   10   15   20   25

and elements within the matrix

mylist[3][[1]][,1]
## [1] 1 2 3 4 5
mylist[3][[1]][2,]
## [1]  2  7 12 17 22


Extracting the first part of the column name

splitNames[[6]][1]
## [1] "Location"

THis is a little function that removes the first element of a name that has been split.

firstElement <- function(x){x[1]}
sapply(splitNames, firstElement)
## [1] "address"      "direction"    "street"       "crossStreet" 
## [5] "intersection" "Location"


Using gsub and sub

Download peer review study

fileUrl1 <- "https://dl.dropboxusercontent.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 <- "https://dl.dropboxusercontent.com/u/7710864/data/solutions-apr29.csv"
download.file(fileUrl1,destfile="./data/reviews.csv",method="curl")
download.file(fileUrl2,destfile="./data/solutions.csv",method="curl")
reviews <- read.csv("./data/reviews.csv"); solutions <- read.csv("./data/solutions.csv")
head(reviews,2)
##   id solution_id reviewer_id      start       stop time_left accept
## 1  1           3          27 1304095698 1304095758      1754      1
## 2  2           4          22 1304095188 1304095206      2306      1


You might want to remove the underscores sub looks for certain characters and replaces it with something else

sub("_", "", names(reviews))
## [1] "id"         "solutionid" "reviewerid" "start"      "stop"      
## [6] "timeleft"   "accept"

sub doesn’t work with a character string with multiple elements that should be replaced.

testName <- "This_is_a_test"
sub("_", "", testName)
## [1] "Thisis_a_test"

gsub will work though.

gsub("_", "", testName)
## [1] "Thisisatest"


Searching for specific values in variable names using grep and grepl

grep returns the variable number where the search string is found

grep("Alameda", cameraData$intersection)
## [1]  4  5 36

grepl returns TRUE or FALSE indicating if the search string exists in variable or not.

grepl("Alameda", cameraData$intersection)
##  [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE

Using table tells us how many times the search string shows up.

table(grepl("Alameda", cameraData$intersection))
## 
## FALSE  TRUE 
##    77     3

Now I can print out the three variables where Alameda shows up.

cameraData$intersection[grep("Alameda", cameraData$intersection)]
## [1] The Alameda  & 33rd St   E 33rd  & The Alameda   
## [3] Harford \n & The Alameda
## 74 Levels:  & Caton Ave & Benson Ave ... York Rd \n & Gitting Ave

Although this is faster and does the same thing

grep("Alameda", cameraData$intersection, value=TRUE)
## [1] "The Alameda  & 33rd St"   "E 33rd  & The Alameda"   
## [3] "Harford \n & The Alameda"

Using ! returns a subset of every row except rows with “Alameda” in the intersection data.

subset_cameraData<- cameraData[!grepl("Alameda", cameraData$intersection),]

If a value that you search for doesn’t appear it will return interger(0)

grep("JeffStreet", cameraData$intersection)
## integer(0)

The length of that value will be zero if the search string can’t be found.

length(grep("JeffStreet", cameraData$intersection))
## [1] 0


Getting the length of character vectors

nchar returns the length of a text string which the length function doesn’t do.

nchar("These are words")
## [1] 15
length("These are words")
## [1] 1


Modifying the length of a text string

substring returns characters from the first number to the second number

substr("These words", 1, 5)
## [1] "These"
substr("These words", 2, 5)
## [1] "hese"

Pasting words together.

paste("These", "are", "words")
## [1] "These are words"