Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
rstudio:finding_things_out [2013/04/03 15:38]
amelia
rstudio:finding_things_out [2016/05/13 13:45] (current)
Line 4: Line 4:
 ==== Sorting and ordering ==== ==== Sorting and ordering ====
  
-Because R is a statistical programming language, not a spreadsheet program like Excel, it isn’t as natural to sort a data set. If you think about it, why do we like sorting datasets? Well, it allows us to know the maximum and minimum value of a variable. But, summary() will tell us the maximum and minimum values. Sorting also allows us to scroll through and see the distribution of values, but it’s hard to hold enough information in our head to really grasp the distribution from scrolling (people can usually only remember 7 numbers at once, which is why we have 7-digit phone numbers!). A better way to see a distribution of values would be making a histogram, hist().\\+Because R is a statistical programming language, not a spreadsheet program like Excel, it isn’t as natural to sort a data set. If you think about it, why do we like sorting datasets? Well, it allows us to know the maximum and minimum value of a variable. But, ''​summary()'' ​will tell us the maximum and minimum values. Sorting also allows us to scroll through and see the distribution of values, but it’s hard to hold enough information in our head to really grasp the distribution from scrolling (people can usually only remember 7 numbers at once, which is why we have 7-digit phone numbers!). A better way to see a distribution of values would be making a histogram, ​''​hist()''​.\\
  
-With that said, if you want to sort your dataset, there is a way to do it. We use the order() command. But, if we use the command alone, it doesn’t give us what we want: +With that said, if you want to sort your dataset, there is a way to do it. We use the ''​order()'' ​command. But, if we use the command alone, it doesn’t give us what we want: 
 <code r> <code r>
 order(labike$bike_count_pm) order(labike$bike_count_pm)
Line 12: Line 12:
 {{:​rstudio:​order_labike_bike_count_pm_.png?​direct|}} {{:​rstudio:​order_labike_bike_count_pm_.png?​direct|}}
  
-What are these values? Well, look through the data in the viewing pane– what row has the smallest number in the bike_count_pm column? It’s a tie between row [21] and row [30], which both have 35 in that column. Notice that the first two numbers that R printed out were 21 and 30, so it’s just giving us the list of indices, ordered by the values in bike_count_pm.\\ ​+What are these values? Well, look through the data in the viewing pane– what row has the smallest number in the ''​bike_count_pm'' ​column? It’s a tie between row ''​[21]'' ​and row ''​[30]''​, which both have 35 in that column. Notice that the first two numbers that R printed out were 21 and 30, so it’s just giving us the list of indices, ordered by the values in ''​bike_count_pm''​.\\ 
  
 So to get the results we really want, we need to apply it to the dataset So to get the results we really want, we need to apply it to the dataset
Line 18: Line 18:
 labike[order(labike$bike_count_pm),​] labike[order(labike$bike_count_pm),​]
 </​code> ​ </​code> ​
- +{{ :​rstudio:​orderprintout.jpg?​direct&​700 |ordering part 1}} 
-MISSING IMAGE\\  ​+{{ :​rstudio:​orderprintout2.jpg?​direct&​700 |ordering part 2}}
  
 This is what we wanted, right? Our Console window isn’t wide enough to see all the columns at once, so R is printing out the last two columns after the rest of the dataset, but we can see that the data is sorted by bike_count_pm.\\ ​ This is what we wanted, right? Our Console window isn’t wide enough to see all the columns at once, so R is printing out the last two columns after the rest of the dataset, but we can see that the data is sorted by bike_count_pm.\\ ​
Line 57: Line 57:
 Notice the only tricky thing here– including na.rm=TRUE. If you don’t include that option, all these functions will return NA, because they are trying to compute a number and are encountering NA values. By passing na.rm=TRUE, you are telling the function to remove na values, thus na.rm.\\ ​ Notice the only tricky thing here– including na.rm=TRUE. If you don’t include that option, all these functions will return NA, because they are trying to compute a number and are encountering NA values. By passing na.rm=TRUE, you are telling the function to remove na values, thus na.rm.\\ ​
  
-==== Transforming data==== +This is true for number of other descriptive statistics functions,
-One of the great things about R is that it acts like calculator, and it can be used to transform data very easily. For example, say you wanted to transform the height variable in the cdc data set from height in centimeters to height in inches. We know the formula is //inches = centimeters//​ × 39.37, and that’s easy to do in R, +
 <code r> <code r>
-head(cdc$height+median(cdc$weight, na.rm = TRUE) 
 +## [1] 65.32 
 +min(cdc$weight,​ na.rm = TRUE) 
 +25 
 +## [1] 34.47 
 +max(cdc$weight,​ na.rm = TRUE) 
 +## [1] 181 
 +</​code>​
  
-## [1] 1.70 1.75 1.80 1.47 1.83 1.68 
- 
-heightInches = cdc$height * 39.37  
-head(heightInches) 
- 
-## [1] 66.93 68.90 70.87 57.87 72.05 66.14 
-</​code>​ 
Print/export