page 1 of 2
Author Message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Fri Jan 21, 2011 10:13 am Post subject: Learn R Reply with quote
Most of what we do comes down to comparing groups or individuals. This is pretty simple to do in practice (50% of my work is done on a hand-held calculator -- no joke). Excel works well for cleaning up data and performing arithmetic calculations and generating simple graphs. But there are some problems where you need a more powerful system: you can do linear regression in Excel, but not very easily, and logistic regression is nearly impossible. All manner of multivariate statistics are impossible in Excel. Complex graphs are difficult to implement. For these problems you need a dedicated statistical package.
R is an open source statistical environment, a kind of stats-oriented programming language. If you want to leave the spreadsheet world, I'm not going to say that R is the easiest road to statistical packages, but I do know that today there are many resources to make the process easier. In particular, Millsy, who you might know as a frequent commenter at THE BOOK blog, has a series of tutorials on learning the basics of R at his blog. There are many introductory tutorials, but these have the benefit of being baseball-oriented, which should make it easier to apply mentally to the types of things apbrmetricians have in their heads.
R is not easy to learn, particularly if, like me, you don't have a proper statistics or programming background. But years ago, when I decided I wanted to do this for a living, I dedicated myself to learning R and probably within a few months I was able to use it comfortably. And that was before the explosion of learning resources emerged (when I started learning R, the only resource was the entirely unhelpful R-help listserv).
Like I said, you don't need a statistical package to do good work in this field, but there are somethings you simply cannot do any other way. If you want to learn R, I encourage you to check out Millsy's posts to see if it's for you.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
DSMok1
Joined: 05 Aug 2009
Posts: 547
Location: Where the wind comes sweeping down the plains
PostPosted: Fri Jan 21, 2011 10:26 am Post subject: Reply with quote
Wonderful! This is just what I need. I can do just about anything you can do with Excel, but R is the next frontier.
_________________
GodismyJudgeOK.com/DStats
Back to top
View user's profile Send private message Visit poster's website
EvanZ
Joined: 22 Nov 2010
Posts: 199
PostPosted: Fri Jan 21, 2011 11:20 am Post subject: Reply with quote
I've been diving into R the last couple of months. It's fun! Laughing
(Actually two R's, R and Ruby!)
We should have an R gallery. Check out this awesome plot I made a while back (doesn't really matter what the data are):
You'd have a tough time doing that in Excel.
_________________
http://www.thecity2.com
http://www.ibb.gatech.edu/evan-zamir
Back to top
View user's profile Send private message
Jon Nichols
Joined: 18 Aug 2005
Posts: 369
PostPosted: Fri Jan 21, 2011 11:28 am Post subject: Reply with quote
Agreed completely, especially with the first line. Wink
I'd also like to add that even though I assume most people who care this much about basketball statistics are quite familiar with Excel, maybe not everyone is. Excel obviously has some limitations (as you mentioned, Ed), but I would also recommend that everyone truly learn Excel as well. Once you really dive in, and learn something even as simple as a VLOOKUP or IF formula, it's amazing what you can do. Plus, in terms of sharing files with non-APRBmetricians, Excel is pretty much essential.
Back to top
View user's profile Send private message AIM Address
Mogilny
Joined: 05 Aug 2010
Posts: 12
PostPosted: Fri Jan 21, 2011 11:54 am Post subject: Reply with quote
Cool. I've got some statistics background and a little programming from my engineering degree but as a swede I'm pretty new to advanced statistics in sports (we have soccer and hockey and the statistics used here among the polar bears doesn't go further than goals, yellow cards and assists) so I'm trying to get into it. I've been lurking this forum for quite a while and following a bunch of blogs so I'm starting to get the basics but to really get it I feel I need to get my hands dirty to gain a deeper understanding so I will try out the program and the tutorials. Thanks for the url.
Back to top
View user's profile Send private message
Ilardi
Joined: 15 May 2008
Posts: 257
Location: Lawrence, KS
PostPosted: Fri Jan 21, 2011 3:00 pm Post subject: Reply with quote
Has anyone here ever picked up R after learning SAS? If so, what is that learning curve like? And is there any advantage in switching to R for a SAS user like myself?
Back to top
View user's profile Send private message
acollard
Joined: 22 Sep 2010
Posts: 49
Location: MA
PostPosted: Fri Jan 21, 2011 3:25 pm Post subject: Reply with quote
Excellent! I've been working with some things in R, and having a hell of a time with the help files, which are detailed but completely impractical from a how-to stand point. Tutorials are ideal, and these look good.
As far as R vs. SAS and other programs. I'm learning R mostly because of the graphical output capabilites, as well as some of the scripting and add-on packages. However, if you're trying to just process data and get numbers back out, I haven't found R to better than any other program.
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Sun Jan 23, 2011 1:18 pm Post subject: Reply with quote
Ilardi wrote:
Has anyone here ever picked up R after learning SAS? If so, what is that learning curve like? And is there any advantage in switching to R for a SAS user like myself?
I'm a SAS user, and tried to learn R in grad school (and have tried it out sporadically a few times since then). It's tough because it's a totally different interface and mentality, in my opinion.
R is command-line based. With SAS, I'll write a bunch of code (say, 20-25 lines), highlight it, and run it to test it out. Then I can go back and tweak the code with different options in the PROC's, or add lines to the DATA statements bit by bit to make sure things work the way I want, or comment lines out to figure out where the logic error is. I've never been sure how to do the same kind of iterative programming in R.
Also, I've felt that the way R names and handles datasets is unintuitive.
Back to top
View user's profile Send private message Send e-mail AIM Address
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Sun Jan 23, 2011 1:21 pm Post subject: Reply with quote
acollard wrote:
As far as R vs. SAS and other programs. I'm learning R mostly because of the graphical output capabilites, as well as some of the scripting and add-on packages. However, if you're trying to just process data and get numbers back out, I haven't found R to better than any other program.
I think it's easier to quickly get neat-looking graphics in R than in SAS, definitely. The capabilities are there with SAS and its ODS environment, but it's definitely more difficult to finesse things and make them look pretty.
But, as I sort of alluded to in my previous post, I think SAS is much more powerful overall. More robust, too.
Back to top
View user's profile Send private message Send e-mail AIM Address
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Mon Jan 24, 2011 6:06 am Post subject: Reply with quote
Also, for those who use R, I just came across something that might be a helpful tool:
http://realizationsinbiostatistics.blog ... -to-r.html
Back to top
View user's profile Send private message Send e-mail AIM Address
mtamada
Joined: 28 Jan 2005
Posts: 375
PostPosted: Mon Jan 24, 2011 7:50 pm Post subject: Reply with quote
gabefarkas wrote:
Ilardi wrote:
Has anyone here ever picked up R after learning SAS? If so, what is that learning curve like?
[...]
R is command-line based. With SAS, I'll write a bunch of code (say, 20-25 lines), highlight it, and run it to test it out. Then I can go back and tweak the code with different options in the PROC's, or add lines to the DATA statements bit by bit to make sure things work the way I want, or comment lines out to figure out where the logic error is. I've never been sure how to do the same kind of iterative programming in R.
My experience with both is limited and from years ago: I've used SAS only a little, and have not become "fluent" with it. I used R's proprietary predecessor, S, more intensively, so I have experience with S but not R. (And that was S that I used, not it's newer cousin S-Plus.)
The thing that I found annoying about SAS was the inadequacy of the documentation; all too often I could not figure out how to do a certain command, and would have to either buy a third party book or serendipitously find an example which was illustrating something else but fortuitously gave me a hint about how to do what I was trying to do. I didn't have that problem with S.
As for the programming, I was okay with S. But I actually prefer command line environments (my favorite text editor is vi). I'd do the kind of iterative programming that GabeF describes by creating ... I forget what they're called ... basically S scripts, and running and debugging them, doing that stuff he describes such as adding lines or commenting them out. Pretty fast and easy to do, and I didn't have to take my hands off the keyboard (which is the reason why I hate to use mice, touchpads, trackballs, etc.). In fact that's still how I do SQL programming, partly because I don't have access to a GUI for our database, just the Unix command line, partly because I don't like GUI's in the first place.
There are undoubtedly nifty integrated development environments out there (Stata e.g. makes it easy to edit and debug one command at a time) but when I last used SAS for Windows, maybe 8 years ago, I found its programming environment to be fine but nothing special. E.g. it lacked a nice facility that SPSS has, to use the mouse and menus to execute a command -- and then to look up (and copy and paste) the code that was generated by those mouse clicks. (SPSS however is far weaker than any of the stat packages we're discussing here.)
Back to top
View user's profile Send private message
basketballvalue
Joined: 07 Mar 2006
Posts: 204
PostPosted: Mon Jan 24, 2011 10:30 pm Post subject: Reply with quote
gabefarkas wrote:
R is command-line based. With SAS, I'll write a bunch of code (say, 20-25 lines), highlight it, and run it to test it out. Then I can go back and tweak the code with different options in the PROC's, or add lines to the DATA statements bit by bit to make sure things work the way I want, or comment lines out to figure out where the logic error is. I've never been sure how to do the same kind of iterative programming in R.
Also, I've felt that the way R names and handles datasets is unintuitive.
I actually use R for doing the regression involved in the adjusted +/- dataset. I use Tinn-R for the editing that you describe above, it's not perfect but it's definitely workable. It reminds me of my MATLAB days back in grad school.....
Thanks,
Aaron
PS Thanks for the pointer to the tutorials, Ed.
_________________
http://www.basketballvalue.com
Follow on Twitter
Back to top
View user's profile Send private message
gabefarkas
Joined: 31 Dec 2004
Posts: 1311
Location: Durham, NC
PostPosted: Tue Jan 25, 2011 12:22 pm Post subject: Reply with quote
mtamada wrote:
The thing that I found annoying about SAS was the inadequacy of the documentation; all too often I could not figure out how to do a certain command, and would have to either buy a third party book or serendipitously find an example which was illustrating something else but fortuitously gave me a hint about how to do what I was trying to do. I didn't have that problem with S.
...
There are undoubtedly nifty integrated development environments out there (Stata e.g. makes it easy to edit and debug one command at a time) but when I last used SAS for Windows, maybe 8 years ago, I found its programming environment to be fine but nothing special. E.g. it lacked a nice facility that SPSS has, to use the mouse and menus to execute a command -- and then to look up (and copy and paste) the code that was generated by those mouse clicks. (SPSS however is far weaker than any of the stat packages we're discussing here.)
From what I've seen, there was a pretty big leap between SAS v8 and SAS v9 (and then a decent-sized one between v9.1.3 and v9.2), in terms of documentation, examples, and ease-of-reading. Thinking about SAS v8 (and previous versions), I'd probably agree with you.
Also, SAS offers different add-on modules with customized GUIs; these are more menu-driven and resemble the kind of interface you'd find with SPSS.
Back to top
View user's profile Send private message Send e-mail AIM Address
acollard
Joined: 22 Sep 2010
Posts: 49
Location: MA
PostPosted: Tue Jan 25, 2011 12:50 pm Post subject: Reply with quote
Just posted in another thread, but thought it might be relevant here. I used R to make rough network graphs of statistical similarities between teams. I struggled doing all the data processing in R, though, so I did most of that in Excel. Still need to work out some formatting issues. The scripts were pretty easy to write and modify. Network analysis seems pretty common in R, so there's a lot of documentation and how-to's if you look hard enough and know where to look.
Gold=Champ, Green=Runnerup, Red=2011 team, Blue= Playoff, Beige=Lottery. Size is proportional to W-L%.
Spotlight Heat '11
Spotlight OKC '11
I'd be happy to post the code when I get home (even though the graph isn't very polished yet), in case anyone wants to try it with other similarities they have generated, as I know they're a pretty common exercise to do.[/img]
Back to top
View user's profile Send private message
Ed Küpfer
Joined: 30 Dec 2004
Posts: 783
Location: Toronto
PostPosted: Tue Jan 25, 2011 2:21 pm Post subject: Reply with quote
acollard wrote:
I'd be happy to post the code when I get home
Please do.
_________________
ed
Back to top
View user's profile Send private message Send e-mail
page 2 of 2
Author Message
acollard
Joined: 22 Sep 2010
Posts: 56
Location: MA
PostPosted: Wed Jan 26, 2011 2:53 am Post subject: Reply with quote
Hi all, here's the code I've used for the network plotting. I hope it helps. I tried to annotate so its clear what I've done along the way. I'm sure there are more elegant ways to write this code, but it seems to work for me.
Code:
m <- read.csv("C:/Users/AC/Desktop/edgelist2.csv", sep=","); #excel generated files .csv file. An edgelist is an easy format to list your network data in. First column one team (vertex) and second column a connected team (vertex).
size <- read.csv("C:/Users/AC/Desktop/wlsizes.csv", sep=","); #vector with sizes for vertices, if you want them to change, I used win%
colors <-read.csv("C:/Users/AC/Desktop/colors.csv", sep=","); #vector with colors. In R, you can use rgb colors, assigned numbers, or just name the colors
shortnames <-read.csv("C:/Users/AC/Desktop/Shortnames.csv", sep=",", stringsAsFactors=FALSE); #made a vector with shorter team names
shortnamesv <-as.character(shortnames[,1]) #read.csv imports as factors, needed to make as strings
palette(colors()) #allowed for color by numbers
net1 <-network(m) #turns edgelist into network dataset
net2 <-as.sociomatrix.sna(net1) #turns net1 into "adjacency matrix"
sizem <-data.matrix(size) #needed to do this for some reason, to get sizes in correct data format
colorsm <-data.matrix(colors) #same with colors
colorsv <- c(colorsm[1:835])
sizes <- c(sizem[1:835])
pdf(file="C:/Users/AC/Desktop/network11.pdf", width=30, height=30) #makes file print to .pdf, great because its vector based.
gplot(net2, gmode="digraph", mode = "fruchtermanreingold", label=shortnamesv, boxed.labels=FALSE, displaylabel=TRUE, label.pos=5, label.cex=1.5*sizes, arrowhead.cex=.05, vertex.cex=sizes, edge.col="lightgrey", vertex.col=colorsv, label.pad=8, label.col="black", vertex.border="black") # meat of the program, graphs network diagram. use help(gplot()) for all the terms.
dev.off()#closes .pdf file
Credit is owed to David Sparks and his arbitrarian blog. I didn't really use a lot of his coding, because he did his data processing in R, but his graphs and presentation were definitely something to strive for. His network graphs and his code can be found at http://arbitrarian.wordpress.com/2008/0 ... mple-code/
Back to top
View user's profile Send private message
EvanZ
Joined: 22 Nov 2010
Posts: 272
PostPosted: Mon Feb 28, 2011 4:25 pm Post subject: Reply with quote
RStudio looks interesting:
http://www.rstudio.org/
_________________
http://www.thecity2.com
http://www.ibb.gatech.edu/evan-zamir
Back to top
View user's profile Send private message
DSMok1
Joined: 05 Aug 2009
Posts: 608
Location: Where the wind comes sweeping down the plains
PostPosted: Mon Feb 28, 2011 7:25 pm Post subject: Reply with quote
RStudio seems really helpful to me!
_________________
GodismyJudgeOK.com/DStats
Back to top
View user's profile Send private message Send e-mail Visit poster's website
acollard
Joined: 22 Sep 2010
Posts: 56
Location: MA
PostPosted: Wed Mar 02, 2011 2:47 am Post subject: Reply with quote
Very cool. Be sure to post what you guys come up with!
Back to top
View user's profile Send private message
DSMok1
Joined: 05 Aug 2009
Posts: 608
Location: Where the wind comes sweeping down the plains
PostPosted: Wed Mar 02, 2011 9:09 am Post subject: Reply with quote
acollard wrote:
Very cool. Be sure to post what you guys come up with!
I'm going to try to nail down the exact rest-day adjustments using lm(). The first time I tried it (1 year only) there was colinearity between Washington and B2B (I'm sure that will go away with more years). So I figured out how to use lm.ridge(). Seems cool! Too bad you can't get stderr's on the coefficients using ridge regression.
_________________
GodismyJudgeOK.com/DStats
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Mogilny
Joined: 05 Aug 2010
Posts: 23
PostPosted: Thu Mar 03, 2011 2:27 pm Post subject: Reply with quote
I've just downloaded R and RStudio and been playing around for a few hours and I've got a noob question - do you guys have any csv-files with NBA stats to recommend and if so, where do I get them?
Back to top
View user's profile Send private message
DSMok1
Joined: 05 Aug 2009
Posts: 608
Location: Where the wind comes sweeping down the plains
PostPosted: Thu Mar 03, 2011 2:37 pm Post subject: Reply with quote
Mogilny wrote:
I've just downloaded R and RStudio and been playing around for a few hours and I've got a noob question - do you guys have any csv-files with NBA stats to recommend and if so, where do I get them?
Mostly, you've got to assemble CSV files in Excel or the like, from data off web tables. Basketball Reference has an option to convert each table to CSV form, which you could then paste into a text document and save that way.
_________________
GodismyJudgeOK.com/DStats
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Mogilny
Joined: 05 Aug 2010
Posts: 23
PostPosted: Thu Mar 03, 2011 2:43 pm Post subject: Reply with quote
DSMok1 wrote:
Mogilny wrote:
I've just downloaded R and RStudio and been playing around for a few hours and I've got a noob question - do you guys have any csv-files with NBA stats to recommend and if so, where do I get them?
Mostly, you've got to assemble CSV files in Excel or the like, from data off web tables. Basketball Reference has an option to convert each table to CSV form, which you could then paste into a text document and save that way.
Thanks! I'll give it a try at once. Smile
Back to top
View user's profile Send private message
EvanZ
Joined: 22 Nov 2010
Posts: 272
PostPosted: Thu Mar 03, 2011 4:26 pm Post subject: Reply with quote
My ezpm spreadsheets are available as .csv files and contain many individual and team stats derived from pbp data:
2008-2009
2009-2010
2010-2011 (through Feb. 13)
_________________
http://www.thecity2.com
http://www.ibb.gatech.edu/evan-zamir
Back to top
View user's profile Send private message
DSMok1
Joined: 05 Aug 2009
Posts: 608
Location: Where the wind comes sweeping down the plains
PostPosted: Thu Mar 03, 2011 5:33 pm Post subject: Reply with quote
EvanZ wrote:
My ezpm spreadsheets are available as .csv files and contain many individual and team stats derived from pbp data:
2008-2009
2009-2010
2010-2011 (through Feb. 13)
You could still use a glossary!
_________________
GodismyJudgeOK.com/DStats
Back to top
View user's profile Send private message Send e-mail Visit poster's website
EvanZ
Joined: 22 Nov 2010
Posts: 272
PostPosted: Thu Mar 03, 2011 6:20 pm Post subject: Reply with quote
DSMok1 wrote:
EvanZ wrote:
My ezpm spreadsheets are available as .csv files and contain many individual and team stats derived from pbp data:
2008-2009
2009-2010
2010-2011 (through Feb. 13)
You could still use a glossary!
http://thecity2.com/glossary/
_________________
http://www.thecity2.com
http://www.ibb.gatech.edu/evan-zamir
Back to top
View user's profile Send private message
DSMok1
Joined: 05 Aug 2009
Posts: 608
Location: Where the wind comes sweeping down the plains
PostPosted: Thu Mar 03, 2011 6:46 pm Post subject: Reply with quote
Embarassed
_________________
GodismyJudgeOK.com/DStats
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Mogilny
Joined: 05 Aug 2010
Posts: 23
PostPosted: Thu Mar 03, 2011 7:00 pm Post subject: Reply with quote
Awesome Evan!
Back to top
View user's profile Send private message
EvanZ
Joined: 22 Nov 2010
Posts: 272
PostPosted: Fri Mar 04, 2011 10:05 am Post subject: Reply with quote
I've converted dougstats 10-11 stats from .txt to .csv:
https://spreadsheets0.google.com/pub?ke ... output=csv
_________________
http://www.thecity2.com
http://www.ibb.gatech.edu/evan-zamir
Learn R (Ed Küpfer, 2011)
Re: Learn R (Ed Küpfer, 2011)
Bumping this because I've had a couple of folks ask me about learning R over the last couple of weeks, and also because Millsy keeps posting great sports-based tutorials that may be easier for apbrmetricians to grasp. I want to highlight a recent post on basic graphing techniques that may help some people get started. Feel free to ask questions here.