Page 3 of 3

Re: Talking about data acquisition

Posted: Sun May 19, 2013 1:26 pm
by Statman
mystic wrote:
kpascual wrote: If it exists and you know where it is, why can't that someone be you? I'll quote DJ Patil in saying "80% of the work in any data project is in cleaning the data." You're basically asking someone to do the 80% of the work for you, i.e. most of the work.
Unfortunately, that is true. Even with some programming skills, it is still not easy to get a clean datafile. I learnt that lesson myself. My matchupfile is not good at all, and I really admire those who can get a matchupfile with a low error rate.


Regarding the data acquistion: I guess, the only way to get free accessable raw data would be a project in which multiple people would work together in their spare time and a provider would volunteer to give the neccessary storage and download volume.
On a related note - I am currently compiling the last 15 seasons of NCAA player data - and SOME/MUCH of the data acquisition is actually even copy/paste team by team.

"Cleaning" the data is a MASSIVE endeavor - I cross reference cbb-reference, ESPN, and statsheet with team totals to attempt to get the most accurate data - sometimes all three sites disagree with each other (albeit often slightly) on a player's stats, as well as player class (Fr, So, Jr, Sr). It's taking forever - which is why I have set everything aside and not made any updates to my site or done any NBA playoff stuff (or posted here). It's like I missed the playoffs and already am preparing for the NBA draft - well, that is exactly what I'm doing.

I am PRAYING to be done enough before the NBA draft to do all the pre draft work ups with possibly NBA career projections before the draft takes place. No matter when I'm done - eventually I'll be offering NBA career projections for current NBA players and college players entering the NBA.

That being said, when I'm finally done - I'm not sure if I can offer all my 15 seasons of raw college data to the public - since I didn't buy it. I fixed it - it won't exactly match the data from ANY site (seasons 1999 and 2000 of complete can't even be found anywhere - I even have to use wayback machine often). Well, I'm hoping to make the full data available to the public someday, since I'll have it, and it's not found anywhere.

I also will be ATTEMPTING to best "fix" older NBA seasons - since I'll be doing NBA player ratings (completely adusted for era and missing data of that era - turnovers, O & D rebounds, etc) all the way back to when player minutes were compiled (early 50s). The NBA team totals often don't match compiled player stats - so I have to cross reference multiple sites along with my old basketball encyclopedias and Total basketball encyclopedias for the best guess of good player data. Maybe I'll be able to someday make that data available to everyone also - I don't really know.

Well, we'll see. First things first.

Re: Talking about data acquisition

Posted: Tue May 21, 2013 8:15 am
by AcrossTheCourt
Here are my thoughts on the subject of all the hard work in data collection....

Yes, it's true that most of the work that goes into adjusted +/- is working with play by play data and organizing into something workable. It takes skill, time, and patience to do that. The results are extremely valuable. Back in school, when I did research, the data I had was quite valuable, and you protect it. You don't want someone to have the data before you publish. And collecting or even accessing the data is the crucial link.

However, this is a place where we talk about basketball analytically. We want to further the cause and push basketball analysis into higher places. Clean, accessible play-by-play data and matchup files are essential to the process. We should be on the same team.

Each of us individually could work on compiling matchup data for a season. That would mean a huge output of hours working on this per individual. It would be more ideal to learn about the league if we pulled our efforts together and published play by play data for everyone to use. This also minimizes error and uncertainty. Like what basketballvalue did, for instance, and I wish that was done for every year with play-by-play data.

I know that you can have a huge advantage over the field with data no one else has, but given all the work that goes into it if we did that collectively then there would be more time/effort to work on the analysis, instead of dealing with data.

Re: Talking about data acquisition

Posted: Sun Jun 09, 2013 11:37 am
by wilq
Does anybody know why ESPN doesn't have play-by-play data for two games from the last regular season?
Do you think it's an error and they don't know about it? I mean those...
http://scores.espn.go.com/nba/boxscore?gameId=400277763
http://scores.espn.go.com/nba/boxscore?gameId=400277730

Re: Talking about data acquisition

Posted: Sun Jun 09, 2013 6:03 pm
by kpascual
AcrossTheCourt wrote:Here are my thoughts on the subject of all the hard work in data collection....

Yes, it's true that most of the work that goes into adjusted +/- is working with play by play data and organizing into something workable. It takes skill, time, and patience to do that. The results are extremely valuable. Back in school, when I did research, the data I had was quite valuable, and you protect it. You don't want someone to have the data before you publish. And collecting or even accessing the data is the crucial link.

However, this is a place where we talk about basketball analytically. We want to further the cause and push basketball analysis into higher places. Clean, accessible play-by-play data and matchup files are essential to the process. We should be on the same team.

Each of us individually could work on compiling matchup data for a season. That would mean a huge output of hours working on this per individual. It would be more ideal to learn about the league if we pulled our efforts together and published play by play data for everyone to use. This also minimizes error and uncertainty. Like what basketballvalue did, for instance, and I wish that was done for every year with play-by-play data.

I know that you can have a huge advantage over the field with data no one else has, but given all the work that goes into it if we did that collectively then there would be more time/effort to work on the analysis, instead of dealing with data.
I agree with all of what you said. We should be working together to provide clean, accessible data, because on the whole it will save everyone time and effort, and increase the velocity in which we can provide solid analytical knowledge.

However, my concern is that the tone of the conversation is "can you get the data for me" as opposed to "how can we work together to get the data". To me analytics isn't really about statistics and algorithms, it's really about getting clean data. In school, data comes to you pre-cleaned, but in the real world you spend most of your time cleaning data, again at least 80%.

I just don't think there should be this expectation that clean data will be handed to you. That would be a disservice to the analyst, especially if you want to do analytics for a living.

I'm all for working together. So how should we proceed? Should we share our algorithms that produce the units? Share our code? Are people comfortable sharing code?

Re: Talking about data acquisition

Posted: Mon Jun 10, 2013 12:37 am
by SportsOps
Hi all,

I'm a long-time casual reader, but this is my first post. Looking forward to joining the online community.

kpascual, you gave this link for http://stats.nba.com/stats/playbyplay?G ... dPeriod=10. Can you explain how you accessed the page in that format? The typical nba.com/stats play-by-play links are of this form: http://stats.nba.com/gameDetail.html?Ga ... playbyplay, and it's not obvious to me how you got it in the other form. (Of course, one could just change the GameID in the link above, but I'm sure there's a better and more systematic way to do it). Thanks.

Re: Talking about data acquisition

Posted: Mon Jun 10, 2013 5:54 am
by kpascual
SportsOps wrote:Hi all,

I'm a long-time casual reader, but this is my first post. Looking forward to joining the online community.

kpascual, you gave this link for http://stats.nba.com/stats/playbyplay?G ... dPeriod=10. Can you explain how you accessed the page in that format? The typical nba.com/stats play-by-play links are of this form: http://stats.nba.com/gameDetail.html?Ga ... playbyplay, and it's not obvious to me how you got it in the other form. (Of course, one could just change the GameID in the link above, but I'm sure there's a better and more systematic way to do it). Thanks.
When you hit that play by play page, you'll notice the actual play by play text doesn't appear until a little bit after page load. That suggests that data is being loaded asynchronously. I used Chrome's Developer Tools to confirm this, and so if you enable XmlHttpRequest logging and refresh the page, you find this to be true, and you see this particular URL being loaded.

And if you go to that page, you see all the raw JSON data. Then by playing around with the parameters in the URL, you realize you can do a lot of very interesting things with it.

Image

Re: Talking about data acquisition

Posted: Mon Jun 10, 2013 3:28 pm
by SportsOps
kpascual wrote:
SportsOps wrote:Hi all,

I'm a long-time casual reader, but this is my first post. Looking forward to joining the online community.

kpascual, you gave this link for http://stats.nba.com/stats/playbyplay?G ... dPeriod=10. Can you explain how you accessed the page in that format? The typical nba.com/stats play-by-play links are of this form: http://stats.nba.com/gameDetail.html?Ga ... playbyplay, and it's not obvious to me how you got it in the other form. (Of course, one could just change the GameID in the link above, but I'm sure there's a better and more systematic way to do it). Thanks.
When you hit that play by play page, you'll notice the actual play by play text doesn't appear until a little bit after page load. That suggests that data is being loaded asynchronously. I used Chrome's Developer Tools to confirm this, and so if you enable XmlHttpRequest logging and refresh the page, you find this to be true, and you see this particular URL being loaded.

And if you go to that page, you see all the raw JSON data. Then by playing around with the parameters in the URL, you realize you can do a lot of very interesting things with it.

Image
Awesome, thanks. I don't have much familiarity with JavaScript or web scraping, but I was able to figure out that bit relatively easily. I'll play around with it a bit more this week.

Another comment: it appears that the starting line-ups are omitted from the page (substitutions are included...though it only indicates last name, no first initial, which has the potential to be problematic). Someone might have addressed this in another post somewhere, but any recommendations on procuring the starting line-ups from nba.com?

Re: Talking about data acquisition

Posted: Mon Jul 08, 2013 7:21 pm
by AcrossTheCourt
EvanZ wrote:
AcrossTheCourt wrote:No, not blaming them for joining a team. The problem is once they do the website is toast. So you need a website with a large group of people or some system where certain roles can be filled once the person is gone.
You need someone like me who is happy with his current job and not looking to be hired by the NBA (although I can't say the opportunity hasn't been presented).

And if I did take a job, I'd make it a condition of being hired that the site would have to stay up.

FWIW, I'm actually planning to get the 90's data (as far back as I can go) this summer and put it on nbawowy. Look for it.
I would really love to work with the 90's data as well and help you form/parse it. I also want to have some fun with the lineup data: rebounding, +/-, usage/efficiency, etc. Let me know if I can help.

Re: Talking about data acquisition

Posted: Wed Sep 04, 2013 7:36 pm
by archilochusColubris
kpascual wrote:
AcrossTheCourt wrote:Here are my thoughts on the subject of all the hard work in data collection....

Yes, it's true that most of the work that goes into adjusted +/- is working with play by play data and organizing into something workable. It takes skill, time, and patience to do that. The results are extremely valuable. Back in school, when I did research, the data I had was quite valuable, and you protect it. You don't want someone to have the data before you publish. And collecting or even accessing the data is the crucial link.

However, this is a place where we talk about basketball analytically. We want to further the cause and push basketball analysis into higher places. Clean, accessible play-by-play data and matchup files are essential to the process. We should be on the same team.

Each of us individually could work on compiling matchup data for a season. That would mean a huge output of hours working on this per individual. It would be more ideal to learn about the league if we pulled our efforts together and published play by play data for everyone to use. This also minimizes error and uncertainty. Like what basketballvalue did, for instance, and I wish that was done for every year with play-by-play data.

I know that you can have a huge advantage over the field with data no one else has, but given all the work that goes into it if we did that collectively then there would be more time/effort to work on the analysis, instead of dealing with data.
I agree with all of what you said. We should be working together to provide clean, accessible data, because on the whole it will save everyone time and effort, and increase the velocity in which we can provide solid analytical knowledge.

However, my concern is that the tone of the conversation is "can you get the data for me" as opposed to "how can we work together to get the data". To me analytics isn't really about statistics and algorithms, it's really about getting clean data. In school, data comes to you pre-cleaned, but in the real world you spend most of your time cleaning data, again at least 80%.

I just don't think there should be this expectation that clean data will be handed to you. That would be a disservice to the analyst, especially if you want to do analytics for a living.

I'm all for working together. So how should we proceed? Should we share our algorithms that produce the units? Share our code? Are people comfortable sharing code?
So it's been a while but I'd like to bring this thread back to the surface. It seems like it died out just when people started talking about how to move forward.

I'm new to the basketball analytics game and would love to jump in and start doing some research, but I'm a bit frustrated with how to even go about acquiring the necessary data. I'm not familiar with web scraping, so I don't even know how to start. I hear well that I'll have to devote some time to compiling the data, and I'm willing to devote time to the project if someone could help point me in the right direction.

So suppose I wanted to go about recreating the data files Aaron Barzilai posted at http://basketballvalue.com/downloads.php. Would anyone be able to help light my way?

Re: Talking about data acquisition

Posted: Thu Sep 05, 2013 9:12 am
by J.E.
Python with urllib and/or http://en.wikipedia.org/wiki/Beautiful_Soup works pretty well

Re: Talking about data acquisition

Posted: Wed Jan 08, 2014 10:17 pm
by kohanz
kpascual wrote:What is the biggest pain point with regards to data acquisition or manipulation? Is it that people don't have the raw data accessible (i.e. play by play)? Or are the tools to access the raw or processed data not flexible/powerful/easy enough? Or is it in processing the data so data can be compared reliably?

The contents of this thread suggest all three are problems to some degree, but which one is the most painful?

I went about exposing my own data, but then realized I didn't know what the real problem was. Raw play by play data is basically solved by NBA.com/stats (http://stats.nba.com/stats/playbyplay?G ... dPeriod=10), and I can help on the tools/data processing piece. But it's hard to know exactly how to help without better understanding of what problem needs to be solved.
Hi there! I'm a big fan of Vorped (and NBAWowy for that matter). I've been working on my own side-project for a (too long) time that I'm hoping to make public sooner rather than later. It also relies on parsing PBP data, the code for which I wrote myself. I use a different source for the PBP data (feel free to PM) and, as you mentioned, a lot of the work has gone into cleaning the data. Substitutions are the biggest pain point. As I'm sure you know, substitutions between quarters are regularly omitted (but can be deduced fairly easily), but sometimes in-game substitutions are missed (bad data) and have to be estimated. I just spent hours last week tracking down a bug in my substitution cleaning code because there was a game where Kevin Love got a technical while on the bench - and I was assuming that only players on the court would get mentioned in the PBP :)

Anyway, I'm curious about the NBA pbp data. Are you saying it is cleaner than most? How do you get around the last-name only issue?

Re: Talking about data acquisition

Posted: Wed Jan 08, 2014 10:54 pm
by AcrossTheCourt
kohanz wrote:
kpascual wrote:What is the biggest pain point with regards to data acquisition or manipulation? Is it that people don't have the raw data accessible (i.e. play by play)? Or are the tools to access the raw or processed data not flexible/powerful/easy enough? Or is it in processing the data so data can be compared reliably?

The contents of this thread suggest all three are problems to some degree, but which one is the most painful?

I went about exposing my own data, but then realized I didn't know what the real problem was. Raw play by play data is basically solved by NBA.com/stats (http://stats.nba.com/stats/playbyplay?G ... dPeriod=10), and I can help on the tools/data processing piece. But it's hard to know exactly how to help without better understanding of what problem needs to be solved.
Hi there! I'm a big fan of Vorped (and NBAWowy for that matter). I've been working on my own side-project for a (too long) time that I'm hoping to make public sooner rather than later. It also relies on parsing PBP data, the code for which I wrote myself. I use a different source for the PBP data (feel free to PM) and, as you mentioned, a lot of the work has gone into cleaning the data. Substitutions are the biggest pain point. As I'm sure you know, substitutions between quarters are regularly omitted (but can be deduced fairly easily), but sometimes in-game substitutions are missed (bad data) and have to be estimated. I just spent hours last week tracking down a bug in my substitution cleaning code because there was a game where Kevin Love got a technical while on the bench - and I was assuming that only players on the court would get mentioned in the PBP :)

Anyway, I'm curious about the NBA pbp data. Are you saying it is cleaner than most? How do you get around the last-name only issue?
Yeah, you can't assume being mentioned in the pbp means you're in the game. I just ignore technical fouls for lineup substitutions.

There shouldn't be much estimating with creating the lineups though. The first thing I do is sweep through the period looking for substitutions. If you go both ways, you can figure out who was in before and after the substitution. After that check to see if you have five players all throughout the period. If not, you have a player or players who have played the entire period and thus need to be added through the entire pbp. You can figure this out by just looking for which players are the pbp (excluding technical fouls or anything like that) but not in the substitution lines. Some pbp data includes who's starting or who starts a period/half, and that helps a lot.

Finally, there is an instance where you need to estimate: a player who never comes out/in via a substitution for an entire period but doesn't show up in the pbp lines (i.e. doesn't take a shot, rebound, steal, etc.) This doesn't happen often, but it's almost always during overtime, and (at least from what I've seen) it's usually an away player.

Last names aren't too big of an issue. If you know the team and you have the last name, you probably know which exact player it is. But there are exceptions when you have teammates who share a name. When you eliminate games where one of those players didn't play, you need to estimate lineup patterns using things like who started, minutes totals, box score totals (points, rebounds, etc.), box score stats one player had and the other didn't, and finally if you're stuck positions can help (making sure there's at least one point guard on the court, for example.) You can figure out most of these patterns pretty easily, but there were a few instances where it's not 100%.

Re: Talking about data acquisition

Posted: Thu Jan 09, 2014 3:50 am
by kohanz
AcrossTheCourt wrote:Yeah, you can't assume being mentioned in the pbp means you're in the game. I just ignore technical fouls for lineup substitutions.
Technicals are an exception, but are there other mentions that can't be assumed to mean that the player is in the game?
AcrossTheCourt wrote:There shouldn't be much estimating with creating the lineups though. The first thing I do is sweep through the period looking for substitutions. If you go both ways, you can figure out who was in before and after the substitution. After that check to see if you have five players all throughout the period. If not, you have a player or players who have played the entire period and thus need to be added through the entire pbp. You can figure this out by just looking for which players are the pbp (excluding technical fouls or anything like that) but not in the substitution lines. Some pbp data includes who's starting or who starts a period/half, and that helps a lot.
Interesting to hear someone else's take on it. I definitely do some of this. The only time I have to make an assumption is when an in-game (during a quarter) substitution is completely omitted. The substitution had to occur between the last play involving the player subbed out and the first play involved the player subbed in, and likely during a stoppage where another substitution was made, but that's what I mean by assumption - there's not always an exact answer to that. For my purposes, not a huge deal, and I've only seen it happen in one game so far this season.
AcrossTheCourt wrote:Finally, there is an instance where you need to estimate: a player who never comes out/in via a substitution for an entire period but doesn't show up in the pbp lines (i.e. doesn't take a shot, rebound, steal, etc.) This doesn't happen often, but it's almost always during overtime, and (at least from what I've seen) it's usually an away player.
Good point - overtime can be a special case for this.
AcrossTheCourt wrote:Last names aren't too big of an issue. If you know the team and you have the last name, you probably know which exact player it is. But there are exceptions when you have teammates who share a name. When you eliminate games where one of those players didn't play, you need to estimate lineup patterns using things like who started, minutes totals, box score totals (points, rebounds, etc.), box score stats one player had and the other didn't, and finally if you're stuck positions can help (making sure there's at least one point guard on the court, for example.) You can figure out most of these patterns pretty easily, but there were a few instances where it's not 100%.
I think there was one year where the Nets had 3 Willams' on the same time. More recently the Jazz had 2 Williams'. What about the Morris twins on Phoenix. If vorped is using the NBA pbp, there must be a way to determine, with certainty, which play each player comes from.

The PBP data I use (CNNSI) has first names, so this is not an issue.

Re: Talking about data acquisition

Posted: Thu Jan 23, 2014 4:36 pm
by kpascual
Most of the time it's easy to figure out who's on the court. For the times you don't know, or if you have conflicts, you can figure out pretty well who was playing at a given moment in the game using the stats.nba.com API. Just tweak the StartRange and EndRange to the appropriate time (it's based on seconds elapsed in the game x 10). Whoever shows up in the player stats, they're probably on the court.

http://stats.nba.com/stats/boxscore?Gam ... angeType=2