Talking about data acquisition

Home for all your discussion of basketball statistical analysis.
colts18
Posts: 313
Joined: Fri Aug 31, 2012 1:52 am

Re: Talking about data acquisition

Post by colts18 »

EvanZ wrote:
FWIW, I'm actually planning to get the 90's data (as far back as I can go) this summer and put it on nbawowy. Look for it.
Are you going to put the pbp data online, or will you do the WOWY analysis for that data like you did for the 12-13 season?
kpascual
Posts: 50
Joined: Thu Mar 01, 2012 7:02 pm

Re: Talking about data acquisition

Post by kpascual »

What is the biggest pain point with regards to data acquisition or manipulation? Is it that people don't have the raw data accessible (i.e. play by play)? Or are the tools to access the raw or processed data not flexible/powerful/easy enough? Or is it in processing the data so data can be compared reliably?

The contents of this thread suggest all three are problems to some degree, but which one is the most painful?

I went about exposing my own data, but then realized I didn't know what the real problem was. Raw play by play data is basically solved by NBA.com/stats (http://stats.nba.com/stats/playbyplay?G ... dPeriod=10), and I can help on the tools/data processing piece. But it's hard to know exactly how to help without better understanding of what problem needs to be solved.
colts18
Posts: 313
Joined: Fri Aug 31, 2012 1:52 am

Re: Talking about data acquisition

Post by colts18 »

kpascual wrote:What is the biggest pain point with regards to data acquisition or manipulation? Is it that people don't have the raw data accessible (i.e. play by play)? Or are the tools to access the raw or processed data not flexible/powerful/easy enough? Or is it in processing the data so data can be compared reliably?

The contents of this thread suggest all three are problems to some degree, but which one is the most painful?

I went about exposing my own data, but then realized I didn't know what the real problem was. Raw play by play data is basically solved by NBA.com/stats (http://stats.nba.com/stats/playbyplay?G ... dPeriod=10), and I can help on the tools/data processing piece. But it's hard to know exactly how to help without better understanding of what problem needs to be solved.
Raw play by play data for the 1997-2000 seasons would be awesome. I think if the pbp and lineup/matchup data is out there, someone would be able to do some kind of APM or RAPM analysis.
kpascual
Posts: 50
Joined: Thu Mar 01, 2012 7:02 pm

Re: Talking about data acquisition

Post by kpascual »

colts18 wrote: Raw play by play data for the 1997-2000 seasons would be awesome. I think if the pbp and lineup/matchup data is out there, someone would be able to do some kind of APM or RAPM analysis.
So your problem is just having play by play data that might not exist, and might not have been recorded at all? I can't help with that.

I can help somewhat with accessing boxscore or play by play data (that exists). Would something like these URLs be helpful? It's the NBA.com boxscore and play by play for some game in 2010, in CSV form. This also exists in JSON by replacing ".csv" with ".json".

http://api.vorped.com/sample/boxscore.csv
http://api.vorped.com/sample/playbyplay.csv

I would imagine this would be super helpful, since you wouldn't need a version of the CSV file on your local machine, like, ever. With Excel you can import a file and enter this URL, which will create a data table for you. With R, you can do a pretty simple command to load the data:

Code: Select all

a <- read.table("http://api.vorped.com/sample/playbyplay.csv", header=TRUE, sep=",")
It's a pretty stupid, trivial example. But imagine you can pass parameters, like player name or season, into the URL. Then updating your models/spreadsheets/visualizations becomes trivial.
colts18
Posts: 313
Joined: Fri Aug 31, 2012 1:52 am

Re: Talking about data acquisition

Post by colts18 »

kpascual wrote:
colts18 wrote: Raw play by play data for the 1997-2000 seasons would be awesome. I think if the pbp and lineup/matchup data is out there, someone would be able to do some kind of APM or RAPM analysis.
So your problem is just having play by play data that might not exist, and might not have been recorded at all? I can't help with that.

I can help somewhat with accessing boxscore or play by play data (that exists). Would something like these URLs be helpful? It's the NBA.com boxscore and play by play for some game in 2010, in CSV form. This also exists in JSON by replacing ".csv" with ".json".

http://api.vorped.com/sample/boxscore.csv
http://api.vorped.com/sample/playbyplay.csv

I would imagine this would be super helpful, since you wouldn't need a version of the CSV file on your local machine, like, ever. With Excel you can import a file and enter this URL, which will create a data table for you. With R, you can do a pretty simple command to load the data:

Code: Select all

a <- read.table("http://api.vorped.com/sample/playbyplay.csv", header=TRUE, sep=",")
It's a pretty stupid, trivial example. But imagine you can pass parameters, like player name or season, into the URL. Then updating your models/spreadsheets/visualizations becomes trivial.
The play by play data does exist. NBA.com has play by play data for 1997-2000. But someone has to parse out the pbp data so that there can be a APM/RAPM analysis done for the 97-2000 seaons.
kpascual
Posts: 50
Joined: Thu Mar 01, 2012 7:02 pm

Re: Talking about data acquisition

Post by kpascual »

colts18 wrote: The play by play data does exist. NBA.com has play by play data for 1997-2000. But someone has to parse out the pbp data so that there can be a APM/RAPM analysis done for the 97-2000 seaons.
If it exists and you know where it is, why can't that someone be you? I'll quote DJ Patil in saying "80% of the work in any data project is in cleaning the data." You're basically asking someone to do the 80% of the work for you, i.e. most of the work.

http://radar.oreilly.com/2012/07/data-jujitsu.html
colts18
Posts: 313
Joined: Fri Aug 31, 2012 1:52 am

Re: Talking about data acquisition

Post by colts18 »

kpascual wrote:
colts18 wrote: The play by play data does exist. NBA.com has play by play data for 1997-2000. But someone has to parse out the pbp data so that there can be a APM/RAPM analysis done for the 97-2000 seaons.
If it exists and you know where it is, why can't that someone be you? I'll quote DJ Patil in saying "80% of the work in any data project is in cleaning the data." You're basically asking someone to do the 80% of the work for you, i.e. most of the work.

http://radar.oreilly.com/2012/07/data-jujitsu.html
NBA.com does have the play by play data for 97-00. But I don't have the programming experience to convert the play by play into lineup data.

For example, here is a random Bulls-Jazz game in 1998.

http://stats.nba.com/stats/playbyplay?G ... dPeriod=10

I don't have the programming experience to make that into something usable.
DSMok1
Posts: 1119
Joined: Thu Apr 14, 2011 11:18 pm
Location: Maine
Contact:

Re: Talking about data acquisition

Post by DSMok1 »

FWIW, that NBA 97-00 data will be really hard to use since it's only last names (not even a first initial.)
Developer of Box Plus/Minus
APBRmetrics Forum Administrator
Twitter.com/DSMok1
fpliii
Posts: 85
Joined: Fri May 10, 2013 1:38 pm

Re: Talking about data acquisition

Post by fpliii »

EvanZ wrote:
AcrossTheCourt wrote:No, not blaming them for joining a team. The problem is once they do the website is toast. So you need a website with a large group of people or some system where certain roles can be filled once the person is gone.
You need someone like me who is happy with his current job and not looking to be hired by the NBA (although I can't say the opportunity hasn't been presented).

And if I did take a job, I'd make it a condition of being hired that the site would have to stay up.

FWIW, I'm actually planning to get the 90's data (as far back as I can go) this summer and put it on nbawowy. Look for it.
Love your site, thanks for your hard work. I'm definitely looking forward to having the complete PbP dataset available on wowy for analysis.

In general, I'd agree with the OP. I have a strong background in math/stats, and have done some programming (minor projects in Java, C++, C, but also some stuff in R, Matlab), but am by no means confident in handling large datasets with variable formats. I should probably set aside some time after this season and look at some of the threads on this forum (and reposted from the archives) to figure out the mechanics between data acquisition/handling.
wilq
Posts: 80
Joined: Fri Apr 15, 2011 4:05 pm
Location: Poland
Contact:

Re: Talking about data acquisition

Post by wilq »

Speaking of data acquisition I have a question: is it even legal to publish files with processed data from hundreds of pages? Recently someone asked for me all players' gamelogs which I do have but can I publish it in one downloadable file or would it be unfair to the sources? The second question is probably to Neil Paine or someone else who works at the source.
kpascual wrote:What is the biggest pain point with regards to data acquisition or manipulation? Is it that people don't have the raw data accessible (i.e. play by play)? Or are the tools to access the raw or processed data not flexible/powerful/easy enough? Or is it in processing the data so data can be compared reliably?

The contents of this thread suggest all three are problems to some degree, but which one is the most painful?
That probably depends entirely on individual skill-set so I don't expect there to be a general rule. Though I guess there are common stages: 1) is there a data and if not can I create it, 2) can I process it and 3) can I use it/present it in a desirable way.
DSMok1
Posts: 1119
Joined: Thu Apr 14, 2011 11:18 pm
Location: Maine
Contact:

Re: Talking about data acquisition

Post by DSMok1 »

wilq wrote:Speaking of data acquisition I have a question: is it even legal to publish files with processed data from hundreds of pages? Recently someone asked for me all players' gamelogs which I do have but can I publish it in one downloadable file or would it be unfair to the sources? The second question is probably to Neil Paine or someone else who works at the source.
If you have not significantly altered the data in some way (value-added) then no, I don't think it's ethical--the website had to pay good money to get that same data. I don't think it's an issue to share with a few folks on request, though--just not publish it publicly.

If you HAVE significantly altered the data in some way--like you've calculated some stat with the data and are not publishing the original source data, no problem--that stat is yours.

It's a bit of a fuzzy topic legally. From what I understand, many/most sites claim more rights over their data in terms of use than they actually really could enforce legally. But if you want to be ethical, I would always ask the source site (i.e. Justin Kubatko).
Developer of Box Plus/Minus
APBRmetrics Forum Administrator
Twitter.com/DSMok1
PD123
Posts: 32
Joined: Wed Jan 30, 2013 9:32 pm

Re: Talking about data acquisition

Post by PD123 »

wilq wrote:Speaking of data acquisition I have a question: is it even legal to publish files with processed data from hundreds of pages? Recently someone asked for me all players' gamelogs which I do have but can I publish it in one downloadable file or would it be unfair to the sources?
IANAL (I say as I proceed to talk like I know what I'm talking about).

Legally speaking, you're more or less in the clear to publish it as long as it doesn't include relatively specialized statistics. People can't easily claim ownership of facts in a legal sense, and statistics are simply another way of recording things like who had the ball or who missed a shot; where it could get a little fuzzy is if you published something like another person's rating metric which could be argued is more of a statement of an opinion than of a fact, but the bottom line is that nobody has really ever successfully enforced any claim of ownership over generic and common sports statistics.

As for the ethical question, as much as I'd like to have easily downloadable sources of information, I'm compelled to agree with DSMok1 that posting information that you've gotten from someone else, without their permission and/or without wholesale changes, is probably not the most ethical thing to do.
wilq
Posts: 80
Joined: Fri Apr 15, 2011 4:05 pm
Location: Poland
Contact:

Re: Talking about data acquisition

Post by wilq »

DSMok1 wrote:If you have not significantly altered the data in some way (value-added) then no, I don't think it's ethical--the website had to pay good money to get that same data. I don't think it's an issue to share with a few folks on request, though--just not publish it publicly.
PD123 wrote:As for the ethical question, as much as I'd like to have easily downloadable sources of information, I'm compelled to agree with DSMok1 that posting information that you've gotten from someone else, without their permission and/or without wholesale changes, is probably not the most ethical thing to do.
I agree with your point of view... so maybe THIS is the biggest problem with talking about data acquisition? Even if someone has already collected the data it really doesn't help everybody else interested in the same data.

Frankly, I'm puzzled why there aren't more projects like this: http://www.basketballreference.com/stats_download.htm.
Isn't the business of "one page per player" totally different than "all pages in one file"? And aren't they targeted at different people? Am I wrong to think basketball-reference.com or nba.com could sell files such as "gamelogs from 1985-86 to today" or "playbyplay9798"? I don't know how profitable it would be but the cost of this idea is basically "only bandwidth" because they already have the data!
DSMok1
Posts: 1119
Joined: Thu Apr 14, 2011 11:18 pm
Location: Maine
Contact:

Re: Talking about data acquisition

Post by DSMok1 »

wilq wrote:Am I wrong to think basketball-reference.com or nba.com could sell files such as "gamelogs from 1985-86 to today" or "playbyplay9798"? I don't know how profitable it would be but the cost of this idea is basically "only bandwidth" because they already have the data!
Basketball Reference buys their data from a provider, and it is, I believe, outside of their terms of use to provide the data for wholesale download (after all, that data is what the provider is selling!).
Developer of Box Plus/Minus
APBRmetrics Forum Administrator
Twitter.com/DSMok1
mystic
Posts: 470
Joined: Mon Apr 18, 2011 10:09 am
Contact:

Re: Talking about data acquisition

Post by mystic »

kpascual wrote: If it exists and you know where it is, why can't that someone be you? I'll quote DJ Patil in saying "80% of the work in any data project is in cleaning the data." You're basically asking someone to do the 80% of the work for you, i.e. most of the work.
Unfortunately, that is true. Even with some programming skills, it is still not easy to get a clean datafile. I learnt that lesson myself. My matchupfile is not good at all, and I really admire those who can get a matchupfile with a low error rate.


Regarding the data acquistion: I guess, the only way to get free accessable raw data would be a project in which multiple people would work together in their spare time and a provider would volunteer to give the neccessary storage and download volume.
Post Reply