Page 2 of 3

Re: Talking about data acquisition

Posted: Mon Apr 29, 2013 11:41 pm
by colts18
EvanZ wrote:
FWIW, I'm actually planning to get the 90's data (as far back as I can go) this summer and put it on nbawowy. Look for it.
Are you going to put the pbp data online, or will you do the WOWY analysis for that data like you did for the 12-13 season?

Re: Talking about data acquisition

Posted: Wed May 01, 2013 4:44 pm
by kpascual
What is the biggest pain point with regards to data acquisition or manipulation? Is it that people don't have the raw data accessible (i.e. play by play)? Or are the tools to access the raw or processed data not flexible/powerful/easy enough? Or is it in processing the data so data can be compared reliably?

The contents of this thread suggest all three are problems to some degree, but which one is the most painful?

I went about exposing my own data, but then realized I didn't know what the real problem was. Raw play by play data is basically solved by NBA.com/stats (http://stats.nba.com/stats/playbyplay?G ... dPeriod=10), and I can help on the tools/data processing piece. But it's hard to know exactly how to help without better understanding of what problem needs to be solved.

Re: Talking about data acquisition

Posted: Wed May 01, 2013 6:38 pm
by colts18
kpascual wrote:What is the biggest pain point with regards to data acquisition or manipulation? Is it that people don't have the raw data accessible (i.e. play by play)? Or are the tools to access the raw or processed data not flexible/powerful/easy enough? Or is it in processing the data so data can be compared reliably?

The contents of this thread suggest all three are problems to some degree, but which one is the most painful?

I went about exposing my own data, but then realized I didn't know what the real problem was. Raw play by play data is basically solved by NBA.com/stats (http://stats.nba.com/stats/playbyplay?G ... dPeriod=10), and I can help on the tools/data processing piece. But it's hard to know exactly how to help without better understanding of what problem needs to be solved.
Raw play by play data for the 1997-2000 seasons would be awesome. I think if the pbp and lineup/matchup data is out there, someone would be able to do some kind of APM or RAPM analysis.

Re: Talking about data acquisition

Posted: Sat May 04, 2013 8:22 pm
by kpascual
colts18 wrote: Raw play by play data for the 1997-2000 seasons would be awesome. I think if the pbp and lineup/matchup data is out there, someone would be able to do some kind of APM or RAPM analysis.
So your problem is just having play by play data that might not exist, and might not have been recorded at all? I can't help with that.

I can help somewhat with accessing boxscore or play by play data (that exists). Would something like these URLs be helpful? It's the NBA.com boxscore and play by play for some game in 2010, in CSV form. This also exists in JSON by replacing ".csv" with ".json".

http://api.vorped.com/sample/boxscore.csv
http://api.vorped.com/sample/playbyplay.csv

I would imagine this would be super helpful, since you wouldn't need a version of the CSV file on your local machine, like, ever. With Excel you can import a file and enter this URL, which will create a data table for you. With R, you can do a pretty simple command to load the data:

Code: Select all

a <- read.table("http://api.vorped.com/sample/playbyplay.csv", header=TRUE, sep=",")
It's a pretty stupid, trivial example. But imagine you can pass parameters, like player name or season, into the URL. Then updating your models/spreadsheets/visualizations becomes trivial.

Re: Talking about data acquisition

Posted: Sun May 05, 2013 12:15 am
by colts18
kpascual wrote:
colts18 wrote: Raw play by play data for the 1997-2000 seasons would be awesome. I think if the pbp and lineup/matchup data is out there, someone would be able to do some kind of APM or RAPM analysis.
So your problem is just having play by play data that might not exist, and might not have been recorded at all? I can't help with that.

I can help somewhat with accessing boxscore or play by play data (that exists). Would something like these URLs be helpful? It's the NBA.com boxscore and play by play for some game in 2010, in CSV form. This also exists in JSON by replacing ".csv" with ".json".

http://api.vorped.com/sample/boxscore.csv
http://api.vorped.com/sample/playbyplay.csv

I would imagine this would be super helpful, since you wouldn't need a version of the CSV file on your local machine, like, ever. With Excel you can import a file and enter this URL, which will create a data table for you. With R, you can do a pretty simple command to load the data:

Code: Select all

a <- read.table("http://api.vorped.com/sample/playbyplay.csv", header=TRUE, sep=",")
It's a pretty stupid, trivial example. But imagine you can pass parameters, like player name or season, into the URL. Then updating your models/spreadsheets/visualizations becomes trivial.
The play by play data does exist. NBA.com has play by play data for 1997-2000. But someone has to parse out the pbp data so that there can be a APM/RAPM analysis done for the 97-2000 seaons.

Re: Talking about data acquisition

Posted: Sun May 05, 2013 1:03 am
by kpascual
colts18 wrote: The play by play data does exist. NBA.com has play by play data for 1997-2000. But someone has to parse out the pbp data so that there can be a APM/RAPM analysis done for the 97-2000 seaons.
If it exists and you know where it is, why can't that someone be you? I'll quote DJ Patil in saying "80% of the work in any data project is in cleaning the data." You're basically asking someone to do the 80% of the work for you, i.e. most of the work.

http://radar.oreilly.com/2012/07/data-jujitsu.html

Re: Talking about data acquisition

Posted: Sun May 05, 2013 9:53 pm
by colts18
kpascual wrote:
colts18 wrote: The play by play data does exist. NBA.com has play by play data for 1997-2000. But someone has to parse out the pbp data so that there can be a APM/RAPM analysis done for the 97-2000 seaons.
If it exists and you know where it is, why can't that someone be you? I'll quote DJ Patil in saying "80% of the work in any data project is in cleaning the data." You're basically asking someone to do the 80% of the work for you, i.e. most of the work.

http://radar.oreilly.com/2012/07/data-jujitsu.html
NBA.com does have the play by play data for 97-00. But I don't have the programming experience to convert the play by play into lineup data.

For example, here is a random Bulls-Jazz game in 1998.

http://stats.nba.com/stats/playbyplay?G ... dPeriod=10

I don't have the programming experience to make that into something usable.

Re: Talking about data acquisition

Posted: Mon May 06, 2013 1:58 am
by DSMok1
FWIW, that NBA 97-00 data will be really hard to use since it's only last names (not even a first initial.)

Re: Talking about data acquisition

Posted: Fri May 10, 2013 6:47 pm
by fpliii
EvanZ wrote:
AcrossTheCourt wrote:No, not blaming them for joining a team. The problem is once they do the website is toast. So you need a website with a large group of people or some system where certain roles can be filled once the person is gone.
You need someone like me who is happy with his current job and not looking to be hired by the NBA (although I can't say the opportunity hasn't been presented).

And if I did take a job, I'd make it a condition of being hired that the site would have to stay up.

FWIW, I'm actually planning to get the 90's data (as far back as I can go) this summer and put it on nbawowy. Look for it.
Love your site, thanks for your hard work. I'm definitely looking forward to having the complete PbP dataset available on wowy for analysis.

In general, I'd agree with the OP. I have a strong background in math/stats, and have done some programming (minor projects in Java, C++, C, but also some stuff in R, Matlab), but am by no means confident in handling large datasets with variable formats. I should probably set aside some time after this season and look at some of the threads on this forum (and reposted from the archives) to figure out the mechanics between data acquisition/handling.

Re: Talking about data acquisition

Posted: Sun May 12, 2013 10:55 am
by wilq
Speaking of data acquisition I have a question: is it even legal to publish files with processed data from hundreds of pages? Recently someone asked for me all players' gamelogs which I do have but can I publish it in one downloadable file or would it be unfair to the sources? The second question is probably to Neil Paine or someone else who works at the source.
kpascual wrote:What is the biggest pain point with regards to data acquisition or manipulation? Is it that people don't have the raw data accessible (i.e. play by play)? Or are the tools to access the raw or processed data not flexible/powerful/easy enough? Or is it in processing the data so data can be compared reliably?

The contents of this thread suggest all three are problems to some degree, but which one is the most painful?
That probably depends entirely on individual skill-set so I don't expect there to be a general rule. Though I guess there are common stages: 1) is there a data and if not can I create it, 2) can I process it and 3) can I use it/present it in a desirable way.

Re: Talking about data acquisition

Posted: Sun May 12, 2013 11:59 am
by DSMok1
wilq wrote:Speaking of data acquisition I have a question: is it even legal to publish files with processed data from hundreds of pages? Recently someone asked for me all players' gamelogs which I do have but can I publish it in one downloadable file or would it be unfair to the sources? The second question is probably to Neil Paine or someone else who works at the source.
If you have not significantly altered the data in some way (value-added) then no, I don't think it's ethical--the website had to pay good money to get that same data. I don't think it's an issue to share with a few folks on request, though--just not publish it publicly.

If you HAVE significantly altered the data in some way--like you've calculated some stat with the data and are not publishing the original source data, no problem--that stat is yours.

It's a bit of a fuzzy topic legally. From what I understand, many/most sites claim more rights over their data in terms of use than they actually really could enforce legally. But if you want to be ethical, I would always ask the source site (i.e. Justin Kubatko).

Re: Talking about data acquisition

Posted: Sun May 12, 2013 12:46 pm
by PD123
wilq wrote:Speaking of data acquisition I have a question: is it even legal to publish files with processed data from hundreds of pages? Recently someone asked for me all players' gamelogs which I do have but can I publish it in one downloadable file or would it be unfair to the sources?
IANAL (I say as I proceed to talk like I know what I'm talking about).

Legally speaking, you're more or less in the clear to publish it as long as it doesn't include relatively specialized statistics. People can't easily claim ownership of facts in a legal sense, and statistics are simply another way of recording things like who had the ball or who missed a shot; where it could get a little fuzzy is if you published something like another person's rating metric which could be argued is more of a statement of an opinion than of a fact, but the bottom line is that nobody has really ever successfully enforced any claim of ownership over generic and common sports statistics.

As for the ethical question, as much as I'd like to have easily downloadable sources of information, I'm compelled to agree with DSMok1 that posting information that you've gotten from someone else, without their permission and/or without wholesale changes, is probably not the most ethical thing to do.

Re: Talking about data acquisition

Posted: Sun May 12, 2013 1:32 pm
by wilq
DSMok1 wrote:If you have not significantly altered the data in some way (value-added) then no, I don't think it's ethical--the website had to pay good money to get that same data. I don't think it's an issue to share with a few folks on request, though--just not publish it publicly.
PD123 wrote:As for the ethical question, as much as I'd like to have easily downloadable sources of information, I'm compelled to agree with DSMok1 that posting information that you've gotten from someone else, without their permission and/or without wholesale changes, is probably not the most ethical thing to do.
I agree with your point of view... so maybe THIS is the biggest problem with talking about data acquisition? Even if someone has already collected the data it really doesn't help everybody else interested in the same data.

Frankly, I'm puzzled why there aren't more projects like this: http://www.basketballreference.com/stats_download.htm.
Isn't the business of "one page per player" totally different than "all pages in one file"? And aren't they targeted at different people? Am I wrong to think basketball-reference.com or nba.com could sell files such as "gamelogs from 1985-86 to today" or "playbyplay9798"? I don't know how profitable it would be but the cost of this idea is basically "only bandwidth" because they already have the data!

Re: Talking about data acquisition

Posted: Mon May 13, 2013 1:46 am
by DSMok1
wilq wrote:Am I wrong to think basketball-reference.com or nba.com could sell files such as "gamelogs from 1985-86 to today" or "playbyplay9798"? I don't know how profitable it would be but the cost of this idea is basically "only bandwidth" because they already have the data!
Basketball Reference buys their data from a provider, and it is, I believe, outside of their terms of use to provide the data for wholesale download (after all, that data is what the provider is selling!).

Re: Talking about data acquisition

Posted: Tue May 14, 2013 4:10 pm
by mystic
kpascual wrote: If it exists and you know where it is, why can't that someone be you? I'll quote DJ Patil in saying "80% of the work in any data project is in cleaning the data." You're basically asking someone to do the 80% of the work for you, i.e. most of the work.
Unfortunately, that is true. Even with some programming skills, it is still not easy to get a clean datafile. I learnt that lesson myself. My matchupfile is not good at all, and I really admire those who can get a matchupfile with a low error rate.


Regarding the data acquistion: I guess, the only way to get free accessable raw data would be a project in which multiple people would work together in their spare time and a provider would volunteer to give the neccessary storage and download volume.