On a related note - I am currently compiling the last 15 seasons of NCAA player data - and SOME/MUCH of the data acquisition is actually even copy/paste team by team.mystic wrote:Unfortunately, that is true. Even with some programming skills, it is still not easy to get a clean datafile. I learnt that lesson myself. My matchupfile is not good at all, and I really admire those who can get a matchupfile with a low error rate.kpascual wrote: If it exists and you know where it is, why can't that someone be you? I'll quote DJ Patil in saying "80% of the work in any data project is in cleaning the data." You're basically asking someone to do the 80% of the work for you, i.e. most of the work.
Regarding the data acquistion: I guess, the only way to get free accessable raw data would be a project in which multiple people would work together in their spare time and a provider would volunteer to give the neccessary storage and download volume.
"Cleaning" the data is a MASSIVE endeavor - I cross reference cbb-reference, ESPN, and statsheet with team totals to attempt to get the most accurate data - sometimes all three sites disagree with each other (albeit often slightly) on a player's stats, as well as player class (Fr, So, Jr, Sr). It's taking forever - which is why I have set everything aside and not made any updates to my site or done any NBA playoff stuff (or posted here). It's like I missed the playoffs and already am preparing for the NBA draft - well, that is exactly what I'm doing.
I am PRAYING to be done enough before the NBA draft to do all the pre draft work ups with possibly NBA career projections before the draft takes place. No matter when I'm done - eventually I'll be offering NBA career projections for current NBA players and college players entering the NBA.
That being said, when I'm finally done - I'm not sure if I can offer all my 15 seasons of raw college data to the public - since I didn't buy it. I fixed it - it won't exactly match the data from ANY site (seasons 1999 and 2000 of complete can't even be found anywhere - I even have to use wayback machine often). Well, I'm hoping to make the full data available to the public someday, since I'll have it, and it's not found anywhere.
I also will be ATTEMPTING to best "fix" older NBA seasons - since I'll be doing NBA player ratings (completely adusted for era and missing data of that era - turnovers, O & D rebounds, etc) all the way back to when player minutes were compiled (early 50s). The NBA team totals often don't match compiled player stats - so I have to cross reference multiple sites along with my old basketball encyclopedias and Total basketball encyclopedias for the best guess of good player data. Maybe I'll be able to someday make that data available to everyone also - I don't really know.
Well, we'll see. First things first.