Parsing play-by-play data
Posted: Sun Nov 08, 2015 6:21 am
Hi I have been working on basketball analytics for quite a while. Some of the data I need needs to be parsed directly from the playbyplay.
A few people have written code to extract the play by play from statsnba.com or espn, but I noticed that they do not ususally have tools to parse the play by play into usable csv for statistics. So I have been implementing my own parser to do that job and I hope to share the codebase through open source with this community to facilitate the process.
I noticed that different sources have different format for the pbp, so what people usually do is to write regular expressions for different sites, which I believe can be hectic. Furthermore, there maybe some outliers. I hope to implement a universal one that can be quickly implemented for different websites. To do that I did some very simple natural language processing and tokenization of the text and after that I will do classification via machine learning.
It works okay at this moment but I definitely need some help. In order to assess the reliability of this parser I need prepared data to complete the parser. I noticed that NBAStuffer has the desired data that I want to learn the parser. But in order for me to complete the parser for websites such as ESPN, I will probabily need someone to manually prepare the data in format similar to that of NBAStuffer. I am not sure whether someone already has it.
Anyone has any idea I I shall proceed from here?
A few people have written code to extract the play by play from statsnba.com or espn, but I noticed that they do not ususally have tools to parse the play by play into usable csv for statistics. So I have been implementing my own parser to do that job and I hope to share the codebase through open source with this community to facilitate the process.
I noticed that different sources have different format for the pbp, so what people usually do is to write regular expressions for different sites, which I believe can be hectic. Furthermore, there maybe some outliers. I hope to implement a universal one that can be quickly implemented for different websites. To do that I did some very simple natural language processing and tokenization of the text and after that I will do classification via machine learning.
It works okay at this moment but I definitely need some help. In order to assess the reliability of this parser I need prepared data to complete the parser. I noticed that NBAStuffer has the desired data that I want to learn the parser. But in order for me to complete the parser for websites such as ESPN, I will probabily need someone to manually prepare the data in format similar to that of NBAStuffer. I am not sure whether someone already has it.
Anyone has any idea I I shall proceed from here?