Hacked

Home for all your discussion of basketball statistical analysis.
Crow
Posts: 6246
Joined: Thu Apr 14, 2011 11:10 pm

Re: Hacked

Post by Crow » Mon Apr 18, 2011 6:41 pm

Thanks.

If you are trying to read an old thread and find it is missing a key graphic, mention it and I might go and get it. I might pro-actively backfill graphics in threads that I grabbed earlier before I realized the graphs needed formatting but it will be as I realize which ones need it and as time allows and interest warrants.

Justin
Posts: 12
Joined: Sun Apr 17, 2011 6:25 pm
Location: Portland, OR
Contact:

Re: Hacked

Post by Justin » Mon Apr 18, 2011 7:30 pm

Wow, this is heartbreaking! I've only lurked since I discovered this board a couple months ago, but I've been inspired by the quality of discussion. I feel if those discussions were left to fade into Internet Oblivion, a little bit of history would be lost. I want to be proactive and offer my help as a programmer.

A couple obvious questions, but are you sure they actually deleted the data from your database? If so, are you absolutely positive you or your host never created any backups? Do you have SQL logs? It's possible to restore from those.

If the answer to all of the above questions is "no", then the most important thing is to try and recover as many pages from internet caches before they are removed. I tried to run warrick http://warrick.cs.odu.edu/warrick.html on Google results, but Google quickly banned my IP so I wrote a script that will pull from The Wayback Machine. I'm in the process of downloading about 2200 pages. There is no guarantee they are unique or complete, but at least it's something. I'll upload them when it completes.

Then it gets to the hard part. It'll be a challenge to restore the post on this forum because the user ids have all changed since the upgrade. You could possibly create a bot user and all the topics could be posted as that user and just include the author name in the body of the post. You could automate this, but it is still a nontrivial task.

Let me know how I can help. Maybe you have other ideas.

DSMok1
Posts: 905
Joined: Thu Apr 14, 2011 11:18 pm
Location: Maine
Contact:

Re: Hacked

Post by DSMok1 » Mon Apr 18, 2011 7:38 pm

Yeah, Justin, I discovered also that Google Webcache doesn't like mass downloading!

It is too bad that this new forum seems to have been installed over the old one--if it were placed elsewhere, there'd be a better chance of data recovery (this from my programmer brother).

Sounds like you know a lot more about this than I do.

Kevin, you could have asked around for some programmers' help before installing the new forum!
Developer of Box Plus/Minus
APBRmetrics Forum Administrator
GodismyJudgeOK.com/DStats/
Twitter.com/DSMok1

Crow
Posts: 6246
Joined: Thu Apr 14, 2011 11:10 pm

Re: Hacked

Post by Crow » Mon Apr 18, 2011 7:45 pm

Justin,

Thanks for looking at a mass recovery.

I'll pause my ad hoc work until you have a chance to make your data grab and you or admin or we can see what it has recovered.

My guess is that it will be single pages of threads like I am working with and one would have to cut and paste them together. But maybe you can run a script on titles or url?

Unless sorting can be automated, having 2200 at once to sort thru each time might be of some help and some hindrance.

If the name of the thread (or subject or author) is known, a google cache query for that name and the cut, sort and paste can be done and one only has to sort thru a small set of pages.
Last edited by Crow on Mon Apr 18, 2011 8:28 pm, edited 1 time in total.

Justin
Posts: 12
Joined: Sun Apr 17, 2011 6:25 pm
Location: Portland, OR
Contact:

Re: Hacked

Post by Justin » Mon Apr 18, 2011 8:18 pm

Happy to do what I can to help.

Here is a topic list of what I recovered - http://dl.dropbox.com/u/602885/sonicscentral-topics.txt
Here is the actual data (30mb) - http://dl.dropbox.com/u/602885/sonicscentral.tar.gz

Unfortunately, thewaybackmachine only had about 430 unique topics. It looks like each file is one page of a single thread. So, if a thread had more than one page, it would be in a different file. I can automate either one of the processes below with one known caveat: The BBCode won't be retained because it has already been transformed to HTML.

The easiest way is to create a single bot user that posts an entire thread into one post. The slightly more difficult way is to create a bot user that reconstructs threads by individual post.
Last edited by Justin on Mon Apr 18, 2011 8:35 pm, edited 1 time in total.

EvanZ
Posts: 912
Joined: Thu Apr 14, 2011 10:41 pm
Location: The City
Contact:

Re: Hacked

Post by EvanZ » Mon Apr 18, 2011 8:23 pm

At the minimum, we should probably have a sticky post with that zipped file, so people could search through it.

Crow
Posts: 6246
Joined: Thu Apr 14, 2011 11:10 pm

Re: Hacked

Post by Crow » Mon Apr 18, 2011 8:35 pm

Entire thread into one post by bot user seems fine to me.

I'd say you make the call.

Duplicate pages may be an issue. To be dealt with in advance or after the fact.

I wouldn't mess with user IDs and trying to match-up as many old users are not here at this time.

Justin
Posts: 12
Joined: Sun Apr 17, 2011 6:25 pm
Location: Portland, OR
Contact:

Re: Hacked

Post by Justin » Mon Apr 18, 2011 8:43 pm

I'll go with the single post method and see how that works out. Threads might show up and then get deleted while I work out the kinks. I'll do a trial run and update this thread.

Crow
Posts: 6246
Joined: Thu Apr 14, 2011 11:10 pm

Re: Hacked

Post by Crow » Mon Apr 18, 2011 9:08 pm

Sounds good.


There were about 2000+ threads in the history of the forum on this platform. (100+ pages * 20 threads per page). (There was an earlier yahoo group for years with thousands of threads as well.)

If your mass recovery goes well that would put us at about 30% recovered from this epoch.

Graphics may require special treatment or after the fact recovery from the file you posted.

DSMok1
Posts: 905
Joined: Thu Apr 14, 2011 11:18 pm
Location: Maine
Contact:

Re: Hacked

Post by DSMok1 » Mon Apr 18, 2011 9:10 pm

If I remember, there were some 60-65 pages in the previous forum, at I believe 20 threads/page. So something like 1200-1300 threads.

Google has a bunch more threads cached, I think. It's just a matter of setting up Warrick to scrape at intervals rather than all at once? I don't know how the Google webcache determines inappropriate activity.
Developer of Box Plus/Minus
APBRmetrics Forum Administrator
GodismyJudgeOK.com/DStats/
Twitter.com/DSMok1

Crow
Posts: 6246
Joined: Thu Apr 14, 2011 11:10 pm

Re: Hacked

Post by Crow » Mon Apr 18, 2011 9:17 pm

The lost forum definitely had 100+ pages as I noticed when it went over.

If Google or yahoo or bing can be scraped incrementally for even more threads that would be good.

Google cache has 5,200 items listed at the moment, though not all are thread pages. Some are index pages, search, member profiles, etc. and there is plenty of duplication.

Threads recovered from the wayback machine archive will be generally or entirely from at least 18 months ago.

Justin
Posts: 12
Joined: Sun Apr 17, 2011 6:25 pm
Location: Portland, OR
Contact:

Re: Hacked

Post by Justin » Mon Apr 18, 2011 9:25 pm

Google's cache would be the ideal place to recover the most topics, but I'm not sure it let me get to 100 requests before it blacklisted me. I didn't want to incur the full wrath of Google and be permanently blacklisted, so I stopped the script. I mean, how else will I find cat pictures if that were to happen?!? :D

Here are the warrick docs regarding Google:
Warrick accesses cached pages by scraping results from http://www.google.com. Be careful when running Warrick: Google monitors traffic through http://www.google.com, and if they suspect you are making automated requests, they will "blacklist" your IP address and will not respond to queries for as long as 12 hours. If Warrick detects that it has been blacklisted, it will sleep for 12 hours and then pick up where it left off. In my experiments, Google has detected me after about 100-150 requests. We cannot be held responsible if Google blacklists your IP address.

To avoid using Google, use the switch "-wr ia,y,b" which tells Warrick to use Internet Archive, Yahoo, and Bing only.

Crow
Posts: 6246
Joined: Thu Apr 14, 2011 11:10 pm

Re: Hacked

Post by Crow » Mon Apr 18, 2011 9:30 pm

Do what you are comfortable with and have the time for.

If multiple users run Warrick (if anyone else is comfortable with the software & process), can the years or months be divided in some fashion to avoid duplicate work?

Justin
Posts: 12
Joined: Sun Apr 17, 2011 6:25 pm
Location: Portland, OR
Contact:

Re: Hacked

Post by Justin » Mon Apr 18, 2011 10:35 pm

It looks like warrick only supports time spans using The Internet Archive, so I don't think we could do that. I'm still banned from Google's cache, but if anyone is familiar with Ruby (the programming language) and the process of running a script then I could write a script that could fetch a range of pages.

On a related note, there is an issue with missing pages in the data I was able to get from The Internet Archive. So, pages 1-4 here would make sense, but then you skip all the way to page 10.

Code: Select all

"Wins Produced - Wages of Wins (Berri, Schmidt, and Brook)"
{
    :"11" => "/Users/justin/dev/lrr/ruby/ia_scrape/sonicscentral.com/viewtopic.php?t=877&start=150&postdays=0&postorder=asc&highlight=",
    :"12" => "/Users/justin/dev/lrr/ruby/ia_scrape/sonicscentral.com/viewtopic.php?t=877&start=165&postdays=0&postorder=asc&highlight=",
    :"13" => "/Users/justin/dev/lrr/ruby/ia_scrape/sonicscentral.com/viewtopic.php?t=877&start=180&postdays=0&postorder=asc&highlight=",
     :"1" => "/Users/justin/dev/lrr/ruby/ia_scrape/sonicscentral.com/viewtopic.php?t=877&start=0&postdays=0&postorder=asc&highlight=",
     :"2" => "/Users/justin/dev/lrr/ruby/ia_scrape/sonicscentral.com/viewtopic.php?t=877&start=15&postdays=0&postorder=asc&highlight=",
     :"3" => "/Users/justin/dev/lrr/ruby/ia_scrape/sonicscentral.com/viewtopic.php?t=877&start=30&postdays=0&postorder=asc&highlight=",
    :"10" => "/Users/justin/dev/lrr/ruby/ia_scrape/sonicscentral.com/viewtopic.php?t=877&postdays=0&postorder=asc&start=135",
     :"4" => "/Users/justin/dev/lrr/ruby/ia_scrape/sonicscentral.com/viewtopic.php?t=877&postdays=0&postorder=asc&start=45"
}


Post Reply