• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

how could I hack this?

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

Christoph

JAPH Senior
Joined
Oct 8, 2001
Location
Redmond, WA
(By hack, I mean program, of course.)
I'd like to write a script that looks at a certain site and downloads any MP3 from it that I don't already have.
Among other things, I need a piece of code that will turn
2000-09-14 <b>Mega Man 3</b> '<a href='detailmix.php?mixid=OCR00115'>It's Boss Time</a>'<br>
into Mega_Man_3_Its_Boss_Time_OC_ReMix and that can put the year and month into $YEAR and $MONTH. I don't want people doing my thinking for me, but I'm not sure of the best tools to use. What should I use, and where can I find a good tutorial on it?
 
If you didn't need the date info in the filename, I'd suggest mixing wget and the standard shell scripting tools like sed and awk. Wget has nice features for stripping urls out of pages and manipulating them and then downloading them.
 
Perl and LWP::UserAgent would work perfectly for this. Try the lwpcook. There are many good HTML modules that may or may not make parsing the particular page you're thinking about easier. Probably a well-crafted regex would suffice from which you'd create file names, compare with the local file system, and download any that weren't already present.
 
I know that perl is great for this kind of thing, but I also don't know perl and I don't want to learn a tiny subset of it just for this script. However, I did discover that my Linux book has a section on awk and an appendix on regular expressions. From what I've read so far, this won't be too hard to do. Thanks for the recommendations!
 
Lately I've been writing all my scripts in PHP, I like it because it allows me to be sloppy. hehehe


it's pretty simple, I would just tell it to look at 2 or 3 of the characters at a time in a string, then when it finds key patters for markers then do something about it.

Like if every title has <b>your title</b> just look at 3 chars at a time, and if it finds <b>, after that start drumping all the characters into a different variable until it gets to </b

If you need more help I'm willing.

**edit**

I'd also use a combination of fopen("site"0 and system("wget blah") to retrieve the files.
 
Last edited:
System scripting in php just strikes me as wrong for some reason. I know php pretty well, use it for web stuff, but system scripting with it makes baby jesus cry!
 
No way, php is a very powerful language. It's more like perl than anything.

Just put #!/usr/bin/php4 -q at the top of the script and it's no differant than running perl.

Writing an mp3 player in a low level programming language is what makes baby jebus cry.
 
Last edited:
As it turns out, all I needed to filter the HTML code to the filenames was two simple commands.
echo "{if (\$2 != \"\") printf(\"%s%s%s\",\$1,\"\'N_\",\$2); if (\$2 != \"\") print \"\"; else print \$0;}">prog
awk -F'<[^>]*>' '/200[0-3]-[0-9]*/ {if ($1 == "") print $3 " " $5 ; if ($1 != "") print $2 " " $4; }' $1 | awk '{for(a=1;a<NF;a++) printf("%s_",$a);print $a;}' | awk -F'[.:]' '{for(a=1;a<=NF;a++) printf("%s",$a);print "_OC_ReMix.mp3"}' | awk -F'[^a-zA-Z]n_' -f prog

where $1 is the file. It could be one line if bash would let me deal with filenames containing "'n ". Alternately, it could be several lines of read-write code (as opposed to write-only), but I'll save that for debugging. For now, I'm just happy that it works on everything I've thrown at it.

Getting the dates is even easier:
awk -F'<[^>]*>' '/200[0-3]-[0-9]*/ {if ($1 == "") print $2 ; else print $1; }' $1

BTW, here is an example of what I'm parsing.

Edit: I'm going to go insane now. I just found two more special cases (one of which is inconsistant within the site). I guess this is what it's like working with an artist.

edit2: I'm glad I decided to look into sed. It's made my code *much* more readable and less hackish. Happily, I also figured out a way to avoid all those nasty special cases, which allowed my to write a functional beta!
 
Last edited:
Back