• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

How can you download all files of just one single type from a web site?

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

c627627

c(n*199780) Senior Member
Joined
Feb 18, 2002
Let's say you had a basic web site that had 100 pages with each page displaying 5 embedded pictures hosted elsewhere.


How can you download only .jpg files from all the 100 pages and nothing else [so just download 500 .jpg files], preferably using open source?
 
a spider downloader, i have not used one in quite a while, but they will go down a certain number of "links" or levels, and can be set to download everything or filtered to download specific items.
problem is many locations where something like that was usefull to do a picture snag, put in various java scripting, and things to block robot downloading.

as each one became useless, or problematic i switched methodology and did a simple "macros" instead, where a mouse and key macro , goes page by page with a repeating pattern of mouse moves. Like GhostMouse , which will repeat the mouse moves after i do the first one.

What can still work in freeware , is "Offline" browsing tools, IEs own offline browser, and other offline browsing tools, then you just sort out the Temp Net files , and DIY filter the items you want out of there, or out of the pool set out for offlining. In IE it was Tools/Syncronize then setup .

i think that being nice is to restrict the number of downloads at a time, and restrict the max speed, so while scrounging around thier site using a machine, you dont beat the site with requests and end up hurting it, or just getting blocked either temporarily or permentally. no use acting like a hacker , when you just want to avoid a few clicks.
.
 
Last edited:
Yes - the pictures can be hidden like you point out but using a GhostMouse is an excellent idea. Plus I forgot about the old IE temp files.
 
Last edited:
That was the first hit on my search as well, however the downloads when using it (as well as other downloaders) appear to be empty 0.1MB files for reasons, I think, described here:

a spider downloader...
problem is many locations where something like that was useful to do a picture snag, put in various java scripting, and things to block robot downloading.

as each one became useless, or problematic...

So I don't think it works.
 
I will private message it to anyone who wants to take a shot at figuring it out. It is a straight forward page with a bunch of pictures + at the bottom of every page you have the standard two links to go back 1 page or forward 1 page.

The pictures themselves are hosted elsewhere and just embedded on each page.
 
As far as I can tell...nothing too strange is going on with the site in question. Only slightly tricky is that the images are not in the same "directory" structure, and are links off the server. Plenty of options that should work, including using things like DownloadThemAll and running that on each page manually.

Anyway, using wget:
wget.exe -p -o c627627.txt <website> grabs everything required for that one page to display, including images.
wget.exe -r -p -np -o c627627.txt <website> goes through every page on that site and grabs everything required to display each page

Explanation of arguments:
-p = grab everything needed to display a page
-o c627627.txt = output log to file
-r = follow links (recursive, breadth-first) and grab stuff that's there (and thus dangerous)
-np = cannot ascend to parent "directory", this also stops traversing links to other sites.
For example, a recursive retrieval of http:// example.web.com/page/1 would only visit pages that started with http:// example.web.com/page/

Finally, for the site in question, images are downloaded, but have a really screwed up extension that you'll probably want to change.
 
Thank you. Extensions can be easily renamed using mass File Renaming programs.


OK so in this specific case you know about, I definitely want to use the /page/1 example and have it just get the images from *.com/page/1 to *.com/page/99 for example.

Now since we are not interesting in displaying pages or anything else other than jpegs, what is the command that would retrieve only .jpeg and .jpg files (with whatever extension) from addresses: *.com/page/1 to *.com/page/99?
 
Yes, getting just the images should be theoretically possible, with the arguments:
-r -np -A jpg <webpage>
wget should parse all html pages and keep only the jpg files. But something about the construction of the pages in question short-circuit this on the first page - not sure why (so much for "nothing too strange is going on"). Also a monkey wrench in this approach is the fact that the file extensions are mangled. Thus the necessity of using -p

One could write a HTML-parsing script that looks for <img> tags and grabs the images that way (since I did try HTTrack as well, and that has similar problems)...

But otherwise, I'd just have to recommend -r -p -np -l2000 <website>/page/1. I forgot that the default depth is 5, thus the presence of -l2000.

What you will end up with are multiple folders, one for each domain that ends up visited, with the associated files within (majority of images will be in s3.amazonaws.com)
 
It appears to be working!
Much obliged.
 
whop, spoke too soon - it stopped after downloading about 100MB. I do believe that some kind of protection over too fast browsing kicked in which would have happened even if you browsed too fast even if it wasn't automated.


Clearly I need a command to make it:

download for X number of minutes
pause for X number of minutes
resume download from where it left off for X number of minutes.
pause again, etc.


 

Attachments

  • wget.jpg
    wget.jpg
    33.6 KB · Views: 108
All right then we would have
wget.exe -r -p -np -l2000 -w180 *.com/page/1

What does that do? Pauses for 3 minutes after what, each download?



 
Last edited:
Try doing -w 3. I know many times a three second pause between downloads is all you need.


httrack is what I'd recommend, if you can't get wget to work in the end. I've used it many times for getting websites or files (most recently with creating the redirects for old front page to the new front page)
 
3 seconds resulted in only 25MB more being downloaded.

I'll try longer. I hope it would work otherwise the time delay may not be the reason...
 
wget.exe -r -p -np -l2000 -w31 *.com/page/1

Appears to be working but is taking forever on account of a 31 second delay b/w each download.

I think I should have tested a value b/w 3 and 31 first, but :shrug:



EDIT: I CTRL+C interrupted the download to test shorter pause.

I wanted to keep already downloaded stuff so I tried the -nc no-clobber extension but it gave me already there; not retrieving for page 1 and it stopped.


I then tried the -m extension
wget.exe -r -p -m -np -l2000 -w5 *.com/page/1

I don't really know what it is but it appears to have continued. I may be wrong about -m,
how do you continue interrupted download w/o overwriting existing files? -nc doesn't work. I hope I did the right thing by redoing it with -m and w5 (to see if 5 second pause works.)


-m didn't do it. I'm trying -c next.


 
Last edited:
I aborted it but whoever is reading this - I think it's important to figure out how to resume site download [not file download].


I noticed that folder GnuWin32\bin\*.com\page contains page numbers of the donwload and since there were pages 1-60 in there, I simply put in
wget.exe -r -p -np -l2000 -w5 *.com/page/61

and so now it is downloading pages 61, 62, 63 etc.


EDIT: Didn't work. Trying -w10

EDIT #2: 10 seconds works and so does 7 seconds.


Since 5 seconds did not work, it is either minimum 6 seconds [untested] or definitely minimum 7 seconds [tested] b/w each individual download and it works.
 
Last edited:
Back