• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

[C++] HTTP GET Request (Sockets)

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

ShadowPho

Member
Joined
Jun 8, 2005
Location
I am in your stack, SUBbing your registers!
Right now I am trying to create a web spider. After all, if google, yahoo, aol and even microsoft is doing it, then why not me? :D
But I seemed to hit a stone. For some reason "GET / HTTP/1.0" works perfectly with woot.com and thedailywtf.com, but not with evga.com and google.com

Maybe there is something wrong with my code? I am entering the above command for the second one, but it still doesn't work with google :/

http://pastebin.com/m6dd6f2ca

Code:
#include <cstdlib>
#include <iostream>
#include <cstdio>
#include <cmath>
#include <windows.h>
#include <winsock2.h>

SOCKET sock;

using namespace std;

int main(int argc, char *argv[])
{
    if(argc>9999) cout <<argv[0];
    

    
    WSADATA wsaData;
    sockaddr_in addr;
    
    if(WSAStartup(0x101,&wsaData)!=0){printf("WSA FAIL");system("pause");return 0;}
      sock = socket(AF_INET, SOCK_STREAM, 0);
    if(sock==(unsigned int)SOCKET_ERROR){printf("SOCK ERROR");system("pause");return 0;}
    
    char *temp = (char*) malloc(32000*sizeof(char));
    if(temp==NULL){cout<<"FAILED TO ALLOCATE 32MB of RAM."; system("pause");return 0;}
    printf("Please enter the website. (max 256 chars)\n");
    printf("Do not enter spaces or http.\n");
    printf("Ex: www.google.com \n");
    fgets(temp,256,stdin);   // BIG PROBLEM: FGETS STORES THE NEWLINE CHAR
   
   
   for(int i=0;i<255;i++)
   if(temp[i]=='\n')         // SO WE LOOK FOR THE /n AND TURN IT INTO THE END
   {temp[i]='\0';break;}          
   
    LPHOSTENT host;
    host = gethostbyname(temp);
    if(host==NULL)
    {printf("DNS failure: error %i on %s",WSAGetLastError(),temp);system("pause");return 0;}
    
      addr.sin_family = AF_INET;
      addr.sin_port = htons(80);
      addr.sin_addr = *((LPIN_ADDR)*host->h_addr_list);

  if(connect(sock,(LPSOCKADDR)&addr,sizeof(struct sockaddr))==SOCKET_ERROR)
  {printf("CRIT ERROR %i : Unable to connect to %s",WSAGetLastError(),temp);system("pause");return 0;}
  printf("Type in the command\n");
  fgets(temp,256,stdin);
  send(sock,temp,strlen(temp),0);
  
  recv(sock,temp,32000*sizeof(char),0);
  
  cout << temp;

  free(temp);  
  closesocket(sock);
    system("PAUSE");
    return EXIT_SUCCESS;
}
 
Code:
    char *temp = (char*) malloc(32000*sizeof(char));
    if(temp==NULL){cout<<"FAILED TO ALLOCATE 32MB of RAM."; system("pause");return 0;}
Well, first of all, isn't 32000*sizeof(char) 32KB, not 32MB? Also, the thread title says this is a C++ program, but C++ does not (prefers not to) use malloc. Learn to use the "new" keyword (which was not available in C, but is in C++). It's much, much easier to use and easier to debug. Second, why are we using fgets()? I don't know what that does, but judging from the comment about the problem with it storing \n instead of \0, it seems like it would be easier to use scanf() or even cin.

As for the actual problem... well, what is the problem? You say it doesn't work with google, but how does it not work? Does it say anything specific or does it just not do anything?
 
Nice idea, i thought about trying this but had no idea on where to start.. i would be interested to see where this project goes...

Just a thought, does your program check the robots.txt file, to know whether the site allows web crawlers or not ? if so this could be stopping you crawling google.. ( i kow its a long shot but..)

Good luck with the project..
 
I decided to go with a C++ approach, and now I am encapsulating all in a C++ class. I am also using a std::string to hold the data.

asusradeon said:
Nice idea, i thought about trying this but had no idea on where to start.. i would be interested to see where this project goes...

Just a thought, does your program check the robots.txt file, to know whether the site allows web crawlers or not ? if so this could be stopping you crawling google.. ( i kow its a long shot but..)

Good luck with the project..

Good news is that I figured out that google wants for HTTP requests to be.... complete. So I am almost there. If you are interested, I will post my header/c++ spider class. It is quite a really simple class once you realize the HTML standard.


EDIT: It's good to know that everything is progressing as it should be!


SOCK ERRORCRIT ERROR 10038 : Unable to connect to www.google.comsend failed5
send failed recv failed CRIT ERROR %i : Unable to connect to %s Malloc failed.
Expect weird errors on recv. wrong IP WSA FAIL SOCK ERROR Failed to DNS. gmon.ou
t _mcleanup: tos overflow
monstartup: out of memory
АG ►0@ t0@ Е0@ ►0@ `0@ g0@ g0@ g0@ g0@ Н0@ l0@ ►0@ `0@
basic_string::at basic_string::copy basic_string::compare basic_Press any ke
y to continue . . .
 
Last edited:
ShadowPho said:
If you are interested, I will post my header/c++ spider class. It is quite a really simple class once you realize the HTML standard.

Yeh if you could, i'd like to see it....
 
asusradeon said:
Yeh if you could, i'd like to see it....

Alright, here is the class for the actual conversation with the server. This is the second class I ever tried (I just did the C way before), and it feels... elegant!

I also included main to show how to work the class, and badmain to show how I attempted to start the web-spider. Only look at badmain if you want to see how to best sigfault the program. I won't have much time in the upcoming days to program more on this so I am just releasing this now.
 

Attachments

  • Spider.zip
    489.2 KB · Views: 375
Captain Newbie said:
By the way: debugging network apps: Wireshark is your friend :cool:
Already had it from the time I was *COUGH* debugging *COUGH* that M$ game .
Thanks for suggestion! I will try that out instead of logging all the packets.


Alright, here is the class for the actual conversation with the server. This is the second class I ever tried (I just did the C way before), and it feels... elegant!

I also included main to show how to work the class, and badmain to show how I attempted to start the web-spider. Only look at badmain if you want to see how to best sigfault the program. I won't have much time in the upcoming days to program more on this so I am just releasing this now.
If someone downloads the class, please remove my email from it :)
 
I think here you only check the first Addr returned from the getHostby name.
bettet to iterate over them all
 
Back