• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

C++ retrieve HTML page and parse

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

petteyg359

Likes Popcorn
Joined
Jul 31, 2004
I'm attempting to write a C++ program to retrieve a web page (or several) using curl. I can grab the source of a page with no problem, but I can't find a good example of how to parse pieces of it with regex or XPath. I've looked at Xalan/Xerces and QDom, but the Xalan/Xerces examples make no sense to me, and all the Qt examples are GUI-based, and my gcc (4.3.4) refuses to build anything with Qt, for some reason.
 

Omsion

Member
Joined
Mar 6, 2006
XML parsers tend to fail on all but the simplest pages when given much less strict HTML. For example, my experience parsing HTML is through Python, and Python's built-in XML parser fails even on Google search results pages (BeautifulSoup is the go-to parser for HTML there).

http://htmlcxx.sourceforge.net/ perhaps? Doesn't appear to have been updated lately.

Perhaps HTML tidy first then an XML parser that you know how to use?
http://tidy.sourceforge.net/
 
Last edited:

da_spork

Member
Joined
Jul 31, 2004
Location
Rolla, Missouri
Have you looked at expat? It has a lot of examples floating around the net for xml parsing. I worked with someone that used a combination of that and strtok to parse HTML in c.

If you're interested in branching out a little and using python, take a look at BeautifulSoup. It'd be much easier to parse with
 
OP
petteyg359

petteyg359

Likes Popcorn
Joined
Jul 31, 2004
XML parsers tend to fail on all but the simplest pages when given much less strict HTML. For example, my experience parsing HTML is through Python, and Python's built-in XML parser fails even on Google search results pages (BeautifulSoup is the go-to parser for HTML there).

Perhaps HTML tidy first then an XML parser that you know how to use?
http://tidy.sourceforge.net/

The pages I'll be grabbing are mostly valid, but I'll definitely be running them through Tidy first. The problem is there isn't an XML parser that I know how to use. That's why I want examples, so I can see how to use one :) Something like
Code:
#include "someParser.hpp"
#include <iostream>
int main()
{
 char* XHTML="<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"><html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>Title</title></head><body><p id=\"blah\">BlahContent</p></body></html>"
 someParser P(XHTML);
 cout << P.p("blah").textContent << endl;
}
or something like that. I haven't figured out OO C++, so I don't really understand the examples that require fifty different functions in main.cpp.

Have you looked at expat? It has a lot of examples floating around the net for xml parsing. I worked with someone that used a combination of that and strtok to parse HTML in c.

If you're interested in branching out a little and using python, take a look at BeautifulSoup. It'd be much easier to parse with

I don't know any Python at all. I know basic C++, and I can work from examples, which is why I'm attempting this with C++.
 
Last edited:

bLack0ut

Member
Joined
Dec 21, 2004
The pages I'll be grabbing are mostly valid, but I'll definitely be running them through Tidy first. The problem is there isn't an XML parser that I know how to use. That's why I want examples, so I can see how to use one :) Something like
Code:
#include "someParser.hpp"
#include <iostream>
int main()
{
 char* XHTML="<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"><html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>Title</title></head><body><p id=\"blah\">BlahContent</p></body></html>"
 someParser P(XHTML);
 cout << P.p("blah").textContent << endl;
}
or something like that. I haven't figured out OO C++, so I don't really understand the examples that require fifty different functions in main.cpp.



I don't know any Python at all. I know basic C++, and I can work from examples, which is why I'm attempting this with C++.

I would actually say learning python from scratch is faster/easier than learning c++ from an amateur level, as python is more forgiving.

I know both, and python is more intuitive (but c is faster!).
 

Omsion

Member
Joined
Mar 6, 2006
py2exe bundles dependencies together and creates a single exe.
 
Last edited: