• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

C++ retrieve HTML page and parse

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

petteyg359

Likes Popcorn
Joined
Jul 31, 2004
I'm attempting to write a C++ program to retrieve a web page (or several) using curl. I can grab the source of a page with no problem, but I can't find a good example of how to parse pieces of it with regex or XPath. I've looked at Xalan/Xerces and QDom, but the Xalan/Xerces examples make no sense to me, and all the Qt examples are GUI-based, and my gcc (4.3.4) refuses to build anything with Qt, for some reason.
 
XML parsers tend to fail on all but the simplest pages when given much less strict HTML. For example, my experience parsing HTML is through Python, and Python's built-in XML parser fails even on Google search results pages (BeautifulSoup is the go-to parser for HTML there).

http://htmlcxx.sourceforge.net/ perhaps? Doesn't appear to have been updated lately.

Perhaps HTML tidy first then an XML parser that you know how to use?
http://tidy.sourceforge.net/
 
Last edited:
Have you looked at expat? It has a lot of examples floating around the net for xml parsing. I worked with someone that used a combination of that and strtok to parse HTML in c.

If you're interested in branching out a little and using python, take a look at BeautifulSoup. It'd be much easier to parse with
 
XML parsers tend to fail on all but the simplest pages when given much less strict HTML. For example, my experience parsing HTML is through Python, and Python's built-in XML parser fails even on Google search results pages (BeautifulSoup is the go-to parser for HTML there).

Perhaps HTML tidy first then an XML parser that you know how to use?
http://tidy.sourceforge.net/

The pages I'll be grabbing are mostly valid, but I'll definitely be running them through Tidy first. The problem is there isn't an XML parser that I know how to use. That's why I want examples, so I can see how to use one :) Something like
Code:
#include "someParser.hpp"
#include <iostream>
int main()
{
 char* XHTML="<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"><html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>Title</title></head><body><p id=\"blah\">BlahContent</p></body></html>"
 someParser P(XHTML);
 cout << P.p("blah").textContent << endl;
}
or something like that. I haven't figured out OO C++, so I don't really understand the examples that require fifty different functions in main.cpp.

Have you looked at expat? It has a lot of examples floating around the net for xml parsing. I worked with someone that used a combination of that and strtok to parse HTML in c.

If you're interested in branching out a little and using python, take a look at BeautifulSoup. It'd be much easier to parse with

I don't know any Python at all. I know basic C++, and I can work from examples, which is why I'm attempting this with C++.
 
Last edited:
The pages I'll be grabbing are mostly valid, but I'll definitely be running them through Tidy first. The problem is there isn't an XML parser that I know how to use. That's why I want examples, so I can see how to use one :) Something like
Code:
#include "someParser.hpp"
#include <iostream>
int main()
{
 char* XHTML="<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"><html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>Title</title></head><body><p id=\"blah\">BlahContent</p></body></html>"
 someParser P(XHTML);
 cout << P.p("blah").textContent << endl;
}
or something like that. I haven't figured out OO C++, so I don't really understand the examples that require fifty different functions in main.cpp.



I don't know any Python at all. I know basic C++, and I can work from examples, which is why I'm attempting this with C++.

I would actually say learning python from scratch is faster/easier than learning c++ from an amateur level, as python is more forgiving.

I know both, and python is more intuitive (but c is faster!).
 
py2exe bundles dependencies together and creates a single exe.
 
Last edited:
Back