• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Gathering web site data. JAVA

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

Xantom

Member
Joined
Apr 2, 2007
Location
MN
I am attempting to build a fantasy football web site, I am however not having much luck developing a way to extract data, from a web site to retrieve football stats. I can pull in a web site, and get to the point where the QB stats start. My question is there a way to strip all the html tags and just be left with the players name and numbers(stats)? I skip down 366 lines to get to the example below. Was thinking maybe, to tokenize using > as the delimiiter, then tokenizing again to strip away the rest of the tag using < as the delimiter. Any suggestions are much appreciated, thanks.

CODE
---------------------------------------------------------------------
import java.net.*;
import java.io.*;

public class WebRipper
{

public static void main(String[] argv)
{
int count = 0;
try {
// Create a URL for the desired page
URL url = new URL("http://www.fftoday.com/stats/playerstats.php?Season=2006&GameWeek=1&PosID=10&LeagueID=1");
// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null)
{
if(count > 366)
{
System.out.println(str);
}// end if
count ++;
}// end while
in.close();
}// end try
catch (MalformedURLException e) {}
catch (IOException e) {}
}// end main

}// end WebRipper
---------------------------------------------------------------------

OUTPUT(this is just a example of the first few lines of output)
<TD CLASS="sort1" ALIGN="LEFT" BGCOLOR="#ffffff"> 1. <A HREF="playerprofile.php?PlayerID=1607&LeagueID=1">Donovan McNabb</A></TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">PHI</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">24</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">35</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">314</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">3</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">1</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">4</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">7</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">0</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#e0e0e0">28.4</TD>
<TD CLASS="sort1" ALIGN="center" BGCOLOR="#ffffff">28.4</TD>
</TR>
 
I do not know much on JAVA's namespaces but is there an XML DOM class? You can parse HTML files just like XML and use the InnerText property to get the information.

I did this to inject the BODY tag of an HTML file into en HTML enabled body property of an email sent to my client's customers. I do this in C# though.
 
Back