I'm writing a c# console application to retrieve table info from an external html web page.
Example web page: (chessnuts.org)
I want to extract all <td> records for data,match,opponent,result etc - 23 rows in example link above.
I've no control of this web page which unfortunately isn't well formatted so options I've tried like the HtmlAgilityPack and XML parsing simply fail. I have also tried a number for RegEx's but my knowledge of this is extremely poor, an example I tried below:
string[] trs = Regex.Matches(html,
@"<tr[^>]*>(?<content>.*)</tr>",
RegexOptions.Multiline)
.Cast<Match>()
.Select(t => t.Groups["content"].Value)
.ToArray();
This returns a complete list of all <tr>'s (with many records I don't need) but I'm then unable to get the data from this.
UPDATE
Here is an example of the use of HtmlAgilityPack I tried:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table"))
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
foreach (HtmlNode cell in row.SelectNodes("td"))
{
Console.WriteLine(cell.InnerText);
}
}
}