- Remove From My Forums
Question
User-774506321 posted
The html page that needs to be parsed is structured without tables or divs. I need pull each hyperlink from this html page and the file modified text next to it [EX: h2_eh.xml 23-Feb-2011 05:05]? What would be the best approach? I would pull only the hyperlink filenames with .xml extensions.
---------------------------------------------------
Expected Results [that will be saved to DB]:
---------------------------------------------------
h2_eh.xml 23-Feb-2011
h2_ih.xml 12-Feb-2011
h2_pcs.xml 03-Mar-2011
----------------------------------------------------
HTML Page Source:
----------------------------------------------------
Index of /home/gpoxmlc112
Index of /home/gpoxmlc112
Name Last modified Size Description Parent Directory - billres.xsl 23-Jun-2010 09:43 574K dot_line1.gif 03-May-2004 09:57 806 h2_eh.xml 23-Feb-2011 05:05 735K h2_ih.xml 12-Feb-2011 05:05 719K h2_pcs.xml 03-Mar-2011 05:05 736K h2_eh.xml 20-Jan-2011 05:05 4.1K h2_ih.xml 06-Jan-2011 05:05 12K h2_pcs.xml 29-Jan-2011 05:05 4.6K h3_ih.xml 21-Jan-2011 05:05 24K h4_eh.xml 04-Mar-2011 05:05 7.4K h4_ih.xml 14-Jan-2011 05:05 17K h4_pcs.xml 08-Mar-2011 05:05 8.0K h4_rh.xml 23-Feb-2011 05:05 20K ............ ETC..........
Answers
User2130758966 posted
Hey,
This code works and doesnt blow up with the test case; ymmv when used in the field, it could certainly do with some more checks such has the regex matched anything, is the next sibling the right node type, etc etc.
Markup:
Code behind:
using System; using System.Collections.Generic; using System.Linq; using System.Web; using System.Web.UI; using System.Web.UI.WebControls; using HtmlAgilityPack; using System.IO; using System.Text.RegularExpressions; public class FileAndDate { public string File { get; set; } public string Date { get; set; } } namespace ExtractAllHrefFromHtmlSnippet { public partial class ParseExampleHtml : System.Web.UI.Page { protected void Page_Load[object sender, EventArgs e] { // load snippet HtmlDocument htmlSnippet = new HtmlDocument[]; htmlSnippet = LoadHtmlSnippetFromFile[]; // extract hrefs List hrefTags = new List[]; hrefTags = ExtractAllAHrefTags[htmlSnippet]; // bind to gridview GridViewHrefs.DataSource = hrefTags; GridViewHrefs.DataBind[]; } /// /// Load the html snippet from the txt file /// private HtmlDocument LoadHtmlSnippetFromFile[] { TextReader reader = File.OpenText[Server.MapPath["~/App_Data/Sample.html"]]; HtmlDocument doc = new HtmlDocument[]; doc.Load[reader]; reader.Close[]; return doc; } /// /// Extract all anchor tags using HtmlAgilityPack /// /// /// private List ExtractAllAHrefTags[HtmlDocument htmlSnippet] { List hrefTags = new List[]; foreach [HtmlNode link in htmlSnippet.DocumentNode.SelectNodes["//a[@href]"]] { HtmlAttribute att = link.Attributes["href"]; if [att.Value.ToLower[].EndsWith[".xml"]] { string NextSibling = link.NextSibling.InnerText; Regex r = new Regex[@"[\d]{2}-[A-Z][a-z]{2}-[\d]{4}"]; Match match = r.Match[NextSibling]; hrefTags.Add[new FileAndDate[] { File = att.Value, Date = match.Value }]; } } return hrefTags; } } }
Sample.html
Index of /home/gpoxmlc112
Index of /home/gpoxmlc112
Name Last modified Size Description Parent Directory - billres.xsl 23-Jun-2010 09:43 574K dot_line1.gif 03-May-2004 09:57 806 h2_eh.xml 23-Feb-2011 05:05 735K h2_ih.xml 12-Feb-2011 05:05 719K h2_pcs.xml 03-Mar-2011 05:05 736K h2_eh.xml 20-Jan-2011 05:05 4.1K h2_ih.xml 06-Jan-2011 05:05 12K h2_pcs.xml 29-Jan-2011 05:05 4.6K h3_ih.xml 21-Jan-2011 05:05 24K h4_eh.xml 04-Mar-2011 05:05 7.4K h4_ih.xml 14-Jan-2011 05:05 17K h4_pcs.xml 08-Mar-2011 05:05 8.0K h4_rh.xml 23-Feb-2011 05:05 20K
- Marked as answer by Thursday, October 7, 2021 12:00 AM
- Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM