Html agility pack get href value

  • Remove From My Forums

  • Question

  • User-774506321 posted

    The html page that needs to be parsed is structured without tables or divs. I need pull each hyperlink from this html page and the file modified text next to it [EX: h2_eh.xml               23-Feb-2011 05:05]? What would be the best approach?  I would pull only the hyperlink filenames with .xml extensions.

    ---------------------------------------------------

    Expected Results [that will be saved to DB]:

    ---------------------------------------------------

    h2_eh.xml          23-Feb-2011

    h2_ih.xml           12-Feb-2011

    h2_pcs.xml          03-Mar-2011

    ----------------------------------------------------

    HTML Page Source:

    ----------------------------------------------------

     
     
      
      Index of /home/gpoxmlc112 
      
      
    

    Index of /home/gpoxmlc112

     Name                    Last modified      Size  Description Parent Directory                             -   
     billres.xsl             23-Jun-2010 09:43  574K  
     dot_line1.gif           03-May-2004 09:57  806   
     h2_eh.xml               23-Feb-2011 05:05  735K  
     h2_ih.xml               12-Feb-2011 05:05  719K  
     h2_pcs.xml              03-Mar-2011 05:05  736K  
     h2_eh.xml               20-Jan-2011 05:05  4.1K  
     h2_ih.xml               06-Jan-2011 05:05   12K  
     h2_pcs.xml              29-Jan-2011 05:05  4.6K  
     h3_ih.xml               21-Jan-2011 05:05   24K  
     h4_eh.xml               04-Mar-2011 05:05  7.4K  
     h4_ih.xml               14-Jan-2011 05:05   17K  
     h4_pcs.xml              08-Mar-2011 05:05  8.0K  
     h4_rh.xml               23-Feb-2011 05:05   20K  
    
    ............ ETC..........

Answers

  • User2130758966 posted

    Hey,

    This code works and doesnt blow up with the test case; ymmv when used in the field, it could certainly do with some more checks such has the regex matched anything, is the next sibling the right node type, etc etc.

    Markup:

    
    
    
    
    
        
    
    
        
        

    Code behind:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Web;
    using System.Web.UI;
    using System.Web.UI.WebControls;
    
    using HtmlAgilityPack;
    using System.IO;
    using System.Text.RegularExpressions;
    
    public class FileAndDate
    {
        public string File { get; set; }
        public string Date { get; set; }
    }
    
    namespace ExtractAllHrefFromHtmlSnippet
    {
        public partial class ParseExampleHtml : System.Web.UI.Page
        {
            protected void Page_Load[object sender, EventArgs e]
            {
                // load snippet
                HtmlDocument htmlSnippet = new HtmlDocument[];
                htmlSnippet = LoadHtmlSnippetFromFile[];
    
                // extract hrefs
                List hrefTags = new List[];
                hrefTags = ExtractAllAHrefTags[htmlSnippet];
    
                // bind to gridview
                GridViewHrefs.DataSource = hrefTags;
                GridViewHrefs.DataBind[];
            }
    
            /// 
            /// Load the html snippet from the txt file
            /// 
            private HtmlDocument LoadHtmlSnippetFromFile[]
            {
                TextReader reader = File.OpenText[Server.MapPath["~/App_Data/Sample.html"]];
    
                HtmlDocument doc = new HtmlDocument[];
                doc.Load[reader];
    
                reader.Close[];
    
                return doc;
            }
    
            /// 
            /// Extract all anchor tags using HtmlAgilityPack
            /// 
            /// 
            /// 
            private List ExtractAllAHrefTags[HtmlDocument htmlSnippet]
            {
                List hrefTags = new List[];
    
                foreach [HtmlNode link in htmlSnippet.DocumentNode.SelectNodes["//a[@href]"]]
                {
                    HtmlAttribute att = link.Attributes["href"];
    
                    if [att.Value.ToLower[].EndsWith[".xml"]]
                    {
                        string NextSibling = link.NextSibling.InnerText;
                        Regex r = new Regex[@"[\d]{2}-[A-Z][a-z]{2}-[\d]{4}"];
                        Match match = r.Match[NextSibling];
    
                        hrefTags.Add[new FileAndDate[] { File = att.Value, Date = match.Value }];
                    }
                }
    
                return hrefTags;
            }
        }
    }

    Sample.html

     
     
      
      Index of /home/gpoxmlc112 
      
      
    

    Index of /home/gpoxmlc112

     Name                    Last modified      Size  Description Parent Directory                             -   
     billres.xsl             23-Jun-2010 09:43  574K  
     dot_line1.gif           03-May-2004 09:57  806   
     h2_eh.xml               23-Feb-2011 05:05  735K  
     h2_ih.xml               12-Feb-2011 05:05  719K  
     h2_pcs.xml              03-Mar-2011 05:05  736K  
     h2_eh.xml               20-Jan-2011 05:05  4.1K  
     h2_ih.xml               06-Jan-2011 05:05   12K  
     h2_pcs.xml              29-Jan-2011 05:05  4.6K  
     h3_ih.xml               21-Jan-2011 05:05   24K  
     h4_eh.xml               04-Mar-2011 05:05  7.4K  
     h4_ih.xml               14-Jan-2011 05:05   17K  
     h4_pcs.xml              08-Mar-2011 05:05  8.0K  
     h4_rh.xml               23-Feb-2011 05:05   20K  
    
    

    • Marked as answer by Thursday, October 7, 2021 12:00 AM

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM

Bài Viết Liên Quan

Chủ Đề