Html agility pack get href value

  • Remove From My Forums

  • Question

  • User-774506321 posted

    The html page that needs to be parsed is structured without tables or divs. I need pull each hyperlink from this html page and the file modified text next to it (EX: h2_eh.xml               23-Feb-2011 05:05)? What would be the best approach?  I would pull only the hyperlink filenames with .xml extensions.

    ---------------------------------------------------

    Expected Results (that will be saved to DB):

    ---------------------------------------------------

    h2_eh.xml          23-Feb-2011

    h2_ih.xml           12-Feb-2011

    h2_pcs.xml          03-Mar-2011

    ----------------------------------------------------

    HTML Page Source:

    ----------------------------------------------------

     
     
      
      Index of /home/gpoxmlc112 
      
      
    

    Index of /home/gpoxmlc112

    Html agility pack get href value
    Name Last modified Size Description
    Html agility pack get href value
    Parent Directory -
    Html agility pack get href value
    billres.xsl 23-Jun-2010 09:43 574K
    Html agility pack get href value
    dot_line1.gif 03-May-2004 09:57 806
    Html agility pack get href value
    h2_eh.xml 23-Feb-2011 05:05 735K
    Html agility pack get href value
    h2_ih.xml 12-Feb-2011 05:05 719K
    Html agility pack get href value
    h2_pcs.xml 03-Mar-2011 05:05 736K
    Html agility pack get href value
    h2_eh.xml 20-Jan-2011 05:05 4.1K
    Html agility pack get href value
    h2_ih.xml 06-Jan-2011 05:05 12K
    Html agility pack get href value
    h2_pcs.xml 29-Jan-2011 05:05 4.6K
    Html agility pack get href value
    h3_ih.xml 21-Jan-2011 05:05 24K
    Html agility pack get href value
    h4_eh.xml 04-Mar-2011 05:05 7.4K
    Html agility pack get href value
    h4_ih.xml 14-Jan-2011 05:05 17K
    Html agility pack get href value
    h4_pcs.xml 08-Mar-2011 05:05 8.0K
    Html agility pack get href value
    h4_rh.xml 23-Feb-2011 05:05 20K ............ ETC..........

Answers

  • User2130758966 posted

    Hey,

    This code works and doesnt blow up with the test case; ymmv when used in the field, it could certainly do with some more checks such has the regex matched anything, is the next sibling the right node type, etc etc.

    Markup:

    <%@ Page Language="C#" AutoEventWireup="true" CodeBehind="ParseExampleHtml.aspx.cs" Inherits="ExtractAllHrefFromHtmlSnippet.ParseExampleHtml" %>
    
    
    
    
    
        
    
    
        

    Code behind:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Web;
    using System.Web.UI;
    using System.Web.UI.WebControls;
    
    using HtmlAgilityPack;
    using System.IO;
    using System.Text.RegularExpressions;
    
    public class FileAndDate
    {
        public string File { get; set; }
        public string Date { get; set; }
    }
    
    namespace ExtractAllHrefFromHtmlSnippet
    {
        public partial class ParseExampleHtml : System.Web.UI.Page
        {
            protected void Page_Load(object sender, EventArgs e)
            {
                // load snippet
                HtmlDocument htmlSnippet = new HtmlDocument();
                htmlSnippet = LoadHtmlSnippetFromFile();
    
                // extract hrefs
                List hrefTags = new List();
                hrefTags = ExtractAllAHrefTags(htmlSnippet);
    
                // bind to gridview
                GridViewHrefs.DataSource = hrefTags;
                GridViewHrefs.DataBind();
            }
    
            /// 
            /// Load the html snippet from the txt file
            /// 
            private HtmlDocument LoadHtmlSnippetFromFile()
            {
                TextReader reader = File.OpenText(Server.MapPath("~/App_Data/Sample.html"));
    
                HtmlDocument doc = new HtmlDocument();
                doc.Load(reader);
    
                reader.Close();
    
                return doc;
            }
    
            /// 
            /// Extract all anchor tags using HtmlAgilityPack
            /// 
            /// 
            /// 
            private List ExtractAllAHrefTags(HtmlDocument htmlSnippet)
            {
                List hrefTags = new List();
    
                foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
                {
                    HtmlAttribute att = link.Attributes["href"];
    
                    if (att.Value.ToLower().EndsWith(".xml"))
                    {
                        string NextSibling = link.NextSibling.InnerText;
                        Regex r = new Regex(@"[\d]{2}-[A-Z][a-z]{2}-[\d]{4}");
                        Match match = r.Match(NextSibling);
    
                        hrefTags.Add(new FileAndDate() { File = att.Value, Date = match.Value });
                    }
                }
    
                return hrefTags;
            }
        }
    }

    Sample.html

     
     
      
      Index of /home/gpoxmlc112 
      
      
    

    Index of /home/gpoxmlc112

    Html agility pack get href value
    Name Last modified Size Description
    Html agility pack get href value
    Parent Directory -
    Html agility pack get href value
    billres.xsl 23-Jun-2010 09:43 574K
    Html agility pack get href value
    dot_line1.gif 03-May-2004 09:57 806
    Html agility pack get href value
    h2_eh.xml 23-Feb-2011 05:05 735K
    Html agility pack get href value
    h2_ih.xml 12-Feb-2011 05:05 719K
    Html agility pack get href value
    h2_pcs.xml 03-Mar-2011 05:05 736K
    Html agility pack get href value
    h2_eh.xml 20-Jan-2011 05:05 4.1K
    Html agility pack get href value
    h2_ih.xml 06-Jan-2011 05:05 12K
    Html agility pack get href value
    h2_pcs.xml 29-Jan-2011 05:05 4.6K
    Html agility pack get href value
    h3_ih.xml 21-Jan-2011 05:05 24K
    Html agility pack get href value
    h4_eh.xml 04-Mar-2011 05:05 7.4K
    Html agility pack get href value
    h4_ih.xml 14-Jan-2011 05:05 17K
    Html agility pack get href value
    h4_pcs.xml 08-Mar-2011 05:05 8.0K
    Html agility pack get href value
    h4_rh.xml 23-Feb-2011 05:05 20K

    • Marked as answer by Thursday, October 7, 2021 12:00 AM

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM