I have the following code, however, when I launch it, I only limit it to get some URLS.
While (stopFlag! = True) {WebRCC Request = WebRequest.Create (urlList [i]); (WebResponse response = request.GetResponse ()) using {StreamReader Reader = New Stream Reader (Response .GetterSponsScream (), Encoding.UTF8) {string sitecontent = reader.ReadToEnd (); // Add links to the list // Process the content // Create a text box for the HTML code // Regex URLRx = New Regex (@ "((| (| | | | | | ftp | File] \: // www) [A -zA-Z0-9 \\ -.] (/ [One-zA-Z0-9 \ & amp; ==?!? \ '\ (\) \ * \ - \ ._ ~%] *) * ", RegexOptions.IgnoreCase); Reggaez urlRx = New Reggae (@ "(? & Lt; url & gt; (http: [/] [/] | www.) ([एज] | [AG] | [0-9] | [/.] | [~]) *) ", RegexOptions.IgnoreCase; Match Collection Matches = urlRx.Matches (Siteignant); Foreign matches (match matches in matches) {string cleanMatch = cleanUP (match.Value); UrlList.Add (cleanMatch); Update succeeded (result, "\" "+ CleanMatch +" \ ", \ n"); }}}} I think the error is within regex.
What I'm trying to achieve pulls out a webpage, then drag all links from that page - add them to the list, then get the next page for each list item and repeat the process . Instead of trying to use, I suggest using a good HTML parser -
What is actually the HTML agility pack (HAP)? This is a tight HTML parser that reads / writes domes and plain XPath or XSLT (you really do not have to understand XPAT nor do not use XSLT to use it , Do not worry ...). This is a .NET code library that allows you to parse "outside the web" HTML files. Parser is very tolerant with the "real world" HTML deformed object model system. XML offers, but is very similar to HTML documents (or streams).
Comments
Post a Comment