I am trying to read title tags from HTML pages using HTMLParser. Although I am getting the error mentioned above, my class looks like this:
HTMLParser import HTMLParser import urlib from class readtitle.py
MyHTMLParser ( HTMLParser): def __init __ (auto, url): HTMLParser .__ init __ (self) self.url = url self.data = urllib.urlopen (url) .read () self.feed (self.data) self.intitle = "" Self.mytitle = "" def handle_starttag (self, tags, ethers): self.intitle = tag == "title" def handle_data (self, data): if it's self Title: self.mytitle = Return data Mytitle I ran the code using the following commands and got the error:
import urlib import readtitle parser = readtitle.MyHTMLParser ("http: //docs.python.org/tutorial/classes.html ") Traceback (most recent call final): File" & lt; stdin> ", line 1, & lt; Module & gt; File "readtitle.py", line 10, __ignit_self. The feed (swadata) file "/usr/lib/python2.6/HTMLParser.py", line 108, the self.goahead (0) file in the feed "/ Usr / lib / python2.6 / HTMLParser.py", line 142, In goahead if i & lt; J: self.handle_data (rawdata [i: j]) file "readtitle.py", in line 18, in hand_data if self.intitle: attributeError: MyHTMLParser example has no attribute 'intitle'
You run self.feed () , and call like this Handle_data () (before being identified by trace), before you run self.intitle = "" . Fix: self.url = url self.data = urlib .urlopen (url) .read () # Maybe a decode () should be here? Self.intitle = Incorrect self.mileyilil = "" self Feed (Auto Data) -------------------- ------------- ------
Debugging is always the most important part of this code, run this code and see what it is printed. HTMLParser import HTMLParser import urlib, sys class MyHTMLParser: HTML (HTML, Self.in_title = Wrong self.title = '' self.feed (self.data) def handle_starttag (self, tags, attrs): if the tag == 'body': sys.exit ('found & body; body & gt; 'Self-tag': # self.in_title = (tag == 'title') Very easy to print, 'handle start off', tag, 'in_chitle is', itself .IN_title def handle_endtag (self, tag): print' handles End of ', Tag DRF handl_data (self, data): printed "handling data:", repr (data) if self.i N_title: print "Apparently, we have a & gt; title & gt; tag. Self.title now", repr (data) self.title = data print data back self.title parser = MyHTMLParser ("http: // www For the convenience, HTML for the page in question:
. & Lt; HMTL & gt; & Lt; HEAD & gt; & Lt; Title & gt; Webpage1 & lt; / Title & gt; & Lt; / HEAD> & Lt; Body BGCOLOR = "FFFFFf" LINK = "006666" AA = "8B4513" VLINK = "006666" & gt;
Comments
Post a Comment