블로그 이미지

NCFPTeam's Blog

세상을 향한 통로... by nineclouds


[python] 웹페이지에서 필요로하는 정보 가져오기(미완성)

*** Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit (Intel)] on win32. ***
>>> import urllib, re
>>> url='http://www.census.gov/cgi-bin/ipc/popclockw'
>>> text=urllib.urlopen(url).read()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
AttributeError: 'module' object has no attribute 'urlopen'
>>> help(urllib)
Help on package urllib:

NAME
    urllib

FILE
    c:\program files\python311\lib\urllib\__init__.py

PACKAGE CONTENTS
    error
    parse
    request
    response
    robotparser


>>> text=urllib.__package__
>>> import urllib.request
>>> text=urllib.request.urlopen(url).read()
>>> text
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n<head>\n<title>International Programs</title>\n<meta name="author" content="Demographic Internet Staff">\n<link rel="stylesheet" href="/main/.in/style.css" type="text/css" />\n\n<link rel="stylesheet" type="text/css" href="/population/css/poptemplate.css" />\n\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />\n<style type="text/css">\n<!--\n.frame {\tbackground-color: #adc8e6;\n}\n.equal {\twidth: 30%;\n\ttext-align: left;\n}\n.style1 {font-size: medium}\n.style3 {font-size: small}\n-->\n</style>\n</head>\n<body>\n\n<!-- top banner and navigation bar -->\n<div id="header"> <!-- START CB Header 2:24 PM 7/1/2009 -->\n<!-- 44px height -->\n<!-- newline whitespace removed -->\n<div id="cb_header"><a href="#SKIP_HDR" title="Skip Header"\nstyle="visibility:hidden; display:none;">Skip header section</a><div\nstyle="height:40px; min-width: 782px;\n_width:expression((this.parentNode.offsetWidth>782)?\'auto\':\'782px\');\ntext-align:right; white-space:nowrap; overflow:hidden; margin:0; padding:0;\nbackground:#036 url(/main/www/m-img/cb_head_gradient.png) no-repeat top\nright; border-bottom:4px solid #09C;"><a href="/" style="float:left;"><img\nsrc="/main/www/img/home/cblogo.jpg" alt="US Census Bureau"\nstyle="width:275px; height:40px; padding:0; margin:0; border:none;"\n/></a><div style="display:inline;"><div style="padding:10px 0;\nwhite-space:nowrap;"><a\nhref="/population/www/" title="People &amp; Households"\nstyle="color:#FFF; background:none; font:bold 80% Arial,\nHelvetica, sans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center;\nbackground:none;">People</span></a><span style="width:1px; overflow:hidden;\nborder-left:1px solid #E6EFF6;">&nbsp;</span><a\nhref="/econ/" title="Business &amp; Industry"\nstyle="color:#FFF; background:none; font:bold 80% Arial, Helvetica,\nsans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center;\nbackground:none;">Business</span></a><span style="width:1px; overflow:hidden;\nborder-left:1px solid #E6EFF6;">&nbsp;</span><a\nhref="/geo/www/" title="Geography"\nstyle="color:#FFF; background:none; font:bold 80% Arial, Helvetica,\nsans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center;\nbackground:none;">Geography</span></a><span style="width:1px;\noverflow:hidden; border-left:1px solid #E6EFF6;">&nbsp;</span><a\nhref="/pubinfo/www/news/" title="Newsroom"\nstyle="color:#FFF; background:none; font:bold 80% Arial, Helvetica,\nsans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center;\nbackground:none;">Newsroom</span></a><span style="width:1px; overflow:hidden;\nborder-left:1px solid #E6EFF6;">&nbsp;</span><a\nhref="/main/www/a2z/" title="Subjects A to Z"\nstyle="color:#FFF; background:none; font:bold 80% Arial, Helvetica,\nsans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center;\nbackground:none;">Subjects A to Z</span></a><span style="width:1px;\noverflow:hidden; border-left:1px solid #E6EFF6;">&nbsp;</span><a\nhref="/main/www/srchtool.html" title="search at census"\nstyle="color:#FFF; background:none; font:bold 80% Arial, Helvetica,\nsans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center;\nbackground:none;">Search@Census</span></a></div></div><div\nstyle="width:782px; height:1px; line-height:1px;">&nbsp;</div></div><a\nname="SKIP_HDR" style="visibility:hidden; display:none;"></a></div>\n<!-- END CB Header 2:24 PM 7/1/2009 -->\r\n  <div id="local_header">\r\n    <h1>International Programs</h1>\r\n    <form name="gs" id="search" method="get" action="http://search.census.gov/search">\r\n      \r\n      <!--original location for the image: /hhes/www/img/-->\r\n      <input type="text" name="q" size="32" maxlength="256" value="" />\r\n      <input type="submit" name="btnG" value="Search This Site" />\r\n      <br />\r\n      <br />\r\n      <input type="hidden" name="filter" value="0" />\r\n      <input type="hidden" name="entqr" value="0" />\r\n      <input type="hidden" name="output" value="xml_no_dtd" />\r\n      <input type="hidden" name="ud" value="1" />\r\n      <input type="hidden" name="ie" value="UTF-8" />\r\n      <input type="hidden" name="client" value="subsite" />\r\n      <input type="hidden" name="proxystylesheet" value="subsite" />\r\n      <input type="hidden" name="hq" value="inurl:www.census.gov/ipc/www/cendates/" />\r\n      <input type="hidden" name="subtitle" value="hhes-xyz" />\r\n    </form>\r\n    <ul id="top_nav">\r\n      <li><a href="index.html">International Programs Main</a></li>\r\n\t  <li><a href="aboutintl.html">Overview</a>\r\n      <li><a href="intldata.html">Products and Services</a></li>\r\n      <li><a href="intlsoftapps.html">FAQ</a></li> \r\n<li><a href="intlrelated.html">Related Sites</a></li>\r\n\r\n\t  \r\n\t  \r\n    </ul>\r\n  </div>\r\n</div>\n<!-- end top banner and navigation bar -->\n<!--end of local_header div-->\n<!--end of header div-->\n<!-- end footer and bottom navigation bar -->\n<!-- main body of page (below banner and above hard rule) -->\n<!-- **************************** -->\n<!-- BEGIN 1st COLUMN (LEFT) HERE -->\n<!-- **************************** -->\n<div id="container">\n  <h2 id="int_page_header">World POPClock Projection </h2>\n  <p></p>\nAccording to the <a href="/ipc/www/">International Programs Center</a>, U.S. Census Bureau, the total population of the World, projected to 10/13/09\n at 13:25 GMT\n (EST+5) is<br /><br /><div id="worldnumber">6,790,219,427</div><p></p>\n<hr />\n<h3>Monthly World population figures:</h3>\n<pre>\n07/01/09    6,768,167,712\n08/01/09    6,774,705,647\n09/01/09    6,781,243,583\n10/01/09    6,787,570,618\n11/01/09    6,794,108,554\n12/01/09    6,800,435,588\n01/01/10    6,806,973,524\n02/01/10    6,813,511,460\n03/01/10    6,819,416,692\n04/01/10    6,825,954,628\n05/01/10    6,832,281,663\n06/01/10    6,838,819,599\n07/01/10    6,845,146,634\n</pre>\n<hr />\n<p>\n<a href="/ipc/www/popwnote.html">World POPClock notes</a><br /></p>\n<p>\nSource: U.S. Census Bureau, <a href="/ipc/www/idb/">International Data Base</a>.</p>\n<p>&nbsp;</p>\n<hr />\n<p>\n<a href="/main/www/popclock.html">More POPClocks.</a></p>\n\n<h3><a href="/ipc/www/idb/worldpopinfo.php">World Population Information</a></h3>\n</div>\n<!--end container div-->\n<!-- end main body of page (below banner and above horizontal rule) -->\n<div id="foot">\n  \n<hr />\nSource:  U.S. Census Bureau, Population Division<br />\r\n<a href="http://ask.census.gov">Questions?</a> / 1-866-758-1060<br />\r\n\r\n\r\n\n<hr />\n  <div id="bottom_nav">\r\n    <ul>\r\n      <li><a href="index.html">International Programs Main</a></li>\r\n\t  <li><a href="aboutintl.html">Overview</a>\r\n      <li><a href="intldata.html">Products and Services</a></li>\r\n      <li><a href="intlsoftapps.html">FAQ</a></li> \r\n<li><a href="intlrelated.html">Related Sites</a></li>\r\n    </ul>\r\n  </div>\n  <br />\n  <!-- START FOOTER 37px height 2:22 PM 7/1/2009 -->\n<!-- wordwrap at col 78 -->\n<div id="cb_footer"><div style="height:37px; min-width:782px;\n_width:expression((this.parentNode.offsetWidth>782)?\'auto\':\'782px\');\ntext-align:right; margin:0; padding:0; background:#036\nurl(/main/www/m-img/cb_head_gradient.png) no-repeat top right; border-top:3px\nsolid #09C;"><a href="#SKIP_FTR" title="Skip Footer" style="visibility:hidden;\ndisplay:none;">Skip footer section</a><a href="/" style="float:left;\npadding:0; margin:0;"><img src="/main/www/img/home/wordmark2.gif" alt="US\nCensus Bureau" style="width:210px; height:25px; padding:0; margin:0;\nborder:none;" /></a><div style="display:inline;"><div style="padding:7px\n0;"><a href="/privacy/" title="Privacy Policy"\nstyle="color:#FFF; background:none; font:bold 80% Arial, Helvetica,\nsans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center;\nbackground:none;">Privacy Policy</span></a><span style="width:1px;\noverflow:hidden; border-left:1px solid #E6EFF6;">&nbsp;</span><a\nhref="/2010census/" title="2010 Census"\nstyle="color:#FFF; background:none; font:bold 80% Arial, Helvetica,\nsans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center; background:none;">2010\nCensus</span></a><span style="width:1px; overflow:hidden; border-left:1px\nsolid #E6EFF6;">&nbsp;</span><a href="/main/www/access.html" title="Data Tools"\nstyle="color:#FFF; background:none; font:bold 80% Arial, Helvetica,\nsans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center; background:none;">Data\nTools</span></a><span style="width:1px; overflow:hidden; border-left:1px solid\n#E6EFF6;">&nbsp;</span><a href="/quality/" title="Information Quality"\nstyle="color:#FFF; background:none; font:bold 80%\nArial, Helvetica, sans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center;\nbackground:none;">Information Quality</span></a><span style="width:1px;\noverflow:hidden; border-left:1px solid #E6EFF6;">&nbsp;</span><a\nhref="/mp/www/cat/" title="Product Catalog"\nstyle="color:#FFF; background:none; font:bold 80% Arial, Helvetica,\nsans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center;\nbackground:none;">Product Catalog</span></a><span style="width:1px;\noverflow:hidden; border-left:1px solid #E6EFF6;">&nbsp;</span><a\nhref="/aboutus/contacts.html" title="Contact Us"\nstyle="color:#FFF; background:none; font:bold 80% Arial, Helvetica,\nsans-serif; text-decoration:none; white-space:nowrap;"><span\nstyle="padding:10px 4px; margin:0px; text-align:center;\nbackground:none;">Contact Us</span></a><span style="width:1px;\noverflow:hidden; border-left:1px solid #E6EFF6;">&nbsp;</span><a\nhref="/" title="Home" style="color:#FFF; background:none;\nfont:bold 80% Arial, Helvetica, sans-serif; text-decoration:none;\nwhite-space:nowrap;"><span style="padding:10px 4px; margin:0px;\ntext-align:center; background:none;">Home</span></a></div></div><a\nname="SKIP_FTR" style="visibility:hidden; display:none;"></a></div></div>\n\n<!-- Foresee Survey -->\n<script type="text/javascript" src="/fsrscripts/foresee-trigger.js"></script>\n<!-- End Foresee Survey -->\n<!-- END FOOTER 2:22 PM 7/1/2009 -->\n\n\n\n\n</div>\n</body>\n</html>\n\n'
>>> pattern='<h1>([0-9,]+)</h1>'
>>> match=re.search(pattern, text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Program Files\Python311\Lib\re.py", line 157, in search
    return _compile(pattern, flags).search(string)
TypeError: can't use a string pattern on a bytes-like object
>>> pattern
'<h1>([0-9,]+)</h1>'
>>> match=re.compile(pattern, 0).search(text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: can't use a string pattern on a bytes-like object
>>> pattern = '<h1>([0-9,]+)</h1>'
>>> match=re.compile(pattern, 0).search(text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: can't use a string pattern on a bytes-like object
>>> match=re.compile(pattern).search(text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: can't use a string pattern on a bytes-like object
>>> match=re.compile(pattern, flags).search(text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
NameError: name 'flags' is not defined
>>> match=re.compile(pattern, '0').search(text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Program Files\Python311\Lib\re.py", line 205, in compile
    return _compile(pattern, flags)
  File "C:\Program Files\Python311\Lib\re.py", line 273, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Program Files\Python311\Lib\sre_compile.py", line 491, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Program Files\Python311\Lib\sre_parse.py", line 692, in parse
    p = _parse_sub(source, pattern, 0)
  File "C:\Program Files\Python311\Lib\sre_parse.py", line 315, in _parse_sub
    itemsappend(_parse(source, state))
  File "C:\Program Files\Python311\Lib\sre_parse.py", line 408, in _parse
    if state.flags & SRE_FLAG_VERBOSE:
TypeError: unsupported operand type(s) for &: 'str' and 'int'
>>> match=re.compile(pattern, '').search(text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Program Files\Python311\Lib\re.py", line 205, in compile
    return _compile(pattern, flags)
  File "C:\Program Files\Python311\Lib\re.py", line 273, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Program Files\Python311\Lib\sre_compile.py", line 491, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Program Files\Python311\Lib\sre_parse.py", line 692, in parse
    p = _parse_sub(source, pattern, 0)
  File "C:\Program Files\Python311\Lib\sre_parse.py", line 315, in _parse_sub
    itemsappend(_parse(source, state))
  File "C:\Program Files\Python311\Lib\sre_parse.py", line 408, in _parse
    if state.flags & SRE_FLAG_VERBOSE:
TypeError: unsupported operand type(s) for &: 'str' and 'int'
>>> match=re.compile(pattern, 0).search(text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: can't use a string pattern on a bytes-like object
>>> pattern='<div id="worldnumber">([0-9,]+)</div>'
>>> match=re.compile(pattern, 0).search(text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: can't use a string pattern on a bytes-like object
>>> pattern='<div id=""worldnumber"">([0-9,]+)</div>'
>>> match=re.compile(pattern, 0).search(text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: can't use a string pattern on a bytes-like object
>>> pattern='<div([0-9,]+)</div>'
>>> match=re.search(pattern, text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Program Files\Python311\Lib\re.py", line 157, in search
    return _compile(pattern, flags).search(string)
TypeError: can't use a string pattern on a bytes-like object
>>> pattern='<div([0-9,]+)</div>'
>>> match=re.search(pattern, text)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Program Files\Python311\Lib\re.py", line 157, in search
    return _compile(pattern, flags).search(string)
TypeError: can't use a string pattern on a bytes-like object
>>> del pattern, re, text, url, urllib
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> import httplib, re
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ImportError: No module named httplib
>>> import httplib
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ImportError: No module named httplib
>>> import http, re
>>> host='www.census.gov''
  File "<string>", line None
SyntaxError: EOL while scanning string literal (<interactive input>, line 1)
>>> import http.__package__
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ImportError: No module named __package__
>>> import http.__file__
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ImportError: No module named __file__
>>> import http.request
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ImportError: No module named request
>>> import http.client
>>> h=http.client.HTTPConnection(host)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
NameError: name 'host' is not defined
>>> host
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
NameError: name 'host' is not defined
>>> host='www.census.gov'
>>> h=http.client.HTTPConnection(host)
>>> h
<http.client.HTTPConnection object at 0x04DB48D0>
>>> h.putrequest('GET', '/cgi-bin/ipc/popclockw')
>>> h.putheader('Host', host)
>>> h.putheader('Accept', 'text/html')
>>> h.putheader('Cache-Control', 'no-cache')
>>> h.endheaders()
>>> errcode, errmsg, headers=h.getresponse()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
ValueError: need more than 0 values to unpack
>>> errcode
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
NameError: name 'errcode' is not defined
>>> h
<http.client.HTTPConnection object at 0x04DB48D0>
>>> headers=h.getresponse()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Program Files\Python311\Lib\http\client.py", line 992, in getresponse
    raise ResponseNotReady(self.__state)
http.client.ResponseNotReady: Idle
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> del h, host, http, re
>>> import http.server
>>> import re
>>> host = 'www.census.gov'
>>> h=http.server.HTTPServer(host)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: __init__() takes at least 3 positional arguments (2 given)
>>> host
'www.census.gov'
>>> h=http.server.http(host)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: 'module' object is not callable
>>> h=http.client.HTTPConnection(host)
>>> h.putrequest('GET', '/cgi-bin/ipc/popclockw')
>>> h.putheader('Host', host)
>>> h.putheader('Accept', 'text/html')
>>> h.putheader('Cache-Control', 'no-cache')
>>> h.endheaders()
>>> errcode
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
NameError: name 'errcode' is not defined
>>> errmsg
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
NameError: name 'errmsg' is not defined
>>> headers=h.getresponse
>>> headers=h.getresponse()
>>> headers
<http.client.HTTPResponse object at 0x04E65E50>
>>> f=h.getresponse()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Program Files\Python311\Lib\http\client.py", line 992, in getresponse
    raise ResponseNotReady(self.__state)
http.client.ResponseNotReady: Idle
>>> f=h.getresponse()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Program Files\Python311\Lib\http\client.py", line 992, in getresponse
    raise ResponseNotReady(self.__state)
http.client.ResponseNotReady: Idle
>>> f=h.getresponse()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Program Files\Python311\Lib\http\client.py", line 992, in getresponse
    raise ResponseNotReady(self.__state)
http.client.ResponseNotReady: Idle
>>> f=h.getresponse
>>> f
<bound method HTTPConnection.getresponse of <http.client.HTTPConnection object at 0x04DB48D0>>
>>> f.read()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
AttributeError: 'function' object has no attribute 'read'
>>> f=h.response_class
>>> f
<class 'http.client.HTTPResponse'>
>>> f.read()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: read() takes at least 1 positional argument (0 given)
>>> f.read
<function read at 0x0500DC48>
>>> text=f.read
>>> text
<function read at 0x0500DC48>
>>> f.readall()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
TypeError: descriptor 'readall' of '_io._RawIOBase' object needs an argument
>>> 
Top