python 模塊BeautifulSoup 從HTML或XML文件中提取數(shù)據(jù)

、安裝

Beautiful Soup 是一個HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 數(shù)據(jù)。

lxml 只會局部遍歷，而Beautiful Soup 是基于HTML DOM的，會載入整個文檔，解析整個DOM樹，因此時間和內(nèi)存開銷都會大很多，所以性能要低于lxml。

BeautifulSoup 用來解析 HTML 比較簡單，API非常人性化，支持CSS選擇器、Python標(biāo)準(zhǔn)庫中的HTML解析器，也支持 lxml 的 XML解析器。

pip install beautifulsoup4

二、使用案例

from bs4 import BeautifulSoup
import requests
import asyncio
import functools
import re

house_info = []

'''異步請求獲取鏈家每頁數(shù)據(jù)'''
async def get_page(page_index):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
    }
    request = functools.partial(requests.get, f'https://sh.lianjia.com/ershoufang/pudong/pg{page_index}/',
                                headers=headers)
    loop = asyncio.get_running_loop()
    response = await loop.run_in_executor(None, request)
    return response


'''使用xpath獲取房屋信息'''
def get_house_info(soup):
    house_info_list = soup.select('.info')  # 房屋title
    reg = re.compile(r'\n|\s')
    for html in house_info_list:

        house_info.append({
            'title': re.sub(reg,'',html.select('.title a')[0].getText()),
            'house_pattern': re.sub(reg,'',html.select('.houseInfo')[0].getText()),
            'price': re.sub(reg,'',html.select('.unitPrice')[0].getText()),
            'location': re.sub(reg,'',html.select('.positionInfo')[0].getText()),
            'total': re.sub(reg,'',html.select('.totalPrice')[0].getText())
        })

'''異步獲取第一頁數(shù)據(jù)，拿到第一頁房屋信息，并返回分頁總數(shù)和當(dāng)前頁'''
async def get_first_page():
    response = await get_page(1)
    soup = BeautifulSoup(response.text, 'lxml')
    get_house_info(soup)
    print(house_info)


if __name__ == '__main__':
    asyncio.run(get_first_page())

三、創(chuàng)建soup對象

soup = BeautifulSoup(markup="", features=None, builder=None,parse_only=None, from_encoding=None, exclude_encodings=None,element_classes=None)

markup：要解析的HTML或XML文檔字符串?？梢允且粋€字符串變量，也可以是一個文件對象（需要指定"html.parser"或"lxml"等解析器）。
features：指定解析器的名稱或類型。默認(rèn)為"html.parser"，可以使用其他解析器如"lxml"、"html5lib"等。
builder：指定文檔樹的構(gòu)建器。默認(rèn)為None，表示使用默認(rèn)構(gòu)建器?？梢允褂?#34;lxml"或"html5lib"等指定其他構(gòu)建器。
parse_only：指定要解析的特定部分。可以傳遞一個解析器或一個標(biāo)簽名或一個元素的列表。
from_encoding：指定解析器使用的字符編碼。默認(rèn)為None，表示自動檢測編碼。
exclude_encodings：指定要排除的編碼列表，用于字符編碼自動檢測。
element_classes：指定要用于解析文檔的元素類。默認(rèn)為None，表示使用默認(rèn)元素類。

解析器	使用方法	優(yōu)勢	劣勢
Python標(biāo)準(zhǔn)庫	BeautifulSoup(markup,"html.parser")	Python 的內(nèi)置標(biāo)準(zhǔn)庫、執(zhí)行速度適中、文檔容錯能力強	Python 2.7.3 or3.2.2) 前的版本中文容錯能力差
LXML HTML 解析器	BeautifulSoup(markup,"lxml")	速度快、文檔容錯能力強	需要安裝 C 語言庫
LXML XML解析器	BeautifulSoup(markup,"xml")	速度快、唯一支持 XML 的解析器	需要安裝 C 語言庫
html5lib	BeautifulSoup(markup,"html5lib")	最好的容錯性、以瀏覽器的方式解析文檔、生成 HTML5 格式的文檔	速度慢、不依賴外部擴展

四、soup對象

soup.prettify(encoding=None, formatter="minimal")：返回格式化后的HTML或XML文檔的字符串表示。它將文檔內(nèi)容縮進并使用適當(dāng)?shù)臉?biāo)簽閉合格式，以提高可讀性
soup.title：返回文檔的

問題

我們需要以客戶端的形式通過HTTP協(xié)議訪問多種服務(wù)。比如，下載數(shù)據(jù)或者同一個基于REST的API進行交互。

解決方案

對于簡單的任務(wù)來說，使用urllib.request模塊通常就足夠了。比方說，要發(fā)送一個簡單的HTTP GET請求到遠(yuǎn)端服務(wù)器上，可以這樣做：

from urllib import request, parse
# Base URL being accessed
url = 'http://httpbin.org/get'
# Dictionary of query parameters (if any)
parms = {
 'name1' : 'value1',
 'name2' : 'value2'
}
# Encode the query string
querystring = parse.urlencode(parms)
# Make a GET request and read the response
u = request.urlopen(url+'?' + querystring)
resp = u.read()

如果需要使用POST方法在請求主體（request body）中發(fā)送查詢參數(shù)，可以將參數(shù)編碼后作為可選參數(shù)提供給urlopen()函數(shù)，就像這樣：

from urllib import request, parse
# Base URL being accessed
url = 'http://httpbin.org/post'
# Dictionary of query parameters (if any)
parms = {
 'name1' : 'value1',
 'name2' : 'value2'
}
# Encode the query string
querystring = parse.urlencode(parms)
# Make a POST request and read the response
u = request.urlopen(url, querystring.encode('ascii'))
resp = u.read()

如果需要在發(fā)出的請求中提供一些自定義的HTTP頭，比如修改user-agent字段，那么可以創(chuàng)建一個包含字段值的字典，并創(chuàng)建一個Request實例然后將其傳給urlopen()。示例如下：

from urllib import request, parse
...
# Extra headers
headers = {
 'User-agent' : 'none/ofyourbusiness',
 'Spam' : 'Eggs'
}
req = request.Request(url, querystring.encode('ascii'), headers=headers)
# Make a request and read the response
u = request.urlopen(req)
resp = u.read()

如果需要交互的服務(wù)比上面的例子都要復(fù)雜，也許應(yīng)該去看看requests庫（[http://pypi. python.org/pypi/requests](http://pypi. python.org/pypi/requests)）。比如，下面這個示例采用requests庫重新實現(xiàn)了上面的操作：

import requests
# Base URL being accessed
url = 'http://httpbin.org/post'
# Dictionary of query parameters (if any)
parms = {
 'name1' : 'value1',
 'name2' : 'value2'
}
# Extra headers
headers = {
 'User-agent' : 'none/ofyourbusiness',
 'Spam' : 'Eggs'
}
resp = requests.post(url, data=parms, headers=headers)
# Decoded text returned by the request
text = resp.text

關(guān)于requests庫，一個值得一提的特性就是它能以多種方式從請求中返回響應(yīng)結(jié)果的內(nèi)容。從上面的代碼來看，resp.text帶給我們的是以Unicode解碼的響應(yīng)文本。但是，如果去訪問resp.content，就會得到原始的二進制數(shù)據(jù)。另一方面，如果訪問resp.json，那么就會得到JSON格式的響應(yīng)內(nèi)容。

下面這個示例利用requests庫來發(fā)起一個HEAD請求，并從響應(yīng)中提取出一些HTTP頭數(shù)據(jù)的字段：

import requests
resp = requests.head('http://www.python.org/index.html')
status = resp.status_code
last_modified = resp.headers['last-modified']
content_type = resp.headers['content-type']
content_length = resp.headers['content-length']

下面的示例使用requests庫通過基本的認(rèn)證在Python Package Index（也就是pypi）上執(zhí)行了一個登錄操作：

import requests
resp = requests.get('http://pypi.python.org/pypi?:action=login',
 auth=('user','password'))

下面的示例使用requests庫將第一個請求中得到的HTTP cookies傳遞給下一個請求：

import requests
# First request
resp1 = requests.get(url)
...
# Second requests with cookies received on first requests
resp2 = requests.get(url, cookies=resp1.cookies)

最后但也同樣重要的是，下面的例子使用requests庫來實現(xiàn)內(nèi)容的上傳：

import requests
url = 'http://httpbin.org/post'
files = { 'file': ('data.csv', open('data.csv', 'rb')) }
r = requests.post(url, files=files)

討論

對于確實很簡單的HTTP客戶端代碼，通常使用內(nèi)建的urllib模塊就足夠了。但是，如果要做的不僅僅只是簡單的GET或POST請求，那就真的不能再依賴它的功能了。這時候就是第三方模塊比如requests大顯身手的時候了。

舉個例子，如果我們決定堅持使用標(biāo)準(zhǔn)的程序庫而不考慮像requests這樣的第三方庫，那么也許就不得不使用底層的http.client模塊來實現(xiàn)自己的代碼。比方說，下面的代碼展示了如何執(zhí)行一個HEAD請求：

from http.client import HTTPConnection
from urllib import parse
c = HTTPConnection('www.python.org', 80)
c.request('HEAD', '/index.html')
resp = c.getresponse()
print('Status', resp.status)
for name, value in resp.getheaders():
 print(name, value)

同樣地，如果必須編寫涉及代理、認(rèn)證、cookies以及其他一些細(xì)節(jié)方面的代碼，那么使用urllib就顯得特別別扭和啰嗦。比方說，下面這個示例實現(xiàn)在Python package index上的認(rèn)證：

import urllib.request
auth = urllib.request.HTTPBasicAuthHandler()
auth.add_password('pypi','http://pypi.python.org','username','password')
opener = urllib.request.build_opener(auth)
r = urllib.request.Request('http://pypi.python.org/pypi?:action=login')
u = opener.open(r)
resp = u.read()
# From here. You can access more pages using opener
...

坦白說，所有這些操作在requests庫中都變得簡單得多。

在開發(fā)過程中測試HTTP客戶端代碼常常是很令人沮喪的，因為所有棘手的細(xì)節(jié)問題都需要考慮（例如cookies、認(rèn)證、HTTP頭、編碼方式等）。要完成這些任務(wù)，考慮使用httpbin服務(wù)（http://httpbin.org）。這個站點會接收發(fā)出的請求，然后以JSON的形式將響應(yīng)信息回傳回來。下面是一個交互式的例子：

>>> import requests
>>> r = requests.get('http://httpbin.org/get?name=Dave&n=37',
... headers = { 'User-agent': 'goaway/1.0' })
>>> resp = r.json
>>> resp['headers']
{'User-Agent': 'goaway/1.0', 'Content-Length': '', 'Content-Type': '',
'Accept-Encoding': 'gzip, deflate, compress', 'Connection':
'keep-alive', 'Host': 'httpbin.org', 'Accept': '*/*'}
>>> resp['args']
{'name': 'Dave', 'n': '37'}
>>>

在要同一個真正的站點進行交互前，先在httpbin.org這樣的網(wǎng)站上做實驗常常是可取的辦法。尤其是當(dāng)我們面對3次登錄失敗就會關(guān)閉賬戶這樣的風(fēng)險時尤為有用（不要嘗試自己編寫HTTP認(rèn)證客戶端來登錄你的銀行賬戶）。

盡管本節(jié)沒有涉及，requests庫還對許多高級的HTTP客戶端協(xié)議提供了支持，比如OAuth。requests模塊的文檔（http://docs.python-requests.org）質(zhì)量很高（坦白說比在這短短一節(jié)的篇幅中所提供的任何信息都好），可以參考文檔以獲得更多的信息。

詳細(xì)]php調(diào)用python腳本，將word轉(zhuǎn)為html代碼及調(diào)用失敗處理

起因：因為公司遇到發(fā)稿問題，很多人喜歡用word編碼，然后再發(fā)布到網(wǎng)站上。PHP的包中雖然有部分可以使用的類庫，但是對于圖片始終處理不好，我就想到了python。研究了下，python將word轉(zhuǎn)為html還真是方便。但是，怎么結(jié)合到服務(wù)器上呢？我們的服務(wù)器是用PHP開發(fā)的。

1：python腳本

#!/usr/bin/python# -*- coding: UTF-8 -*-import sysfrom pydocx import PyDocXreload(sys)sys.setdefaultencoding('utf8')FileName = sys.argv[1] #獲取文件名參數(shù)ShortName = sys.argv[2] #獲取文件名參數(shù)html = PyDocX.to_html(FileName) # f = open("/www/wwwroot/micuer.com/pythoncode/runtime/99.txt", 'w') #服務(wù)器的全路徑# f.write(html)# f.close()print(html)

2:php處理腳本

public function uploadword(){        try {            $file = request()->file("file");            // 上傳到本地服務(wù)器            $savename = \think\facade\Filesystem::disk('upload')->putFile( 'word', $file);            $shotrname = time().".txt"; // 短名稱            $savename = "/www/wwwroot/micuer.com/data/upload/".$savename; //Request::domain().            $python_file_name = "/www/wwwroot/micuer.com/pythoncode/WordToHtml.py";            //組裝命令            $cmd = "python {$python_file_name} ".$savename." {$shotrname}  2>error.txt 2>&1";            $res = exec($cmd,$array, $ret);            return json(["code"=>200,"msg"=>"成功","data"=>$savename,"cmd"=>$cmd,"array"=>$array]);        } catch (think\exception\ValidateException $e) {            return json(["code"=>40000,"msg"=>$e->getMessage()]);        }    }

上傳界面如下：

實現(xiàn)的功能就是利用PHP的exec函數(shù)，調(diào)用py腳本，將html代碼返回給前臺服務(wù)器。

返回數(shù)據(jù)如下

其實，再處理這個方案中，也遇到了很多問題，比如在命令行下只能成功，但是exec函數(shù)執(zhí)行不成功等等。
參考了資料：https://my.oschina.net/u/4427610/blog/3155816
也就是

exec("python python_test.py 2>error.txt 2>&1", $array, $ret);

在bash中0,1,2三個數(shù)字分代表STDIN_FILENO、STDOUT_FILENO、STDERR_FILENO，即標(biāo)準(zhǔn)輸入（一般是鍵盤），標(biāo)準(zhǔn)輸出（一般是顯示屏，準(zhǔn)確的說是用戶終端控制臺），標(biāo)準(zhǔn)錯誤（出錯信息輸出）。
也可以通過以下方式將標(biāo)準(zhǔn)錯誤重定向到標(biāo)準(zhǔn)輸出保存到$array中：
打印之后，發(fā)現(xiàn)是沒有權(quán)限調(diào)用。于是就直接改為輸出了，也就是 py的print(html)函數(shù)。

注意幾點：
1：執(zhí)行權(quán)限問題
2：exec(“python python_test.py 2>error.txt 2>&1”, $array, $ret); 中 $array就接受到了 print(html)的值
3：各個腳本盡量使用全路徑

在線咨詢

上一篇：講解 CSS 過渡和動畫 transition/animation (很全面 )
下一篇：HTML的基本語法

您的項目需求

*請認(rèn)真填寫需求信息，我們會在24小時內(nèi)與您取得聯(lián)系。

整合營銷服務(wù)商

python 模塊BeautifulSoup 從HTML或XML文件中提取數(shù)據(jù)

問題

解決方案

討論

詳細(xì)]php調(diào)用python腳本，將word轉(zhuǎn)為html代碼及調(diào)用失敗處理

您的項目需求

詳細(xì)]php調(diào)用python腳本，將word轉(zhuǎn)為html代碼及調(diào)用失敗處理