NLP之文本分類：「Tf-Idf、Word2Vec和

NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

幕組雙語(yǔ)原文：NLP之文本分類：「Tf-Idf、Word2Vec和BERT」三種模型比較

英語(yǔ)原文：Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT

翻譯：雷鋒字幕組（關(guān)山、wiige）

概要

在本文中，我將使用NLP和Python來(lái)解釋3種不同的文本多分類策略：老式的詞袋法（tf-ldf），著名的詞嵌入法（Word2Vec）和最先進(jìn)的語(yǔ)言模型（BERT）。

NLP（自然語(yǔ)言處理）是人工智能的一個(gè)領(lǐng)域，它研究計(jì)算機(jī)和人類語(yǔ)言之間的交互作用，特別是如何通過(guò)計(jì)算機(jī)編程來(lái)處理和分析大量的自然語(yǔ)言數(shù)據(jù)。NLP常用于文本數(shù)據(jù)的分類。文本分類是指根據(jù)文本數(shù)據(jù)內(nèi)容對(duì)其進(jìn)行分類的問題。

我們有多種技術(shù)從原始文本數(shù)據(jù)中提取信息，并用它來(lái)訓(xùn)練分類模型。本教程比較了傳統(tǒng)的詞袋法（與簡(jiǎn)單的機(jī)器學(xué)習(xí)算法一起使用）、流行的詞嵌入模型（與深度學(xué)習(xí)神經(jīng)網(wǎng)絡(luò)一起使用）和最先進(jìn)的語(yǔ)言模型（和基于attention的transformers模型中的遷移學(xué)習(xí)一起使用），語(yǔ)言模型徹底改變了NLP的格局。

我將介紹一些有用的Python代碼，這些代碼可以輕松地應(yīng)用在其他類似的案例中（僅需復(fù)制、粘貼、運(yùn)行），并對(duì)代碼逐行添加注釋，以便你能復(fù)現(xiàn)這個(gè)例子（下面是全部代碼的鏈接）。

mdipietro09/DataScience_ArtificialIntelligence_Utils

我將使用“新聞?lì)悇e數(shù)據(jù)集”（News category dataset），這個(gè)數(shù)據(jù)集提供了從HuffPost獲取的2012-2018年間所有的新聞標(biāo)題，我們的任務(wù)是把這些新聞標(biāo)題正確分類，這是一個(gè)多類別分類問題（數(shù)據(jù)集鏈接如下）。

News Category Dataset

特別地，我要講的是：

設(shè)置：導(dǎo)入包，讀取數(shù)據(jù)，預(yù)處理，分區(qū)。
詞袋法：用scikit-learn進(jìn)行特征工程、特征選擇以及機(jī)器學(xué)習(xí)，測(cè)試和評(píng)估，用lime解釋。
詞嵌入法：用gensim擬合Word2Vec，用tensorflow/keras進(jìn)行特征工程和深度學(xué)習(xí)，測(cè)試和評(píng)估，用Attention機(jī)制解釋。
語(yǔ)言模型：用transformers進(jìn)行特征工程，用transformers和tensorflow/keras進(jìn)行預(yù)訓(xùn)練BERT的遷移學(xué)習(xí)，測(cè)試和評(píng)估。

設(shè)置

首先，我們需要導(dǎo)入下面的庫(kù)：

## for data

import json

import pandas as pd

import numpy as np## for plotting

import matplotlib.pyplot as plt

import seaborn as sns## for bag-of-words

from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing## for explainer

from lime import lime_text## for word embedding

import gensim

import gensim.downloader as gensim_api## for deep learning

from tensorflow.keras import models, layers, preprocessing as kprocessing

from tensorflow.keras import backend as K## for bert language model

import transformers

該數(shù)據(jù)集包含在一個(gè)jason文件中，所以我們首先將其讀取到一個(gè)帶有json的字典列表中，然后將其轉(zhuǎn)換為pandas的DataFrame。

lst_dics=

with open('data.json', mode='r', errors='ignore') as json_file:

for dic in json_file:

lst_dics.append( json.loads(dic) )## print the first one

lst_dics[0]

原始數(shù)據(jù)集包含30多個(gè)類別，但出于本教程中的目的，我將使用其中的3個(gè)類別：娛樂（Entertainment）、政治（Politics）和科技（Tech）。

## create dtf

dtf=pd.DataFrame(lst_dics)## filter categories

dtf=dtf[ dtf["category"].isin(['ENTERTAINMENT','POLITICS','TECH']) ][["category","headline"]]## rename columns

dtf=dtf.rename(columns={"category":"y", "headline":"text"})## print 5 random rows

dtf.sample(5)

從圖中可以看出，數(shù)據(jù)集是不均衡的：和其他類別相比，科技新聞的占比很小，這會(huì)使模型很難識(shí)別科技新聞。

在解釋和構(gòu)建模型之前，我將給出一個(gè)預(yù)處理示例，包括清理文本、刪除停用詞以及應(yīng)用詞形還原。我們要寫一個(gè)函數(shù)，并將其用于整個(gè)數(shù)據(jù)集上。

'''

Preprocess a string.

:parameter

:param text: string - name of column containing text

:param lst_stopwords: list - list of stopwords to remove

:param flg_stemm: bool - whether stemming is to be applied

:param flg_lemm: bool - whether lemmitisation is to be applied

:return

cleaned text

'''

def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):

## clean (convert to lowercase and remove punctuations and

characters and then strip)

text=re.sub(r'[^\w\s]', '', str(text).lower.strip)

## Tokenize (convert from string to list)

lst_text=text.split ## remove Stopwords

if lst_stopwords is not None:

lst_text=[word for word in lst_text if word not in

lst_stopwords]

## Stemming (remove -ing, -ly, ...)

if flg_stemm==True:

ps=nltk.stem.porter.PorterStemmer

lst_text=[ps.stem(word) for word in lst_text]

## Lemmatisation (convert the word into root word)

if flg_lemm==True:

lem=nltk.stem.wordnet.WordNetLemmatizer

lst_text=[lem.lemmatize(word) for word in lst_text]

## back to string from list

text=" ".join(lst_text)

return text

該函數(shù)從語(yǔ)料庫(kù)中刪除了一組單詞（如果有的話）。我們可以用nltk創(chuàng)建一個(gè)英語(yǔ)詞匯的通用停用詞列表（我們可以通過(guò)添加和刪除單詞來(lái)編輯此列表）。

lst_stopwords=nltk.corpus.stopwords.words("english")

lst_stopwords

現(xiàn)在，我將在整個(gè)數(shù)據(jù)集中應(yīng)用編寫的函數(shù)，并將結(jié)果存儲(chǔ)在名為“text_clean”的新列中，以便你選擇使用原始的語(yǔ)料庫(kù)，或經(jīng)過(guò)預(yù)處理的文本。

dtf["text_clean"]=dtf["text"].apply(lambda x:

utils_preprocess_text(x, flg_stemm=False, flg_lemm=True,

lst_stopwords=lst_stopwords))dtf.head

如果你對(duì)更深入的文本分析和預(yù)處理感興趣，你可以查看這篇文章。我將數(shù)據(jù)集劃分為訓(xùn)練集（70%）和測(cè)試集（30%），以評(píng)估模型的性能。

## split dataset

dtf_train, dtf_test=model_selection.train_test_split(dtf, test_size=0.3)## get target

y_train=dtf_train["y"].values

y_test=dtf_test["y"].values

讓我們開始吧！

詞袋法

詞袋法的模型很簡(jiǎn)單：從文檔語(yǔ)料庫(kù)構(gòu)建一個(gè)詞匯表，并計(jì)算單詞在每個(gè)文檔中出現(xiàn)的次數(shù)。換句話說(shuō)，詞匯表中的每個(gè)單詞都成為一個(gè)特征，文檔由具有相同詞匯量長(zhǎng)度的矢量（一個(gè)“詞袋”）表示。例如，我們有3個(gè)句子，并用這種方法表示它們：

特征矩陣的形狀：文檔數(shù)x詞匯表長(zhǎng)度

可以想象，這種方法將會(huì)導(dǎo)致很嚴(yán)重的維度問題：文件越多，詞匯表越大，因此特征矩陣將是一個(gè)巨大的稀疏矩陣。所以，為了減少維度問題，詞袋法模型通常需要先進(jìn)行重要的預(yù)處理（詞清除、刪除停用詞、詞干提取/詞形還原）。

詞頻不一定是文本的最佳表示方法。實(shí)際上我們會(huì)發(fā)現(xiàn)，有些常用詞在語(yǔ)料庫(kù)中出現(xiàn)頻率很高，但是它們對(duì)目標(biāo)變量的預(yù)測(cè)能力卻很小。為了解決此問題，有一種詞袋法的高級(jí)變體，它使用詞頻-逆向文件頻率（Tf-Idf）代替簡(jiǎn)單的計(jì)數(shù)。基本上，一個(gè)單詞的值和它的計(jì)數(shù)成正比地增加，但是和它在語(yǔ)料庫(kù)中出現(xiàn)的頻率成反比。

先從特征工程開始，我們通過(guò)這個(gè)流程從數(shù)據(jù)中提取信息來(lái)建立特征。使用Tf-Idf向量器(vectorizer)，限制為1萬(wàn)個(gè)單詞（所以詞長(zhǎng)度將是1萬(wàn)），捕捉一元文法（即 "new "和 "york"）和二元文法（即 "new york"）。以下是經(jīng)典的計(jì)數(shù)向量器的代碼:

ngram_range=(1,2))vectorizer=feature_extraction.text.TfidfVectorizer(max_features=10000, ngram_range=(1,2))

現(xiàn)在將在訓(xùn)練集的預(yù)處理語(yǔ)料上使用向量器來(lái)提取詞表并創(chuàng)建特征矩陣。

corpus=dtf_train["text_clean"]vectorizer.fit(corpus)X_train=vectorizer.transform(corpus)dic_vocabulary=vectorizer.vocabulary_

特征矩陣X_train的尺寸為34265（訓(xùn)練集中的文檔數(shù)）×10000（詞長(zhǎng)度），這個(gè)矩陣很稀疏:

sns.heatmap(X_train.todense[:,np.random.randint(0,X.shape[1],100)]==0, vmin=0, vmax=1, cbar=False).set_title('Sparse Matrix Sample')

從特征矩陣中隨機(jī)抽樣（黑色為非零值）

為了知道某個(gè)單詞的位置，可以這樣在詞表中查詢:

word="new york"dic_vocabulary[word]

如果詞表中存在這個(gè)詞，這行腳本會(huì)輸出一個(gè)數(shù)字N，表示矩陣的第N個(gè)特征就是這個(gè)詞。

為了降低矩陣的維度所以需要去掉一些列，我們可以進(jìn)行一些特征選擇（Feature Selection），這個(gè)流程就是選擇相關(guān)變量的子集。操作如下:

將每個(gè)類別視為一個(gè)二進(jìn)制位（例如，"科技"類別中的科技新聞將分類為1，否則為0）;
進(jìn)行卡方檢驗(yàn)，以便確定某個(gè)特征和其（二進(jìn)制）結(jié)果是否獨(dú)立;
只保留卡方檢驗(yàn)中有特定p值的特征。

y=dtf_train["y"]

X_names=vectorizer.get_feature_names

p_value_limit=0.95dtf_features=pd.DataFrame

for cat in np.unique(y):

chi2, p=feature_selection.chi2(X_train, y==cat)

dtf_features=dtf_features.append(pd.DataFrame(

{"feature":X_names, "score":1-p, "y":cat}))

dtf_features=dtf_features.sort_values(["y","score"],

ascending=[True,False])

dtf_features=dtf_features[dtf_features["score"]>p_value_limit]X_names=dtf_features["feature"].unique.tolist

這將特征的數(shù)量從10000個(gè)減少到3152個(gè)，保留了最有統(tǒng)計(jì)意義的特征。選一些打印出來(lái)是這樣的:

for cat in np.unique(y):

print("# {}:".format(cat))

print(" . selected features:",

len(dtf_features[dtf_features["y"]==cat]))

print(" . top features:", ",".join(

dtf_features[dtf_features["y"]==cat]["feature"].values[:10]))

print(" ")

我們將這組新的詞表作為輸入，在語(yǔ)料上重新擬合向量器。這將輸出一個(gè)更小的特征矩陣和更短的詞表。

vectorizer=feature_extraction.text.TfidfVectorizer(vocabulary=X_names)vectorizer.fit(corpus)X_train=vectorizer.transform(corpus)dic_vocabulary=vectorizer.vocabulary_

新的特征矩陣X_train的尺寸是34265（訓(xùn)練中的文檔數(shù)量）×3152（給定的詞表長(zhǎng)度）。你看矩陣是不是沒那么稀疏了:

從新的特征矩陣中隨機(jī)抽樣（非零值為黑色）

現(xiàn)在我們?cè)撚?xùn)練一個(gè)機(jī)器學(xué)習(xí)模型試試了。我推薦使用樸素貝葉斯算法：它是一種利用貝葉斯定理的概率分類器，貝葉斯定理根據(jù)可能相關(guān)條件的先驗(yàn)知識(shí)進(jìn)行概率預(yù)測(cè)。這種算法最適合這種大型數(shù)據(jù)集了，因?yàn)樗鼤?huì)獨(dú)立考察每個(gè)特征，計(jì)算每個(gè)類別的概率，然后預(yù)測(cè)概率最高的類別。

classifier=naive_bayes.MultinomialNB

我們?cè)谔卣骶仃嚿嫌?xùn)練這個(gè)分類器，然后在經(jīng)過(guò)特征提取后的測(cè)試集上測(cè)試它。因此我們需要一個(gè)scikit-learn流水線：這個(gè)流水線包含一系列變換和最后接一個(gè)estimator。將Tf-Idf向量器和樸素貝葉斯分類器放入流水線，就能輕松完成對(duì)測(cè)試數(shù)據(jù)的變換和預(yù)測(cè)。

## pipelinemodel=pipeline.Pipeline([("vectorizer", vectorizer),

("classifier", classifier)])## train classifiermodel["classifier"].fit(X_train, y_train)## testX_test=dtf_test["text_clean"].values

predicted=model.predict(X_test)

predicted_prob=model.predict_proba(X_test)

至此我們可以使用以下指標(biāo)評(píng)估詞袋模型了:

準(zhǔn)確率: 模型預(yù)測(cè)正確的比例。
混淆矩陣: 是一張記錄每類別預(yù)測(cè)正確和預(yù)測(cè)錯(cuò)誤數(shù)量的匯總表。
ROC: 不同閾值下，真正例率與假正例率的對(duì)比圖。曲線下的面積(AUC)表示分類器中隨機(jī)選擇的正觀察值排序比負(fù)觀察值更靠前的概率。
精確率: "所有被正確檢索的樣本數(shù)(TP)"占所有"實(shí)際被檢索到的(TP+FP)"的比例。
召回率: 所有"被正確檢索的樣本數(shù)(TP)"占所有"應(yīng)該檢索到的結(jié)果(TP+FN)"的比例。

classes=np.unique(y_test)

y_test_array=pd.get_dummies(y_test, drop_first=False).values

## Accuracy, Precision, Recallaccuracy=metrics.accuracy_score(y_test, predicted)

auc=metrics.roc_auc_score(y_test, predicted_prob,

multi_)

print("Accuracy:", round(accuracy,2))

print("Auc:", round(auc,2))

print("Detail:")

print(metrics.classification_report(y_test, predicted))

## Plot confusion matrixcm=metrics.confusion_matrix(y_test, predicted)

fig, ax=plt.subplots

sns.heatmap(cm, annot=True, fmt='d', ax=ax, cmap=plt.cm.Blues,

cbar=False)

ax.set(xlabel="Pred", ylabel="True", xticklabels=classes,

yticklabels=classes, title="Confusion matrix")

plt.yticks(rotation=0)

fig, ax=plt.subplots(nrows=1, ncols=2)## Plot rocfor i in range(len(classes)):

fpr, tpr, thresholds=metrics.roc_curve(y_test_array[:,i],

predicted_prob[:,i])

ax[0].plot(fpr, tpr, lw=3,

label='{0} (area={1:0.2f})'.format(classes[i],

metrics.auc(fpr, tpr))

)

ax[0].plot([0,1], [0,1], color='navy', lw=3, line)

ax[0].set(xlim=[-0.05,1.0], ylim=[0.0,1.05],

xlabel='False Positive Rate',

ylabel="True Positive Rate (Recall)",

title="Receiver operating characteristic")

ax[0].legend(loc="lower right")

ax[0].grid(True)

## Plot precision-recall curvefor i in range(len(classes)):

precision, recall, thresholds=metrics.precision_recall_curve(

y_test_array[:,i], predicted_prob[:,i])

ax[1].plot(recall, precision, lw=3,

label='{0} (area={1:0.2f})'.format(classes[i],

metrics.auc(recall, precision))

)

ax[1].set(xlim=[0.0,1.05], ylim=[0.0,1.05], xlabel='Recall',

ylabel="Precision", title="Precision-Recall curve")

ax[1].legend(loc="best")

ax[1].grid(True)

plt.show

詞袋模型能夠在測(cè)試集上正確分類85%的樣本（準(zhǔn)確率為0.85），但在辨別科技新聞方面卻很吃力（只有252條預(yù)測(cè)正確）。

讓我們探究一下為什么模型會(huì)將新聞分類為其他類別，順便看看預(yù)測(cè)結(jié)果是不是能解釋些什么。lime包可以幫助我們建立一個(gè)解釋器。為讓這更好理解，我們從測(cè)試集中隨機(jī)采樣一次, 看看能發(fā)現(xiàn)些什么:

## select observationi=0

txt_instance=dtf_test["text"].iloc[i]## check true value and predicted valueprint("True:", y_test[i], "--> Pred:", predicted[i], "| Prob:", round(np.max(predicted_prob[i]),2))## show explanationexplainer=lime_text.LimeTextExplainer(class_names=

np.unique(y_train))

explained=explainer.explain_instance(txt_instance,

model.predict_proba, num_features=3)

explained.show_in_notebook(text=txt_instance, predict_proba=False)

這就一目了然了：雖然"舞臺(tái)(stage)"這個(gè)詞在娛樂新聞中更常見, "克林頓(Clinton) "和 "GOP "這兩個(gè)詞依然為模型提供了引導(dǎo)（政治新聞）。

詞嵌入

詞嵌入（Word Embedding）是將中詞表中的詞映射為實(shí)數(shù)向量的特征學(xué)習(xí)技術(shù)的統(tǒng)稱。這些向量是根據(jù)每個(gè)詞出現(xiàn)在另一個(gè)詞之前或之后的概率分布計(jì)算出來(lái)的。換一種說(shuō)法，上下文相同的單詞通常會(huì)一起出現(xiàn)在語(yǔ)料庫(kù)中，所以它們?cè)谙蛄靠臻g中也會(huì)很接近。例如，我們以前面例子中的3個(gè)句子為例:

二維向量空間中的詞嵌入

在本教程中，我門將使用這類模型的開山怪: Google的Word2Vec（2013）。其他流行的詞嵌入模型還有斯坦福大學(xué)的GloVe（2014）和Facebook的FastText（2016）。

Word2Vec生成一個(gè)包含語(yǔ)料庫(kù)中的每個(gè)獨(dú)特單詞的向量空間，通常有幾百維, 這樣在語(yǔ)料庫(kù)中擁有共同上下文的單詞在向量空間中的位置就會(huì)相互靠近。有兩種不同的方法可以生成詞嵌入：從某一個(gè)詞來(lái)預(yù)測(cè)其上下文（Skip-gram）或根據(jù)上下文預(yù)測(cè)某一個(gè)詞（Continuous Bag-of-Words）。

在Python中，可以像這樣從genism-data中加載一個(gè)預(yù)訓(xùn)練好的詞嵌入模型:

nlp=gensim_api.load("word2vec-google-news-300")

我將不使用預(yù)先訓(xùn)練好的模型，而是用gensim在訓(xùn)練數(shù)據(jù)上自己訓(xùn)練一個(gè)Word2Vec。在訓(xùn)練模型之前，需要將語(yǔ)料轉(zhuǎn)換為n元文法列表。具體來(lái)說(shuō)，就是嘗試捕獲一元文法（"york"）、二元文法（"new york"）和三元文法（"new york city"）。

corpus=dtf_train["text_clean"]## create list of lists of unigramslst_corpus=

for string in corpus:

lst_words=string.split

lst_grams=[" ".join(lst_words[i:i+1])

for i in range(0, len(lst_words), 1)]

lst_corpus.append(lst_grams)## detect bigrams and trigramsbigrams_detector=gensim.models.phrases.Phrases(lst_corpus,

delimiter=" ".encode, min_count=5, threshold=10)

bigrams_detector=gensim.models.phrases.Phraser(bigrams_detector)trigrams_detector=gensim.models.phrases.Phrases(bigrams_detector[lst_corpus],

delimiter=" ".encode, min_count=5, threshold=10)

trigrams_detector=gensim.models.phrases.Phraser(trigrams_detector)

在訓(xùn)練Word2Vec時(shí)，需要設(shè)置一些參數(shù):

詞向量維度設(shè)置為300;
窗口大小，即句子中當(dāng)前詞和預(yù)測(cè)詞之間的最大距離，這里使用語(yǔ)料庫(kù)中文本的平均長(zhǎng)度;
訓(xùn)練算法使用 skip-grams (sg=1)，因?yàn)橐话銇?lái)說(shuō)它的效果更好。

## fit w2vnlp=gensim.models.word2vec.Word2Vec(lst_corpus, size=300,

window=8, min_count=1, sg=1, iter=30)

現(xiàn)在我們有了詞嵌入模型，所以現(xiàn)在可以從語(yǔ)料庫(kù)中任意選擇一個(gè)詞，將其轉(zhuǎn)化為一個(gè)300維的向量。

word="data"nlp[word].shape

甚至可以通過(guò)某些維度縮減算法（比如TSNE），將一個(gè)單詞及其上下文可視化到一個(gè)更低的維度空間（2D或3D）。

word="data"

fig=plt.figure## word embedding

tot_words=[word] + [tupla[0] for tupla in

nlp.most_similar(word, topn=20)]

X=nlp[tot_words]## pca to reduce dimensionality from 300 to 3

pca=manifold.TSNE(perplexity=40, n_components=3, init='pca')

X=pca.fit_transform(X)## create dtf

dtf_=pd.DataFrame(X, index=tot_words, columns=["x","y","z"])

dtf_["input"]=0

dtf_["input"].iloc[0:1]=1## plot 3d

from mpl_toolkits.mplot3d import Axes3D

ax=fig.add_subplot(111, projection='3d')

ax.scatter(dtf_[dtf_["input"]==0]['x'],

dtf_[dtf_["input"]==0]['y'],

dtf_[dtf_["input"]==0]['z'], c="black")

ax.scatter(dtf_[dtf_["input"]==1]['x'],

dtf_[dtf_["input"]==1]['y'],

dtf_[dtf_["input"]==1]['z'], c="red")

ax.set(xlabel=None, ylabel=None, zlabel=None, xticklabels=,

yticklabels=, zticklabels=)

for label, row in dtf_[["x","y","z"]].iterrows:

x, y, z=row

ax.text(x, y, z, s=label)

這非?？?，但詞嵌入在預(yù)測(cè)新聞?lì)悇e這樣的任務(wù)上有何裨益呢？詞向量可以作為神經(jīng)網(wǎng)絡(luò)的權(quán)重。具體是這樣的:

首先，將語(yǔ)料轉(zhuǎn)化為單詞id的填充(padded)序列，得到一個(gè)特征矩陣。
然后，創(chuàng)建一個(gè)嵌入矩陣，使id為N的詞向量位于第N行。
最后，建立一個(gè)帶有嵌入層的神經(jīng)網(wǎng)絡(luò)，對(duì)序列中的每一個(gè)詞都用相應(yīng)的向量進(jìn)行加權(quán)。

還是從特征工程開始，用 tensorflow/keras 將 Word2Vec 的同款預(yù)處理語(yǔ)料（n-grams 列表）轉(zhuǎn)化為文本序列的列表:

## tokenize texttokenizer=kprocessing.text.Tokenizer(lower=True, split=' ',

oov_token="NaN",

filters='!"#$%&*+,-./:;?@[\]^_`{|}~\t\n')

tokenizer.fit_on_texts(lst_corpus)

dic_vocabulary=tokenizer.word_index## create sequencelst_text2seq=tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_train=kprocessing.sequence.pad_sequences(lst_text2seq,

maxlen=15, padding="post", truncating="post")

特征矩陣X_train的尺寸為34265×15（序列數(shù)×序列最大長(zhǎng)度）?？梢暬幌率沁@樣的:

sns.heatmap(X_train==0, vmin=0, vmax=1, cbar=False)

plt.show

特征矩陣(34 265 x 15)

現(xiàn)在語(yǔ)料庫(kù)中的每一個(gè)文本都是一個(gè)長(zhǎng)度為15的id序列。例如，如果一個(gè)文本中有10個(gè)詞符，那么這個(gè)序列由10個(gè)id和5個(gè)0組成，這個(gè)0這就是填充元素（而詞表中沒有的詞其id為1）。我們來(lái)輸出一下看看一段訓(xùn)練集文本是如何被轉(zhuǎn)化成一個(gè)帶有填充元素的詞序列:

i=0## list of text: ["I like this", ...]len_txt=len(dtf_train["text_clean"].iloc[i].split)print("from: ", dtf_train["text_clean"].iloc[i], "| len:", len_txt)## sequence of token ids: [[1, 2, 3], ...]len_tokens=len(X_train[i])print("to: ", X_train[i], "| len:", len(X_train[i]))## vocabulary: {"I":1, "like":2, "this":3, ...}print("check: ", dtf_train["text_clean"].iloc[i].split[0],

" -- idx in vocabulary -->",

dic_vocabulary[dtf_train["text_clean"].iloc[i].split[0]])print("vocabulary: ", dict(list(dic_vocabulary.items)[0:5]), "... (padding element, 0)")

記得在測(cè)試集上也要做這個(gè)特征工程:

corpus=dtf_test["text_clean"]## create list of n-gramslst_corpus=

for string in corpus:

lst_words=string.split

lst_grams=[" ".join(lst_words[i:i+1]) for i in range(0,

len(lst_words), 1)]

lst_corpus.append(lst_grams)

## detect common bigrams and trigrams using the fitted detectorslst_corpus=list(bigrams_detector[lst_corpus])

lst_corpus=list(trigrams_detector[lst_corpus])## text to sequence with the fitted tokenizerlst_text2seq=tokenizer.texts_to_sequences(lst_corpus)## padding sequenceX_test=kprocessing.sequence.pad_sequences(lst_text2seq, maxlen=15,

padding="post", truncating="post")

X_test (14,697 x 15)

現(xiàn)在我們就有了X_train和X_test，現(xiàn)在需要?jiǎng)?chuàng)建嵌入矩陣，它將作為神經(jīng)網(wǎng)絡(luò)分類器的權(quán)重矩陣.

## start the matrix (length of vocabulary x vector size) with all 0sembeddings=np.zeros((len(dic_vocabulary)+1, 300))for word,idx in dic_vocabulary.items:

## update the row with vector try:

embeddings[idx]=nlp[word]

## if word not in model then skip and the row stays all 0s except:

pass

這段代碼生成的矩陣尺寸為22338×300（從語(yǔ)料庫(kù)中提取的詞表長(zhǎng)度×向量維度）。它可以通過(guò)詞表中的詞id。

word="data"print("dic[word]:", dic_vocabulary[word], "|idx")print("embeddings[idx]:", embeddings[dic_vocabulary[word]].shape,

"|vector")

終于要建立深度學(xué)習(xí)模型了! 我門在神經(jīng)網(wǎng)絡(luò)的第一個(gè)Embedding層中使用嵌入矩陣，訓(xùn)練它之后就能用來(lái)進(jìn)行新聞分類。輸入序列中的每個(gè)id將被視為訪問嵌入矩陣的索引。這個(gè)嵌入層的輸出是一個(gè) 包含輸入序列中每個(gè)詞id對(duì)應(yīng)詞向量的二維矩陣（序列長(zhǎng)度 x 詞向量維度）。以 "我喜歡這篇文章(I like this article) "這個(gè)句子為例:

我的神經(jīng)網(wǎng)絡(luò)的結(jié)構(gòu)如下:

一個(gè)嵌入層，如前文所述, 將文本序列作為輸入, 詞向量作為權(quán)重。
一個(gè)簡(jiǎn)單的Attention層，它不會(huì)影響預(yù)測(cè)，但它可以捕捉每個(gè)樣本的權(quán)重, 以便將作為一個(gè)不錯(cuò)的解釋器（對(duì)于預(yù)測(cè)來(lái)說(shuō)它不是必需的，只是為了提供可解釋性，所以其實(shí)可以不用加它）。這篇論文（2014）提出了序列模型（比如LSTM）的Attention機(jī)制，探究了長(zhǎng)文本中哪些部分實(shí)際相關(guān)。
兩層雙向LSTM，用來(lái)建模序列中詞的兩個(gè)方向。
最后兩層全連接層，可以預(yù)測(cè)每個(gè)新聞?lì)悇e的概率。

## code attention layerdef attention_layer(inputs, neurons):

x=layers.Permute((2,1))(inputs)

x=layers.Dense(neurons, activation="softmax")(x)

x=layers.Permute((2,1), name="attention")(x)

x=layers.multiply([inputs, x])

return x## inputx_in=layers.Input(shape=(15,))## embeddingx=layers.Embedding(input_dim=embeddings.shape[0],

output_dim=embeddings.shape[1],

weights=[embeddings],

input_length=15, trainable=False)(x_in)## apply attentionx=attention_layer(x, neurons=15)## 2 layers of bidirectional lstmx=layers.Bidirectional(layers.LSTM(units=15, dropout=0.2,

return_sequences=True))(x)

x=layers.Bidirectional(layers.LSTM(units=15, dropout=0.2))(x)## final dense layersx=layers.Dense(64, activation='relu')(x)

y_out=layers.Dense(3, activation='softmax')(x)## compilemodel=models.Model(x_in, y_out)

model.compile(loss='sparse_categorical_crossentropy',

optimizer='adam', metrics=['accuracy'])

model.summary

現(xiàn)在來(lái)訓(xùn)練模型，不過(guò)在實(shí)際測(cè)試集上測(cè)試之前，我們要在訓(xùn)練集上劃一小塊驗(yàn)證集來(lái)驗(yàn)證模型性能。

## encode ydic_y_mapping={n:label for n,label in

enumerate(np.unique(y_train))}

inverse_dic={v:k for k,v in dic_y_mapping.items}

y_train=np.array([inverse_dic[y] for y in y_train])## traintraining=model.fit(x=X_train, y=y_train, batch_size=256,

epochs=10, shuffle=True, verbose=0,

validation_split=0.3)## plot loss and accuracymetrics=[k for k in training.history.keys() if ("loss" not in k) and ("val" not in k)]

fig, ax=plt.subplots(nrows=1, ncols=2, sharey=True)ax[0].set(title="Training")

ax11=ax[0].twinx

ax[0].plot(training.history['loss'], color='black')

ax[0].set_xlabel('Epochs')

ax[0].set_ylabel('Loss', color='black')for metric in metrics:

ax11.plot(training.history[metric], label=metric)

ax11.set_ylabel("Score", color='steelblue')

ax11.legendax[1].set(title="Validation")

ax22=ax[1].twinx

ax[1].plot(training.history['val_loss'], color='black')

ax[1].set_xlabel('Epochs')

ax[1].set_ylabel('Loss', color='black')for metric in metrics:

ax22.plot(training.history['val_'+metric], label=metric)

ax22.set_ylabel("Score", color="steelblue")

plt.show

Nice！在某些epoch中準(zhǔn)確率達(dá)到了0.89。為了對(duì)詞嵌入模型進(jìn)行評(píng)估，在測(cè)試集上也要進(jìn)行預(yù)測(cè)，并用相同指標(biāo)進(jìn)行對(duì)比（評(píng)價(jià)指標(biāo)的代碼與之前相同）。

## testpredicted_prob=model.predict(X_test)

predicted=[dic_y_mapping[np.argmax(pred)] for pred in

predicted_prob]

該模式的表現(xiàn)與前一個(gè)模型差不多。其實(shí)，它的科技新聞分類也不怎么樣。

但它也具有可解釋性嗎? 是的! 因?yàn)樵谏窠?jīng)網(wǎng)絡(luò)中放了一個(gè)Attention層來(lái)提取每個(gè)詞的權(quán)重，我們可以了解這些權(quán)重對(duì)一個(gè)樣本的分類貢獻(xiàn)有多大。所以這里我將嘗試使用Attention權(quán)重來(lái)構(gòu)建一個(gè)解釋器（類似于上一節(jié)里的那個(gè)）:

## select observationi=0txt_instance=dtf_test["text"].iloc[i]## check true value and predicted valueprint("True:", y_test[i], "--> Pred:", predicted[i], "| Prob:", round(np.max(predicted_prob[i]),2))## show explanation### 1. preprocess inputlst_corpus=for string in [re.sub(r'[^\w\s]','', txt_instance.lower.strip)]:

lst_words=string.split

lst_grams=[" ".join(lst_words[i:i+1]) for i in range(0,

len(lst_words), 1)]

lst_corpus.append(lst_grams)

lst_corpus=list(bigrams_detector[lst_corpus])

lst_corpus=list(trigrams_detector[lst_corpus])

X_instance=kprocessing.sequence.pad_sequences(

tokenizer.texts_to_sequences(corpus), maxlen=15,

padding="post", truncating="post")### 2. get attention weightslayer=[layer for layer in model.layers if "attention" in

layer.name][0]

func=K.function([model.input], [layer.output])

weights=func(X_instance)[0]

weights=np.mean(weights, axis=2).flatten### 3. rescale weights, remove null vector, map word-weightweights=preprocessing.MinMaxScaler(feature_range=(0,1)).fit_transform(np.array(weights).reshape(-1,1)).reshape(-1)

weights=[weights[n] for n,idx in enumerate(X_instance[0]) if idx

!=0]

dic_word_weigth={word:weights[n] for n,word in

enumerate(lst_corpus[0]) if word in

tokenizer.word_index.keys}### 4. barplotif len(dic_word_weigth) > 0:

dtf=pd.DataFrame.from_dict(dic_word_weigth, orient='index',

columns=["score"])

dtf.sort_values(by="score",

ascending=True).tail(top).plot(kind="barh",

legend=False).grid(axis='x')

plt.showelse:

print("--- No word recognized ---")### 5. produce html visualizationtext=for word in lst_corpus[0]:

weight=dic_word_weigth.get(word)

if weight is not None:

text.append('' + word + '')

else:

text.append(word)

text=' '.join(text)### 6. visualize on notebookprint("3[1m"+"Text with highlighted words")from IPython.core.display import display, HTML

display(HTML(text))

就像之前一樣，"克林頓 (clinton)"和 "老大黨(gop) "這兩個(gè)詞激活了模型的神經(jīng)元，而且這次發(fā)現(xiàn) "高(high) "和 "班加西(benghazi) "與預(yù)測(cè)也略有關(guān)聯(lián)。

語(yǔ)言模型

語(yǔ)言模型, 即上下文/動(dòng)態(tài)詞嵌入（Contextualized/Dynamic Word Embeddings），克服了經(jīng)典詞嵌入方法的最大局限：多義詞消歧義，一個(gè)具有不同含義的詞（如" bank "或" stick"）只需一個(gè)向量就能識(shí)別。最早流行的是 ELMO（2018），它并沒有采用固定的嵌入，而是利用雙向 LSTM觀察整個(gè)句子，然后給每個(gè)詞分配一個(gè)嵌入。

到Transformers時(shí)代, 谷歌的論文Attention is All You Need（2017）提出的一種新的語(yǔ)言建模技術(shù)，在該論文中，證明了序列模型（如LSTM）可以完全被Attention機(jī)制取代，甚至獲得更好的性能。

而后谷歌的BERT（Bidirectional Encoder Representations from Transformers，2018）包含了ELMO的上下文嵌入和幾個(gè)Transformers，而且它是雙向的（這是對(duì)Transformers的一大創(chuàng)新改進(jìn)）。BERT分配給一個(gè)詞的向量是整個(gè)句子的函數(shù)，因此，一個(gè)詞可以根據(jù)上下文不同而有不同的詞向量。我們輸入岸河(bank river)到Transformer試試:

txt="bank river"## bert tokenizertokenizer=transformers.BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)## bert modelnlp=transformers.TFBertModel.from_pretrained('bert-base-uncased')## return hidden layer with embeddingsinput_ids=np.array(tokenizer.encode(txt))[None,:]

embedding=nlp(input_ids)

embedding[0][0]

如果將輸入文字改為 "銀行資金(bank money)"，則會(huì)得到這樣的結(jié)果:

為了完成文本分類任務(wù)，可以用3種不同的方式來(lái)使用BERT:

從零訓(xùn)練它，并將其作為分類器使用。
提取詞嵌入，并在嵌入層中使用它們（就像上面用Word2Vec那樣）。
對(duì)預(yù)訓(xùn)練模型進(jìn)行精調(diào)(遷移學(xué)習(xí))。

我打算用第三種方式，從預(yù)訓(xùn)練的輕量 BERT 中進(jìn)行遷移學(xué)習(xí)，人稱 Distil-BERT （用6600 萬(wàn)個(gè)參數(shù)替代1.1 億個(gè)參數(shù)）

## distil-bert tokenizertokenizer=transformers.AutoTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)

在訓(xùn)練模型之前，還是需要做一些特征工程，但這次會(huì)比較棘手。為了說(shuō)明我們需要做什么，還是以我們這句 "我喜歡這篇文章(I like this article) "為例，他得被轉(zhuǎn)化為3個(gè)向量（Ids, Mask, Segment）:

尺寸為 3 x 序列長(zhǎng)度

首先，我們需要確定最大序列長(zhǎng)度。這次要選擇一個(gè)大得多的數(shù)字(比如50)，因?yàn)锽ERT會(huì)將未知詞分割成子詞符(sub-token)，直到找到一個(gè)已知的單字。比如若給定一個(gè)像 "zzdata "這樣的虛構(gòu)詞，BERT會(huì)把它分割成["z"，"##z"，"##data"]。除此之外, 我們還要在輸入文本中插入特殊的詞符，然后生成掩碼(musks)和分段(segments)向量。最后，把它們放進(jìn)一個(gè)張量里得到特征矩陣，其尺寸為3（id、musk、segment）x 語(yǔ)料庫(kù)中的文檔數(shù) x 序列長(zhǎng)度。

這里我使用原始文本作為語(yǔ)料（前面一直用的是clean_text列）。

corpus=dtf_train["text"]

maxlen=50## add special tokensmaxqnans=np.int((maxlen-20)/2)

corpus_tokenized=["[CLS] "+

" ".join(tokenizer.tokenize(re.sub(r'[^\w\s]+|\n', '',

str(txt).lower.strip))[:maxqnans])+

" [SEP] " for txt in corpus]## generate masksmasks=[[1]*len(txt.split(" ")) + [0]*(maxlen - len(

txt.split(" "))) for txt in corpus_tokenized]

## paddingtxt2seq=[txt + " [PAD]"*(maxlen-len(txt.split(" "))) if len(txt.split(" ")) !=maxlen else txt for txt in corpus_tokenized]

## generate idxidx=[tokenizer.encode(seq.split(" ")) for seq in txt2seq]

## generate segmentssegments=for seq in txt2seq:

temp, i=, 0 for token in seq.split(" "):

temp.append(i)

if token=="[SEP]":

i +=1 segments.append(temp)## feature matrixX_train=[np.asarray(idx, dtype='int32'),

np.asarray(masks, dtype='int32'),

np.asarray(segments, dtype='int32')]

特征矩陣X_train的尺寸為3×34265×50。我們可以從特征矩陣中隨機(jī)挑一個(gè)出來(lái)看看:

i=0print("txt: ", dtf_train["text"].iloc[0])

print("tokenized:", [tokenizer.convert_ids_to_tokens(idx) for idx in X_train[0][i].tolist])

print("idx: ", X_train[0][i])

print("mask: ", X_train[1][i])

print("segment: ", X_train[2][i])

這段代碼在dtf_test["text"]上跑一下就能得到X_test。

現(xiàn)在要從預(yù)練好的 BERT 中用遷移學(xué)習(xí)一個(gè)深度學(xué)習(xí)模型。具體就是，把 BERT 的輸出用平均池化壓成一個(gè)向量，然后在最后添加兩個(gè)全連接層來(lái)預(yù)測(cè)每個(gè)新聞?lì)悇e的概率.

下面是使用BERT原始版本的代碼（記得用正確的tokenizer重做特征工程):

## inputsidx=layers.Input((50), dtype="int32", name="input_idx")

masks=layers.Input((50), dtype="int32", name="input_masks")

segments=layers.Input((50), dtype="int32", name="input_segments")## pre-trained bertnlp=transformers.TFBertModel.from_pretrained("bert-base-uncased")

bert_out, _=nlp([idx, masks, segments])## fine-tuningx=layers.GlobalAveragePooling1D(bert_out)

x=layers.Dense(64, activation="relu")(x)

y_out=layers.Dense(len(np.unique(y_train)),

activation='softmax')(x)## compilemodel=models.Model([idx, masks, segments], y_out)for layer in model.layers[:4]:

layer.trainable=Falsemodel.compile(loss='sparse_categorical_crossentropy',

optimizer='adam', metrics=['accuracy'])model.summary

這里用輕量級(jí)的Distil-BERT來(lái)代替BERT:

## inputsidx=layers.Input((50), dtype="int32", name="input_idx")

masks=layers.Input((50), dtype="int32", name="input_masks")## pre-trained bert with configconfig=transformers.DistilBertConfig(dropout=0.2,

attention_dropout=0.2)

config.output_hidden_states=Falsenlp=transformers.TFDistilBertModel.from_pretrained('distilbert-

base-uncased', config=config)

bert_out=nlp(idx, attention_mask=masks)[0]## fine-tuningx=layers.GlobalAveragePooling1D(bert_out)

x=layers.Dense(64, activation="relu")(x)

y_out=layers.Dense(len(np.unique(y_train)),

activation='softmax')(x)## compilemodel=models.Model([idx, masks], y_out)for layer in model.layers[:3]:

layer.trainable=Falsemodel.compile(loss='sparse_categorical_crossentropy',

optimizer='adam', metrics=['accuracy'])model.summary

最后我們訓(xùn)練.測(cè)試并評(píng)估該模型 (評(píng)價(jià)代碼與前文一致):

## encode ydic_y_mapping={n:label for n,label in

enumerate(np.unique(y_train))}

inverse_dic={v:k for k,v in dic_y_mapping.items}

y_train=np.array([inverse_dic[y] for y in y_train])## traintraining=model.fit(x=X_train, y=y_train, batch_size=64,

epochs=1, shuffle=True, verbose=1,

validation_split=0.3)## testpredicted_prob=model.predict(X_test)

predicted=[dic_y_mapping[np.argmax(pred)] for pred in

predicted_prob]

BERT的表現(xiàn)要比之前的模型稍好，它能識(shí)別的科技新聞要比其他模型多一些.

結(jié)語(yǔ)

本文是一個(gè)通俗教程，展示了如何將不同的NLP模型應(yīng)用于多類分類任務(wù)上。文中比較了3種流行的方法: 用Tf-Idf的詞袋模型, 用Word2Vec的詞嵌入, 和用BERT的語(yǔ)言模型. 每個(gè)模型都介紹了其特征工程與特征選擇、模型設(shè)計(jì)與測(cè)試、模型評(píng)價(jià)與模型解釋，并在(可行時(shí)的)每一步中比較了這3種模型。

雷鋒字幕組是一個(gè)由AI愛好者組成的翻譯團(tuán)隊(duì)，匯聚五五多位志愿者的力量，分享最新的海外AI資訊，交流關(guān)于人工智能技術(shù)領(lǐng)域的行業(yè)轉(zhuǎn)變與技術(shù)創(chuàng)新的見解。

團(tuán)隊(duì)成員有大數(shù)據(jù)專家，算法工程師，圖像處理工程師，產(chǎn)品經(jīng)理，產(chǎn)品運(yùn)營(yíng)，IT咨詢?nèi)?，在校師生；志愿者們?lái)自IBM，AVL，Adobe，阿里，百度等知名企業(yè)，北大，清華，港大，中科院，南卡羅萊納大學(xué)，早稻田大學(xué)等海內(nèi)外高校研究所。

如果，你也是位熱愛分享的AI愛好者。歡迎與雷鋒字幕組一起，學(xué)習(xí)新知，分享成長(zhǎng)。

言

在制作網(wǎng)頁(yè)時(shí)，文字是最基本的元素之一。讓閱讀者更容易閱讀，短時(shí)間里獲得更多信息，是網(wǎng)頁(yè)創(chuàng)作者的目標(biāo)。本篇將介紹各種文字格式標(biāo)簽的使用方法。

本篇主要針對(duì)初學(xué)者的一篇教程，如果你非常熟悉html，可以忽略本篇文章。

標(biāo)題文字

在網(wǎng)上瀏覽時(shí)經(jīng)?？吹揭恍?biāo)題文字，用來(lái)對(duì)應(yīng)章節(jié)劃分，它們以固定的字號(hào)顯示，總共有6種級(jí)別的標(biāo)題，從 h1 至 h6 依次減小，如下圖：

html 代碼：

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>標(biāo)題</title>
</head>
<body>
<h1>這是標(biāo)題 1</h1>
<h2>這是標(biāo)題 2</h2>
<h3>這是標(biāo)題 3</h3>
<h4>這是標(biāo)題 4</h4>
<h5>這是標(biāo)題 5</h5>
<h6>這是標(biāo)題 6</h6>
</body>
</html>

標(biāo)題對(duì)齊方式可以使用 align 屬性，分別有三個(gè)屬性：

left —— 左對(duì)齊
center —— 居中對(duì)齊
right —— 右對(duì)齊

如下圖：

html代碼：

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>標(biāo)題</title>
</head>
<body>
<h1>這是標(biāo)題 1</h1>
<h2 align="left">這是標(biāo)題 2</h2>
<h3 align="center">這是標(biāo)題 3</h3>
<h4 align="right">這是標(biāo)題 4</h4>
<h5>這是標(biāo)題 5</h5>
<h6>這是標(biāo)題 6</h6>
</body>
</html>

文字格式標(biāo)簽

除了標(biāo)題，網(wǎng)頁(yè)中普通文字也是不可缺少的，而各種文字效果可以使網(wǎng)頁(yè)更加漂亮。

只需在<body>和</body>之間輸入文字，就會(huì)直接在頁(yè)面中顯示，如何設(shè)置這些文字的格式，這里使用標(biāo)簽，下面將逐一介紹各種文字格式用法。

一、設(shè)置字體、字號(hào)、顏色 —— 標(biāo)簽

標(biāo)簽在HTML 4 中用于指定字體、字體大小和文本顏色，但在HTML5 中不支持。

face 屬性：字體類型
size 屬性：字體字號(hào)大小
color 屬性：字體顏色

html代碼：

<html>
<body>
<div><font face="宋體">字體</font></div>
<div><font size="5">5號(hào)字體</font></div>
<div><font color="red">顏色</font></div>
<div><font size="5" face="arial" color="blue">一起使用</font></div>
</body>
</html>

在html5中不建議使用，請(qǐng)用 css 樣式代替。

二、粗體、斜體、下劃線、刪除線—— strong、em、u、del

效果如下：

html代碼：

<!DOCTYPE html>
<html>
<body>
<p>這是普通文本 - <strong>這是粗體文本</strong>。</p>
<p>這是普通文本 - <em>這是斜體</em>。</p>
<p>這是普通文本 - <u>這是下劃線</u>。</p>
<p>這是普通文本 - <del>這是下劃線</del>。</p>
</body>
</html>

注：html 5 和 html 4 相關(guān)標(biāo)簽存在巨大差異，比如 strong 和 b 、del 和 s、em 和 i 等效果相同，在html5 中不支持，b、s、i 標(biāo)簽，已不建議使用，關(guān)于各種差異，可自己了解下就可以了。

3、上標(biāo)和下標(biāo) —— sup、sub

效果如下：

html代碼：

<html>
<body>
<p>
普通文本 <sup>上標(biāo)</sup>
</p>
<p>
普通文本 <sub>下標(biāo)</sub>
</p>
<p>
數(shù)學(xué)公式 X<sup>3</sup> + 5X<sup>2</sup> - 5=0
</p>
<p>
數(shù)學(xué)公式 X<sub>1</sub> - 2X<sub>1</sub>=0
</p>
</body>
</html>

4、空格——

一般在網(wǎng)頁(yè)中輸入文字時(shí)，在段落中明明增加了空格，卻在頁(yè)面中看不到，這是因?yàn)樵趆tml中，瀏覽器本身會(huì)將2個(gè)句子之間的所有半角空白僅當(dāng)做一個(gè)空白來(lái)看待。所以在這里使用空格符代替，每個(gè)空格符代表一個(gè)半角空格，多個(gè)空格可以使用多次。

html代碼：

由于頭條不顯示空格字符，所以用圖片代替

效果：

5、其它特殊字符

除了空格字符，在網(wǎng)頁(yè)中還有一些特殊字符也需要使用代碼來(lái)代替，一般情況下，特殊字符由前綴 “&” 開始、字符名和后綴 “;” 組成，和空格符類似。如下表

特殊字符有很多，這里只列出一些例子，具體自己搜索了解下。

段落

在網(wǎng)頁(yè)中要把文字有條理地顯示，需要使用到段落標(biāo)簽，下面介紹一些與段落相關(guān)的標(biāo)簽。

段落標(biāo)簽——p

在網(wǎng)頁(yè)中，通過(guò) 定義為一個(gè)段落。

html代碼：

<html>
<body>
<p>這是段落。</p>
<p>這是段落。</p>
<p>這是段落。</p>
<p>段落元素由 p 標(biāo)簽定義。</p> 
</body>
</html>

效果：

換行標(biāo)簽——br

在寫文字時(shí)，除了自動(dòng)換行外，換可以使用 標(biāo)簽強(qiáng)制文字換行，這個(gè)和 p 段落標(biāo)簽不一樣。段落標(biāo)簽的換行是隔行的，而br不是，時(shí)2行文字更加緊湊。

html代碼：

<html>
<body>
<p>
第一個(gè)段落<br />換行1<br />換行2<br />換行3<br />最后一行.
</p>
<p>
第二個(gè)段落 <br />換行1<br />換行2<br />換行3<br />最后一行.
</p>
</body>
</html>

效果如下：

如果不想文字被瀏覽器自動(dòng)換行，可以使用標(biāo)簽處理，如下圖：

改行文字不會(huì)被自動(dòng)換行，會(huì)看到出現(xiàn)橫向滾動(dòng)條。

保留原始排版方式——pre

在網(wǎng)頁(yè)制作中，有時(shí)需要保留一些特殊的排版效果，這是使用標(biāo)簽控制就會(huì)很麻煩，使用<pre>標(biāo)簽就可以保留文本的格式排版效果。如下圖：

html代碼：

<html>
<body>
<pre>
這是
預(yù)格式文本。
它保留了      空格
和換行。
</pre>
<p>pre 標(biāo)簽很適合顯示計(jì)算機(jī)代碼：</p>
<pre>
for i=1 to 10
     print i
next i
</pre>
<p>這是一個(gè)ok效果</p>
<pre>
  O O    k  K
 O   O   K K
  O O    K  K
</pre>
</body>
</html>

其它標(biāo)簽

右縮進(jìn)—— blockquote

使用<blockquote>可以實(shí)現(xiàn)文字段落縮進(jìn)，每使用一次，段落就縮進(jìn)一次，可以嵌套使用。

實(shí)例代碼：

<html>
<body>
Here comes a long quotation:
<blockquote>
This is a long quotation. This is a long quotation. This is a long quotation. This is a long quotation. This is a long quotation.
</blockquote>
請(qǐng)注意，瀏覽器在 blockquote 元素前后添加了換行，并增加了外邊距。
</body>
</html>

效果如下：

請(qǐng)注意，瀏覽器在 blockquote 元素前后添加了換行，并增加了外邊距。

水平線——hr

在段落和段落之間加上一行水平線，將段落隔開。如下效果：

html代碼：

<html>
<body>
<p>hr 標(biāo)簽定義水平線：</p>
<hr />
<p>這是段落。</p>
<hr />
<p>這是段落。</p>
<hr />
<p>這是段落。</p>
</body>
</html>

文字標(biāo)注——ruby

在網(wǎng)頁(yè)中可以通過(guò)添加對(duì)文字的標(biāo)注來(lái)說(shuō)明某段文本。

效果如下：

html代碼：

<!DOCTYPE HTML>
<html>
<body>
<p>ruby 使用語(yǔ)法：</p>
<ruby>
 被說(shuō)明的文字 <rt> 標(biāo)注 </rt>
</ruby>
</body>
</html>

其它標(biāo)簽——var、code、kbd等

<dfn>	定義一個(gè)定義項(xiàng)目。
<code>	定義計(jì)算機(jī)代碼文本。
<samp>	定義樣本文本。
<kbd>	定義鍵盤文本。它表示文本是從鍵盤上鍵入的。它經(jīng)常用在與計(jì)算機(jī)相關(guān)的文檔或手冊(cè)中。
<var>	定義變量。您可以將此標(biāo)簽與 <pre> 及 <code> 標(biāo)簽配合使用。
<cite>	定義引用?？墒褂迷摌?biāo)簽對(duì)參考文獻(xiàn)的引用進(jìn)行定義，比如書籍或雜志的標(biāo)題。

總結(jié)

本篇介紹了大部分常用的文本格式標(biāo)簽，在制作網(wǎng)頁(yè)時(shí)會(huì)經(jīng)常使用到。如何掌握這些標(biāo)簽使用，很簡(jiǎn)單，可以使用文本編輯器或類似w3cshool 在線可編輯預(yù)覽的工具，親手寫一寫，熟悉每個(gè)標(biāo)簽的用處，無(wú)需死記硬背，關(guān)鍵在于理解。

最后，感謝您的閱讀及關(guān)注，祝你學(xué)習(xí)愉快。

上篇：前端入門——HTML的發(fā)展歷史

下篇：前端入門——html 列表

果文章對(duì)你有幫助，記得點(diǎn)贊收藏哦，如果有疑問記得評(píng)論區(qū)留下你的問題，我會(huì)第一時(shí)間回復(fù)的！

前言

之前書寫了使用pytorch進(jìn)行短文本分類，其中的數(shù)據(jù)處理方式比較簡(jiǎn)單粗暴。自然語(yǔ)言處理領(lǐng)域包含很多任務(wù)，很多的數(shù)據(jù)像之前那樣處理的話未免有點(diǎn)繁瑣和耗時(shí)。在pytorch中眾所周知的數(shù)據(jù)處理包是處理圖片的torchvision，而處理文本的少有提及，快速處理文本數(shù)據(jù)的包也是有的，那就是torchtext[1]。下面還是結(jié)合上一個(gè)案例：【深度學(xué)習(xí)】textCNN論文與原理——短文本分類(基于pytorch)[2]，使用torchtext進(jìn)行文本數(shù)據(jù)預(yù)處理，然后再使用torchtext進(jìn)行模型分類。

關(guān)于torchtext的基本使用除了可以參考官方文檔，也可以看看這篇文章：TorchText用法示例及完整代碼[3]。

下面就開始看看該如何進(jìn)行處理吧。

1 數(shù)據(jù)處理

首先導(dǎo)入包：

from torchtext import data

我們處理的語(yǔ)料中，主要涉及兩個(gè)內(nèi)容：文本，文本對(duì)應(yīng)的類別。下面使用torchtext構(gòu)建這兩個(gè)字段：

# 文本內(nèi)容，使用自定義的分詞方法，將內(nèi)容轉(zhuǎn)換為小寫，設(shè)置最大長(zhǎng)度等
TEXT = data.Field(tokenize=utils.en_seg, lower=True, fix_length=config.MAX_SENTENCE_SIZE, batch_first=True)
# 文本對(duì)應(yīng)的標(biāo)簽
LABEL = data.LabelField(dtype=torch.float)

其中的一些參數(shù)在一個(gè)config.py文件中，如下：

# 模型相關(guān)參數(shù)
RANDOM_SEED = 1000  # 隨機(jī)數(shù)種子
BATCH_SIZE = 128    # 批次數(shù)據(jù)大小
LEARNING_RATE = 1e-3   # 學(xué)習(xí)率
EMBEDDING_SIZE = 200   # 詞向量維度
MAX_SENTENCE_SIZE = 50  # 設(shè)置最大語(yǔ)句長(zhǎng)度
EPOCH = 20            # 訓(xùn)練測(cè)輪次

# 語(yǔ)料路徑
NEG_CORPUS_PATH = './corpus/neg.txt'
POS_CORPUS_PATH = './corpus/pos.txt'

utils.en_seg是自定義的文本分詞函數(shù)，如下：

def en_seg(sentence):
    """
    簡(jiǎn)單的英文分詞方法，
    :param sentence: 需要分詞的語(yǔ)句
    :return: 返回分詞結(jié)果
    """
    return sentence.split()

當(dāng)然也可以書寫更復(fù)雜的，或者使用spacy。下面就是書寫讀取文本數(shù)據(jù)到torchtext對(duì)象的數(shù)據(jù)了，便于使用torchtext中的方法，如下：

def get_dataset(corpus_path, text_field, label_field, datatype):
    """
    構(gòu)建torchtext數(shù)據(jù)集
    :param corpus_path: 數(shù)據(jù)路徑
    :param text_field: torchtext設(shè)置的文本域
    :param label_field: torchtext設(shè)置的文本標(biāo)簽域
    :param datatype: 文本的類別
    :return: torchtext格式的數(shù)據(jù)集以及設(shè)置的域
    """
    fields = [('text', text_field), ('label', label_field)]
    examples = []
    with open(corpus_path, encoding='utf8') as reader:
        for line in reader:
            content = line.rstrip()
            if datatype == 'pos':
                label = 1
            else:
                label = 0
            # content[：-2]是由于原始文本最后的兩個(gè)內(nèi)容是空格和.，這里直接去掉，并將數(shù)據(jù)與設(shè)置的域?qū)?yīng)起來(lái)
            examples.append(data.Example.fromlist([content[:-2], label], fields))

    return examples, fields

現(xiàn)在就可以獲取torchtext格式的數(shù)據(jù)了，如下：

# 構(gòu)建data數(shù)據(jù)
pos_examples, pos_fields = dataloader.get_dataset(config.POS_CORPUS_PATH, TEXT, LABEL, 'pos')
neg_examples, neg_fields = dataloader.get_dataset(config.NEG_CORPUS_PATH, TEXT, LABEL, 'neg')
all_examples, all_fields = pos_examples + neg_examples, pos_fields + neg_fields

# 構(gòu)建torchtext類型的數(shù)據(jù)集
total_data = data.Dataset(all_examples, all_fields)

有了上面的數(shù)據(jù)，下面就可以快速地為準(zhǔn)備模型需要的數(shù)據(jù)了，如切分，構(gòu)造批次數(shù)據(jù)，獲取字典等，如下：


# 數(shù)據(jù)集切分
train_data, test_data = total_data.split(random_state=random.seed(config.RANDOM_SEED), split_ratio=0.8)

# 切分后的數(shù)據(jù)查看
# # 數(shù)據(jù)維度查看
print('len of train data: %r' % len(train_data))  # len of train data: 8530
print('len of test data: %r' % len(test_data))  # len of test data: 2132

# # 抽一條數(shù)據(jù)查看
print(train_data.examples[100].text)
# ['never', 'engaging', ',', 'utterly', 'predictable', 'and', 'completely', 'void', 'of', 'anything', 'remotely',
# 'interesting', 'or', 'suspenseful']
print(train_data.examples[100].label)
# 0

# 為該樣本數(shù)據(jù)構(gòu)建字典，并將子每個(gè)單詞映射到對(duì)應(yīng)數(shù)字
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

# 查看字典長(zhǎng)度
print(len(TEXT.vocab))  # 19206
# 查看字典中前10個(gè)詞語(yǔ)
print(TEXT.vocab.itos[:10])  # ['<unk>', '<pad>', ',', 'the', 'a', 'and', 'of', 'to', '.', 'is']
# 查找'name'這個(gè)詞對(duì)應(yīng)的詞典序號(hào), 本質(zhì)是一個(gè)dict
print(TEXT.vocab.stoi['name'])  # 2063

# 構(gòu)建迭代(iterator)類型的數(shù)據(jù)
train_iterator, test_iterator = data.BucketIterator.splits((train_data, test_data),
                                                           batch_size=config.BATCH_SIZE,
                                                           sort=False)

這樣一看，是不是減少了我們書寫的很多代碼了。下面就是老生常談的模型預(yù)測(cè)和模型效果查看了。

2 構(gòu)建模型并訓(xùn)練

模型的相關(guān)理論已在前文介紹，如果忘了可以回過(guò)頭看看。模型還是那個(gè)模型，如下：

import torch
from torch import nn

import config


class TextCNN(nn.Module):
    # output_size為輸出類別（2個(gè)類別，0和1）,三種kernel，size分別是3,4，5，每種kernel有100個(gè)
    def __init__(self, vocab_size, embedding_dim, output_size, filter_num=100, kernel_list=(3, 4, 5), dropout=0.5):
        super(TextCNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # 1表示channel_num，filter_num即輸出數(shù)據(jù)通道數(shù)，卷積核大小為(kernel, embedding_dim)
        self.convs = nn.ModuleList([
            nn.Sequential(nn.Conv2d(1, filter_num, (kernel, embedding_dim)),
                          nn.LeakyReLU(),
                          nn.MaxPool2d((config.MAX_SENTENCE_SIZE - kernel + 1, 1)))
            for kernel in kernel_list
        ])
        self.fc = nn.Linear(filter_num * len(kernel_list), output_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.embedding(x)  # [128, 50, 200] (batch, seq_len, embedding_dim)
        x = x.unsqueeze(1)  # [128, 1, 50, 200] 即(batch, channel_num, seq_len, embedding_dim)
        out = [conv(x) for conv in self.convs]
        out = torch.cat(out, dim=1)  # [128, 300, 1, 1]，各通道的數(shù)據(jù)拼接在一起
        out = out.view(x.size(0), -1)  # 展平
        out = self.dropout(out)  # 構(gòu)建dropout層
        logits = self.fc(out)  # 結(jié)果輸出[128, 2]
        return logits

為了方便模型訓(xùn)練，測(cè)試書寫了兩個(gè)函數(shù)，當(dāng)然也和之前的相同，如下：

def binary_acc(pred, y):
    """
    計(jì)算模型的準(zhǔn)確率
    :param pred: 預(yù)測(cè)值
    :param y: 實(shí)際真實(shí)值
    :return: 返回準(zhǔn)確率
    """
    correct = torch.eq(pred, y).float()
    acc = correct.sum() / len(correct)
    return acc


def train(model, train_data, optimizer, criterion):
    """
    模型訓(xùn)練
    :param model: 訓(xùn)練的模型
    :param train_data: 訓(xùn)練數(shù)據(jù)
    :param optimizer: 優(yōu)化器
    :param criterion: 損失函數(shù)
    :return: 該論訓(xùn)練各批次正確率平均值
    """
    avg_acc = []
    model.train()       # 進(jìn)入訓(xùn)練模式
    for i, batch in enumerate(train_data):
        pred = model(batch.text)
        loss = criterion(pred, batch.label.long())
        acc = binary_acc(torch.max(pred, dim=1)[1], batch.label)
        avg_acc.append(acc)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # 計(jì)算所有批次數(shù)據(jù)的結(jié)果
    avg_acc = np.array(avg_acc).mean()
    return avg_acc


def evaluate(model, test_data):
    """
    使用測(cè)試數(shù)據(jù)評(píng)估模型
    :param model: 模型
    :param test_data: 測(cè)試數(shù)據(jù)
    :return: 該論訓(xùn)練好的模型預(yù)測(cè)測(cè)試數(shù)據(jù)，查看預(yù)測(cè)情況
    """
    avg_acc = []
    model.eval()  # 進(jìn)入測(cè)試模式
    with torch.no_grad():
        for i, batch in enumerate(test_data):
            pred = model(batch.text)
            acc = binary_acc(torch.max(pred, dim=1)[1], batch.label)
            avg_acc.append(acc)
    return np.array(avg_acc).mean()

涉及相關(guān)包的話，就自行導(dǎo)入即可。下面就是創(chuàng)建模型和模型訓(xùn)練測(cè)試了。好緊張，又到了這個(gè)環(huán)節(jié)了。

# 創(chuàng)建模型
text_cnn = model.TextCNN(len(TEXT.vocab), config.EMBEDDING_SIZE, len(LABEL.vocab))
# 選取優(yōu)化器
optimizer = optim.Adam(text_cnn.parameters(), lr=config.LEARNING_RATE)
# 選取損失函數(shù)
criterion = nn.CrossEntropyLoss()

# 繪制結(jié)果
model_train_acc, model_test_acc = [], []

# 模型訓(xùn)練
for epoch in range(config.EPOCH):
    train_acc = utils.train(text_cnn, train_iterator, optimizer, criterion)
    print("epoch = {}, 訓(xùn)練準(zhǔn)確率={}".format(epoch + 1, train_acc))

    test_acc = utils.evaluate(text_cnn, test_iterator)
    print("epoch = {}, 測(cè)試準(zhǔn)確率={}".format(epoch + 1, test_acc))

    model_train_acc.append(train_acc)
    model_test_acc.append(test_acc)

# 繪制訓(xùn)練過(guò)程
plt.plot(model_train_acc)
plt.plot(model_test_acc)
plt.ylim(ymin=0.5, ymax=1.01)
plt.title("The accuracy of textCNN mode")
plt.legend(['train', 'test'])
plt.show()

模型最后的結(jié)果如下：

模型訓(xùn)練過(guò)程

這個(gè)和之前結(jié)果沒多大區(qū)別，但是在數(shù)據(jù)處理中卻省去更多的時(shí)間，并且也更加規(guī)范化。所以還是有時(shí)間學(xué)習(xí)一下torchtext咯。

3 總結(jié)

torchtext支持的自然語(yǔ)言處理處理任務(wù)還是比較多的，并且自身和帶有一些數(shù)據(jù)集。最近還在做實(shí)體識(shí)別任務(wù)，使用的算法模型是bi-lstm+crf。這個(gè)任務(wù)的本質(zhì)就是序列標(biāo)注，torchtext也是支持這種類型數(shù)據(jù)的處理的，后期有時(shí)間的話也會(huì)做相關(guān)的介紹，記得關(guān)注哦。對(duì)啦，本文的全部代碼和語(yǔ)料，我都上傳到github上了:https://github.com/Htring/NLP_Applications[4]，后續(xù)其他相關(guān)應(yīng)用代碼也會(huì)陸續(xù)更新，也歡迎star，指點(diǎn)哦。

參考文獻(xiàn)

[1] torchtext: https://pytorch.org/text/stable/index.html

[2]【深度學(xué)習(xí)】textCNN論文與原理——短文本分類(基于pytorch): https://piqiandong.blog.csdn.net/article/details/110149143

[3] TorchText用法示例及完整代碼: https://blog.csdn.net/nlpuser/article/details/88067167

[4] https://github.com/Htring/NLP_Applications: https://github.com/Htring/NLP_Applications

首發(fā)公眾號(hào)【AIAS編程有道】,頭條同步。

原創(chuàng)不易，科皮子菊麻煩你關(guān)注，轉(zhuǎn)發(fā)，評(píng)論，感謝你的批評(píng)和指導(dǎo)，你的支持是我在頭條發(fā)布文章的源源動(dòng)力。我是愛編程，愛算法的科皮子菊，下篇博文見！

在線咨詢

上一篇：使用FLIP技術(shù)讓編寫動(dòng)畫事半功倍
下一篇：JS學(xué)習(xí)之正則

您的項(xiàng)目需求

*請(qǐng)認(rèn)真填寫需求信息，我們會(huì)在24小時(shí)內(nèi)與您取得聯(lián)系。

整合營(yíng)銷服務(wù)商

NLP之文本分類：「Tf-Idf、Word2Vec和

概要

設(shè)置

語(yǔ)言模型

結(jié)語(yǔ)

言

目錄

標(biāo)題文字

文字格式標(biāo)簽

段落

其它標(biāo)簽

總結(jié)

前言

1 數(shù)據(jù)處理

2 構(gòu)建模型并訓(xùn)練

3 總結(jié)

參考文獻(xiàn)

您的項(xiàng)目需求