小言_互联网的博客

自然语言处理(NLP)编程实战-1.3 使用词向量预测国家

259人阅读  评论(0)

内容汇总:https://blog.csdn.net/weixin_43093481/article/details/114989382?spm=1001.2014.3001.5501
课程笔记:1.3 向量空间模型(Vector Space Models)
代码:https://github.com/Ogmx/Natural-Language-Processing-Specialization
——————————————————————————————————————————

作业 3: 认识词向量

学习目标:
 在本实验中将会对词向量进行探究。通常在NLP任务中,各单词会用词向量的形式来表示,词向量能对词的含义进行编码。
 词向量可通过多种不同的机器学习方法进行训练得到。在本实验中并不会研究如何生成词向量,而是学习如何使用词向量,因为在真实应用情况下,往往是直接使用已经训练好的词向量,并不会亲自训练。

具体而言,将会学习:

  • 预测单词之间的类比关系.
  • 使用PCA对词嵌入降维并进行可视化
  • 使用相似度度量(余弦相似度)来比较词嵌入
  • 理解向量空间模型的原理

1.0 通过首都来预测国家

 给出一个首都的名字,预测其所属的国家

1.1 导入数据

同之前一样,要先导入一些必要的Python库和数据集,本次的数据集为 Pandas DataFrame类型,这是一种在数据科学中非常常用的数据类型。由于数据较大因此加载数据可能会花费一段时间。

# Run this cell to import packages.
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from utils import get_vectors
data = pd.read_csv('capitals.txt', delimiter=' ')
data.columns = ['city1', 'country1', 'city2', 'country2']

# print first five elements in the DataFrame
data.head(5)
    city1 	country1  city2 	country2
0 	Athens 	Greece 	Bangkok 	Thailand
1 	Athens 	Greece 	Beijing 	China
2 	Athens 	Greece 	Berlin 	    Germany
3 	Athens 	Greece 	Bern 	    Switzerland
4 	Athens 	Greece 	Cairo 	    Egypt

下载完整词嵌入数据集

由于谷歌新闻词嵌入数据集大小为3.64G,本实验中只使用了其中一小部分,并存储word_embeddings_capitals.p 文件中。

如果想要下载完整数据集用于你自己的任务,参考如下方法:

  • 下载地址 page
  • 在网页中搜索 ‘GoogleNews-vectors-negative300.bin.gz’ 并点击进行下载

现在载入训练好的词嵌入并将使构成一个 Python dictionary

word_embeddings = pickle.load(open("word_embeddings_subset.p", "rb"))
len(word_embeddings)  # there should be 243 words that will be used in this assignment

243

每一个词嵌入都是一个300维的向量

print("dimension: {}".format(word_embeddings['Spain'].shape[0]))

dimension: 300

预测单词之间的关系

下面将实现一个函数,通过使用词嵌入来预测单词之间的关系

  • 该函数输入三个单词
  • 前两个单词彼此相关
  • 将根据前两个单词的关系,使用第3个单词预测出具有相同关系的第4个单词
  • 例如, “Athens is to Greece as Bangkok is to ______”?

例如:

实现一个函数能根据首都来预测其所属国家
使用如上图所示的方法,需要通过计算余弦相似度或欧式距离来实现

1.2 余弦相似度(Cosine Similarity)

余弦相似度定义如下:

cos ⁡ ( θ ) = A ⋅ B ∥ A ∥ ∥ B ∥ = ∑ i = 1 n A i B i ∑ i = 1 n A i 2 ∑ i = 1 n B i 2 (1) \cos (\theta)=\frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}=\frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}\tag{1} cos(θ)=ABAB=i=1nAi2 i=1nBi2 i=1nAiBi(1)

A A A B B B 表示词向量, A i A_i Ai B i B_i Bi 表示该向量第 i i i个元素

  • 当A与B相同时,其余弦相似度 c o s ( θ ) = 1 cos(\theta) = 1 cos(θ)=1
  • 反之,当其完全不同时,即 A = − B A= -B A=B, 其余弦相似度 c o s ( θ ) = − 1 cos(\theta) = -1 cos(θ)=1
  • 如果 c o s ( θ ) = 0 cos(\theta) =0 cos(θ)=0, 表示其正交/垂直
  • 值在0~1之间表示相似度
  • 值在-1~0之间表示不相似度

实践:实现一个函数输入两个词向量,输出其余弦相似度

Hints

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def cosine_similarity(A, B):
    '''
    Input:
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
    Output:
        cos: numerical number representing the cosine similarity between A and B.
    '''

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    
    dot = np.dot(A,B)
    norma = np.linalg.norm(A)
    normb = np.linalg.norm(B)
    cos = dot / (norma*normb)

    ### END CODE HERE ###
    return cos
# feel free to try different words
king = word_embeddings['king']
queen = word_embeddings['queen']

cosine_similarity(king, queen)

0.6510956

1.3 欧式距离(Euclidean distance)

实现一个函数使用欧式距离来计算两向量的相似度

欧式距离定义如下:

d ( A , B ) = d ( B , A ) = ( A 1 − B 1 ) 2 + ( A 2 − B 2 ) 2 + ⋯ + ( A n − B n ) 2 = ∑ i = 1 n ( A i − B i ) 2

d ( A , B ) = d ( B , A ) = ( A 1 B 1 ) 2 + ( A 2 B 2 ) 2 + + ( A n B n ) 2 = i = 1 n ( A i B i ) 2
d(A,B)=d(B,A)=(A1B1)2+(A2B2)2++(AnBn)2 =i=1n(AiBi)2

  • n n n 是向量中元素个数
  • A A A B B B 表示词向量
  • 两个词越相似,其欧式距离越接近0

实践: 实现一个函数计算两向量的欧式距离

# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def euclidean(A, B):
    """
    Input:
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
    Output:
        d: numerical number representing the Euclidean distance between A and B.
    """

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

    # euclidean distance

    d = np.linalg.norm(A-B)

    ### END CODE HERE ###

    return d

# Test your function
euclidean(king, queen)

2.4796925

1.4 通过首都来预测国家

本节中,将使用上面完成的计算相似度函数来完成该任务。实现一个函数输入3个单词和词嵌入字典,输出对应的国家。例如给出如下输入:

  • 1: Athens 2: Greece 3: Baghdad,

该模型将预测对应国家 4: Iraq.

实践

  1. 为了预测国家,参考上述King - Man + Woman = Queen 的例子,并使用相同的机制,结合词嵌入和相似度函数。
  2. 遍历词嵌入字典,并计算你的向量和当前单词嵌入之间的余弦相似度。
  3. 最后确保输出结果与任一输入单词不同。应返回得分最高的一项
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_country(city1, country1, city2, embeddings):
    """
    Input:
        city1: a string (the capital city of country1)
        country1: a string (the country of capital1)
        city2: a string (the capital city of country2)
        embeddings: a dictionary where the keys are words and values are their embeddings
    Output:
        countries: a dictionary with the most likely country and its similarity score
    """
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

    # store the city1, country 1, and city 2 in a set called group
    group = set((city1,country1, city2))

    # get embeddings of city 1
    city1_emb = embeddings[city1]

    # get embedding of country 1
    country1_emb = embeddings[country1]

    # get embedding of city 2
    city2_emb = embeddings[city2]

    # get embedding of country 2 (it's a combination of the embeddings of country 1, city 1 and city 2)
    # Remember: King - Man + Woman = Queen
    vec = country1_emb - city1_emb + city2_emb

    # Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1)
    similarity = -1

    # initialize country to an empty string
    country = ''

    # loop through all words in the embeddings dictionary
    for word in embeddings.keys():

        # first check that the word is not already in the 'group'
        if word not in group:

            # get the word embedding
            word_emb = embeddings[word]

            # calculate cosine similarity between embedding of country 2 and the word in the embeddings dictionary
            cur_similarity = cosine_similarity(word_emb,vec)

            # if the cosine similarity is more similar than the previously best similarity...
            if cur_similarity > similarity:

                # update the similarity to the new, better similarity
                similarity = cur_similarity

                # store the country as a tuple, which contains the word and the similarity
                country = (word, similarity)

    ### END CODE HERE ###

    return country
# Testing your function, note to make it more robust you can return the 5 most similar words.
get_country('Athens', 'Greece', 'Cairo', word_embeddings)

(‘Egypt’, 0.7626821)

1.5 模型准确度

本节中将使用数据集来测试模型的准确度:

Accuracy = Correct # of predictions Total # of predictions \text{Accuracy}=\frac{\text{Correct \# of predictions}}{\text{Total \# of predictions}} Accuracy=Total # of predictionsCorrect # of predictions

实践:实现一个函数来计算模型在数据集上的准确度。遍历每一行,以获得相应的单词,并将它们输入上述的get_country函数中。

Hints

for i, row in data.iterrows():
     print(row['city1'])

Athens
Athens
Athens
Athens
Athens
Athens
Athens
Athens
Athens

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_accuracy(word_embeddings, data):
    '''
    Input:
        word_embeddings: a dictionary where the key is a word and the value is its embedding
        data: a pandas dataframe containing all the country and capital city pairs
    
    Output:
        accuracy: the accuracy of the model
    '''

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # initialize num correct to zero
    num_correct = 0

    # loop through the rows of the dataframe
    for i, row in data.iterrows():

        # get city1
        city1 = row['city1']

        # get country1
        country1 = row['country1']

        # get city2
        city2 =  row['city2']

        # get country2
        country2 = row['country2']

        # use get_country to find the predicted country2
        predicted_country2, _ = get_country(city1,country1,city2,word_embeddings)

        # if the predicted country2 is the same as the actual country2...
        if predicted_country2 == country2:
            # increment the number of correct by 1
            num_correct += 1

    # get the number of rows in the data dataframe (length of dataframe)
    m = len(data)

    # calculate the accuracy by dividing the number correct by m
    accuracy = num_correct / m

    ### END CODE HERE ###
    return accuracy

accuracy = get_accuracy(word_embeddings, data)
print(f"Accuracy is {accuracy:.2f}")

Accuracy is 0.92

2.0 使用PCA可视化

本节将研究维度减小后向量之间的距离关系,通过使用PCA方法减小维度,详见:principal component analysis (PCA)

在本实验中,使用了300维的向量空间,虽然从计算的角度看该方法表现很好,但是300维的数据并不能直观进行可视化和理解,因此需要使用PCA进行降维。

PCA方法通过在保持最大信息的前提下,将高维向量投影至低维空间中。最大信息指,原始向量和投影后的向量之间的欧式距离最小,因此在高维空间中彼此接近的向量降维后仍彼此接近。

相似的词将会被聚集到一起,例如 ‘sad’, ‘happy’, ‘joyful’其都表示情感,因此会彼此靠近,其它的如’oil’, ‘gas’, 'petroleum’都表示自然资源,‘city’, ‘village’, 'town’是同义词。

在对单词进行可视化之前,需要先用PCA方法将词向量降维至2维,步骤如下:

  1. 均值归一化数据
  2. 计算协方差矩阵 ( Σ \Sigma Σ).
  3. 计算协方差矩阵的特征值和特征向量
  4. 将前K个特征向量与归一化数据相乘。变换的结果如下:

实践:

实现一个函数,输入一个数据集,其中每一行对应一个词向量,输出降维后结果。

  • 词向量为300维
  • 使用PCA将300维压缩至n_components
  • 得到新矩阵的维度为 m, n_componentns.
  • 首先对数据进行去均值
  • 使用 linalg.eigh得到特征值. 使用 eigh 而不是 eig 因为 R 是对称阵. 使用eigh而不用 eig能极大提升性能
  • 按照特征值的递减顺序对特征向量和特征值进行排序
  • 获得特征向量的子集 (使用n_components 选择想使用的主成分数量)
  • 通过将特征向量与原始数据相乘,返回变换后的数据
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_pca(X, n_components=2):
    """
    Input:
        X: of dimension (m,n) where each row corresponds to a word vector
        n_components: Number of components you want to keep.
    Output:
        X_reduced: data transformed in 2 dims/columns + regenerated original data
    """

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # mean center the data
    X_demeaned = X - X.mean(axis=0,keepdims=True)

    # calculate the covariance matrix
    covariance_matrix = np.cov(X_demeaned,rowvar=False)

    # calculate eigenvectors & eigenvalues of the covariance matrix
    eigen_vals, eigen_vecs = np.linalg.eigh(covariance_matrix)

    # sort eigenvalue in increasing order (get the indices from the sort)
    idx_sorted = np.argsort(eigen_vals)
    
    # reverse the order so that it's from highest to lowest.
    idx_sorted_decreasing = idx_sorted[::-1]

    # sort the eigen values by idx_sorted_decreasing
    eigen_vals_sorted = eigen_vals[idx_sorted_decreasing]

    # sort eigenvectors using the idx_sorted_decreasing indices
    eigen_vecs_sorted = eigen_vecs[:,idx_sorted_decreasing]

    # select the first n eigenvectors (n is desired dimension
    # of rescaled data array, or dims_rescaled_data)
    eigen_vecs_subset = eigen_vecs_sorted[:,:n_components]

    # transform the data by multiplying the transpose of the eigenvectors 
    # with the transpose of the de-meaned data
    # Then take the transpose of that product.
    X_reduced = np.matmul(eigen_vecs_subset.T,X_demeaned.T).T

    ### END CODE HERE ###

    return X_reduced

# Testing your function
np.random.seed(1)
X = np.random.rand(3, 10)
X_reduced = compute_pca(X, n_components=2)
print("Your original matrix was " + str(X.shape) + " and it became:")
print(X_reduced)

(10, 10)
Your original matrix was (3, 10) and it became:
[[ 0.43437323 0.49820384]
[ 0.42077249 -0.50351448]
[-0.85514571 0.00531064]]

最后,使用上面实现的PCA函数来进行可视化。你会发现具有相同含义的词会彼此相邻。

words = ['oil', 'gas', 'happy', 'sad', 'city', 'town',
         'village', 'country', 'continent', 'petroleum', 'joyful']

# given a list of words and the embeddings, it returns a matrix with all the embeddings
X = get_vectors(word_embeddings, words)

print('You have 11 words each of 300 dimensions thus X.shape is:', X.shape)

You have 11 words each of 300 dimensions thus X.shape is: (11, 300)

# We have done the plotting for you. Just run this cell.
result = compute_pca(X, 2)
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
    plt.annotate(word, xy=(result[i, 0] - 0.05, result[i, 1] + 0.1))

plt.show()

你发现了什么?

‘gas’, ‘oil’ and ‘petroleum’ 这三个词彼此靠近,因为它们的向量彼此靠近。同样,‘sad’, ‘joyful’,'happy’都表示情感,因此也彼此靠近。


转载:https://blog.csdn.net/weixin_43093481/article/details/116406130
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场