自然语言处理(NLP)编程实战-1.3 使用词向量预测国家

2021-05-05 20:41 573人阅读评论(0)

内容汇总：https://blog.csdn.net/weixin_43093481/article/details/114989382?spm=1001.2014.3001.5501
课程笔记：1.3 向量空间模型(Vector Space Models)
代码：https://github.com/Ogmx/Natural-Language-Processing-Specialization
——————————————————————————————————————————

作业 3: 认识词向量

学习目标：
在本实验中将会对词向量进行探究。通常在NLP任务中，各单词会用词向量的形式来表示，词向量能对词的含义进行编码。
词向量可通过多种不同的机器学习方法进行训练得到。在本实验中并不会研究如何生成词向量，而是学习如何使用词向量，因为在真实应用情况下，往往是直接使用已经训练好的词向量，并不会亲自训练。

具体而言，将会学习：

预测单词之间的类比关系.
使用PCA对词嵌入降维并进行可视化
使用相似度度量(余弦相似度)来比较词嵌入
理解向量空间模型的原理

1.0 通过首都来预测国家

给出一个首都的名字，预测其所属的国家

1.1 导入数据

同之前一样，要先导入一些必要的Python库和数据集，本次的数据集为 Pandas DataFrame类型，这是一种在数据科学中非常常用的数据类型。由于数据较大因此加载数据可能会花费一段时间。

# Run this cell to import packages.
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from utils import get_vectors

data = pd.read_csv('capitals.txt', delimiter=' ')
data.columns = ['city1', 'country1', 'city2', 'country2']

# print first five elements in the DataFrame
data.head(5)

    city1 	country1  city2 	country2
0 	Athens 	Greece 	Bangkok 	Thailand
1 	Athens 	Greece 	Beijing 	China
2 	Athens 	Greece 	Berlin 	    Germany
3 	Athens 	Greece 	Bern 	    Switzerland
4 	Athens 	Greece 	Cairo 	    Egypt

下载完整词嵌入数据集

由于谷歌新闻词嵌入数据集大小为3.64G，本实验中只使用了其中一小部分，并存储word_embeddings_capitals.p 文件中。

如果想要下载完整数据集用于你自己的任务，参考如下方法：

下载地址 page
在网页中搜索 ‘GoogleNews-vectors-negative300.bin.gz’ 并点击进行下载

现在载入训练好的词嵌入并将使构成一个 Python dictionary

word_embeddings = pickle.load(open("word_embeddings_subset.p", "rb"))
len(word_embeddings)  # there should be 243 words that will be used in this assignment

243

每一个词嵌入都是一个300维的向量

print("dimension: {}".format(word_embeddings['Spain'].shape[0]))

dimension: 300

预测单词之间的关系

下面将实现一个函数，通过使用词嵌入来预测单词之间的关系

该函数输入三个单词
前两个单词彼此相关
将根据前两个单词的关系，使用第3个单词预测出具有相同关系的第4个单词
例如, “Athens is to Greece as Bangkok is to ______”?

例如：

实现一个函数能根据首都来预测其所属国家
使用如上图所示的方法，需要通过计算余弦相似度或欧式距离来实现

1.2 余弦相似度(Cosine Similarity)

余弦相似度定义如下：

$\cos (\theta)=\frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}=\frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}\tag{1}$

$A$ 和 $B$ 表示词向量， $A_i$ 和 $B_i$ 表示该向量第 $i$ 个元素

当A与B相同时，其余弦相似度 $cos(\theta) = 1$
反之，当其完全不同时，即 $A = - B$ , 其余弦相似度 $cos(\theta) = -1$
如果 $cos(\theta) =0$ , 表示其正交/垂直
值在0~1之间表示相似度
值在-1~0之间表示不相似度

实践：实现一个函数输入两个词向量，输出其余弦相似度

Hints

Python的 NumPy 库支持线性代数操作 (如, dot product, vector norm ...).
使用 numpy.dot .
使用numpy.linalg.norm .

# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def cosine_similarity(A, B):
    '''
    Input:
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
    Output:
        cos: numerical number representing the cosine similarity between A and B.
    '''

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    
    dot = np.dot(A,B)
    norma = np.linalg.norm(A)
    normb = np.linalg.norm(B)
    cos = dot / (norma*normb)

    ### END CODE HERE ###
    return cos

# feel free to try different words
king = word_embeddings['king']
queen = word_embeddings['queen']

cosine_similarity(king, queen)

0.6510956

1.3 欧式距离(Euclidean distance)

实现一个函数使用欧式距离来计算两向量的相似度

欧式距离定义如下：

\begin{aligned} d (A, B) = d (B, A) & = \sqrt{{(A_{1} - B_{1})}^{2} + {(A_{2} - B_{2})}^{2} + \dots + {(A_{n} - B_{n})}^{2}} \\ = \sqrt{\sum_{i = 1}^{n} {(A_{i} - B_{i})}^{2}} \end{aligned}

$\begin{aligned} d(\mathbf{A}, \mathbf{B})=d(\mathbf{B}, \mathbf{A}) &=\sqrt{\left(A_{1}-B_{1}\right)^{2}+\left(A_{2}-B_{2}\right)^{2}+\cdots+\left(A_{n}-B_{n}\right)^{2}} \\ &=\sqrt{\sum_{i=1}^{n}\left(A_{i}-B_{i}\right)^{2}} \end{aligned}$

d (A, B) = d (B, A) = (A_{1} - B_{1})^{2} + (A_{2} - B_{2})^{2} + \dots + (A_{n} - B_{n})^{2} = i = 1 \sum n (A_{i} - B_{i})^{2}

$n$ 是向量中元素个数
$A$ 和 $B$ 表示词向量
两个词越相似，其欧式距离越接近0

实践：实现一个函数计算两向量的欧式距离

# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def euclidean(A, B):
    """
    Input:
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
    Output:
        d: numerical number representing the Euclidean distance between A and B.
    """

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

    # euclidean distance

    d = np.linalg.norm(A-B)

    ### END CODE HERE ###

    return d

# Test your function
euclidean(king, queen)

2.4796925

1.4 通过首都来预测国家

本节中，将使用上面完成的计算相似度函数来完成该任务。实现一个函数输入3个单词和词嵌入字典，输出对应的国家。例如给出如下输入：

1: Athens 2: Greece 3: Baghdad,

该模型将预测对应国家 4: Iraq.

实践：

为了预测国家，参考上述King - Man + Woman = Queen 的例子，并使用相同的机制，结合词嵌入和相似度函数。
遍历词嵌入字典，并计算你的向量和当前单词嵌入之间的余弦相似度。
最后确保输出结果与任一输入单词不同。应返回得分最高的一项

# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_country(city1, country1, city2, embeddings):
    """
    Input:
        city1: a string (the capital city of country1)
        country1: a string (the country of capital1)
        city2: a string (the capital city of country2)
        embeddings: a dictionary where the keys are words and values are their embeddings
    Output:
        countries: a dictionary with the most likely country and its similarity score
    """
    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###

    # store the city1, country 1, and city 2 in a set called group
    group = set((city1,country1, city2))

    # get embeddings of city 1
    city1_emb = embeddings[city1]

    # get embedding of country 1
    country1_emb = embeddings[country1]

    # get embedding of city 2
    city2_emb = embeddings[city2]

    # get embedding of country 2 (it's a combination of the embeddings of country 1, city 1 and city 2)
    # Remember: King - Man + Woman = Queen
    vec = country1_emb - city1_emb + city2_emb

    # Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1)
    similarity = -1

    # initialize country to an empty string
    country = ''

    # loop through all words in the embeddings dictionary
    for word in embeddings.keys():

        # first check that the word is not already in the 'group'
        if word not in group:

            # get the word embedding
            word_emb = embeddings[word]

            # calculate cosine similarity between embedding of country 2 and the word in the embeddings dictionary
            cur_similarity = cosine_similarity(word_emb,vec)

            # if the cosine similarity is more similar than the previously best similarity...
            if cur_similarity > similarity:

                # update the similarity to the new, better similarity
                similarity = cur_similarity

                # store the country as a tuple, which contains the word and the similarity
                country = (word, similarity)

    ### END CODE HERE ###

    return country

# Testing your function, note to make it more robust you can return the 5 most similar words.
get_country('Athens', 'Greece', 'Cairo', word_embeddings)

(‘Egypt’, 0.7626821)

1.5 模型准确度

本节中将使用数据集来测试模型的准确度：

$\text{Accuracy}=\frac{\text{Correct \# of predictions}}{\text{Total \# of predictions}}$

实践：实现一个函数来计算模型在数据集上的准确度。遍历每一行，以获得相应的单词，并将它们输入上述的get_country函数中。

Hints

使用 pandas.DataFrame.iterrows .

for i, row in data.iterrows():
     print(row['city1'])

Athens
Athens
Athens
Athens
Athens
Athens
Athens
Athens
Athens
…

# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_accuracy(word_embeddings, data):
    '''
    Input:
        word_embeddings: a dictionary where the key is a word and the value is its embedding
        data: a pandas dataframe containing all the country and capital city pairs
    
    Output:
        accuracy: the accuracy of the model
    '''

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # initialize num correct to zero
    num_correct = 0

    # loop through the rows of the dataframe
    for i, row in data.iterrows():

        # get city1
        city1 = row['city1']

        # get country1
        country1 = row['country1']

        # get city2
        city2 =  row['city2']

        # get country2
        country2 = row['country2']

        # use get_country to find the predicted country2
        predicted_country2, _ = get_country(city1,country1,city2,word_embeddings)

        # if the predicted country2 is the same as the actual country2...
        if predicted_country2 == country2:
            # increment the number of correct by 1
            num_correct += 1

    # get the number of rows in the data dataframe (length of dataframe)
    m = len(data)

    # calculate the accuracy by dividing the number correct by m
    accuracy = num_correct / m

    ### END CODE HERE ###
    return accuracy

accuracy = get_accuracy(word_embeddings, data)
print(f"Accuracy is {accuracy:.2f}")

Accuracy is 0.92

2.0 使用PCA可视化

本节将研究维度减小后向量之间的距离关系，通过使用PCA方法减小维度，详见：principal component analysis (PCA)

在本实验中，使用了300维的向量空间，虽然从计算的角度看该方法表现很好，但是300维的数据并不能直观进行可视化和理解，因此需要使用PCA进行降维。

PCA方法通过在保持最大信息的前提下，将高维向量投影至低维空间中。最大信息指，原始向量和投影后的向量之间的欧式距离最小，因此在高维空间中彼此接近的向量降维后仍彼此接近。

相似的词将会被聚集到一起，例如 ‘sad’, ‘happy’, ‘joyful’其都表示情感，因此会彼此靠近，其它的如’oil’, ‘gas’, 'petroleum’都表示自然资源，‘city’, ‘village’, 'town’是同义词。

在对单词进行可视化之前，需要先用PCA方法将词向量降维至2维，步骤如下：

均值归一化数据
计算协方差矩阵 ( $\Sigma$ ).
计算协方差矩阵的特征值和特征向量
将前K个特征向量与归一化数据相乘。变换的结果如下：

实践:

实现一个函数，输入一个数据集，其中每一行对应一个词向量，输出降维后结果。

词向量为300维
使用PCA将300维压缩至n_components维
得到新矩阵的维度为 m, n_componentns.
首先对数据进行去均值
使用 linalg.eigh得到特征值. 使用 eigh 而不是 eig 因为 R 是对称阵. 使用eigh而不用 eig能极大提升性能
按照特征值的递减顺序对特征向量和特征值进行排序
获得特征向量的子集 (使用n_components 选择想使用的主成分数量)
通过将特征向量与原始数据相乘，返回变换后的数据

# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_pca(X, n_components=2):
    """
    Input:
        X: of dimension (m,n) where each row corresponds to a word vector
        n_components: Number of components you want to keep.
    Output:
        X_reduced: data transformed in 2 dims/columns + regenerated original data
    """

    ### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
    # mean center the data
    X_demeaned = X - X.mean(axis=0,keepdims=True)

    # calculate the covariance matrix
    covariance_matrix = np.cov(X_demeaned,rowvar=False)

    # calculate eigenvectors & eigenvalues of the covariance matrix
    eigen_vals, eigen_vecs = np.linalg.eigh(covariance_matrix)

    # sort eigenvalue in increasing order (get the indices from the sort)
    idx_sorted = np.argsort(eigen_vals)
    
    # reverse the order so that it's from highest to lowest.
    idx_sorted_decreasing = idx_sorted[::-1]

    # sort the eigen values by idx_sorted_decreasing
    eigen_vals_sorted = eigen_vals[idx_sorted_decreasing]

    # sort eigenvectors using the idx_sorted_decreasing indices
    eigen_vecs_sorted = eigen_vecs[:,idx_sorted_decreasing]

    # select the first n eigenvectors (n is desired dimension
    # of rescaled data array, or dims_rescaled_data)
    eigen_vecs_subset = eigen_vecs_sorted[:,:n_components]

    # transform the data by multiplying the transpose of the eigenvectors 
    # with the transpose of the de-meaned data
    # Then take the transpose of that product.
    X_reduced = np.matmul(eigen_vecs_subset.T,X_demeaned.T).T

    ### END CODE HERE ###

    return X_reduced

# Testing your function
np.random.seed(1)
X = np.random.rand(3, 10)
X_reduced = compute_pca(X, n_components=2)
print("Your original matrix was " + str(X.shape) + " and it became:")
print(X_reduced)

(10, 10)
Your original matrix was (3, 10) and it became:
[[ 0.43437323 0.49820384]
[ 0.42077249 -0.50351448]
[-0.85514571 0.00531064]]

最后，使用上面实现的PCA函数来进行可视化。你会发现具有相同含义的词会彼此相邻。

words = ['oil', 'gas', 'happy', 'sad', 'city', 'town',
         'village', 'country', 'continent', 'petroleum', 'joyful']

# given a list of words and the embeddings, it returns a matrix with all the embeddings
X = get_vectors(word_embeddings, words)

print('You have 11 words each of 300 dimensions thus X.shape is:', X.shape)

You have 11 words each of 300 dimensions thus X.shape is: (11, 300)

# We have done the plotting for you. Just run this cell.
result = compute_pca(X, 2)
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
    plt.annotate(word, xy=(result[i, 0] - 0.05, result[i, 1] + 0.1))

plt.show()

你发现了什么?

‘gas’, ‘oil’ and ‘petroleum’ 这三个词彼此靠近，因为它们的向量彼此靠近。同样，‘sad’, ‘joyful’,'happy’都表示情感，因此也彼此靠近。

转载：https://blog.csdn.net/weixin_43093481/article/details/116406130

查看评论

小言_互联网的博客

小言_互联网的博客

个人资料

文章分类

文章存档

阅读排行

评论排行

推荐文章