内容汇总:https://blog.csdn.net/weixin_43093481/article/details/114989382?spm=1001.2014.3001.5501
课程笔记:1.3 向量空间模型(Vector Space Models)
代码:https://github.com/Ogmx/Natural-Language-Processing-Specialization
——————————————————————————————————————————
作业 3: 认识词向量
学习目标:
在本实验中将会对词向量进行探究。通常在NLP任务中,各单词会用词向量的形式来表示,词向量能对词的含义进行编码。
词向量可通过多种不同的机器学习方法进行训练得到。在本实验中并不会研究如何生成词向量,而是学习如何使用词向量,因为在真实应用情况下,往往是直接使用已经训练好的词向量,并不会亲自训练。
具体而言,将会学习:
- 预测单词之间的类比关系.
- 使用PCA对词嵌入降维并进行可视化
- 使用相似度度量(余弦相似度)来比较词嵌入
- 理解向量空间模型的原理
1.0 通过首都来预测国家
给出一个首都的名字,预测其所属的国家
1.1 导入数据
同之前一样,要先导入一些必要的Python库和数据集,本次的数据集为 Pandas DataFrame类型,这是一种在数据科学中非常常用的数据类型。由于数据较大因此加载数据可能会花费一段时间。
# Run this cell to import packages.
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from utils import get_vectors
data = pd.read_csv('capitals.txt', delimiter=' ')
data.columns = ['city1', 'country1', 'city2', 'country2']
# print first five elements in the DataFrame
data.head(5)
city1 country1 city2 country2
0 Athens Greece Bangkok Thailand
1 Athens Greece Beijing China
2 Athens Greece Berlin Germany
3 Athens Greece Bern Switzerland
4 Athens Greece Cairo Egypt
下载完整词嵌入数据集
由于谷歌新闻词嵌入数据集大小为3.64G,本实验中只使用了其中一小部分,并存储word_embeddings_capitals.p
文件中。
如果想要下载完整数据集用于你自己的任务,参考如下方法:
- 下载地址 page
- 在网页中搜索 ‘GoogleNews-vectors-negative300.bin.gz’ 并点击进行下载
现在载入训练好的词嵌入并将使构成一个 Python dictionary
word_embeddings = pickle.load(open("word_embeddings_subset.p", "rb"))
len(word_embeddings) # there should be 243 words that will be used in this assignment
243
每一个词嵌入都是一个300维的向量
print("dimension: {}".format(word_embeddings['Spain'].shape[0]))
dimension: 300
预测单词之间的关系
下面将实现一个函数,通过使用词嵌入来预测单词之间的关系
- 该函数输入三个单词
- 前两个单词彼此相关
- 将根据前两个单词的关系,使用第3个单词预测出具有相同关系的第4个单词
- 例如, “Athens is to Greece as Bangkok is to ______”?
例如:
实现一个函数能根据首都来预测其所属国家
使用如上图所示的方法,需要通过计算余弦相似度或欧式距离来实现
1.2 余弦相似度(Cosine Similarity)
余弦相似度定义如下:
cos ( θ ) = A ⋅ B ∥ A ∥ ∥ B ∥ = ∑ i = 1 n A i B i ∑ i = 1 n A i 2 ∑ i = 1 n B i 2 (1) \cos (\theta)=\frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}=\frac{\sum_{i=1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i=1}^{n} A_{i}^{2}} \sqrt{\sum_{i=1}^{n} B_{i}^{2}}}\tag{1} cos(θ)=∥A∥∥B∥A⋅B=∑i=1nAi2∑i=1nBi2∑i=1nAiBi(1)
A A A 和 B B B 表示词向量, A i A_i Ai 和 B i B_i Bi 表示该向量第 i i i个元素
- 当A与B相同时,其余弦相似度 c o s ( θ ) = 1 cos(\theta) = 1 cos(θ)=1
- 反之,当其完全不同时,即 A = − B A= -B A=−B, 其余弦相似度 c o s ( θ ) = − 1 cos(\theta) = -1 cos(θ)=−1
- 如果 c o s ( θ ) = 0 cos(\theta) =0 cos(θ)=0, 表示其正交/垂直
- 值在0~1之间表示相似度
- 值在-1~0之间表示不相似度
实践:实现一个函数输入两个词向量,输出其余弦相似度
Hints
- Python的 NumPy 库 支持线性代数操作 (如, dot product, vector norm ...).
- 使用 numpy.dot .
- 使用numpy.linalg.norm .
# UNQ_C1 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def cosine_similarity(A, B):
'''
Input:
A: a numpy array which corresponds to a word vector
B: A numpy array which corresponds to a word vector
Output:
cos: numerical number representing the cosine similarity between A and B.
'''
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
dot = np.dot(A,B)
norma = np.linalg.norm(A)
normb = np.linalg.norm(B)
cos = dot / (norma*normb)
### END CODE HERE ###
return cos
# feel free to try different words
king = word_embeddings['king']
queen = word_embeddings['queen']
cosine_similarity(king, queen)
0.6510956
1.3 欧式距离(Euclidean distance)
实现一个函数使用欧式距离来计算两向量的相似度
欧式距离定义如下:
d ( A , B ) = d ( B , A ) = ( A 1 − B 1 ) 2 + ( A 2 − B 2 ) 2 + ⋯ + ( A n − B n ) 2 = ∑ i = 1 n ( A i − B i ) 2
- n n n 是向量中元素个数
- A A A 和 B B B 表示词向量
- 两个词越相似,其欧式距离越接近0
实践: 实现一个函数计算两向量的欧式距离
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def euclidean(A, B):
"""
Input:
A: a numpy array which corresponds to a word vector
B: A numpy array which corresponds to a word vector
Output:
d: numerical number representing the Euclidean distance between A and B.
"""
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# euclidean distance
d = np.linalg.norm(A-B)
### END CODE HERE ###
return d
# Test your function
euclidean(king, queen)
2.4796925
1.4 通过首都来预测国家
本节中,将使用上面完成的计算相似度函数来完成该任务。实现一个函数输入3个单词和词嵌入字典,输出对应的国家。例如给出如下输入:
- 1: Athens 2: Greece 3: Baghdad,
该模型将预测对应国家 4: Iraq.
实践:
- 为了预测国家,参考上述King - Man + Woman = Queen 的例子,并使用相同的机制,结合词嵌入和相似度函数。
- 遍历词嵌入字典,并计算你的向量和当前单词嵌入之间的余弦相似度。
- 最后确保输出结果与任一输入单词不同。应返回得分最高的一项
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_country(city1, country1, city2, embeddings):
"""
Input:
city1: a string (the capital city of country1)
country1: a string (the country of capital1)
city2: a string (the capital city of country2)
embeddings: a dictionary where the keys are words and values are their embeddings
Output:
countries: a dictionary with the most likely country and its similarity score
"""
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# store the city1, country 1, and city 2 in a set called group
group = set((city1,country1, city2))
# get embeddings of city 1
city1_emb = embeddings[city1]
# get embedding of country 1
country1_emb = embeddings[country1]
# get embedding of city 2
city2_emb = embeddings[city2]
# get embedding of country 2 (it's a combination of the embeddings of country 1, city 1 and city 2)
# Remember: King - Man + Woman = Queen
vec = country1_emb - city1_emb + city2_emb
# Initialize the similarity to -1 (it will be replaced by a similarities that are closer to +1)
similarity = -1
# initialize country to an empty string
country = ''
# loop through all words in the embeddings dictionary
for word in embeddings.keys():
# first check that the word is not already in the 'group'
if word not in group:
# get the word embedding
word_emb = embeddings[word]
# calculate cosine similarity between embedding of country 2 and the word in the embeddings dictionary
cur_similarity = cosine_similarity(word_emb,vec)
# if the cosine similarity is more similar than the previously best similarity...
if cur_similarity > similarity:
# update the similarity to the new, better similarity
similarity = cur_similarity
# store the country as a tuple, which contains the word and the similarity
country = (word, similarity)
### END CODE HERE ###
return country
# Testing your function, note to make it more robust you can return the 5 most similar words.
get_country('Athens', 'Greece', 'Cairo', word_embeddings)
(‘Egypt’, 0.7626821)
1.5 模型准确度
本节中将使用数据集来测试模型的准确度:
Accuracy = Correct # of predictions Total # of predictions \text{Accuracy}=\frac{\text{Correct \# of predictions}}{\text{Total \# of predictions}} Accuracy=Total # of predictionsCorrect # of predictions
实践:实现一个函数来计算模型在数据集上的准确度。遍历每一行,以获得相应的单词,并将它们输入上述的get_country
函数中。
for i, row in data.iterrows():
print(row['city1'])
Athens
Athens
Athens
Athens
Athens
Athens
Athens
Athens
Athens
…
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def get_accuracy(word_embeddings, data):
'''
Input:
word_embeddings: a dictionary where the key is a word and the value is its embedding
data: a pandas dataframe containing all the country and capital city pairs
Output:
accuracy: the accuracy of the model
'''
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# initialize num correct to zero
num_correct = 0
# loop through the rows of the dataframe
for i, row in data.iterrows():
# get city1
city1 = row['city1']
# get country1
country1 = row['country1']
# get city2
city2 = row['city2']
# get country2
country2 = row['country2']
# use get_country to find the predicted country2
predicted_country2, _ = get_country(city1,country1,city2,word_embeddings)
# if the predicted country2 is the same as the actual country2...
if predicted_country2 == country2:
# increment the number of correct by 1
num_correct += 1
# get the number of rows in the data dataframe (length of dataframe)
m = len(data)
# calculate the accuracy by dividing the number correct by m
accuracy = num_correct / m
### END CODE HERE ###
return accuracy
accuracy = get_accuracy(word_embeddings, data)
print(f"Accuracy is {accuracy:.2f}")
Accuracy is 0.92
2.0 使用PCA可视化
本节将研究维度减小后向量之间的距离关系,通过使用PCA方法减小维度,详见:principal component analysis (PCA)
在本实验中,使用了300维的向量空间,虽然从计算的角度看该方法表现很好,但是300维的数据并不能直观进行可视化和理解,因此需要使用PCA进行降维。
PCA方法通过在保持最大信息的前提下,将高维向量投影至低维空间中。最大信息指,原始向量和投影后的向量之间的欧式距离最小,因此在高维空间中彼此接近的向量降维后仍彼此接近。
相似的词将会被聚集到一起,例如 ‘sad’, ‘happy’, ‘joyful’其都表示情感,因此会彼此靠近,其它的如’oil’, ‘gas’, 'petroleum’都表示自然资源,‘city’, ‘village’, 'town’是同义词。
在对单词进行可视化之前,需要先用PCA方法将词向量降维至2维,步骤如下:
- 均值归一化数据
- 计算协方差矩阵 ( Σ \Sigma Σ).
- 计算协方差矩阵的特征值和特征向量
- 将前K个特征向量与归一化数据相乘。变换的结果如下:
实践:
实现一个函数,输入一个数据集,其中每一行对应一个词向量,输出降维后结果。
- 词向量为300维
- 使用PCA将300维压缩至
n_components
维 - 得到新矩阵的维度为
m, n_componentns
. - 首先对数据进行去均值
- 使用
linalg.eigh
得到特征值. 使用eigh
而不是eig
因为 R 是对称阵. 使用eigh
而不用eig
能极大提升性能 - 按照特征值的递减顺序对特征向量和特征值进行排序
- 获得特征向量的子集 (使用
n_components
选择想使用的主成分数量) - 通过将特征向量与原始数据相乘,返回变换后的数据
# UNQ_C5 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
def compute_pca(X, n_components=2):
"""
Input:
X: of dimension (m,n) where each row corresponds to a word vector
n_components: Number of components you want to keep.
Output:
X_reduced: data transformed in 2 dims/columns + regenerated original data
"""
### START CODE HERE (REPLACE INSTANCES OF 'None' with your code) ###
# mean center the data
X_demeaned = X - X.mean(axis=0,keepdims=True)
# calculate the covariance matrix
covariance_matrix = np.cov(X_demeaned,rowvar=False)
# calculate eigenvectors & eigenvalues of the covariance matrix
eigen_vals, eigen_vecs = np.linalg.eigh(covariance_matrix)
# sort eigenvalue in increasing order (get the indices from the sort)
idx_sorted = np.argsort(eigen_vals)
# reverse the order so that it's from highest to lowest.
idx_sorted_decreasing = idx_sorted[::-1]
# sort the eigen values by idx_sorted_decreasing
eigen_vals_sorted = eigen_vals[idx_sorted_decreasing]
# sort eigenvectors using the idx_sorted_decreasing indices
eigen_vecs_sorted = eigen_vecs[:,idx_sorted_decreasing]
# select the first n eigenvectors (n is desired dimension
# of rescaled data array, or dims_rescaled_data)
eigen_vecs_subset = eigen_vecs_sorted[:,:n_components]
# transform the data by multiplying the transpose of the eigenvectors
# with the transpose of the de-meaned data
# Then take the transpose of that product.
X_reduced = np.matmul(eigen_vecs_subset.T,X_demeaned.T).T
### END CODE HERE ###
return X_reduced
# Testing your function
np.random.seed(1)
X = np.random.rand(3, 10)
X_reduced = compute_pca(X, n_components=2)
print("Your original matrix was " + str(X.shape) + " and it became:")
print(X_reduced)
(10, 10)
Your original matrix was (3, 10) and it became:
[[ 0.43437323 0.49820384]
[ 0.42077249 -0.50351448]
[-0.85514571 0.00531064]]
最后,使用上面实现的PCA函数来进行可视化。你会发现具有相同含义的词会彼此相邻。
words = ['oil', 'gas', 'happy', 'sad', 'city', 'town',
'village', 'country', 'continent', 'petroleum', 'joyful']
# given a list of words and the embeddings, it returns a matrix with all the embeddings
X = get_vectors(word_embeddings, words)
print('You have 11 words each of 300 dimensions thus X.shape is:', X.shape)
You have 11 words each of 300 dimensions thus X.shape is: (11, 300)
# We have done the plotting for you. Just run this cell.
result = compute_pca(X, 2)
plt.scatter(result[:, 0], result[:, 1])
for i, word in enumerate(words):
plt.annotate(word, xy=(result[i, 0] - 0.05, result[i, 1] + 0.1))
plt.show()
你发现了什么?
‘gas’, ‘oil’ and ‘petroleum’ 这三个词彼此靠近,因为它们的向量彼此靠近。同样,‘sad’, ‘joyful’,'happy’都表示情感,因此也彼此靠近。
转载:https://blog.csdn.net/weixin_43093481/article/details/116406130