Reference:
1. Bidirectional LSTM-CRF Models for Sequence Tagging
2. Neural Architectures for Named Entity Recognition
3. Character-based BiLSTM-CRF Incorporating POS and Dictionaries for Chinese Opinion Target Extraction
4. CRF Layer on the Top of BiLSTM
5. Bi-LSTM + CRF with character embeddings for NER and POS
Tagging Scheme
IOBES: Inside, Outside, Beginning, End, Single
Bidirectional LSTM Networks
By utilize bidirectional LSTM, we can efficiently make use of past features (via forward states) and future features (via backward states) for a specific time frame.
We train the bidirectional LSTM networks using BPTT, and specially treat at the beginning and the end of the data points. Such as reset the hidden states to 0 at the begging of each sentence.
At time for NER task:
- the input layer represents features could be one-hot-encoding for word feature, dense vector features, or sparse features.
- the output layer represents a probability distribution (distributed by softmax) over labels. It has the same dimensionality as size of labels. Using the label with the max probability as output of timestep .
Why use the CRF Networks?
Despite use the as features to make independent tagging decisions for each output is success in POS tagging, the independent classification decisions are limiting when there are strong dependencies across output labels.
As the NER task, since the “grammar” that characterizes interpretable sequences of tags imposes several hard constraints (e.g. I-PER cannot follow B-LOC) that would be impossible to model with independence assumptions.
The first way is to predict a distribution of tags of each time step and then use beam-like decoding to find optimal tag sequences, such as Maximum entropy classifier
(Ratnaparkhi, 1996) and Maximum entropy Markov models
(McCallum etal., 2000).
The second way is to focus on sentence level instead of individual positions, thus leading to Conditional Random Fields (CRF) models that the inputs and outputs are directly connected, as opposed to LSTM networks where memory cells/recurrent components
are employed.
如下最右侧图,若不考虑不同位置词标注之间的关系,会出现错误的标注。
对于序列标注任务,输入词序列为观测序列,带标注的序列为隐藏状态序列。基于状态独立假设的序列标注模型,无法处理不同状态间的硬性约束,MRMM、CRF擅长解决此类问题。
MEMM假设当前状态仅与上一状态有关(马尔可夫性假设),CRF没有马尔可夫性假设,预测每个状态时考虑全局状态信息。
HMM(马尔可夫性假设、观测独立性假设) -> MEMM(马尔可夫性假设)-> CRF
CRF Networks
BiLSTM-CRF networks
Given and to represent an input sequence and a sequence of predicted tags respectively, where is the length of the input sentence.
Emission score
We consider to be the matrix of scores output by the BiLSTM network, where is the number of distinct tags, the element corresponds to the score of the -th tag of the -th word in a sentence.
In BiLSTM-CRF networks, emission scores come from the BiLSTM layer. For instance, according the above figure, the score of labeled as B-Person is 1.5.
对于长度为 的句子,发射矩阵 包含 个 维的隐藏状态.
Transition score
CRF introduces a transition matrix , that is position independent, which measure the score from -th tag to -th tag by the element .
In order to make the transition score matrix more robust, we will add the START and the END tags of a sentence to the set of possible tags. is therefore a square matrix of size .
Here is an example of the transition matrix score including the extra added START and END labels.
Transition Matrix | START | B-Person | I-Person | B-Organization | I-Organization | O | END |
---|---|---|---|---|---|---|---|
START | 0 | 0.8 | 0.007 | 0.7 | 0.0008 | 0.9 | 0.08 |
B-Person | 0 | 0.6 | 0.9 | 0.2 | 0.0006 | 0.6 | 0.009 |
I-Person | -1 | 0.5 | 0.53 | 0.55 | 0.0003 | 0.85 | 0.008 |
B-Organization | 0.9 | 0.5 | 0.0003 | 0.25 | 0.8 | 0.77 | 0.006 |
I-Organization | -0.9 | 0.45 | 0.007 | 0.7 | 0.65 | 0.76 | 0.2 |
O | 0 | 0.65 | 0.0007 | 0.7 | 0.0008 | 0.9 | 0.08 |
END | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
As shown in the table above, we can find that the transition matrix has learned some useful constraints.
- The label of first word in a sentence should start with “B-” or “O”, not “I-”.
- etc.
Where or how to get the transition matrix?
Transition matrix is a parameter of CRF layer. It’s initialization with random value, that will be more and more reasonable gradually with increasing training iterations.
Decoding
For a sentence has 5 words: , the real tags is:
Here, we add two more extra words which denote the start and the end of sentence: .
A linear-chain CRF defines a global score
consists of 2 parts, such that:
Emission Score
where and just set them zeros, are from the previous BiLSTM.
Transition Score
these score are actually the parameters of CRF layer.
Illustration of the scoring of a sentence with linear-chain CRF:
1 + 10 + 4 + 3 + 2 + 11 + 0 = 31 |
1 + 10 + 2 + 4 - 2 + 11 + 0 = 26 |
Now that we understand the scoring function of the CRF, we need to do 2 things:
- Find the sequence of tags with the best score.
- Compute a probability distribution over all the sequence of tags (total score).
The simplest way to measure the total score is that: enumerating all the possible paths and sum their scores. However, it is very inefficient, with complexity. The recurrent nature of our formula makes it the perfect candidates to apply dynamic programming.
Find the sequence of tags with the best score (best path)
Let’s suppose that is the solution for time steps for sequences that start with :
The best score and path are:
As we perform
step, final cost is
, much less than
.
类似于CRF解码(已知模型、观测求最可能的状态序列)的维特比算法
Probability distribution over the sequence of tags (total path)
The final step of a linear chain CRF is to apply a softmax to the scores of all possible sequences to get the probability of a given sequence of tags . To do that, we need to compute the partition factor:
Let’s call
the sum of scores for all sequences that start at time step
with tag
. Then,
verifies:
Then, we can easily define the probability of a given sequence of tags as:
Loss function
CRF loss function is consist of the real path score and the total score of all the possible paths. The real path should having the highest score among those of all possible paths.
The loss function of CRF model as below (minimize loss function):
where the score of real path should be the one with largest percentage in all the scores, represents all possible tag sequences, even those that do not verify the IOB format for a sentence .
During the training process, the parameter values of our BiLSTM-CRF model will be updated again and again to keep increasing the percentage of the score of the real path.
Bi-LSTM-CRF Networks
Combine a Bi-LSTM network and a CRF network to form a Bi-LSTM-CRF model. This network can efficiently use past input features
via a Bi-LSTM layer and sentence level tag information
via a CRF layer.
We construct our neural network by using hidden state of Bi-LSTM layer as input sequence of CRF layer. We can use the concatenation hidden states (forward + backward), denoted as of to predict the tag of corrected directly, since it contains the context information.
We feed those hidden states
into CRF layer:
where and are the parameters.
转载:https://blog.csdn.net/sinat_34072381/article/details/105869963