飞道的博客

Elasticsearch:如何实现对 emoji 表情符号进行搜索

450人阅读  评论(0)

Elasticsearch 是一个应用非常广泛的搜索引擎。它可以对文字进行分词,从而实现全文搜索。在实际的使用中,我们会发现有一些文字中包含一些表情符号,比如笑脸,动物等等,那么我们该如何对这些表情符号来进行搜索呢?


  
  1. 🏻 => 🏻, light skin tone, skin tone, type 12
  2. 🏼 => 🏼, medium-light skin tone, skin tone, type 3
  3. 🏽 => 🏽, medium skin tone, skin tone, type 4
  4. 🏾 => 🏾, medium-dark skin tone, skin tone, type 5
  5. 🏿 => 🏿, dark skin tone, skin tone, type 6
  6. ♪ => ♪, eighth, music, note
  7. ♭ => ♭, bemolle, flat, music, note
  8. ♯ => ♯, dièse, diesis, music, note, sharp
  9. 😀 => 😀, face, grin, grinning face
  10. 😃 => 😃, face, grinning face with big eyes, mouth, open, smile
  11. 😄 => 😄, eye, face, grinning face with smiling eyes, mouth, open, smile
  12. 😁 => 😁, beaming face with smiling eyes, eye, face, grin, smile
  13. 😆 => 😆, face, grinning squinting face, laugh, mouth, satisfied, smile
  14. 😅 => 😅, cold, face, grinning face with sweat, open, smile, sweat
  15. 🤣 => 🤣, face, floor, laugh, rofl, rolling, rolling on the floor laughing, rotfl
  16. 😂 => 😂, face, face with tears of joy, joy, laugh, tear
  17. 🙂 => 🙂, face, slightly smiling face, smile
  18. 🙃 => 🙃, face, upside-down
  19. 😉 => 😉, face, wink, winking face
  20. 🐅 => 🐅, tiger
  21. 🐆 => 🐆, leopard
  22. 🐴 => 🐴, face, horse
  23. 🐎 => 🐎, equestrian, horse, racehorse, racing
  24. 🦄 => 🦄, face, unicorn
  25. 🦓 => 🦓, stripe, zebra
  26. 🦌 => 🦌, deer

在上面,我们可以看到各种各样的 emoji 符号。比如我们想搜索 grin,那么它就把含有 😀 emoji 符号的文档也找出来。在今天的文章中,我们来展示如何实现对 emoji 符号的进行搜索。

 

安装

如果你还没有对 Elasticsearch 及 Kibana 进行安装的话,请参阅之前的文章 “Elastic:菜鸟上手指南” 进行安装。 另外,我们必须安装 ICU analyzer。关于 ICU analyzer 的安装,请参阅之前的文章 “Elasticsearch:ICU 分词器介绍”。我们在 Elasticsearch 的安装根目录中,打入如下的命令:

./bin/elasticsearch-plugin install analysis-icu

等安装好后,我们需要重新启动 Elasticsearch 让它起作用。运行:

./bin/elasticsearch-plugin list

上面的命令显示:


  
  1. $ ./bin/elasticsearch-plugin install analysis-icu
  2. -> Installing analysis-icu
  3. -> Downloading analysis-icu from elastic
  4. [=================================================] 100%  
  5. -> Installed analysis-icu
  6. $ ./ bin/elasticsearch- plugin list
  7. analysis-icu

安装完 ICU analyzer 后,我们必须重新启动 Elasticsearch。

 

搜索 emoji 符号

我们先做一个简单的实验:


  
  1. GET /_analyze
  2. {
  3. "tokenizer": "icu_tokenizer",
  4. "text": "I live in 🇨🇳 and I'm 👩‍🚀"
  5. }

上面使用 icu_tokenizer 来对 “I live in 🇨🇳  and I'm 👩‍🚀” 进行分词。 👩‍🚀 表情符号非常独特,因为它是更经典的 👩 和 🚀 表情符号的组合。 中国的国旗也很特别,它是 🇨 和 🇳 的组合。 因此,我们不仅在谈论正确地分割 Unicode 代码点,而且在这里真正地了解了表情符号。

上面的请求的返回结果为:


  
  1. {
  2. "tokens" : [
  3. {
  4. "token" : "I",
  5. "start_offset" : 0,
  6. "end_offset" : 1,
  7. "type" : "<ALPHANUM>",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "live",
  12. "start_offset" : 2,
  13. "end_offset" : 6,
  14. "type" : "<ALPHANUM>",
  15. "position" : 1
  16. },
  17. {
  18. "token" : "in",
  19. "start_offset" : 7,
  20. "end_offset" : 9,
  21. "type" : "<ALPHANUM>",
  22. "position" : 2
  23. },
  24. {
  25. "token" : "" "🇨🇳" "",
  26. "start_offset" : 10,
  27. "end_offset" : 14,
  28. "type" : "<EMOJI>",
  29. "position" : 3
  30. },
  31. {
  32. "token" : "and",
  33. "start_offset" : 16,
  34. "end_offset" : 19,
  35. "type" : "<ALPHANUM>",
  36. "position" : 4
  37. },
  38. {
  39. "token" : "I'm",
  40. "start_offset" : 20,
  41. "end_offset" : 23,
  42. "type" : "<ALPHANUM>",
  43. "position" : 5
  44. },
  45. {
  46. "token" : "" "👩‍🚀" "",
  47. "start_offset" : 24,
  48. "end_offset" : 29,
  49. "type" : "<EMOJI>",
  50. "position" : 6
  51. }
  52. ]
  53. }

显然 emoji 的符号被正确地分词,并能被搜索。

在实际的使用中,我们可能并不限限于对这些 emoji 的符号的搜索。比如我们想对如下的文档进行搜索:


  
  1. PUT emoji-capable /_doc/ 1
  2. {
  3. "content": "I like 🐅"
  4. }

上面的文档中含有一个 🐅,也就是老虎。针对上面的文档,我们想搜索 tiger 的时候,也能正确地搜索到文档,那么我们该如何去做呢?

在 github 上面,有一个项目叫做 https://github.com/jolicode/emoji-search/。在它的项目中,有一个目录 https://github.com/jolicode/emoji-search/tree/master/synonyms。这里其实就是同义词的目录。我们现在下载其中的一个文件 https://github.com/jolicode/emoji-search/blob/master/synonyms/cldr-emoji-annotation-synonyms-en.txt 到 Elasticsearch 的本地安装目录:


  
  1. config
  2. ├── analysis
  3. │ ├── cldr-emoji- annotation- synonyms- en. txt
  4. │ └── emoticons.txt
  5. ├── elasticsearch.yml
  6. ...

在我的电脑上:


  
  1. $ pwd
  2. /Users/liuxg /elastic1/elasticsearch- 7.11. 0/config
  3. $ tree -L 3
  4. .
  5. ├── analysis
  6. │   └── cldr-emoji- annotation- synonyms- en. txt
  7. ├── elasticsearch.keystore
  8. ├── elasticsearch.yml
  9. ├── jvm.options
  10. ├── jvm.options.d
  11. ├── log4j2.properties
  12. ├── role_mapping.yml
  13. ├── roles.yml
  14. ├── users
  15. └── users_roles

在上面的 cldr-emoji-annotation-synonyms-en.txt 的文件中,它包含了常见 emoji 的符号的同义词。比如:


  
  1. 😀 => 😀, face, grin, grinning face
  2. 😃 => 😃, face, grinning face with big eyes, mouth, open, smile
  3. 😄 => 😄, eye, face, grinning face with smiling eyes, mouth, open, smile
  4. 😁 => 😁, beaming face with smiling eyes, eye, face, grin, smile
  5. 😆 => 😆, face, grinning squinting face, laugh, mouth, satisfied, smile
  6. 😅 => 😅, cold, face, grinning face with sweat, open, smile, sweat
  7. ....

为此,我们来进行如下的实验:


  
  1. PUT /emoji- capable
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "english_emoji": {
  7. "type": "synonym",
  8. "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
  9. }
  10. },
  11. "analyzer": {
  12. "english_with_emoji": {
  13. "tokenizer": "icu_tokenizer",
  14. "filter": [
  15. "english_emoji"
  16. ]
  17. }
  18. }
  19. }
  20. },
  21. "mappings": {
  22. "properties": {
  23. "content": {
  24. "type": "text",
  25. "analyzer": "english_with_emoji"
  26. }
  27. }
  28. }
  29. }

在上面,我们定义了 english_with_emoji 分词器,同时我们在定义 content 字段时也使用相同的分词器 english_with_emoji。我们使用 _analyze API 来进行如下的使用:


  
  1. GET emoji-capable/_analyze
  2. {
  3. "analyzer": "english_with_emoji",
  4. "text": "I like 🐅"
  5. }

上面的命令返回:


  
  1. {
  2. "tokens" : [
  3. {
  4. "token" : "I",
  5. "start_offset" : 0,
  6. "end_offset" : 1,
  7. "type" : "<ALPHANUM>",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "like",
  12. "start_offset" : 2,
  13. "end_offset" : 6,
  14. "type" : "<ALPHANUM>",
  15. "position" : 1
  16. },
  17. {
  18. "token" : "" "🐅" "",
  19. "start_offset" : 7,
  20. "end_offset" : 9,
  21. "type" : "SYNONYM",
  22. "position" : 2
  23. },
  24. {
  25. "token" : "tiger",
  26. "start_offset" : 7,
  27. "end_offset" : 9,
  28. "type" : "SYNONYM",
  29. "position" : 2
  30. }
  31. ]
  32. }

显然它除了返回 🐅, 也同时返回了 tiger 这样的 token。也就是说我们可以同时搜索这两种,都可以搜索到这个文档。同样地:


  
  1. GET emoji-capable/_analyze
  2. {
  3. "analyzer": "english_with_emoji",
  4. "text": "😀 means happy"
  5. }

它返回:


  
  1. {
  2. "tokens" : [
  3. {
  4. "token" : "" "😀" "",
  5. "start_offset" : 0,
  6. "end_offset" : 2,
  7. "type" : "SYNONYM",
  8. "position" : 0
  9. },
  10. {
  11. "token" : "face",
  12. "start_offset" : 0,
  13. "end_offset" : 2,
  14. "type" : "SYNONYM",
  15. "position" : 0
  16. },
  17. {
  18. "token" : "grin",
  19. "start_offset" : 0,
  20. "end_offset" : 2,
  21. "type" : "SYNONYM",
  22. "position" : 0
  23. },
  24. {
  25. "token" : "grinning",
  26. "start_offset" : 0,
  27. "end_offset" : 2,
  28. "type" : "SYNONYM",
  29. "position" : 0
  30. },
  31. {
  32. "token" : "means",
  33. "start_offset" : 3,
  34. "end_offset" : 8,
  35. "type" : "<ALPHANUM>",
  36. "position" : 1
  37. },
  38. {
  39. "token" : "face",
  40. "start_offset" : 3,
  41. "end_offset" : 8,
  42. "type" : "SYNONYM",
  43. "position" : 1
  44. },
  45. {
  46. "token" : "happy",
  47. "start_offset" : 9,
  48. "end_offset" : 14,
  49. "type" : "<ALPHANUM>",
  50. "position" : 2
  51. }
  52. ]
  53. }

它表明,如果我们搜索 face, grinning,grin,该文档也会被正确地返回。

现在,我们输入如下的两个文档:


  
  1. PUT emoji-capable /_doc/ 1
  2. {
  3. "content": "I like 🐅"
  4. }
  5. PUT emoji-capable /_doc/ 2
  6. {
  7. "content": "😀 means happy"
  8. }

我们对文档进行如下的搜索:


  
  1. GET emoji-capable/_search
  2. {
  3. "query": {
  4. "match": {
  5. "content": "🐅"
  6. }
  7. }
  8. }

或:


  
  1. GET emoji-capable/_search
  2. {
  3. "query": {
  4. "match": {
  5. "content": "tiger"
  6. }
  7. }
  8. }

他们都将返回第一个文档:


  
  1. {
  2. "took" : 2,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 1,
  6. "successful" : 1,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : {
  12. "value" : 1,
  13. "relation" : "eq"
  14. },
  15. "max_score" : 0.8514803,
  16. "hits" : [
  17. {
  18. "_index" : "emoji-capable",
  19. "_type" : "_doc",
  20. "_id" : "1",
  21. "_score" : 0.8514803,
  22. "_source" : {
  23. "content" : "" "I like 🐅" ""
  24. }
  25. }
  26. ]
  27. }
  28. }

通用地,我们进行如下的搜索:


  
  1. GET emoji-capable/_search
  2. {
  3. "query": {
  4. "match": {
  5. "content": "😀"
  6. }
  7. }
  8. }

或者:


  
  1. GET emoji-capable/_search
  2. {
  3. "query": {
  4. "match": {
  5. "content": "grin"
  6. }
  7. }
  8. }

它们都将返回第二个文档:


  
  1. {
  2. "took" : 1,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 1,
  6. "successful" : 1,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : {
  12. "value" : 1,
  13. "relation" : "eq"
  14. },
  15. "max_score" : 0.8514803,
  16. "hits" : [
  17. {
  18. "_index" : "emoji-capable",
  19. "_type" : "_doc",
  20. "_id" : "2",
  21. "_score" : 0.8514803,
  22. "_source" : {
  23. "content" : "" "😀 means happy" ""
  24. }
  25. }
  26. ]
  27. }
  28. }

 


转载:https://blog.csdn.net/UbuntuTouch/article/details/114261636
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场