ElasticSearch Docker 部署

发表于 2024-12-09 更新于 2024-12-08 分类于 Software ， ElasticSearch 本文字数： 12k 阅读时长 ≈ 10 分钟

本文简要记录如何使用 Docker 部署 ElasticSearch，并安装 hanlp 分词器提高中文分词效果。

什么是 ElasticSearch

以下是官方原文：

Elasticsearch 是一个开源的分布式 RESTful 搜索和分析引擎、可扩展的数据存储和向量数据库，能够解决不断涌现出的各种用例。作为 Elastic Stack 的核心，Elasticsearch 会集中存储您的数据，让您飞快完成搜索，微调相关性，进行强大的分析，并轻松缩放规模。

Docker 部署

以下为完整的 docker compose 文件，包括：

ElasticSearch
Kibana

docker compose 文件如下：

services:
  elastic-search:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    container_name: elastic-search
    restart: always
    ports:
      - 9200:9200
      - 9300:9300
    volumes:
      # - ./data/config:/usr/share/elasticsearch/config
      - ./data/data:/usr/share/elasticsearch/data # 数据文件
      - ./data/plugins:/usr/share/elasticsearch/plugins # 插件文件
    environment:
      - "cluster.name=elasticsearch" #设置集群名称为elasticsearch
      - "ES_JAVA_OPTS=-Xms1G -Xmx1G" #设置使用jvm内存大小，设置ES的初始内存和最大内存，否则导致过大启动不了ES
      - "discovery.type=single-node" # 单节点
    user: "1000:0"

  kibana:
    image: kibana:8.12.0 # 与 elastic-search 版本保持一致
    container_name: kibana
    restart: always
    depends_on:
      - elastic-search
    ports:
      - 9301:5601

创建默认配置文件 ./data/config/elasticsearch.yml，内容如下：

1 2	cluster.name: "docker-cluster" network.host: 0.0.0.0

安装步骤

启动 docker 容器

# 为挂载容器设置权限，这一步非常重要，否则 es 会启动失败
# 这里的 1000:1000 是 Elasticsearch 容器中 elasticsearch 用户的 UID 和 GID
mkdir -p data/{data,config,plugin}
sudo chown -R 1000:0 ./data
sudo chmod -R 777 ./data

# 启动
docker compose up -d
# 放行端口
sudo ufw allow 9200,9300,9301/tcp

# 复制容器内容的需要挂载的配置文件
docker cp -a elastic-search:/usr/share/elasticsearch/config data

# 打开 config 的目录挂载，然后重新启动
# 这是为了方便挂载 config 目录，若直接挂载，es 初始化会有问题
docker compose up -d

设置用户名和密码

按如下步骤设置用户名和密码：

1 2	# 这个命令会依次为 elastic, kibana, logstash_system,beats_system 四个用户设置密码 docker exec -it elastic-search bin/elasticsearch-setup-passwords interactive

还有一个更简单的方式，直接重置 elastic 用户的密码：

1	docker exec -it elastic-search bin/elasticsearch-reset-password -u elastic -s

此处密码一定要保存好。

若要重置为指定密码，可以使用：

1	docker exec -it elastic-search bin/elasticsearch-reset-password -u elastic -i

完整命令帮助如下：

配置 kibana

打开 kibana

使用浏览器打开 kibana

地址为：http://docker-host-ip:9301

右下角提示，需要配置 publicBaseUrl，由于是内网使用，可以不用管它

生成 enrollment token

远程到 docker 宿主机，执行以下命令获取：

1	docker exec -it elastic-search bin/elasticsearch-create-enrollment-token --scope kibana

结果如图所示：

将 token 复制到输入框后，点确认。

获取验证码

上一步之后，会弹出验证码弹窗：

使用如下命令获取验证码：

1	docker exec -it kibana bin/kibana-verification-code

输入之后，等待配置完成。

配置分词插件

中文分词器对比

Elasticsearch 常见分词器对比区别

最终选择 hankcs/HanLP: 中文分词作为分词器。

HanLP 插件安装

本节主要参考：p3psi-boo/elasticsearch-analysis-hanlp-8.x

配置与测试可以在 /app/dev_tools#/console 开发者工具中进行

下载分词器

# 进入到插件目录
cd data/plugins

# 下载分词器插件
# 可以使用 export https_proxy=http://127.0.0.1:8087 为 wget 配置代理
wget https://github.com/p3psi-boo/elasticsearch-analysis-hanlp-8.x/releases/download/v1.0.0/elasticsearch-analysis-hanlp.zip -O analysis-hanlp.zip

# 解压并删除
unzip analysis-hanlp.zip && rm analysis-hanlp.zip

下载分词模型

analysis-hanlp 插件下载完成后，还需要下载分词模型，接着上一步继续操作：

cd analysis-hanlp

# 下载模型
wget https://file.hankcs.com/hanlp/data-for-1.7.5.zip
# 解压到 data 目录并删除
unzip data-for-1.7.5.zip && rm data-for-1.7.5.zip

# 重启 elastic-search 容器
docker restart elastic-search

修改默认分词器

打开 kibana 的控制台 Console - Dev Tools - Elastic，在里面输入执行命令或者直接通过 http 请求接口。

本节主要参考：Specify an analyzer

创建索引

以下参数会创建一个 search-iepc-document 索引并设置 hanlp 为默认分词器

# 创建索引
PUT /search-iepc-document
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "analysis": {
      # 自定义token器
      "tokenizer": {
        "hanlp_tokenizer": {
          "type": "hanlp",
          "enable_custom_config": true,
          "enable_custom_dictionary": true,
          "enable_number_quantifier_recognize": true,
          "enable_place_recognize": true,
          "enable_organization_recognize": true,
          "enable_stop_dictionary": true
        }
      },
      "analyzer": {
        "default":{
          "filter": ["lowercase"],
          "tokenizer": "hanlp_tokenizer"
        },
        "hanlp_analyzer": {
          "tokenizer": "hanlp_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    # 包含两个字段
    "properties": {
      "fileId": {
        "type": "text"
      },
      "content": {
        "type": "text"
      }
    }
  }
}

为索引创建默认分词器

若要修改设置，可以先关闭索引，然后更新设置

# 关闭服务
POST search-iepc-document/_close

# 打开服务
POST search-iepc-document/_open

# 设置默认分词器
PUT search-iepc-document/_settings
{
  "settings": {
    "analysis": {
      # 定义 tokenizer
      "tokenizer": {
        "hanlp_tokenizer": {
          "type": "hanlp",
          "enable_custom_config": true,
          "enable_custom_dictionary": true,
          "enable_number_quantifier_recognize": true,
          "enable_place_recognize": true,
          "enable_organization_recognize": true,
          "enable_stop_dictionary": true
        }
      },
      # 定义 analyzer
      "analyzer": {
        "default":{
          "filter": ["lowercase"],
          "tokenizer": "hanlp_tokenizer"
        },
        "hanlp_analyzer": {
          "tokenizer": "hanlp_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

生成 API_KEY

为了能够通过 API 调用，需要生成一个 API_KEY。

访问 Kibana 这个页面进行生成：http://host:9301/app/management/security/api_keys

更加详细的权限控制参考：Create API key API

在配置时，建议启用受限权限：

python 连接

from elasticsearch import Elasticsearch

client = Elasticsearch(
    "https://192.168.23.30:9200",
    api_key="T0xWJRucldVOHM6MwTUJVLWJfS33lJeSkZQcE1Z3xVFBDUQ==",
    # verify_certs=False, # 关闭证书验证, 开发中可以启用
    ca_certs="certs/elastic-search.crt", # 证书从 elastic-search 容器中的 config/certs 中复制
    ssl_assert_hostname=False,  # 禁用主机名验证
)

常用 kibana 控制台命令

以下命令可直接在 kibana 的控制台 /app/dev_tools#/console 中使用

详情如下：

# 查看分词效果
GET _analyze
{
  "analyzer": "hanlp",
  "text": "受弯构件在进行正截面抗弯承载力计算。美国,|=阿拉斯加州发生8.0级地震"
}

# 创建索引
PUT /search-iepc-document
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "tokenizer": {
        "hanlp_tokenizer": {
          "type": "hanlp",
          "enable_custom_config": true,
          "enable_custom_dictionary": true,
          "enable_number_quantifier_recognize": true,
          "enable_place_recognize": true,
          "enable_organization_recognize": true,
          "enable_stop_dictionary": true
        }
      },
      "analyzer": {
        "default":{
          "filter": ["lowercase"],
          "tokenizer": "hanlp_tokenizer"
        },
        "hanlp_analyzer": {
          "tokenizer": "hanlp_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "fileId": {
        "type": "text"
      },
      "content": {
        "type": "text"
      }
    }
  }
}

# 设置默认分词器
PUT search-iepc-document/_settings
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "hanlp_tokenizer": {
          "type": "hanlp",
          "enable_custom_config": true,
          "enable_custom_dictionary": true,
          "enable_number_quantifier_recognize": true,
          "enable_place_recognize": true,
          "enable_organization_recognize": true,
          "enable_stop_dictionary": true
        }
      },
      "analyzer": {
        "default":{
          "filter": ["lowercase"],
          "tokenizer": "hanlp_tokenizer"
        },
        "hanlp_analyzer": {
          "tokenizer": "hanlp_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  }
}

# 测试
POST search-iepc-document/_analyze
{
  "text": "受弯构件在进行正截面抗弯承载力计算。美国,|=阿拉斯加州发生8.0级地震",
  "analyzer": "hanlp_analyzer"
}

# 关闭服务
POST search-iepc-document/_close

# 打开服务
POST search-iepc-document/_open

POST search-iepc-document/_analyze
{
    "text": "HanLP是面向生产环境的自然语言处理工具包。2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。HanLP只做我们认为正确、先进的事情，而不一定是流行、权威的事情。晓美焰来到北京立方庭参观自然语义科技公司。当下雨天地面积水分外严重。总统普京与特朗普通电话讨论美国太空探索技术公司。采用优等生鲜肉，欢迎新老师生前来就餐。",
    "analyzer": "hanlp_analyzer"
}

# 按内容搜索
POST search-iepc/_search
{
  "query":{
    "match": {
      "content": "桥梁 支座 要求",
    },
  },
  "size": 3
}

# 复杂查询
# 按内容搜索
POST search-iepc/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "fileId": "1713859954114"
          }
        }
      ], 
      "must": [
        {
          "match": {
            "content": "混凝土压杆承载力设计值如何计算"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields":{
      "content":{
          "type":"plain",
          "order": "score",
          "number_of_fragments": 3
      }
    }
  }
}

# 一般通用查询
POST search-iepc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "混凝土压杆承载力设计值如何计算"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields":{
      "content":{
          "type":"plain",
          "order": "score",
          "number_of_fragments": 3
      }
    }
  }
}


# 按词出现的先后顺序查找
POST search-iepc/_search
{
  "query": {
    "match_phrase": {
      "content": "抗裂验算 全预应力 混凝土"
    }
  }
}

# 按先后顺序查询
POST search-iepc/_search
{
  "query":{
    "match": {
      "content": {
        "query": "抗裂验算 全预应力 混凝土",
        "operator": "and"
      }
    }
  }
}

# 通过 id 查找数据
GET search-iepc/_doc/66150b2166d66170065f78d2

# 删除指定id的数据
POST search-iepc/_delete_by_query
{
  "query": {
    "bool": {
      "filter": {
        "ids": {
          "values": ["661f97cb2d1b78567953f85e"]
        }
      }
    }
  }
}

# 查找但不评分
GET search-iepc/_search
{
  "query":{
    "constant_score": {
      "filter":{
         "term": {
          "fileId": 1713488916377
        }
      }
    }
  }
}

# 通过 fileId 查找删除数据
POST search-iepc/_delete_by_query
{
  "query":{
    "constant_score": {
      "filter":{
         "term": {
          "fileId": 1713431812754
        }
      }
    }
  }
}

# 清空所有
POST search-iepc/_delete_by_query
{
  "query": {
    "match_all":{}
  }
}

结语

上面的安装方式使用了 elasticsearch-analysis-hanlp，这种方式有个弊端，无法快速迭代分词器。

后期若有升级需要，可以将 hanlp 独立成服务，然后制作一个网络分词器插件进行调用。