🔥🔥探索人工智能的世界：构建智能问答系统之实战篇

技术分享 2年前 (2023-11-18) 0 999+

引言

前面我们已经做好了必要的准备工作，包括对相关知识点的了解以及环境的安装。今天我们将重点关注代码方面的内容。如果你已经具备了Java编程基础，那么理解Python语法应该不会成为问题，毕竟只是语法的差异而已。随着时间的推移，你自然会逐渐熟悉和掌握这门语言。现在让我们开始吧！

环境安装命令

在使用之前，我们需要先进行一些必要的准备工作，其中包括执行一些命令。如果你已经仔细阅读了Milvus的官方文档，你应该已经了解到了这一点。下面是需要执行的一些命令示例：

pip3 install langchain  pip3 install openai  pip3 install protobuf==3.20.0  pip3 install grpcio-tools  python3 -m pip install pymilvus==2.3.2  python3 -c "from pymilvus import Collection"

快速入门

现在，我们来尝试使用官方示例，看看在没有集成LangChain的情况下，我们需要编写多少代码才能完成插入、查询等操作。官方示例已经在前面的注释中详细讲解了所有的流程。总体流程如下：

连接到数据库
创建集合（这里还有分区的概念，我们不深入讨论）
插入向量数据（我看官方文档就简单插入了一些数字...）
创建索引（根据官方文档的说法，通常在一定数据量下是不会经常创建索引的）
查询数据
删除数据
断开与数据库的连接

通过以上步骤，你会发现与连接MySQL数据库的操作非常相似。

# hello_milvus.py demonstrates the basic operations of PyMilvus, a Python SDK of Milvus. # 1. connect to Milvus # 2. create collection # 3. insert data # 4. create index # 5. search, query, and hybrid search on entities # 6. delete entities by PK # 7. drop collection import time  import numpy as np from pymilvus import (     connections,     utility,     FieldSchema, CollectionSchema, DataType,     Collection, )  fmt = "n=== {:30} ===n" search_latency_fmt = "search latency = {:.4f}s" num_entities, dim = 3000, 8  ################################################################################# # 1. connect to Milvus # Add a new connection alias `default` for Milvus server in `localhost:19530` # Actually the "default" alias is a buildin in PyMilvus. # If the address of Milvus is the same as `localhost:19530`, you can omit all # parameters and call the method as: `connections.connect()`. # # Note: the `using` parameter of the following methods is default to "default". print(fmt.format("start connecting to Milvus")) connections.connect("default", host="localhost", port="19530")  has = utility.has_collection("hello_milvus") print(f"Does collection hello_milvus exist in Milvus: {has}")  ################################################################################# # 2. create collection # We're going to create a collection with 3 fields. # +-+------------+------------+------------------+------------------------------+ # | | field name | field type | other attributes |       field description      | # +-+------------+------------+------------------+------------------------------+ # |1|    "pk"    |   VarChar  |  is_primary=True |      "primary field"         | # | |            |            |   auto_id=False  |                              | # +-+------------+------------+------------------+------------------------------+ # |2|  "random"  |    Double  |                  |      "a double field"        | # +-+------------+------------+------------------+------------------------------+ # |3|"embeddings"| FloatVector|     dim=8        |  "float vector with dim 8"   | # +-+------------+------------+------------------+------------------------------+ fields = [     FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),     FieldSchema(name="random", dtype=DataType.DOUBLE),     FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim) ]  schema = CollectionSchema(fields, "hello_milvus is the simplest demo to introduce the APIs")  print(fmt.format("Create collection `hello_milvus`")) hello_milvus = Collection("hello_milvus", schema, consistency_level="Strong")  ################################################################################ # 3. insert data # We are going to insert 3000 rows of data into `hello_milvus` # Data to be inserted must be organized in fields. # # The insert() method returns: # - either automatically generated primary keys by Milvus if auto_id=True in the schema; # - or the existing primary key field from the entities if auto_id=False in the schema.  print(fmt.format("Start inserting entities")) rng = np.random.default_rng(seed=19530) entities = [     # provide the pk field because `auto_id` is set to False     [str(i) for i in range(num_entities)],     rng.random(num_entities).tolist(),  # field random, only supports list     rng.random((num_entities, dim)),  # field embeddings, supports numpy.ndarray and list ]  insert_result = hello_milvus.insert(entities)  hello_milvus.flush() print(f"Number of entities in Milvus: {hello_milvus.num_entities}")  # check the num_entities  ################################################################################ # 4. create index # We are going to create an IVF_FLAT index for hello_milvus collection. # create_index() can only be applied to `FloatVector` and `BinaryVector` fields. print(fmt.format("Start Creating index IVF_FLAT")) index = {     "index_type": "IVF_FLAT",     "metric_type": "L2",     "params": {"nlist": 128}, }  hello_milvus.create_index("embeddings", index)  ################################################################################ # 5. search, query, and hybrid search # After data were inserted into Milvus and indexed, you can perform: # - search based on vector similarity # - query based on scalar filtering(boolean, int, etc.) # - hybrid search based on vector similarity and scalar filtering. #  # Before conducting a search or a query, you need to load the data in `hello_milvus` into memory. print(fmt.format("Start loading")) hello_milvus.load()  # ----------------------------------------------------------------------------- # search based on vector similarity print(fmt.format("Start searching based on vector similarity")) vectors_to_search = entities[-1][-2:] search_params = {     "metric_type": "L2",     "params": {"nprobe": 10}, }  start_time = time.time() result = hello_milvus.search(vectors_to_search, "embeddings", search_params, limit=3, output_fields=["random"]) end_time = time.time()  for hits in result:     for hit in hits:         print(f"hit: {hit}, random field: {hit.entity.get('random')}") print(search_latency_fmt.format(end_time - start_time))  # ----------------------------------------------------------------------------- # query based on scalar filtering(boolean, int, etc.) print(fmt.format("Start querying with `random > 0.5`"))  start_time = time.time() result = hello_milvus.query(expr="random > 0.5", output_fields=["random", "embeddings"]) end_time = time.time()  print(f"query result:n-{result[0]}") print(search_latency_fmt.format(end_time - start_time))  # ----------------------------------------------------------------------------- # pagination r1 = hello_milvus.query(expr="random > 0.5", limit=4, output_fields=["random"]) r2 = hello_milvus.query(expr="random > 0.5", offset=1, limit=3, output_fields=["random"]) print(f"query pagination(limit=4):nt{r1}") print(f"query pagination(offset=1, limit=3):nt{r2}")  # ----------------------------------------------------------------------------- # hybrid search print(fmt.format("Start hybrid searching with `random > 0.5`"))  start_time = time.time() result = hello_milvus.search(vectors_to_search, "embeddings", search_params, limit=3, expr="random > 0.5",                              output_fields=["random"]) end_time = time.time()  for hits in result:     for hit in hits:         print(f"hit: {hit}, random field: {hit.entity.get('random')}") print(search_latency_fmt.format(end_time - start_time))  ############################################################################### # 6. delete entities by PK # You can delete entities by their PK values using boolean expressions. ids = insert_result.primary_keys  expr = f'pk in ["{ids[0]}" , "{ids[1]}"]' print(fmt.format(f"Start deleting with expr `{expr}`"))  result = hello_milvus.query(expr=expr, output_fields=["random", "embeddings"]) print(f"query before delete by expr=`{expr}` -> result: n-{result[0]}n-{result[1]}n")  hello_milvus.delete(expr)  result = hello_milvus.query(expr=expr, output_fields=["random", "embeddings"]) print(f"query after delete by expr=`{expr}` -> result: {result}n")  ############################################################################### # 7. drop collection # Finally, drop the hello_milvus collection print(fmt.format("Drop collection `hello_milvus`")) utility.drop_collection("hello_milvus")

升级版

现在，让我们来看一下使用LangChain版本的代码。由于我们使用的是封装好的Milvus，所以我们需要一个嵌入模型。在这里，我们选择了HuggingFaceEmbeddings中的sensenova/piccolo-base-zh模型作为示例，当然你也可以选择其他模型，这里没有限制。只要能将其作为一个变量传递给LangChain定义的函数调用即可。

下面是一个简单的示例，包括数据库连接、插入数据、查询以及得分情况的定义：

from langchain.embeddings import HuggingFaceEmbeddings from langchain.vectorstores import Milvus    model_name = "sensenova/piccolo-base-zh" embeddings = HuggingFaceEmbeddings(model_name=model_name)   print("链接数据库") vector_db = Milvus(     embeddings,     connection_args={"host": "localhost", "port": "19530"},     collection_name="hello_milvus", )  print("简单传入几个值") vector_db.add_texts(["12345678","789","努力的小雨是一个知名博主，其名下有公众号【灵墨AI探索室】，博客：稀土掘金、博客园、51CTO及腾讯云等","你好啊","我不好"])  print("查询前3个最相似的结果") docs = vector_db.similarity_search_with_score("你好啊",3)  print("查看其得分情况，分值越低越接近") for text in docs:     print('文本:%s,得分:%s'%(text[0].page_content,text[1]))

注意，以上代码只是一个简单示例，具体的实现可能会根据你的具体需求进行调整和优化。

在langchain版本的代码中，如果你想要执行除了自己需要开启docker中的milvus容器之外的操作，还需要确保你拥有网络代理。这里不多赘述，因为HuggingFace社区并不在国内。

个人定制版

接下来，我们将详细了解如何调用openai模型来回答问题！

from dotenv import load_dotenv from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate;  from langchain import PromptTemplate from langchain.chains import LLMChain  from langchain.chat_models.openai import ChatOpenAI  from langchain.schema import BaseOutputParser  # 加载env环境变量里的key值 load_dotenv() # 格式化输出 class CommaSeparatedListOutputParser(BaseOutputParser):     """Parse the output of an LLM call to a comma-separated list."""      def parse(self, text: str):         """Parse the output of an LLM call."""         return text.strip().split(", ") # 先从数据库查询问题解 docs = vector_db.similarity_search("努力的小雨是谁？") doc = docs[0].page_content  chat = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0) template = "请根据我提供的资料回答问题，资料： {input_docs}" system_message_prompt = SystemMessagePromptTemplate.from_template(template) human_template = "{text}" human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)  chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])  # chat_prompt.format_messages(input_docs=doc, text="努力的小雨是谁？") chain = LLMChain(     llm=chat,     prompt=chat_prompt,     output_parser=CommaSeparatedListOutputParser() ) chain.run(input_docs=doc, text="努力的小雨是谁？")

当你成功运行完代码后，你将会得到你所期望的答案。如下图所示，这些答案将会展示在你的屏幕上。不然，如果系统不知道这些问题的答案，那它又如何能够给出正确的回答呢？

总结

通过本系列文章的学习，我们已经对个人或企业知识库有了一定的了解。尽管OpenAI已经提供了私有知识库的部署选项，但是其高昂的成本对于一些企业来说可能是难以承受的。无论将来国内企业是否会提供个人或企业知识库的解决方案，我们都需要对其原理有一些了解。无论我们的预算多少，都可以找到适合自己的玩法，因为不同预算的玩法也会有所不同。

发表评论