一文带您了解向量数据库( 二 ) _小知识

图:向量数据库结构
以上结构图，核心包括两个关键部分：Embedding 和 vector database，Embedding过程是将非结构化的数据编码为向量，这些非结构化的数据包括：文本、图片等等，Embedding 的核心Embedding Model 。过程包括：

使用嵌入模型(Embedding Model)来为想要索引的内容创建向量。这些内容包括文本、图片、视频等等。
向量被插入到向量数据库中，包括原始内容。
当应用程序发出查询时，使用相同的嵌入模型为查询创建向量，并使用这些向量在数据库中查询相似的向量。如前所述，这些相似的向量与用来创建它们的原始内容相关联。

在chatGPT火爆的春天里，各种向量数据库如雨后春笋般冒出来，参见：https://github.com/topics/vector-database ；为更好理解向量数据库，笔者试用了两个开源的向量数据库：Chroma 和 milvus 。
Chroma ：the open-source embedding database.

Github：https://github.com/chroma-core/chroma

Chroma是一个开源的嵌入向量数据库，专门用于存储和检索向量嵌入。它提供高效的存储和检索功能，支持相似度搜索和大规模向量数据处理。Chroma的架构设计灵活，具备可扩展性和高性能，能够处理不断增长的数据量和查询负载。开发人员可以根据自己的需求自定义和扩展Chroma的功能，利用向量嵌入实现先进的语义搜索和分析能力。

图：Chroma应用交互流程

import chromadb# setup Chroma in-memory, for easy prototyping. Can add persistence easily!client = chromadb.Client()# Create collection. get_collection, get_or_create_collection, delete_collection also available!collection = client.create_collection("all-my-documents")# Add docs to the collection. Can also update and delete. Row-based API coming soon!collection.add(documents=["This is document1", "This is document2"], # we handle tokenization, embedding, and indexing automatically. You can skip that and add your own embeddings as wellmetadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on these!ids=["doc1", "doc2"], # unique for each doc)# Query/search 2 most similar results. You can also .get by idresults = collection.query(query_texts=["This is a query document"],n_results=2,# where={"metadata_field": "is_equal_to_this"}, # optional filter# where_document={"$contains":"search_string"}# optional filter)

Chroma支持的嵌入模型：

all-MiniLM-L6-v2

from chromadb.utils import embedding_functionsdefault_ef = embedding_functions.DefaultEmbeddingFunction()

Sentence Transformers

sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

OpenAI

openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key="YOUR_API_KEY",model_name="text-embedding-ada-002")

Instructor models

cohere_ef= embedding_functions.CohereEmbeddingFunction(api_key="YOUR_API_KEY",model_name="multilingual-22-12")multilingual_texts= [ 'Hello from Cohere!','您好，来自 Cohere！',]cohere_ef(texts=multilingual_texts)ef = embedding_functions.InstructorEmbeddingFunction(model_name="hkunlp/instructor-xl", device="cuda")

Google PaLM API models

palm_embedding = embedding_functions.GooglePalmEmbeddingFunction(api_key=api_key, model=model_name)milvus：github 星星最多的开源向量数据库Milvus是github 星星最多的开源向量数据库，专门用于快速存储、检索和分析大规模向量数据。它提供高性能和可扩展的架构，支持多种向量索引算法和查询模式。Milvus可广泛应用于推荐系统、图像搜索、自然语言处理和机器学习等领域，帮助用户快速发现和分析相似的向量数据。

github：https://github.com/milvus-io/milvus

同时提供一个云化的服务（Zilliz CLoud），为了更简洁说明和体验向量数据，笔者直接开通的服务尝试向量数据库，免费100刀的试用额度，https://cloud.zilliz.com/ 。
milvus(Zilliz CLoud)注册忽略，直接创建数据库和collection，可以自己上传数据（需要自己Embedding 向量）也可以使用它样例数据创建collection，然后data preview 和搜索查询；后面部分Python/ target=_blank class=infotextkey>Python API 调用服务执行搜索等任务示例。