Text Embeddings

In order to perform vector search, OramaCore uses text embeddings to convert text into numerical vectors. This page explains how to customize text embeddings in OramaCore.

Text embeddings are a way to represent words, phrases, or entire documents as numerical vectors, placing them in a dense space where similar meanings are positioned closer together.

This transformation enables machines to understand and compare text based on semantics rather than just raw characters or keywords.

Because of this, text embeddings play a crucial role in vector search, where finding relevant information requires measuring the similarity between different pieces of text.

By converting text into vectors, we can efficiently compare them using mathematical distance metrics like cosine similarity or Euclidean distance, allowing for fast and accurate retrieval of the most relevant results.

This approach is widely used in search engines, natural language processing, and various machine learning applications.

The Problem with Text Embeddings

Generating text embeddings for vector databases presents challenges in computational cost, latency, and model selection, as transformer-based models require significant processing power and memory.

High-dimensional embeddings improve expressiveness but slow down search efficiency and increase storage costs, often requiring dimensionality reduction.

Privacy concerns arise when using third-party APIs for embedding generation, pushing some applications toward self-hosted models with added operational complexity.

Real-time indexing and streaming data further demand fast, efficient embedding generation without performance degradation.

Additionally, general-purpose embeddings may not capture nuances in specialized domains, requiring fine-tuned models with high-quality training data, adding another layer of complexity.

How OramaCore Solves These Challenges

OramaCore addresses these challenges by providing a flexible, scalable, and customizable text embedding pipeline that integrates seamlessly with its own vector databases.

First of all, at the time of writing, OramaCore only supports locally hosted text embeddings. We selected the best performing models and optimized them for speed and memory usage.

The supported models are:

Model Name	Description	Language	Embedding Size
MultilingualE5Small	Multilingual model by Microsoft	Multilingual	384
MultilingualE5Base	Multilingual model by Microsoft	Multilingual	768
MultilingualE5Large	Multilingual model by Microsoft	Multilingual	1024
BGESmall	English model by BAAI	English	384
BGEBase	English model by BAAI	English	768
BGELarge	English model by BAAI	English	1024

These models are optimized for speed and memory usage, making them suitable for real-time applications and large-scale deployments.

About E5 Models: If you have used the E5 models before, you may know that you need to specify the intent of each embedding generation request.

When generating embeddings to store them in a vector database, they specify to prepend the string passage: before the text to generate the embeddings, and query: before the text to generate the embeddings for a query to search in the database.

This is not needed in OramaCore, as it automatically handles this for you. You can simply provide the text you want to generate embeddings for, and OramaCore will take care of the rest.

The current OramaCore implementation is designed to support multiple text embedding models, allowing users to choose the best model for their specific use case. Therefore, different collections can use different models.

Defaults and Configuration

By default, OramaCore uses the MultilingualE5Small model for text embeddings. This is a good starting point for most use cases, as it provides a balance between speed, memory usage, and accuracy. Most importantly, it supports multiple languages, allowing users from different regions to benefit from vector search.

When configuring OramaCore, you can specify what models or model groups to use for text embeddings:

config.yaml

# ...
 
ai_server:
    # ...
    embeddings:
        default_model_group: small
        dynamically_load_models: false 
        execution_providers:
            - CUDAExecutionProvider
            - CPUExecutionProvider
        total_threads: 8
 
# ...

Let's break down the configuration options.

`default_model_group`

At startup-time, OramaCore will preload the models specified in the default_model_group. This group can be one of the following:

Model Group	Description	Included Models
`small`	Small-sized models for text embeddings	- `MultilingualE5Small` - `BGESmall`
`multilingual`	All multilingual models (default group)	- `MultilingualE5Small` - `MultilingualE5Base` - `MultilingualE5Large`
`en`	English-only models	- `BGESmall` - `BGEBase` - `BGELarge`
`all`	Complete set of available models	- `MultilingualE5Small` - `MultilingualE5Base` - `MultilingualE5Large` - `BGESmall` - `BGEBase` - `BGELarge`

By default, OramaCore uses the multilingual group, which includes all multilingual models.

Unless specified otherwise, OramaCore will specifically use the MultilingualE5Small model for text embeddings.

`dynamically_load_models`

When set to true, OramaCore will load models on-demand, reducing startup time and memory usage. However, this may introduce latency when loading models for the first time.

`execution_providers`

Specifies the execution providers to use for running the models. By default, OramaCore uses the CUDAExecutionProvider and CPUExecutionProvider for GPU and CPU acceleration, respectively.

Please note: the CUDAExecutionProvider requires an NVIDIA GPU with CUDA support and the corresponding drivers installed. Highly recommended for production deployments.

`total_threads`

The total number of threads to use for running the models. This value should be adjusted based on the number of available CPU cores and the expected workload.

Generating Text Embeddings on your Data

When creating a collection in OramaCore, you can specify both the model to generate embeddings, and the fields to use for text extraction.

Default Behavior: Concatenate all `string` Fields

By default, OramaCore will use the default embedding model (MultilingualE5Small) and will concatenate all the strings found in the input document to generate the embeddings.

So, for example, if you have documents that looks like this:

document.json

{
  "productName": "Wireless Headphones",
  "description": "The best wireless headphones for music lovers",
  "price": 99.99,
  "brand": "Sony"
}

the default behavior will be to concatenate the productName, description, and brand fields to generate the embeddings:

"Wireless Headphones The best wireless headphones for music lovers Sony"

Pros of this approach:

Since OramaCore is schemaless, concatenating all string fields is a simple and effective way to generate embeddings for most use cases.

Cons of this approach:

It may not capture the nuances of the text, as it treats all fields equally.
It may ingest irrelevant information, such as IDs or metadata fields.

Customizing Text Extraction during Collection creation

To customize text extraction, you can specify the fields to use and the model to generate embeddings when creating a collection:

Creating a collection

curl -X POST \
  http://localhost:8080/v0/collections \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <master-api-key>' \
  -d '{
    "id": "products",
    "write_api_key": "my-write-api-key",
    "read_api_key": "my-read-api-key",
    "embeddings": {
      "model": "BGESmall",
      "document_fields": ["productName", "description", "brand"]
    }
  }'

In this example, we create a collection named products that uses the BGESmall model and extracts text from the productName, description, and brand fields.

Pros of this approach:

Allows for fine-grained control over text extraction.
Enables the use of specialized models for a specific collection.

Cons of this approach:

Each document must contain these specified fields, which is not a guarantee in a schemaless system.

Using JavaScript Hooks

Since OramaCore ships with a JavaScript runtime integrated, you can use JavaScript hooks to customize text extraction and transformation.

Since this is a more advanced topic, we decided to dedicate it an entire section. Please refer to the JavaScript Hooks documentation for more information.

Text Embeddings

On this page