Text Embeddings
In order to perform vector search, OramaCore uses text embeddings to convert text into numerical vectors. This page explains how to customize text embeddings in OramaCore.
Text embeddings are a way to represent words, phrases, or entire documents as numerical vectors, placing them in a dense space where similar meanings are positioned closer together.
This transformation enables machines to understand and compare text based on semantics rather than just raw characters or keywords.
Because of this, text embeddings play a crucial role in vector search, where finding relevant information requires measuring the similarity between different pieces of text.
By converting text into vectors, we can efficiently compare them using mathematical distance metrics like cosine similarity or Euclidean distance, allowing for fast and accurate retrieval of the most relevant results.
This approach is widely used in search engines, natural language processing, and various machine learning applications.
The Problem with Text Embeddings
Generating text embeddings for vector databases presents challenges in computational cost, latency, and model selection, as transformer-based models require significant processing power and memory.
High-dimensional embeddings improve expressiveness but slow down search efficiency and increase storage costs, often requiring dimensionality reduction.
Privacy concerns arise when using third-party APIs for embedding generation, pushing some applications toward self-hosted models with added operational complexity.
Real-time indexing and streaming data further demand fast, efficient embedding generation without performance degradation.
Additionally, general-purpose embeddings may not capture nuances in specialized domains, requiring fine-tuned models with high-quality training data, adding another layer of complexity.
How OramaCore Solves These Challenges
OramaCore addresses these challenges by providing a flexible, scalable, and customizable text embedding pipeline that integrates seamlessly with its own vector databases.
First of all, at the time of writing, OramaCore only supports locally hosted text embeddings. We selected the best performing models and optimized them for speed and memory usage.
The supported models are:
Model Name | Description | Language | Embedding Size |
---|---|---|---|
MultilingualE5Small | Multilingual model by Microsoft | Multilingual | 384 |
MultilingualE5Base | Multilingual model by Microsoft | Multilingual | 768 |
MultilingualE5Large | Multilingual model by Microsoft | Multilingual | 1024 |
BGESmall | English model by BAAI | English | 384 |
BGEBase | English model by BAAI | English | 768 |
BGELarge | English model by BAAI | English | 1024 |
These models are optimized for speed and memory usage, making them suitable for real-time applications and large-scale deployments.
About E5 Models: If you have used the E5
models before, you may know that you need to specify the intent of each embedding generation request.
When generating embeddings to store them in a vector database, they specify to prepend the string passage:
before the text to generate the embeddings, and query:
before the text to generate the embeddings for a query to search in the database.
This is not needed in OramaCore, as it automatically handles this for you. You can simply provide the text you want to generate embeddings for, and OramaCore will take care of the rest.
The current OramaCore implementation is designed to support multiple text embedding models, allowing users to choose the best model for their specific use case. Therefore, different collections can use different models.
Defaults and Configuration
By default, OramaCore uses the MultilingualE5Small
model for text embeddings. This is a good starting point for most use cases, as it provides a balance between speed, memory usage, and accuracy. Most importantly, it supports multiple languages, allowing users from different regions to benefit from vector search.
When configuring OramaCore, you can specify what models or model groups to use for text embeddings:
Let's break down the configuration options.
default_model_group
At startup-time, OramaCore will preload the models specified in the default_model_group
. This group can be one of the following:
Model Group | Description | Included Models |
---|---|---|
small | Small-sized models for text embeddings | - MultilingualE5Small - BGESmall |
multilingual | All multilingual models (default group) | - MultilingualE5Small - MultilingualE5Base - MultilingualE5Large |
en | English-only models | - BGESmall - BGEBase - BGELarge |
all | Complete set of available models | - MultilingualE5Small - MultilingualE5Base - MultilingualE5Large - BGESmall - BGEBase - BGELarge |
By default, OramaCore uses the multilingual
group, which includes all multilingual models.
Unless specified otherwise, OramaCore will specifically use the MultilingualE5Small
model for text embeddings.
dynamically_load_models
When set to true
, OramaCore will load models on-demand, reducing startup time and memory usage. However, this may introduce latency when loading models for the first time.
execution_providers
Specifies the execution providers to use for running the models. By default, OramaCore uses the CUDAExecutionProvider
and CPUExecutionProvider
for GPU and CPU acceleration, respectively.
Please note: the CUDAExecutionProvider
requires an NVIDIA GPU with CUDA support and the corresponding drivers installed. Highly recommended for production deployments.
total_threads
The total number of threads to use for running the models. This value should be adjusted based on the number of available CPU cores and the expected workload.
Generating Text Embeddings on your Data
When creating a collection in OramaCore, you can specify both the model to generate embeddings, and the fields to use for text extraction.
Default Behavior: Concatenate all string
Fields
By default, OramaCore will use the default embedding model (MultilingualE5Small
) and will concatenate all the strings found in the input document to generate the embeddings.
So, for example, if you have documents that looks like this:
the default behavior will be to concatenate the productName
, description
, and brand
fields to generate the embeddings:
Pros of this approach:
Since OramaCore is schemaless, concatenating all string fields is a simple and effective way to generate embeddings for most use cases.
Cons of this approach:
- It may not capture the nuances of the text, as it treats all fields equally.
- It may ingest irrelevant information, such as IDs or metadata fields.
Customizing Text Extraction during Collection creation
To customize text extraction, you can specify the fields to use and the model to generate embeddings when creating a collection:
In this example, we create a collection named products
that uses the BGESmall
model and extracts text from the productName
, description
, and brand
fields.
Pros of this approach:
- Allows for fine-grained control over text extraction.
- Enables the use of specialized models for a specific collection.
Cons of this approach:
- Each document must contain these specified fields, which is not a guarantee in a schemaless system.
Using JavaScript Hooks
Since OramaCore ships with a JavaScript runtime integrated, you can use JavaScript hooks to customize text extraction and transformation.
Since this is a more advanced topic, we decided to dedicate it an entire section. Please refer to the JavaScript Hooks documentation for more information.