A Beginner’s Guide to Tokenization and Vector Embeddings

What is Gen AI?

Generative artificial intelligence, also known as generative AI or gen AI for short, is a type of AI that can create new content and ideas, including conversations, stories, images, videos and music. It can learn human language, programming languages, art, chemistry, biology or any complex subject matter. It reuses what it knows to solve new problems.

For example, it can learn English vocabulary and create a poem from the words it processes.

Your organization can use generative AI for various purposes, like chatbots, media creation, product development and design.

List of Gen AI LLM Models :

OpenAI
Anthropic
Google
Deepseek
Alibaba

What is Tokenization?

Tokenization is the process of converting input text (such as a sentence or paragraph) into smaller pieces called tokens, which are the fundamental units that AI models process.

What is a Token?

A token is word, part of a word or character (depending on LLM which tokenizer used).

For example,

Input_token="I love coding and write code."
output_token=["I", " love", " coding", " and", " write", " code", "."]

Why Tokenization Matters in GenAI?

Generative AI models (like GPT, LLaMA, Claude) do not understand raw text directly.

Tokenize the text.
Convert tokens to numerical numbers.
Process these numbers using deep learning models.
Generate output tokens, which are then converted back to readable text (detokenized).

What is Vector Embeddings?

A vector embedding is a dense, continuous, fixed-size vector (e.g., 512 or 768 dimensions) that captures the semantic meaning of the input.

For example:

The word "king" might be represented as a 768 dimensional vector:
```
[0.12, -0.84, ..., 0.09]
```
Words like "queen", "royalty" and "monarch" would have similar vectors, meaning they are close together in the vector space.

Why Are Embeddings Important?

Understand similarity between words or sentences.
Perform semantic search.
Enable recommendation systems.

How to create Vector Embeddings in python using openai LLM Model:

from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI()

text = "java better then javascript"

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=text
)

print("Vector Embeddings", response)
print(len(response.data[0].embedding))

Quick Summary

the process of breaking text into tokens for AI processing and vector embeddings, which capture semantic meanings in dense vectors. These concepts are crucial for understanding word similarity, semantic searches. The article also touches on creating vector embeddings using the OpenAI LLM model.

Thank you