Knowledge

1. What are Tokens?

Tokens are the basic units for AI models to process text, which can be understood as the smallest unit of model "thinking". It is not exactly equivalent to characters or words.

  • Chinese Tokenization: A Chinese character is usually encoded as 1-2 tokens (e.g., "你好" ≈ 2-4 tokens).
  • English Tokenization: Common words are usually 1 token, and longer or uncommon words are broken down into multiple tokens.
  • Special Characters: Spaces, punctuation marks, newlines, etc. also occupy tokens.

2. What is a Tokenizer?

Tokenizer is a tool for AI models to convert text into tokens. Different models may have different Tokenizers due to differences in training data, tokenization algorithms (such as BPE, WordPiece), and optimization goals.

3. What is an Embedding Model?

Embedding model is a technology that converts high-dimensional discrete data (text, images, etc.) into low-dimensional continuous vectors. It acts as a "translator", converting human-understandable information into digital forms that AI can calculate.

  • Working Principle: Map words to vector space, and words with similar semantics will automatically cluster together.
  • Application Scenarios: Text analysis, recommendation systems, image processing, semantic search.
  • Core Advantages: Good dimensionality reduction effect, complete semantic preservation, high computational efficiency.

4. What is MCP (Model Context Protocol)?

MCP is an open source protocol designed to provide context information to Large Language Models (LLMs) in a standardized way.

  • Analogy Understanding: You can think of MCP as a "USB flash drive" in the AI field. Various "plugins" that provide context can be "plugged" into the MCP Server, and LLMs can request these plugins as needed.
  • Core Advantages: Standardized interface, modular management, flexible selection, high scalability.