Mi老鼠：小米实习期间摸鱼看其他知识

大模型基本功

ybq说的挺好的

神中神代码

gpt-fast
面面俱到，我只能说写的太好了！甚至包含tensor parallel
但需要注意的是，transformer部分写法似乎只支持推理，因为强制生成kv_cache的占位空间才能正常运行，在训练中这是不必要的。
但改起来似乎也很方便，理论上把这行注释掉就行？

采样解析（greedy， beamsearch， topK， topP， temperature，联合策略）

https://hengsblog.top/2023/11/01/decoder/
温度只控制图像的陡峭程度，不改变值的相对大小（当然由于精度问题，一些值被截断后可能会和其他值相等）
温度会使得绝对大小的差值改变。
那么，在最后输出的备选词A B C中选择一个，由于温度改变了原有的概率绝对大小，在备选词进行概率重新归一化然后选择这一步，选择token的概率完全改变了
因为最后选择的代码大概是这么个思想：

probs = {
          'A': 0.5, 
          'B': 0.4, 
          'C': 0.1
        }
tgt_token = None
for token in probs:
  if random.rand(0, 1) <= probs[token]:
    tgt_token = token
if not tgt_token:
  tgt_token = probs.keys()[-1]5

激活函数

relu

1 2	def relu(x): return np.maximum(0, x)

swish/SiLU

1 2	def swish(x, beta): return x / (1 + np.exp(-beta*x))

SwiGLU
$$
\text{SwigLU}(x) = \text{Swish}(x) \cdot \text{GLU}(x)
$$
可以参见：https://blog.csdn.net/yjw123456/article/details/138441972
SwiGLU通常与FFN联合起来用，因为FFN有up proj和down proj的操作

import torch
from torch import nn
import torch.nn.functional as F

class FeedForward(nn.Module):
    def __init__(self, hidden_size: int, intermediate_size: int) -> None:
       	super().__init__()

        self.w1 = nn.Linear(hidden_size, intermediate_size, bias=False)
        self.w2 = nn.Linear(intermediate_size, hidden_size, bias=False)
        self.w3 = nn.Linear(hidden_size, intermediate_size, bias=False)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch_size, seq_len, hidden_size)
        # w1(x) -> (batch_size, seq_len, intermediate_size)
        # w3(x) -> (batch_size, seq_len, intermediate_size)
        # w2(*) -> (batch_size, seq_len, hidden_size)
    	return self.w2(F.silu(self.w1(x)) * self.w3(x))

为什么这里使用的是silu激活函数？因为SiLU其实就是beta为1时的Swish激活函数
w1和w3都是up proj，实际上实现的路径是：
x通过w1升维——通过silu转化成门控概率——x通过w3升维——和门控概率相乘
这个路径得到的是通常ffn中，up proj之后直接用relu之类的结果
之后通过w2降维，这和relu的ffn差不多

SVM核函数中，高斯核/RBF核为何可以将原样本映射到无限的高维空间中去？

https://www.cnblogs.com/jiading/p/11695870.html

核函数：直接求两点在某个高维空间的距离，而无需先执行映射函数将点映射成高维点、再通过距离算法计算高维点的距离

径向基核函数（Radial Basis Function, RBF）是一种广泛用于机器学习和支持向量机（SVM）的核函数。RBF 核函数之所以能够映射到无限维度的空间，主要原因如下：

特征映射的本质：
RBF 核函数是通过计算两个数据点在高维空间中的相似性来工作的。对于任意两个输入向量 ( x ) 和 ( x’ )，RBF 核函数的形式为：
$$ K(x, x’) = \exp\left(-\gamma | x - x’ |^2 \right) $$
其中 ( \gamma ) 是一个参数，控制 RBF 核的宽度。这个核函数实际上是用无穷多的高维特征来表示输入向量之间的相似性。
无穷维度的特征空间：
RBF 核函数的特征映射可以被认为是将输入数据投影到一个无穷维的特征空间中。这意味着通过这种映射，两个原始空间中的点在高维空间中的内积将变得更加可分。这种高维映射增加了数据点的线性可分性，尽管原始数据在低维空间中可能是线性不可分的。

实际上，径向基核函数只是在求两个点的距离
为什么SVM用核函数呢？因为它是要做到一批点可以分界，那么这个边界怎么确定？就是用已有的点来做。
距离与hinge loss
ok我们得到一个边界点，及其确立的一个边界/超平面
然后，这个/几个点就作为RBF的两个向量中固定的一个，来给其他点、以及后来所有点做分类

感觉有点类似于聚类了这样说…….

为什么说将原样本映射到无限的高维空间呢，因为RBF核只是计算最终的高维映射的距离，而不关心高维映射的结果。这个距离函数尽管表现的很简单，但是实际上可能是由点原本特征的各种组合决定的。点的特征、点的数量都会影响这个映射维度，从而反映到映射距离上。

Qwen2-VL训练策略

在推理阶段，所有图和视频被打包为一个序列
模型结构：ViT + LLM
ViT是将图片分成patchs，然后铺平，然后经过MLP映射为Token，然后加入位置编码

首先只训练ViT，然后ViT+LLM，然后最后只训练LLM
延续 Qwen-VL，Qwen2-VL 也采用了 3-stage 的训练过程：ViT训练——全参数训练——LLM 指令微调。

LLM使用Qwen2初始化
Vision encoder使用基于DFN数据集训练的ViT模型的encoder

同样是optimizer，deepspeed自定义的似乎比trainer的占用更多显存

记录适配Qwen2VL TRL PPO时候的一些BUG

generation config这个设置，我在inference_test.py这个文件中：

from transformers import Qwen2VLForConditionalGeneration, GenerationConfig, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_dir = './Qwen2VL/Qwen2-VL-2B-Instruct'

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="cuda",  # 这里改成cuda，flash attn会报错
)
processor = AutoProcessor.from_pretrained(model_dir)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": "./Qwen2VL/data/test1.jpg"},
            {"type": "image_url", "image_url": "./Qwen2VL/data/test2.jpg"},
            {"type": "text", "text": "网友第一次按摩，问这是不是正规的操作？[允悲]"},
            {"type": "text", "text": "之前提供的图文的内容是一个微博博文，请分析这条博文的内容、情感等信息"}
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generation_config = GenerationConfig(
            max_new_tokens=512,
            temperature=1.0,
            top_k=0.0,
            top_p=1.0,
            do_sample=True,
            return_dict_in_generate=True,
            output_scores=True,
            output_attentions = False,
            output_hidden_states = False
        )
print(inputs)
generated_ids = model.generate(**inputs, generation_config=generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids.sequences)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

这样设置，可以输出正常的decode结果。
但是在trainer内部，定义这个同样的config，进行batch decode的时候，却遇到了decode一直到max token才终止的问题。也就是说，generate没有正确地运行。
直接查看Qwen2VL的generation config：

{
    "bos_token_id": 151643,
    "pad_token_id": 151643,
    "do_sample": true,
    "eos_token_id": [
      151645,
      151643
    ],
    "repetition_penalty": 1.0,
    "temperature": 0.01,
    "top_p": 0.001,
    "top_k": 1,
    "transformers_version": "4.37.0"
  }

此处指定了终止符、pad符等一系列generate的设置
直接将这个设置中的参数加入到trainer内config的里面，运行生成，这下结果正常了。

原因尚未知晓。目前看来，两处代码的主要区别在于：

inference_test.py文件导入了qwen-vl-utils这个官方包

合理怀疑，是导入的过程中，这个package有一份generation config，自动初始化了？

def generate(
    lm_backbone: torch.nn.Module, 
    one_inputs, 
    pad_token_id: int, 
    generation_config: GenerationConfig
) -> tuple[torch.Tensor, torch.Tensor]:
    for key in one_inputs:
        if key == 'input_ids' or key == 'image_grid_thw' or key == 'attention_mask':
            one_inputs[key] = torch.LongTensor(one_inputs[key]).to('cuda')
        else:
            one_inputs[key] = torch.Tensor(one_inputs[key]).to('cuda')
    input_shape = one_inputs['input_ids'].shape
    context_length = input_shape[1]
    attention_mask = one_inputs['input_ids'] != pad_token_id
    one_inputs['input_ids'] = torch.masked_fill(one_inputs['input_ids'], ~attention_mask, 0)
    output = lm_backbone.generate(
        **one_inputs,
        generation_config=generation_config,
    )
    logits = torch.stack(output.scores, 1)
    return torch.cat((one_inputs['input_ids'], output.sequences[:, context_length:]), dim=1), logits, context_length

def pad(tensors: list[torch.Tensor], padding_value: int = 151643, padding_side: str = "right") -> torch.Tensor:
    output_shape = np.max([t.shape for t in tensors], 0).tolist()
    output = torch.full((len(tensors), *output_shape), padding_value, dtype=tensors[0].dtype, device=tensors[0].device)

    for i, t in enumerate(tensors):
        if padding_side == "left":
            seq_slice = slice(output_shape[0] - t.shape[0], output_shape[0])
        elif padding_side == "right":
            seq_slice = slice(0, t.shape[0])
        else:
            raise ValueError("padding_side must be 'left' or 'right'")

        slices = (seq_slice,) + tuple(slice(0, s) for s in t.shape[1:])
        output[i][slices] = t

    return output

@torch.no_grad()
def batch_generation(
    model: torch.nn.Module,
    inputs: torch.Tensor,
    local_rollout_forward_batch_size: int,
    pad_token_id: int,
    generation_config: GenerationConfig,
):
    query_responses = []
    logitss = []
    prompt_length = []
    batch_size = len(inputs)
    for i in range(0, batch_size):
        micro_inputs = inputs[i]
        query_response, logits, context_length = generate(
            model,
            micro_inputs,
            pad_token_id,
            generation_config,
        )
        query_responses.append(query_response)
        logitss.append(logits)
        prompt_length.append(context_length)
    padded_query_responses = pad(query_responses, padding_value=pad_token_id, padding_side="right")
    padded_logitss = pad(logitss, padding_value=0, padding_side="right")
    tmp_1 = padded_query_responses.shape
    tmp_2 = padded_logitss.shape
    padded_query_responses = padded_query_responses.view(-1, padded_query_responses.shape[-1])[:batch_size]
    padded_logitss = padded_logitss.view(-1, *padded_logitss.shape[2:])[:batch_size]
    return padded_query_responses, padded_logitss, prompt_length

with unwrap_model_for_generation(model, self.accelerator) as unwrapped_model:
    # print
    query_responses, logitss, prompt_length = batch_generation(
        unwrapped_model.policy,
        data,
        args.local_rollout_forward_batch_size,
        pad_token_id=151643,
        generation_config=generation_config,
    )
for i in range(0, len(data)):
    context_length = prompt_length[i]
    query_response = query_responses[i]
    response = query_response[context_length:]
    logits = logitss[i]
    all_logprob = F.log_softmax(logits, dim=-1)
    logprob = torch.gather(all_logprob, 1, response.unsqueeze(-1)).squeeze(-1)

attention mask 输出的shape是1 * seq len，在输入到模型之后，会被一个方法转化成1 * seq len * seq len
同时，一个样本输入到模型中，经过qkv的计算，得到的attn weight的shape是1 * seq len * seq len
正好加上mask，然后 * V得到attn output的shape则是1 * seq len * hidden dim
也就是说，一个样本样本确实需要用到seq len * seq len的attn mask