Lucidrains 系列项目源码解析(一百一十三),lucidchart diagrams

Lucidrains 系列项目源码解析(一百一十三).\\lucidrains\\voicebox-pytorch\\voicebox_pytorch\\__init__.py
# 从 voicebox_pytorch.voicebox_pyto

.\\lucidrains\\voicebox-pytorch\\voicebox_pytorch\\__init__.py

# 从voicebox_pytorch.voicebox_pytorch 模块导入Transformer、EncodecVoco、VoiceBox、DurationPredictor、ConditionalFlowMatcherWrapper 类

voicebox_pytorch 来自voicebox_pytorch 导入(

变压器,

编码语音,

语音信箱,

经期预测器,

条件流匹配器包装器,

从#voicebox_pytorch.trainer 模块导入VoiceBoxTrainer 类

从voicebox_pytorch.trainer 导入(

语音盒训练器

从#spear_tts_pytorch 模块导入TextToSemantic 类

从Spear_tts_pytorch 导入TextToSemantic

#从audiolm_pytorch模块导入HubertWithKmeans类

从audiolm_pytorch导入HubertWithKmeans

x-clip

CLIP 的简明而完整的实现,并对最近的论文进行了各种实验性改进。

Install

$ pip 安装x-clip

Usage

进口手电筒

从x_clip 导入CLIP

剪辑=剪辑(

暗文本=512,

暗淡图像=512,

暗淡潜伏=512,

num_text_tokens=10000,

文本编码深度=6,

文本序列长度=256,

文本头=8,

Visual_enc_Depth=6,

视觉图像大小=256,

视觉补丁大小=32,

视觉头=8,

Visual_patch_dropout=0.5, # 补丁丢失概率。用于何凯明的FLIP中,以节省计算并提高最终结果。 0.5 是一个不错的值,高端的0.75 是可以接受的。

use_all_token_embeds=False, # 是否使用细粒度对比学习(FILIP)

de Coupled_contrastive_learning=True, # 使用分离对比学习(DCL) 目标函数,并从InfoNCE 损失(CLOOB + DCL) 的分母中删除正对。

extra_latent_projection=True, # 是否使用单独的投影进行文本到图像的比较和图像到文本的比较(CLOOB)

use_visual_ssl=True, # 是否使用iages 进行自监督学习

use_mlm=False, # 使用文本掩码语言学习(MLM) (DeCLIP)

text_ssl_loss_weight=0.05, # 文本传销损失权重

image_ssl_loss_weight=0.05 # 图像自监督学习损失权重

# 模拟数据

文本=torch.randint(0, 10000, (4, 256))

图像=torch.randn(4, 3, 256, 256)

火车

损失=剪辑(

句子,

图像,

ize_image_encoder=False, # 使用预训练图像网络时是否冻结图像编码器,如LiT 论文中建议的那样

return_loss=True # 必须设置为True 才能返回对称损失

loss.backward()

您还可以传入外部视觉变压器/残差网络。您需要做的就是确保您的图像编码器以batch x seq x dim 的形式返回一组嵌入,并且将dim_image 正确指定为返回的大小。下面是使用vit_pytorch 视觉转换器的示例。

$ pip install vit_pytorch=0.25.6

进口手电筒

从x_clip 导入CLIP

从vit_pytorch 导入ViT

从vit_pytorch.extractor 导入提取器

基础维生素=ViT(

图像大小=256,

补丁大小=32,

班级数量=1000,

暗淡=512,

深度=6,

头=16,

mlp_dim=2048,

辍学率=0.1,

emb_dropout=0.1

维特=提取器(

基位,

return_embeddings_only=True

剪辑=剪辑(

图像编码器=维特,

dim_image=512, # 必须设置为与上面的视觉变换器相同的尺寸

暗文本=512,

暗淡潜伏=512,

num_text_tokens=10000,

文本编码深度=6,

文本序列长度=256,

文本头=8

文本=torch.randint(0, 10000, (4, 256))

图像=torch.randn(4, 3, 256, 256)

损失=剪辑(文本,图像,return_loss=True)

loss.backward()

最后,文本转换器也可以在外部定义,但目前它们必须返回包含CLS 令牌的嵌入。

进口手电筒

x_clip 导入CLIP,来自TextTransformer

从vit_pytorch 导入ViT

从vit_pytorch.extractor 导入提取器

基础维生素=ViT(

图像大小=256,

补丁大小=32,

班级数量=1000,

暗淡=512,

深度=6,

头=16,

mlp_dim=2048,

辍学率=0.1,

emb_dropout=0.1

图像编码器=提取器(

基位,

return_embeddings_only=True

文本编码器=文本转换器(

暗淡=512,

代币数量=10000,

最大序列长度=256,

深度=6,

头数=8

剪辑=剪辑(

图像编码器=图像编码器,

文本编码器=文本编码器,

暗淡图像=512,

暗文本=512,

暗淡潜伏=512

文本=torch.randint(0, 10000, (4, 256))

图像=torch.randn(4, 3, 256, 256)

损失=剪辑(文本,图像,return_loss=True)

loss.backward()

Multiview CL Losses

该存储库还支持DeCLIP 中提出的多视图对比学习损失。只需传递增强文本或增强图像,就会根据初始化时设置的multiview_loss_weight 自动计算权重。

原来的。

进口手电筒

x_clip 导入CLIP,来自TextTransformer

从vit_pytorch 导入ViT

从vit_pytorch.extractor 导入提取器

基础维生素=ViT(

图像大小=256,

补丁大小=32,

班级数量=1000,

暗淡=512,

深度=6,

头=16,

mlp_dim=2048,

辍学率=0.1,

emb_dropout=0.1

图像编码器=提取器(

基位,

return_embeddings_only=True

文本编码器=文本转换器(

暗淡=512,

代币数量=10000,

最大序列长度=256 + 1,

深度=6,

头数=8

剪辑=剪辑(

图像编码器=图像编码器,

文本编码器=文本编码器,

暗淡图像=512,

暗文本=512,

暗淡潜伏=512,

extra_latent_projection=True,

multiview_loss_weight=0.1 # 将多视图对比度损失加权0.1

文本=torch.randint(0, 10000, (4, 256))

图像=torch.randn(4, 3, 256, 256)

aug_text=torch.randint(0, 10000, (4, 256)) # 增强文本(逆变换或EDA),与文本大小相同

aug_images=torch.randn(4, 3, 256, 256) # 增强图像,与上图尺寸相同

损失=剪辑(

句子,

图像,

aug_text=aug_text, # 传递扩展文本

aug_image=aug_images, # 传递增强图像

返回损失=真,

冻结图像编码器=True

loss.backward()

您还可以发送多个扩展文本或图像

#.

八月文本=(

火炬.randint(0, 10000, (4, 256)),

火炬.randint(0, 10000, (4, 256)),

八月图像=(

火炬.randn(4, 3, 256, 256),

火炬.randn(4, 3, 256, 256),

损失=剪辑(

句子,

图像,

aug_text=aug_texts,

aug_image=aug_images,

返回损失=真,

冻结图像编码器=True

loss.backward()

Custom Vision Self-supervised Learning Module

您可以使用Visual_ssl 关键字传递您自己的Vision 自监督学习模块。

进口手电筒

从x_clip 导入CLIP

从x_clip.visual_ssl 导入SimSiam

从vit_pytorch 导入ViT

从vit_pytorch.extractor 导入提取器

基础维生素=ViT(

图像大小=256,

补丁大小=32,

班级数量=1000,

暗淡=512,

深度=6,

头=16,

mlp_dim=2048,

辍学率=0.1,

emb_dropout=0.1

图像编码器=提取器(

基位,

return_embeddings_only=True

Visual_ssl=SimSiam( # SimSiam 是外部定义的- 它必须是一个接受与CLIP 尺寸相同的图像并返回标量损失的模块

图像编码器,

图像大小=256,

隐藏层=-1

剪辑=剪辑(

图像编码器=图像编码器,

暗淡图像=512,

暗文本=512,

暗淡潜伏=512,

use_mlm=True,

Visual_ssl=Visual_ssl, # 传递给CLIP 的SSL 模块

use_all_token_embeds=False,

extra_latent_projection=False,

mlm_random_token_prob=0.1

文本=torch.randint(0, 10000, (4, 256))

图像=torch.randn(4, 3, 256, 256)

损失=剪辑(文本,图像,return_loss=True)

loss.backward()

Citations

@其他{radford2021学习,

title={从自然语言监控中学习可迁移的视觉模型},

作者={亚历克·雷德福、金钟旭、克里斯·哈拉西、阿迪亚·拉梅什、加布里埃尔·戈、桑迪尼·阿加瓦尔、吉里什·萨斯特里、阿曼达·阿斯克尔、帕梅拉·米什金、杰克·克拉克、格雷琴·克鲁格、伊利亚·萨茨克弗},

年份={2021},

电子打印={2103.00020},

archivePrefix={arXiv},

主要类别={cs.CV}

}

@misc{yao2021filip,

title={FILIP: 细粒度交互语言-图像预训练},

作者={姚乐伟、黄润辉、侯璐、路冠松、牛敏哲、徐航、梁晓丹、李振国、蒋欣、徐春景},

年份={2021},

电子打印={2111.07783},

archivePrefix={arXiv},

主要类别={cs.CV}

}

@misc{furst2021cloob,

title={CLOOB: 采用InfoLOOB 的现代Hopfield 网络优于CLIP},

作者={Andreas Frst、Elisabeth Lumethofer、Viet Tran、Hubert Ramsauer、Fei Tan、Johannes Lehner、David Kreil、Michael Kopp、Gnter Krambauer、Angela Bittonemling、Sepp Hochreiter},

年份={2021},

电子打印={2110.11316},

archivePrefix={arXiv},

主类={cs.LG}

}

@misc{yeh2021分离,

title={分离对比学习},

作者={Chun-Hsiao Yeh、Cheng-Yao Hon、Yen-Chi Hsu、Tyng-Luh Liu、Yubei Chen 和Yann LeCun},

年份={2021},

电子打印={2110.06848},

archivePrefix={arXiv},

主类={cs.LG}

}

@misc{zhai2021lit,

title={LiT: 带锁定图像文本调整的零样本传输},

作者={翟晓华、王晓、Basil Mustafa、Andreas Steiner、Daniel Keysers、Alexander Kolesnikov、Lucas Beyer},

年份={2021},

电子打印={2111.07991},

archivePrefix={arXiv},

主要类别={cs.CV}

}

@misc{li2021 导演,

title={监控无处不在:数据高效的对比语言和图像预训练范式},

作者={李阳光、梁峰和

Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},
year = {2021},
eprint = {2110.05208},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}

@Article{mu2021slip,
author = {Norman Mu and Alexander Kirillov and David Wagner and Saining Xie},
title = {SLIP: Self-supervision meets Language-Image Pre-training},
journal = {arXiv preprint arXiv:2112.12750},
year = {2021},
}

@misc{su2021roformer,
title = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
author = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},
year = {2021},
eprint = {2104.09864},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}

@inproceedings{anonymous2022normformer,
title = {NormFormer: Improved Transformer Pretraining with Extra Normalization},
author = {Anonymous},
booktitle = {Submitted to The Tenth International Conference on Learning Representations },
year = {2022},
url = {https://openreview.net/forum?id=GMYWzWztDx5},
note = {under review}
}

@inproceedings{Li2022ScalingLP,
title = {Scaling Language-Image Pre-training via Masking},
author = {Yanghao Li and Haoqi Fan and Ronghang Hu and Christoph Feichtenhofer and Kaiming He},
year = {2022}
}

@article{Liu2022PatchDropoutEV,
title = {PatchDropout: Economizing Vision Transformers Using Patch Dropout},
author = {Yue Liu and Christos Matsoukas and Fredrik Strand and Hossein Azizpour and Kevin Smith},
journal = {ArXiv},
year = {2022},
volume = {abs/2208.07220}
}

@misc{shi2023enhance,
title = {Enhance audio generation controllability through representation similarity regularization},
author = {Yangyang Shi and Gael Le Lan and Varun Nagaraja and Zhaoheng Ni and Xinhao Mei and Ernie Chang and Forrest Iandola and Yang Liu and Vikas Chandra},
year = {2023},
eprint = {2309.08773},
archivePrefix = {arXiv},
primaryClass = {cs.SD}
}

.\\lucidrains\\x-clip\\setup.py

# 导入设置和查找包的函数
from setuptools import setup, find_packages
# 设置包的信息
setup(
# 包的名称
name = \’x-clip\’,
# 查找所有包,不排除任何包
packages = find_packages(exclude=[]),
# 包含所有数据文件
include_package_data = True,
# 版本号
version = \’0.14.4\’,
# 许可证类型
license=\’MIT\’,
# 描述
description = \’X-CLIP\’,
# 作者
author = \’Phil Wang\’,
# 作者邮箱
author_email = \’lucidrains@gmail.com\’,
# 项目链接
url = \’https://github.com/lucidrains/x-clip\’,
# 长描述内容类型
long_description_content_type = \’text/markdown\’,
# 关键词列表
keywords = [
\’artificial intelligence\’,
\’deep learning\’,
\’contrastive learning\’,
\’CLIP\’,
],
# 安装依赖
install_requires=[
\’beartype\’,
\’einops>=0.6\’,
\’ftfy\’,
\’regex\’,
\’torch>=1.6\’,
\’torchvision\’
],
# 分类标签
classifiers=[
\’Development Status :: 4 – Beta\’,
\’Intended Audience :: Developers\’,
\’Topic :: Scientific/Engineering :: Artificial Intelligence\’,
\’License :: OSI Approved :: MIT License\’,
\’Programming Language :: Python :: 3.6\’,
],
)

.\\lucidrains\\x-clip\\x_clip\\distributed.py

# 导入 torch 库
import torch
# 从 torch.autograd 模块中导入 Function 类
from torch.autograd import Function
# 导入 torch.distributed 模块
import torch.distributed as distributed
# 从 einops 库中导入 rearrange 函数
from einops import rearrange
# 定义函数 pad_dim_to,用于在指定维度上对张量进行填充
def pad_dim_to(t, length, dim = 0):
# 计算需要填充的长度
pad_length = length – t.shape[dim]
# 计算需要填充的维度对数
zero_pairs = (-dim – 1) if dim < 0 else (t.ndim – dim – 1)
# 对张量进行填充操作
return F.pad(t, (*((0, 0) * zero_pairs), 0, pad_length))
# distributed helpers
# 定义函数 all_gather_variable_dim,用于在分布式环境下收集不同维度的张量
def all_gather_variable_dim(t, dim = 0, sizes = None):
# 获取当前设备、进程排名和世界大小
device, rank, world_size = t.device, distributed.get_rank(), distributed.get_world_size()
# 如果未提供 sizes 参数,则进行计算
if not exists(sizes):
# 创建包含当前维度大小的张量
size = torch.tensor(t.shape[dim], device = device, dtype = torch.long)
# 创建用于存储各进程维度大小的列表
sizes = [torch.empty_like(size, device = device, dtype = torch.long) for i in range(world_size)]
# 使用 all_gather 函数收集各进程的维度大小
distributed.all_gather(sizes, size)
# 将结果堆叠成张量
sizes = torch.stack(sizes)
# 获取最大维度大小
max_size = sizes.amax().item()
# 对输入张量进行填充操作
padded_t = pad_dim_to(t, max_size, dim = dim)
# 创建用于存储收集到的张量的列表
gathered_tensors = [torch.empty(padded_t.shape, device = device, dtype = padded_t.dtype) for i in range(world_size)]
# 使用 all_gather 函数收集张量
distributed.all_gather(gathered_tensors, padded_t)
# 将收集到的张量拼接在一起
gathered_tensor = torch.cat(gathered_tensors, dim = dim)
# 创建序列张量
seq = torch.arange(max_size, device = device)
# 创建掩码,用于选择有效数据
mask = rearrange(seq, \’j -> 1 j\’) < rearrange(sizes, \’i -> i 1\’)
mask = rearrange(mask, \’i j -> (i j)\’)
seq = torch.arange(mask.shape[-1], device = device)
indices = seq[mask]
# 根据掩码选择有效数据
gathered_tensor = gathered_tensor.index_select(dim, indices)
return gathered_tensor, sizes
# 定义 AllGather 类,继承自 Function 类
class AllGather(Function):
@staticmethod
def forward(ctx, x, dim, sizes):
# 断言分布式环境已初始化且世界大小大于 1
assert distributed.is_initialized() and distributed.get_world_size() > 1
# 调用 all_gather_variable_dim 函数进行数据收集
x, batch_sizes = all_gather_variable_dim(x, dim = dim, sizes = sizes)
ctx.batch_sizes = batch_sizes.tolist()
ctx.dim = dim
return x, batch_sizes
@staticmethod
def backward(ctx, grads, _):
# 获取批次大小和当前进程排名
batch_sizes, rank = ctx.batch_sizes, distributed.get_rank()
# 根据批次大小拆分梯度
grads_by_rank = grads.split(batch_sizes, dim = ctx.dim)
return grads_by_rank[rank], None, None
# 将 AllGather 类应用为函数
all_gather = AllGather.apply

.\\lucidrains\\x-clip\\x_clip\\mlm.py

import math
from functools import reduce
import torch
from torch import nn
import torch.nn.functional as F
# 定义一些辅助函数
# 根据概率生成掩码
def prob_mask_like(t, prob):
return torch.zeros_like(t).float().uniform_(0, 1) < prob
# 根据特定的 token_ids 生成掩码
def mask_with_tokens(t, token_ids):
init_no_mask = torch.full_like(t, False, dtype=torch.bool)
mask = reduce(lambda acc, el: acc | (t == el), token_ids, init_no_mask)
return mask
# 根据概率和掩码生成子集掩码
def get_mask_subset_with_prob(mask, prob):
batch, seq_len, device = *mask.shape, mask.device
max_masked = math.ceil(prob * seq_len)
num_tokens = mask.sum(dim=-1, keepdim=True)
mask_excess = (mask.cumsum(dim=-1) > (num_tokens * prob).ceil())
mask_excess = mask_excess[:, :max_masked]
rand = torch.rand((batch, seq_len), device=device).masked_fill(~mask, -1e9)
_, sampled_indices = rand.topk(max_masked, dim=-1)
sampled_indices = (sampled_indices + 1).masked_fill_(mask_excess, 0)
new_mask = torch.zeros((batch, seq_len + 1), device=device)
new_mask.scatter_(-1, sampled_indices, 1)
return new_mask[:, 1:].bool()
# 主要类
class MLM(nn.Module):
def __init__(
self,
transformer,
*,
dim,
num_tokens,
mask_prob = 0.15,
replace_prob = 0.9,
random_token_prob = 0.,
mask_token_id = 2,
pad_token_id = 0,
mask_ignore_token_ids = []):
super().__init__()
self.transformer = transformer
# MLM 相关概率
self.mask_prob = mask_prob
self.replace_prob = replace_prob
self.num_tokens = num_tokens
self.random_token_prob = random_token_prob
# token ids
self.pad_token_id = pad_token_id
self.mask_token_id = mask_token_id
self.mask_ignore_token_ids = set([*mask_ignore_token_ids, pad_token_id])
# 转换为文本 logits
self.to_logits = nn.Linear(dim, num_tokens)
def forward(self, seq, **kwargs):
# 不要掩码 [pad] tokens,或者任何在被排除的 tokens 中的 tokens ([cls], [sep])
# 也不要在随机选择的 tokens 中包含这些特殊 tokens
no_mask = mask_with_tokens(seq, self.mask_ignore_token_ids)
mask = get_mask_subset_with_prob(~no_mask, self.mask_prob)
# 将原本不需要掩码的 tokens 掩码为 padding tokens
labels = seq.masked_fill(~mask, self.pad_token_id)
# 使用 mask tokens 掩码 seq,掩码的概率为 `replace_prob`(以概率 1 – replace_prob 保持 tokens 不变)
masked_seq = seq.clone().detach()
# 如果随机 token 概率 > 0 用于 MLM
if self.random_token_prob > 0:
assert self.num_tokens is not None, \’num_tokens keyword must be supplied when instantiating MLM if using random token replacement\’
random_token_prob = prob_mask_like(seq, self.random_token_prob)
random_tokens = torch.randint(0, self.num_tokens, seq.shape, device = seq.device)
random_no_mask = mask_with_tokens(random_tokens, self.mask_ignore_token_ids)
random_token_prob &= ~random_no_mask
masked_seq = torch.where(random_token_prob, random_tokens, masked_seq)
# 从掩码中减去随机 token 概率掩码
mask = mask & ~random_token_prob
# [mask] seq
replace_prob = prob_mask_like(seq, self.replace_prob)
masked_seq = masked_seq.masked_fill(mask * replace_prob, self.mask_token_id)
# 获取生成器输出并计算 MLM 损失
embedding = self.transformer(masked_seq, **kwargs)
# 投影到 logits 并移除 CLS
logits = self.to_logits(embedding)
logits = logits[:, 1:]
mlm_loss = F.cross_entropy(
logits.transpose(1, 2),
labels,
ignore_index = self.pad_token_id
)
return mlm_loss

.\\lucidrains\\x-clip\\x_clip\\visual_ssl.py

# 导入必要的库
import copy
import random
from functools import wraps
import torch
from torch import nn
import torch.nn.functional as F
from torchvision import transforms as T
from einops import rearrange
# 数据增强模块
class RandomApply(nn.Module):
def __init__(self, fn, p):
super().__init__()
self.fn = fn
self.p = p
def forward(self, x):
if random.random() > self.p:
return x
return self.fn(x)
# 获取默认的数据增强方法
def get_default_aug(image_size, channels = 3):
is_rgb = channels == 3
is_greyscale = channels == 1
rgb_or_greyscale = is_rgb or is_greyscale
return torch.nn.Sequential(
RandomApply(
T.ColorJitter(0.8, 0.8, 0.8, 0.2),
p = 0.3
) if rgb_or_greyscale else nn.Identity(),
T.RandomGrayscale(p = 0.2) if is_rgb else nn.Identity(),
T.RandomHorizontalFlip(),
RandomApply(
T.GaussianBlur((3, 3), (1.0, 2.0)),
p = 0.2
),
T.RandomResizedCrop((image_size, image_size)),
T.Normalize(
mean=torch.tensor([0.485, 0.456, 0.406]),
std=torch.tensor([0.229, 0.224, 0.225])
) if is_rgb else nn.Identity(),
)
# 辅助函数
def default(val, def_val):
return def_val if val is None else val
def flatten(t):
return t.reshape(t.shape[0], -1)
def singleton(cache_key):
def inner_fn(fn):
@wraps(fn)
def wrapper(self, *args, **kwargs):
instance = getattr(self, cache_key)
if instance is not None:
return instance
instance = fn(self, *args, **kwargs)
setattr(self, cache_key, instance)
return instance
return wrapper
return inner_fn
def get_module_device(module):
return next(module.parameters()).device
def set_requires_grad(model, val):
for p in model.parameters():
p.requires_grad = val
def l2norm(t):
return F.normalize(t, p = 2, dim = -1)
# SimCLR 损失函数
def contrastive_loss(queries, keys, temperature = 0.1):
b, device = queries.shape[0], queries.device
logits = queries @ keys.t()
logits = logits – logits.max(dim=-1, keepdim=True).values
logits /= temperature
return F.cross_entropy(logits, torch.arange(b, device=device))
def nt_xent_loss(queries, keys, temperature = 0.1):
b, device = queries.shape[0], queries.device
n = b * 2
projs = torch.cat((queries, keys))
logits = projs @ projs.t()
mask = torch.eye(n, device=device).bool()
logits = logits[~mask].reshape(n, n – 1)
logits /= temperature
labels = torch.cat(((torch.arange(b, device = device) + b – 1), torch.arange(b, device=device)), dim=0)
loss = F.cross_entropy(logits, labels, reduction = \’sum\’)
loss /= n
return loss
# 损失函数
def loss_fn(x, y):
x = l2norm(x)
y = l2norm(y)
return 2 – 2 * (x * y).sum(dim=-1)
# 用于 projector 和 predictor 的 MLP 类
def MLP(dim, projection_size, hidden_size = None):
hidden_size = default(hidden_size, dim)
return nn.Sequential(
nn.Linear(dim, hidden_size),
nn.BatchNorm1d(hidden_size),
nn.ReLU(inplace = True),
nn.Linear(hidden_size, projection_size)
)
def SimSiamMLP(dim, projection_size, hidden_size = 4096):
hidden_size = default(hidden_size, projection_size * 2)
return nn.Sequential(
nn.Linear(dim, hidden_size, bias = False),
nn.BatchNorm1d(hidden_size),
nn.ReLU(inplace = True),
nn.Linear(hidden_size, hidden_size, bias = False),
nn.BatchNorm1d(hidden_size),
nn.ReLU(inplace = True),
nn.Linear(hidden_size, projection_size, bias = False),
nn.BatchNorm1d(projection_size, affine = False)
)
# 用于基础神经网络的包装类
# 管理隐藏层输出并将其传递到 projector 和 predictor 网络
class NetWrapper(nn.Module):
# 初始化函数,设置网络、投影大小、投影隐藏层大小和层索引
def __init__(self, net, projection_size, projection_hidden_size = 4096, layer = -2):
super().__init__()
self.net = net
self.layer = layer
self.projector = None
self.projection_size = projection_size
self.projection_hidden_size = projection_hidden_size
self.hidden = {}
self.hook_registered = False
# 查找指定层
def _find_layer(self):
if type(self.layer) == str:
modules = dict([*self.net.named_modules()])
return modules.get(self.layer, None)
elif type(self.layer) == int:
children = [*self.net.children()]
return children[self.layer]
return None
# 钩子函数,用于获取隐藏层输出
def _hook(self, _, input, output):
device = input[0].device
self.hidden[device] = flatten(output)
# 注册钩子函数
def _register_hook(self):
layer = self._find_layer()
assert layer is not None, f\’hidden layer ({self.layer}) not found\’
handle = layer.register_forward_hook(self._hook)
self.hook_registered = True
# 获取投影器
@singleton(\’projector\’)
def _get_projector(self, hidden):
_, dim = hidden.shape
projector = SimSiamMLP(dim, self.projection_size, self.projection_hidden_size)
return projector.to(hidden)
# 获取表示
def get_representation(self, x):
if self.layer == -1:
return self.net(x)
if not self.hook_registered:
self._register_hook()
self.hidden.clear()
_ = self.net(x)
hidden = self.hidden[x.device]
self.hidden.clear()
assert hidden is not None, f\’hidden layer {self.layer} never emitted an output\’
return hidden
# 前向传播函数
def forward(self, x, return_projection = True):
representation = self.get_representation(x)
if not return_projection:
return representation
flattened_representation = rearrange(representation, \’… d -> (…) d\’)
projector = self._get_projector(flattened_representation)
projection = projector(flattened_representation)
return projection, representation
# 主类定义
class SimSiam(nn.Module):
def __init__(
self,
net,
image_size,
channels = 3,
hidden_layer = -2,
projection_size = 256,
projection_hidden_size = 4096,
augment_fn = None,
augment_fn2 = None
):
super().__init__()
self.net = net
# 默认的 SimCLR 数据增强
self.augment1 = default(augment_fn, get_default_aug(image_size, channels))
self.augment2 = default(augment_fn2, self.augment1)
self.online_encoder = NetWrapper(net, projection_size, projection_hidden_size, layer=hidden_layer)
self.online_predictor = MLP(projection_size, projection_size, projection_hidden_size)
# 获取网络的设备并将包装器设备设置为相同
device = get_module_device(net)
self.to(device)
# 发送一个模拟图像张量以实例化单例参数
self.forward(torch.randn(2, channels, image_size, image_size, device=device))
def forward(self, x):
assert not (self.training and x.shape[0] == 1), \’you must have greater than 1 sample when training, due to the batchnorm in the projection layer\’
image_one, image_two = self.augment1(x), self.augment2(x)
online_proj_one, _ = self.online_encoder(image_one)
online_proj_two, _ = self.online_encoder(image_two)
online_pred_one = self.online_predictor(online_proj_one)
online_pred_two = self.online_predictor(online_proj_two)
with torch.no_grad():
target_encoder = self.online_encoder
target_proj_one, _ = target_encoder(image_one)
target_proj_two, _ = target_encoder(image_two)
target_proj_one.detach_()
target_proj_two.detach_()
loss_one = loss_fn(online_pred_one, target_proj_two)
loss_two = loss_fn(online_pred_two, target_proj_one)
loss = loss_one + loss_two
return loss.mean()
# SimCLR 类
class SimCLR(nn.Module):
def __init__(
self,
net,
image_size,
channels = 3,
hidden_layer = -2,
project_hidden = True,
project_dim = 128,
augment_both = True,
use_nt_xent_loss = False,
augment_fn = None,
temperature = 0.1
):
super().__init__()
self.net = NetWrapper(net, project_dim, layer = hidden_layer)
self.augment = default(augment_fn, get_default_aug(image_size, channels))
self.augment_both = augment_both
self.temperature = temperature
# 获取网络的设备并将��装器设备设置为相同
device = get_module_device(net)
self.to(device)
# 发送一个模拟图像张量以实例化参数
self.forward(torch.randn(1, channels, image_size, image_size))
def forward(self, x):
b, c, h, w, device = *x.shape, x.device
transform_fn = self.augment if self.augment_both else noop
queries, _ = self.net(transform_fn(x))
keys, _ = self.net(self.augment(x))
queries, keys = map(flatten, (queries, keys))
loss = nt_xent_loss(queries, keys, temperature = self.temperature)
return loss

.\\lucidrains\\x-clip\\x_clip\\x_clip.py

# 导入数学库
import math
# 导入复制库
import copy
# 导入上下文管理器
from contextlib import contextmanager
# 导入偏函数和装饰器
from functools import partial, wraps
# 导入 PyTorch 库
import torch
import torch.nn.functional as F
import torch.distributed as distributed
from torch import nn, einsum
from torch.utils.checkpoint import checkpoint
# 导入 einops 库
from einops import rearrange, repeat, reduce
from einops.layers.torch import Rearrange, Reduce
# 导入自定义模块
from x_clip.mlm import MLM
from x_clip.visual_ssl import SimSiam, SimCLR
from x_clip.distributed import all_gather
# 辅助函数
# 返回输入本身
def identity(t, *args, **kwargs):
return t
# 检查值是否存在
def exists(val):
return val is not None
# 返回默认值
def default(val, d):
return val if exists(val) else d
# 空上下文管理器
@contextmanager
def null_context():
yield
# 返回指定数据类型的最大负值
def max_neg_value(dtype):
return -torch.finfo(dtype).max
# 将输入转换为元组
def cast_tuple(t):
return t if isinstance(t, (tuple, list)) else (t,)
# 计算带掩码的均值
def masked_mean(t, mask, dim = 1, eps = 1e-6):
t = t.masked_fill(~mask, 0.)
numer = t.sum(dim = dim)
denom = mask.sum(dim = dim).clamp(min = eps)
return numer / denom
# 将输入张量的指定维度填充到指定长度
def pad_dim_to(t, length, dim = 0):
pad_length = length – t.shape[dim]
zero_pairs = (-dim – 1) if dim < 0 else (t.ndim – dim – 1)
return F.pad(t, (*((0, 0) * zero_pairs), 0, pad_length))
# 计算输入张量的对数
def log(t, eps = 1e-20):
return torch.log(t + eps)
# 计算输入张量的 L2 范数
def l2norm(t):
return F.normalize(t, dim = -1)
# 提取输入张量的对角线元素
def matrix_diag(t):
device = t.device
i, j = t.shape[-2:]
num_diag_el = min(i, j)
i_range = torch.arange(i, device = device)
j_range = torch.arange(j, device = device)
diag_mask = rearrange(i_range, \’i -> i 1\’) == rearrange(j_range, \’j -> 1 j\’)
diag_el = t.masked_select(diag_mask)
return rearrange(diag_el, \'(b d) -> b d\’, d = num_diag_el)
# 检查点辅助函数
# 使函数支持检查点
def make_checkpointable(fn):
@wraps(fn)
def inner(*args):
input_needs_grad = any([isinstance(el, torch.Tensor) and el.requires_grad for el in args])
if not input_needs_grad:
return fn(*args)
return checkpoint(fn, *args)
return inner
# 关键字参数辅助函数
# 从字典中选择指定键的值并弹出这些键
def pick_and_pop(keys, d):
values = list(map(lambda key: d.pop(key), keys))
return dict(zip(keys, values))
# 根据条件将字典分组
def group_dict_by_key(cond, d):
return_val = [dict(),dict()]
for key in d.keys():
match = bool(cond(key))
ind = int(not match)
return_val[ind][key] = d[key]
return (*return_val,)
# 检查字符串是否以指定前缀开头
def string_begins_with(prefix, str):
return str.startswith(prefix)
# 根据前缀将字典分组
def group_by_key_prefix(prefix, d):
return group_dict_by_key(partial(string_begins_with, prefix), d)
# 根据前缀将字典分组并去除前缀
def groupby_prefix_and_trim(prefix, d):
kwargs_with_prefix, kwargs = group_dict_by_key(partial(string_begins_with, prefix), d)
kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items()))
return kwargs_without_prefix, kwargs
# 辅助类
# 重排图像维度
class RearrangeImage(nn.Module):
def forward(self, x):
return rearrange(x, \’b (h w) c -> b c h w\’, h = int(math.sqrt(x.shape[1])))
# 层归一化
class LayerNorm(nn.Module):
def __init__(self, dim):
super().__init__()
self.g = nn.Parameter(torch.ones(dim))
def forward(self, x):
eps = 1e-5 if x.dtype == torch.float32 else 1e-3
var = torch.var(x, dim = -1, unbiased = False, keepdim = True)
mean = torch.mean(x, dim = -1, keepdim = True)
return (x – mean) * (var + eps).rsqrt() * self.g
# 预归一化
class PreNorm(nn.Module):
def __init__(self, dim, fn):
super().__init__()
self.norm = LayerNorm(dim)
self.fn = fn
def forward(self, x, *args, **kwargs):
return self.fn(self.norm(x), *args, **kwargs)
# 补丁丢弃
class PatchDropout(nn.Module):
def __init__(self, prob):
super().__init__()
assert 0 <= prob < 1.
self.prob = prob
# 定义一个前向传播函数,用于在训练时进行部分丢弃
def forward(self, x, force_keep_all = False):
# 如果不在训练模式下,或者概率为0,或者强制保留所有元素,则直接返回输入
if not self.training or self.prob == 0. or force_keep_all:
return x
# 获取输入张量的形状信息和设备信息
b, n, _, device = *x.shape, x.device
# 创建一个包含0到b-1的整数张量,用于索引每个样本
batch_indices = torch.arange(b, device = device)
# 重新排列张量维度,将其变为二维张量
batch_indices = rearrange(batch_indices, \’… -> … 1\’)
# 计算应该保留的补丁数量,至少保留一个补丁
num_patches_keep = max(1, int(n * (1 – self.prob)))
# 生成服从标准正态分布的随机数,然后在每个样本的补丁中选择要保留的补丁索引
patch_indices_keep = torch.randn(b, n, device = device).topk(num_patches_keep, dim = -1).indices
# 返回保留的补丁数据
return x[batch_indices, patch_indices_keep]
# 定义旋转位置嵌入类
class RotaryEmbedding(nn.Module):
def __init__(self, dim):
super().__init__()
# 计算频率的倒数
inv_freq = 1. / (10000 ** (torch.arange(0, dim, 2).float() / dim))
# 将频率的倒数作为缓冲区
self.register_buffer(\’inv_freq\’, inv_freq)
def forward(self, seq_len, device):
# 获取频率的倒数
inv_freq = self.inv_freq
# 生成序列长度的张量
t = torch.arange(seq_len, device=device).type_as(inv_freq)
# 计算频率
freqs = torch.einsum(\’i , j -> i j\’, t, inv_freq)
# 拼接频率,返回结果
return torch.cat((freqs, freqs), dim=-1)
# 旋转半个张量
def rotate_half(x):
# 重新排列张量
x = rearrange(x, \’… (j d) -> … j d\’, j=2)
# 拆分张量
x1, x2 = x.unbind(dim=-2)
# 拼接旋转后的张量
return torch.cat((-x2, x1), dim=-1)
# 应用旋转位置嵌入
def apply_rotary_pos_emb(freqs, t):
# 获取旋转维度
rot_dim = freqs.shape[-1]
# 拆分张量
t, t_pass = t[…, :rot_dim], t[…, rot_dim:]
# 应用旋转位置嵌入
t = (t * freqs.cos()) + (rotate_half(t) * freqs.sin())
# 拼接结果
return torch.cat((t, t_pass), dim=-1)
# GEGLU模块
class GEGLU(nn.Module):
def forward(self, x):
# 拆分张量
x, gate = x.chunk(2, dim=-1)
# 返回GEGLU激活后的结果
return x * F.gelu(gate)
# 前馈神经网络模块
class FeedForward(nn.Module):
def __init__(self, dim, mult=4, dropout=0.):
super().__init__()
inner_dim = int(dim * mult)
self.net = nn.Sequential(
nn.Linear(dim, inner_dim * 2, bias=False),
GEGLU(),
LayerNorm(inner_dim),
nn.Dropout(dropout),
nn.Linear(inner_dim, dim, bias=False)
)
def forward(self, x):
# 返回前馈神经网络的结果
return self.net(x)
# 注意力机制模块
class Attention(nn.Module):
def __init__(self, dim, dim_head=64, heads=8, causal=False, dropout=0.):
super().__init__()
self.heads = heads
self.causal = causal
self.scale = dim_head ** -0.5
inner_dim = dim_head * heads
self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False)
self.to_out = nn.Sequential(nn.Linear(inner_dim, dim, bias=False), LayerNorm(dim))
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None, rotary_pos_emb=None):
h, device, scale = self.heads, x.device, self.scale
q, k, v = self.to_qkv(x).chunk(3, dim=-1)
q, k, v = map(lambda t: rearrange(t, \’b n (h d) -> b h n d\’, h=h), (q, k, v))
q = q * self.scale
if exists(rotary_pos_emb):
apply_rotary = partial(apply_rotary_pos_emb, rotary_pos_emb)
q, k, v = map(apply_rotary, (q, k, v))
sim = einsum(\’b h i d, b h j d -> b h i j\’, q, k)
mask_value = -torch.finfo(sim.dtype).max
if exists(mask):
mask = rearrange(mask, \’b j -> b 1 1 j\’)
sim = sim.masked_fill(~mask, mask_value)
if self.causal:
i, j = sim.shape[-2:]
causal_mask = torch.ones((i, j), dtype=torch.bool, device=device).triu(j – i + 1)
sim = sim.masked_fill(causal_mask, mask_value)
attn = sim.softmax(dim=-1, dtype=torch.float32)
attn = attn.type(sim.dtype)
attn = self.dropout(attn)
out = einsum(\’b h i j, b h j d -> b h i d\’, attn, v)
out = rearrange(out, \’b h n d -> b n (h d)\’)
return self.to_out(out)
# Transformer模块
class Transformer(nn.Module):
def __init__(
self,
dim,
*,
depth,
dim_head=64,
heads=8,
causal=False,
attn_dropout=0.,
ff_dropout=0.,
ff_mult=4,
checkpoint_during_training=False
):
super().__init__()
self.checkpoint_during_training = checkpoint_during_training
self.layers = nn.ModuleList([])
for _ in range(depth):
self.layers.append(nn.ModuleList([
PreNorm(dim, Attention(dim=dim, dim_head=dim_head, heads=heads, causal=causal, dropout=attn_dropout)),
PreNorm(dim, FeedForward(dim=dim, mult=ff_mult)),
]))
self.norm_in = LayerNorm(dim)
self.norm_out = LayerNorm(dim)
# 定义一个前向传播函数,接受输入 x,旋转位置嵌入 rotary_pos_emb 和掩码 mask
def forward(
self,
x,
rotary_pos_emb = None,
mask = None
):
# 检查是否可以在训练期间进行检查点,如果可以则调用 make_checkpointable 函数,否则调用 identity 函数
can_checkpoint = self.training and self.checkpoint_during_training
checkpoint_fn = make_checkpointable if can_checkpoint else identity
# 对输入 x 进行归一化处理
x = self.norm_in(x)
# 遍历每个注意力层和前馈层
for attn, ff in self.layers:
# 对注意力层和前馈层应用检查点函数
attn, ff = map(checkpoint_fn, (attn, ff))
# 执行注意力层操作,并将结果与输入 x 相加
x = attn(x, mask, rotary_pos_emb) + x
# 执行前馈层操作,并将结果与输入 x 相加
x = ff(x) + x
# 对最终输出 x 进行归一化处理
return self.norm_out(x)
# 定义文本转换器类,继承自 nn.Module
class TextTransformer(nn.Module):
# 初始化函数,接受一系列参数
def __init__(
self,
dim,
*,
num_tokens,
max_seq_len,
dim_head,
rotary_pos_emb = None,
causal = False,
**kwargs
):
super().__init__()
# 创建一个词嵌入层,将输入的标记转换为指定维度的向量
self.token_emb = nn.Embedding(num_tokens, dim)
# 创建绝对位置编码层,用于处理绝对位置信息
self.abs_pos_emb = nn.Embedding(max_seq_len, dim) if not rotary_pos_emb else None
# 创建旋转位置编码层,用于处理旋转位置信息
self.rotary_pos_emb = RotaryEmbedding(min(dim_head, 32)) if rotary_pos_emb else None
# 创建一个类别标记参数,用于处理因果关系
self.cls_token = nn.Parameter(torch.randn(dim)) if not causal else None
# 创建一个 Transformer 模型,用于文本转换
self.transformer = Transformer(dim, dim_head = dim_head, causal = causal, **kwargs)
# 前向传播函数,接受输入 x 和掩码 mask
def forward(self, x, mask = None):
# 获取输入 x 的形状和设备信息
b, n, device = *x.shape, x.device
# 将输入 x 转换为词嵌入向量
x = self.token_emb(x)
# 如果存在绝对位置编码层,则添加绝对位置编码信息
if exists(self.abs_pos_emb):
pos_emb = self.abs_pos_emb(torch.arange(n, device = device))
x = x + rearrange(pos_emb, \’n d -> 1 n d\’)
rotary_pos_emb = None
# 如果存在旋转位置编码层,则获取旋转位置编码信息
if exists(self.rotary_pos_emb):
rotary_pos_emb = self.rotary_pos_emb(n + 1, device = device)
# 如果存在类别标记参数,则添加类别标记到输入 x 中
if exists(self.cls_token):
cls_tokens = repeat(self.cls_token, \’d -> b 1 d\’, b = b)
x = torch.cat((cls_tokens, x), dim = 1)
# 如果存在掩码,则在掩码前面填充一个值为 True 的元素
if exists(mask):
mask = F.pad(mask, (1, 0), value = True)
# 使用 Transformer 模型进行转换
out = self.transformer(x, mask = mask, rotary_pos_emb = rotary_pos_emb)
return out
# 定义视觉转换器类,继承自 nn.Module
class VisionTransformer(nn.Module):
# 初始化函数,接受一系列参数
def __init__(
self,
dim,
*,
image_size,
patch_size,
channels,
patch_dropout = 0.5,
**kwargs
):
super().__init__()
# 断言图像尺寸必须能够被补丁大小整除
assert image_size % patch_size == 0, \’Image dimensions must be divisible by the patch size.\’
num_patches = (image_size // patch_size) ** 2
patch_dim = channels * patch_size ** 2
# 创建一个将图像转换为标记序列的模块
self.to_tokens = nn.Sequential(
Rearrange(\’b c (h p1) (w p2) -> b (h w) (p1 p2 c)\’, p1 = patch_size, p2 = patch_size),
nn.Linear(patch_dim, dim)
)
# 创建位置编码层,用于处理位置信息
self.pos_emb = nn.Embedding(num_patches, dim)
# 创建补丁丢弃模块,用于随机丢弃补丁
self.patch_dropout = PatchDropout(patch_dropout)
# 创建一个 Transformer 模型,用于视觉转换
self.transformer = Transformer(dim, **kwargs)
# 创建一个将输出转换为类别标记的模块
self.to_cls_tokens = nn.Sequential(
Reduce(\’b n d -> b d\’, \’mean\’),
nn.Linear(dim, dim, bias = False),
Rearrange(\’b d -> b 1 d\’)
)
# 前向传播函数,接受输入 x 和是否保留所有补丁的标志
def forward(
self,
x,
keep_all_patches = False
):
device = x.device
# 将输入 x 转换为标记序列
x = self.to_tokens(x)
b, n, _ = x.shape
# 添加位置编码信息到输入 x 中
pos_emb = self.pos_emb(torch.arange(n, device = device))
x = x + rearrange(pos_emb, \’n d -> 1 n d\’)
# 对输入 x 进行补丁丢弃处理
x = self.patch_dropout(x, force_keep_all = keep_all_patches)
# 使用 Transformer 模型进行转换
out = self.transformer(x)
# 将输出转换为类别标记并返回
cls_tokens = self.to_cls_tokens(out)
return torch.cat((cls_tokens, out), dim = 1)
# 定义模型前向传播函数,接受一系列参数
def model_forward_with_context(
*,
fn,
args,
freeze,
):
# 根据是否冻结模型选择上下文
encoding_context = null_context if not freeze else torch.no_grad
# 在指定上下文中执行模型前向传播
with encoding_context():
enc = fn(*args)
# 如果冻结模型,则将输出张量断开梯度
if freeze:
enc.detach_()
return enc
# 主要的 CLIP 类,继承自 nn.Module
class CLIP(nn.Module):
# 初始化函数,设置各种参数的默认取值
def __init__(
self,
*,
image_encoder = None, # 图像编码器,默认为None
text_encoder = None, # 文本编码器,默认为None
dim_text = 512, # 文本维度,默认为512
dim_image = 512, # 图像维度,默认为512
dim_latent = 512, # 潜在空间维度,默认为512
num_text_tokens = 10000, # 文本标记数量,默认为10000
text_enc_depth = 6, # 文本编码器深度,默认为6
text_seq_len = 256, # 文本序列长度,默认为256
text_heads = 8, # 文本头数,默认为8
text_dim_head = 64, # 文本头维度,默认为64
text_has_cls_token = True, # 文本是否包含CLS标记,默认为True
text_pad_id = 0, # 文本填充标记,默认为0
text_rotary_pos_emb = False, # 是否使用旋转位置编码,默认为False
text_causal_mask = False, # 是否使用因果掩码,默认为False
text_eos_id = None, # 文本结束标记,默认为None
text_encode_without_mask = False, # 是否在不使用掩码的情况下编码文本,默认为False
visual_enc_depth = 6, # 图像编码器深度,默认为6
visual_heads = 8, # 图像头数,默认为8
visual_dim_head = 64, # 图像头维度,默认为64
visual_image_size = 256, # 图像大小,默认为256
visual_patch_size = 32, # 图像块大小,默认为32
visual_patch_dropout = 0.5, # 图像块丢弃率,默认为0.5
visual_has_cls_token = True, # 图像是否包含CLS标记,默认为True
channels = 3, # 通道数,默认为3
use_all_token_embeds = False, # 是否使用所有标记嵌入,默认为False
downsample_image_embeds = False, # 是否降采样图像嵌入,默认为False
decoupled_contrastive_learning = False, # 是否解耦对比学习,默认为False
extra_latent_projection = False, # 是否使用额外的潜在投影,默认为False
use_mlm = False, # 是否使用MLM,默认为False
text_ssl_loss_weight = 0.05, # 文本SSL损失权重,默认为0.05
use_visual_ssl = False, # 是否使用视觉SSL,默认为False
visual_ssl = None, # 视觉SSL,默认为None
visual_ssl_type = \’simsiam\’, # 视觉SSL类型,默认为\’simsiam\’
visual_ssl_hidden_layer = -1, # 视觉SSL隐藏层,默认为-1
simclr_temperature = 0.1, # SimCLR温度,默认为0.1
image_ssl_loss_weight = 0.05, # 图像SSL损失权重,默认为0.05
multiview_loss_weight = 0.1, # 多视图损失权重,默认为0.1
checkpoint_during_training = False, # 训练期间是否检查点,默认为False
sim_reg_loss_weight = 0., # 相似性正则化损失权重,默认为0.0
**kwargs
def forward(
self,
text,
image,
return_loss = False, # 是否返回损失,默认为False
return_encodings = False, # 是否返回编码,默认为False
return_latents = False, # 是否返回潜在空间,默认为False
freeze_image_encoder = False, # 如果设置为True,则图像编码器不会被训练,由LiT论文提出
freeze_text_encoder = False, # 如果设置为True,则文本编码器不会被训练
text_to_image = True, # 在额外投影打开的情况下,根据模态方向返回不同的相似性值
aug_text = None, # 增强文本(用于多视图)
aug_image = None # 增强图像(用于多视图)

.\\lucidrains\\x-clip\\x_clip\\__init__.py

# 从 x_clip.x_clip 模块中导入 CLIP 和 TextTransformer 类
from x_clip.x_clip import CLIP, TextTransformer

Data source

The enwik8 data was downloaded from the Hutter prize page: http://prize.hutter1.net/

.\\lucidrains\\x-transformers\\examples\\enwik8_simple\\train.py

# 导入所需的库
from x_transformers import TransformerWrapper, Decoder
from x_transformers.autoregressive_wrapper import AutoregressiveWrapper
import random
import tqdm
import gzip
import numpy as np
import torch
import torch.optim as optim
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
# 定义常量
NUM_BATCHES = int(1e5)
BATCH_SIZE = 4
GRADIENT_ACCUMULATE_EVERY = 4
LEARNING_RATE = 1e-4
VALIDATE_EVERY = 100
GENERATE_EVERY = 500
GENERATE_LENGTH = 1024
SEQ_LEN = 1024
# 定义辅助函数
def cycle(loader):
# 无限循环生成数据
while True:
for data in loader:
yield data
def decode_token(token):
# 将 token 解码为字符
return str(chr(max(32, token)))
def decode_tokens(tokens):
# 将 tokens 解码为字符串
return \’\’.join(list(map(decode_token, tokens)))
# 实例化类似 GPT 的解码器模型
model = TransformerWrapper(
num_tokens = 256,
max_seq_len = SEQ_LEN,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
rotary_pos_emb = True
)
)
model = AutoregressiveWrapper(model)
model.cuda()
# 准备 enwik8 数据
with gzip.open(\’./data/enwik8.gz\’) as file:
data = np.frombuffer(file.read(int(95e6)), dtype=np.uint8).copy()
train_x, valid_x = np.split(data, [int(90e6)])
data_train, data_val = torch.from_numpy(train_x), torch.from_numpy(valid_x)
# 定义数据集类
class TextSamplerDataset(Dataset):
def __init__(self, data, seq_len):
super().__init__()
self.data = data
self.seq_len = seq_len
def __getitem__(self, index):
rand_start = torch.randint(0, self.data.size(0) – self.seq_len – 1, (1,))
full_seq = self.data[rand_start: rand_start + self.seq_len + 1].long()
return full_seq.cuda()
def __len__(self):
return self.data.size(0) // self.seq_len
train_dataset = TextSamplerDataset(data_train, SEQ_LEN)
val_dataset = TextSamplerDataset(data_val, SEQ_LEN)
train_loader = cycle(DataLoader(train_dataset, batch_size = BATCH_SIZE, drop_last = True))
val_loader = cycle(DataLoader(val_dataset, batch_size = BATCH_SIZE, drop_last = True))
# 定义优化器
optim = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
# 训练过程
for i in tqdm.tqdm(range(NUM_BATCHES), mininterval=10., desc=\’training\’):
model.train()
for __ in range(GRADIENT_ACCUMULATE_EVERY):
loss = model(next(train_loader))
(loss / GRADIENT_ACCUMULATE_EVERY).backward()
print(f\’training loss: {loss.item()}\’)
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
optim.step()
optim.zero_grad()
if i % VALIDATE_EVERY == 0:
model.eval()
with torch.no_grad():
loss = model(next(val_loader))
print(f\’validation loss: {loss.item()}\’)
if i % GENERATE_EVERY == 0:
model.eval()
inp = random.choice(val_dataset)[:-1]
prime = decode_tokens(inp)
print(f\’%s \\n\\n %s\’, (prime, \’*\’ * 100))
sample = model.generate(
prompts = inp,
seq_len = GENERATE_LENGTH,
cache_kv = True
)
output_str = decode_tokens(sample)
print(output_str)

.\\lucidrains\\x-transformers\\examples\\enwik8_simple\\train_nar.py

# 从 x_transformers 模块中导入 TransformerWrapper、Encoder、NonAutoregressiveWrapper 类
from x_transformers import (
TransformerWrapper,
Encoder,
NonAutoregressiveWrapper
)
# 导入必要的库
import random
import tqdm
import gzip
import numpy as np
import torch
import torch.optim as optim
from torch.nn import functional as F
from torch.utils.data import DataLoader, Dataset
# 常量定义
NUM_BATCHES = int(1e8)
BATCH_SIZE = 4
GRADIENT_ACCUMULATE_EVERY = 4
LEARNING_RATE = 2e-4
VALIDATE_EVERY = 100
GENERATE_EVERY = 250
SEQ_LEN = 256
# 定义循环函数
def cycle(loader):
while True:
for data in loader:
yield data
# 解码单个 token
def decode_token(token):
return str(chr(max(32, token)))
# 解码一组 tokens
def decode_tokens(tokens):
return \’\’.join(list(map(decode_token, tokens)))
# 创建 TransformerWrapper 模型
model = TransformerWrapper(
num_tokens = 256 + 1,
logits_dim = 256,
max_seq_len = SEQ_LEN,
attn_layers = Encoder(
dim = 512,
depth = 8,
heads = 8,
dynamic_pos_bias = True
)
)
# 创建 NonAutoregressiveWrapper 模型
model = NonAutoregressiveWrapper(
model,
steps = 18,
schedule = \’cosine\’,
mask_id = 256, # mask id is last token, which is why num_tokens above has a +1 (special token)
self_token_critic = True
)
# 将模型移至 GPU
model.cuda()
# 准备 enwik8 数据
with gzip.open(\’./data/enwik8.gz\’) as file:
data = np.frombuffer(file.read(int(95e6)), dtype=np.uint8).copy()
train_x, valid_x = np.split(data, [int(90e6)])
data_train, data_val = torch.from_numpy(train_x), torch.from_numpy(valid_x)
# 定义数据集类
class TextSamplerDataset(Dataset):
def __init__(self, data, seq_len):
super().__init__()
self.data = data
self.seq_len = seq_len
def __getitem__(self, index):
rand_start = torch.randint(0, self.data.size(0) – self.seq_len, (1,))
full_seq = self.data[rand_start: rand_start + self.seq_len].long()
return full_seq.cuda()
def __len__(self):
return self.data.size(0) // self.seq_len
# 创建训练集和验证集数据集实例
train_dataset = TextSamplerDataset(data_train, SEQ_LEN)
val_dataset = TextSamplerDataset(data_val, SEQ_LEN)
train_loader = cycle(DataLoader(train_dataset, batch_size = BATCH_SIZE))
val_loader = cycle(DataLoader(val_dataset, batch_size = BATCH_SIZE))
# 创建优化器
optim = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
# 训练过程
for i in tqdm.tqdm(range(NUM_BATCHES), mininterval=10., desc=\’training\’):
model.train()
for __ in range(GRADIENT_ACCUMULATE_EVERY):
loss = model(next(train_loader)).loss
(loss / GRADIENT_ACCUMULATE_EVERY).backward()
print(f\’training loss: {loss.item()}\’)
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
optim.step()
optim.zero_grad()
if i % VALIDATE_EVERY == 0:
model.eval()
with torch.no_grad():
val_data = next(val_loader)
loss = model(val_data).loss
print(f\’validation loss: {loss.item()}\’)
if i % GENERATE_EVERY == 0:
model.eval()
sample = model.generate()
output_str = decode_tokens(sample)
print(output_str)

.\\lucidrains\\x-transformers\\examples\\toy_tasks\\enc_dec_copy.py

# 导入必要的库
import tqdm
import torch
import torch.optim as optim
from x_transformers import XTransformer
# 定义常量
NUM_BATCHES = int(1e5) # 总批次数
BATCH_SIZE = 32 # 每批次的样本数量
LEARNING_RATE = 3e-4 # 学习率
GENERATE_EVERY = 100 # 每隔多少批次生成输出
NUM_TOKENS = 16 + 2 # 标记的数量
ENC_SEQ_LEN = 32 # 编码器序列长度
DEC_SEQ_LEN = 64 + 1 # 解码器序列长度
# 定义辅助函数
def cycle():
# 生成器函数,无限循环生成数据
while True:
prefix = torch.ones((BATCH_SIZE, 1)).long().cuda()
src = torch.randint(2, NUM_TOKENS, (BATCH_SIZE, ENC_SEQ_LEN)).long().cuda()
tgt = torch.cat((prefix, src, src), 1)
src_mask = torch.ones(BATCH_SIZE, src.shape[1]).bool().cuda()
yield (src, tgt, src_mask)
# 实例化模型
model = XTransformer(
dim = 512,
tie_token_emb = True,
return_tgt_loss = True,
enc_num_tokens=NUM_TOKENS,
enc_depth = 3,
enc_heads = 8,
enc_max_seq_len = ENC_SEQ_LEN,
dec_num_tokens = NUM_TOKENS,
dec_depth = 3,
dec_heads = 8,
dec_max_seq_len = DEC_SEQ_LEN
).cuda()
# 定义优化器
optim = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
# 训练过程
for i in tqdm.tqdm(range(NUM_BATCHES), mininterval=10., desc=\’training\’):
model.train()
src, tgt, src_mask = next(cycle())
# 计算损失并反向传播
loss = model(src, tgt, mask=src_mask)
loss.backward()
print(f\'{i}: {loss.item()}\’)
optim.step()
optim.zero_grad()
# 每隔一定批次生成输出
if i != 0 and i % GENERATE_EVERY == 0:
model.eval()
src, _, src_mask = next(cycle())
src, src_mask = src[:1], src_mask[:1]
start_tokens = (torch.ones((1, 1)) * 1).long().cuda()
# 生成输出并计算错误数量
sample = model.generate(src, start_tokens, ENC_SEQ_LEN, mask = src_mask)
incorrects = (src != sample).abs().sum()
print(f\”input: \”, src)
print(f\”predicted output: \”, sample)
print(f\”incorrects: {incorrects}\”)

x-transformers

A concise but fully-featured transformer, complete with a set of promising experimental features from various papers.

Install

$ pip install x-transformers

Usage

Full encoder / decoder

import torch
from x_transformers import XTransformer
model = XTransformer(
dim = 512,
enc_num_tokens = 256,
enc_depth = 6,
enc_heads = 8,
enc_max_seq_len = 1024,
dec_num_tokens = 256,
dec_depth = 6,
dec_heads = 8,
dec_max_seq_len = 1024,
tie_token_emb = True # tie embeddings of encoder and decoder
)
src = torch.randint(0, 256, (1, 1024))
src_mask = torch.ones_like(src).bool()
tgt = torch.randint(0, 256, (1, 1024))
loss = model(src, tgt, mask = src_mask) # (1, 1024, 512)
loss.backward()

Decoder-only (GPT-like)

import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 12,
heads = 8
)
).cuda()
x = torch.randint(0, 256, (1, 1024)).cuda()
model(x) # (1, 1024, 20000)

GPT3 would be approximately the following (but you wouldn’t be able to run it anyways)

gpt3 = TransformerWrapper(
num_tokens = 50000,
max_seq_len = 2048,
attn_layers = Decoder(
dim = 12288,
depth = 96,
heads = 96,
attn_dim_head = 128
)
).cuda()

Encoder-only (BERT-like)

import torch
from x_transformers import TransformerWrapper, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Encoder(
dim = 512,
depth = 12,
heads = 8
)
).cuda()
x = torch.randint(0, 256, (1, 1024)).cuda()
mask = torch.ones_like(x).bool()
model(x, mask = mask) # (1, 1024, 20000)

State of the art image classification (SimpleViT)

import torch
from x_transformers import ViTransformerWrapper, Encoder
model = ViTransformerWrapper(
image_size = 256,
patch_size = 32,
num_classes = 1000,
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8,
)
)
img = torch.randn(1, 3, 256, 256)
model(img) # (1, 1000)

Image -> caption

import torch
from x_transformers import ViTransformerWrapper, TransformerWrapper, Encoder, Decoder
encoder = ViTransformerWrapper(
image_size = 256,
patch_size = 32,
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8
)
)
decoder = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
cross_attend = True
)
)
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))
encoded = encoder(img, return_embeddings = True)
decoder(caption, context = encoded) # (1, 1024, 20000)

PaLI, state of the art language-vision model

import torch
from x_transformers import ViTransformerWrapper, XTransformer, Encoder
# PaLI composes of
# 1. vision transformer (ViTransformerWrapper) +
# 2. encoder-decoder transformer (XTransformer)
vit = ViTransformerWrapper(
image_size = 256,
patch_size = 32,
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8
)
)
pali = XTransformer(
dim = 512,
enc_num_tokens = 256,
enc_depth = 6,
enc_heads = 8,
enc_max_seq_len = 1024,
dec_num_tokens = 256,
dec_depth = 6,
dec_heads = 8,
dec_max_seq_len = 1024
)
# training data
img = torch.randn(1, 3, 256, 256) # images
prompt = torch.randint(0, 256, (1, 1024)) # prompt
prompt_mask = torch.ones(1, 1024).bool() # prompt text mask
output_text = torch.randint(0, 256, (1, 1024)) # target output text
# train
img_embeds = vit(
img,
return_embeddings = True
)
loss = pali(
prompt,
output_text,
mask = prompt_mask,
src_prepend_embeds = img_embeds # will preprend image embeddings to encoder text embeddings before attention
)
loss.backward()
# do the above for many steps on a 17B parameter model
# attention is all you need

Dropouts

import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
emb_dropout = 0.1, # dropout after embedding
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
layer_dropout = 0.1, # stochastic depth – dropout entire layer
attn_dropout = 0.1, # dropout post-attention
ff_dropout = 0.1 # feedforward dropout
)
)
x = torch.randint(0, 20000, (1, 1024))
model(x)

Features

Flash Attention

What originally started off as a short paper from Markus Rabe culminated as a practical fused attention CUDA kernel, named Flash Attention by Tri Dao.

The technique processes the attention matrix in tiles, only keeping track of the running softmax and exponentiated weighted sums. By recomputing on the backwards pass in a tiled fashion, one is able to keep the memory linear with respect to sequence length. This allows a lot of recent models to be able to reach for longer context lengths without worrying about the memory bottleneck.

Other engineering decisions made by Tri Dao led to its enormous success, namely minimizing HBM accesses so that both the forwards and backwards outperform naive attention. In other words, flash attention is not only more memory efficient, but faster as well, making it a necessity for training transformers.

MetaAI has recently added the ability to use Tri Dao’s CUDA kernel through the scaled_dot_product_attention function in Pytorch 2.0. (They also have a mem_efficient attention, which is identical to flash attention design, just that the tiles are traversed differently)

Llama was trained using Flash Attention. The only reason to avoid it is if you require operating on the attention matrix (dynamic positional bias, talking heads, residual attention).

You can use it in this repository by setting attn_flash to True and enjoy the immediate memory savings and increase in speed.

ex.

import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
attn_flash = True # just set this to True if you have pytorch 2.0 installed
)
)

Augmenting Self-attention with Persistent Memory

https://arxiv.org/abs/1907.01470

Proposes adding learned memory key / values prior to attention. They were able to remove feedforwards altogether and attain similar performance to the original transformers. I have found that keeping the feedforwards and adding the memory key / values leads to even better performance.

from x_transformers import Decoder, Encoder
enc = Encoder(
dim = 512,
depth = 6,
heads = 8,
attn_num_mem_kv = 16 # 16 memory key / values
)

Memory Transformers

https://arxiv.org/abs/2006.11527

Proposes adding learned tokens, akin to CLS tokens, named memory tokens, that is passed through the attention layers alongside the input tokens. This setting is compatible with both encoder and decoder training.

import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
num_memory_tokens = 20, # 20 memory tokens
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8
)
)

Update: MetaAI researchers have found that adding memory tokens (they call them register tokens), alleviates outliers (which is suspected now to be a pathology of attention networks unable to attend to nothing).

Transformers Without Tears

https://arxiv.org/abs/1910.05895

They experiment with alternatives to Layer normalization and found one that is both effective and simpler. Researchers have shared with me this leads to faster convergence.

import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
use_scalenorm = True # set to True to use for all layers
)
)

You can also use the l2 normalized embeddings proposed as part of fixnorm. I have found it leads to improved convergence, when paired with small initialization (proposed by BlinkDL). The small initialization will be taken care of as long as l2norm_embed is set to True

import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
l2norm_embed = True, # set this to True for l2 normalized embedding + small init
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8
)
)

Along the same lines of l2 normalized embeddings, Huggingface’s 175B parameter BLOOM also places a layernorm right after the embeddings and just before the tokens enter the attention layers. This was corroborated by Yandex’s 100B parameter YaLM to stabilize training.

It is recommended you either have either l2norm_embed or post_emb_norm set to True but not both, as they probably serve the same purpose.

import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
post_emb_norm = True, # set this to True to layernorm summed token + pos embeddings
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8
)
)

Root Mean Square Layer Normalization

https://arxiv.org/abs/1910.07467

The authors propose to replace layer normalization with a simpler alternative, without mean centering and the learned bias. An investigative paper found this to be the best performing normalization variant. It was also used in Deepmind’s latest large language models, Retro and Gopher.

import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
use_rmsnorm = True # set to true to use for all layers
)
)

July 2023 A linear attention paper has experiments to show that removing the learned multiplicative gamma led to no performance degradation. This simplifies the RMS normalization to a satisfying l2norm(x) * sqrt(dim).

import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
use_simple_rmsnorm = True # set to true to use for all layers
)
)

GLU Variants Improve Transformer

https://arxiv.org/abs/2002.05202

Noam Shazeer paper that explores gating in the feedforward, finding that simple gating with GELU leads to significant improvements. This variant also showed up in the latest mT5 architecture. You should always turn this on (I may eventually turn it on by default).

import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
ff_glu = True # set to true to use for all feedforwards
)
)

The PaLM language model also chose to use the Swish GLU variant. You can turn this on by setting two flags

import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
ff_swish = True, # set this to True
ff_glu = True # set to true to use for all feedforwards
)
)
“““py
### No Bias in Feedforward
Starting with <a href=\”https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html\”>PaLM</a>, there begun a trend to remove biases from the transformer all together. <a href=\”https://github.com/borisdayma\”>Boris Dayma</a> has run a number of experiments that showed removing biases from feedforwards led to increased throughput without any loss of accuracy. This was corroborated by <a href=\”https://arxiv.org/abs/2212.14034\”>yet another paper</a> investigating transformer architecture variants.
You can turn off the feedforward bias as follows
“`py
import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
ff_no_bias = True # set this to True
)
)
“`py
### ReLU²
https://arxiv.org/abs/2109.08668
This paper used neural architecture search and found an activation, Relu Squared, that is both simpler and performs better than GELU, in the autoregressive language model setting. I have confirmed this in my independent experiments. However, if one were using the GLU variant from above, GELU still performs better. Pending further corroboration.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
ff_relu_squared = True
)
)
“`py
### Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
<img src=\”./images/topk-attention.png\” width=\”500px\”></img>
https://arxiv.org/abs/1912.11637
This paper proposes an efficient way to sparsify attention by zeroing all dot-product query/key values not within the top k values. The show that this cheap method was as effective as other more expensive operations like sparsemax or entmax15. This technique comes with the cost of an extra hyperparameter (the top k values to keep). The paper recommends a value of `k = 8`
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
attn_sparse_topk = 8 # keep only the top 8 values before attention (softmax)
)
)
“`py
### Talking-Heads Attention
<img src=\”./images/talking-heads.png\” width=\”500px\”></img>
https://arxiv.org/abs/2003.02436
A Noam Shazeer paper that proposes mixing information between heads pre and post attention (softmax). This comes with the cost of extra memory and compute.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
attn_talking_heads = True # turn on information exchange between attention heads
)
)
“`py
### One Write-Head Is All You Need
https://arxiv.org/abs/1911.02150
Yet another Noam Shazeer paper (he\’s a legend) that proposes to only have one head for the key / values, but multi-headed queries. This paper was largely ignored for a while, but recently validated at scale in <a href=\”https://arxiv.org/abs/2203.07814\”>AlphaCode</a> as well as <a href=\”https://arxiv.org/abs/2204.02311\”>PaLM</a>. It has the property of being memory efficient when decoding extremely large language models. You can use it with one keyword argument as shown below.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
attn_one_kv_head = True
)
)
“`py
This has been further generalized in <a href=\”https://arxiv.org/abs/2305.13245\”>a recent paper</a> to allow for groups of query heads to attend to a single key / value head. You can use this by specifying the `attn_kv_heads`
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 12,
heads = 8,
attn_kv_heads = 2 # say you want 4 query heads to attend to 1 key / value head
)
)
“`py
### Attention on Attention for Image Captioning
<img src=\”./images/attention-on-attention.png\”></img>
https://arxiv.org/abs/1908.06954
This paper proposes to add a gated linear unit at the end of the attention layer, further gated by the original queries. Although this is not widely used outside of visual question / answering, I suspect it should lead to improvements after seeing the success of the feedforward GLU variant.
Update: After some experimentation, I found this variant actually performs worse, but if it were to be modified to not concatenate the queries before gating, it performs much better. That is what we will be using in this repository.
“`py
import torch
from x_transformers import TransformerWrapper, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8,
attn_on_attn = True # gate output of attention layer, by queries
)
)
“`py
### Intra-attention Gating on Values
<img src=\”./images/gate_values.png\” width=\”400px\”></img>
<a href=\”https://github.com/deepmind/alphafold\”>Alphafold2</a> had a peculiar variant of attention where they gate the aggregated values with the input, presumably to have the block have more control over the update.
A quick test shows a small but noticeable improvement, on about the same order as attention on attention.
“`py
import torch
from x_transformers import TransformerWrapper, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8,
attn_gate_values = True # gate aggregated values with the input
)
)
“`py
### Improving Transformer Models by Reordering their Sublayers
<img src=\”./images/sandwich.png\”></img>
<img src=\”./images/sandwich-2.png\”></img>
https://arxiv.org/abs/1911.03864
This paper proposes to break from the normal fixed pattern of alternating attention and feedforwards, but to have blocks of only attention at the beginning followed by blocks of feedforwards at the end. This was further corroborated by a paper by Nvidia that reduces the number of attention layers to be 1/3rd of the feedforwards without loss in performance.
The amount of interleaving is controlled by a \”sandwich coefficient\”, which they found to be optimal at a value of `6`.
You can experiment with this feature as shown below
“`py
import torch
from x_transformers import TransformerWrapper, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8,
sandwich_coef = 6 # interleave attention and feedforwards with sandwich coefficient of 6
)
)
“`py
### Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
<img src=\”./images/macaron-1.png\”></img>
<img src=\”./images/macaron-2.png\”></img>
https://arxiv.org/abs/1906.02762
The authors propose to view the success of transformers from a dynamical systems point of view, and then proposes an improvement based on mathematics of that POV. Specifically, they propose to place the attention layer in between two feedforward layers. This was adopted by a paper using transformers for speech recognition, the <a href=\”https://arxiv.org/abs/2005.08100\”>Conformer</a>.
“`py
import torch
from x_transformers import TransformerWrapper, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8,
macaron = True # use macaron configuration
)
)
“`py
### T5\’s Simplified Relative Positional Encoding
https://arxiv.org/abs/1910.10683
T5 is one of the most successful encoder / decoder transformer architectures trained to date. They invented a new simplified relative positional encoding based on learned bias values that are added to the attention matrix pre-softmax. This bias is shared and injected into each attention layer. I have decided to include this because it offers a cheap way to have relative positional encoding (superior to absolute positional), and I have read papers that suggest having positional encoding added to each layer (vs only before the first) is beneficial.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
rel_pos_bias = True # adds relative positional bias to all attention layers, a la T5
)
)
“`py
### Residual Attention
<img src=\”./images/residual_attn.png\” width=\”500px\”></img>
https://arxiv.org/abs/2012.11747
This paper from Google proposes residualizing the pre-attention scores across all layers. At the cost of no extra parameters, they show improvement on top of regular attention networks. If you turn on this setting, be aware that the best results in the paper used post-normalization, in which case a learning warmup will be needed. The authors also reported that they could use a higher learning rate and get even better gains in the same amount of steps. (In the paper they use `2e-4` vs `1e-4` for vanilla transformer)
“`py
import torch
from x_transformers import TransformerWrapper, Encoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Encoder(
dim = 512,
depth = 6,
heads = 8,
pre_norm = False, # in the paper, residual attention had best results with post-layernorm
residual_attn = True # add residual attention
)
)
“`py
I also tried residualizing cross attention and may have noticed an improvement in convergence. You can try it by setting the `cross_residual_attn` keyword to `True`
“`py
import torch
from x_transformers import XTransformer
model = XTransformer(
dim = 512,
enc_num_tokens = 256,
enc_depth = 6,
enc_heads = 8,
enc_max_seq_len = 1024,
dec_num_tokens = 256,
dec_depth = 6,
dec_heads = 8,
dec_max_seq_len = 1024,
dec_cross_residual_attn = True # residualize cross attention
)
“`py
### Transformer-XL recurrence
You can also do Transformer-XL recurrence, by simply passing in a `max_mem_len` in the `TransformerWrapper` class, and then making sure your `Decoder` has `rel_pos_bias` (or `rotary_pos_emb`) set to `True`.
Then, you can retrieve the memories at each step with the `return_mems` keyword and pass it to the next iteration.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model_xl = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 512,
max_mem_len = 2048,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
rel_pos_bias = True
)
)
seg1 = torch.randint(0, 20000, (1, 512))
seg2 = torch.randint(0, 20000, (1, 512))
seg3 = torch.randint(0, 20000, (1, 512))
logits1, mems1 = model_xl(seg1, return_mems = True)
logits2, mems2 = model_xl(seg2, mems = mems1, return_mems = True)
logits3, mems3 = model_xl(seg3, mems = mems2, return_mems = True)
“`py
Setting up the logic for training and sampling from transformer xl can be a bit overwhelming. This repository offers a simple wrapper that should make this easy, with the `XLAutoregressiveWrapper`.
“`py
# pass in the above model_xl
xl_wrapper = XLAutoregressiveWrapper(model_xl)
seg = torch.randint(0, 20000, (1, 4096)).cuda() # sequence exceeding max length, automatically segmented and memory managed
loss = xl_wrapper(seg)
loss.backward()
# then, after much training
prime = seg[:, :1024] # if prime exceeds max length, memory will be caught up before generating
generated = xl_wrapper.generate(prime, 4096) # (1, 4096)
“`py
### Enhanced recurrence
<img src=\”./images/enhanced-recurrence.png\” width=\”400px\”/>
<a href=\”https://arxiv.org/abs/2012.15688\”>This paper</a> proposes a simple technique to enhance the range of Transformer-XL. They simply route the memory segment of a layer to the layer below it, for the next recurrent step. You can enable this by setting `shift_mem_down = 1`. You can also shift down arbitrary number of layers by setting this value to `> 1`.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model_xl = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 512,
max_mem_len = 2048,
shift_mem_down = 1,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
rotary_pos_emb = True
)
)
seg1 = torch.randint(0, 20000, (1, 512))
seg2 = torch.randint(0, 20000, (1, 512))
seg3 = torch.randint(0, 20000, (1, 512))
logits1, mems1 = model_xl(seg1, return_mems = True)
logits2, mems2 = model_xl(seg2, mems = mems1, return_mems = True) # mems1 of layer N are automatically routed to the layer N-1
“`py
### Gated residual
<img src=\”./images/gating.png\” width=\”500px\”></img>
https://arxiv.org/abs/1910.06764
The authors propose gating the residual connections in the transformer network and demonstrate increased stability and performance for Transformer-XL in a variety of reinforcement learning tasks.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
max_mem_len = 2048,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 16,
gate_residual = True
)
)
“`py
### Rotary Positional Embeddings
<img src=\”./images/rotary.png\” width=\”500px\”></img>
Developed in Beijing, this new technique quickly gained interest in the NLP circles. In short, it allows you to endow the transformer with relative positional embeddings at the cost of no learned parameters. You apply a rotary operation to the queries and keys prior to their dot product in attention. The big idea is injecting positions through rotations.
Highly recommend that you have this turned on whenever you are working on an ordered sequence.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
rotary_pos_emb = True # turns on rotary positional embeddings
)
)
“`py
Update (12/2022): Rotary embedding has since been hugely successful, widely adopted in many large language models, including the largest in the world, PaLM. However, it has been uncovered in the ALiBi paper that rotary embeddings cannot length extrapolate well. This was recently addressed in <a href=\”https://arxiv.org/abs/2212.10554v1\”>a Microsoft research paper</a>. They propose a way to unobtrusively add the same decay as in ALiBi, and found that this resolves the extrapolation problem. You can use it in this repository by setting `rotary_xpos = True`. Like ALiBi, it would enforce the attention to be local. You can set the receptive field with `rotary_xpos_scale_base` value, which defaults to `512`
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
rotary_xpos = True # modified rotary to extrapolate well beyond length at which it was trained
)
)
“`py
### Dynamic Positional Bias
<img src=\”./images/dynamic-pos-bias.png\” width=\”150px\”></img>
This technique bears roots from the field of vision transformers, where researchers are trying to have relative positions generalize to larger resolutions (without having to retrain the entire network). It was used in two recent papers, <a href=\”https://arxiv.org/abs/2108.00154\”>CrossFormer</a>, as well as <a href=\”https://arxiv.org/abs/2111.09883\”>SwinV2</a>.
<a href=\”https://github.com/cfoster0\”>Charles Foster</a> first tried this for a language model, and found that it works. Later on <a href=\”https://github.com/bob80333\”>Eric Engelhart</a> produced experimental results that show the same type of extrapolation holds, even for 1d sequences.
Eric trained at sequence lengths of 128, and showed that it generalized well to 1024. In addition, he showed that linear positions was better than log (used in SwinV2), for language.
Linear distances
<img src=\”./images/dynamic-pos-bias-linear.png\” width=\”600px\”></img>
Log distances
<img src=\”./images/dynamic-pos-bias-log.png\” width=\”600px\”></img>
Negative control – Sinusoidal
<img src=\”./images/dynamic-pos-bias-sinusoidal.png\” width=\”600px\”></img>
More of Eric\’s experimental results can be found <a href=\”https://github.com/bob80333/investigating_extrapolation\”>here</a>
You can use this type of relative position if you wish to train at smaller sequence lengths and have it generalize to longer ones, for both autoregressive and bidirectional models.
Update: <a href=\”https://www.kaggle.com/competitions/stanford-ribonanza-rna-folding/discussion/460121\”>First place RNA folding using dynamic positional bias</a>
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 256,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
dynamic_pos_bias = True, # set this to True
dynamic_pos_bias_log_distance = False # whether to use log distance, as in SwinV2
)
)
“`py
### ALiBi Positional Embedding
<a href=\”https://ofir.io/train_short_test_long.pdf\”>This paper</a> proposes to simply apply a static linear bias to the attention matrix. The authors show this is not only effective as a relative positional encoding, but also allows the attention net to extrapolate to greater sequences length than what it was trained on, for autoregressive language models.
This repository also offers a bidirectional variant (nonsymmetric), proposed by the authors <a href=\”https://github.com/ofirpress/attention_with_linear_biases/issues/5\”>here</a>. However, this is untested. If you need bidirectional length extrapolation, the safest option would be Dynamic Position Bias
Update: It may be that ALiBi enforces a strong local attention across the heads, and may hinder it from attending at distances greater than 1k. To avoid any issues with global message passing, I\’ve decided to introduce another hyperparameter `alibi_num_heads`, so one can specify less heads for the ALiBi bias
Update: There are reports that ALiBi outperform Rotary embeddings for pretraining and downstream fine-tuning.
Update: <a href=\”https://arxiv.org/abs/2305.19466\”>New paper</a> shows that no positional embedding can length extrapolate even than explicit ones
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
alibi_pos_bias = True, # turns on ALiBi positional embedding
alibi_num_heads = 4 # only use ALiBi for 4 out of the 8 heads, so other 4 heads can still attend far distances
)
)
“`py
### Shifted Tokens
An <a href=\”https://github.com/BlinkDL\”>independent researcher</a> has found that shifting a subset of the feature dimension along the sequence dimension by 1 token helps with convergence (<a href=\”https://zhuanlan.zhihu.com/p/191393788\”>Time-mixing</a>). I have tested this for the autoregressive case and can confirm that it leads to greatly improved convergence. This also lines up with <a href=\”https://arxiv.org/abs/2106.07477\”>the results</a> of some papers in the vision domain.
To use it, simply set `shift_tokens = 1` (or to whatever number of shifts you desire). The feature dimension will be divided by `shift_tokens + 1` and then each chunk will be shifted `[0, shift_tokens]` respectively
Update: new experiments by @sdtblck suggests this may only work for character-level training
Update: after more experiments, it seems that in the context of BPE encoding, with rotary turned on, there is no benefit to shifting. for character-level training, shifting may still improve a tiny bit
Update: When doing BPE encoded tokens, it seems that shift of 2 will bottleneck the dimensions (divided by 5). It is recommended you always do a shift of 1, unless if you are working with character level.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
shift_tokens = 1
)
)
“`py
If you want finer control over how much is shifted per block (whether attention or feedforward), simply pass in a tuple of size that is equal to the number of layers.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
shift_tokens = (1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0) # 12 blocks, attention and feedforward alternating, with progressively less shifting
)
)
“`py
### Sandwich Norm
<img src=\”./images/sandwich_norm.png\” width=\”400px\”/>
This technique first made an appearance in <a href=\”https://arxiv.org/abs/2105.13290\”>the CoqView paper</a>, a Chinese version of the famous text-to-image transformer DALL-E. They propose, when using pre-layernorm, to add an extra layernorm to all the branch outputs. I have found this to be very effective for a number of projects, when facing instability during training.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
sandwich_norm = True # set this to True
)
)
x = torch.randint(0, 20000, (1, 1024))
model(x)
“`py
### ResiDual
<img src=\”./images/resi_dual.png\” width=\”400px\”/>
<a href=\”https://arxiv.org/abs/2304.14802\”>This Microsoft paper</a> proposes yet another normalization configuration, combining both pre and post layernorm. They claim this hybridization reduces representation collapse (known to be an issue with pre-layernorm with increasing depth), while maintaining stability and reducing vanishing gradients (issues with post-layernorm). Initial experiments on my end show it to work no worse than pre-layernorm or sandwich norm. More study needed by the public to see if this is actually a winning technique.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
resi_dual = True, # set this to True
resi_dual_scale = 0.1 # in appendix, they said on fp16 the prenorm residual is prone to overflow. they claim by scaling it at each layer by a factor, it would prevent the overflow, and keep results the same (as layernorms are invariant to scaling of the input)
)
)
x = torch.randint(0, 20000, (1, 1024))
model(x)
“`py
### Normformer
<img src=\”./images/normformer.png\” width=\”400px\”/>
This <a href=\”https://openreview.net/forum?id=GMYWzWztDx5\”>paper</a> uncovers an issue with pre-norm transformers where gradients are mismatched between the early and later layers. They propose 4 changes, of which I will be offering 3.
The first change is to offer per head scaling after aggregating the values in attention. My experiments show a slight improvement in convergence.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
attn_head_scale = True # set this to True
)
)
x = torch.randint(0, 20000, (1, 1024))
model(x)
“`py
The second change is an extra layernorm right after the activation in the feedforward. I have also verified a slight improvement, at the cost of extra compute.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
ff_post_act_ln = True # set this to True
)
)
x = torch.randint(0, 20000, (1, 1024))
model(x)
“`py
For the residual scaling, you simply have to set `scale_residual = True`. I have noticed slight improvements, but occasional instability as well, so use with caution.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
scale_residual = True # set this to True
)
)
x = torch.randint(0, 20000, (1, 1024))
model(x)
“`py
The last change is a layernorm right after the outwards projection in attention. This is actually identical to the sandwich norm proposed by the Coqview paper, so you can use this by simply setting `sandwich_norm = True`, although it would also add it to the feedforward layer.
### Cosine Sim Attention
<img src=\”./images/cosine-sim-attention.png\” width=\”400px\”></img>
This <a href=\”https://arxiv.org/abs/2010.04245\”>paper</a> proposes to l2 normalize the queries and keys along the head dimension before the dot product (cosine similarity), with the additional change of the scale being learned rather than static. The normalization prevents the attention operation from overflowing, and removes any need for numerical stability measures prior to softmax. Both are perennial problems when training transformers.
This was validated at scale recently by the training of <a href=\”https://arxiv.org/abs/2111.09883\”>a 3B parameter vision transformer</a>. The SwinV2 paper also proposes to change the pre-layernorm to a post-layernorm for further stability.
I have validated that this works just as well as dot product attention in an autoregressive setting, if one were to initialize the temperature as proposed in the QK-norm paper (as a function of the sequence length).
This flavor of attention also has <a href=\”https://arxiv.org/abs/2111.05498\”>a connection</a> to sparse distributed memory. <a href=\”https://www.youtube.com/watch?v=THIIk7LR9_8\”>[youtube talk]</a>
Update: I have discovered a way to remove the learned temperature altogether, by grouping the feature dimension and doing l2-normalization on each group. This allows the queries and keys to have a similarity that is upper bounded by the number of groups. A group size of 8 or 16 was sufficient in my tests. Decided to name this technique \”Grouped QK Normalization\”. The drawback is that I believe an attention head dimension 32 is too small to use this tactic (a dimension often used in vision)
Update 2: Tero Karras has successfully used cosine sim attention in <a href=\”https://arxiv.org/abs/2312.02696\”>a new paper</a>.
You can use it as follows
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
attn_qk_norm = True, # set this to True
attn_qk_norm_groups = 8 # number of groups in the feature dimension for l2norm, similarity scores will be bounded between [-group, group]. determines how sharp the attention can be
)
)
x = torch.randint(0, 20000, (1, 1024))
model(x)
“`py
Another update: Simply scaling the cosine similarity (group of 1) with a fixed constant (10) may work too
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
attn_qk_norm = True, # set to True
attn_qk_norm_scale = 10 # new scale on the similarity, with groups of 1
)
)
x = torch.randint(0, 20000, (1, 1024))
model(x)
“`py
### QK RMSNorm
<img src=\”./images/qknorm-analysis.png\” width=\”450px\”></img>
Update: Google Brain has proven out something similar to cosine sim attention in <a href=\”https://arxiv.org/abs/2302.05442\”>a 22B parameter model</a>. In their papers, they have analysis showing that the normalization resulted in not only extra stability, but also better results in the end (due to less need to adjust learning rate when increasing parameter count).
We are nearing the point of wiping out a source of transformer training instability with one simple intervention, in my opinion. The only slight difference in the paper is that they still have a learned scale across the feature dimension (per use of rmsnorm). Not sure how critical this is, but just to make sure we don\’t miss anything, I will include this here. You can use this by setting `qk_norm_dim_scale = True`
Update: <a href=\”https://twitter.com/Tim_Dettmers/status/1625531080513306627\”>Counterpoint from Tim Dettmers</a>
Update 2: <a href=\”https://arxiv.org/abs/2305.19268\”>Counter</a> to Tim\’s assertion that outliers are needed, and potentially even <a href=\”https://arxiv.org/abs/2306.12929\”>some solutions</a>
Update 3: Used by <a href=\”https://www.adept.ai/blog/persimmon-8b\”>8B parameter LLM</a> successfully
Update 4: a MetaAI group found that they can <a href=\”https://arxiv.org/abs/2309.16588\”>alleviate outliers</a> by adding `register tokens`, also known as `memory tokens` from earlier literature (Burtsev et al). Perhaps what should be tried next is see if qk norm can be improved in the presence of memory tokens.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 12,
heads = 8,
attn_qk_norm = True,
attn_qk_norm_dim_scale = True # set this to True, in addition to `attn_qk_norm = True`
)
)
x = torch.randint(0, 256, (1, 1024))
model(x)
“`py
### Turning off absolute positional embedding
A number of papers have hinted that causal transformers (`Decoder`) can learn absolute positions in the absence of added embeddings of any sort. This was recently thoroughly investigated <a href=\”https://arxiv.org/abs/2203.16634\”>here</a>. You can turn off the absolute positional embedding by setting `use_abs_pos_emb = False` in the `TransformerWrapper`
Given <a href=\”https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html\”>PaLM</a>, the trend going forward may be to forgo absolute positional embedding (again, for causal transformers only), and add relative positional embeddings with RoPE, ALiBi, etc.
Update: <a href=\”https://arxiv.org/abs/2305.19466\”>This paper</a> shows that in the absence of any engineered absolute or relative positional embeddings, decoders can generate implicit positions, and even length generalize better than solutions of the past. They were unaware of dynamic positional bias, however.
“`py
import torch
from x_transformers import TransformerWrapper, Decoder
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
use_abs_pos_emb = False, # set this to False
attn_layers = Decoder(
dim = 512,
depth = 6,
heads = 8,
)
)
x = torch.randint(0, 20000, (1, 1024))
model(x)
“`py
### Forgetful Causal Mask
<img src=\”./images/fcm.png\” width=\”450px\”></img>
<a href=\”https://arxiv.org/abs/2210.13432\”>This paper</a> shows convincing results that one can combine masking (from masked language modeling) with autoregressive training, leading to significantly better results.
You can use this by setting the `mask_prob` on the `AutoregressiveWrapper` class
“`py
import torch
from x_transformers import TransformerWrapper, Decoder, AutoregressiveWrapper
model = TransformerWrapper(
num_tokens = 20000,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 12,
heads = 8
)
)
model = AutoregressiveWrapper(
model,
mask_prob = 0.15 # in paper, they use 15%, same as BERT
).cuda()
# mock data
x = torch.randint(0, 20000, (1, 1024)).cuda()
# derive cross entropy loss, masking all taken care of
loss = model(x)
loss.backward()
“`py
## Miscellaneous
### Cross Attention
“`py
import torch
from x_transformers import Encoder, CrossAttender
enc = Encoder(dim = 512, depth = 6)
model = CrossAttender(dim = 512, depth = 6)
nodes = torch.randn(1, 1, 512)
node_masks = torch.ones(1, 1).bool()
neighbors = torch.randn(1, 5, 512)
neighbor_masks = torch.ones(1, 5).bool()
encoded_neighbors = enc(neighbors, mask = neighbor_masks)
model(nodes, context = encoded_neighbors, mask = node_masks, context_mask = neighbor_masks) # (1, 1, 512)
“`py
### Continuous Embeddings
“`py
import torch
from x_transformers import ContinuousTransformerWrapper, Decoder
model = ContinuousTransformerWrapper(
dim_in = 32,
dim_out = 100,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 12,
heads = 8
)
)
x = torch.randn((1, 1024, 32))
model(x) # (1, 1024, 100)
“`py
You can also train a transformer that accepts continuous values autoregressively easily, in the same scheme as done successfully in <a href=\”https://arxiv.org/abs/2112.05329\”>this paper</a>
“`py
import torch
from x_transformers import ContinuousTransformerWrapper, Decoder
from x_transformers import ContinuousAutoregressiveWrapper
model = ContinuousTransformerWrapper(
dim_in = 777,
dim_out = 777,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 12,
heads = 8
)
)
# wrap it with the continuous autoregressive wrapper
model = ContinuousAutoregressiveWrapper(model)
# mock data
x = torch.randn((1, 1024, 777))
mask = torch.ones(1, 1024).bool()
# train on a lot of data above
loss = model(x, mask = mask)
loss.backward
# then generate
start_emb = torch.randn(1, 777)
generated = model.generate(start_emb, 17) # (17, 777)
“`py
### xVal – Continuous and Discrete
<img src=\”./images/xval.png\” width=\”400px\”></img>
This is promising work that resulted from the collaboration across many institutes (collectively known as Polymathic AI). They found that by offering a continuously scaled number token to the transformer, the transformer was able to generalize arithmetic and forecasting tasks better than the alternative encoding schemes.
This is corroborated by some [prior work](https://github.com/lucidrains/tab-transformer-pytorch#ft-transformer)
“`py
import torch
from x_transformers import (
Decoder,
XValTransformerWrapper,
XValAutoregressiveWrapper
)
model = XValTransformerWrapper(
num_tokens = 4,
numerical_token_id = 3,
max_seq_len = 1024,
attn_layers = Decoder(
dim = 512,
depth = 12,
heads = 8
)
)
# wrap it with the xval autoregressive wrapper
model = XValAutoregressiveWrapper(model)
# mock data
ids = torch.randint(0, 4, (1, 777))
nums = torch.randn(1, 777)
mask = torch.ones(1, 777).bool()
# train on a lot of data above
loss = model(ids, nums, mask = mask)
loss.backward()
# then generate
start_ids = torch.randint(0, 4, (1, 1))
start_nums = torch.randn(1, 1)
ids_out, num_out, is_number_mask = model.generate(start_ids, start_nums, 17)
# (1, 17), (1, 17), (1, 17)
# discrete, continuous, mask for discrete / continuous
“`py
## Citations
“`py
@misc{vaswani2017attention,
title = {Attention Is All You Need},
author = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
year = {2017},
eprint = {1706.03762},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
“`py
“`py
@article{DBLP:journals/corr/abs-1907-01470,
author = {Sainbayar Sukhbaatar and
Edouard Grave and
Guillaume Lample and
Herv{\\\'{e}} J{\\\'{e}}gou and
Armand Joulin},
title = {Augmenting Self-attention with Persistent Memory},
journal = {CoRR},
volume = {abs/1907.01470},
year = {2019},
url = {http://arxiv.org/abs/1907.01470}
}
“`py
“`py
@article{1910.05895,
author = {Toan Q. Nguyen and Julian Salazar},
title = {Transformers without Tears: Improving the Normalization of Self-Attention},
year = {2019},
eprint = {arXiv:1910.05895},
doi = {10.5281/zenodo.3525484},
}
“`py
“`py
@misc{shazeer2020glu,
title = {GLU Variants Improve Transformer},
author = {Noam Shazeer},
year = {2020},
url = {https://arxiv.org/abs/2002.05202}
}
“`py
“`py
@inproceedings{Zoph2022STMoEDS,
title = {ST-MoE: Designing Stable and Transferable Sparse Expert Models},
author = {Barret Zoph and Irwan Bello and Sameer Kumar and Nan Du and Yanping Huang and Jeff Dean and Noam M. Shazeer and William Fedus},
year = {2022}
}
“`py
“`py
@misc{bhojanapalli2020lowrank,
title = {Low-Rank Bottleneck in Multi-head Attention Models},
author = {Srinadh Bhojanapalli and Chulhee Yun and Ankit Singh Rawat and Sashank J. Reddi and Sanjiv Kumar},
year = {2020},
eprint = {2002.07028}
}
“`py
“`py
@misc{burtsev2020memory,
title = {Memory Transformer},
author = {Mikhail S. Burtsev and Grigory V. Sapunov},
year = {2020},
eprint = {2006.11527},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
“`py
“`py
@misc{zhao2019explicit,
title = {Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection},
author = {Guangxiang Zhao and Junyang Lin and Zhiyuan Zhang and Xuancheng Ren and Qi Su and Xu Sun},
year = {2019},
eprint = {1912.11637},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
“`py
“`py
@misc{correia2019adaptively,
title = {Adaptively Sparse Transformers},
author = {Gonçalo M. Correia and Vlad Niculae and André F. T. Martins},
year = {2019},
eprint = {1909.00015},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
“`py
“`py
@misc{shazeer2020talkingheads,
title = {Talking-Heads Attention},
author = {Noam Shazeer and Zhenzhong Lan and Youlong Cheng and Nan Ding and Le Hou},
year = {2020},
eprint = {2003.02436},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
“`py
“`py
@misc{press2020improving,
title = {Improving Transformer Models by Reordering their Sublayers},
author = {Ofir Press and Noah A. Smith and Omer Levy},
year = {2020},
eprint = {1911.03864},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
“`py
“`py
@misc{lu2019understanding,
title = {Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View},
author = {Yiping Lu and Zhuohan Li and Di He and Zhiqing Sun and Bin Dong and Tao Qin and Liwei Wang and Tie-Yan Liu},
year = {2019},
eprint = {1906.02762},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
“`py
“`py
@misc{ke2020rethinking,
title = {Rethinking Positional Encoding in Language Pre-training},
author = {Guolin Ke and Di He and Tie-Yan Liu},
year = {2020},
eprint = {2006.15595},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
“`py
“`py
@misc{dosovitskiy2020image,
title = {An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale},
author = {Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
year = {2020},
eprint = {2010.11929},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
“`py
“`py
@misc{huang2019attention,
title = {Attention on Attention for Image Captioning},
author = {Lun Huang and Wenmin Wang and Jie Chen and Xiao-Yong Wei},
year = {2019},
eprint = {1908.06954},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
“`py
“`py
@misc{raffel2020exploring,
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
year = {2020},
eprint = {1910.10683},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
“`py
“`py
@inproceedings{martins-etal-2020-sparse,
title = \”Sparse Text Generation\”,
author = \”Martins, Pedro Henrique and
Marinho, Zita and
Martins, Andr{\\\’e} F. T.\”,
booktitle = \”Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)\”,
month = nov,
year = \”2020\”,
address = \”Online\”,
publisher = \”Association for Computational Linguistics\”,
url = \”https://www.aclweb.org/anthology/2020.emnlp-main.348\”
}
“`py
“`py
@misc{he2020realformer,
title = {RealFormer: Transformer Likes Residual Attention},
author = {Ruining He and Anirudh Ravula and Bhargav Kanagal and Joshua Ainslie},
year = {2020},
eprint = {2012.11747},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
“`py
“`py
@misc{carion2020endtoend,
title = {End-to-End Object Detection with Transformers},
author = {Nicolas Carion and Francisco Massa and Gabriel Synnaeve and Nicolas Usunier and Alexander Kirillov and Sergey Zagoruyko},
year = {2020},
eprint = {2005.12872},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
“`py
“`py
@misc{press2021ALiBi,
title = {Train Short, Test Long: Attention with Linear Biases Enable Input Length Extrapolation},
author = {Ofir Press and Noah A. Smith and Mike Lewis},
year = {2021},
url = {https://ofir.io/train_short_test_long.pdf}
}
“`py
“`py
@misc{parisotto2019stabilizing,
title = {Stabilizing Transformers for Reinforcement Learning},
author = {Emilio Parisotto and H. Francis Song and Jack W. Rae and Razvan Pascanu and Caglar Gulcehre and Siddhant M. Jayakumar and Max Jaderberg and Raphael Lopez Kaufman and Aidan Clark and Seb Noury and Matthew M. Botvinick and Nicolas Heess and Raia Hadsell},
year = {2019},
eprint = {1910.06764},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
“`py
“`py
@misc{narang2021transformer,
title = {Do Transformer Modifications Transfer Across Implementations and Applications?},
author = {Sharan Narang and Hyung Won Chung and Yi Tay and William Fedus and Thibault Fevry and Michael Matena and Karishma Malkan and Noah Fiedel and Noam Shazeer and Zhenzhong Lan and Yanqi Zhou and Wei Li and Nan Ding and Jake Marcus and Adam Roberts and Colin Raffel},
year = {2021},
eprint = {2102.11972},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
“`py
“`py
@misc{zhang2019root,
title = {Root Mean Square Layer Normalization},
author = {Biao Zhang and Rico Sennrich},
year = {2019},
eprint = {1910.07467},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
“`py
“`py
@inproceedings{Qin2023ScalingTT,
title = {Scaling TransNormer to 175 Billion Parameters},
author = {Zhen Qin and Dong Li and Weigao Sun and Weixuan Sun and Xuyang Shen and Xiaodong Han and Yunshen Wei and Baohong Lv and Fei Yuan and Xiao Luo and Y. Qiao and Yiran Zhong},
year = {2023},
url = {https://api.semanticscholar.org/CorpusID:260203124}
}
“`py
“`py
@misc{su2021roformer,
title = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
author = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},
year = {2021},
eprint = {2104.09864},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
“`py
“`py
@inproceedings{Chen2023ExtendingCW,
title = {Extending Context Window of Large Language Models via Positional Interpolation},
author = {Shouyuan Chen and Sherman Wong and Liangjian Chen and Yuandong Tian},
year = {2023}
}
“`py
“`py
@inproceedings{Sun2022ALT,
title = {A Length-Extrapolatable Transformer},
author = {Yutao Sun and Li Dong and Barun Patra and Shuming Ma and Shaohan Huang and Alon Benhaim and Vishrav Chaudhary and Xia Song and Furu Wei},
year = {2022}
}
“`py
“`py
@Article{AlphaFold2021,
author = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\\v{Z}}{\\\’\\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},
journal = {Nature},
title = {Highly accurate protein structure prediction with {AlphaFold}},
year = {2021},
doi = {10.1038/s41586-021-03819-2},
note = {(Accelerated article preview)},
}
“`py
“`py
@software{peng_bo_2021_5196578,
author = {PENG Bo},
title = {BlinkDL/RWKV-LM: 0.01},
month = {aug},
year = {2021},
publisher = {Zenodo},
version = {0.01},
doi = {10.5281/zenodo.5196578},
url = {https://doi.org/10.5281/zenodo.5196578}
}
“`py
“`py
@misc{csordás2021devil,
title = {The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers},
author = {Róbert Csordás and Kazuki Irie and Jürgen Schmidhuber},
year = {2021},
eprint = {2108.12284},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
“`py
“`py
@misc{so2021primer,
title = {Primer: Searching for Efficient Transformers for Language Modeling},
author = {David R. So and Wojciech Mańke and Hanxiao Liu and Zihang Dai and Noam Shazeer and Quoc V. Le},
year = {2021},
eprint = {2109.08668},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
“`py
“`py
@misc{ding2021erniedoc,
title = {ERNIE-Doc: A Retrospective Long-Document Modeling Transformer},
author = {Siyu Ding and Junyuan Shang and Shuohuan Wang and Yu Sun and Hao Tian and Hua Wu and Haifeng Wang},
year = {2021},
eprint = {2012.15688},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
“`py
“`py
@misc{ding2021cogview,
title = {CogView: Mastering Text-to-Image Generation via Transformers},
author = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},
year = {2021},
eprint = {2105.13290},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
“`py
“`py
@inproceedings{anonymous2022normformer,
title = {NormFormer: Improved Transformer Pretraining with Extra Normalization},
author = {Anonymous},
booktitle = {Submitted to The Tenth International Conference on Learning Representations },
year = {2022},
url = {https://openreview.net/forum?id=GMYWzWztDx5},
note = {under review}
}
“`py
“`py
@misc{henry2020querykey,
title = {Query-Key Normalization for Transformers},
author = {Alex Henry and Prudhvi Raj Dachapally and Shubham Pawar and Yuxuan Chen},
year = {2020},
eprint = {2010.04245},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
“`py
“`py
@misc{liu2021swin,
title = {Swin Transformer V2: Scaling Up Capacity and Resolution},
author = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
year = {2021},
eprint = {2111.09883},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
“`py
“`py
@article{Haviv2022TransformerLM,
title = {Transformer Language Models without Positional Encodings Still Learn Positional Information},
author = {Adi Haviv and Ori Ram and Ofir Press and Peter Izsak and Omer Levy},
journal = {ArXiv},
year = {2022},
volume = {abs/2203.16634}
}
“`py
“`py
@article{chowdhery2022PaLM,
title = {PaLM: Scaling Language Modeling with Pathways},
author = {Chowdhery, Aakanksha et al},
year = {2022}
}
“`py
“`py
@article{Shazeer2019FastTD,
title = {Fast Transformer Decoding: One Write-Head is All You Need},
author = {Noam M. Shazeer},
journal = {ArXiv},
year = {2019},
volume = {abs/1911.02150}
}
“`py
“`py
@article{Ainslie2023GQATG,
title = {GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints},
author = {Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebr\’on and Sumit K. Sanghai},
journal = {ArXiv},
year = {2023},
volume = {abs/2305.13245},
url = {https://api.semanticscholar.org/CorpusID:258833177}
}
“`py
“`py
@misc{schlag2020enhancing,
title = {Enhancing the Transformer with explicit relational encoding for math problem solving},
author = {Imanol Schlag and Paul Smolensky and Roland Fernandez and Nebojsa Jojic and J{\\\”u}rgen Schmidhuber and Jianfeng Gao},
year = {2020},
url = {https://openreview.net/forum?id=B1xfElrKPr}
}
“`py
“`py
@article{Liu2022FCMFC,
title = {FCM: Forgetful Causal Masking Makes Causal Language Models Better Zero-Shot Learners},
author = {Hao Liu and Xinyang Geng and Lisa Lee and Igor Mordatch and Sergey Levine and Sharan Narang and P. Abbeel},
journal = {ArXiv},
year = {2022},
volume = {abs/2210.13432}
}
“`py
“`py
@inproceedings{Huang2016DeepNW,
title = {Deep Networks with Stochastic Depth},
author = {Gao Huang and Yu Sun and Zhuang Liu and Daniel Sedra and Kilian Q. Weinberger},
booktitle = {European Conference on Computer Vision},
year = {2016}
}
“`py
“`py
@inproceedings{Hua2022TransformerQI,
title = {Transformer Quality in Linear Time},
author = {Weizhe Hua and Zihang Dai and Hanxiao Liu and Quoc V. Le},
booktitle = {International Conference on Machine Learning},
year = {2022}
}
“`py
“`py
@article{Chang2022MaskGITMG,
title = {MaskGIT: Masked Generative Image Transformer},
author = {Huiwen Chang and Han Zhang and Lu Jiang and Ce Liu and William T. Freeman},
journal = {2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2022},
pages = {11305-11315}
}
“`py
“`py
@article{Lezama2022ImprovedMI,
title = {Improved Masked Image Generation with Token-Critic},
author = {Jos{\\\’e} Lezama and Huiwen Chang and Lu Jiang and Irfan Essa},
journal = {ArXiv},
year = {2022},
volume = {abs/2209.04439}
}
“`py
“`py
@misc{https://doi.org/10.48550/arxiv.2302.01327,
doi = {10.48550/ARXIV.2302.01327},
url = {https://arxiv.org/abs/2302.01327},
author = {Kumar, Manoj and Dehghani, Mostafa and Houlsby, Neil},
title = {Dual PatchNorm},
publisher = {arXiv},
year = {2023},
copyright = {Creative Commons Attribution 4.0 International}
}
“`py
“`py
@inproceedings{dao2022flashattention,
title = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
author = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\\\’e}, Christopher},
booktitle = {Advances in Neural Information Processing Systems},
year = {2022}
}
“`py
“`py
@article{Xie2023ResiDualTW,
title = {ResiDual: Transformer with Dual Residual Connections},
author = {Shufang Xie and Huishuai Zhang and Junliang Guo and Xu Tan and Jiang Bian and Hany Hassan Awadalla and Arul Menezes and Tao Qin and Rui Yan},
journal = {ArXiv},
year = {2023},
volume = {abs/2304.14802}
}
“`py
“`py
@inproceedings{Dehghani2023ScalingVT,
title = {Scaling Vision Transformers to 22 Billion Parameters},
author = {Mostafa Dehghani and Josip Djolonga and Basil Mustafa and Piotr Padlewski and Jonathan Heek and Justin Gilmer and Andreas Steiner and Mathilde Caron and Robert Geirhos and Ibrahim M. Alabdulmohsin and Rodolphe Jenatton and Lucas Beyer and Michael Tschannen and Anurag Arnab and Xiao Wang and Carlos Riquelme and Matthias Minderer and Joan Puigcerver and Utku Evci and Manoj Kumar and Sjoerd van Steenkiste and Gamaleldin F. Elsayed and Aravindh Mahendran and Fisher Yu and Avital Oliver and Fantine Huot and Jasmijn Bastings and Mark Collier and Alexey A. Gritsenko and Vighnesh Birodkar and Cristina Nader Vasconcelos and Yi Tay and Thomas Mensink and Alexander Kolesnikov and Filip Paveti\’c and Dustin Tran and Thomas Kipf and Mario Luvci\’c and Xiaohua Zhai and Daniel Keysers and Jeremiah Harmsen and Neil Houlsby},
year = {2023}
}
“`py
“`py
@article{Beyer2022BetterPV,
title = {Better plain ViT baselines for ImageNet-1k},
author = {Lucas Beyer and Xiaohua Zhai and Alexander Kolesnikov},
journal = {ArXiv},
year = {2022},
volume = {abs/2205.01580}
}
“`py
“`py
@article{Kazemnejad2023TheIO,
title = {The Impact of Positional Encoding on Length Generalization in Transformers},
author = {Amirhossein Kazemnejad and Inkit Padhi and Karthikeyan Natesan Ramamurthy and Payel Das and Siva Reddy},
journal = {ArXiv},
year = {2023},
volume = {abs/2305.19466}
}
“`py
“`py
@misc{bloc97-2023
title = {NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.},
author = {/u/bloc97},
url = {https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/}
}
“`py
“`py
@inproceedings{Zoph2022STMoEDS,
title = {ST-MoE: Designing Stable and Transferable Sparse Expert Models},
author = {Barret Zoph and Irwan Bello and Sameer Kumar and Nan Du and Yanping Huang and Jeff Dean and Noam M. Shazeer and William Fedus},
year = {2022}
}
“`py
“`py
@article{Lan2019ALBERTAL,
title = {ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},
author = {Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},
journal = {ArXiv},
year = {2019},
volume = {abs/1909.11942},
url = {https://api.semanticscholar.org/CorpusID:202888986}
}
“`py
“`py
@inproceedings{Li2022ContrastiveDO,
title = {Contrastive Decoding: Open-ended Text Generation as Optimization},
author = {Xiang Lisa Li and Ari Holtzman and Daniel Fried and Percy Liang and Jason Eisner and Tatsunori Hashimoto and Luke Zettlemoyer and Mike Lewis},
booktitle = {Annual Meeting of the Association for Computational Linguistics},
year = {2022},
url = {https://api.semanticscholar.org/CorpusID:253157949}
}
“`py
“`py
@inproceedings{OBrien2023ContrastiveDI,
title = {Contrastive Decoding Improves Reasoning in Large Language Models},
author = {Sean O\’Brien and Mike Lewis},
year = {2023},
url = {https://api.semanticscholar.org/CorpusID:261884427}
}
“`py
“`py
@inproceedings{Darcet2023VisionTN,
title = {Vision Transformers Need Registers},
author = {Timoth\’ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
year = {2023},
url = {https://api.semanticscholar.org/CorpusID:263134283}
}
“`py
“`py
@article{Bondarenko2023QuantizableTR,
title = {Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing},
author = {Yelysei Bondarenko and Markus Nagel and Tijmen Blankevoort},
journal = {ArXiv},
year = {2023},
volume = {abs/2306.12929},
url = {https://api.semanticscholar.org/CorpusID:259224568}
}
“`py
“`py
@inproceedings{Golkar2023xValAC,
title = {xVal: A Continuous Number Encoding for Large Language Models},
author = {Siavash Golkar and Mariel Pettee and Michael Eickenberg and Alberto Bietti and M. Cranmer and G{\\\’e}raud Krawezik and Francois Lanusse and Michael McCabe and Ruben Ohana and Liam Parker and Bruno R{\\\’e}galdo-Saint Blancard and Tiberiu Teşileanu and Kyunghyun Cho and Shirley Ho},
year = {2023},
url = {https://api.semanticscholar.org/CorpusID:263622222}
}
“`py
“`py
@article{Rafailov2023DirectPO,
title = {Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Stefano Ermon and Christopher D. Manning and Chelsea Finn},
journal = {ArXiv},
year = {2023},
volume = {abs/2305.18290},
url = {https://api.semanticscholar.org/CorpusID:258959321}
}

solve intelligence… then use that to solve everything else. – Demis Hassabis

#以上关于Lucidrains 系列项目源码解析(一百一十三)的相关内容来源网络仅供参考,相关信息请以官方公告为准!

原创文章,作者:CSDN,如若转载,请注明出处:https://www.sudun.com/ask/92775.html

Like (0)
CSDN的头像CSDN
Previous 2024年7月4日
Next 2024年7月4日

相关推荐

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注