vanilla implementation

The scope of this blog is to recreate the “Attention is all you need” paper and train it on a toy task.

Resources:

https://jalammar.github.io/illustrated-transformer/
https://nlp.seas.harvard.edu/annotated-transformer/
https://www.youtube.com/watch?v=ISNdQcPhsts&list=PLbFaFS2XKvIbugaadbmQTPdambWz6a5-1&index=2

!pip install datasets tokenizers tqdm pandas

Collecting datasets
  Downloading datasets-4.5.0-py3-none-any.whl.metadata (19 kB)
Collecting tokenizers
  Downloading tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Requirement already satisfied: tqdm in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (4.67.0)
Requirement already satisfied: pandas in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (2.2.3)
Requirement already satisfied: filelock in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from datasets) (3.16.1)
Requirement already satisfied: numpy>=1.17 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from datasets) (2.2.6)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-23.0.1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.1 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: requests>=2.32.2 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from datasets) (2.32.3)
Collecting httpx<1.0.0 (from datasets)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.6.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Collecting multiprocess<0.70.19 (from datasets)
  Downloading multiprocess-0.70.18-py310-none-any.whl.metadata (7.5 kB)
Requirement already satisfied: fsspec<=2025.10.0,>=2023.1.0 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets) (2024.10.0)
Collecting huggingface-hub<2.0,>=0.25.0 (from datasets)
  Downloading huggingface_hub-1.4.1-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: packaging in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from datasets) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from datasets) (6.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from pandas) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from pandas) (2024.2)
Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets) (3.11.12)
Collecting anyio (from httpx<1.0.0->datasets)
  Downloading anyio-4.12.1-py3-none-any.whl.metadata (4.3 kB)
Requirement already satisfied: certifi in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from httpx<1.0.0->datasets) (2024.8.30)
Collecting httpcore==1.* (from httpx<1.0.0->datasets)
  Using cached httpcore-1.0.9-py3-none-any.whl.metadata (21 kB)
Requirement already satisfied: idna in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from httpx<1.0.0->datasets) (3.10)
Collecting h11>=0.16 (from httpcore==1.*->httpx<1.0.0->datasets)
  Using cached h11-0.16.0-py3-none-any.whl.metadata (8.3 kB)
Collecting hf-xet<2.0.0,>=1.2.0 (from huggingface-hub<2.0,>=0.25.0->datasets)
  Downloading hf_xet-1.2.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Collecting shellingham (from huggingface-hub<2.0,>=0.25.0->datasets)
  Using cached shellingham-1.5.4-py2.py3-none-any.whl.metadata (3.5 kB)
Collecting typer-slim (from huggingface-hub<2.0,>=0.25.0->datasets)
  Downloading typer_slim-0.24.0-py3-none-any.whl.metadata (4.2 kB)
Requirement already satisfied: typing-extensions>=4.1.0 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from huggingface-hub<2.0,>=0.25.0->datasets) (4.12.2)
Requirement already satisfied: six>=1.5 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (3.4.0)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from requests>=2.32.2->datasets) (1.26.20)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets) (2.4.6)
Requirement already satisfied: aiosignal>=1.1.2 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets) (1.3.2)
Requirement already satisfied: async-timeout<6.0,>=4.0 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets) (5.0.1)
Requirement already satisfied: attrs>=17.3.0 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets) (25.1.0)
Requirement already satisfied: frozenlist>=1.1.1 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets) (6.1.0)
Requirement already satisfied: propcache>=0.2.0 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets) (0.2.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspec[http]<=2025.10.0,>=2023.1.0->datasets) (1.18.3)
Requirement already satisfied: exceptiongroup>=1.0.2 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from anyio->httpx<1.0.0->datasets) (1.2.2)
Collecting typer>=0.24.0 (from typer-slim->huggingface-hub<2.0,>=0.25.0->datasets)
  Downloading typer-0.24.0-py3-none-any.whl.metadata (16 kB)
Collecting click>=8.2.1 (from typer>=0.24.0->typer-slim->huggingface-hub<2.0,>=0.25.0->datasets)
  Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Requirement already satisfied: rich>=12.3.0 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from typer>=0.24.0->typer-slim->huggingface-hub<2.0,>=0.25.0->datasets) (14.1.0)
Collecting annotated-doc>=0.0.2 (from typer>=0.24.0->typer-slim->huggingface-hub<2.0,>=0.25.0->datasets)
  Downloading annotated_doc-0.0.4-py3-none-any.whl.metadata (6.6 kB)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from rich>=12.3.0->typer>=0.24.0->typer-slim->huggingface-hub<2.0,>=0.25.0->datasets) (4.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from rich>=12.3.0->typer>=0.24.0->typer-slim->huggingface-hub<2.0,>=0.25.0->datasets) (2.18.0)
Requirement already satisfied: mdurl~=0.1 in /home/vishwa/anaconda3/envs/eeg_proj/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=12.3.0->typer>=0.24.0->typer-slim->huggingface-hub<2.0,>=0.25.0->datasets) (0.1.2)
Downloading datasets-4.5.0-py3-none-any.whl (515 kB)
Downloading tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hDownloading dill-0.4.0-py3-none-any.whl (119 kB)
Using cached httpx-0.28.1-py3-none-any.whl (73 kB)
Using cached httpcore-1.0.9-py3-none-any.whl (78 kB)
Downloading huggingface_hub-1.4.1-py3-none-any.whl (553 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m553.3/553.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.18-py310-none-any.whl (134 kB)
Downloading pyarrow-23.0.1-cp310-cp310-manylinux_2_28_x86_64.whl (47.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading xxhash-3.6.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (193 kB)
Downloading hf_xet-1.2.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading anyio-4.12.1-py3-none-any.whl (113 kB)
Using cached shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)
Downloading typer_slim-0.24.0-py3-none-any.whl (3.4 kB)
Using cached h11-0.16.0-py3-none-any.whl (37 kB)
Downloading typer-0.24.0-py3-none-any.whl (56 kB)
Downloading annotated_doc-0.0.4-py3-none-any.whl (5.3 kB)
Downloading click-8.3.1-py3-none-any.whl (108 kB)
Installing collected packages: xxhash, shellingham, pyarrow, hf-xet, h11, dill, click, anyio, annotated-doc, multiprocess, httpcore, typer, httpx, typer-slim, huggingface-hub, tokenizers, datasets
Successfully installed annotated-doc-0.0.4 anyio-4.12.1 click-8.3.1 datasets-4.5.0 dill-0.4.0 h11-0.16.0 hf-xet-1.2.0 httpcore-1.0.9 httpx-0.28.1 huggingface-hub-1.4.1 multiprocess-0.70.18 pyarrow-23.0.1 shellingham-1.5.4 tokenizers-0.22.2 typer-0.24.0 typer-slim-0.24.0 xxhash-3.6.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

# imports
import torch
import math
import torch.nn as nn
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from torch.utils.data import Dataset, DataLoader, random_split
from pathlib import Path
from torch.utils.tensorboard import SummaryWriter
from tqdm import tqdm
import warnings
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
import pandas as pd

Part 1: The model’s architecture

Input Embedding

class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)

Positional Embedding

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, dropout: float):
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)

        # Create a matrix of shape (seq_len, d_model)
        pe = torch.zeros(seq_len, d_model)
        # Create a vector of shape (seq_len)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1) # (seq_len, 1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0)/ d_model)) # (1, d_model/2)

        #Apply the sin to even positions, cos to odd positions
        pe[:, 0::2] = torch.sin(position * div_term) # (seq_len, d_model/2)
        pe[:, 1::2] = torch.cos(position * div_term) # (seq_len, d_model/2)

        pe = pe.unsqueeze(0) # (1, seq_len, d_model)

        self.register_buffer('pe', pe)

    def forward(self,x):
        x = x + (self.pe[:, :x.shape[1],:]).requires_grad_(False)
        return self.dropout(x)

Layer Normalization - Add + Norm in Transformer Architecture

class LayerNormalization(nn.Module):
    def __init__(self, eps: float = 10**-6) -> None:
        super().__init__()
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(1)) # Multiplied
        self.bias = nn.Parameter(torch.zeros(1)) # Added

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.alpha + (x - mean) / (std + self.eps) + self.bias

FeedForward Layer - Used in both encoder and decoder

class FeedForwardBlock(nn.Module):
    def __init__(self, d_model:int, d_ff:int, dropout:float) -> None:
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff) # W1, B1
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model) # W2, B2

    def forward(self, x):
        # Input: (batch, seq_len, d_model)
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

Multi Head Attention Block

# Seq = sequence length
# d_model = size of the embedding vector
# h = number of heads
# dk = dv = d_model/h
class MultiHeadAttentionBlock(nn.Module):
    def __init__(self, d_model: int, h: int, dropout:float) -> None:
        super().__init__()
        self.d_model = d_model
        self.h = h
        assert d_model % h ==0, "d_model is not divisble by h"
        self.d_k = d_model // h

        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)

    @staticmethod
    def attention(query, key, value, mask, dropout: nn.Dropout):
        d_k = query.shape[-1]
        # (Batch, h, seq_len, d_k) -> (Batch, h, seq_len, seq_len)
        attention_scores = (query @ key.transpose(-2,-1)) / math.sqrt(d_k)
        if mask is not None:
            attention_scores.masked_fill(mask==0, -1e9)

        attention_scores = attention_scores.softmax(dim=-1) # Batch(h, seq_len, seq_len)

        if dropout is not None:
            attention_scores = dropout(attention_scores)
        return (attention_scores @ value), attention_scores # attention_scores is used for visualization



    def forward(self, q, k, v, mask):
        query = self.w_q (q) # (batch, seq_len, d_model) -> (batch, seq_len, d_model)
        key = self.w_k(k) # (batch, seq_len, d_model) -> (batch, seq_len, d_model)
        value = self.w_v(v) # (batch, seq_len, d_model) -> (batch, seq_len, d_model)

        # (batch, seq_len, d_model) -> (Batch, seq_len, h, d_k) -> (Batch, h, seq_len, d_k)
        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1,2)
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1,2)
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1,2)

        x, self.attention_scores = MultiHeadAttentionBlock.attention(query, key, value, mask, self.dropout)

        # (Batch, h, seq_len, d_k) -> (Batch, seq_len, h, d_k) -> (Batch, seq_len, d_model)
        x =  x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.h * self.d_k)

        # (Batch, seq_len, d_model)  -> (Batch, seq_len, d_model)
        return self.w_o(x)

Residual Connection

class ResidualConnection(nn.Module):
    def __init__(self, dropout:float) -> None:
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization()

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

Encoder

class EncoderBlock(nn.Module):
    def __init__(self, self_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        self.self_attention_block = self_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([ResidualConnection(dropout) for _ in range(2)])

    def forward(self, x, src_mask):
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, src_mask))
        x = self.residual_connections[1](x, self.feed_forward_block)
        return x

class Encoder(nn.Module):
    def __init__(self, layers: nn.ModuleList) -> None:
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)

        return self.norm(x)

Decoder

class DecoderBlock(nn.Module):
    def __init__(self, self_attention_block: MultiHeadAttentionBlock, cross_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        self.self_attention_block = self_attention_block
        self.cross_attention_block = cross_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([ResidualConnection(dropout) for _ in range(3)])


    def forward(self, x, encoder_output, src_mask, tgt_mask):
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x,x,x, tgt_mask))
        x = self.residual_connections[1](x, lambda x: self.cross_attention_block(x, encoder_output, encoder_output, src_mask))
        x = self.residual_connections[2](x, self.feed_forward_block)
        return x

class Decoder(nn.Module):
    def __init__(self, layers: nn.ModuleList) -> None:
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, encoder_output, src_mask, tgt_mask)
        return self.norm(x)

Linear Layer

class ProjectionLayer(nn.Module):
    def __init__(self, d_model: int, vocab_size: int) -> None:
        super().__init__()
        self.proj = nn.Linear(d_model,vocab_size)

    def forward(self, x):
        # (batch, seq_len, d_model) -> (batch, seq_len, vocab_size)
        return torch.log_softmax(self.proj(x), dim=-1)

Transformer

class Transformer(nn.Module):
    def __init__(self, encoder: Encoder, decoder: Decoder, src_embed: InputEmbeddings, tgt_embed: InputEmbeddings, src_pos: PositionalEncoding, tgt_pos: PositionalEncoding, projection_layer: ProjectionLayer):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.src_pos = src_pos
        self.tgt_pos = tgt_pos
        self.projection_layer = projection_layer

    def encode(self, src, src_mask):
        src = self.src_embed(src)
        src = self.src_pos(src)
        return self.encoder(src, src_mask)

    def decode(self, encoder_output, src_mask, tgt, tgt_mask):
        tgt = self.tgt_embed(tgt)
        tgt = self.tgt_pos(tgt)
        return self.decoder(tgt, encoder_output, src_mask, tgt_mask)

    def project(self, x):
        return self.projection_layer(x)

Build the transformer block

def build_transformer(src_vocab_size: int, tgt_vocab_size: int, src_seq_len: int, tgt_seq_len: int, d_model=512, N: int=6, h:int=8, dropout:float = 0.1, d_ff=2048) -> Transformer:
    # Create embedding layers
    src_embed = InputEmbeddings(d_model, src_vocab_size)
    tgt_embed = InputEmbeddings(d_model, tgt_vocab_size)

    # Create positional Encoding layers
    src_pos = PositionalEncoding(d_model, src_seq_len, dropout)
    tgt_pos = PositionalEncoding(d_model, tgt_seq_len, dropout)

    # Create Encoder blocks
    encoder_blocks = []
    for _ in range(N):
        encoder_self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
        encoder_block = EncoderBlock(encoder_self_attention_block, feed_forward_block, dropout)
        encoder_blocks.append(encoder_block)

    # Create Decoder blocks
    decoder_blocks = []
    for _ in range(N):
        decoder_self_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        decoder_cross_attention_block = MultiHeadAttentionBlock(d_model, h, dropout)
        feed_forward_block = FeedForwardBlock(d_model, d_ff, dropout)
        decoder_block = DecoderBlock(decoder_self_attention_block, decoder_cross_attention_block, feed_forward_block, dropout)
        decoder_blocks.append(decoder_block)

    # Create encoder and decoder
    encoder = Encoder(nn.ModuleList(encoder_blocks))
    decoder = Decoder(nn.ModuleList(decoder_blocks))

    # Create the projection layer
    projection_layer = ProjectionLayer(d_model, tgt_vocab_size)

    transformer = Transformer(encoder, decoder, src_embed, tgt_embed, src_pos, tgt_pos, projection_layer)

    #Initialize the params
    for p in transformer.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)

    return transformer

Part 2: The dataset

import torch
from torch.utils.data import Dataset

def causal_mask(size):
    mask = torch.triu(torch.ones(1, size, size), diagonal=1).type(torch.int)
    return mask == 0

class BilingualDataset(Dataset):
    def __init__(self, ds, tokenizer_src, tokenizer_tgt, src_lang, tgt_lang, seq_len) -> None:
        super().__init__()
        self.ds = ds
        self.seq_len = seq_len
        self.tokenizer_src = tokenizer_src
        self.tokenizer_tgt = tokenizer_tgt
        self.src_lang = src_lang
        self.tgt_lang = tgt_lang

        # Pre-calculate special token IDs
        self.sos_token = torch.tensor([tokenizer_tgt.token_to_id("[SOS]")], dtype=torch.int64)
        self.eos_token = torch.tensor([tokenizer_tgt.token_to_id("[EOS]")], dtype=torch.int64)
        self.pad_token = torch.tensor([tokenizer_tgt.token_to_id("[PAD]")], dtype=torch.int64)

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, index):
        src_target_pair = self.ds[index]
        src_text = src_target_pair['translation'][self.src_lang]
        tgt_text = src_target_pair['translation'][self.tgt_lang]

        # Transform the text into tokens
        enc_input_tokens = self.tokenizer_src.encode(src_text).ids
        dec_input_tokens = self.tokenizer_tgt.encode(tgt_text).ids

        # Add sos, eos and padding to each sentence
        enc_num_padding_tokens = self.seq_len - len(enc_input_tokens) - 2  # We add 2 tokens ([SOS], [EOS])
        # We only add 1 token ([SOS]) to the decoder input
        dec_num_padding_tokens = self.seq_len - len(dec_input_tokens) - 1

        # Make sure the number of padding tokens is not negative. If it is, the sentence is too long
        if enc_num_padding_tokens < 0 or dec_num_padding_tokens < 0:
            raise ValueError("Sentence is too long")

        # Add SOS and EOS token
        encoder_input = torch.cat(
            [
                self.sos_token,
                torch.tensor(enc_input_tokens, dtype=torch.int64),
                self.eos_token,
                torch.tensor([self.pad_token] * enc_num_padding_tokens, dtype=torch.int64),
            ],
            dim=0,
        )

        # Add only SOS token
        decoder_input = torch.cat(
            [
                self.sos_token,
                torch.tensor(dec_input_tokens, dtype=torch.int64),
                torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype=torch.int64),
            ],
            dim=0,
        )

        # Add only EOS token (Label is what we expect as output from the decoder)
        label = torch.cat(
            [
                torch.tensor(dec_input_tokens, dtype=torch.int64),
                self.eos_token,
                torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype=torch.int64),
            ],
            dim=0,
        )

        # Double check the size of the tensors to make sure they are all seq_len long
        assert encoder_input.size(0) == self.seq_len
        assert decoder_input.size(0) == self.seq_len
        assert label.size(0) == self.seq_len

        return {
            "encoder_input": encoder_input,  # (seq_len)
            "decoder_input": decoder_input,  # (seq_len)
            "encoder_mask": (encoder_input != self.pad_token).unsqueeze(0).unsqueeze(0).int(), # (1, 1, seq_len)
            "decoder_mask": (decoder_input != self.pad_token).unsqueeze(0).int() & causal_mask(decoder_input.size(0)), # (1, seq_len) & (1, seq_len, seq_len)
            "label": label,  # (seq_len)
            "src_text": src_text,
            "tgt_text": tgt_text,
        }

def get_config():
    return {
        "batch_size": 8,
        "num_epochs": 20,
        "lr": 10**-4,
        "seq_len": 350,
        "d_model": 512,
        "lang_src": "en",  # Must match CSV header
        "lang_tgt": "de",  # Must match CSV header
        "model_folder": "weights",
        "model_filename": "tmodel",
        "preload": None,
        "tokenizer_file": "tokenizer_{0}.json",
        "experiment_name": "runs/tmodel"
    }

def get_all_sentences(df, lang):
    for text in df[lang]:
        yield str(text)

def get_or_build_tokenizer(config, df, lang):
    tokenizer_path = Path(config['tokenizer_file'].format(lang))
    if not Path.exists(tokenizer_path):
        # Using BPE as per the original paper's philosophy
        tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
        tokenizer.pre_tokenizer = Whitespace()
        trainer = BpeTrainer(special_tokens=["[UNK]", "[PAD]", "[SOS]", "[EOS]"], min_frequency=2)
        tokenizer.train_from_iterator(get_all_sentences(df, lang), trainer=trainer)
        tokenizer.save(str(tokenizer_path))
    else:
        tokenizer = Tokenizer.from_file(str(tokenizer_path))
    return tokenizer

def get_ds(config):
    # Load your local CSV files
    train_df = pd.read_csv('../datasets/wmt14_translate_de-en_train.csv',
                           nrows=500, # Start with 50k rows for testing
                           on_bad_lines='skip',
                           engine='python')

    val_df = pd.read_csv('../datasets/wmt14_translate_de-en_validation.csv',
                         nrows=500,
                         on_bad_lines='skip',
                         engine='python')

    # Build tokenizers using the training data
    # Note: 'en' and 'de' are standard column names for WMT14
    tokenizer_src = get_or_build_tokenizer(config, train_df, config['lang_src'])
    tokenizer_tgt = get_or_build_tokenizer(config, train_df, config['lang_tgt'])

    # Format data for BilingualDataset (creating the 'translation' dict structure)
    train_ds_raw = train_df.to_dict('records')
    val_ds_raw = val_df.to_dict('records')

    # We need to wrap the dict to match your BilingualDataset logic:
    # item['translation'][lang]
    train_ds_formatted = [{'translation': r} for r in train_ds_raw]
    val_ds_formatted = [{'translation': r} for r in val_ds_raw]

    train_ds = BilingualDataset(train_ds_formatted, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])
    val_ds = BilingualDataset(val_ds_formatted, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])

    train_dataloader = DataLoader(train_ds, batch_size=config['batch_size'], shuffle=True)
    val_dataloader = DataLoader(val_ds, batch_size=1, shuffle=True)

    return train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt

def get_model(config, vocab_src_len, vocab_tgt_len):
    # This calls the build_transformer function you defined in Part 1
    model = build_transformer(
        vocab_src_len,
        vocab_tgt_len,
        config['seq_len'],
        config['seq_len'],
        config['d_model']
    )
    return model

def greedy_decode(model, source, source_mask, tokenizer_src, tokenizer_tgt, max_len, device):
    sos_idx = tokenizer_tgt.token_to_id('[SOS]')
    eos_idx = tokenizer_tgt.token_to_id('[EOS]')

    # Precompute the encoder output and reuse it for every token we get from the decoder
    encoder_output = model.encode(source, source_mask)
    # Initialize the decoder input with the start-of-sentence token
    decoder_input = torch.empty(1, 1).fill_(sos_idx).type_as(source).to(device)

    while True:
        if decoder_input.size(1) == max_len:
            break

        # Build mask for the target
        decoder_mask = causal_mask(decoder_input.size(1)).type_as(source_mask).to(device)

        # Calculate output
        out = model.decode(encoder_output, source_mask, decoder_input, decoder_mask)

        # Get next token
        prob = model.project(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        decoder_input = torch.cat(
            [decoder_input, torch.empty(1, 1).type_as(source).fill_(next_word.item()).to(device)], dim=1
        )

        if next_word == eos_idx:
            break

    return decoder_input.squeeze(0)

def run_validation(model, validation_ds, tokenizer_src, tokenizer_tgt, max_len, device, print_msg, global_step, writer, num_examples=2):
    model.eval()
    count = 0

    with torch.no_grad():
        for batch in validation_ds:
            count += 1
            encoder_input = batch['encoder_input'].to(device) # (b, seq_len)
            encoder_mask = batch['encoder_mask'].to(device) # (b, 1, 1, seq_len)

            # Check that the batch size is 1
            assert encoder_input.size(0) == 1, "Validation batch size must be 1"

            model_out = greedy_decode(model, encoder_input, encoder_mask, tokenizer_src, tokenizer_tgt, max_len, device)

            source_text = batch['src_text'][0]
            target_text = batch['tgt_text'][0]
            model_out_text = tokenizer_tgt.decode(model_out.detach().cpu().numpy())

            # Print the results to the console
            print_msg('-'*80)
            print_msg(f"{f'SOURCE: ':>12}{source_text}")
            print_msg(f"{f'TARGET: ':>12}{target_text}")
            print_msg(f"{f'PREDICTED: ':>12}{model_out_text}")

            if count == num_examples:
                print_msg('-'*80)
                break

Part 3: The training loop

def train_model(config):
    # Define the device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device {device}")

    # Make sure the weights folder exists
    Path(config['model_folder']).mkdir(parents=True, exist_ok=True)

    train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt = get_ds(config)
    model = get_model(config, tokenizer_src.get_vocab_size(), tokenizer_tgt.get_vocab_size()).to(device)

    # Tensorboard
    writer = SummaryWriter(config['experiment_name'])

    # Adam Optimizer (Paper uses beta1=0.9, beta2=0.98, epsilon=10^-9)
    optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'], eps=1e-9)

    # Initializing loss function with Label Smoothing (Paper uses 0.1)
    # ignore_index=tokenizer_tgt.token_to_id('[PAD]') ensures we don't learn to predict padding
    loss_fn = nn.CrossEntropyLoss(ignore_index=tokenizer_tgt.token_to_id('[PAD]'), label_smoothing=0.1).to(device)

    initial_epoch = 0
    global_step = 0

    for epoch in range(initial_epoch, config['num_epochs']):
        model.train()
        batch_iterator = tqdm(train_dataloader, desc=f"Processing Epoch {epoch:02d}")

        for batch in batch_iterator:
            encoder_input = batch['encoder_input'].to(device) # (B, seq_len)
            decoder_input = batch['decoder_input'].to(device) # (B, seq_len)
            encoder_mask = batch['encoder_mask'].to(device)   # (B, 1, 1, seq_len)
            decoder_mask = batch['decoder_mask'].to(device)   # (B, 1, seq_len, seq_len)

            # Run the tensors through the transformer
            encoder_output = model.encode(encoder_input, encoder_mask) # (B, seq_len, d_model)
            decoder_output = model.decode(encoder_output, encoder_mask, decoder_input, decoder_mask) # (B, seq_len, d_model)
            proj_output = model.project(decoder_output) # (B, seq_len, tgt_vocab_size)

            # Compare the output with the label
            label = batch['label'].to(device) # (B, seq_len)

            # (B, seq_len, tgt_vocab_size) -> (B * seq_len, tgt_vocab_size)
            loss = loss_fn(proj_output.view(-1, tokenizer_tgt.get_vocab_size()), label.view(-1))
            batch_iterator.set_postfix({f"loss": f"{loss.item():6.3f}"})

            # Log the loss
            writer.add_scalar('train loss', loss.item(), global_step)
            writer.flush()

            # Backpropagate the loss
            loss.backward()

            # Update weights
            optimizer.step()
            optimizer.zero_grad()

            global_step += 1

        # Run validation at the end of every epoch
        run_validation(model, val_dataloader, tokenizer_src, tokenizer_tgt, config['seq_len'], device, lambda msg: batch_iterator.write(msg), global_step, writer)

        # Save the model
        model_filename = f"{config['model_folder']}/{config['model_filename']}{epoch:02d}.pt"
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'global_step': global_step
        }, model_filename)

import gc

# 1. Clear the GPU cache
torch.cuda.empty_cache()

# 2. Force Python's garbage collector to run
gc.collect()

# 3. Optional: Reset peak memory statistics
torch.cuda.reset_peak_memory_stats()

# 1. Get the configuration
config = get_config()

# 2. Run the training
train_model(config)

Using device cuda


Processing Epoch 00: 100%|██████████| 63/63 [00:32<00:00,  1.96it/s, loss=7.749]


--------------------------------------------------------------------------------
    SOURCE: They refer in particular to article 220, which grants Al-Azhar University an advisory role, with particular reference to verifying the conformity of the laws with sharia.
    TARGET: Dabei haben sie besonders Artikel 220 im Auge, der der Universität Al-Azhar eine beratende Funktion zuerkennt, insbesondere in Bezug auf die Überprüfung der Konformität mit den Gesetzen der Scharia.
 PREDICTED: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
--------------------------------------------------------------------------------
    SOURCE: The bitcoin exchange rate, after reaching a peak of 30 dollars (23 euros) in June 2011, fell to 2 dollars five months later, returning today to around a dozen dollars (rates are listed on the bitcoincharts.com site).
    TARGET: Nachdem der Kurs von Bitcoin im Juni 2011 einen Höchststand von 30 Dollar (23 Euro) erreicht hatte, fiel er fünf Monate später auf 2 Dollar, bevor er sich heute auf rund zehn Dollar erholt hat (die Kurse sind auf der Webseite bitcoincharts.com aufgeführt).
 PREDICTED: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
--------------------------------------------------------------------------------


Processing Epoch 01: 100%|██████████| 63/63 [01:30<00:00,  1.43s/it, loss=7.384]


--------------------------------------------------------------------------------
    SOURCE: In the Carlsbad region, the roads have been usable this morning, though in some places they were icy and snowy.
    TARGET: Im Bezirk Karlsbad waren die Straßen heute Morgen gut befahrbar, vereinzelt herrschte Schnee- und Eisglätte.
 PREDICTED: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
--------------------------------------------------------------------------------
    SOURCE: They refer in particular to article 220, which grants Al-Azhar University an advisory role, with particular reference to verifying the conformity of the laws with sharia.
    TARGET: Dabei haben sie besonders Artikel 220 im Auge, der der Universität Al-Azhar eine beratende Funktion zuerkennt, insbesondere in Bezug auf die Überprüfung der Konformität mit den Gesetzen der Scharia.
 PREDICTED: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
--------------------------------------------------------------------------------


Processing Epoch 02: 100%|██████████| 63/63 [01:26<00:00,  1.37s/it, loss=7.876]


--------------------------------------------------------------------------------
    SOURCE: We need to make art fashionable, as it was in the beginning of last century.
    TARGET: Kunst muss wieder modern sein, so wie zu Beginn des vorigen Jahrhunderts.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: - If we talk about your work with Xbox Kinect sensors, what are your complaints about modern cameras?
    TARGET: - Welche Ansprüche stellen Sie ausgehend von Ihrer Arbeit mit den Xbox Kinect-Sensoren an moderne Kameras?
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 03: 100%|██████████| 63/63 [01:27<00:00,  1.39s/it, loss=7.399]


--------------------------------------------------------------------------------
    SOURCE: At the time, he wrote two columns in the Murdoch press each week.
    TARGET: Zu jener Zeit schrieb er pro Woche zwei Kolumnen in der Murdoch-Presse.
 PREDICTED: . . . . . . . . . .
--------------------------------------------------------------------------------
    SOURCE: From the start, while the trial judge notes that creditors should be governed by the United States Bankruptcy Code, the Court of Appeal for the Fifth Circuit, based in New Orleans, states that the main action is the insolvency action handled in Mexico.
    TARGET: Während der Richter der Sache darauf hinweist, dass die Gläubiger dem Konkursrecht der Vereinigten Staaten unterliegen müssen, sagt das Berufungsgericht des Fifth Circuit mit Sitz in New Orleans zunächst, dass es sich bei dem Hauptverfahren um den in Mexiko behandelten Handelskonkurs handelt.
 PREDICTED: . . . . . . . . . .
--------------------------------------------------------------------------------


Processing Epoch 04: 100%|██████████| 63/63 [01:23<00:00,  1.33s/it, loss=7.395]


--------------------------------------------------------------------------------
    SOURCE: However, images are not to be expected until 2018 at the earliest.
    TARGET: Mit Bildern ist aber frühestens 2018 zu rechnen.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: "I work, I take part in a lot of activities, I travel, I have an active but responsible sex life, I take care of myself and the other person" said Fabrizio, who agreed to share his intimate secrets with MILENIO JALISCO, to motivate those people with his story who today, in the context of World AIDS Day, are afraid.
    TARGET: "Ich arbeite, unternehme viel, habe ein aktives, aber verantwortliches Sexleben, ich achte auf meine Gesundheit und die der anderen Person", zählt Fabrizio auf, der zustimmte, seine intimen Details mit MILENIO JALISCO zu teilen, um mit seinem Zeugnis jene zu animieren, die heute im Rahmen des weltweiten AIDS-Tages Angst haben.
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 05: 100%|██████████| 63/63 [01:23<00:00,  1.33s/it, loss=7.253]


--------------------------------------------------------------------------------
    SOURCE: Using such scales, researchers have discovered that boys tend to be bored more often than girls, said Stephen Vodanovich, a professor of psychology at the University of West Florida, especially when it comes needing more, and a variety of, external stimulation.
    TARGET: Mithilfe solcher Skalen haben Forscher herausgefunden, dass Jungen öfter langweilig wird als Mädchen, so Stephen Vodanovich, Professor für Psychologie an der Universität Westflorida - insbesondere, was den Bedarf nach mehr und einer größeren Vielfalt externer Reize betrifft.
 PREDICTED: .
--------------------------------------------------------------------------------
    SOURCE: "I work, I take part in a lot of activities, I travel, I have an active but responsible sex life, I take care of myself and the other person" said Fabrizio, who agreed to share his intimate secrets with MILENIO JALISCO, to motivate those people with his story who today, in the context of World AIDS Day, are afraid.
    TARGET: "Ich arbeite, unternehme viel, habe ein aktives, aber verantwortliches Sexleben, ich achte auf meine Gesundheit und die der anderen Person", zählt Fabrizio auf, der zustimmte, seine intimen Details mit MILENIO JALISCO zu teilen, um mit seinem Zeugnis jene zu animieren, die heute im Rahmen des weltweiten AIDS-Tages Angst haben.
 PREDICTED: .
--------------------------------------------------------------------------------


Processing Epoch 06: 100%|██████████| 63/63 [01:23<00:00,  1.32s/it, loss=7.282]


--------------------------------------------------------------------------------
    SOURCE: A king with 14 wives
    TARGET: Ein König mit 14 Ehefrauen
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: "Baku became famous in May due to the Eurovision Song Contest, and next year we are having a festival to celebrate the oldest people in the world," said Fatulayeva.
    TARGET: "Baku wurde im Mai mit dem Eurovision-Song-Festival berühmt - und im nächsten Jahr haben wir hier ein Festival der ältesten Menschen der Welt", sagt Fatulayeva.
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 07: 100%|██████████| 63/63 [01:21<00:00,  1.30s/it, loss=6.644]


--------------------------------------------------------------------------------
    SOURCE: This star birth was captured by the Hubble telescope in the M83 spiral galaxy.
    TARGET: Diese Sternen-Geburt zeichnete das Hubble-Teleskop in der Spiralgalaxie M83 auf.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: Is the Svarc System prohibited or allowed?
    TARGET: Ist das Schwarz-System nun verboten oder erlaubt?
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 08: 100%|██████████| 63/63 [01:23<00:00,  1.33s/it, loss=6.832]


--------------------------------------------------------------------------------
    SOURCE: In addition, new data unveiled this week at a large physics Congress in Kyoto seem to confirm this, but there are still insufficient data to be perfectly sure.
    TARGET: Darüber hinaus scheint dies durch neue Daten bestätigt zu werden, die in dieser Woche auf einem großen Physik-Kongress in Kyoto vorgestellt wurden, aber es fehlen noch Daten, um darüber vollkommen sicher zu sein.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: This is a nationwide fund-raising event, which we have been planning for the past two years.
    TARGET: Es handelt sich um eine landesweite Benefizveranstaltung, die wir schon zwei Jahre planen.
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 09: 100%|██████████| 63/63 [01:21<00:00,  1.30s/it, loss=6.565]


--------------------------------------------------------------------------------
    SOURCE: Or make it worse, by increasing the required volume of calculations.
    TARGET: Oder das Problem verstärken, da mehr Berechnungen notwendig wären.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: Germany is still not used to it, the country continues to be uncomfortable and, for obvious reasons, still finds it difficult to play a more prominent role.
    TARGET: Deutschland ist noch nicht daran gewöhnt, das ist noch immer unangenehm und, aus offensichtlichen Gründen, fällt es dem Land noch immer schwer, eine größere Rolle zu spielen.
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 10: 100%|██████████| 63/63 [01:22<00:00,  1.30s/it, loss=6.869]


--------------------------------------------------------------------------------
    SOURCE: Incidentally, it is the most unusual sea in the world, located in the lowest point of the planet - 417 m below sea level.
    TARGET: Übrigens ist dieses ungewöhnliche Meer mit einer Lage von 417 Metern unter dem Meeresspiegel das am tiefsten gelegene Meer der Welt.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: For example, Xbox and Kinect sensors are a step forward. Almost no Xbox is sold without Kinect today, because everyone likes control by gestures.
    TARGET: Beispielsweise sind die Xbox-Konsole und die Kinect-Sensoren ein Fortschritt, und heutzutage wird keine Xbox mehr ohne Kinect verkauft, da sich alle für Bewegungssteuerung interessieren.
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 11: 100%|██████████| 63/63 [01:21<00:00,  1.30s/it, loss=6.268]


--------------------------------------------------------------------------------
    SOURCE: A problem, which may arise here, is that the law becomes "too constrained" and will not be enforceable.
    TARGET: Es kann jedoch das Problem auftreten, dass die Regeln "strenger" werden und nicht mehr eingehalten werden können.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: Now, however, due to their special relationship with Israel, Germany must be cautious.
    TARGET: Nun muss Deutschland aufgrund seiner besonderen Beziehung mit Israel allerdings vorsichtig sein.
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 12: 100%|██████████| 63/63 [01:21<00:00,  1.30s/it, loss=6.753]


--------------------------------------------------------------------------------
    SOURCE: Each week, students explore apocalyptic themes such as nuclear war, zombies, viruses and germs, and global warming.
    TARGET: Jede Woche untersuchen die Studenten apokalyptische Themen wie z. B. den Atomkrieg, Zombies, Virus und Keime und die globale Erwärmung.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: Everyone has the Internet, an iPad and eBooks.
    TARGET: Alle haben Internet, iPad und E-Books.
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 13: 100%|██████████| 63/63 [01:21<00:00,  1.30s/it, loss=5.982]


--------------------------------------------------------------------------------
    SOURCE: Amid the usual futile arguments over who started it, scores of buildings have been reduced to rubble; more than 140 Palestinians, most of them civilians, and six Israelis have been killed; and, for the first time, missiles from Gaza have landed near Tel Aviv, Israel's metropolis, and the holy city of Jerusalem.
    TARGET: Inmitten der üblichen Schuldzuweisungen wurden zahllose Gebäude in Schutt und Asche gelegt, mehr als 140 Palästinenser, die meisten davon Zivilisten, und sechs Israelis getötet - und zum ersten Mal sind Raketen aus Gaza in der Nähe der israelischen Metropole Tel Aviv und der heiligen Stadt Jerusalem eingeschlagen.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: In Ligue 1, would not winning the title, like last season, be a big failure?
    TARGET: Wäre es ein großer Misserfolg, nicht den Titel in der Ligue 1 zu gewinnen, wie dies in der letzten Saison der Fall war?
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 14: 100%|██████████| 63/63 [01:21<00:00,  1.30s/it, loss=6.288]


--------------------------------------------------------------------------------
    SOURCE: In reality, all types of interaction with computers are good, but each in their own niche.
    TARGET: Eigentlich ist jede Art von Interaktion mit Computern gut, allerdings jede in ihrer bestimmten Nische.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: There are people for whom work is hell, and others who - literally - work in hell.
    TARGET: Es gibt Menschen, für die ihre Arbeit die Hölle ist, und andere arbeiten wortwörtlich in der Hölle.
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 15: 100%|██████████| 63/63 [01:23<00:00,  1.32s/it, loss=6.050]


--------------------------------------------------------------------------------
    SOURCE: As you know, Azcarraga Andrade is the main shareholder of the Posadas hotel chain.
    TARGET: Wie Sie wissen, ist Azcárraga Andrade der Hauptaktionär der Hotelkette Posadas.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: How do you explain this progression?
    TARGET: Wie erklären Sie diesen Fortschritt?
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 16: 100%|██████████| 63/63 [01:22<00:00,  1.32s/it, loss=5.685]


--------------------------------------------------------------------------------
    SOURCE: I will have a programme exhibit, "Russian museum in clowns."
    TARGET: Ich werde eine Ausstellung mit dem Thema "Das russische Museum als Clowns" haben.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: What will we leave as a souvenir, after two billion euros of military spending?
    TARGET: Welches Souvenir lassen wir nach Militärausgaben von zwei Milliarden Euro zurück?
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 17: 100%|██████████| 63/63 [01:23<00:00,  1.32s/it, loss=5.637]


--------------------------------------------------------------------------------
    SOURCE: However, Berlin subsequently abstained from voting.
    TARGET: Dann aber enthielt sich Berlin.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: There were twenty of us, including Sharon Stone and John Kennedy Jr.
    TARGET: Wir waren ungefähr 20 Personen, darunter auch Sharon Stone und John Kennedy Jr.
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 18: 100%|██████████| 63/63 [01:23<00:00,  1.32s/it, loss=5.574]


--------------------------------------------------------------------------------
    SOURCE: Moreover, Hamas's leaders may well conclude that time is on their side.
    TARGET: Und die Hamas-Führer könnten zum Schluss kommen, dass sie die Zeit auf ihrer Seite haben.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: Give it a chance.
    TARGET: Gib dem Ganzen eine Chance.
 PREDICTED:
--------------------------------------------------------------------------------


Processing Epoch 19: 100%|██████████| 63/63 [01:21<00:00,  1.30s/it, loss=5.526]


--------------------------------------------------------------------------------
    SOURCE: Representatives of human rights, religious minorities or civil society had done likewise.
    TARGET: Vertreter von Menschenrechtsorganisationen, religiösen Minderheiten oder der Zivilgesellschaft haben gleichermaßen gehandelt.
 PREDICTED:
--------------------------------------------------------------------------------
    SOURCE: I think that the eighth will probably be the addition of tactile sensations.
    TARGET: Ich denke, der achte Durchbruch könnte durchaus die Integration der taktilen Wahrnehmung werden.
 PREDICTED:
--------------------------------------------------------------------------------

The scope of this blog is to recreate the “Attention is all you need” paper and train it on a toy task.

Part 1: The model’s architecture

Part 2: The dataset

Part 3: The training loop

Part 4: The extras