A Gentle Introduction to Transformer Architecture and Relevance to Generative AI
This post is about Transformer Architecture, its relevance to Generative AI, tips and guidance on customizing your interaction with the large language models.
- Language Translation - translating from one language to another. A use case here would be helping a non-english speaker interact with a critical government service over the internet.
- Text Generation - generating stories, scripts and any text-based content that could be utilized in a social media campaign.
- Sentiment Analysis - determining the sentiment or emotion of a piece of text or speech, to help understand why a customer may not be using a business service or product.
- Question Answering - generating the most relevant answer, based on input, to help re-enforce the learnings within a large orgnisation as part of a cultural change and/or organisational change initiative.
- Text Summarisation - providing a concise summary of a long document, to help explain a government policy document on a public-facing website to it’s citizens.
- Speech Recognitiion - processing of acoustic features of the audio data to translate speech into text, to help hearing-impaired employees in government organisations.
- Music Production - creating new music, based on original input content, to produce new music that can be weaved into a new film, tv or YouTube video by a professional musician or social media content creator.

- Input Sequence: This consists of conveying the party requirements such as theme, venue, activities, etc.
- Encoder: Each person represents a single encoder, and specilises in an aspect of the party planning: decorations, food, music, games, etc. All the people, and therefore the stack of encoders, represent a party committee.
- Self-Attention: Each person pays attention to everyone else’s ideas. They consider the relevance, importance of each party idea and how the ideas all relate to each other. This happens as part of a brainstorming and collaboration session.
- Decoder: The party committee then take all of the information and ideas, weigh the importance of each idea and determine how to assemble into a party plan.
- Output: The output sequence generated by the transformer is the final party plan.
- The number of decoder or encoder layers or in the analogy, the number of people with a skill that can assist with planning the party.
- The number of parallel self-attention mechanisms we could use which would better capture diverse patterns in the data, but would require more computational power.
- The size of the feed-forward neural network.
- LLM models are powerful but also general-purposed. They may not fit good enough to your specific use case.
- You may or may not have direct access to the LLM model.
- Your organization may have massive amount of domain/industrial data, and you are not sure how to better use this knowledge with GenAI capabilities.
It's like you are asking a question to a chatbot, and getting an answer.
It's simple and easy to start, but it may not perform well with accurate/relevant/factual response.
If zero-shot prompting doesn't work well, consider using few-shot prompting.
- Using demonstrations (examples) as labels (ground-truth).
- Specify the overall format is crucial, e.g. when label space is unknown, using random English words as labels, is significantly better than using no labels.
- Define your objective, ideathon and understand your specific use cases, define what good output would look like.
- Work backwards from your specific use case.
- If you are considering using prompt engineering (over fine tuning approach, which will be covered in the next section), start with a simple approach.If few-shot prompting will achieve your goal, don't bother using RAG.
- Be specific on prompt - as clear as possible on
- input questions
- context (additional information)
- examples (output format)
- instructions (e.g. step-by-step instruction)
Therefore, here we just list some of the popular ones to give you an idea of where to start.
- Improving performance of common tasks from pre-trained model.
- Improving performance of specific tasks from pre-trained model.
Fine Tuning II is updating parameters of all layers. Compared with Fine Tuning I, Fine Tuning II can result in better performance, but it is also more expensive.
This approach can improve zero-shot performance of LLM on unseen tasks.
Besides, storing and inferencing with large fine-tuned models (of similar size to the pre-trained LLMs) can also be expensive.
Parameter-Efficient Finetuning (PEFT) can help to address these challenges.
In this way, it significantly reduces the resource consumption during training. And also results in a much smaller fine-tuned model of comparable performance with fully fine-tuned model. The training and inference with PEFT may allow the model to fit in a single GPU.
This approach freezes the pre-trained model parameters, and limits trainable parameters of each individual layer of the model (Transformer architecture). In doing so, it greatly reduces the number of trainable parameters, hence, decreasing the training time.
- Less resource consumption: less compute resource for training, less storage for hosting model
- Faster training
- Better performance (not overfitting)
There are existing evaluation tools such as ROUGE, that measures model performance, but that's still not good enough.
Reinforcement Learning with Human Feedback (RLHF) approach can help here.
- Supervised fine tuning (SFT).
- Reward Model training, based on human feedback. Basically define what "good" looks like.
- Reinforcement Learning on this Reward Model. This part is to further fine-tuning the model.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.