A Summer of Nuggets

Charles F Vardeman II

2023-08-28

Where we started…

Where we’re going…

This is meant to provide something of a “Roadmap”!

“Old School AI” and the Web…

A vision of “Ontologies”, “Linked Data”, and “Software Agents”…

Old School “Cool” AI and the Web (2001)…

The semantic web had a vision of Agents with Shared Understanding through Ontologies, the ability to Use Tools Like the Web and Consume a Web of Linked Data as Distibuted Knowledge Graphs.

What’s an Ontology?

Ontology Design Patterns

Google Search and the “Semantic Web”

Knowledge Graphs

Knowledge Graphs

What’s a Knowledge Graph?

AI in 2023..

The “AI Social Disruption”

Pretrained Foundation Models…

Evolution of Large Language Models

Transformer:“Attention is all you need”

“Explain the transformer neural network architecture to a College Junior Computer Science student. Use the style of a good science communicator like Richard Feynman.”

Transformers explained by ChatGPT

Alright, let’s get into it! Imagine you’re at a party and you’re trying to catch snippets of multiple conversations. You’re not just listening to the words, but also noticing who’s talking to whom, the tone, the context, and you’re making judgments about what’s important or not. That’s kinda what the Transformer architecture does, but for sequences of data like sentences or time-series data.

At the core, a Transformer has two main parts: the Encoder and the Decoder. Each has multiple identical layers stacked on top of each other.

  1. Encoder: Takes in the sequence (like a sentence “Hello, how are you?”) and processes it as a whole. It outputs a sort of “summary” or “context” for each word in the sequence.
    • Self-Attention: Each word checks out all the other words and figures out which ones are important to it. The magic formula called attention scores helps in this.
    • Feed-forward neural networks: After attention, each word passes through a mini neural network that’s the same for all words, transforming it yet again.

Transformers explained by ChatGPT

  1. Decoder: Takes these encodings and starts generating the output sequence (say, a translation of the original sentence into another language).
    • Masked Self-Attention: Similar to the encoder, but in a way that it can’t cheat and look at future words in the output sequence.
    • Encoder-Decoder Attention: The decoder takes a peek at the encoder’s output to better understand the context and produce the correct next word.
    • Feed-forward neural networks: Same drill as the encoder, another mini network for transforming words.

Transformers explained by ChatGPT

So the magic here is mainly in how words interact with each other through the attention mechanism. Each word looks at all the other words to decide how much focus it should get, like how you tune in to different voices at a party. It does this not once, but multiple times, diving deeper and deeper into the context each time.

It’s like a super-efficient assembly line for sequences, enabling models like me to generate text or translate languages at scale. And this architecture has been the basis for a lot of breakthroughs in natural language processing. It’s versatile, parallelizable, and mighty powerful!

“Some LLM Neural Networks use encoder-decoder networks. Some LLM neural networks are encoder only and some are decoder only like GPT. Explain that to a College Junior Computer Science student. Use the style of a good science communicator like Richard Feynman.”

Transformers explained by ChatGPT

Ah, good question! Think of it like this: you’ve got three types of orchestras. One plays both classical and jazz seamlessly (encoder-decoder), another specializes only in classical (encoder), and the last one is all about that jazz (decoder).

  1. Encoder-Decoder Networks: These are like your all-rounder orchestras. In the machine learning world, these models are often used for tasks like machine translation. The encoder part processes the input language, let’s say, English, and the decoder part spits out the output in another language, maybe Spanish. They work together like a well-conducted orchestra, each contributing to the final masterpiece.

  2. Encoder-Only Networks: Imagine an orchestra that’s focused entirely on classical music, where each musician feeds into the other to create a beautiful symphony. In tech, these models are good for tasks like text classification or sentiment analysis. The encoder processes the text and summarizes it in a way that lets you say, “Ah, this review is positive!”

Transformers explained by ChatGPT

  1. Decoder-Only Networks: Now, these are your jazz specialists, all about improvisation and solos. GPT models, like the one you’re talking to, belong to this category. The decoder starts with a seed (your prompt) and keeps generating one word at a time, predicting the next best word until it forms a coherent reply. It doesn’t need to look back at an original input sentence like in translation; it just jams out text based on what it has learned during its training.

So, you’ve got different setups for different gigs. If you need a two-way conversation between languages, you go for the encoder-decoder. If you want to understand the essence of a piece of text, you’d lean on the encoder. And if you want to generate text, like writing or even code, the decoder is your go-to musician. Each has its own style, its own instruments, and its own unique approach to making beautiful “music”—or in our case, useful computations.

“How do these relate to the Transformer architecture”

Transformers explained by ChatGPT

Great follow-up! Alright, think of the Transformer architecture as a fancy new instrument—a versatile electric guitar, let’s say, that can produce a whole range of sounds from classical to rock to jazz. Now, just like you can use that electric guitar in different types of bands, you can use parts of the Transformer architecture in different types of models.

  1. Encoder-Decoder Networks with Transformers: When people initially talked about Transformers, they were often referring to the full shebang—both encoder and decoder. In a language translation model, the Transformer’s encoder reads an English sentence and compresses its essence. The decoder then takes this essence and generates a Spanish sentence. Both sides are using self-attention, layer normalization, and all those Transformer goodies to do their jobs.

  2. Encoder-Only Transformers: In some tasks, you don’t need the full band; a guitar solo will do. For instance, BERT (Bidirectional Encoder Representations from Transformers) uses just the encoder part of the Transformer architecture. It processes a sentence and spits out a rich, contextual representation of each word, which can then be used for tasks like text classification or filling in blanks.

  3. Decoder-Only Transformers: Now, what if we only want to jam and improvise? That’s where models like GPT (Generative Pre-trained Transformer) come in. These models use only the decoder part of the Transformer architecture to generate new text. They still use self-attention and feed-forward layers, but they are more about predicting the next note—or in our case, the next word—in a sequence.

Transformers explained by ChatGPT

So there you go! Just like our versatile electric guitar can be part of a full orchestra, a rock band, or a solo act, different pieces of the Transformer architecture can be used to construct different kinds of neural network models. Each specialized version takes the core principles of the Transformer and applies them in a way that’s tailored to specific tasks.

How to Build a ChatGPT?

“The state of GPT” – You should watch this!

“The state of GPT”

Large Language Models from a more general view…

Text to numbers…

Base models create general representations through “Pre-Training” (GPT)

Base models create general representations through “Pre-Training” (GPT)

“Pre-Training Datasets?”

LLama: Open and Efficient Foundation Language Models

LLama: Open and Efficient Foundation Language Models

Gao et al. “The Pile”?

Gao et al. “The Pile”?

What could go wrong?

What could go wrong?

What could go wrong?

“Would I forbid the teaching (if that is the word) of my stories to computers? Not even if I could. I might as well be King Canute, forbidding the tide to come in. Or a Luddite trying to stop industrial progress by hammering a steam loom to pieces.” – Stephen King

(GPT-3) Language Models are Few-Shot Learners (2020)

(GPT-3) In-Context Learning

(GPT-3) Instruct-GPT Reinforcement Learning from Human Feedback

(GPT-3) Instruct-GPT Reinforcement Learning from Human Feedback

(GPT-4) “Sparks of AGI”?

(GPT-4) “Sparks of AGI”?

(GPT-4) “Sparks of AGI”?

(GPT-4) “Sparks of AGI”?

LLM Behavior Changes with Time!

(GPT-4) OpenAI Reinforcement Learning – “Towards TruthGPT”

(GPT-4) OpenAI Reinforcement Learning – “Conceptual Models”

(Claude) “Constitutional AI”

(Claude) “Foundation Model in AWS Bedrock”

(Llama 2)“Open-License” Large Language Models

“The state of GPT” Recommendations

Reasoning…

(GPT-3) Large Language Models are Zero Shot Reasoners (Chain-of-Thought Reasoning)

(GPT-3) Large Language Models are Zero Shot Reasoners (Chain-of-Thought Reasoning)

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models

Prompt Engineering

Large Language Models are Semantic Reasoners

Large Language Models are Semantic Reasoners

Our Research…

Retrevial Augmented Generation

Knowledge Graphs and LLMs – Must Read!

Knowledge Graphs and Frameworks

What About Trusted AI?

How to talk about LLMs? Must Read!