Exercises

Exploring the Tensorflow Playground

You've been introduced to neural networks. To deepen your understanding of how they work, you've been tasked with investigating the TensorFlow Playground—an interactive visualisation tool that allows you to experiment with small neural networks directly in your browser. Your goal is to explore how different configurations of neural networks affect their ability to learn and solve classification problems.

Open the TensorFlow Playground:
Go to TensorFlow Playground in your web browser.

Task 1: Build a Simple Neural Network

  • Start with the default dataset (two circles) and network settings.
  • Use one hidden layer with one neuron to classify the points.
  • Observe the decision boundary and how the network performs.

Task 2: Add More Neurons

  • Increase the number of neurons in the hidden layer to three neurons.
  • Observe the changes in the decision boundary and how well the network classifies the data.
  • How does the performance change with more neurons?

Task 3: Add More Layers

  • Add a second hidden layer with two neurons.
  • Compare the performance to the previous network with just one hidden layer. How does adding more layers affect the learning?

Submission:
Take a screenshot of each configuration (one neuron, three neurons, two layers). Briefly describe (2-3 sentences) how each change affected the decision boundary and model performance.

After trying out tensorflow playground, now it is time for doing a bit more complex task. This will be useful in gaining an intuitive understanding of how machine learning models work.

Task 1: Investigate Different Datasets

  • Change the dataset to "Spirals" using the dropdown menu.
  • Experiment with different network architectures to successfully classify the data:
  • Start with one hidden layer and four neurons.
  • Gradually increase the number of neurons and layers until the network correctly classifies the spirals.
  • What is the minimum number of layers and neurons needed to classify the spirals accurately?

Task 2: Adjusting Activation Functions and Learning Rate

  • Change the activation function from "ReLU" to "tanh" and observe how it affects the network's ability to learn.
  • Now, adjust the learning rate slider and observe how different learning rates (e.g., 0.01, 0.1, 3) impact the training speed and model accuracy.

Task 3: Experiment with Regularization

  • Add L2 regularization to your network and observe how it impacts the decision boundary and prevents overfitting.
  • Try increasing the regularization strength to see its effects on training.

Submission:
Share screenshots of your experiments with the "Spirals" dataset and different network configurations. Briefly describe (2-3 sentences) how changing the activation function, learning rate, and regularization influenced the network's performance.

Digital Signal Processing

The topics of this week include two special kinds of data that require somewhat specific techniques: digital signals (such as audio or images), and natural language.

Even though we try to avoid mentioning the Terminator (because people think about it way too much when we say 'AI'), at this point of the course, it's probably safe to use it as an example of some signal processing tasks. You can watch this clip before and after this part in the course to see if you'll be able to recognize some familiar features in it:

Learning objectives of Part 5
Theme Objectives (after the course, you ...)
Digital Signal Processing
  • discuss about Generative AI in image(GAN, VAE)
  • can recognize transformer and diffusion models
  • can recognize AI generated images.
Natural language processing
  • discuss about generative AI in NLP
  • implement and work with large language models (NanoGPT)

Digital signal representations

Digital image and audio signals represent recordings of natural signals in numerical form. For images, the values represent pixel intensities stored in a two-dimensional array. A third dimension can be used to store different bands for, e.g., red, green, and blue (RGB). We can also think of the values are functions of the x and y coordinates, F(x,y).

Fun fact (but not required for the exam): In the case of audio, the signal can be thought to be a signal of time, F(t), but other representations such as those that are functions of frequency, F(f), are commonly used. The frequency representation is particularly well suited for compact representation since frequency bands outside the audible range 20 Hz to 20 kHz can be left out without observable (to humans) effect.

Signal processing covers a wide variaty of different tasks and techniques. See the lecture slides for a number of examples of both audio and image/video processing tasks from ''photoshopping'' to face recognition.

Pattern recognition and invariant features in images

Pattern recognition is a good example of an AI problem involving digital signals. It means generally the problem is recognizing an object in a signal (typically an image). The task is made hard because signals tend to vary greatly depending on external conditions such as camera angle, distance, lighting conditions, echo, noise, and differences between recording devices. Therefore the signal is practically always different even if it is from the same object.

Based on these facts, an essential requirement for successful pattern recognition is the extraction of invariant features in the signal. Such features are insensitive (or as insensitive as possible) to variation in the external conditions, and tend to be similar in recordings of the same object. In the lecture slides there is a famous photograph of an Afghan girl by Steve McCurry. She was identified based on the unique patterns in the iris of the eye. Here the iris is the invariant feature, which remains unchanged when a person grows older even if their appearance may otherwise change. The iris is an example of invariance, and a metaphor for more general invariance with respect to external conditions.

More generally, the feature-based approach to image recognition attempts to identify features that are invariant with respect to (meaning: not affected too much by)

  • scale, which depends on the distance between the camera and the object
  • orientation, which depends on the angle of the object relative to the camera
  • brightness, which depends on the lighting conditions and the camera settings (aperture and exposure time)
Moreover, we'll prefer feature that can be computed
  • efficiently
Finally, the object should be
  • identifiable
based on the features.

The starting point to construct such features will be a set of points of interest, each of which is located at specific x, y coordinates in the image. The above criteria, especially the requirement of identifiability, lead us to prefer points like corners and ''blobs'' (we won't define this in more detail): a point on a flat surface, on the other hand, would be a poor feature because its only identifiable characteristic is its color, which won't usually be specific enough to distinguish between features in objects of the same color.

Deep learning and computer vision

The latest trend in computer vision and other digital signal processing domains in AI is deep learning. We already briefly discussed the idea in deep learning in the previous part.

The main promise of deep learning for computer vision is the ability to sidestep the feature engineering step, so that we won't necessarily have to apply the feature-based approach descrived above at all. The deep neural network architecture is designed so that the first layers (the layers closest to the input data) automatically learn to extract useful features from raw pixel data. The next layers will gradually refine the features to obtain higher-level features. Finally, the last layers (the layers furthest from the input data) will solve the actual learning task, be it classification or regression or something else.

We didn't go into deep learning for image processing in this course, and it is not included in the learning objectives: don't expect a question about it in the exam. There are other courses like Advanced course in machine learning and Deep learning where you can learn more. You should also remember that you can apply for permission to take courses at Aalto University where they also have relevant courses.

Generative AI in Image Processing

vae_gan_image

Generative AI in image processing focuses on creating new images or enhancing existing ones by learning patterns from large datasets. Using techniques such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), AI systems can generate high-quality images, reconstruct missing parts, and create entirely new visuals. These models have become instrumental in industries like entertainment, advertising, and healthcare, enabling tasks such as image synthesis, inpainting, and super-resolution.

Auto Regression (AR)

Auto Regression (AR) is a statistical modeling technique widely used in time series analysis to predict future values based on past observations of the same variable. The AR model assumes that the value of a variable at any given time depends linearly on its past values and a stochastic (random) error term. The general form of an AR model of order p (AR(p)) is:

yt = c + φ1yt-1 + φ2yt-2 + ... + φpyt-p + εt

Where:

  • yt: The value of the time series at time t.
  • c: A constant term.
  • φi: The coefficients for past values (yt-i).
  • p: The order of the model, indicating how many past values are used.
  • εt: The error term, representing random noise.

Applications of Auto Regression

Auto Regression is used in various domains, including:

  • Economic Forecasting: Predicting stock prices, GDP growth, or unemployment rates.
  • Weather Modeling: Estimating temperature, rainfall, or other weather-related variables.
  • Signal Processing: Analyzing trends in audio, communication, or seismic signals.

Assumptions of the AR Model

To effectively use AR models, the following assumptions should hold:

  • The time series is stationary, meaning its statistical properties (mean, variance) do not change over time.
  • The errors (εt) are independently and identically distributed (i.i.d.) with a mean of zero.

Choosing the Order (p)

The order of the AR model, p, can significantly impact its performance. It is typically chosen using methods like:

  • Akaike Information Criterion (AIC): Balances model complexity and goodness of fit.
  • Bayesian Information Criterion (BIC): Penalizes complexity more heavily than AIC.
  • Partial Autocorrelation Function (PACF): Determines the most significant lags for the model.

Limitations

While AR models are powerful for stationary time series, they have limitations:

  • Poor performance on non-stationary data unless transformed (e.g., through differencing).
  • Limited to linear relationships and may not capture non-linear dynamics.

Extension to ARIMA

Auto Regression is a core component of more complex time series models like ARIMA (Auto Regressive Integrated Moving Average), which combines AR with moving averages and differencing to handle non-stationary data.

Overall, Auto Regression remains a foundational technique in time series analysis, providing a straightforward yet effective way to model and forecast data based on its own past behavior.

GAN

One of the most popular models for image generation. GANs are a type of generative model that can create new, synthetic data similar to a given dataset. In image generation, for example, GANs can produce realistic-looking images from random noise. GANs consist of two neural networks—the generator and the discriminator—working in opposition.

  • The Generator: This network tries to generate realistic images from random noise.
  • The Discriminator: This network tries to distinguish between real images (from the dataset) and fake images (generated by the generator).

The generator's goal is to fool the discriminator, while the discriminator's goal is to correctly classify images as real or fake. Over time, both networks improve: the generator gets better at creating realistic images, and the discriminator becomes more accurate at distinguishing real from fake images. This setup is called adversarial because the two networks are in competition, similar to a game. The generator tries to create realistic images, while the discriminator evaluates them against real images, gradually improving the generator's ability to produce highly realistic outputs.

VAE

These models work by compressing images into a latent space representation, allowing for smooth interpolation between images. VAEs are useful for generating new images and filling in missing data by sampling from this latent space.

Hands-On: GANs and VAEs

Generator: The generator will take a vector of random noise (a latent vector, usually of size 100) and generate a 28x28 image (since we are using the MNIST dataset).

Discriminator: The discriminator will take an image (either real or generated) and output a single scalar that represents whether the image is real (1) or fake (0).

Generator Architecture

  • Latent Vector (z): Random noise vector (size 100).
  • Output Image: A 28x28 grayscale image (MNIST size).

Code for the generator can be found here: Generator Code.

Discriminator Architecture

  • Input Image: A 28x28 grayscale image.
  • Output: A single value between 0 and 1 (real or fake).

Code for the discriminator can be found here: Discriminator Code.

Full code of training on the MNIST dataset can be accessed by the link: MNIST Training Code.

Further Reading:

Transition from RNN-based Approaches to CNN-based

Historically, Recurrent Neural Networks (RNNs) were applied to sequential tasks, including image generation. However, RNNs struggled with scalability and capturing spatial hierarchies in image data. As a result, Convolutional Neural Networks (CNNs) became the dominant architecture for image-related tasks, particularly for tasks involving spatial relationships such as object detection and image classification.

CNNs are designed to process data with a grid-like topology (e.g., images). They employ convolutional layers that filter small regions of input data, capturing features such as edges, textures, and shapes. This enables CNNs to model spatial hierarchies efficiently, making them more effective for image processing compared to RNNs, which are better suited for sequential data (like text or time-series data).

The shift from RNN-based to CNN-based approaches in tasks like image recognition, object detection, and image generation was driven more by the scalability of CNNs and their higher degree of parallelization, rather than an inherent superiority. While both architectures can ultimately achieve similar outcomes, CNNs are generally easier to train and more efficient on large-scale datasets due to their ability to process data in parallel.

Transformers and Attention Mechanisms

vae_gan_image

The introduction of Transformers marked a breakthrough in deep learning, primarily due to their attention mechanisms, which enable models to focus on important parts of the input data simultaneously. Initially designed for Natural Language Processing (NLP), transformers are now being applied in computer vision as well, most notably through Vision Transformers (ViTs).

Self-Attention Mechanism

This is the core of the transformer architecture. Each input token (in NLP, a word; in images, a patch) can attend to every other token, allowing the model to capture long-range dependencies. In image processing, this means understanding relationships between distant regions of an image.

Multi-head Attention

By using multiple attention heads, transformers learn various relationships simultaneously, improving the model's ability to focus on different features of the image.

Transformers have outperformed traditional CNNs in certain vision tasks, especially those requiring understanding of global image contexts. This advancement improved such fields as image classification, object segmentation, and image generation.

Further Reading:

Latent Variable and Diffusion Models

Latent Variable Models are a class of generative models that introduce hidden variables to capture the underlying structure of the data. These models are often used to represent complex distributions in a simpler, compressed form, making them useful for generating data that resembles the training set.

Latent Variables

These are not directly observed but inferred from data. Variational Autoencoders (VAEs) use latent variables to represent the underlying data in a lower-dimensional space. This allows VAEs to generate new data samples by sampling from the latent space.

Diffusion Models

Diffusion Models are a newer class of generative models that represent data as a gradual transformation of noise into structured data. These models work by simulating a reverse diffusion process, starting with noise and iteratively refining it to produce clear and realistic outputs.

Diffusion Process

In this approach, data is corrupted by progressively adding noise, and the model learns to reverse this process to recover the original data. The iterative denoising allows for the generation of high-quality images from random noise.

Diffusion models are highly effective for image synthesis, denoising, and image inpainting, and they have become competitive alternatives to GANs in various image generation tasks.

Further Reading:

As most diffusion models are quite big and difficult to run on local hardware, we will use open-source free diffusion models for image generation. As it turns out there are quite a lot of websites that host diffusion models that can generate any image given a texual prompt. Here is medium article about this. Though most of these websites have limited number of tries for generating images, we will use one that has unlimited use for learning purposes.

Open the Stable Diffusion 2.1 Demo from hugging face:
Go to Stable Diffusion 2.1 Demo in your web browser.

Task 1: Test out Image generation ability of the model.

  • Start with the default example exercises at the bottom of the website.
  • Change the prompt and negative prompt next to the "Generate image" box.
  • Observe how it generates images.

Task 2: Use advanced settings.

  • Change the "Guidance Scale" slider under "Advanced settings". (It goes up from 0 to 50)
  • Observe how changing the "Guidance Scale" changes the results.
  • Why does this happen? Find out the idea behind prompt and negative prompt in diffusion models.

Task 3: Generate realistic images of people holding things.

  • Write this sentence to the prompt section. "A person holding an apple in their hand." (Without the quotes).
  • Change the negative prompt and "Guidance Scale" however you see fit to create the most realistic looking picture possible. Try out at least 5 combinations.
  • Notice that for some reason, diffusion models are really bad at generating hands and fingers (For now, at least!). Try to find out why that is the case.

Submission:
Take a screenshot of each configuration of prompt and negative prompts with the corresponding generated pictures (at least 5!). Write short answers to the questions provided above. Add these in a file and show it to during the exercise session.

The model that we used above it severely underpowered and a bit old at this point. The newer diffuision models like DALL-E-3 or Midjourney 6.1 can already produce images that have already advanced from silly mistakes like weird hand positions. Now, we need to be more aware of the latest capabilities of these image generation models.

Task: Methods of identifying AI generated images.

  • Try out DALL-E-3 model by writing this prompt in ChatGPT website. "Create an image of A person holding an apple in their hand." (Without the quotes)
  • Observe the differences between the pictures generated by this model and the stable diffusion model that you tried on previous exercise.
  • Try out different combinations of prompts(at least 5) just like the previous exercise in ChatGPT. Do these pictures still show the same level of artifacts and visible mistakes like the previously generated pictures?

Submission:
Search for ways to detect if a image has been created using generative AI. Try to find at least 3 articles about it from the internet. Write a small report (No more than 1 page) about the methods used to detect AI generated images and how reliable they are. Add the articles you have read as reference.

Applications of Generative AI

Generative AI has a wide range of applications across multiple industries, driving innovation in content creation, design, and data generation. Some of the key applications include:

Art and Design

Generative AI helps create original artworks, music, and designs by leveraging algorithms to blend creativity with machine learning. Tools like DeepArt allow artists to use AI as a creative partner.

Medical Imaging

AI-generated synthetic medical images are used to augment training datasets for healthcare applications, improving diagnostic models while protecting patient privacy.

Gaming and Virtual Worlds

Procedurally generated game environments and characters are becoming more common, offering endless variations and creativity in gaming.

Marketing and Advertising

AI models generate personalized content such as tailored advertisements, social media posts, and product designs, enhancing customer engagement through personalized experiences.

Natural Language Processing

Language is another example of an area where AI has hard time considering how easy it feels to us, as humans, to understand natural scenes or language. This is probably in part because we don't appreciate the hardness of tasks that even a child can perform without much apparent practice. (Of course, a child actually practices these skills very intensively, and in fact, evolution has trained us for millions of years, preconditioning our "neural networks" to process information in this form.)

The features that make language distinctive and different from many other data sources for which we may apply AI techniques include:

  1. Even though language follows an obvious linear structure where a piece of text has a beginning and an end, and the words inbetween are in a given order, it also has hierarchical structures including, e.g., text–paragraph–sentence–word relations, where each sentence, for example, forms a separate unit.
  2. Grammar constrains the possible (allowed) expressions in natural languages but still leaves room for ambiguity (unlike formal languages such as programming languages which must be unambiguous).
  3. Compared to digital signals such as audio or image, natural language data is much closer to semantics: the idea of a chair has a direct correspondence with the word 'chair' but any particular image of a chair will necessarily carry plenty of additional information such as the color, size, material of the chair and even the lighting conditions, the camera angle, etc, which are completely independent of the chair itself.

Generative AI in NLP

Generative AI in Natural Language Processing (NLP) has changed how machines understand and generate human language. NLP tasks, such as text generation, summarization, translation, and conversation, have significantly improved with the introduction of deep learning models that can generate coherent and contextually appropriate language.

At the heart of this transformation are transformer-based models, such as GPT (Generative Pre-trained Transformer), which leverage massive datasets and powerful neural networks to create natural language outputs. By predicting the next word or phrase based on preceding text, these models are able to produce human-like language with impressive accuracy and fluency.

Key generative models in NLP include GPT, BERT, and T5, each of which has advanced specific NLP tasks. These models rely heavily on attention mechanisms and self-supervision (where the training data is created automatically from unlabeled text), making them highly scalable.

Brief History of GPT: From GPT-0

The Generative Pre-trained Transformer (GPT) series represents a breakthrough in language generation, with each iteration improving upon its predecessor in terms of scale, complexity, and performance. Here's a brief history of the evolution from GPT-0 to today's advanced models:

  • GPT-0 (2018): The original GPT model was based on the transformer architecture introduced by Vaswani et al. in 2017. GPT-0 (also referred to as GPT-1) was trained on a large corpus of unsupervised text data and could generate text by predicting the next word in a sequence. It had approximately 110 million parameters, which made it significantly more powerful than earlier recurrent neural network (RNN) models. While impressive, GPT-0 had limitations in terms of contextual understanding and coherence over long texts.
  • GPT-2 (2019): OpenAI introduced GPT-2, which scaled up the number of parameters to 1.5 billion, marking a significant jump in capability. GPT-2 could generate more coherent and contextually relevant paragraphs of text. Its ability to complete sentences and generate long-form content was a major improvement over GPT-0. However, concerns over the potential misuse of GPT-2 for disinformation led OpenAI to initially limit access to the full model.
  • GPT-3 (2020): GPT-3 represented another massive leap in scale, with 175 billion parameters. Its capabilities in understanding and generating natural language far surpassed previous models. GPT-3 was able to perform a variety of tasks, including answering questions, translating languages, summarizing text, and even generating code—all from a single, pre-trained model. GPT-3 demonstrated the power of few-shot learning, where it could adapt to new tasks with minimal examples.
  • GPT-4 (Expected 2023+): GPT-4 is expected to be even larger, more efficient, and more adept at solving complex tasks with fewer mistakes. GPT-4 will likely build on advancements in architecture, dataset size, and optimization techniques to push the boundaries of what language models can achieve.

Large Language Models (LLMs)

Large Language Models (LLMs), such as GPT-3 and beyond, have become a foundational tool in generative AI, thanks to their ability to process and generate natural language at an unprecedented scale. LLMs are pre-trained on vast corpora of text data, often drawn from books, websites, articles, and other sources. This pre-training allows these models to learn the statistical properties of language, enabling them to generate meaningful text and understand complex language patterns.

Key Characteristics of LLMs:

  • Scale: LLMs like GPT-3 contain billions of parameters, making them capable of handling a wide range of NLP tasks with minimal fine-tuning. The sheer size of these models allows them to generalize well across tasks and domains.
  • Pre-training and Fine-tuning: LLMs are pre-trained on general-purpose text data, which gives them broad language comprehension. They can be fine-tuned on specific tasks, such as sentiment analysis, summarization, or translation, by exposing them to smaller, task-specific datasets.
  • Few-shot and Zero-shot Learning: One of the most remarkable capabilities of LLMs is their ability to perform few-shot or even zero-shot learning. In few-shot learning, the model can perform a task after seeing only a few examples. In zero-shot learning, the model can tackle a task without any specific training, relying solely on its pre-trained knowledge.

Applications:

  • Chatbots and virtual assistants (e.g., OpenAI's ChatGPT, Microsoft's Cortana)
  • Automated content generation for blogs, news articles, and social media posts
  • Translation services, providing high-quality translations between multiple languages
  • Code generation, where the model can write software code from natural language descriptions (e.g., GitHub's Copilot)

LLMs represent a significant step toward more general-purpose AI, capable of understanding and generating text across various contexts and industries. However, their scale also raises ethical and practical concerns, including bias, misinformation, and the environmental costs of training such large models.

Further Reading:

Exploring NanoGPT for Text Generation

In this section, you'll get hands-on experience running the LLM models you've been learning about. For this task, we'll use a custom, lightweight local LLM model that's small enough to run on a personal laptop (though you're welcome to try out larger models if you have the hardware!). This approach gives you a clearer understanding of how models like ChatGPT generate answers and highlights the significant computing cost associated with running them.

You've been approached by a group of colleagues in your AI research group who are investigating the performance and practicality of locally run GPT models. They're specifically interested in models that are fully transparent and understandable from a code perspective. To help with this, they've recommended using the NanoGPT repository by Andrej Karpathy, which is a small but complete GPT implementation, written from scratch. Your mission is to explore this model by running it locally on your machine and experimenting with it to better understand its capabilities.

  • Clone the NanoGPT Repository: Follow the instructions on the NanoGPT GitHub repository to clone the code to your local machine.
  • Run the Pre-trained Model: The repository comes with a small pre-trained GPT model on a Shakespearean dataset.
  • Your task is to generate text samples using this pre-trained model. You can do this by running the provided script that generates text from the Shakespeare model.

Submission:
Share the generated text output from your run (minimum 100 characters).

Tips:

  • Make sure you have all necessary dependencies installed, such as Python and PyTorch, as described in the repo.
  • Fine-tuning might take time, so choose your dataset wisely for faster experimentation.
  • Feel free to adjust the parameters during fine-tuning to see how the model's performance changes.

In the previous exercise you were introduced to NanoGPT and got a chance to get familiar with it. In this exercise, you will fine-tune the NanoGPT model with your own data.

  • Prepare Your Own Text: Select a text dataset of your choice. It can be a personal collection of articles, blog posts, stories, or any form of coherent text (minimum 1,000 words).
  • Fine-Tune the NanoGPT Model: Follow the instructions in the repository to fine-tune the GPT model using your own dataset. Adjust the hyperparameters, if necessary, to ensure that the model learns effectively from your data.
  • Generate New Text with a Prompt: Once the model is fine-tuned, generate new text by providing a custom prompt related to your dataset.

Submission:
Submit the prompt you used and the new text generated by your fine-tuned model (minimum 100 characters).

Tips:

  • Make sure you have all necessary dependencies installed, such as Python and PyTorch, as described in the repo.
  • Fine-tuning might take time, so choose your dataset wisely for faster experimentation.
  • Feel free to adjust the parameters during fine-tuning to see how the model's performance changes.

Historical Section

To understand how we got to today's powerful AI methods, let's look back at some earlier techniques in pattern recognition for images and language processing. These approaches laid the groundwork for the AI models we now use.

  • Image Pattern Recognition with SIFT and SURF: Before deep learning became popular, methods like SIFT (Scale-Invariant Feature Transform) and SURF (Speeded Up Robust Features) were widely used to identify features in images. These methods helped detect unique parts of an image that could be used to compare it to other images. For example, they could pick out edges, corners, and textures that stand out, making it possible to match similar images or recognize objects across different images. Even though CNNs (Convolutional Neural Networks) are now the go-to for image recognition, SIFT and SURF remain valuable tools for understanding why certain features are chosen during matching, which can be useful in certain applications.
  • Natural Language Processing (NLP) with Formal Language Algorithms: In natural language processing, early methods relied on formal language algorithms, like the CYK (Cocke-Younger-Kasami) algorithm, to parse sentences and understand language structure. The CYK algorithm, for instance, helped break down sentences into their grammatical parts, giving insights into sentence structure and meaning. This approach, known as parsing, still influences the design of modern language models (LLMs) by helping these models understand syntax and grammar, even though the methods have evolved.

These foundational methods offered more interpretable results, which is helpful in understanding how patterns are identified. Although they have largely been replaced by newer techniques, their legacy lives on in modern AI, where transparency and interpretability are increasingly valued. In this section, we short description about these historical methods.

SIFT and SURF

Good examples of commonly used feature extraction techniques for image data include Scale-Invariant Feature Transform (SIFT) and Speeded-Up Robust Features (SURF). We take a closer look at the SURF technique. It can be broken into two or three stages:

  1. Choosing interest points: We identify a set of (x,y) points by maximizing the determinant of a so called Hessian matrix that characterizes the local intensity variation around the point. The interest points tend to be located at "blobs" in the image. (You are not required to decode the details of this -- a rough idea is enough.)
  2. Feature descriptors: The intensity variation around each interest point is described by a feature descriptor which is a 64-dimensional vector (array of 64 real values). In addition, the feature has a scale (size) and an orientation (dominant direction).
  3. Feature matching: If the problem is to match objects between two images, then both of the above tasks are performed on both images. We then find pairs of feature descriptors (one extracted from image A and the other from image B) that are sufficiently close to each other in Euclidean distance.



Above is an example of detected features: interest point location, scale, and orientation shown by the circles, and the type of contrast shown by the color, i.e., dark on light background (red) or light on dark background (blue). Image credit: NASA/Jet Propulsion Laboratory/University of Arizona. Image from the edge of the south polar residual cap on Mars.

Formal languages and grammars

Natural language processing (NLP) applies many concepts from theoretical computer science which are applicable to the study of formal languages. A formal language is generally a set of strings. For example, the set of valid mathematical expressions is a formal language that includes the string (1+2)×6–3 but not the string 1+(+×5(. A (formal) grammar is a set of symbol manipulation rules that enables the generation of the valid strings (or "sentences") in the language.

An important subclass of formal languages is context-free languages. A context-free language is defined by a context-free grammar, which is a grammar with rules of the form V⟶w, where V is a non-terminal symbol, and w is a sequence of non-terminal or terminal symbols. The context-free nature of the rules means that such a rule can be applied independently of the surrounding context, i.e., the symbols before or after V.

An example will hopefully make the matter clear. Let S be a start symbol, which is also the only non-terminal symbol in the grammar. Let the set of terminal symbols be ()[]. The set of rules is given by

S ⟶ SS
S ⟶
S ⟶ (S)
S ⟶ [S]

The language defined by such a grammar is the set of strings that can be generated by repeated application of any of the above rules, starting with the start symbol S.

For example, the string () can be generated by first applying the rule S ⟶ (S) to the start symbol, whereupon we obtain the intermediate string (S). Applying the rule S ⟶ to the symbol S in the intermediate string, to obtain the final outcome ().

As another example, the string (())[] also belongs to the language. Can you figure out a sequence of rules to generate it from the start symbol S? Notice that each application of a rule only applies to a single symbol in the intermediate string. So for example, if you apply the rule S ⟶ (S) to the intermediate string SS, only one of the symbols gets parentheses around it (you are free to choose which one): S(S) or (S)S. If you want to apply the same rule to both symbols, you can, you just have to apply the same rule twice, which gives (S)(S).

Try working out a sequence of rules to obtain (())[]. To check the correct answer, you can look at the example solution below but don't look before you've tried it by yourself.

Parsing

Efficient algorithms for deciding whether a given string belongs to a given context-free language exist. They are based on parsing, which means the construction of a parse tree. The parse tree represents the hierarchical structure of the string. An example parse tree is shown below.



In the above tree, the sentence "she eats a fish with a fork" has two main components: "she" and the rest of the sentence. The rest of the sentence likewise has two parts: "eats" and "a fish with a fork". The interpretation of this parse tree is that the object of eating is a fish that has a fork. (A possible but rather unusual scenario as fish rarely have forks.)

The above parse tree demonstrates the ambiguity of natural language, which is a challenge to AI applications. A potential solution in such scenarios is to use the semantics of the words and to attach word-dependent probabilities to different rules. This could for instance suggest that the probability of eating something with a fork is a more probable construction than a fish that has a fork. This method is called lexicalized parsing.

CYK algorithm

The Cooke–Younger–Kasami (CYK) algorithm is an O(mn3) time algorithm for parsing context-free languages. Here n refers to the number of words in the sentence, and m is the number of rules in the grammar. It is based on dynamic programming where the sentence is parsed by first processing short substrings (sub-sentences) that form a part of the whole, and combining the results to obtain a parse tree for the whole sentence.

To present the idea of the CYK algorihtm, we'll start with an example. In our naive example grammar, we have the following set of rules:

S ⟶ NP VP
VP ⟶ V NP
VP ⟶ VP ADV
VP ⟶ V
NP ⟶ N
V ⟶ fish
N ⟶ Robots
N ⟶ fish
ADV ⟶ today

The starting symbol is again S, which stands for both 'start' as well as 'sentence'. NP stands for noun-phrase, VP for verb-phrase, V and N are of course verb and noun respectively, and finally ADV stands for adverb (a specifier that says how or when something is done).

The terminal symbols — note that here we consider words as atomic symbols, and not sequences of characters — are 'fish', 'Robots', and 'today'. Also note that the word 'fish' is both a verb (V) and a noun (N).

CYK table

A fundamental data structure in CYK is a triangular table that is usually depicted as a lower-triangular (or upper-triangular) matrix of size n × n. The completed table for our small example is below. This would be the result of running the CYK algorithm with the sentence "Robots fish fish today" as input.

S    (1,4)
S    (1,3) S,VP   (2,4)
S    (1,2) S,VP   (2,3) VP   (3,4)
N,NP   (1,1)
Robots
V,N,VP,NP (2,2)
fish
V,N,VP,NP (3,3)
fish
ADV    (4,4)
today

First of all, pay attention to the numbering scheme of the table cells. Each cell is numbered as shown in the top right corner: (1,4), (1,3), etc. The numbering may be slighly confusing since it doesn't follow the more usual (row, column) numbering. Instead, the numbers given the start and the end of the sub-sentence that the cell corresponds to. For example, the cell (1,4) refers to the whole sentence, i.e., words 1–4, while the cell (2,3) refers to the words fish fish, i.e., words 2–3.

After the algorihtm is completed, the contents of the cells in the CYK table correspond to the possible interpretations (if any) of the words in question in terms of different non-terminals. For example, the words fish fish are seen to have two interpretations: as a sentence (S) and as a verb-phrase (VP). This is because we can generate these two words by starting either from S or VP.

Just to keep you on your toes, let's make sure you understand what we mean when we say that fish fish can be generated from S and VP. That is, give a list of rules to generate these two words starting from S, and then do the same starting from VP.

Starting from S, we first apply the only rule that has S on the left, i.e., S ⟶ NP VP. Then apply NP ⟶ N to obtain N VP, and VP ⟶ V, to obtain N V. Finally, apply N ⟶ fish and V ⟶ fish to obtain "fish fish".

Starting from VP, apply VP ⟶ V NP. Then, apply V ⟶ fish, NP ⟶ N, and finally N ⟶ fish.

The interpretations of "fish fish" as a sentence (S) and a verb-phrase (VP) are quite different. In the former, the first 'fish' is generated through NP and N, i.e., the first word is considered a noun, while the second 'fish' is generated through VP and V, i.e., it is considered a verb. So there are some sea creatures (the first 'fish') that try to catch other sea creatures. On the other hand, in the VP interpretation, the first 'fish' is generated through V, so it is a verb, and the second 'fish' is a noun. This time the interpretation is that what is going on is an attempt to catch some sea creatures but it is not explicated who is fishing since the words only form a verb-phrase that can be combined with a subject noun.

The other elements in the table have similar interpretations. The above example is actually somewhat unusual in the sense that all sub-sentences can be parsed and therefore, all cells in the table contain at least one non-terminal symbol. This is by no means a general property. For example, "Robots Robots Robots" wouldn't have an interpretation as a valid sub-sentence in a grammatical sentence, and the table for a sentence containing these words (consecutively) would have empty cells.

Operation of the algorithm

The CYK algorithm operates by processing one row of the table at a time, starting from the bottom row which corresponds to one-word sub-sentences, and working its way to the top node that corresponds to the whole sentence. This shows the dynamic programming nature of the algorithm: the short sub-sentences are parsed first and used in parsing the longer sub-sentences and whole sentence.

More specifically, the algorithm tries to find on each row any rules that could have generated the corresponding symbols. On the bottom row, we can start by finding rules that could have generated each individual word. So for example, we notice that the word 'Robots' could have been generated (only) by the rule N ⟶ Robots, and so we add N into the cell (1,1). Likewise we add both V and N into the cells (2,2) and (3,3) because the word 'fish' could have been generated from either V or N. At this stage, the CYK table would look as follows:

(1,4)
(1,3) (2,4)
(1,2) (2,3) (3,4)
N    (1,1)
Robots
V,N    (2,2)
fish
V,N    (3,3)
fish
ADV    (4,4)
today

The same procedure is repeated as long as there are rules that we haven't applied yet in a given cell. This implies that we will also apply the rule NP ⟶ N in the cell (1,1) after having added the symbol N in it, etc.

Only after we have made sure that no other rules can be applied on the bottom row, we move one row up. Now we will also consider binary rules where the right side has two symbols. A binary rule can be applied in a given cell if the two symbols on the right side of the rule are found in cells that correspond to splitting the sub-sentence in the current cell into two consecutive parts. This means that in cell (1,2) for example, we can split the sub-sentence 'Robots fish' into two parts that correspond to cell (1,1) and (2,2) respectively. As these two cells contain the symbols NP and VP, and the rule S ⟶ NP VP contains exactly these symbols on the right side, we can add the symbol S on the left side of the rule into the cell (1,2).

(1,4)
(1,3) (2,4)
S   (1,2) (2,3) (3,4)
N,NP   (1,1)
Robots
V,N,VP,NP (2,2)
fish
V,N,VP,NP (3,3)
fish
ADV    (4,4)
today

This process is iterated until again we can't apply more rules on the second row from the bottom, after which we move one level up. This is repeated on all levels until we reach the top node and the algorithm terminates.

Pseudocode for the CYK algorithm is as follows:

1:  T = n x n array of empty lists
2:  for i in 1,...,n:
3:     T[i,i] = [word[i]]
4:  for len = 0,...,n-1:
5:     while rule can be applied:
6:        if x ⟶ T[i,i+len] for some x:
7:           add x in T[i,i+len]
8:        if x ⟶ T[i,i+k] T[i+k+1,i+len] for some k,x:
9:           add x in T[i,i+len]

The variable len refers to the length of the sub-sentence being processed in terms of the distance between its start and end. So for example, in the first iteration of the loop on lines 4–9, we process individual words in cell (1,1), (2,2), ..., (7,7), and so len = 0.

The if statements on lines 6 and 8 check whether there exists a rule where the symbol(s) on the right side are found in the specific cells given in the pseudodode (T[i,i+len] on line 6, or T[i,i+k] and T[i+k+1,i+len] on line 8). If this is the case, then the symbol x on the left side of the rule is added into cell T[i,i+len]. This means that the sub-sentence (i,i+len) can be generated from symbol x.

Constructing parse trees from the CYK table

What is not shown in the pseudocode, is that we should also keep track of how we obtained each entry in the CYK table. Each time we apply a rule, we should therefore store the symbol(s) in the other cell(s) where the symbols on the right side of the rule were found. For example, in the very last step of the algorithm in the above example, we would store that the symbol S was obtained because we found NP in cell (1,1) and VP in cell (2,4), etc. Note that it is important to store references not only the cell but also the individual symbols that were used.

If there are multiple ways to obtain a given symbol in a given cell, then we store information about each possible way as a separate set of references. These different ways will correspond to different ways to parse the sentence.

Having stored the references to the symbols that enabled us to obtain the symbols in the higher rows of the table, we can reconstruct parse trees by traversing the references between the symbols, starting from the symbol S in top node (if one exists; if there is no S in the top node, the sentence is invalid/not grammatical, and it doesn't belong to the language at all).

In the above example there should be only one possible parse tree but the lecture slides and the next exercise show cases where there are more than one.

Table of Contents