Understanding Few-Shot Learning in Computer Vision: What You Need to Know-阿里云开发者社区

Since the first convolutional neural network (CNN) algorithms were created, they have drastically improved deep learning performance on computer vision (CV) tasks.

In 2015, Microsoft reported that their model was actually better than humans at classifying images from the ImageNet dataset.[1]

Nowadays, computers have no match when it comes to using billions of images to solve a specific task. Still, in the real world, you can rarely build or find a dataset with that many samples.

How do we overcome this problem? If we’re talking about a CV task, we can use data augmentation (DA), or collect and label additional data.

DA is a powerful tool and might be a big part of the solution. Labeling additional samples is a time-consuming and expensive task, but it does deliver better results.

If the dataset is really small, both of these techniques might not help us. Imagine a task where we need to build a classification with only one or two samples per class, and each sample is super difficult to find.

This would call for innovative approaches. Few-Shot Learning (FSL) is one of them.

In this article we’ll cover:

What Few-Shot Learning is – definition, purpose, and FSL problem example
Few-Shot Learning variations – N-Shot Learning, Few-Shot Learning, One-Shot Learning, Zero-Shot Learning
Few-Shot Learning approaches – Meta-Learning, Data-level, Parameter-level
Meta-Learning algorithm – definition, Metric-Learning, Gradient-Based Meta-Learning
Algorithms for Few-Shot image classification – Model-Agnostic Meta-Learning, Matching, Prototypical and Relation Networks
Few-Shot Object Detection – YOLOMAML

What is Few-Shot learning?

Few-Shot Learning is a sub-area of machine learning. It’s about classifying new data when you have only a few training samples with supervised information.

FSL is a rather young area that needs more research and refinement. As of today, you can use it in CV tasks. A computer vision model can work quite well with relatively few training samples. Throughout this article, we’ll be focusing on FSL in computer vision.

For example: say we work in healthcare and have a problem with categorizing bone illnesses via x-ray photos.

Some rare pathologies might lack enough images to be used in the training set. This is exactly the type of problem that can be solved by building an FSL classifier.

Few-Shot variations

Let’s take a look at different variations and extreme cases of FSL. In general, researchers identify four types:

N-Shot Learning (NSL)
Few-Shot Learning
One-Shot Learning (OSL)
Less than one or Zero-Shot Learning (ZSL)

When we’re talking about FSL, we usually mean N-way-K-Shot-classification.

N stands for the number of classes, and K for the number of samples from each class to train on.

N-Shot Learning is seen as a more broad concept than all the others. It means that Few-Shot, One-Shot, and Zero-Shot Learning are sub-fields of NSL.

Zero-Shot

To me, ZSL is the most interesting. The goal of Zero-Shot Learning is to classify unseen classes without any training examples.

It may seem a little crazy, but think about it this way: can you classify an object without even seeing it? If you have the general idea of an object, its appearance, properties, and functionality, it shouldn’t be a problem.

This is the approach that you use when doing ZSL and, according to current trends, Zero-Shot Learning will soon become more effective.

One-Shot and Few-Shot

By this point, you probably see a general concept, so it’ll be no surprise that in One-Shot Learning, we only have a single sample of each class. Few-Shot has two to five samples per each class, making it just a more flexible version of OSL.

When we talk about the overall concept, we use the Few-Shot Learning term. But this area is quite young, so people will use these terms differently. Keep that in mind when you’re reading articles.

Few-Shot learning approaches

All right, time to move to a more practical field and talk about different Few-Shot Learning problem approaches.

First of all, let’s define an N-way-K-Shot-classification problem. Imagine that we have:

A training (support) set that consists of:
1. N class labels
2. K labeled images for each class (a small amount, less than ten samples per class)
Q query images

We want to classify Q query images among the N classes. The N * K samples in the training set are the only examples that we have. The main problem here is not enough training data. [1]

few-shot-classification-problem

Few-Shot Image Classification with Meta-Learning

The first and most obvious step in an FSL task is to gain experience from other, similar problems. This is why Few-Shot Learning is characterized as a Meta-Learning problem.

Let’s make this clear: in a traditional classification problem, we try to learn how to classify from the training data, and evaluate using test data.

In Meta-Learning, we learn how to learn to classify given a set of training data. We use one set of classification problems for other, unrelated sets.

Generally, there are two approaches that you should consider when solving FSL problems:

Data-level approach (DLA)
Parameter-level approach (PLA)

Data-level approach

This approach is really simple. It’s based on the concept that if you don’t have enough data to build a reliable model and avoid overfitting and underfitting, you should simply add more data.

That is why many FSL problems are solved by using additional information from a large base-dataset. The key feature of the base-dataset is that it doesn’t have classes that we have in our support-set for the Few-Shot task. For example, if we want to classify a specific bird species, the base-dataset can have images of many other birds.

We can also produce more data ourselves. To reach this goal, we can use data augmentation, or even generative adversarial networks (GANs).

Parameter-level approach

From the parameter-level point of view, it’s quite easy to overfit on Few-Shot Learning samples, as they have extensive and high-dimensional spaces quite often.

To overcome this problem we should limit the parameter space and use regularization and proper loss functions. The model will generalize the limited number of training samples.

On the other hand, we can enhance model performance by directing it to the extensive parameter space. If we use a standard optimization algorithm, it might not give reliable results because of the small amount of training data.

That is why on the parameter-level we train our model to find the best route in the parameter space to give optimal prediction results. As we have already mentioned above, this technique is called Meta-Learning.

Meta-Learning algorithm

In the classic paradigm, when we have a specific task, an algorithm is learning if its task performance improves with experience. In the Meta-Learning paradigm, we have a set of tasks. An algorithm is learning to learn if its performance at each task improves with experience and with the number of tasks. This algorithm is called a Meta-Learning algorithm.

Imagine that we have a test task TEST. We will train our Meta-Learning algorithm on a batch of training tasks TRAIN. Training experience gained from attempting to solve TRAIN tasks will be used to solve the TEST task.

Solving an FSL task has a set sequence of steps. Imagine we have a classification problem as we mentioned before. To start, we need to choose a base-dataset. Choosing a base-dataset is crucial. You want to pick a good one, so be careful.

Right now we have the N-way-K-Shot-classification problem (let’s name it TEST) and a large base-dataset that we’ll use as a Meta-Learning training set (TRAIN).

The whole Meta-Training process will have a finite number of episodes. We form an episode like this:

From the TRAIN, we sample N classes and K support-set images per each class, along with Q query images. This way, we form a classification task that’s similar to our ultimate TEST task.

At the end of each episode, the parameters of the model are trained to maximize the accuracy of Q images from the query set. This is where our model learns the ability to solve an unseen classification problem. [1]

The overall efficiency of the model is measured by its accuracy on the TEST classification task.

meta-learning

Few-Shot Image Classification with Meta-Learning

In recent years, researchers published many Meta-Learning algorithms for solving FSL classification problems. All of them can be divided into two large groups: Metric-Learning and Gradient-Based Meta-Learning algorithms.

Metric-Learning

When we talk about Metric-Learning, we usually refer to the technique of learning a distance function over objects.

In general, Metric-Learning algorithms learn to compare data samples. In the case of a Few-Shot classification problem, they classify query samples based on their similarity to the support samples.

As you might have already guessed, if we’re working with images, we basically train a convolutional neural network to output an image embedding vector, which is later compared to other embeddings to predict the class.

Gradient-Based Meta-Learning

For the Gradient-Based approach, you need to build a meta-learner and a base-learner.

Meta-learner is a model that learns across episodes, whereas a base-learner is a model that is initialized and trained inside each episode by the meta-learner.

Imagine an episode of Meta-training with some classification task defined by a N * K images support-set and a Q query set:

We choose a meta-learner model,
Episode is started,
We initialize the base-learner (typically a CNN classifier),
We train it on the support-set (the exact algorithm used to train the base-learner is defined by the meta-learner),
Base-learner predicts the classes on the query set,
Meta-learner parameters are trained on the loss resulting from the classification error,
From this point, the pipeline may differ based on your choice of meta-learner. [1]

Algorithms for Few-Shot image classification

This section comes from “Few-Shot Image Classification with Meta-Learning“, written by Etienne Bennequin.

From the general picture, let’s move on to the specific Meta-Learning algorithms that are used to solve Few-Shot Learning image classification problems.

In this section we’ll cover:

Model-Agnostic Meta-Learning (MAML)
Matching Networks
Prototypical Networks
Relation Network

Model-Agnostic Meta-Learning

MAML is based on the Gradient-Based Meta-Learning (GBML) concept. As we’ve already figured out, GBML is about the meta-learner acquiring prior experience from training the base-model and learning the common features representations of all tasks.

Whenever there is a new task to learn, the meta-learner with its prior experience will be fine-tuned a little bit using the small amount of new training data brought by the new task.

Still, we don’t want to start from a random parameter initialization. If we do so, our algorithm will not converge to good performance after a few updates.

MAML aims to solve this problem.

MAML provides a good initialization of a meta-learner’s parameters to achieve optimal fast learning on a new task with only a small number of gradient steps while avoiding overfitting that may happen when using a small dataset.

Here is how it’s done:

The meta-learner creates a copy of itself (C) at the beginning of each episode,
C is trained on the episode (just as we have previously discussed, with the help of base-model),
C makes predictions on the query set,
The loss computed from these predictions is used to update C,
This continues until you’ve trained on all episodes.

maml-algorithm

The greatest advantage of this technique is that it’s conceived to be agnostic of the meta-learner algorithm choice. Thus, the MAML method is widely used with many machine learning algorithms that need fast adaptation, especially Deep Neural Networks.

Matching Networks

Matching Networks (MN) was the first Metric-Learning algorithm designed to solve FSL problems.

For the Matching Networks algorithm, you need to use a large base-dataset to solve a Few-Shot Learning task. As shown above, this dataset is split into episodes. After that, for each episode, Matching Networks apply the following procedure:

Each image from the support and the query set is fed to a CNN that outputs embeddings for them,
Each query image is classified using the softmax of the cosine distance from its embeddings to the support-set embeddings,
The Cross-Entropy Loss on the resulting classification is backpropagated through the CNN.

This way, Matching Networks learn to compute image embeddings. This approach allows MN to classify images with no specific prior knowledge of classes. Everything is done simply by comparing different instances of the classes.

Since the classes are different in every episode, Matching Networks compute features of the images that are relevant to discriminate between classes. On the contrary, in the case of a standard classification, the algorithm learns the features that are specific to each class.

It’s worth mentioning that the authors actually proposed some improvements to the initial algorithms. For example, they augmented their algorithm with bidirectional LSTM. The embedding of each image started depending on the embeddings of the others.

All improvement proposals may be found in their initial article. Still, you must remember that improving the performance of the algorithm might make the computational time longer.

Prototypical Networks

Prototypical Networks (PN) are similar to Matching Networks. Still, there are small differences that help to enhance the algorithm’s performance. PN actually obtains better results than MN.

The PN process is essentially the same, but the query image embeddings are not compared to every image embedding from the support set. Instead, Prototypical Networks propose an alternative approach.

In PN, you need to form class prototypes. They are basically class embeddings formed by averaging the embeddings of images from this class. The query image embeddings are then compared only to these class prototypes.

It’s worth mentioning that in the case of a One-Shot Learning problem, the algorithm is similar to Matching Networks.

Also, PN uses Euclidean distance instead of cosine distance. It’s seen as a major part of the algorithm’s improvements.

Relation Network

All experiments carried out to build Matching and Prototypical Networks actually led to the creation of the Relation Network (RN). RN was built on the PN concept but added big changes to the algorithm.

The distance function was not defined in advance but learned by the algorithm. RN has its own relation module that does this. If you want to learn more, check out the initial article.

The overall structure is as follows. The relation module is put on the top of the embedding module, which is the part that computes embeddings and class prototypes from input images.

The relation module is fed with the concatenation of the embedding of a query image with each class prototype, and it outputs a relation score for each couple. Applying a Softmax to the relation scores, we get a prediction.

relation-network

Source

Few-Shot Object Detection

This section comes from “Meta-learning algorithms for Few-Shot Computer Vision“, written by Etienne Bennequin.

It’s quite obvious that we may encounter FSL problems in all Computer Vision tasks. We have considered Few-Shot image classification, now it’s time to tackle the Few-Shot Object Detection problem.

Let’s define an Object Detection task. Imagine that we have a list of object types and an image, and the goal is to detect all objects from the list on the image. We’ll say that the object is detected if:

we localized it by drawing the smallest bounding box possible that contains it,
we classified the object.

Let’s move on and define the N-way-K-Shot Object Detection task. Imagine:

A support-set composed of:
1. N class labels,
2. For each class, K labeled images containing at least one object belonging to this class,
Q query images.

Our goal is to detect objects belonging to one of the N given classes in the query images.

Note that there is a key difference with the Few-Shot image classification problem, as one image can contain multiple objects belonging to one or several of the N classes. We might face a class imbalance problem, as our algorithm trains on at least K example objects of each class.

YOLOMAML

The Few-Shot Object Detection sphere is quickly developing, but there aren’t many efficient solutions. The most stable solution to this problem is the YOLOMAML algorithm.

YOLOMAML has two blended pieces: YOLOv3 Object Detection architecture, and the MAML algorithm.

As mentioned earlier, MAML can be applied to a wide variety of Deep Neural Networks, which is why it was easy for developers to combine these two pieces.

YOLOMAML is a straightforward application of the MAML algorithm to the YOLO detector. If you want to learn more, check out the official Github repository.

Final thoughts

In this article, we have figured out what Few-Shot Learning is, what FSL variations and problem approaches there are, and what algorithms you can use to solve image classification and Object Detection FSL tasks.

To my knowledge, Few-Shot Learning is a quickly developing and promising field, but still quite challenging and unresearched. There is a lot more to be done, researched, and developed.

Hopefully, with this information, you will have no problems starting to experiment in the Few-Shot Learning field.

Understanding Few-Shot Learning in Computer Vision: What You Need to Know

What is Few-Shot learning?