alex boeglin mid

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Design and Implementation of Bag-of-Keypoints Image Classification : 

Design and Implementation of Bag-of-Keypoints Image Classification By Alex Boeglin ENGN 161 Fall 2007 Research based on: Visual Categorization with Bags of Keypoints, by Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, and Cédric Bray, Xerox Research Centre Europe, 2004

Problem: 

Problem With the ever increasing size of digital photo libraries, due to the proliferation of digital cameras, comes the complexity of managing these huge libraries Use high level information about objects contained within the images to search/sort through these large databases iPhoto and similar applications Google Image Search

Approach - Bag of Keypoints: 

Approach - Bag of Keypoints “Visual Categorization with Bags of Keypoints,” Csurka,Dance, Fan, Willamowski, and Bray, Xerox Research Centre Europe, 2004 Similar idea to text categorization, use small image squares known as keypoints to identify features in the image Relate descriptors discovered in query image to descriptors “seen” in training Classify the image by identifying the majority of descriptors relating to a certain class

Overview of Steps Involved: 

Overview of Steps Involved Training: Create a training set of multiple images from several defined classes Use SIFT algorithm to extract features and descriptors from the training images (make sure to keep different classes separate) Cluster the descriptors, creating k cluster centers, using k-means clustering For each class, set the descriptor matrix values to the cluster descriptor with the closest Euclidean distance Use descriptors to calculate the probability of a keyword given a class Testing: Extract features and descriptors from test image Relate these descriptors to the cluster by setting the descriptors from the test image to the cluster descriptor with the closest Euclidean distance Use Naïve Bayes to calculate probabilities and classify image

Get Images!: 

Get Images! The first step was to create a training database of images For the initial project results, two classes were chosen: Cars and Faces Cars represents a frontal/side view - Drivers side view with front end rotated approximately 30 degrees towards the camera Faces represent a frontal view 20 Images were collected from each class to compose a training set of 40 images total Next, features and descriptors will have to be extracted from each training set class

Extracting Features and Descriptors Using SIFT: 

Extracting Features and Descriptors Using SIFT SIFT: Shift Invariant Feature Transformation Matlab code provided by David Lowe, University of British Colombia Note: Make sure to have a C-compiler installed and setup to operate with Matlab in order to run his code Multi-step process to identify regions that stand out compared to their neighboring regions Regions are invariant to image scaling, translation, rotation, and to a lesser degree to illumination changes and 3D projection

SIFT’ing - A Simplified Overview: 

SIFT’ing - A Simplified Overview Stage One: Key localization Identify key locations by “looking for locations that are maxima or minima of a difference of Gaussian function” Convolve 1-D Gaussian with image, repeat on convolved image, resample the resulting image to a pixel spacing of 1.5, iterate to create a pyramid Compare each pixel in pyramid to its 8-neighbors, determine if it’s a max of min; if so determine pixel location at next lowest level and repeat Maxima / Minima represent initial keys

SIFT’ing: 

SIFT’ing Stage Two: Key Stability At each key at each level of the pyramid, compute the gradient magnitude and orientation Let Aij represent the key at each level of the pyramid, Mij the magnitude gradient, and Rij the orientation: “Each key location assigned a canonical orientation so that the image descriptors are invariant to image rotation” “Orientation is determined by the peak in a histogram of local image gradient orientations”

SIFT’ing: 

SIFT’ing Stage 3: Local Image Description “Gaussian derivatives computed at 8 orientations over a 4x4 grid, giving a 128-dimension vector”

Interest Points and Descriptors: 

Interest Points and Descriptors SIFT is run on each training image Descriptors for the classes are entered into a giant matrix holding all the descriptors from all classes in the training set, and are also entered into a matrix containing only descriptors from other training images in its own class (which will be used later for probability calculations) After extracting all features and descriptors, the training set descriptors are ready to be clustered using the k-means algorithm

K-means Clustering - Square Error Partitioning: 

K-means Clustering - Square Error Partitioning Create a partition with k clusters Assign points to closest cluster center Recalculate position of the centroids, minimizing the square error in sum of Euclidean distances between points and center point Iterate until centroids no longer move within convergence criteria Remove small clusters http://cmp.felk.cvut.cz/cmp/software/stprtool/examples/kmeans_example.gif

K-means Clustering: 

K-means Clustering Initially start with 1000 cluster centers On a trial training set of 5 images, this took 8 hours to run with Matlab’s built in algorithm Once training set expanded to 40 images, Matlab was left clustering for over 24 hours with no results Was able to find a fast k-means clustering Matlab m-file on the web which claimed a 10 fold performance increase over Matlab’s built in k-means clustering This algorithm was able to compute the cluster centers for k = 1000 in approximately 10 hours Still, for even my comparatively small database of images, the clustering process is VERY slow

Clustering: 

Clustering Clusters were created with different numbers of center points: k = 20, 70, 100, 250, 500, and 1000 were all computed for experimental results These clusters were saved for easy loading into the main test program For the descriptors stored into the class specific matrices of descriptors, each descriptor was assigned to its closest Euclidean distance cluster center The final step for the Training set was to compute an intermediate probability, which will be described next, along with the overall probability for the actual Test image

Probability Calculation: 

Probability Calculation Naïve Bayes Method “Category is selected according to class prior probabilities” Keywords chosen independently What we’re after: The probability of being of a category given an image By Bayes, probability of observing a category Cj given a set of images Ii is proportional to the prior probability of the class, Cj, multiplied by the probability of observing an image Ii for a specified class Cj:

Probability Calculation: 

Probability Calculation Training images represent the image set Ij Training set: What we’re really interested in is the keywords Vocabulary: Since Ii is a set of images, the probability P(Ii | Cj) can be written out as: Since the images are defined by the words they contain, this can be rewritten as the probability of a word from training image I occurring in the image given Cj:

Probability Calculation: 

Probability Calculation Since words can occur more than once in an image, group similar words from the vocab set and rewrite in our standard notation for the vocab set of non-repeated words: N(t,i) represents the number of times a given keyword occurs in the test image Thus, as given by the paper and our results:

Probability Calculation: 

Probability Calculation How do we determine the probability of a word given a specified class? This quantity is determined by the training set of images (the intermediate probability mentioned earlier) It is computed by dividing the number of times a certain keyword occurs in the training image set of a specified class by the number of times all the keywords occur in the image set of that specified class Thus, each class has its own quantity computed This quantity must be normalized first however!

Probability Calculation: 

Probability Calculation As given in paper, to avoid probabilities of zero, use Laplace Smoothing: So for each class, this quantity is being computed from the training set descriptors (which have been assigned to the cluster centers) For this calculation, it is important to note that this N(t,i) represents the number of times a specified vocabulary word occurs within the training set of the specified class

Probability Calculation: 

Probability Calculation As a final step, must be able to classify the image based on the resulting probability Still have a term out in front of the multiplication however: In order to remove this term, we introduce the probability of image not being of the specified class:

Probability Calculation: 

Probability Calculation By dividing these two probabilities, we see the the two terms out front are a constant, which can be disregarded for classification purposes This results in a Z term for each image class Use this result to classify images Classify image based on the largest Z

Probability Calculation - Does it make sense?: 

Probability Calculation - Does it make sense? Consider the term N(t,i) N(t,i) can be zero Makes that term go to 1, thus this is okay since multiplying by one does nothing Otherwise, N(t,i) is a number greater than or equal to one By raising P(Vt | Cj), a probability which by definition is a quantity less than one, to a power larger than one will create an even smaller number Since this gives decimals raised to powers greater than 1, multiplying these ever smaller numbers by one another forces the result to go to zero!

Probability Fix: 

Probability Fix The paper did not discuss how to handle this, and merely left the multiplication raised to the power N(t,i) In examining another, separate, Naïve Bayes classification algorithm, it is evident that the calculation MUST be done in terms of logarithms, and left in term of logarithms, in order to avoid numerical problems in terms of multiplications by zero and sums of zero And thus we obtain:

Intermediate Results: 

Intermediate Results Given the slow clustering times, checked to make sure the cluster centers actually make sense! Plot the three most predominant cluster centers on two of the training images to see if they’re “lining up”

Intermediate Results: 

Intermediate Results

Testing: 

Testing Used 40 test images: 20 faces and 20 cars Also included 3 challenge images, with undeterminable outcomes Face and car together Shadow of head “Binoculars at the beach” with layout similarities to human face Test images run on all cluster sizes: k = 20, 70, 100, 250, 500, and 1000

Testing: 

Testing Each test image must be loaded by hand in order to run through the program Only tests the image using one cluster at a time Again, a very slow process just to test the images On average, takes about 1 minute to setup and run the program per test image In the future, would like to be able to load the images into an array to loop the testing, if this is possible

Testing Results: 

Testing Results Confusion Matrices: Recall, in Paper: In house training set consisted of 1776 images in 7 classes Attained an error rate of 28% for k = 1000

k versus Error Rate: 

k versus Error Rate Optimal k for my training set (on left): k = 100 Their larger training set (on right) optimized for k = 1000

True Examples: 

True Examples

False Classifications Examples: 

False Classifications Examples Only for faces, no cars were classified as faces

ROC Curves: 

ROC Curves The ROC curve shows the False Positive Rate versus the True Positive Rate

Challenge Image 1: 

Challenge Image 1 Face and Car are both clearly visible Car is much more dominant feature All cluster sizes picked Car as category

Challenge Image 2: 

Challenge Image 2 Lacks any facial features other than outline Expected that it would get classified as a face since ears and hair are still visible features, and lacks any of car features Only k = 100 failed to classify the image as a face

Challenge Image 3: 

Challenge Image 3 Can see similarities to facial structure, yet it is a man made object like a car All cluster sizes identified as Car Even though layout similarities to face, lacks facial qualities Hypothesizing that the man made qualities of the instrument - metallic materials, screws, bolts, etc, caused its descriptors to be closer to those determined from cars

Conclusions: 

Conclusions Initially training set for cars consisted of various views of cars - from the front, rear, side, etc Discovered that the program was not able to classify the car images SIFT algorithm is limited to 20 degrees of rotation of the object from the camera Given training set size of 20 images, it was very crucial to produce similar descriptors such that cluster centers are representative of the class A larger training set would allow more variation in training images since anomalies will affect the cluster centers much less when more descriptors are present

Conclusions: 

Conclusions Algorithm has a very easy time identifying cars, has much more trouble identifying faces correctly Part of this comes from only having two classes Faces vary so greatly so creating a meaningful training set is very difficult - identifying similar descriptors is very challenging when facial descriptors vary The number of descriptors that are picked up from face to face varies greatly as well, impacts comparison of descriptors between images Lighting conditions, color, and contrast vary greatly within portraits, much more so than images of cars Not only are portraits taken under various lighting conditions, but due to differences in facial structure, even the way shadows fall on people’s faces varies greatly Could be remedied by a larger training set Again, having more facial descriptors overall would help reduce the effects of variation since outliers would have much less impact on the cluster centers

Future Expandability Ideas: 

Future Expandability Ideas Add a third class Choose something fairly standard, such as phones Having another class based on a fairly standard object (in contrast to faces), will either show that classifying cars as overly dominant, or help to prove the point that classifying faces is extremely difficult due to the wide variation of not only facial structure and features but other side effects of ambient conditions in portraits Expanding the training set size

Future Expandability Ideas: 

Future Expandability Ideas Create a training set using the ETH-80 database A standard database containing 80 images from 8 classes showing

The End: 

The End Questions?

References: 

References Visual Categorization with Bags of Keypoints, Gabriella Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, Cédric Bray Object Recognition from Local Scale-Invariant Features, David G. Lowe Several definitions in notes cited from www.wikipedia.com