logging in or signing up NNTopicsPres10 Wanderer Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 93 Category: Entertainment License: All Rights Reserved Like it (1) Dislike it (0) Added: October 09, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Optimizing Pattern Recognition Systems Professor Mike Manry University of Texas at ArlingtonSlide2: Introduction Feature Selection Complexity Minimization Recent Classification Technologies Neural Nets Support Vector Machines Boosting A Maximal Margin Classifier Growing and Pruning Software Examples Conclusions OutlineSlide3: Intro -- Problems In Pattern Recognition Systems ZSlide4: Intro -- Problems In Pattern Recognition Systems Problems: We usually don’t know which block(s) are bad Leftmost blocks may be more expensive and difficult to change Solution: Using the system illustrated on the next slide, we can quickly optimize the rightmost blocks and narrow the list of potential problems. Slide5: Intro -- IPNNL Optimization System ( www-ee.uta.edu/eeweb/ip/ ) Pattern Recognition Application IPNNL Optimization System Dim(z) = N’ Dim(x) = N N << N’ Slide6: Intro -- Presentation Goals Examine blocks in the Optimization System: Feature Selection Recent Classification Technologies Growing and Pruning of Neural Net Classifiers Demonstrate the Optimization System on: Data from Bell Helicopter Textron Data from UTA Bioengineering Data from UTA Civil EngineeringExample SystemText Reader: Example System Text ReaderSlide8: Feature Selection - Combinatorial Explosion Number of size-N subsets of N’ features is N’ N NS = ( ) Scanning: Generate candidate subsets Subset Evaluation: Calculate subset goodness, and save the best onesSlide9: Given data for a classification problem with N’ = 90 candidate features, there are 9.344 x 1017 subsets of size N=17 and a total of 290 = 1.2379 x 1027 subsets Feature Selection Example of Combinatorial ExplosionSlide10: Feature Selection - Scanning Methods Available Methods : Brute Force (BF) Scanning: Examine every subset (See previous slide) Branch and Bound (BB) [1]: Avoids examining subsets known to be poor. Plus L minus R (L-R) [1]: Adds L good features and eliminates the R worst features Floating Search (FS) [2]: Faster than BB. Feature Ordering (FO) [1]: Given the empty subset, repeatedly add the best additional feature to the subset. Also called forward selection. Ordering Based Upon Individual Feature Goodness (FG) Let G measure the goodness of a subset scanning method. Then G(FG) < G(FO) < G(L-R) < G(FS) < G(BB) = G(BF)Feature Selection Scanning Methods: Feature Goodness vs. Brute Force: Feature Selection Scanning Methods: Feature Goodness vs. Brute Force Suppose that available features are: z1 = x + n, z2 = x + n, z3 = y + 2n, and z4 = n where x and y are useful and n is noise. FG: Nested subsets {z1}, {z1, z2 }, {z1, z2 , z3 } BF: Optimal subsets {z2}, {z1, z4 }, {z1, z3 , z4 } since x = z1 - z4 and y = z3 - 2z4 An optimal subset may include features that are useless, according to the FG approach Slide12: Feature Selection - Subset Evaluation Metrics (SEMs) Let x be a feature subset, and let xk be a candidate feature for possible inclusion into x. Requirements for SEM f() f ( x U xk ) < f (x), ( U denotes union ) f() related to classification success (Pe for example ) Example SEMs Brute Force Subset Evaluation: Design a good classifier and measure Pe Scatter Matrices [1] Slide13: Feature Selection A New Approach [3] Scanning Approach : Floating Search SEM : Pe for piecewise linear classifier At the output, the absence of groups 1, 3, and 5 reveals problemsFeature SelectionExample 1: Feature Selection Example 1 Classification of Numeral Images: N’ = 16 features and Nc = 10 Chosen Subsets {6, } {6, 9, } {6, 9, 14, } {6, 9, 14, 13, } {6, 9, 14, 13, 3, } {6, 9, 14, 13, 3, 15, } {6, 9, 14, 13, 11, 15, 16, } {6, 9, 14, 13, 11, 15, 16, 3, } {6, 9, 14, 13, 11, 15, 16, 3, 4, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, 5, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, 5, 2, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, 5, 2, 10, }Feature SelectionExample 2: Feature Selection Example 2 Classification of Sleep Apnea Data (Mohammad Al-Abed and Khosrow Behbehani): N’ = 90 features and Nc = 2 Chosen Subsets {11, } {11, 56, } {11, 56, 61, } {11, 28, 55, 63, } {11, 28, 55, 63, 27, } {11, 28, 55, 63, 62, 17, } {11, 28, 55, 63, 53, 17, 20, } {11, 28, 55, 63, 53, 17, 20, 62, } {11, 28, 55, 63, 53, 17, 20, 4, 40, } {11, 28, 55, 63, 53, 26, 20, 4, 40, 8, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, 48, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, 48, 38, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, 48, 45, 87, } {11, 28, 13, 19, 53, 26, 65, 4, 40, 30, 80, 85, 8, 48, 75, 87, 18, }Slide16: Complexity Minimization Complexity: Defined as the number of free parameters or coefficients in a processor Complexity Minimization (CM) Procedure Minimize training set Pe for each classifier size, with respect to all weights or coefficients. Nw is the number of weights or coefficients. Measure Pe for a validation data set. Choose that network that minimizes the validation set’s Pe CM leads to smaller, better classifiers, if it can be performed. It is related to SRM [4,5]. Implication: To perform SRM, we need methods for quickly varying network size during training (growing) or after training (pruning) Complexity Minimization: Complexity Minimization E Nw validation Vary Nw until validation error is minimized. Slide18: Recent Classification Technologies Neural Nets Minimize MSE E(w), where tpi = 1 for the correct class (i = ic ) and 0 for an incorrect class (i = id ) . yi(x) is the ith output discriminant of the trained classifier. Neural net support vector x satisfies: yic(x) – max{ yid (x)} = 1 The MSE between yi(x) and bi(x) = P(i|x) is Theorem 1 [5,6]: As the number of training patterns Nv increases, the training error E(w) approaches e(w)+C, where C is a constant. Slide19: Recent Classification Technologies Neural Nets Advantages Neural net outputs approximate Bayes discriminant P(i | x) Training modifies all network weights CM easily performed via growing and pruning methods Accommodates any size training data file Problems E(w) and e(w) are not proportional to Pe. From Theorem 1, yi = bi + εi, where εi is random zero-mean noise. Noise degrades performance, leaving room for improvement via SVM training and boosting. Recent Classification Technologies Support Vector Machines [4,5]: Recent Classification Technologies Support Vector Machines [4,5] SVM: A neural net structure with these properties: Output weights form hyperplane decision boundaries Input vectors xp satisfying yp(ic) = +b and max{ y (id) } = -b are called support vectors Correctly classified input vectors xp outside the decision margins do not adversely affect training Incorrectly classified xp do not strongly affect training In some SVMs, Nh (number of hidden units) initially equals Nv (the number of training patterns) Training may involve quadratic programming Slide21: b b SVM discriminant LMS discriminant x2 x1 SV SV SV SV Correctly classified patterns distort the LMS discriminant Recent Classification Technologies Support Vector MachinesRecent Classification TechnologiesSupport Vector Machines: Recent Classification Technologies Support Vector Machines Advantage: Good classifiers from small data sets Problems SVM methods are practical only for small data sets Training difficult when there are many classes Kernel parameter found by hit or miss Fails to minimize complexity (far too many hidden units required, hidden unit weights don’t adapt) Current Work Developing SVM pruning approach Developing Regression-based maximal margin classifiersRecent Classification TechnologiesBoosting [5]: Recent Classification Technologies Boosting [5] yik(x): ith class discriminant for the kth classifier. K is the number of classifiers being fused. In discriminant fusion, we calculate the ak so that the weighted average discriminant, Adaboost [5] sequentially picks a training subset {xp, tp}k, from the available data and designs yik(x) and ak so they are functions of the previous (k-1) classifiers. has better performance than the yik(x)Recent Classification TechnologiesBoosting: Recent Classification Technologies Boosting Advantage Final classifier can be used to process video signals in real time when step function activations are used Problems Works best for the two-class case, uses huge data files. Final classifier is a large, highly redundant neural net. Training can take days, CM not performed. Future Work Pruning should be tried to reduce redundancy. Feature selection should be tried to speed up trainingRecent Classification TechnologiesProblems with Adaboost: Recent Classification Technologies Problems with Adaboost Future Work : Prune ADA boosted networks N = 3, Nc = 2, K = 3 (number of classifiers being fused ) A Maximal Margin ClassifierProblems With MSE Type Training: A Maximal Margin Classifier Problems With MSE Type Training The ith class neural net discriminant can be modeled as yp(i)= tp(i) + εp(i) This additive noise εp(i) degrades the performances of regression-based classifiers, as mentioned earlier. Correctly classified patterns contribute to MSE, and can adversely affect training (See following page). In other words, regression-based training tries to force all training patterns to be support vectors A Maximal Margin Classifier Problems With MSE Type Training: A Maximal Margin Classifier Problems With MSE Type Training (1) Standard regression approach tries to force all training vectors to be support vectors (2) Red lines are counted as errors, even though those patterns are classified more correctly than desired (3) Outliers can distort decision boundary locationsA Maximal Margin Classifier[5]Existence of Regression-Based Optimal Classifiers: A Maximal Margin Classifier[5] Existence of Regression-Based Optimal Classifiers Let X be the basis vector (hidden units + inputs) of a neural net. The output vector is y = W·X The minimum MSE is found by solving R·W = C (1) where R = E[X·XT ] and C = E[X·tT ] (2) If an “optimal” coefficient matrix Wopt exists, Copt = R·Wopt from (1), so Copt exists. From (2), we can find Copt if the desired output vector t is defined correctly. Regression-based training can mimic other approaches. Slide29: A Maximal Margin Classifier Regression Based Classifier Design [7] Consider the Empirical Risk (MSE) If yp(ic) > tp(ic) (Correct class discriminant large) or yp(id) < tp(id) (Incorrect class discriminant small) Classification error decreases but E increases.Slide30: If yp(ic) > tp(ic ) set tp‘(ic) = yp(ic) If yp(id) < tp(id ) for an incorrect class id, set tp‘(id) = yp(id). In both cases, tp‘(i) is set equal to yp(i) so no error is counted. This algorithm partially mimics SVM training since correctly classified patterns do not affect the MSE.( Related to Ho-Kashyap procedures, [5] pp.249-256 ) The discrepancy is fixed by re-defining the empirical risk as A Maximal Margin Classifier Regression Based Classifier Design-ContinuedA Maximal Margin Classifier Problems With MSE Type Training: A Maximal Margin Classifier Problems With MSE Type Training Recall that E counts all non-support vectors as erroneousA Maximal Margin Classifier Errors Contributing to E’: A Maximal Margin Classifier Errors Contributing to E’ In E’, only errors (green lines) inside the margins are minimized. Outliers can be eliminated.Slide33: A Maximal Margin Classifier Comments The proposed algorithm: Adapts to any number of training patterns Allows for any number of hidden units Makes CM straightforward (4) Is used to train the MLP. The resulting classifier is called a maximal margin classifier (MMC) Questions: Does this really work ? How do the MMC and SVM approaches compare ? A Maximal Margin Classifier Two-Class Example: A Maximal Margin Classifier Two-Class Example Numeral Classification: N = 16, Nc = 2, Nv = 600 Goal: Discriminate numerals 4 and 9. SVM: Nh > 150, Et = 4.33% and Ev = 5.33 % MMC: Nh = 1 , Et = 1.67% and Ev = 5.83 % Comments: The 2-Class SVM seems better, but the price is too steep.A Maximal Margin Classifier Multi-Class Examples: A Maximal Margin Classifier Multi-Class Examples Numeral Classification: N = 16, Nc = 10, Nv = 3,000 SVM: Nh > 608, Ev = 14.53 % MMC: Nh = 32 , Ev = 8.1 % Bell Flight Condition Recognition: N = 24, Nc = 39, Nv = 3,109 SVM: Training fails MMC: Nh = 20 , Ev = 6.97 % Nh - number of hidden units Nv - number of training patterns Ev – validation error percentage Conclusion: SVMs may not work for medium and large size multi-class problems.Slide36: Growing and Pruning Candidate MLP Training Block If complexity minimization (CM) is used, the resulting Ef(Nh) curve is monotonic Practical ways of approximating CM, are growing and pruning. Slide37: Growing and Pruning Growing Ef(w) Nw validation Growing: Starting with no hidden units, repeatedly add Na units and train the network some more. Advantages: Creates a monotonic Ef(Nh) curve. Usefulness is concentrated in the first few units added. Disadvantage: Hidden units are not optimally ordered.Slide38: Growing and Pruning Pruning [8] Ef(w) Nw validation Pruning: Train a large network. Then repeatedly remove a less useful unit using OLS. Advantages: Creates a monotonic Ef(Nh) curve. Hidden units are optimally ordered Disadvantage: Usefulness is not concentrated in the first few unitsSlide39: Growing and Pruning Pruning a Grown Network [9] Training error for Inversion of Radar Scattering Data set is for inversion of radar scattering from bare soil surfaces. It has 20 inputs and 3 outputs Slide40: Growing and Pruning Pruning a Grown Network Validation error for Radar Scattering Inversion Slide41: Growing and Pruning Pruning a Grown Network Prognostics data set for onboard flight load synthesis (FLS) in helicopters, where we estimate mechanical loads on critical parts, using measurements available in the cockpit. There are 17 inputs and 9 outputs. Training error for Prognostics data Slide42: Growing and Pruning Pruning a Grown Network Validation error plots for Prognostics data Slide43: Growing and Pruning Pruning a Grown Network Data set for estimating phoneme likelihood functions in speech, has 39 inputs and 117 outputs Training error for Speech data Slide44: Growing and Pruning Pruning a Grown Network Validation error for Speech data Growing and Pruning : Remaining Work: Insert into the IPNNL Optimization System Growing and Pruning Slide46: IPNNL Software Motivation Theorem 2 (No Free Lunch Theorem [5] ) : In the absence of assumptions concerning the training data, no training algorithm is inherently better than another. Implication: We usually make some assumptions. However, given training data, several classifiers should be tried after feature selection. Slide47: IPNNL Software Block Diagram Size Train Prune & Validate Feature Selection MLP PLN FLN LVQ SOM SVM RBF Data Final Network Analyze Your Data Select Network Type Produce Final Network IPNNL SoftwareExamples: IPNNL Software Examples The IPNNL Optimization system is demonstrated on: Flight condition recognition data from Bell Helicopter Textron (Prognostics Problem) Sleep apnea data from UTA Bioengineering (Prof. Khosrow Behbehani and Mohammad Al-Abed) Traveler characteristics data from UTA Civil Engineering (Prof. Steve Mattingly and Isaradatta Rasmidatta) ExamplesBell Helicopter Textron: Examples Bell Helicopter Textron Flight condition recognition (prognostics) data from Bell Helicopter Textron Features: N’ = 24 cockpit measurements Patterns: 4,745 Classes: Nc = 39 helicopter flight categories Run Feature selection, and save new training and validation files with only 18 features: Run Feature selection, and save new training and validation files with only 18 featuresRun MLP sizing, decide upon 12 hidden units: Run MLP sizing, decide upon 12 hidden unitsRun MLP training, save the network: Run MLP training, save the networkRun MLP pruning, with validation. Final network has 10 hidden units.: Run MLP pruning, with validation. Final network has 10 hidden units.ExamplesBehbehani and Al-Abed: Classification of Sleep Apnea Data (Mohammad Al-Abed and Khosrow Behbehani): Features: N’ = 90 features from Co-occurrence features applied to STDFT Patterns: 136 Classes: Nc = 2 ( Yes/No ) Previous Software: Matlab Neural Net Toolbox Examples Behbehani and Al-AbedSlide55: Run Feature selection, and save new training and validation files with only 17 features. The curve is ragged because of the small number of patterns.Run MLP sizing, decide upon 5 hidden units: Run MLP sizing, decide upon 5 hidden unitsRun MLP training, save the network: Run MLP training, save the networkRun MLP pruning, with validation. Final network has 3 hidden units.: Run MLP pruning, with validation. Final network has 3 hidden units.Slide59: Examples - Mattingly and Rasmidatta Classification of traveler characteristics data (Isaradatta Rasmidatta and Steve Mattingly): Features: N’ = 22 features Patterns: 7,325 Classes: Nc = 3 (car, air, bus/train ) Previous Software: NeuroSolutions by NeuroDimensionRun Feature selection, and save new training and validation files with only 4 features. The flat curve means few features are needed.: Run Feature selection, and save new training and validation files with only 4 features. The flat curve means few features are needed. Run MLP sizing, decide upon 2 hidden units. Flat curve means few hidden units, if any, are needed.: Run MLP sizing, decide upon 2 hidden units. Flat curve means few hidden units, if any, are needed.Run MLP training, save the network: Run MLP training, save the networkRun MLP pruning, with validation. Final network has 1 hidden unit.: Run MLP pruning, with validation. Final network has 1 hidden unit.Conclusions: Conclusions An effective feature selection algorithm has been developed Regression-based networks are compatible with CM Regression-based training can extend maximal margin concepts to many nonlinear networks Several existing and potential blocks in the IPNNL Optimization System have been discussed The system has been demonstrated on three pattern recognition applications A similar Optimization System is available for approximation/regression applications References: References [1] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press, 1990. [2] P. Pudil, J. Novovicova, J. Kittler, " Floating Search Methods in Feature Selection " Pattern Recognition Letters, vol 15 , pp 1119-1125, 1994 [3] Jiang Li, Michael T. Manry, Pramod Lakshmi Narasimha, and Changhua Yu, “Feature Selection Using a Piecewise Linear Network”, accepted by IEEE Trans. on Neural Networks. [4] Vladimir N. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998. [5] Duda, Hart and Stork, Pattern Classification, 2nd edition, John Wiley and Sons, 2001. [6] Dennis W. Ruck et al., "The Mulitlayer Perceptron as an Approximation to a Bayes Optimal Discriminant Function," IEEE Trans. on Neural Networks, Vol. 1, No. 4, 1990. [7] R.G. Gore, Jiang Li, Michael T. Manry, Li-Min Liu, Changhua Yu, and John Wei, "Iterative Design of Neural Network Classifiers through Regression". International Journal on Artificial Intelligence Tools, Vol. 14, Nos. 1&2 (2005) pp. 281-301. [8] F. J. Maldonado and M.T. Manry, "Optimal Pruning of Feed Forward Neural Networks Using the Schmidt Procedure", Conference Record of the Thirty Sixth Annual Asilomar Conference on Signals, Systems, and Computers., November 2002, pp. 1024-1028. [9] Pramod Lakshmi Narasimha, Walter Delashmit, Michael Manry, Jiang Li and Fransisco Maldonado, “Fast Generation of a Sequence of Trained and Validated Feed-Forward Networks”, accepted by the Nineteenth International Conference of the Florida AI Research Society, May 2006. You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
NNTopicsPres10 Wanderer Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 93 Category: Entertainment License: All Rights Reserved Like it (1) Dislike it (0) Added: October 09, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Optimizing Pattern Recognition Systems Professor Mike Manry University of Texas at ArlingtonSlide2: Introduction Feature Selection Complexity Minimization Recent Classification Technologies Neural Nets Support Vector Machines Boosting A Maximal Margin Classifier Growing and Pruning Software Examples Conclusions OutlineSlide3: Intro -- Problems In Pattern Recognition Systems ZSlide4: Intro -- Problems In Pattern Recognition Systems Problems: We usually don’t know which block(s) are bad Leftmost blocks may be more expensive and difficult to change Solution: Using the system illustrated on the next slide, we can quickly optimize the rightmost blocks and narrow the list of potential problems. Slide5: Intro -- IPNNL Optimization System ( www-ee.uta.edu/eeweb/ip/ ) Pattern Recognition Application IPNNL Optimization System Dim(z) = N’ Dim(x) = N N << N’ Slide6: Intro -- Presentation Goals Examine blocks in the Optimization System: Feature Selection Recent Classification Technologies Growing and Pruning of Neural Net Classifiers Demonstrate the Optimization System on: Data from Bell Helicopter Textron Data from UTA Bioengineering Data from UTA Civil EngineeringExample SystemText Reader: Example System Text ReaderSlide8: Feature Selection - Combinatorial Explosion Number of size-N subsets of N’ features is N’ N NS = ( ) Scanning: Generate candidate subsets Subset Evaluation: Calculate subset goodness, and save the best onesSlide9: Given data for a classification problem with N’ = 90 candidate features, there are 9.344 x 1017 subsets of size N=17 and a total of 290 = 1.2379 x 1027 subsets Feature Selection Example of Combinatorial ExplosionSlide10: Feature Selection - Scanning Methods Available Methods : Brute Force (BF) Scanning: Examine every subset (See previous slide) Branch and Bound (BB) [1]: Avoids examining subsets known to be poor. Plus L minus R (L-R) [1]: Adds L good features and eliminates the R worst features Floating Search (FS) [2]: Faster than BB. Feature Ordering (FO) [1]: Given the empty subset, repeatedly add the best additional feature to the subset. Also called forward selection. Ordering Based Upon Individual Feature Goodness (FG) Let G measure the goodness of a subset scanning method. Then G(FG) < G(FO) < G(L-R) < G(FS) < G(BB) = G(BF)Feature Selection Scanning Methods: Feature Goodness vs. Brute Force: Feature Selection Scanning Methods: Feature Goodness vs. Brute Force Suppose that available features are: z1 = x + n, z2 = x + n, z3 = y + 2n, and z4 = n where x and y are useful and n is noise. FG: Nested subsets {z1}, {z1, z2 }, {z1, z2 , z3 } BF: Optimal subsets {z2}, {z1, z4 }, {z1, z3 , z4 } since x = z1 - z4 and y = z3 - 2z4 An optimal subset may include features that are useless, according to the FG approach Slide12: Feature Selection - Subset Evaluation Metrics (SEMs) Let x be a feature subset, and let xk be a candidate feature for possible inclusion into x. Requirements for SEM f() f ( x U xk ) < f (x), ( U denotes union ) f() related to classification success (Pe for example ) Example SEMs Brute Force Subset Evaluation: Design a good classifier and measure Pe Scatter Matrices [1] Slide13: Feature Selection A New Approach [3] Scanning Approach : Floating Search SEM : Pe for piecewise linear classifier At the output, the absence of groups 1, 3, and 5 reveals problemsFeature SelectionExample 1: Feature Selection Example 1 Classification of Numeral Images: N’ = 16 features and Nc = 10 Chosen Subsets {6, } {6, 9, } {6, 9, 14, } {6, 9, 14, 13, } {6, 9, 14, 13, 3, } {6, 9, 14, 13, 3, 15, } {6, 9, 14, 13, 11, 15, 16, } {6, 9, 14, 13, 11, 15, 16, 3, } {6, 9, 14, 13, 11, 15, 16, 3, 4, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, 5, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, 5, 2, } {6, 9, 14, 13, 11, 15, 16, 3, 4, 1, 12, 7, 8, 5, 2, 10, }Feature SelectionExample 2: Feature Selection Example 2 Classification of Sleep Apnea Data (Mohammad Al-Abed and Khosrow Behbehani): N’ = 90 features and Nc = 2 Chosen Subsets {11, } {11, 56, } {11, 56, 61, } {11, 28, 55, 63, } {11, 28, 55, 63, 27, } {11, 28, 55, 63, 62, 17, } {11, 28, 55, 63, 53, 17, 20, } {11, 28, 55, 63, 53, 17, 20, 62, } {11, 28, 55, 63, 53, 17, 20, 4, 40, } {11, 28, 55, 63, 53, 26, 20, 4, 40, 8, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, 48, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, 48, 38, } {11, 28, 22, 19, 53, 26, 20, 4, 40, 30, 80, 85, 8, 48, 45, 87, } {11, 28, 13, 19, 53, 26, 65, 4, 40, 30, 80, 85, 8, 48, 75, 87, 18, }Slide16: Complexity Minimization Complexity: Defined as the number of free parameters or coefficients in a processor Complexity Minimization (CM) Procedure Minimize training set Pe for each classifier size, with respect to all weights or coefficients. Nw is the number of weights or coefficients. Measure Pe for a validation data set. Choose that network that minimizes the validation set’s Pe CM leads to smaller, better classifiers, if it can be performed. It is related to SRM [4,5]. Implication: To perform SRM, we need methods for quickly varying network size during training (growing) or after training (pruning) Complexity Minimization: Complexity Minimization E Nw validation Vary Nw until validation error is minimized. Slide18: Recent Classification Technologies Neural Nets Minimize MSE E(w), where tpi = 1 for the correct class (i = ic ) and 0 for an incorrect class (i = id ) . yi(x) is the ith output discriminant of the trained classifier. Neural net support vector x satisfies: yic(x) – max{ yid (x)} = 1 The MSE between yi(x) and bi(x) = P(i|x) is Theorem 1 [5,6]: As the number of training patterns Nv increases, the training error E(w) approaches e(w)+C, where C is a constant. Slide19: Recent Classification Technologies Neural Nets Advantages Neural net outputs approximate Bayes discriminant P(i | x) Training modifies all network weights CM easily performed via growing and pruning methods Accommodates any size training data file Problems E(w) and e(w) are not proportional to Pe. From Theorem 1, yi = bi + εi, where εi is random zero-mean noise. Noise degrades performance, leaving room for improvement via SVM training and boosting. Recent Classification Technologies Support Vector Machines [4,5]: Recent Classification Technologies Support Vector Machines [4,5] SVM: A neural net structure with these properties: Output weights form hyperplane decision boundaries Input vectors xp satisfying yp(ic) = +b and max{ y (id) } = -b are called support vectors Correctly classified input vectors xp outside the decision margins do not adversely affect training Incorrectly classified xp do not strongly affect training In some SVMs, Nh (number of hidden units) initially equals Nv (the number of training patterns) Training may involve quadratic programming Slide21: b b SVM discriminant LMS discriminant x2 x1 SV SV SV SV Correctly classified patterns distort the LMS discriminant Recent Classification Technologies Support Vector MachinesRecent Classification TechnologiesSupport Vector Machines: Recent Classification Technologies Support Vector Machines Advantage: Good classifiers from small data sets Problems SVM methods are practical only for small data sets Training difficult when there are many classes Kernel parameter found by hit or miss Fails to minimize complexity (far too many hidden units required, hidden unit weights don’t adapt) Current Work Developing SVM pruning approach Developing Regression-based maximal margin classifiersRecent Classification TechnologiesBoosting [5]: Recent Classification Technologies Boosting [5] yik(x): ith class discriminant for the kth classifier. K is the number of classifiers being fused. In discriminant fusion, we calculate the ak so that the weighted average discriminant, Adaboost [5] sequentially picks a training subset {xp, tp}k, from the available data and designs yik(x) and ak so they are functions of the previous (k-1) classifiers. has better performance than the yik(x)Recent Classification TechnologiesBoosting: Recent Classification Technologies Boosting Advantage Final classifier can be used to process video signals in real time when step function activations are used Problems Works best for the two-class case, uses huge data files. Final classifier is a large, highly redundant neural net. Training can take days, CM not performed. Future Work Pruning should be tried to reduce redundancy. Feature selection should be tried to speed up trainingRecent Classification TechnologiesProblems with Adaboost: Recent Classification Technologies Problems with Adaboost Future Work : Prune ADA boosted networks N = 3, Nc = 2, K = 3 (number of classifiers being fused ) A Maximal Margin ClassifierProblems With MSE Type Training: A Maximal Margin Classifier Problems With MSE Type Training The ith class neural net discriminant can be modeled as yp(i)= tp(i) + εp(i) This additive noise εp(i) degrades the performances of regression-based classifiers, as mentioned earlier. Correctly classified patterns contribute to MSE, and can adversely affect training (See following page). In other words, regression-based training tries to force all training patterns to be support vectors A Maximal Margin Classifier Problems With MSE Type Training: A Maximal Margin Classifier Problems With MSE Type Training (1) Standard regression approach tries to force all training vectors to be support vectors (2) Red lines are counted as errors, even though those patterns are classified more correctly than desired (3) Outliers can distort decision boundary locationsA Maximal Margin Classifier[5]Existence of Regression-Based Optimal Classifiers: A Maximal Margin Classifier[5] Existence of Regression-Based Optimal Classifiers Let X be the basis vector (hidden units + inputs) of a neural net. The output vector is y = W·X The minimum MSE is found by solving R·W = C (1) where R = E[X·XT ] and C = E[X·tT ] (2) If an “optimal” coefficient matrix Wopt exists, Copt = R·Wopt from (1), so Copt exists. From (2), we can find Copt if the desired output vector t is defined correctly. Regression-based training can mimic other approaches. Slide29: A Maximal Margin Classifier Regression Based Classifier Design [7] Consider the Empirical Risk (MSE) If yp(ic) > tp(ic) (Correct class discriminant large) or yp(id) < tp(id) (Incorrect class discriminant small) Classification error decreases but E increases.Slide30: If yp(ic) > tp(ic ) set tp‘(ic) = yp(ic) If yp(id) < tp(id ) for an incorrect class id, set tp‘(id) = yp(id). In both cases, tp‘(i) is set equal to yp(i) so no error is counted. This algorithm partially mimics SVM training since correctly classified patterns do not affect the MSE.( Related to Ho-Kashyap procedures, [5] pp.249-256 ) The discrepancy is fixed by re-defining the empirical risk as A Maximal Margin Classifier Regression Based Classifier Design-ContinuedA Maximal Margin Classifier Problems With MSE Type Training: A Maximal Margin Classifier Problems With MSE Type Training Recall that E counts all non-support vectors as erroneousA Maximal Margin Classifier Errors Contributing to E’: A Maximal Margin Classifier Errors Contributing to E’ In E’, only errors (green lines) inside the margins are minimized. Outliers can be eliminated.Slide33: A Maximal Margin Classifier Comments The proposed algorithm: Adapts to any number of training patterns Allows for any number of hidden units Makes CM straightforward (4) Is used to train the MLP. The resulting classifier is called a maximal margin classifier (MMC) Questions: Does this really work ? How do the MMC and SVM approaches compare ? A Maximal Margin Classifier Two-Class Example: A Maximal Margin Classifier Two-Class Example Numeral Classification: N = 16, Nc = 2, Nv = 600 Goal: Discriminate numerals 4 and 9. SVM: Nh > 150, Et = 4.33% and Ev = 5.33 % MMC: Nh = 1 , Et = 1.67% and Ev = 5.83 % Comments: The 2-Class SVM seems better, but the price is too steep.A Maximal Margin Classifier Multi-Class Examples: A Maximal Margin Classifier Multi-Class Examples Numeral Classification: N = 16, Nc = 10, Nv = 3,000 SVM: Nh > 608, Ev = 14.53 % MMC: Nh = 32 , Ev = 8.1 % Bell Flight Condition Recognition: N = 24, Nc = 39, Nv = 3,109 SVM: Training fails MMC: Nh = 20 , Ev = 6.97 % Nh - number of hidden units Nv - number of training patterns Ev – validation error percentage Conclusion: SVMs may not work for medium and large size multi-class problems.Slide36: Growing and Pruning Candidate MLP Training Block If complexity minimization (CM) is used, the resulting Ef(Nh) curve is monotonic Practical ways of approximating CM, are growing and pruning. Slide37: Growing and Pruning Growing Ef(w) Nw validation Growing: Starting with no hidden units, repeatedly add Na units and train the network some more. Advantages: Creates a monotonic Ef(Nh) curve. Usefulness is concentrated in the first few units added. Disadvantage: Hidden units are not optimally ordered.Slide38: Growing and Pruning Pruning [8] Ef(w) Nw validation Pruning: Train a large network. Then repeatedly remove a less useful unit using OLS. Advantages: Creates a monotonic Ef(Nh) curve. Hidden units are optimally ordered Disadvantage: Usefulness is not concentrated in the first few unitsSlide39: Growing and Pruning Pruning a Grown Network [9] Training error for Inversion of Radar Scattering Data set is for inversion of radar scattering from bare soil surfaces. It has 20 inputs and 3 outputs Slide40: Growing and Pruning Pruning a Grown Network Validation error for Radar Scattering Inversion Slide41: Growing and Pruning Pruning a Grown Network Prognostics data set for onboard flight load synthesis (FLS) in helicopters, where we estimate mechanical loads on critical parts, using measurements available in the cockpit. There are 17 inputs and 9 outputs. Training error for Prognostics data Slide42: Growing and Pruning Pruning a Grown Network Validation error plots for Prognostics data Slide43: Growing and Pruning Pruning a Grown Network Data set for estimating phoneme likelihood functions in speech, has 39 inputs and 117 outputs Training error for Speech data Slide44: Growing and Pruning Pruning a Grown Network Validation error for Speech data Growing and Pruning : Remaining Work: Insert into the IPNNL Optimization System Growing and Pruning Slide46: IPNNL Software Motivation Theorem 2 (No Free Lunch Theorem [5] ) : In the absence of assumptions concerning the training data, no training algorithm is inherently better than another. Implication: We usually make some assumptions. However, given training data, several classifiers should be tried after feature selection. Slide47: IPNNL Software Block Diagram Size Train Prune & Validate Feature Selection MLP PLN FLN LVQ SOM SVM RBF Data Final Network Analyze Your Data Select Network Type Produce Final Network IPNNL SoftwareExamples: IPNNL Software Examples The IPNNL Optimization system is demonstrated on: Flight condition recognition data from Bell Helicopter Textron (Prognostics Problem) Sleep apnea data from UTA Bioengineering (Prof. Khosrow Behbehani and Mohammad Al-Abed) Traveler characteristics data from UTA Civil Engineering (Prof. Steve Mattingly and Isaradatta Rasmidatta) ExamplesBell Helicopter Textron: Examples Bell Helicopter Textron Flight condition recognition (prognostics) data from Bell Helicopter Textron Features: N’ = 24 cockpit measurements Patterns: 4,745 Classes: Nc = 39 helicopter flight categories Run Feature selection, and save new training and validation files with only 18 features: Run Feature selection, and save new training and validation files with only 18 featuresRun MLP sizing, decide upon 12 hidden units: Run MLP sizing, decide upon 12 hidden unitsRun MLP training, save the network: Run MLP training, save the networkRun MLP pruning, with validation. Final network has 10 hidden units.: Run MLP pruning, with validation. Final network has 10 hidden units.ExamplesBehbehani and Al-Abed: Classification of Sleep Apnea Data (Mohammad Al-Abed and Khosrow Behbehani): Features: N’ = 90 features from Co-occurrence features applied to STDFT Patterns: 136 Classes: Nc = 2 ( Yes/No ) Previous Software: Matlab Neural Net Toolbox Examples Behbehani and Al-AbedSlide55: Run Feature selection, and save new training and validation files with only 17 features. The curve is ragged because of the small number of patterns.Run MLP sizing, decide upon 5 hidden units: Run MLP sizing, decide upon 5 hidden unitsRun MLP training, save the network: Run MLP training, save the networkRun MLP pruning, with validation. Final network has 3 hidden units.: Run MLP pruning, with validation. Final network has 3 hidden units.Slide59: Examples - Mattingly and Rasmidatta Classification of traveler characteristics data (Isaradatta Rasmidatta and Steve Mattingly): Features: N’ = 22 features Patterns: 7,325 Classes: Nc = 3 (car, air, bus/train ) Previous Software: NeuroSolutions by NeuroDimensionRun Feature selection, and save new training and validation files with only 4 features. The flat curve means few features are needed.: Run Feature selection, and save new training and validation files with only 4 features. The flat curve means few features are needed. Run MLP sizing, decide upon 2 hidden units. Flat curve means few hidden units, if any, are needed.: Run MLP sizing, decide upon 2 hidden units. Flat curve means few hidden units, if any, are needed.Run MLP training, save the network: Run MLP training, save the networkRun MLP pruning, with validation. Final network has 1 hidden unit.: Run MLP pruning, with validation. Final network has 1 hidden unit.Conclusions: Conclusions An effective feature selection algorithm has been developed Regression-based networks are compatible with CM Regression-based training can extend maximal margin concepts to many nonlinear networks Several existing and potential blocks in the IPNNL Optimization System have been discussed The system has been demonstrated on three pattern recognition applications A similar Optimization System is available for approximation/regression applications References: References [1] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press, 1990. [2] P. Pudil, J. Novovicova, J. Kittler, " Floating Search Methods in Feature Selection " Pattern Recognition Letters, vol 15 , pp 1119-1125, 1994 [3] Jiang Li, Michael T. Manry, Pramod Lakshmi Narasimha, and Changhua Yu, “Feature Selection Using a Piecewise Linear Network”, accepted by IEEE Trans. on Neural Networks. [4] Vladimir N. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998. [5] Duda, Hart and Stork, Pattern Classification, 2nd edition, John Wiley and Sons, 2001. [6] Dennis W. Ruck et al., "The Mulitlayer Perceptron as an Approximation to a Bayes Optimal Discriminant Function," IEEE Trans. on Neural Networks, Vol. 1, No. 4, 1990. [7] R.G. Gore, Jiang Li, Michael T. Manry, Li-Min Liu, Changhua Yu, and John Wei, "Iterative Design of Neural Network Classifiers through Regression". International Journal on Artificial Intelligence Tools, Vol. 14, Nos. 1&2 (2005) pp. 281-301. [8] F. J. Maldonado and M.T. Manry, "Optimal Pruning of Feed Forward Neural Networks Using the Schmidt Procedure", Conference Record of the Thirty Sixth Annual Asilomar Conference on Signals, Systems, and Computers., November 2002, pp. 1024-1028. [9] Pramod Lakshmi Narasimha, Walter Delashmit, Michael Manry, Jiang Li and Fransisco Maldonado, “Fast Generation of a Sequence of Trained and Validated Feed-Forward Networks”, accepted by the Nineteenth International Conference of the Florida AI Research Society, May 2006.