# exp meth1

Views:

Category: Entertainment

## Presentation Description

No description available.

## Presentation Transcript

### Experimental Methodology*:

Experimental Methodology* COSC 4550/5550 Prof. D. Spears

### Comparing Agents/Algorithms:

Comparing Agents/Algorithms My agent is better than your agent! Dr. Spears (in a Halloween costume) student I don’t believe it. Our agents are stochastic and we both only ran for one trial! Spears: OK. Let’s run them for 5 trials and compare. Student: Done. When we take the two averages, your agent wins again. But I’m still not convinced, because 5 is so few that it could be a fluke of the random numbers that were generated. Spears: Fine. Let’s run them for 20 trials and compare averages. Student: Done. When we take the two averages, your agent wins again. But I am still not convinced. Even though your average is better, on some trials my agent is better. So it could still be a statistical fluke that your agent’s average is better. I propose that we call in an impartial judge to settle the dispute. Dr. Spears

### The Umpire:

The Umpire Dr. Spears, your student has the right idea, but even he is not strict enough. To prove that your agent is better, you not only need to take an average over lots of trials, but you also need to run a statistical significance test to show that the difference between the averages of the two agents’ performances is meaningful. Statistical significance is greater if the averages/means differ more, if the variances are lower, and if the averages are taken over a larger number of trials. Normally, a statistical significance test will tell you that with a percent confidence, e.g., 95% confidence, one agent/algorithm is “better” than another. Furthermore, it does not imply that the winner is a winner on all the problems in the universe. It only says that the winner is better on problems that are like the ones that the agents were tested on.

### Experimental Methodology (cont’d):

Experimental Methodology (cont’d) Try to identify the weaknesses, as well as the strengths of your algorithm. If your agent/algorithm has parameters, methodically vary them to perform a sensitivity study. When comparing two agents/algorithms, be sure the comparisons are as fair as possible. For example, as much as possible they should see the same data, be tested under the same conditions, have similar knowledge, etc. If you vary as little as possible between two algorithms, you have the best chance of understanding why one is better than the other. When comparing two agents/algorithms, show the mean performance of both and the statistical significance of any comparisons. You should at least compare your algorithm with a baseline algorithm.

### Experimental methodology and statistics:

Experimental methodology and statistics To understand experimental methodology, one needs to understand some basic statistics. The foundation for statistics is probability. So you’re already well-prepared by now!

### Statistics:

Statistics What is statistics? Unlike probability, which is a purely mathematical technique, statistics has more of an empirical nature. It deals with populations and samples from the populations. Descriptive statistics originated with government censuses. It involves summarizing, describing, and presenting data in the form of tables and charts. Statistical inference is more popular these days, and it is what’s used in experimental methodology. Statistical inference is a method for drawing generalizations based on samples. These generalizations go beyond the data. They are a form of conjecture for predicting trends. They can be used to test hypotheses about the data. Statistical inference is also at the heart of much modern machine learning – because much of machine learning is inferring general conclusions from specific data. These conclusions are predictive.

### Review: Performance measure of an agent:

Review: Performance measure of an agent Performance measure: An external measure of how well the agent has done or is doing at the task.

### Performance Comparisons:

Performance Comparisons How do we compare the performance of two or more agents? A purely deterministic situation: Execute the agents in the environment once. Compare their performance IMPORTANT QUESTION: If your environment is deterministic, and your agent’s algorithm is deterministic, but your agent starts in a randomly chosen location every trial/episode, can we consider this a purely deterministic situation? A situation with some element of non-determinism: Execute the agents in the environment multiple times. Compare their performance. How? Statistical inference about the means of the performance.

### Expected Value:

Let X be a discrete-valued random variable, for which each x in the range of X has a value v(x). If x is a number, then v(x) is typically just x. Then the expected value of X, i.e, E[X] is defined by: If X is numerically-valued, then E[X] is also called the mean, and it is typically denoted by . Expected Value R E V I E W

### Estimation of the mean:

Estimation of the mean For statistical inferences about means, we first estimate the true mean of the population from sample data. The average of the sample is the estimate of the mean, called the sample mean x. entire population sample

### Estimating mean performance of agents:

Estimating mean performance of agents For agents, this implies executing the agent in the environment for a number of trials (at least 10 trials; less than 30 trials is considered a “small sample”) and calculating average performance. Recall that the Law of Large Numbers states that with many, many trials the average will approach the mean.

### Machine Learning Evaluation:

Machine Learning Evaluation time t average performance learning curve required

### Requirements for the Term Project:

Requirements for the Term Project Run your agent in the environment for 10 or more trials if there is any aspect of the situation that is non-deterministic. (Some aspect should be random, even if just the start location.) Do likewise for a “baseline agent” (e.g., a random guesser). Report the following: Mean (average) performance over the trials Number of trials over which this has been averaged The standard deviation or variance If you are doing learning, then you can perform the comparison(s) at the end of learning. Optionally, you may also compare during learning. 