Experimental Methodology*: Experimental Methodology* COSC 4550/5550
Prof. D. Spears
Comparing Agents/Algorithms: Comparing Agents/Algorithms My agent is better
than your agent! Dr. Spears
(in a Halloween costume) student I don’t believe it. Our agents
are stochastic and we both only
ran for one trial! Spears: OK. Let’s run them for 5 trials and compare.
Student: Done. When we take the two averages,
your agent wins again. But I’m still not convinced, because
5 is so few that it could be a fluke of the random numbers
that were generated.
Spears: Fine. Let’s run them for 20 trials and compare averages.
Student: Done. When we take the two averages, your agent wins again. But I am
still not convinced. Even though your average is better, on some trials my agent is better.
So it could still be a statistical fluke that your agent’s average is better. I propose that
we call in an impartial judge to settle the dispute. Dr. Spears
The Umpire: The Umpire Dr. Spears, your student has the right idea, but even he is not strict enough. To prove
that your agent is better, you not only need to take an average over lots of trials, but you
also need to run a statistical significance test to show that the difference between the
averages of the two agents’ performances is meaningful. Statistical significance is greater
if the averages/means differ more, if the variances are lower, and if the averages are
taken over a larger number of trials. Normally, a statistical significance test will tell you
that with a percent confidence, e.g., 95% confidence, one agent/algorithm is “better”
than another. Furthermore, it does not imply that the winner is a winner on all the
problems in the universe. It only says that the winner is better on problems that are
like the ones that the agents were tested on.
Experimental Methodology: Experimental Methodology Don’t say “My agent/algorithm is better than your agent/algorithm.”
Do say “My agent/algorithm is better than your agent/algorithm for this class of problems.”
Methodically vary the characteristics of your problem to see what variations cause significant degradation in performance. Try scaling up the difficulty of your problem to see how the algorithm scales.
Be able to explain why your agent/algorithm performs better/well. This could require a very detailed analysis. One useful technique is an ablation study. An ablation study consists of methodically removing portions of the agent/algorithm to see which ablation(s) cause significant performance degradation. Perturb one aspect at a time. Ablation studies can identify the source of strength/power of your agent/algorithm. Try other methodical variations of your problem/algorithm.
Develop clear, well-motivated hypotheses about your algorithm and test them carefully.
Experimental Methodology (cont’d): Experimental Methodology (cont’d) Try to identify the weaknesses, as well as the strengths of your algorithm.
If your agent/algorithm has parameters, methodically vary them to perform a sensitivity study.
When comparing two agents/algorithms, be sure the comparisons are as fair as possible. For example, as much as possible they should see the same data, be tested under the same conditions, have similar knowledge, etc.
If you vary as little as possible between two algorithms, you have the best chance of understanding why one is better than the other.
When comparing two agents/algorithms, show the mean performance of both and the statistical significance of any comparisons.
You should at least compare your algorithm with a baseline algorithm.
Experimental methodologyand statistics: Experimental methodology and statistics To understand experimental methodology, one needs to understand some basic statistics.
The foundation for statistics is probability. So you’re already well-prepared by now!
Statistics: Statistics What is statistics?
Unlike probability, which is a purely mathematical technique, statistics has more of an empirical nature. It deals with populations and samples from the populations.
Descriptive statistics originated with government censuses. It involves summarizing, describing, and presenting data in the form of tables and charts.
Statistical inference is more popular these days, and it is what’s used in experimental methodology. Statistical inference is a method for drawing generalizations based on samples. These generalizations go beyond the data. They are a form of conjecture for predicting trends. They can be used to test hypotheses about the data. Statistical inference is also at the heart of much modern machine learning – because much of machine learning is inferring general conclusions from specific data. These conclusions are predictive.
Review: Performance measure of an agent: Review: Performance measure of an agent Performance measure: An external measure of how well the agent has done or is doing at the task.
Performance Comparisons: Performance Comparisons How do we compare the performance of two or more agents?
A purely deterministic situation:
Execute the agents in the environment once.
Compare their performance
IMPORTANT QUESTION: If your environment is deterministic, and your agent’s algorithm is deterministic, but your agent starts in a randomly chosen location every trial/episode, can we consider this a purely deterministic situation?
A situation with some element of non-determinism:
Execute the agents in the environment multiple times.
Compare their performance. How? Statistical inference about the means of the performance.
Expected Value: Let X be a discrete-valued random variable, for which each x in the range of X has a value v(x). If x is a number, then v(x) is typically just x. Then the expected value of X, i.e, E[X] is defined by:
If X is numerically-valued, then E[X] is also called the mean, and it is typically denoted by . Expected Value R E V I E W
Estimation of the mean: Estimation of the mean For statistical inferences about means, we first estimate the true mean of the population from sample data. The average of the sample is the estimate of the mean, called the sample mean x. entire population sample
Estimating mean performance of agents: Estimating mean performance of agents For agents, this implies executing the agent in the
environment for a number of trials (at least 10 trials;
less than 30 trials is considered a “small sample”)
and calculating average performance. Recall that the Law of Large Numbers states that with many,
many trials the average will approach the mean.
Machine LearningEvaluation: Machine Learning Evaluation time t average
performance learning curve required
Requirements for the Term Project: Requirements for the Term Project Run your agent in the environment for 10 or more trials if there is any aspect of the situation that is non-deterministic.
(Some aspect should be random, even if just the start location.)
Do likewise for a “baseline agent” (e.g., a random guesser).
Report the following:
Mean (average) performance over the trials
Number of trials over which this has been averaged
The standard deviation or variance
If you are doing learning, then you can perform the comparison(s) at the end of learning. Optionally, you may also compare during learning.