IF hours-before-takeoff >= 252 AND price >= 2223
AND route = LAX-BOS THEN wait
IF airline = United AND price >= 360
AND hours-before-takeoff >= 438 THEN wait
Figure 5: Sample Ripper rules.
of misclassifying a ‘buy’ as a ‘wait’ compared with the cost
of misclassifying a ‘wait’ as a ‘buy’. We implemented Meta-
Cost with mixed results.
We found that MetaCost improves Ripper’s performance
by 14 percent, but that MetaCost hurts Hamlet’s overall
performance by 29 percent. As a result, we did not use
MetaCost in Hamlet.
4.2 Q-learning
As our next step we considered Q-learning, a species of
reinforcement learning [25]. Reinforcement learning seems
like a natural fit because after making each new price ob-
servation Hamlet has to decide whether to buy or to wait.
Yet the reward (or penalty) associated with the decision is
only determined later, when Hamlet determines whether it
saved or lost money through its buying policy. Reinforce-
ment learning is also a popular technique in computational
finance [20, 22, 21].
The standard Q-learning formula is:
Q(a, s) = R(s, a) + γmax
a
0
(Q(a
0
, s
0
))
Here, R(s, a) is the immediate reward, γ is the discount
factor for future rewards, and s
0
is the state resulting from
taking action a in state s. We use the notion of state to
model the state of the world after each price observation
(represented by the price, flight number, departure date,
and number of hours prior to takeoff). Thus, there are two
possible actions in each state: b for ‘buy’ and w for ‘wait’.
Of course, the particular reward function used is critical
to the success (or failure) of Q-learning. In our study, the
reward associated with b is the negative of the ticket price
at that state, and the state resulting from b is a terminal
state so there is no future reward. The immediate reward
associated with w is zero as long as economy tickets on the
flight do not sell out in the next time step. We set γ = 1,
so we do not discount future rewards.
To discourage the algorithm from learning a model that
waits until flights sell out, we introduce a “penalty” for such
flights in the reward function. Specifically, in the case where
the flight does sell out at the next time point, we make
the immediate reward for waiting a negative constant whose
absolute value is substantially greater than the price for any
flight. We set the reward for reaching a sold-out state to be
−300, 000. This setting can best be explained below, after
we introduce a notion of equivalence classes among states.
In short, we define the Q function by
Q(b, s) = −price(s)
Q(w, s) =
−300000 if flight sells out after s.
max(Q(b, s
0
), Q(w, s
0
)) otherwise.
To generalize from the training data we used a variant
of the averaging step described in [18]. More specifically,
we defined an equivalence class over states, which enabled
the algorithm to train on a limited set of observations of the
class and then use the learned model to generate predictions
for other states in the class.
To define our equivalence class we need to introduce some
notation. Airlines typically use the same flight number (e.g.,
UA 168) to refer to multiple flights with the same route
that depart at the same time on different dates. Thus,
United flight 168 departs once daily from LAX to Boston at
10:15pm. We refer to a particular flight by a combination of
its flight number and date. For example, UA168-Jan7 refers
to flight 168 which departs on January 7th, 2003. Since we
observe the price of each flight eight times in every 24 hour
period, there are many price observations for each flight. We
distinguish among them by recording the time (number of
hours) until the flight departs. Thus, UA168-Jan7-120 is the
price observation for flight UA168, departing on January 7,
which was recorded on January 2nd (120 hours before the
flight departs on the 7th). Our equivalence class is the set of
states with the same flight number and the same hours be-
fore takeoff, but different departure dates. Thus, the states
denoted UA168-Jan7-120 and UA168-Jan10-120 are in the
same equivalence class, but the state UA168-Jan7-117 is not.
We denote that s and s
∗
are in the same equivalence class
by s ∼ s
∗
.
Thus, our revised Q-learning formula is:
Q(a, s) = Avg
s
∗
∼s
(R(s
∗
, a) + max
a
0
(Q(a
0
, s
0
)))
The reason for choosing -300,000 is now more apparent:
the large penalty can tilt the average toward a low value,
even when many Q values are being averaged together. Sup-
pose, for example, that there are ten training examples in
the same equivalence class, and each has a current price of
$2,500. Suppose now that in nine of the ten examples the
price drops to $2,000 at some point in the future, but the
flight in the tenth example sells out in the next state. The Q
value for waiting in any state in this equivalence class will be
(−300, 000− 2, 000∗ 9)/10 = −31, 800, or still much less then
the Q value for any equivalence class where no flight sells
out in the next state. Thus the choice of reward for a flight
that sells out will determine how willing the Q-Learning al-
gorithm will be to risk waiting when there’s a chance a flight
may sell out. Using a hill climbing search in the space of
penalties, we found -300,000 to be locally optimal.
Q-learning can be very slow, but we were able to ex-
ploit the structure of the problem and the close relationship
between dynamic programming and reinforcement learning
(see [25]) to complete the learning in one pass over the train-
ing set. Specifically, the reinforcement learning problem we
face has a particularly nice structure, in which the value
of Q(b, s) depends only on the price in state s, and the
value of Q(w, s) depends only on the Q values of exactly
one other state: the state containing the same flight num-
ber and departure date but with three hours less time left
until departure. Applying dynamic programming is thus
straightforward, and the initial training step requires only
a single pass over the data. In order to compute averages
over states in the same equivalence class, we keep a running
total and a count of the Q values in each equivalence class.
Thus, the reinforcement learning algorithm just makes a sin-
gle pass over the training data, which bodes well for scaling
the algorithm to much larger data sets.
The output of Q-learning is the learned policy, which de-
termines whether to buy or wait in unseen states by mapping
them to the appropriate equivalence class and choosing the
123