• Shuffle
    Toggle On
    Toggle Off
  • Alphabetize
    Toggle On
    Toggle Off
  • Front First
    Toggle On
    Toggle Off
  • Both Sides
    Toggle On
    Toggle Off
  • Read
    Toggle On
    Toggle Off
Reading...
Front

Card Range To Study

through

image

Play button

image

Play button

image

Progress

1/78

Click to flip

Use LEFT and RIGHT arrow keys to navigate between flashcards;

Use UP and DOWN arrow keys to flip the card;

H to show hint;

A reads text to speech;

78 Cards in this Set

  • Front
  • Back

Which of the following is true for a beta-balanced portfolio?


a) The beta values of all stocks in the portfolio are equal.


b) The beta values of all stocks in the portfolio sum to zero.


c) All stocks in the portfolio have equal weight.


d) The weighted beta values of all stocks in the portfolio sum to zero.

d) as described in Chapter 7 of the text: "Most hedge funds seek beta-balanced portfolios so that they are precisely protected against market-wide moves. That means, essentially Sum(beta_i*w_i) = 0, and Sum(|w_i|)=1.0"

How does Bootstrap Aggregating (Bagging) differ from Boosting?




a) It doesn’t, they are two terms for the same thing.


b) Bagging samples with replacement, while boosting samples without replacement.


c) Boosting weights instances which the model got wrong such that they are more likely to get selected in a subsequent sampling.


d) Bagging weights instances which the model got wrong such that they are less likely to get selected in a subsequent sampling.

c) as described in lecture 03-04 Ensemble learners, bagging, and boosting.

Which of the following is true with respect to overfitting? D is degrees of freedom in a parametric model and k is the number of nearest neighbors in KNN.


a) Overfitting occurs as both D and k increase


b) Overfitting occurs as both D and k decrease


c) Overfitting occurs as k increases and D decreases


d) Overfitting occurs as k decreases and D increases

d) as described in lecture 03-03 Assessing a learning algorithm: For parametric models, as D increases overfitting is more likely to occur. For instances based models as K decreases overfitting is more likely to occur. K=1 is most overfit.

How is the data that fills up each bag chosen from the training data?


a) Randomly, without replacement


b) Randomly, with replacement


c) Each bag gets a copy of the full training data set


d) Systematic sampling where each kth element is put into the bag

b) as described in lecture 03-04 Ensemble learners, bootstrap aggregating, and boosting: Data is chosen from the training data randomly, with replacement for each bag until it reaches n’. The entire training data set would not be too helpful if your models produce the same output each time. Systematic sampling wasn’t discussed in this class.

What is the main way in which Kernel Regression differs from KNN?


a) In Kernel Regression the contributions of nearest data points is weighted, while in KNN each point is given an equal weight.


b) In KNN the contributions of nearest data points is weighted, while in Kernel Regression each point is given an equal weight.


c) Kernel Regression is highly dependent on the data following the Corn distribution.d) Kernel Regression is much faster than KNN because it executes in the kernel rather than in user space

Correct answer is a) as described in lecture 03-02 Regression.

Rank the following in terms of training time




KNN - Lin Reg - Decision Tree - Forest

KNN < Lin Reg < Decision Tree < Forest

Which is not a component of a Markov Decision Problem?


a) Set of states (S) and actions (A)


b) A transition function (T)


c) A reward function (R)


d) An optimal policy *

Correct answer is d) as described in lecture 03-05 Reinforcement Learning.

Rank the following in terms of query time




Lin Reg - Decision Tree - Forest - KNN

Lin Reg < Decision Tree < Forest < KNN

Rank the following based on Space Required (After training) : Lin Reg - Decision Tree - Forest -KNN

Lin Reg < Decision Tree < Forest < KNN

Why were we not able to employ value iteration and/or policy iteration in the context of our strategy learner?


a) Policy iteration and value iteration are too slow to converge on large sets of data


b) Policy iteration and value iteration require a transition function (T) and a reward function (R) which we did not have


c) We could have used either, but they are too difficult to implement and Professor Tucker wanted to be nice


d) Policy iteration and value iteration are limited to binary actions (Take Action or do not) and we needed a robust algorithm which could recommend more actions.

Correct answer is b) as described in lecture 03-05 Reinforcement Learning.

Rank the following in terms of Ease of incorporating new training data: Forest - Lin Reg - Decision Tree - KNN

Forest < Lin Reg < Decision Tree < KNN

You are the manager of an ETF that tracks the performance of the S&P 500 (i.e your ETF is just like SPY).What are the alpha and beta numbers for your ETF?




A) alpha = 0, beta = 0


B) alpha = 0, beta = 1


C) alpha = 1, beta = 0


D) alpha = 1, beta = 1

Answer B. Explanation : Your ETF / SPY tracks the overall market performance.Therefore its alpha = 0 and beta = 1, because its performance is equivalent to that of the markets

Given the following Symbols and their associated Betas what would be a beta balanced portolio? ABC .75 DEF .25 GHI 1




a) [ABC: -.5, DEF: -.1, GHI: .4]


b) [ABC: .5, DEF: .1, GHI: .4]


c) [ABC: .5, DEF: .25, GHI: .25]


d) [ABC: -.5, DEF: .25, GHI: .25]

Correct Answer: a Reason: It is the only one that satisfies both properties of a beta-balanced portfolio (page 54) >>>(.75 * -.5) + (.25 * -.1) + (1 * .4) 0.0 And 0.0 >>> .5 + .1 + .4 1.0

Which of the following metric is most suitable in determining whether prediction quality linearly matches up with actual data?




Pearson Correlation


RMSE


Spearman


Correlation MSE

Pearson correlation determines linear relationship among two variables.

What is the fundamental feature of supervised learning that distinguishes it from unsupervised learning? For all following choices (A through D), X is an mxn matrix where each column corresponds to one feature/factor and each row represent one data instance. Y is vector of size m in which all prediction values are stored. For supervised learning: j




Both X and Y are provided when building the predictive model using the ML algorithms. Each row of X and each value of Y are given as data pair. In this data pair, the Y value is associated with the row in X.




Both X and Y are provided when building the predictive model using the ML algorithms. Each row of X and each value of Y are given as data pair. In this data pair, Y value is not necessarily associated with the row in X.




Only one of X and Y is provided when building the predictive model using ML algorithms.




Neither X or Y is necessary when building the predictive model using ML algorithms.

For supervised learning, the feature/factor data X and its corresponding prediction value Y are both provided. Each Y value must be associated with each row in X.

Consider a regression decision tree modeled with 1000 data points. Which combination of leaf size and bootstrap aggregating will result in the LEAST amount of overfitting?




leaf size = 1, bootstrap aggregating = false


leaf size = 50, bootstrap aggregating = false


leaf size = 1, bootstrap aggregating = true


leaf size = 50, bootstrap aggregating = true

D. Larger leaf sizes rather than small ones prevent over fitting. Bagging reduces overfitting.

Your company is investigating two different models for use in their company, kNN and linear regression. Which of the below is a false claim made by the investigator?




Linear regression is better for lots of queries because it’s faster than kNN.




kNN is better when you frequently want to add more training data. ting.




Linear regression is better at predicting the future.




Linear regression is faster at training since it’s just a linear equation.

Linear regression is not faster at training than kNN.

What is the purpose of using entropy?




To assess errors.




To determine relevant features.




To reduce variance.




To reduce bias.

To determine relevant features.




Information gain is the amount of information acquired that increases certainty in prediction. The more information we get the more certain we are going to be about the outcome. Predictors or features serve as sources of such information and each usually provides different levels of this information. We can measure the usefulness of predictors by how much information they provide. For this we can calculate entropy or value of uncertainty - the higher is entropy the more uncertain we are in prediction. Conversely, the lowest entropy indicates the highest information gain. Entropy can change when different features are included or excluded from the model. The difference between the former and current entropies is the information gain that helps to determine the most relevant features.

Which statement below is true about Random Forests and overfitting? (Random Forests = bagging Random Trees.)




Random Forests cannot reduce overfitting associated with the base learners, in this case Random Trees. ڌ




Increasing the number of bags in a Random Forests improves predictive performance, but also increases overfitting. /




Random Forests do not overfit data. es oveތ




Answers A and B are both true. ml>es oveތ

Correct answer is (C). Overfitting refers to high variance between training and test data. Bagging reduces variance by averaging across base learners. While Random Tree base learners might overfit, e.g., leaf size=1 (one unique sample per branch), bagging across trees with randomly selected features will average out the variance. So (A) is false. (B) is false since the performance gains from increasing the number of bags imply lower variance, which cannot cause increased overfitting. Averaging smoothes; it cannot inject external variance and so only (C) is true.

Which option among the following best describes features of an instance based learning model such as KNN?




Faster learning, slower recall, less memory required compared to parameterized models.




Slower learning, faster recall, more memory required compared to parameterized models.


Faster learning, slower recall, more memory required compared to parameterized models.




Slower learning, slower query compared, less memory required compared to parameterized models. pan s 5





Faster learning, slower recall, more memory required compared to parameterized models.

As the number of bags increases, why is AdaBoost more likely to overfit than simple bagging?

As the number of bags increases, AdaBoost tries to assign more and more specific data points to subsequent learners, trying to model all the difficult examples. Thus, compared to simple bagging, it may result in more overfitting

Pros of Parametric Regression

Fast querying, don't need to store original data so querying is efficient,

Cons of Parametric Regression

We can’t easily update the model as more data is gathered - usually we have to do a complete rerun of the learning algorithm to update the model. thus for parametric approaches training is slow

Pros of Non-Parametric Regression

New evidence can be added easily since no parameters need to be learned and adding new data points doesn’t consume any additional time so training is fast




Avoid having to assume a certain type of model, whether it’s linear or quadratic or so on and therefore they’re suitable to fit complex patterns where we don’t really know what the underlying model is like

Cons of Non-Parametric Regression

We have to store all the data points so it’s hard to apply when we have a huge data set




Querying is potentially slow

Which of the following are considered "states" in the context of ML4T? (Q-Learning)




Buy


Sell


holding long


bollinger value


return from trade


daily return

holding long


bollinger value


daily return

Which of the following are considered "actions" in the context of ML4T? (Q-Learning)


Buy


Sell


holding long


bollinger value


return from trade


daily return

Buy


Sell

What are the key components of a Markov Decision Problem?

States, Actions, Transition Function, Reward Function

Explain the Transition function

The T within the environment - it’s a 3D object and it records in each of its cells the probability that if we are in state s and we take action a, we will end up in state s’




Something to note about this transition function is, suppose we’re in a particular state s and we take a particular action a - the sum of all the next states we might end up in has to sum up to one




In other words, with probability one, we’re going to end up in some new state, but the distribution of probabilities across these different states is what makes this informative and revealing

Explain the reward function

An important component that gives us the reward if we’re in a particular state and we take an action a we get a particular reward

Q-Learning is Model-based or Model-Free

Model-Free

Why do we use discounted reward in Q-Learning?

The math turns out to be very handy and it provides nice conversion properties. Also gives the learner an incentive to use less steps depending on what the discount is set to

True/False: Q-learning is guaranteed to provide an optimal policy

True

How does the Q Learner find out which action to take in state S

All we need to do is look across all the potential actions and find out which value of Q[s,a] is maximized - so we don’t change s, we just step through each value of a and the one that is the largest is the action we should take

Explain the Q Learning Procedure

Select Training Data


Iterate over time:


compute s
select a
observe r, s'
update Q

Test policy
Repeat until convergence

For discounted rewards, what does a low and high value of gamma mean

Low gamma: We value future rewards less


High gamma: We value future rewards more




A Gamma of 1.0 is the same as infinite horizon

For learning rate, what does a low value for alpha mean?

So a low value of alpha means that in the update rule, the previous value for Q[s,a] is more strongly preserved

Summarize the update rule for Q-Learning

Our new Q value in state s action a Q[s,a] is that old value, multiplied by 1-alpha, so depending on how large alpha is, we value that old value more or less, plus alpha times our new best estimate and our new best estimate is again, our immediate reward r plus the discounted reward for all of our future actions

The success of Q-Learning depends on exploration. How do you accomplish this?

One way to accomplish this is with randomness and the way we can interject that fairly easily in the step or Q-learning where we are selection an action, we flip a coin and randomly decide if we’re going to randomly choose an action




A typical way to implement this randomness is to set the probability at about 0.3 or something at the beginning of learning and then over each iteration to slowly make it smaller and smaller and smaller until we essentially don’t choose random actions at all

What happens if you set the random action probability to 0?

You don't explore enough of the set and action space so your QLearner may recommend less-than-ideal actions based on inadequate exploration

Which results in faster convergence for a trading Q-Learner:




r = daily return




or




r = 0 until exit, then cumulative return that we gained across that whole trade

r = daily return. A reward at each step allows the learning agent get feedback on each individual action it takes (including doing nothing). The other method is called a delayed reward

Which of these factors make sense to be in a state?




Adjusted Close


Simple Moving Average SMA


Adjusted Close/SMA


Bollinger Band Value


P/E ratio


Holding Stock


Return Since Entry

Adjusted Close (NO): Is not a good factor for learning because you’re not able to generalize over different price regimes for when the stock was low to when it was high. Also if you’re trying to learn a model for several stocks at once and they each hold very different prices adjusted close doesn’t serve well to help you generalize




Simple Moving Average SMA (NO): Same thing applies as above in AC




Adjusted Close/SMA (YES): If you combine AC/SMA together into a ratio that makes a good factor to use in state




Bollinger Band Value (YES)




P/E ratio (YES)




Holding Stock (YES): This is an important for RL If you’re holding the stock it may be advantageous to get rid of it If you’re not holding it, you might not necessarily want to sellSo this additional feature about what your situation is, is useful




Return Since Entry (YES): This might help us set exit points for instance, maybe we’ve made 10% on the stock since we bought it and we should take our winnings while we can

How do you discretize each factor as part of creating a state?

Bin them into equal bin sizes (steps)

How to you combine multiple discretized factors to create a state?

Combine all discretized integers into a single number. Ex:




(Factor 1 * 10^0) +


(Factor 2 * 10^1) +


(Factor 3 * 10^2)

When discretizing factors, what happens to your bin thresholds when your data is sparse? What happens when your data is not sparse?

When the data is sort of sparse, our thresholds are set far apart. When the data is not sparse, these thresholds end up being closer together

What is the main advantage of a model-free approach such as QLearning?

The main advantage of a model-free approach like Q-Learning over model-based techniques is that it can easily be applied to domains where all states and/or transitions are not fully defined.

What are some major issues with using QLearning for trading?

The biggest challenge is that the reward (e.g. for buying a stock) often comes in the future - representing that properly requires look-ahead and careful weighting.




Another problem is that taking random actions (such as trades) just to learn a good strategy is not really feasible (you'll end up losing a lot of money!).

How would you write pseudocode for the QLearning process?

1. Set the gamma parameter, and environment rewards in matrix R.




2. Initialize matrix Q to zero.




3. For each episode:




Select a random initial state.




*Do While the goal state hasn't been reached.




* Select one among all possible actions for the current state.


* Using this possible action, consider going to the next state.


* Get maximum Q value for this next state based on all possible actions.


* Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]


* Set the next state as the current state.


* End Do




End For

How does Dyna work

Dyna works by building models of T, the transition matrix, and R the reward matrix




Then after each real interaction with the world, we hallucinate many additional interactions, usually a few hundred that are used then to update the Q table

What does it mean when we say that QLearning is "model free"

it does not rely on T or R

How does Dyna augment the QLeaner?

We add some logic that enables us to learn models of T and R

Why would Dyna be desirable in a trading environment

The Dyna-Q operations is very cheap compared to interacting with the real world because we can leverage the experience we gain in the Q-learning part from the interaction with the real world, but then update our model more completely before we step out and interact with the real world again.




Learning from hallucinations is cheaper than learning from making real-world trades.

When using Dyna, at what point does the QLearner go back to interacting with the "real world"

After we iterated through Dyna enough times, (e.g 100, 200 times), then we return back to resuming our interaction with the real world at the Observe s step.

How does Dyna hallucination work?

First we randomly select an s




Second we randomly select an a




Then we infer our new state s’ by looking at T




And we infer our immediate reward r by looking at big R or the R table

How do we learn T in Dyna?

Remember that T[s,a,s’] represents the probability that if we are in state s, take action a, we will end up in state s’




To learn a model of T we’re going to observe how these transitions occur




So in other words we’ll have experience with the real world, we’ll get back a s,a,s’ and we’ll just count how many time did it happen




We’re introducing a new table called Tcount or TcWe initialize all of our Tc values to be a very, very small number, if we don’t do this we get in a situation where we have a divide by zero




Then we begin executing Q learning and each time we interact with the real world we observe s, a and s’ and then we just increment that location in our Tcount matrix




So every time we see it transition from s to s’ with action a, boom we add one and that’s pretty simple

In Dyna, how do you evaluate T in terms of Tc?

You simply need to normalize the observed count Tc[s,a,s'] of landing in next state s' by the total count of all transitions from state s on action a, i.e. summed over all possible next states.




Remember Tc is just the number of times each of these has occurred

In Dyna, how do you model R?

Remember when we execute an action a in state s, we get an immediate reward, r




R[s,a] is our expected reward if we’re in state s and we execute action a (this is our model)




r is our immediate reward when we experience this in the real world


We want to update this model every time we have a real experience and it’s a simple equation, very much like the Q table update equation:




R'[s,a] = (1- alpha) R[s,a] + alpha * r

What is a trading option?

An option is a legal contract which gives you as the buyer the right but not the obligation to buy the underlying stock at a specific price on or before a specific expiration date.

What is a call option?

buying the right to buy the stock

What is a put option?

buying the right to sell the stock

What is the strike price?

The price at which we can reserve the right to buy/sell the specific stock

How many shares will 1 option control?

100 shares

Whats the main advantage of buying an option over buying just regular shares?

You're not committing as much of your money up front.

What are some downsides of buying an option?

First the premium is lost money, it’s gone and you pay it immediately to another person when you acquire the option contract and you don’t get it back no matter what happens to the stock




Second, options have expiration dates, this is a biggy. You’re adding a layer to the best that you’re making in the stock market




With options you don’t own the stock





What is the intrinsic value of the stock when it comes to options trading?

The difference between the option strike price and the underlying spot price for an in-the-money option is called the intrinsic value of the stock

What is Theta?

Theta or time decay, is the rate at which an option is losing it’s time value, which is to say it’s the first derivative of that time value over time.

How does time affect Theta?

The time values changes more rapidly as the expiration date arrives and the time decay goes up




Then when you get to the expiration date, of course time value approaches zero and all of the option prices will approach their intrinsic value because you will have to buy it, exercise it and sell it immediately at that point when it’s just about to expire




The last two weeks is when time decay goes the fastest.

How does a Profit/Loss curve look different between a Call (buy) and a Put (sell)?

In a Call, the curve stays flat until the Strike point, then goes up. In a put, the curve goes down until the strike point, then goes flat

Whats the difference between a Call and a Put in terms of Maximum Profit?

In a Put, maximum profit is no longer unlimited, because the lowest that the stock could go is zero dollars a share.

What is a covered call?

A “covered call” is an income-producing strategy where you sell, or “write”, call options against shares of stock you already own. Typically, you’ll sell one contract for every 100 shares of stock. In exchange for selling the call options, you collect an option premium. But that premium comes with an obligation. If the call option you sold is exercised by the buyer, you may be obligated to deliver your shares of the underlying stock.




Fortunately, you already own the underlying stock, so your potential obligation is “covered” – hence this strategy’s name, “covered call” writing

What are the possible outcomes of a covered call?

1) Stock ends up above strike price before expiration: We're forced to sell the stock but we can still make a bit of money




2) Stock rises but below strike prices: We make money and option is not exercised so we're not forced to sell the stock




3) Stock goes down: We lose money but option is not exercised.

What is a married put?

When you buy a stock and simultaneously buy a put

What are the possible outcomes of a married put?

1) If the stock goes up, then we lose from our profit the amount of the premium we paid, basically for insurance




2) If the stock goes down, then our loss during that downturn is going to be capped because once we hit the strike price, dollar-for-dollar your put will gain at the same rate that the stock loses value.





Why would you do a married put instead of just selling your stock?

one of the main reasons here is that there can be a lot of tax implications from closing your stock position. Actually selling your stock for real, can trigger tax gains or tax losses or a wash sale or other positions we may not want to be in.

What's the benefit of doing the "Butterfly" strategy of options trading?

You profit the most when the stock doesn't move very far from the current price

What's the advantage of using a Random Forest over a decision tree built using information based construction?

Random Forests prevent overfitting

In MC3P1, why did you need to shuffle the data?

testlearner.py splits the data sets up with the first 60% as the training set and the last 40% as the test set. This is generally not a good practice unless you know that the data points are not in any kind of special order to begin with. Usually, what you would do is shuffle the data points (i.e. the rows) before doing the training/test split.

What do we need to do in order to convert a regression decision tree to a classification decision tree?

Take mode of values instead of mean of values