ECON 575 – Data Analysis
Problem Set
10 Points
All answers to problem set questions must be typed so they can be reviewed by Turnitin..
Problem 1
Part of the challenge of data mining text is that the sequence and context of words matters in communication. Consider the use of the word “good” in a movie review. Briefly explain how the word “good” could be used to convey both positive and negative feelings about a movie, why this highlights the importance of context, and if you believe there is a way to work around this problem.
Problem 2
This module provided an overview of a handful of other commonly used data mining techniques.
Consider a problem from your current or a past job, a hobby, or an interest that would make for a good application of one of the following techniques:
• Text-based data mining
• Co-occurrence grouping and associations
• Profiling
• Link prediction
Describe why this would be an appropriate example of a problem that can be solved with one of the methods above and what the use of the results of this analysis would be.
Please do not choose a hypothetical example like something from the textbook or an example from the slides, it should be something with which you have personal experience (yes, this problem is like problem 2 from problem set 2).
Problem 3
Remember that American hotel chain you were working for back in problem set 2? Well, despite all your job hoping since then, you have been rehired by the hotel chain to take another crack at improving their booking and profitability. Armed with more data mining knowledge than ever before, you decide to once again create a classification decision tree model to predict cancelations, only this time you brought in the big guns: ensemble methods.
Start by uploading the hotel_bookings.csv data to BigML if you deleted it, it’s the same data set from week 2. As a reminder, the data set contains the following information:
Target variable:
• is_canceled: whether the reservation was canceled
Attributes:
• hotel_type: whether the hotel is a “resort” or “city” hotel
• summer: whether the was made for the summer season or not
• children: whether children are listed on the reservation
• previous_cancelations: if person who made reservation has canceled before
Start by creating an 80/20 training/test split (it can be randomly this time), then use the training set to create 3 different tree induction models:
1. A regular single decision tree under the model option
2. An ensemble of trees using random forests (which BigML calls “decision forests”)
3. An ensemble of trees using boosting (which BigML calls “boosted trees”)
You can leave the default options enabled for each model (number of nodes, models, iterations, etc.). After you have run your 3 models, evaluate each model on the test set.
Finally, using what you have learned in the class these past couple weeks, describe and compare the performance of each model and comment on if their relative performance met your expectations. (Note: I am intentionally not telling you exactly what to report and compare. Think about what you would want to communicate if you were choosing from among these models and presenting this is information.)
Instructions:
This problem set is worth 10 points. Answer each part of every question in complete detail. If a questions asks you to provide an explanation for you answer, make sure you provide a full explanation. You can receive partial credit on questions, so err on the side of more detail rather than less.
This seventh problem is a mixture of case study type questions and a BigML problem that introduces you to ensemble methods applied to classification trees. It uses the same hotel_bookings.csv data from Module 2, but I’m providing a link to the data below as well.
Submission Instructions:
All answers must be typed so that they can go through a Turnitin review process if needed. Your submitted assignment must be in either a word document (doc or docx) or pdf form, otherwise it is too difficult to leave comments and feedback, and it might be rejected by Turnitin
Briefly explain how the word “good” could be used to convey both positive and negative feelings about a movie, why this highlights the importance of context, and if you believe there is a way to work around this problem.