Week 3 Variables and the Modelling Process

Week 3 Variables and the Modelling Process

One of my colleagues tweeted a response to me on Twitter after i commented that i’m not the quickest of reader and that I was struggling a little with getting through so many academic journals this week with “If only you could Download them to a spreadsheet Tom. You can read, interpret and absorb a Microsoft Excel in seconds…. it’s like a superpower you have” and that sums me up perfectly, in fact i think Academic Journals like these are my Kryponite at times.

This really is me, i can absorb Python code without too much issue but i find Academic Journals, and at times a lot of other written material, can sometimes be over wordy and I have never been particularly great with reading between the lines on these types of documents.

This week started with looking at KDD (Knowledge Discovery in Databases) and specifically at the 9 steps involved in doing these, with “data mining” being one of the main steps involved in KDD despite the norm being that many people believe data mining is the main element here. the following open source journal article goes into way more detail than i can about this, but be warned it is a bit of a read.

Article link: Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth (1996), ‘From data mining to knowledge discovery in databases’, AI magazine 17 (3), pp. 37-54. doi: 10.1609/aimag.v17i3.1230.

The Nine steps are –

  • First – developing an understanding of the application domain and the relevant prior knowledge
  • Second – creating a target data set: selecting a data set
  • Third – data cleaning and pre-processing
  • Fourth- data reduction and projection
  • Fifth is matching the goals of the KDD process (step 1) to a particular data-mining method
  • Sixth – exploratory analysis and model and hypothesis selection
  • Seventh – data mining
  • Eighth – interpreting mined patterns
  • Ninth – acting on the discovered knowledge

The first use of this was a short knowledge check quiz about it then looking at the Edinburgh Castle example again asking to give how these steps could be implemented, my answers are below;

First – developing an understanding of the application domain and the relevant prior knowledge

Look for any previous studies for Edinburgh Castle that can be used as a basis for this one, also worth looking through any studies or market research the castle have already done, there may be links and elements in each of these studies that give us something to start with. I would also suggest speaking to the staff that are there just now, its usually the case that trends and patterns can be seen by staff, but they may not have the hard evidence of them, but those trends and patterns give a good understanding of the data.

Second – creating a target data set: selecting a data set

We are looking to predict the numbers of visitors so any ancillary data is not needed at the moment, such as the shop sales data etc, we would look to reduce our working data just to the data that assists us in predicting visitors, so daily sales logs, visitor profiling, weather information, local holiday dates for visitors etc

Third – data cleaning and pre-processing

Here we should look at ensuring only the data we need is present, and that it is clean, that each visitors log is appended with data on the weather, whether the day of visit was a local holiday in the country the visitor is from, all the characteristics of the visitor are present (dob, gender, country, etc etc) and in the format expected. Data should be connected correctly and verified where possible.

Fourth- data reduction and projection

At this point we should look at each of the variables individually, or in groups, to see if they have a correlation to the number of visitors or whether they are just nice to see. Plot each of the variables separately against visitor numbers to assist in the manual process of viewing patterns on each individual variable.

Fifth is matching the goals of the KDD process (step 1) to a particular data-mining method

Based on the type of data we are looking for here it would appear a Linear Regression approach to data mining here would be beneficial as we look to predict visitor numbers based on sales, weather etc.

Sixth –  exploratory analysis and model and hypothesis selection

At this point we should look at the correlation between the variables and the visitor numbers to find ones with strong correlations (eg rainy weather to high numbers, local holiday in German to higher numbers etc) to allow the building of a hypothesis based on this analysis.

Seventh – data mining

Using the processes we have landed on from the previous steps to mine for data, revisiting any that require it, always being willing to take a second approach, or second view of any data,

Eighth – interpreting mined patterns

Per the last step, look at any patterns and whether they are viable for this purpose, look at other ways of verifying the data in different processes to check the validity of the hypothesis.

Ninth – acting on the discovered knowledge

Talk through the patterns and visualisations with the key staff at Edinburgh Castle, exploring their viewpoints on the patterns and seeing if they match the patterns and trends that the staff were already believing, but have now been verified.

I found this element tough, trying to take an academic journal then looking at it to assist in applying a methodology to an example in this way is not my strong point, but i hope I’ve managed to capture the elements needed.

Following on from this was the first assessed element of the course, 3 further academic journals were provided for reading and absorption and then an assessment of gained knowledge from that was undertaken.

I had a bit of difficulty with one question here as my reading of the journal had me believing strongly that a particular method had been used, but getting the answer wrong and then seeing an explanation, has at least got me seeing the way forward better, and i’m with with 80% for my first assessment.

But wait, thats already been a tough week hasn’t it? surely there can’t be more but there is, this time some additional training and coding in Python looking at variables and plotting visualisations with bar charts, pie charts, scatter graphs, box plots and more coming through to test my skills again, this is an area i enjoy though and use visualisations often in my day job at the college, mainly using Power BI, so i’m pretty comfortable with this area.

Now on to week 4 – which comes loaded with THREE further assessments! i’m on leave from Friday 11th for a week, so next weeks installment might be a day or two into the following week as i certainly don’t want to rush through the assessments

  • Prev Post