Teaching Data Science

This fall is my third time teaching CS533 (Introduction to Data Science). I co-developed the class with Casey Kennington and taught the first offering in 2017.

This intro is a class I had long wanted to create — when I was on the job market the first time around, it was my answer to ‘what class would you like to create?’ — but I haven't yet really been able to achieve what I wanted with it. This fall I am taking a step back and rebuilding it. I hope it's successful.

The syllabus, class materials, etc. are all publicly available.1

This post discusses my design goals and parameters for the class, and its overall structure. I hope to expand on some specific aspects of it in future posts.

Goals

One of the major goals of this class is to lay a foundation for students' further data science education that is grounded in goals, questions, and principles, rather than tools, while simultaneously giving them the technical foundation in commonly-used tools that they need in order to do further data science work.

This creates an interesting dance, because students need to learn the tools (in our case, the PyData stack), but do so in a way that makes it clear that ‘learning data science’ and ‘learning Pandas/SciKit/TensorFlow’ are different things. I try to do that by leading with talking about defining questions and connecting them to data, with the following working definition of data science:

Data science is the use of data to provide quantitative insights on questions of scientific, business, or social interest.

I have deliberately crafted this definition to try to make clear two key points:

  • Data science is in service of other interests, and should be evaluated on its ability to further the goals for which it is employed. Doing things with data for no particular purpose is just a fun exercise.

  • Data science provides quantitative insight, not the final, definitive answer the exclusion of all other sources of knowledge. Later in the semester, I plan to contextualize data science specificially, and quantitative knowledge generally, in a broader space of sources of knowledge, and make it clear that while this class is focused on quantitative methods they are by no means the only methods, and that qualitative, analytical, and critical methods provide much value.

I also want students to think pervasively of issues of ethics, representation, bias, and justice in their data science work. I am planning a module specifically on bias and social effects, but if it is just one module that adds on to everything else, this goal will completely fail. I am therefore building in concern for the people represented and affected by data from the very beginning of the course.

Flipping

This class is a prime candidate for a flipped-classroom design. The classroom setting is perfect to practice the critical thinking and iterative refinement needed to turn goals into questions into insights, while lectures on how to write split/apply/combine operations, estimate parameters, and train models can be consumed independently. I tried to flip it the first time I taught, but it did not work because I didn't put enough time into creating and curating the out-of-class resources needed to effectively support the flip.

Under COVID, though, video lectures have become my go-to strategy. Due to my experience with the RecSys MOOC, I'm already familiar with pre-recording lectures, and it has good properties for enabling students to more easily catch up on material they miss if they're out with the roundboi for a couple of weeks. When we suddenly went online in the spring, I used asynchronous video + open office hours on Zoom during class time for my databases class.

COVID therefore is a forcing function to make me do the asynchronous content delivery that I should have done the first time I tried to flip the class. I'm designing the material with hope that I can keep using most of it when we return to normal classrooms. This fall, I'm doing a ‘remote/online flip’: content delivery is through the videos, Tuesday class time will have structured discussion and activities via Zoom, like we would in the classroom, and Thursdays are for open Q&A.

COVID Design Parameters

Mild cases of COVID seem to usually take the form of ‘I cannot do anything for two weeks’. I take a universal design approach to my class structures as much as possible, so I adopted the design parameter that a student needs to be able to miss 2–3 weeks of class and assignments with negligible impact on grade and learning outcomes, without needing any special accommodations. If they have a more severe case, or other health, family, or personal issues arise that remove them from the class for longer, then I expect to need to work with them individually on a plan to either complete a sufficient subset of the class or take an incomplete.

‘I got COVID’ is, of course, not the only way the pandemic will impact me and my students’ lives. General stress, disruption, Internet access, family needs, etc. all impact us in normal times, and those effects will be exacerbated this year. I usually take a universal design approach to those as well, through late work policies that provide accommodation for students without demanding proof.

I document COVID-relevant policy design in my syllabus, but in summary, here are the things that I'm trying to use to provide a baseline level of flexibility:

  • Pre-recorded videos so they can catch up on content at their own pace.
  • Two options for the synchonous class time.
  • Refined version of my late-day policy to give flexibility for things arising that prevent assignment submission.
  • A makeup exam for a missed midterm.
  • Drop-lowest-grade policies for assignments and synchronous participation (2/8 assignments, and 5/15 weekly participation checkpoints).
  • A steady, predictable workload, to the extent I can accurately estimate required effort, so it's easier to plan and negotiate class and the rest of life.

Structure

The first week is all built out. A given week will have several video lectures, broken down into small-ish pieces and totalling around 90–100 minutes per week. It will also have some reading; sometimes a paper, sometimes online resources. For some material, there will be multiple different resources that students can pick from.

Each week is going to have a page that has all of that week's content in one place, with the material needed for Tuesday's class clearly marked.

I'm also providing a Resources section that contains software instructions, links to general reference readings, and will include Jupyter notebooks that provide info on various programming techniques we will be using.

I am forgoing a project in favor of regularly-paced assignments. The first assignment is a warm-up with no original intellectual work, so that when they do need to figure out course concepts, they aren't also fighting with how to run and submit a notebook.

Technical Infrastructure

I'm using Piazza for discussions and Blackboard for submitting assignments. I looked at OKpy for assignments, but it didn't do the one thing that push me off of Blackboard — provide a good interface for annotating Jupyter notebooks.

I'm building out course content in a static site hosted on Netlify, with video on Panopto (provided by Boise State) and slides hosted on OneDrive. I considered self-hosting video and playing with Video.js, but I couldn't get it to work very reliably and it was going to be quite a bit of effort to provide videos at multiple quality/bandwidth tradeoff points. YouTube won't let me embed videos without some form of recommendations in them. I'm recording and editing videos with Camtasia, using a Blue Yeticaster for audio.

Privilege

There are a couple of things that make this substantially more workable than it might otherwise be.

First, I'm navigating COVID from a maximally privileged position.

Second, Boise State is giving departments a good deal of autonomy for how they meet their students' needs, and in my department that means individual faculty get to decide for the most part. This means that I have been able to plan on this course design from the beginning of the summer, rather than trying to prepare in a holding pattern while others decided how I would offer my class. The immigration stuff in July did throw a wrinkle into my plans, but that meant I was trying to find a way to have a meaningful, minimal-risk in-person component as a hybrid design, not change the whole plan.

Wrapup

COVID is forcing me to make some changes to this class that I think are going to make it better even if we are able to return to some kind of normalcy in the next year or two.

I expect that I'll need to adjust and adapt as the semester progresses, but it's a plan, and I think it's a pretty workable plan, all things considered. We'll see how it goes!


  1. They're currently under an all-rights-reserved license, but I am considering opening them up as an OER if the design works out.