Yay Reproducibility
One of my projects this year has been to refactor and upgrade our Book Data Tools to be easier to work with. Along the way I found some bugs in the original data integration, and the process of finding these bugs and assessing their impact forms a useful case study for some of the principles I teach in my data science class. The current published version of the tools uses PostgreSQL for data storage and integration, which is pretty good in principle but has significant impedance mismatch with my use of DVC to automate the data processing and integration process. It has an ugly set of hacks involving status files and two DVC stages for almost every logical stage of the data integration pipeline (because my other egregious hack of monkey-patching DVC to support a custom URI schema to track PostgreSQL data status as if it were external data broke with DVC upgrades). The result was a repository layout and set of program interactions that are very non-obvious, so it’s harder than it needs to be for others to extend the tools. It’s also harder than it needs to be to manage different version of the data, as we have to have different PostgreSQL databases, keep them straight (and synchronized with Git branches), and it all takes a lot of space. PostgreSQL has served us well, but it’s time to move on. In the time since I built the original data integration, the Apache Arrow project has made significant progress on the Rust version of Arrow, along with the DataFusion query engine that supports SQL queries (with a lot of PostgreSQL syntax) over Parquet files. I have used this to rebuild the integration primarily in Rust and SQL, using custom Rust code for data processing and DataFusion to implement many of the integrations. This fits much better with DVC’s file-based input/output model, as each stage reads input files (from CSV, compressed JSON, MARC, an intermediate Parquet file, or whatever) and produces output in CSV or Parquet format. We can use DVC as intended, without needing to document our own weird pipeline quirks on top of the data structures and integration logic itself. The result is much faster (2 hours instead of 12), takes less space, and removes both the impedance mismatch and some of the complexity of installation (as users no longer require a PostgreSQL installation). It does mean extending it usually requires writing some new Rust, but we’re generally using our dependences as they’re intended to be used, which should help a lot with that learning curve. One of the things I teach my students about data integration is to always track the coverage or the success rate of each link in their integration pipeline. I do that in this project: for each source of book or rating data, I measure the fraction of records that fail along each stage of the pipeline: This is all reported in the statistics notebook, and allows us to have a good view of the data coverage. I watched these statistics closely while I was reworking the code, and ensuring we had (approximately) the same coverage was one of the major criteria for making sure I hadn’t broken the logic. When I was pretty much done, though, I had a problem: the GoodReads coverage was broken. All my other coverages (Library of Congress, BookCrossing, and Amazon) were fine — slightly better, actually, because I had improved some of the ISBN parsing and resolution logic, and had updated the source data to use more current versions of OpenLibrary and VIAF — but GoodReads had gone from about 58% unresolved books to 68%. I fixed two or three other bugs (including working around an annoying bug in the Rust implementation of Parquet) to get there, but the GoodReads coverage wouldn’t budge. After digging around for a while (the details are rather boring), I finally found the problem: the original PostgreSQL integration code had a bug in the GoodReads interaction data integration, where I used an inner join where I should have used a left join. The result is that if a GoodReads book did not have an ISBN, its ratings and interactions were dropped entirely, instead of included as interactions with unknown books. A lot of these are either older books or Kindle exclusives (or editions that haven’t been linked with their works yet); it was a rather large number of books, but they don’t seem among the more active books. I verified this by replicating the erroneous join logic in the new code, and my coverage jumped back up to where it was supposed to be. Glad I found the problem, but disappointingly it was due to a bug in the version that supported the published results. Once verifying the bug I reverted the change, because including all the records is the correct decision for our purposes. This data integration is the data set backing our signature book gender paper. Since the bug only affected books that didn’t have links elsewhere in the database, and most of the analysis focuses on books for which we have data on the author’s gender, I expected it wouldn’t have much impact on the published results: it would affect our GoodReads coverage statistics, and the collaborative filters would be trained on a larger data set (one that includes interactions with these unknown books). But we need to make sure. The paper is supported by reproduction scripts that use DVC to automate the entire experiment process: run So, I imported the new data into the book gender experiment and re-ran it. As I hoped, the results stayed the same. The coverage plots changed to reflect the new data coverage, but nothing else changed beyond the expected variance due to randomization. Sigh of relief. I spend a lot of time polishing data and experiment pipelines — sometimes too much time — but the result here was finding an interesting bug, correcting it, and being able to verify that it didn’t invalidate any of our experimental results. There are many reasons why reproducible pipelines are important, but this is one. It takes time and effort to achieve, but the resulting improvements in robustness and confidence in results are worth significant investment, in my opinion. I still have some more polishing and documentation to do before releasing the new code, but I plan to update the public version of book data tools to use the new system shortly. I’ll also be pushing a new version of the scripts for the paper that reproduce its results using the current integration, as an example of how to use it. Stay tuned!Current State
Upgrade Pathway
Spotting Problems
Finding the Problem
How Big is the Problem?
dvc repro
on a large machine, and a day or three later you’ll have the results. This has had quite a few advantages, including allowing us to do a clean re-run and report the compute resources needed to reproduce the experiments in the paper. My team is also using this repository as the basis for further research, so I have been working on upgrading it to use the new version of the data integration instead of pulling from the PostgreSQL database.Lessons Learned
Next Steps